Execution History Guided Instruction Prefetching

Zhang, Yi; Haga, Steve; Barua, Rajeev
February 2004
Journal of Supercomputing;Feb2004, Vol. 27 Issue 2, p129
Academic Journal
The increasing gap in performance between processors and main memory has made effective instructions prefetching techniques more important than ever. A major deficiency of existing prefetching methods is that most of them require an extra port to I-cache. A recent study by Rivers et al. [19] shows that this factor alone explains why most modern microprocessors do not use such hardware-based I-cache prefetch schemes. The contribution of this paper is two-fold. First, we present a method that does not require an extra port to I-cache. Second, the performance improvement for our method is greater than the best competing method BHGP [23] even disregarding the improvement from not having an extra port. The three key features of our method that prevent the above deficiencies are as follows. First, late prefetching is prevented by correlating misses to dynamically preceding instructions. For example, if the I-cache miss latency is 12 cycles, then the instruction that was fetched 12 cycles prior to the miss is used as the prefetch trigger. Second, the miss history table is kept to a reasonable size by grouping contiguous cache misses together and associated them with one preceding instruction, and therefore, one table entry. Third, the extra I-cache port is avoided through efficient prefetch filtering methods. Experiments show that for our benchmarks, chosen for their poor I-cache performance, an average improvement of 9.2% in runtime is achieved versus the BHGP methods [23], while the hardware cost is also reduced. The improvement will be greater if the runtime impact of avoiding an extra port is considered. When compared to the original machine without prefetching, our method improves performance by about 35% for our benchmarks.


Related Articles

  • Processors up performance and lower costs. Cravotta, Robert // EDN;4/3/2003, Vol. 48 Issue 7, p18 

    Electronics company Analog Devices Inc. is expanding the Blackfin processor family with three devices that are available for sampling and operate as fast as 600 megahertz (MHz) or sell for as low as 5.95 dollars for a 300-MHz device. The dual-16-bit-multiply-accumulate-unit ADSP-BF531,...

  • Complex Systems Require Complex Benchmarks.  // Database Trends & Applications;Jan2004, Vol. 18 Issue 1, p26 

    Focuses on the IOzone benchmark project tool. Examination of the performance of two hard-disk drives; Assessment of the database community; Observation of the caching mechanisms.

  • Dynamic Memory Instruction Bypassing. Ortega, Daniel; Valero, Mateo; Ayguadé, Eduard // International Journal of Parallel Programming;Jun2004, Vol. 32 Issue 3, p199 

    Reducing the latency of load instructions is among the most crucial aspects to achieve high performance for current and future microarchitectures. Deep pipelining impacts load-to-use latency even for loads that hit in cache. In this paper we present a dynamic mechanism which detects relations...

  • Benchmark results posted for 208MHz 90nm ARM9.  // Electronics Weekly;7/4/2007, Issue 2295, p7 

    The article focuses on the results of the evaluation conducted by the Embedded Microprocessor Benchmarking Consortium (EEMBC) for NXP Semiconductors' LPC3180 microcontroller in Great Britain. EEMBC claims that NXP's microcontroller was the first device to demonstrate the effect of an integrated...

  • A benchmark for microprocessor power. Prophet, Graham // EDN Europe;Apr2006, Vol. 51 Issue 4, p16 

    The article focuses on benchmarking consortium EEMBC's series of metrics of processor performance called EnergyBench. It provides a standardized framework in which users can measure the power demand of a processor while executing EEMBC standard benchmark code. The power measurements are the...

  • On Cache Sizing in Multi-Core Cluster Architectures. Sibai, F. N. // International Review on Computers & Software;May2007, Vol. 2 Issue 3, p235 

    With multi-core products hitting every market segment, we focus on a 16 core cluster-based architecture comprised of 4 clusters. Assuming the LI cache memories are private, we conduct SPEC92 simulations to study the effect of the block size, associativity, and cache size on the LI cache hit...

  • Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors. PARCHUR, Abdul Kareem; SINGH, Ram Asaray // Leonardo Electronic Journal of Practices & Technologies;Jul-Dec2011, Issue 19, p109 

    High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 & L5320 and Type B: Intel Xeon X5140, 5130, 5120 & E5310). The...

  • Nassda's HSIM Changes Memory Design. Zhang, Xionan // Electronic News;09/11/2000, Vol. 46 Issue 37, p40 

    Focuses on the prevalence of large memory blocks in microprocessors. Importance of cache memories in graphics processors; Challenges about memory design; Background on the capabilities of HSIM, a memory system developed at Nvidia.

  • AMD takes Intel's crown. Yates, Darren // Australian PC User;May2005, Vol. 17 Issue 5, p71 

    Focuses on the advantage of AMD Athlon 64 3800+ over Pentium 4 660 processor. Key features of Pentium 4 600 series that differentiate it from 500-series processors; Cache memory of Pentium 4; Multimedia performance of Athlon 64.


Read the Article


Sorry, but this item is not currently available from your library.

Try another library?
Sign out of this library

Other Topics