Resource-Aware Compiler Prefetching for Fine-Grained Many-Cores

Caragea, George; Tzannes, Alexandros; Keceli, Fuat; Barua, Rajeev; Vishkin, Uzi
October 2011
International Journal of Parallel Programming;Oct2011, Vol. 39 Issue 5, p615
Academic Journal
Super-scalar, out-of-order processors that can have tens of read and write requests in the execution window place significant demands on Memory Level Parallelism (MLP). Multi- and many-cores with shared parallel caches further increase MLP demand. Current cache hierarchies however have been unable to keep up with this trend, with modern designs allowing only 4-16 concurrent cache misses. This disconnect is exacerbated by recent highly parallel architectures (e.g. GPUs) where power and area per-core budget favor numerous lighter cores with less resources, further reducing support for MLP on a per-core basis. Support for hardware and software prefetch increases MLP pressure since these techniques overlap multiple memory requests with existing computation. In this paper, we propose and evaluate a novel Resource-Aware Prefetching (RAP) compiler algorithm that is aware of the number of simultaneous prefetches supported, and optimized for the same. We implemented our algorithm in a GCC-derived compiler and evaluated its performance using an emerging fine-grained many-core architecture. Our results show that the RAP algorithm outperforms a well-known loop prefetching algorithm by up to 40.15% in run-time on average across benchmarks and the state-of-the art GCC implementation by up to 34.79%, depending upon hardware configuration. Moreover, we compare the RAP algorithm with a simple hardware prefetching mechanism, and show run-time improvements of up to 24.61%. To demonstrate the robustness of our approach, we conduct a design-space exploration (DSE) for the considered target architecture by varying (i) the amount of chip resources designated for per-core prefetch storage and (ii) off-chip bandwidth. We show that the RAP algorithm is robust in that it improves performance across all design points considered. We also identify the Pareto-optimal hardware-software configuration which delivers 53.66% run-time improvement on average while using only 5.47% more chip area than the bare-bones design.


Related Articles

  • Compile-Time Scheduling Algorithms for a Heterogeneous Network of Workstations. Cierniak, M.; Zaki, M. J.; Li, W. // Computer Journal;1997, Vol. 40 Issue 6, p356 

    In this paper, we study the problem of scheduling parallel loops at compile time for a heterogeneous network of workstations. We consider heterogeneity in various aspects of parallel programming: program, processor, memory and network. A heterogeneous program has parallel loops with different...

  • Linear and Extended Linear Transformations for Shared-Memory Multiprocessors. Kulkarni, D.; Stumm, M. // Computer Journal;1997, Vol. 40 Issue 6, p373 

    Advances in program transformation frameworks have significantly advanced compiler technology over the past few years. Program transformation frameworks provide mathematical abstractions of loop and data structures and formal methods for manipulating these structures. It is these frameworks that...

  • Protein Secondary Structure Prediction Using Parallelized Rule Induction from Coverings. Leong Lee; Kandoth, Cyriac; Leopold, Jennifer L.; Frank, Ronald L. // International Journal of Biological & Life Sciences;May2012, Vol. 8 Issue 2, p99 

    Protein 3D structure prediction has always been an important research area in bioinformatics. In particular, the prediction of secondary structure has been a well-studied research topic. Despite the recent breakthrough of combining multiple sequence alignment information and artificial...

  • Blocking in Parallel Multisearch Problems. Dittrich, W.; Hutchinson, D. A.; Maheshwari, A. // Theory of Computing Systems;2001, Vol. 34 Issue 2, p145 

    External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Blockwise access to data is a central theme in the design of efficient EM algorithms. A similar requirement arises in the...

  • Stochastic Model Based Proxy Servers Architecture for VoD to Achieve Reduced Client Waiting Time. GopalaKrishnan Nair, T. R.; Dakshayini, M. // International Journal of Computer Science Issues (IJCSI);Jan2010, Vol. 7 Issue 1, p73 

    In a video on demand system, the main video repository may be far away from the user and generally has limited streaming capacities. Since a high quality video's size is huge, it requires high bandwidth for streaming over the internet. In order to achieve a higher video hit ratio, reduced client...

  • COMPONENTS OF THE COMPUTING ENVIRONMENT. Iyamu, Tiko; Olummide, O. Obe // Computer Science & Telecommunications;2010, Vol. 26 Issue 3, p142 

    More and more organizations are depending on the computing environment to support and enable their business processes, activities and services. Strategies of many organizations involve both external and internal factors, including employees of the computing environment and the entire...

  • A NEW APPROACH FOR BROADBAND BACKUP LINK TO INTERNET IN CAMPUS NETWORK ENVIRONMENT. Ismail, Mohd Nazri // Computer Science & Telecommunications;2010, Vol. 26 Issue 3, p123 

    Most research focus on applying backup resource reprovisioning when a network failure occurs at some particular intervals over a certain time. In this study, we investigate the benefits of performing backup link to improve network connections after the primary link failure as well as backup...

  • Arghhh! Hice, Randy C. // Scientific Computing & Instrumentation;Apr2004, Vol. 21 Issue 5, p10 

    Discusses the system that stores MPEG data on DVD's. Memory storage of the system; Usage of compression algorithms in storing data; Advantages in application of the said system.

  • An introduction to LTFS for digital media. RICHTER, RAINER // Broadcast Engineering;Sep2012, Vol. 54 Issue 9, p28 

    The article offers information on Linear Tape File System (LTFS) related to data storage. LTFS broadens Linear Tape-Open (LTO) technology usefulness by being easier to use and more robust. LTFS format is useful if the tapes are to be sent off-site, archived or shared with many recipients. The...


Read the Article


Sign out of this library

Other Topics