FPGA Based High Performance Double-Precision Matrix Multiplication

Kumar, Vinay B. Y.; Joshi, Siddharth; Patkar, Sachin B.; Narayanan, H.
June 2010
International Journal of Parallel Programming;Jun2010, Vol. 38 Issue 3/4, p322
Academic Journal
We present two designs (I and II) for IEEE 754 double precision floating point matrix multiplication, optimized for implementation on high-end FPGAs. It forms the kernel in many important tile-based BLAS algorithms, making an excellent candidate for acceleration. The designs, both based on the rank-1 update scheme, can handle arbitrary matrix sizes, and are able to sustain their peak performance except during an initial latency period. Through these designs, the trade-offs involved in terms of local-memory and bandwidth for an FPGA implementation are demonstrated and an analysis is presented for the optimal choice of design parameters. The designs, implemented on a Virtex-5 SX240T FPGA, scale gracefully from 1 to 40 processing elements(PEs) with a less than 1% degradation in the design frequency of 373 MHz. With 40 PEs and a design speed of 373 MHz, a sustained performance of 29.8 GFLOPS is possible with a bandwidth requirement of 750 MB/s for design-II and 5.9 GB/s for design-I. This compares favourably with both related art and general purpose CPU implementations.


Related Articles

  • Experimental Investigation of Low-Jitter and Wide-Band Dual Cascaded PLL System. Telba, Ahmed; Qasim, Syed Manzoor // AIP Conference Proceedings;8/25/2011, Vol. 1373 Issue 1, p59 

    Jitter is a matter of great concern for high-speed digital designers because of its ability to degrade the overall system performance. Designing a low-jitter and wide-band phase locked loop (PLL) system is of practical importance because of its application in high speed digital systems. This...

  • Design of Data Buffer Circuit in High-speed, High-resolution Video Image Acquisition System on FPGA. Yao Lu; Zhao Yong; Li Xia // International MultiConference of Engineers & Computer Scientists;2007, p1771 

    This paper introduces a logic circuit of data buffer between higher speed ADC and lower speed DSP. While allowing dropping some frames, the circuit also realizes effective image rescale and image buffer transfers frame by frame. The realization of this design initially solves the problem of the...

  • Accelerating Seismic Computations Using Customized Number Representations on FPGAs. Haohuan Fu; Osborne, William; Clapp, Robert G.; Mencer, Oskar; Luk, Wayne // EURASIP Journal on Embedded Systems;1/1/2009, Special section p1 

    The oil and gas industry has an increasingly large demand for high-performance computation over huge volume of data. Compared to common processors, field-programable gate arrays (FPGAs) can boost the computation performance with a streaming computation architecture and the support for...

  • FPGA-configuration scheme is flexible. Zhe Lou // EDN;1/22/2004, Vol. 49 Issue 2, p82 

    Provides tips on individually programming multiple field programmable gate arrays (FPGA). Methods in programming an FPGA; Suggested microcontrollers for FPGA programming; Details on the use of a bus switch from On Semiconductor for multiplexing or demultiplexing.

  • The Main Circuit.  // ECN: Electronic Component News;May2010, Vol. 54 Issue 6, p15 

    The article focuses on the performance of field-programmable gate arrays (FPGAs) Mezzanine Card (FMC) drives on the aerospace and defense signal processing systems which have the potential of I/O bandwidth. It mentions that PCI Mezzanine Card (PMC) for embedded computing is the most known...

  • Design of Multi-channel Digital Video Optical Transmitter Based on FPGA. Zhang Changsen; Huang Dexin // Proceedings of the International Symposium on Electronic Commerc;Jun2010, p91 

    A design proposal of multiplex digital optical transmitter based on FPGA is given in this paper, which is constructed by chipset EP2C35F6728C of cyclone II series of Altera company as the core digital processor. The system includes A/D converter, TDM, high-speed series to parallel converter and...

  • Implementation of Adaptive OFDM System Using FPGA. Mohamed, M. A.; Samarah, A. S.; Allah, M. I. Fath // International Journal of Computer Science Issues (IJCSI);May2012, Vol. 9 Issue 3, p246 

    OFDM is a modulation as well as multiplexing technique which is now widely used in various high speed mobile and wireless communication systems because of its capacity of ensuring high level robustness against interference. In this paper the design and implementation of OFDM system will be...

  • Asynchronous Realization of Algebraic Integer-Based 2D DCT Using Achronix Speedster SPD60 FPGA. Rajapaksha, Nilanka; Edirisuriya, Amila; Madanayake, Arjuna; Cintra, Renato J.; Onen, Dennis; Amer, Ihab; Dimitrov, Vassil S. // Journal of Electrical & Computer Engineering;2013, p1 

    Transformation and quantization play a critical role in video codecs. Recently proposed algebraic-integer-(AI-) based discrete cosine transform (DCT) algorithms are analyzed in the presence of quantization, using the High Efficiency Video Coding (HEVC) standard. AI DCT is implemented and tested...

  • A Novel Scalable and Storage-Efficient Architecture for High Speed Exact String Matching. Peiravi, Ali; Rahimzadeh, Mohammad Javad // ETRI Journal;2009, Vol. 31 Issue 5, p545 

    String matching is a fundamental element of an important category of modern packet processing applications which involve scanning the content flowing through a network for thousands of strings at the line rate. To keep pace with high network speeds, specialized hardware-based solutions are...


Read the Article


Sign out of this library

Other Topics