Cheap and reliable Node.js hosting starts at $3/month, and $1/month static HTML hosting

Created with love in Canada, visit hostnodejs.com today

Feel like to post an Ad? Learn Details

All Projects → spcl → Gemm_hls

spcl / Gemm_hls

Licence: bsd-3-clause

Scalable systolic array-based matrix-matrix multiplication implemented in Vivado HLS for Xilinx FPGAs.

Labels

cmake fpga hls

Projects that are alternatives of or similar to Gemm hls

Openwifi Hw

FPGA/hardware design of openwifi

Stars: ✭ 181 (+35.07%)

Mutual labels: fpga, hls

scalehls

A scalable High-Level Synthesis framework on MLIR

Stars: ✭ 62 (-53.73%)

Mutual labels: fpga, hls

Logic

CMake, SystemVerilog and SystemC utilities for creating, building and testing RTL projects for FPGAs and ASICs.

Stars: ✭ 149 (+11.19%)

Mutual labels: cmake, fpga

Limago

Limago: an FPGA-based Open-source 100 GbE TCP/IP Stack

Stars: ✭ 95 (-29.1%)

Mutual labels: fpga, hls

Pipecnn

An OpenCL-based FPGA Accelerator for Convolutional Neural Networks

Stars: ✭ 775 (+478.36%)

Mutual labels: fpga, hls

Fpga readings

Recipe for FPGA cooking

Stars: ✭ 164 (+22.39%)

Mutual labels: fpga, hls

PandA-bambu

PandA-bambu public repository

Stars: ✭ 129 (-3.73%)

Mutual labels: fpga, hls

Openwifi

open-source IEEE 802.11 WiFi baseband FPGA (chip) design

Stars: ✭ 2,257 (+1584.33%)

Mutual labels: fpga, hls

Hls4ml

Machine learning in FPGAs using HLS

Stars: ✭ 467 (+248.51%)

Mutual labels: fpga, hls

Pp4fpgas Cn

中文版 Parallel Programming for FPGAs

Stars: ✭ 339 (+152.99%)

Mutual labels: fpga, hls

hwt

VHDL/Verilog/SystemC code generator, simulator API written in python/c++

Stars: ✭ 145 (+8.21%)

Mutual labels: fpga, hls

Pp4fpgas Cn Hls

HLS Project of pp4fpgas - https://github.com/xupsh/pp4fpgas-cn

Stars: ✭ 97 (-27.61%)

Mutual labels: fpga, hls

Halide Hls

HLS branch of Halide

Stars: ✭ 59 (-55.97%)

Mutual labels: fpga, hls

Hlslib

A collection of extensions for Vivado HLS and Intel FPGA OpenCL to improve developer quality of life.

Stars: ✭ 131 (-2.24%)

Mutual labels: cmake, fpga

Opencv4androidwithcmake

Use Android Studio 3.0 (>=2.2) and Cmake Toolchain to make your Android device fly with Opencv (OpenCV 3.40)

Stars: ✭ 126 (-5.97%)

Mutual labels: cmake

Free Tpu

Free TPU for FPGA with Lenet, MobileNet, Squeezenet, Resnet, Inception V3, YOLO V3, and ICNet. Deep learning acceleration using Xilinx zynq (Zedboard or ZC702 ) or kintex-7 to solve image classification, detection, and segmentation problem.

Stars: ✭ 129 (-3.73%)

Mutual labels: fpga

Openage

Free (as in freedom) open source clone of the Age of Empires II engine 🚀

Stars: ✭ 10,712 (+7894.03%)

Mutual labels: cmake

Ros Travis Integration

ROS package continuous integration using travis-CI

Stars: ✭ 125 (-6.72%)

Mutual labels: cmake

Cmake Example

Example project which demonstrates various CMake features.

Stars: ✭ 131 (-2.24%)

Mutual labels: cmake

F4mp

Stars: ✭ 130 (-2.99%)

Mutual labels: cmake

View All Similar Projects ➔

Scalable matrix matrix multiplication on FPGA

This repository includes a pure Vivado HLS implementation of matrix-matrix multiplication (A*B=C) for Xilinx FPGAs, using Xilinx Vitis/SDx/SDAccel to instantiate memory and PCIe controllers and interface with the host.

Experiments run on a VCU1525 achieved 462 GFLOP/s, 301 GFLOP/s and 132 GFLOP/s for half, single, and double precision, respectively, with routing across the three SLRs being the primary bottleneck preventing further scaling. The code is not device-specific, and can be configured for any Xilinx FPGA supported by the Xilinx OpenCL runtime. Kernels have also been verified to execute on TUL KU115 and Alveo U250 boards with similar results.

The implementation uses a systolic array approach, where linearly connected processing elements compute distinct contributions to the outer product of tiles of the output matrix.

The approach used to implement this kernel was presented at FPGA'20 [1]. For a general description of the optimization techniques that we apply, we refer to our article on HLS transformations [2]. We also gave a tutorial on HLS for HPC at HiPEAC'20, SC'19, SC'18, and PPoPP'18.

The compute kernel is in kernel/Compute.cpp, and the modules accessing memory are in kernel/Memory.cpp.

Downloading the code

This project uses the open source Vivado HLS extension library hlslib [3] for simulation, vectorization, finding Xilinx tools, host-side integration and more.

Since hlslib is included as a submodule, make sure you clone with --recursive or grab it after cloning with:

git submodule update --init

Prerequisites

To build and run kernels in hardware, Xilinx Vitis or SDAccel must be installed and available on the PATH (tested with versions 2018.2, 2019.2, and 2020.1).

Configuration and running

This project is configured and built using CMake. Most parameters must be set at configuration-time, as they are used to specialize the hardware.

An example of configuring and building the kernel and executing it in hardware is shown below (starting from the source directory):

mkdir build
cd build
cmake ../ -DMM_DATA_TYPE=float -DMM_PARALLELISM_N=32 -DMM_PARALLELISM_M=8 -DMM_MEMORY_TILE_SIZE_N=512 -DMM_MEMORY_TILE_SIZE_M=512
make
make synthesis
make compile_hardware 
make link_hardware
./RunHardware.exe hw

Matrix sizes use the convention that A: NxK, B: KxM, and C: NxM.

Per default the build targets the Alveo U250 acceleration board, but this can be configured using the MM_DSA_NAME CMake parameter.

The implementation is not restricted to use multiplication and addition as operators. To use other operators, for example addition and minimum to implement the distance product, specify them using the MM_MAP_OP and MM_REDUCE_OP CMake parameters, respectively. To see which operators are pre-implemented, and examples of how to implement new operators, see hlslib/include/hlslib/xilinx/Operators.h.

Selecting tile sizes

See our publication at FPGA'20 [1] on how to choose tile sizes for optimal fast memory and compute utilization.

Parallel performance

The amount of parallelism in the code is determined by the MM_PARALLELISM_N and MM_PARALLELISM_M configuration variables. The former determines the number of processing element instantiated, and the latter regulates the vector width/granularity of each processing element. MM_PARALLELISM_M should be set to a maximum of 64 bytes / sizeof(<your operand>) (i.e., 8 for float or int, 4 for double or long, 16 for 16-bit int, etc.) to avoid performance and routing issues.

The expected performance in Op/s (FLOP/s in the case of floating point types) of a given configuration can be computed as:

2 * MM_PARALLELISM_N * MM_PARALLELISM_M * Frequency

In practice, MM_PARALLELISM_N buffered values of A are applied to MM_PARALLELISM_M values of B.

Bugs

If you experience bugs, or have suggestions for improvements, please use the issue tracker to report them.

Publication

If this code has been useful to your research, please consider citing us:

BibTeX:

@inproceedings{mmm_hls,
  title={Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis},
  author={de~Fine~Licht, Johannes and Kwasniewski, Grzegorz and Hoefler, Torsten},
  booktitle={The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20)},
  year={2020}
}

Plain text:

Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler. "Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis." In Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20).

References

[1] Johannes de Fine Licht, Grzegorz Kwasniewski, and Torsten Hoefler, "Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis", in Proceedings of 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA'20), 2020.

[2] Johannes de Fine Licht, Maciej Besta, Simon Meierhans, and Torsten Hoefler. "Transformations of High-Level Synthesis Codes for High-Performance Computing." arXiv preprint arXiv:1805.08288 (2018).

[3] Johannes de Fine Licht, and Torsten Hoefler. "hlslib: Software Engineering for Hardware Design.", presented at the Fifth International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC'19).

Note that the project description data, including the texts, logos, images, and/or trademarks, for each open source project belongs to its rightful owner. If you wish to add or remove any projects, please contact us at [email protected].

Stars: ✭ 134

Visit Git Page 🔗Visit User Page 🔗Visit Issues Page (1) 🔗