Data movement is the dominating factor affecting
performance and energy in modern computing systems. Consequently, many
algorithms have been developed to minimize the number of I/O operations for
common computing patterns. Matrix multiplication is no exception, and lower
bounds have been proven and implemented both for shared and distributed memory
systems. Reconfigurable hardware platforms are a lucrative target for I/O
minimizing algorithms, as they offer full control of memory accesses to the
programmer. While bounds developed in the context of fixed architectures still
apply to these platforms, the spatially distributed nature of their
computational and memory resources requires a decentralized approach to
optimize algorithms for maximum hardware utilization. We present a model to
optimize matrix multiplication for FPGA platforms, simultaneously targeting
maximum performance and minimum off-chip data movement, within constraints set
by the hardware. We map the model to a concrete architecture using a
high-level synthesis tool, maintaining a high level of abstraction, allowing
us to support arbitrary data types, and enables maintainability and
portability across FPGA devices. Kernels generated from our architecture are
shown to offer competitive performance in practice, scaling with both compute
and memory resources. We offer our design as an open source project to
encourage the open development of linear algebra and I/O minimizing algorithms
on reconfigurable hardware platforms.
@inproceedings{, author={Johannes de Fine Licht and Grzegorz Kwasniewski and Torsten Hoefler}, title={{Flexible Communication Avoiding Matrix Multiplication on FPGA with High-Level Synthesis}}, year={2020}, month={Feb.}, note={In Proceedings of the 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays}, source={http://www.unixer.de/~htor/publications/}, }