algebraic-nnhw

Algebraic enhancements for GEMM & AI accelerators

Stars: 264

Visit

This repository contains the source code for a GEMM & deep learning hardware accelerator system used to validate proposed systolic array hardware architectures implementing efficient matrix multiplication algorithms to increase performance-per-area limits of GEMM & AI accelerators. Achieved results include up to 3× faster CNN inference, >2× higher mults/multiplier/clock cycle, and low area with high clock frequency. The system is specialized for inference of non-sparse DNN models with fixed-point/quantized inputs, fully accelerating all DNN layers in hardware, and highly optimizing GEMM acceleration.

README:

Algebraic Enhancements for GEMM & AI Accelerators

This repository contains the source code for a GEMM & deep learning hardware accelerator system used to validate proposed systolic array hardware architectures that implement under-explored or proposed efficient matrix multiplication algorithms in hardware, to increase the traditional performance-per-area limits of GEMM & AI accelerators while producing identical output. The results achieved:

Up to 3× faster CNN inference than state-of-the-art accelerators implemented on the same type of compute platform
>2× higher mults/multiplier/clock cycle
Low area, high clock frequency

Publications & PhD Thesis

The accelerator system was used to validate the systolic array hardware architectures proposed in the following publications:

[1] T. E. Pogue and N. Nicolici, "Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators," in IEEE Transactions on Computers, vol. 73, no. 2, pp. 495-509, Feb. 2024, doi: 10.1109/TC.2023.3334140. Public Full-text: https://arxiv.org/abs/2311.12224
- Matrix multiplication and deep learning hardware architectures that require half the multipliers to achieve the same performance, by executing alternative inner-product algorithms that trade half the multiplications for cheap low-bitwidth additions. The proposed systolic arrays can be seamlessly swapped into existing systolic array systems to double performance per MAC unit with no other functionality or design changes required in the rest of the system and no hidden tradeoffs.
[2] T. E. Pogue and N. Nicolici, "Karatsuba Matrix Multiplication and its Efficient Custom Hardware Implementations," in IEEE Transactions on Computers, early access, Jan. 2025, doi: 10.1109/TC.2025.3525606. Public Full-text: https://arxiv.org/abs/2501.08889,
- We propose the extension of Karatsuba multiplication to matrix multiplication (KMM) to reduce the complexity of integer matrix multiplication, and we present custom hardware implementations for KMM that provide area or execution time improvements for matrix multiplication and deep learning accelerators
[3] T. E. Pogue and N. Nicolici, "Strassen Multisystolic Array Hardware Architectures," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, early access, Jan. 2025, doi: 10.1109/TVLSI.2025.3530785. Public Full-text: Strassen_Multisystolic_Array_Hardware_Architectures.pdf
- First efficient custom hardware implementations for Strassen's fast matrix multiplication algorithm, which achieve state-of-the-art performance in a deep learning accelerator
[4] T. E. Pogue, "Algebraic Enhancements for Systolic Arrays", Ph.D. dissertation, Department of Electrical and Computer Engineering, McMaster University, Hamilton, 2025. [Online]. Available: https://macsphere.mcmaster.ca/handle/11375/30640
- Ph.D. thesis covering the above three methods, as well as more background on deep learning acceleration, algebraic enhancements, the presented deep learning accelerator system design, and future work

Ph.D. Thesis Abstract

The field of deep learning has seen increasing breakthroughs and commercial adoption in recent years for enabling a wide range of applications including image and speech recognition, multimedia generation, information summarization, and human-like chatbots. This has led to a growing need for hardware that can quickly and efficiently perform deep learning inference, which increasingly requires massive amounts of computational power.

To address this need, recent years have seen many works for optimizing deep learning inference in hardware. Systolic arrays are an efficient class of hardware designs to use as a starting point for this application. However, after hardware-oriented deep learning model optimizations reach their limits, after the known parallelism for executing their compute patterns in hardware is exhausted, and after technology scaling slows to a halt, there is an accelerator wall that limits further improvement on the implementation side.

In this thesis, we contribute to this field through an under-explored direction by presenting new efficient matrix multiplication algorithms and/or their systolic array hardware architectures that increase performance-per-area by reducing the workload at the algebraic level, and thus by computing the same result from a re-arranged compute pattern requiring fewer or cheaper operations to be performed in hardware. We evaluate our architectures in an end-to-end deep learning accelerator, demonstrating their ability to increase the performance-per-area of hardware accelerators beyond their normal theoretical limits.

Scope of Contributions

The contributions in [1]-[4] are relevant for the following scope of applications:

Matrix multiplication and deep learning inference: The proposed systolic array hardware architectures from [1]-[4] improve dense matrix multiplication acceleration in general, and therefore can also be exploited in accelerator systems for accelerating all DNN models/layers that can mainly decompose to matrix multiplication, including fully-connected layers, CNNs, RNNs, and attention layers/transformer models.
Fixed-point data types, same numerical stability: Most of the contributions focus on fixed-point data types and quantized neural network inference, and the presented algorithms/hardware architectures produce identical output as conventional algorithms/architectures and cause no changes to numerical stability.
FPGA and ASIC: Results were validated on FPGA, but the proposed architectures are general and most improvements are applicable to both custom integrated circuit and FPGA implementations
Systolic arrays, seamless system integration: The proposed architectures are systolic arrays, which are an efficient type of design for GEMM & deep learning acceleration (e.g., the Google TPU). It may also be possible to extend some of the concepts to non-systolic array designs in future work. Additionally, they increase performance-per-area but have otherwise identical functionality and interfaces as traditional systolic arrays. I.e., the algebraic enhancements are fully self-contained within the systolic arrays and do not require additional pre or post-processing steps. This means they can be easily/seamlessly swapped with traditional systolic arrays in existing accelerator systems to increase performance per MAC unit with little or no other changes required in the rest of the system and no hidden tradeoffs.

Results Preview

Synthesis and performance results when combining the architectures from [1] and [3] compared to state-of-the-art accelerators implemented on similar compute platforms achieved up to 3× faster CNN inference, 2× higher mults/multiplier/clock cycle, and >40% higher clock frequency:

See [1], [2], [3], [4] for more results.

Accelerator System Overview

The following diagram shows the deep learning accelerator system implemented in this source code used to host and validate the systolic arrays proposed in [1]-[4]. The system implementation is specialized for performing inference of non-sparse DNN models with fixed-point/quantized inputs consisting of convolutional layers, fully-connected layers, and pooling layers. All DNN layers are fully accelerated in hardware and it can accelerate ML models with arbitrary layer dimensions/kernel sizes all on a single hardware design. Input bitwidths and systolic array dimensions are parametrizable. The system is also a highly-optimized GEMM accelerator in general.

System Block Diagram Overview:

Matrix Multiply Unit (MXU) / systolic array
- Contains a systolic array architecture for performing matrix multiplication
- For each method in [1]-[4], different proposed systolic arrays/MXUs are swapped for the MXU in the above system figure
GEMM Unit
- Contains the MXU as well as SRAM and addition logic for accumulating matrix tiles to allow GEMM execution of arbitrarily-sized matrices
Post-GEMM Unit
- Contains neural network-specific functions to be performed on the matrix multiplication outputs. This includes adding the bias values, inter-layer rescaling for quantization, activation, padding, and pooling.
Memory Unit
- Contains memory access control logic and on-chip SRAM memory for holding layer activations
- Implements efficient caching & memory access HW algorithms, mapping convolution to GEMM in-place without data duplication/delay
- Uses memory partitioning schemes allowing SRAM memory and control to run at half or quarter clock rate while outputting new data at full clock rate, to improve overall system frequency and power
Off-chip DDR DRAM memory for weights
RxTx Unit
- PCIe interface to host
Instruction Unit
- For decoding accelerator instructions sent from the host that allow the system to accelerate ML models with arbitrary layer dimensions/kernel sizes all on a single hardware design

Source Code Overview

compiler
- A compiler for parsing Python ML model descriptions into accelerator instructions that allow it to accelerate ML models with arbitrary layer dimensions/kernel sizes. This part also includes code for interfacing with a PCIe driver for initiating model execution on the accelerator, reading back results and performance counters, and testing the correctness of the results
rtl
- Synthesizable SystemVerilog accelerator RTL
sim
- Scripts for setting up simulation environments for verification
tests
- UVM testbench source code for verifying the accelerator in simulation, written in Python and cocotb
utils
- Additional Python packages and scripts used in this project created for general development utilities and aids

The files rtl/top/define.svh and rtl/top/pkg.sv contain a number of configurable parameters such as FIP_METHOD in define.svh which defines the systolic array type (e.g., baseline, FIP, or FFIP [1]), SZI and SZJ which define the systolic array height/width, and LAYERIO_WIDTH/WEIGHT_WIDTH which define the input bitwidths.

The directory rtl/arith includes mxu.sv and mac_array.sv which contain the RTL for the baseline and some of the proposed systolic array architectures (FIP and FFIP [1]) depending on the value of the parameter FIP_METHOD.

Additional Documentation

For more documentation on the accelerator system, refer to [1] and Chapter 3 from [4]. For more details on the proposed systolic array architectures and algebraic enhancements validated in this accelerator system, see [1], [2], [3], [4], as well as the Ph.D. defence slideshow.

For Tasks:

Click tags to check more tools for each tasks

optimize deep learning accelerate matrix multiplication implement systolic arrays validate hardware architectures increase performance-per-area

For Jobs:

hardware engineer deep learning engineer research scientist machine learning engineer computer vision engineer

Alternative AI tools for algebraic-nnhw

Similar Open Source Tools

algebraic-nnhw

github

: 264

AI4Animation

AI4Animation is a comprehensive framework for data-driven character animation, including data processing, neural network training, and runtime control, developed in Unity3D/PyTorch. It explores deep learning opportunities for character animation, covering biped and quadruped locomotion, character-scene interactions, sports and fighting games, and embodied avatar motions in AR/VR. The research focuses on generative frameworks, codebook matching, periodic autoencoders, animation layering, local motion phases, and neural state machines for character control and animation.

github

: 7.5k

aihwkit

The IBM Analog Hardware Acceleration Kit is an open-source Python toolkit for exploring and using the capabilities of in-memory computing devices in the context of artificial intelligence. It consists of two main components: Pytorch integration and Analog devices simulator. The Pytorch integration provides a series of primitives and features that allow using the toolkit within PyTorch, including analog neural network modules, analog training using torch training workflow, and analog inference using torch inference workflow. The Analog devices simulator is a high-performant (CUDA-capable) C++ simulator that allows for simulating a wide range of analog devices and crossbar configurations by using abstract functional models of material characteristics with adjustable parameters. Along with the two main components, the toolkit includes other functionalities such as a library of device presets, a module for executing high-level use cases, a utility to automatically convert a downloaded model to its equivalent Analog model, and integration with the AIHW Composer platform. The toolkit is currently in beta and under active development, and users are advised to be mindful of potential issues and keep an eye for improvements, new features, and bug fixes in upcoming versions.

github

: 335

Nanoflow

NanoFlow is a throughput-oriented high-performance serving framework for Large Language Models (LLMs) that consistently delivers superior throughput compared to other frameworks by utilizing key techniques such as intra-device parallelism, asynchronous CPU scheduling, and SSD offloading. The framework proposes nano-batching to schedule compute-, memory-, and network-bound operations for simultaneous execution, leading to increased resource utilization. NanoFlow also adopts an asynchronous control flow to optimize CPU overhead and eagerly offloads KV-Cache to SSDs for multi-round conversations. The open-source codebase integrates state-of-the-art kernel libraries and provides necessary scripts for environment setup and experiment reproduction.

github

: 483

llm-sp

github

: 363

long-context-attention

Long-Context-Attention (YunChang) is a unified sequence parallel approach that combines the strengths of DeepSpeed-Ulysses-Attention and Ring-Attention to provide a versatile and high-performance solution for long context LLM model training and inference. It addresses the limitations of both methods by offering no limitation on the number of heads, compatibility with advanced parallel strategies, and enhanced performance benchmarks. The tool is verified in Megatron-LM and offers best practices for 4D parallelism, making it suitable for various attention mechanisms and parallel computing advancements.

github

: 266

TensorRT-Model-Optimizer

The NVIDIA TensorRT Model Optimizer is a library designed to quantize and compress deep learning models for optimized inference on GPUs. It offers state-of-the-art model optimization techniques including quantization and sparsity to reduce inference costs for generative AI models. Users can easily stack different optimization techniques to produce quantized checkpoints from torch or ONNX models. The quantized checkpoints are ready for deployment in inference frameworks like TensorRT-LLM or TensorRT, with planned integrations for NVIDIA NeMo and Megatron-LM. The tool also supports 8-bit quantization with Stable Diffusion for enterprise users on NVIDIA NIM. Model Optimizer is available for free on NVIDIA PyPI, and this repository serves as a platform for sharing examples, GPU-optimized recipes, and collecting community feedback.

github

: 438

llvm-aie

This repository extends the LLVM framework to generate code for use with AMD/Xilinx AI Engine processors. AI Engine processors are in-order, exposed-pipeline VLIW processors focused on application acceleration for AI, Machine Learning, and DSP applications. The repository adds LLVM support for specific features like non-power of 2 pointers, operand latencies, resource conflicts, negative operand latencies, slot assignment, relocations, code alignment restrictions, and register allocation. It includes support for Clang, LLD, binutils, Compiler-RT, and LLVM-LIBC.

github

: 130

ianvs

Ianvs is a distributed synergy AI benchmarking project incubated in KubeEdge SIG AI. It aims to test the performance of distributed synergy AI solutions following recognized standards, providing end-to-end benchmark toolkits, test environment management tools, test case control tools, and benchmark presentation tools. It also collaborates with other organizations to establish comprehensive benchmarks and related applications. The architecture includes critical components like Test Environment Manager, Test Case Controller, Generation Assistant, Simulation Controller, and Story Manager. Ianvs documentation covers quick start, guides, dataset descriptions, algorithms, user interfaces, stories, and roadmap.

github

: 111

LLM-RL-Papers

github

: 95

Nucleoid

Nucleoid is a declarative (logic) runtime environment that manages both data and logic under the same runtime. It uses a declarative programming paradigm, which allows developers to focus on the business logic of the application, while the runtime manages the technical details. This allows for faster development and reduces the amount of code that needs to be written. Additionally, the sharding feature can help to distribute the load across multiple instances, which can further improve the performance of the system.

github

: 562

awesome-transformer-nlp

This repository contains a hand-curated list of great machine (deep) learning resources for Natural Language Processing (NLP) with a focus on Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), attention mechanism, Transformer architectures/networks, Chatbot, and transfer learning in NLP.

github

: 1.1k

openspg

OpenSPG is a knowledge graph engine developed by Ant Group in collaboration with OpenKG, based on the SPG (Semantic-enhanced Programmable Graph) framework. It provides explicit semantic representations, logical rule definitions, operator frameworks (construction, inference), and other capabilities for domain knowledge graphs. OpenSPG supports pluggable adaptation of basic engines and algorithmic services by various vendors to build customized solutions.

github

: 591

WeatherGFT

WeatherGFT is a physics-AI hybrid model designed to generalize weather forecasts to finer-grained temporal scales beyond the training dataset. It incorporates physical partial differential equations (PDEs) into neural networks to simulate fine-grained physical evolution and correct biases. The model achieves state-of-the-art performance in forecasting tasks at different time scales, from nowcasting to medium-range forecasts, by utilizing a lead time-aware training framework and a carefully designed PDE kernel. WeatherGFT bridges the gap between nowcast and medium-range forecast by extending forecasting abilities to predict accurately at a 30-minute time scale.

github

: 148

chatgpt-universe

ChatGPT is a large language model that can generate human-like text, translate languages, write different kinds of creative content, and answer your questions in a conversational way. It is trained on a massive amount of text data, and it is able to understand and respond to a wide range of natural language prompts. Here are 5 jobs suitable for this tool, in lowercase letters: 1. content writer 2. chatbot assistant 3. language translator 4. creative writer 5. researcher

github

: 372

aligner

Aligner is a model-agnostic alignment tool that learns correctional residuals between preferred and dispreferred answers using a small model. It can be directly applied to various open-source and API-based models with only one-off training, suitable for rapid iteration and improving model performance. Aligner has shown significant improvements in helpfulness, harmlessness, and honesty dimensions across different large language models.

github

: 86

For similar tasks

algebraic-nnhw

github

: 264

For similar jobs

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

openvino

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. It provides a common API to deliver inference solutions on various platforms, including CPU, GPU, NPU, and heterogeneous devices. OpenVINO™ supports pre-trained models from Open Model Zoo and popular frameworks like TensorFlow, PyTorch, and ONNX. Key components of OpenVINO™ include the OpenVINO™ Runtime, plugins for different hardware devices, frontends for reading models from native framework formats, and the OpenVINO Model Converter (OVC) for adjusting models for optimal execution on target devices.

github

: 8.1k

peft

PEFT (Parameter-Efficient Fine-Tuning) is a collection of state-of-the-art methods that enable efficient adaptation of large pretrained models to various downstream applications. By only fine-tuning a small number of extra model parameters instead of all the model's parameters, PEFT significantly decreases the computational and storage costs while achieving performance comparable to fully fine-tuned models.

github

: 18.0k

jetson-generative-ai-playground

This repo hosts tutorial documentation for running generative AI models on NVIDIA Jetson devices. The documentation is auto-generated and hosted on GitHub Pages using their CI/CD feature to automatically generate/update the HTML documentation site upon new commits.

github

: 94

emgucv

Emgu CV is a cross-platform .Net wrapper for the OpenCV image-processing library. It allows OpenCV functions to be called from .NET compatible languages. The wrapper can be compiled by Visual Studio, Unity, and "dotnet" command, and it can run on Windows, Mac OS, Linux, iOS, and Android.

github

: 2.1k

MMStar

MMStar is an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. It addresses two key issues in current LLM evaluation: the unnecessary use of visual content in many samples and the existence of unintentional data leakage in LLM and LVLM training. MMStar evaluates 6 core capabilities across 18 detailed axes, ensuring a balanced distribution of samples across all dimensions.

github

: 84

VLMEvalKit

VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.

github

: 2.1k

llava-docker

This Docker image for LLaVA (Large Language and Vision Assistant) provides a convenient way to run LLaVA locally or on RunPod. LLaVA is a powerful AI tool that combines natural language processing and computer vision capabilities. With this Docker image, you can easily access LLaVA's functionalities for various tasks, including image captioning, visual question answering, text summarization, and more. The image comes pre-installed with LLaVA v1.2.0, Torch 2.1.2, xformers 0.0.23.post1, and other necessary dependencies. You can customize the model used by setting the MODEL environment variable. The image also includes a Jupyter Lab environment for interactive development and exploration. Overall, this Docker image offers a comprehensive and user-friendly platform for leveraging LLaVA's capabilities.

github

: 59