algebraic-nnhw
Algebraic enhancements for GEMM & AI accelerators
Stars: 264
This repository contains the source code for a GEMM & deep learning hardware accelerator system used to validate proposed systolic array hardware architectures implementing efficient matrix multiplication algorithms to increase performance-per-area limits of GEMM & AI accelerators. Achieved results include up to 3× faster CNN inference, >2× higher mults/multiplier/clock cycle, and low area with high clock frequency. The system is specialized for inference of non-sparse DNN models with fixed-point/quantized inputs, fully accelerating all DNN layers in hardware, and highly optimizing GEMM acceleration.
README:
This repository contains the source code for a GEMM & deep learning hardware accelerator system used to validate proposed systolic array hardware architectures that implement under-explored or proposed efficient matrix multiplication algorithms in hardware, to increase the traditional performance-per-area limits of GEMM & AI accelerators while producing identical output. The results achieved:
- Up to 3× faster CNN inference than state-of-the-art accelerators implemented on the same type of compute platform
- >2× higher mults/multiplier/clock cycle
- Low area, high clock frequency
The accelerator system was used to validate the systolic array hardware architectures proposed in the following publications:
- [1] T. E. Pogue and N. Nicolici, "Fast Inner-Product Algorithms and Architectures for Deep Neural Network Accelerators," in IEEE Transactions on Computers, vol. 73, no. 2, pp. 495-509, Feb. 2024, doi: 10.1109/TC.2023.3334140. Public Full-text: https://arxiv.org/abs/2311.12224
- Matrix multiplication and deep learning hardware architectures that require half the multipliers to achieve the same performance, by executing alternative inner-product algorithms that trade half the multiplications for cheap low-bitwidth additions. The proposed systolic arrays can be seamlessly swapped into existing systolic array systems to double performance per MAC unit with no other functionality or design changes required in the rest of the system and no hidden tradeoffs.
- [2] T. E. Pogue and N. Nicolici, "Karatsuba Matrix Multiplication and its Efficient Custom Hardware Implementations," in IEEE Transactions on Computers, early access, Jan. 2025, doi: 10.1109/TC.2025.3525606. Public Full-text: https://arxiv.org/abs/2501.08889,
- We propose the extension of Karatsuba multiplication to matrix multiplication (KMM) to reduce the complexity of integer matrix multiplication, and we present custom hardware implementations for KMM that provide area or execution time improvements for matrix multiplication and deep learning accelerators
- [3] T. E. Pogue and N. Nicolici, "Strassen Multisystolic Array Hardware Architectures," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, early access, Jan. 2025, doi: 10.1109/TVLSI.2025.3530785. Public Full-text: Strassen_Multisystolic_Array_Hardware_Architectures.pdf
- First efficient custom hardware implementations for Strassen's fast matrix multiplication algorithm, which achieve state-of-the-art performance in a deep learning accelerator
- [4] T. E. Pogue, "Algebraic Enhancements for Systolic Arrays", Ph.D. dissertation, Department of Electrical and Computer Engineering, McMaster University, Hamilton, 2025. [Online]. Available: https://macsphere.mcmaster.ca/handle/11375/30640
- Ph.D. thesis covering the above three methods, as well as more background on deep learning acceleration, algebraic enhancements, the presented deep learning accelerator system design, and future work
The field of deep learning has seen increasing breakthroughs and commercial adoption in recent years for enabling a wide range of applications including image and speech recognition, multimedia generation, information summarization, and human-like chatbots. This has led to a growing need for hardware that can quickly and efficiently perform deep learning inference, which increasingly requires massive amounts of computational power.
To address this need, recent years have seen many works for optimizing deep learning inference in hardware. Systolic arrays are an efficient class of hardware designs to use as a starting point for this application. However, after hardware-oriented deep learning model optimizations reach their limits, after the known parallelism for executing their compute patterns in hardware is exhausted, and after technology scaling slows to a halt, there is an accelerator wall that limits further improvement on the implementation side.
In this thesis, we contribute to this field through an under-explored direction by presenting new efficient matrix multiplication algorithms and/or their systolic array hardware architectures that increase performance-per-area by reducing the workload at the algebraic level, and thus by computing the same result from a re-arranged compute pattern requiring fewer or cheaper operations to be performed in hardware. We evaluate our architectures in an end-to-end deep learning accelerator, demonstrating their ability to increase the performance-per-area of hardware accelerators beyond their normal theoretical limits.
The contributions in [1]-[4] are relevant for the following scope of applications:
- Matrix multiplication and deep learning inference: The proposed systolic array hardware architectures from [1]-[4] improve dense matrix multiplication acceleration in general, and therefore can also be exploited in accelerator systems for accelerating all DNN models/layers that can mainly decompose to matrix multiplication, including fully-connected layers, CNNs, RNNs, and attention layers/transformer models.
- Fixed-point data types, same numerical stability: Most of the contributions focus on fixed-point data types and quantized neural network inference, and the presented algorithms/hardware architectures produce identical output as conventional algorithms/architectures and cause no changes to numerical stability.
- FPGA and ASIC: Results were validated on FPGA, but the proposed architectures are general and most improvements are applicable to both custom integrated circuit and FPGA implementations
- Systolic arrays, seamless system integration: The proposed architectures are systolic arrays, which are an efficient type of design for GEMM & deep learning acceleration (e.g., the Google TPU). It may also be possible to extend some of the concepts to non-systolic array designs in future work. Additionally, they increase performance-per-area but have otherwise identical functionality and interfaces as traditional systolic arrays. I.e., the algebraic enhancements are fully self-contained within the systolic arrays and do not require additional pre or post-processing steps. This means they can be easily/seamlessly swapped with traditional systolic arrays in existing accelerator systems to increase performance per MAC unit with little or no other changes required in the rest of the system and no hidden tradeoffs.
Synthesis and performance results when combining the architectures from [1] and [3] compared to state-of-the-art accelerators implemented on similar compute platforms achieved up to 3× faster CNN inference, 2× higher mults/multiplier/clock cycle, and >40% higher clock frequency:
See [1], [2], [3], [4] for more results.
The following diagram shows the deep learning accelerator system implemented in this source code used to host and validate the systolic arrays proposed in [1]-[4]. The system implementation is specialized for performing inference of non-sparse DNN models with fixed-point/quantized inputs consisting of convolutional layers, fully-connected layers, and pooling layers. All DNN layers are fully accelerated in hardware and it can accelerate ML models with arbitrary layer dimensions/kernel sizes all on a single hardware design. Input bitwidths and systolic array dimensions are parametrizable. The system is also a highly-optimized GEMM accelerator in general.
System Block Diagram Overview:
- Matrix Multiply Unit (MXU) / systolic array
- Contains a systolic array architecture for performing matrix multiplication
- For each method in [1]-[4], different proposed systolic arrays/MXUs are swapped for the MXU in the above system figure
- GEMM Unit
- Contains the MXU as well as SRAM and addition logic for accumulating matrix tiles to allow GEMM execution of arbitrarily-sized matrices
- Post-GEMM Unit
- Contains neural network-specific functions to be performed on the matrix multiplication outputs. This includes adding the bias values, inter-layer rescaling for quantization, activation, padding, and pooling.
- Memory Unit
- Contains memory access control logic and on-chip SRAM memory for holding layer activations
- Implements efficient caching & memory access HW algorithms, mapping convolution to GEMM in-place without data duplication/delay
- Uses memory partitioning schemes allowing SRAM memory and control to run at half or quarter clock rate while outputting new data at full clock rate, to improve overall system frequency and power
- Off-chip DDR DRAM memory for weights
- RxTx Unit
- PCIe interface to host
- Instruction Unit
- For decoding accelerator instructions sent from the host that allow the system to accelerate ML models with arbitrary layer dimensions/kernel sizes all on a single hardware design
- compiler
- A compiler for parsing Python ML model descriptions into accelerator instructions that allow it to accelerate ML models with arbitrary layer dimensions/kernel sizes. This part also includes code for interfacing with a PCIe driver for initiating model execution on the accelerator, reading back results and performance counters, and testing the correctness of the results
- rtl
- Synthesizable SystemVerilog accelerator RTL
- sim
- Scripts for setting up simulation environments for verification
- tests
- UVM testbench source code for verifying the accelerator in simulation, written in Python and cocotb
- utils
- Additional Python packages and scripts used in this project created for general development utilities and aids
The files rtl/top/define.svh and rtl/top/pkg.sv contain a number of configurable parameters such as FIP_METHOD in define.svh which defines the systolic array type (e.g., baseline, FIP, or FFIP [1]), SZI and SZJ which define the systolic array height/width, and LAYERIO_WIDTH/WEIGHT_WIDTH which define the input bitwidths.
The directory rtl/arith includes mxu.sv and mac_array.sv which contain the RTL for the baseline and some of the proposed systolic array architectures (FIP and FFIP [1]) depending on the value of the parameter FIP_METHOD.
For more documentation on the accelerator system, refer to [1] and Chapter 3 from [4]. For more details on the proposed systolic array architectures and algebraic enhancements validated in this accelerator system, see [1], [2], [3], [4], as well as the Ph.D. defence slideshow.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for algebraic-nnhw
Similar Open Source Tools
algebraic-nnhw
This repository contains the source code for a GEMM & deep learning hardware accelerator system used to validate proposed systolic array hardware architectures implementing efficient matrix multiplication algorithms to increase performance-per-area limits of GEMM & AI accelerators. Achieved results include up to 3× faster CNN inference, >2× higher mults/multiplier/clock cycle, and low area with high clock frequency. The system is specialized for inference of non-sparse DNN models with fixed-point/quantized inputs, fully accelerating all DNN layers in hardware, and highly optimizing GEMM acceleration.
AI4Animation
AI4Animation is a comprehensive framework for data-driven character animation, including data processing, neural network training, and runtime control, developed in Unity3D/PyTorch. It explores deep learning opportunities for character animation, covering biped and quadruped locomotion, character-scene interactions, sports and fighting games, and embodied avatar motions in AR/VR. The research focuses on generative frameworks, codebook matching, periodic autoencoders, animation layering, local motion phases, and neural state machines for character control and animation.
LongRoPE
LongRoPE is a method to extend the context window of large language models (LLMs) beyond 2 million tokens. It identifies and exploits non-uniformities in positional embeddings to enable 8x context extension without fine-tuning. The method utilizes a progressive extension strategy with 256k fine-tuning to reach a 2048k context. It adjusts embeddings for shorter contexts to maintain performance within the original window size. LongRoPE has been shown to be effective in maintaining performance across various tasks from 4k to 2048k context lengths.
aihwkit
The IBM Analog Hardware Acceleration Kit is an open-source Python toolkit for exploring and using the capabilities of in-memory computing devices in the context of artificial intelligence. It consists of two main components: Pytorch integration and Analog devices simulator. The Pytorch integration provides a series of primitives and features that allow using the toolkit within PyTorch, including analog neural network modules, analog training using torch training workflow, and analog inference using torch inference workflow. The Analog devices simulator is a high-performant (CUDA-capable) C++ simulator that allows for simulating a wide range of analog devices and crossbar configurations by using abstract functional models of material characteristics with adjustable parameters. Along with the two main components, the toolkit includes other functionalities such as a library of device presets, a module for executing high-level use cases, a utility to automatically convert a downloaded model to its equivalent Analog model, and integration with the AIHW Composer platform. The toolkit is currently in beta and under active development, and users are advised to be mindful of potential issues and keep an eye for improvements, new features, and bug fixes in upcoming versions.
Nanoflow
NanoFlow is a throughput-oriented high-performance serving framework for Large Language Models (LLMs) that consistently delivers superior throughput compared to other frameworks by utilizing key techniques such as intra-device parallelism, asynchronous CPU scheduling, and SSD offloading. The framework proposes nano-batching to schedule compute-, memory-, and network-bound operations for simultaneous execution, leading to increased resource utilization. NanoFlow also adopts an asynchronous control flow to optimize CPU overhead and eagerly offloads KV-Cache to SSDs for multi-round conversations. The open-source codebase integrates state-of-the-art kernel libraries and provides necessary scripts for environment setup and experiment reproduction.
long-context-attention
Long-Context-Attention (YunChang) is a unified sequence parallel approach that combines the strengths of DeepSpeed-Ulysses-Attention and Ring-Attention to provide a versatile and high-performance solution for long context LLM model training and inference. It addresses the limitations of both methods by offering no limitation on the number of heads, compatibility with advanced parallel strategies, and enhanced performance benchmarks. The tool is verified in Megatron-LM and offers best practices for 4D parallelism, making it suitable for various attention mechanisms and parallel computing advancements.
llvm-aie
This repository extends the LLVM framework to generate code for use with AMD/Xilinx AI Engine processors. AI Engine processors are in-order, exposed-pipeline VLIW processors focused on application acceleration for AI, Machine Learning, and DSP applications. The repository adds LLVM support for specific features like non-power of 2 pointers, operand latencies, resource conflicts, negative operand latencies, slot assignment, relocations, code alignment restrictions, and register allocation. It includes support for Clang, LLD, binutils, Compiler-RT, and LLVM-LIBC.
ianvs
Ianvs is a distributed synergy AI benchmarking project incubated in KubeEdge SIG AI. It aims to test the performance of distributed synergy AI solutions following recognized standards, providing end-to-end benchmark toolkits, test environment management tools, test case control tools, and benchmark presentation tools. It also collaborates with other organizations to establish comprehensive benchmarks and related applications. The architecture includes critical components like Test Environment Manager, Test Case Controller, Generation Assistant, Simulation Controller, and Story Manager. Ianvs documentation covers quick start, guides, dataset descriptions, algorithms, user interfaces, stories, and roadmap.
Nucleoid
Nucleoid is a declarative (logic) runtime environment that manages both data and logic under the same runtime. It uses a declarative programming paradigm, which allows developers to focus on the business logic of the application, while the runtime manages the technical details. This allows for faster development and reduces the amount of code that needs to be written. Additionally, the sharding feature can help to distribute the load across multiple instances, which can further improve the performance of the system.
awesome-transformer-nlp
This repository contains a hand-curated list of great machine (deep) learning resources for Natural Language Processing (NLP) with a focus on Generative Pre-trained Transformer (GPT), Bidirectional Encoder Representations from Transformers (BERT), attention mechanism, Transformer architectures/networks, Chatbot, and transfer learning in NLP.
openspg
OpenSPG is a knowledge graph engine developed by Ant Group in collaboration with OpenKG, based on the SPG (Semantic-enhanced Programmable Graph) framework. It provides explicit semantic representations, logical rule definitions, operator frameworks (construction, inference), and other capabilities for domain knowledge graphs. OpenSPG supports pluggable adaptation of basic engines and algorithmic services by various vendors to build customized solutions.
WeatherGFT
WeatherGFT is a physics-AI hybrid model designed to generalize weather forecasts to finer-grained temporal scales beyond the training dataset. It incorporates physical partial differential equations (PDEs) into neural networks to simulate fine-grained physical evolution and correct biases. The model achieves state-of-the-art performance in forecasting tasks at different time scales, from nowcasting to medium-range forecasts, by utilizing a lead time-aware training framework and a carefully designed PDE kernel. WeatherGFT bridges the gap between nowcast and medium-range forecast by extending forecasting abilities to predict accurately at a 30-minute time scale.
matchem-llm
A public repository collecting links to state-of-the-art training sets, QA, benchmarks and other evaluations for various ML and LLM applications in materials science and chemistry. It includes datasets related to chemistry, materials, multimodal data, and knowledge graphs in the field. The repository aims to provide resources for training and evaluating machine learning models in the materials science and chemistry domains.
agentUniverse
agentUniverse is a multi-agent framework based on large language models, providing flexible capabilities for building individual agents. It focuses on multi-agent collaborative patterns, integrating domain experience to help agents solve problems in various fields. The framework includes pattern components like PEER and DOE for event interpretation, industry analysis, and financial report generation. It offers features for agent construction, multi-agent collaboration, and domain expertise integration, aiming to create intelligent applications with professional know-how.
For similar tasks
algebraic-nnhw
This repository contains the source code for a GEMM & deep learning hardware accelerator system used to validate proposed systolic array hardware architectures implementing efficient matrix multiplication algorithms to increase performance-per-area limits of GEMM & AI accelerators. Achieved results include up to 3× faster CNN inference, >2× higher mults/multiplier/clock cycle, and low area with high clock frequency. The system is specialized for inference of non-sparse DNN models with fixed-point/quantized inputs, fully accelerating all DNN layers in hardware, and highly optimizing GEMM acceleration.
For similar jobs
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
openvino
OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. It provides a common API to deliver inference solutions on various platforms, including CPU, GPU, NPU, and heterogeneous devices. OpenVINO™ supports pre-trained models from Open Model Zoo and popular frameworks like TensorFlow, PyTorch, and ONNX. Key components of OpenVINO™ include the OpenVINO™ Runtime, plugins for different hardware devices, frontends for reading models from native framework formats, and the OpenVINO Model Converter (OVC) for adjusting models for optimal execution on target devices.
peft
PEFT (Parameter-Efficient Fine-Tuning) is a collection of state-of-the-art methods that enable efficient adaptation of large pretrained models to various downstream applications. By only fine-tuning a small number of extra model parameters instead of all the model's parameters, PEFT significantly decreases the computational and storage costs while achieving performance comparable to fully fine-tuned models.
jetson-generative-ai-playground
This repo hosts tutorial documentation for running generative AI models on NVIDIA Jetson devices. The documentation is auto-generated and hosted on GitHub Pages using their CI/CD feature to automatically generate/update the HTML documentation site upon new commits.
emgucv
Emgu CV is a cross-platform .Net wrapper for the OpenCV image-processing library. It allows OpenCV functions to be called from .NET compatible languages. The wrapper can be compiled by Visual Studio, Unity, and "dotnet" command, and it can run on Windows, Mac OS, Linux, iOS, and Android.
MMStar
MMStar is an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. It addresses two key issues in current LLM evaluation: the unnecessary use of visual content in many samples and the existence of unintentional data leakage in LLM and LVLM training. MMStar evaluates 6 core capabilities across 18 detailed axes, ensuring a balanced distribution of samples across all dimensions.
VLMEvalKit
VLMEvalKit is an open-source evaluation toolkit of large vision-language models (LVLMs). It enables one-command evaluation of LVLMs on various benchmarks, without the heavy workload of data preparation under multiple repositories. In VLMEvalKit, we adopt generation-based evaluation for all LVLMs, and provide the evaluation results obtained with both exact matching and LLM-based answer extraction.
llava-docker
This Docker image for LLaVA (Large Language and Vision Assistant) provides a convenient way to run LLaVA locally or on RunPod. LLaVA is a powerful AI tool that combines natural language processing and computer vision capabilities. With this Docker image, you can easily access LLaVA's functionalities for various tasks, including image captioning, visual question answering, text summarization, and more. The image comes pre-installed with LLaVA v1.2.0, Torch 2.1.2, xformers 0.0.23.post1, and other necessary dependencies. You can customize the model used by setting the MODEL environment variable. The image also includes a Jupyter Lab environment for interactive development and exploration. Overall, this Docker image offers a comprehensive and user-friendly platform for leveraging LLaVA's capabilities.