yzma
Go with your own intelligence - Go applications that directly integrate llama.cpp for local inference using hardware acceleration.
Stars: 312
yzma is a tool that allows you to write Go applications integrating llama.cpp for local inference using hardware acceleration. It supports running Vision Language Models (VLM) and Large/Small/Tiny Language Models (LLM) on Linux, macOS, or Windows with hardware acceleration like CUDA, Metal, or Vulkan. yzma uses purego and ffi packages, eliminating the need for CGo. It works with the latest llama.cpp releases, enabling users to leverage the newest features and model support.
README:
yzma lets you write Go applications that directly integrate llama.cpp for fully local inference using hardware acceleration.
- Run the latest Vision Language Models (VLM) and Large/Small/Tiny Language Models (LLM) on Linux, macOS, or Windows.
- Use any available hardware acceleration such as CUDA, Metal, or Vulkan for maximum performance.
-
yzmauses thepuregoandffipackages so CGo is not needed. - Works with the newest
llama.cppreleases so you can use the latest features and model support.
This example uses the SmolLM2-135M-GGUF model:
package main
import (
"fmt"
"os"
"path/filepath"
"github.com/hybridgroup/yzma/pkg/download"
"github.com/hybridgroup/yzma/pkg/llama"
)
var (
modelFile = "SmolLM2-135M.Q4_K_M.gguf"
prompt = "Are you ready to go?"
libPath = os.Getenv("YZMA_LIB")
responseLength int32 = 12
)
func main() {
llama.Load(libPath)
llama.LogSet(llama.LogSilent())
llama.Init()
model, _ := llama.ModelLoadFromFile(filepath.Join(download.DefaultModelsDir(), modelFile), llama.ModelDefaultParams())
ctx, _ := llama.InitFromModel(model, llama.ContextDefaultParams())
vocab := llama.ModelGetVocab(model)
tokens := llama.Tokenize(vocab, prompt, true, false)
batch := llama.BatchGetOne(tokens)
sampler := llama.SamplerChainInit(llama.SamplerChainDefaultParams())
llama.SamplerChainAdd(sampler, llama.SamplerInitGreedy())
for pos := int32(0); pos < responseLength; pos += batch.NTokens {
llama.Decode(ctx, batch)
token := llama.SamplerSample(sampler, ctx, -1)
if llama.VocabIsEOG(vocab, token) {
fmt.Println()
break
}
buf := make([]byte, 36)
len := llama.TokenToPiece(vocab, token, buf, 0, true)
fmt.Print(string(buf[:len]))
batch = llama.BatchGetOne([]llama.Token{token})
}
fmt.Println()
}Install yzma, then download the model using the yzma command line tool:
yzma model get -u https://huggingface.co/QuantFactory/SmolLM2-135M-GGUF/resolve/main/SmolLM2-135M.Q4_K_M.ggufAnd run the Go program:
$ go run ./examples/hello/
"Yes, I'm ready to go."You can use the convenient yzma command line tool to download the llama.cpp prebuilt libraries for your platform. You can also have your application download them automatically at installation time, including auto-detection for CUDA.
See INSTALL.md for installation instructions for macOS, Linux, and Windows. We also have specific information on running yzma on Raspberry Pi, NVIDIA Jetson Orin, or the Arduino UNO Q.
We have several examples of how you can use yzma in our examples directory.
This example uses the Qwen2.5-VL-3B-Instruct-Q8_0 VLM model to process both a text prompt and an image, then displays the result.
$ go run ./examples/vlm/ -model ~/models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf -mmproj ~/models/mmproj-Qwen2.5-VL-3B-Instruct-Q8_0.gguf -image ./images/domestic_llama.jpg -p "What is in this picture?"
The image features a white llama standing in a fenced-in area, possibly a zoo or a farm. The llama is positioned in the center of the image, with its body facing the right side. The fenced area is surrounded by trees, creating a natural environment for the llama.You can use yzma to do inference on text language models. This example uses the qwen2.5-0.5b-instruct-fp16.gguf model for an interactive chat session.
$ go run ./examples/chat/ -model ./models/qwen2.5-0.5b-instruct-fp16.gguf
Enter prompt: Are you ready to go?
Yes, I'm ready to go! What would you like to do?
Enter prompt: Let's go to the zoo
Great! Let's go to the zoo. What would you like to see?
Enter prompt: I want to feed the llama
Sure! Let's go to the zoo and feed the llama. What kind of llama are you interested in feeding?See the examples directory for more examples of how to use yzma.
Who is using yzma? Check out some of the tools, applications, examples, and blog posts here!
yzma uses models in the GGUF format supported by llama.cpp. There are many models in GGUF format on Hugging Face (over 161k at last count):
https://huggingface.co/models?library=gguf&sort=trending
You can use the yzma command to download models for you!
For example, this downloads the gemma-3-1b-it-GGUF model:
yzma model get -u https://huggingface.co/ggml-org/gemma-3-1b-it-GGUF/resolve/main/gemma-3-1b-it-Q4_K_M.gguf
Check out the Model Usage page for more information.
yzma currently has support for over 95% of llama.cpp functionality. See ROADMAP.md for the complete list.
You can use multimodal models (image/audio) and text language models with full hardware acceleration on Linux, macOS, and Windows.
| OS | CPU | GPU |
|---|---|---|
| Linux | amd64, arm64 | CUDA, Vulkan, HIP, ROCm, SYCL |
| macOS | arm64 | Metal |
| Windows | amd64 | CUDA, Vulkan, HIP, SYCL, OpenCL |
Whenever there is a new release of llama.cpp, the tests for yzma are run automatically. This helps us stay up to date with the latest code and models.
yzma is fast because it calls llama.cpp in the same process. No external servers needed!
For example, here is the Qwen3-VL-2B-Instruct Visual Language Model (VLM) performing multi-modal inference on an image and text prompt running on a Apple M4 Max with 128 GB RAM:
$ go test -run none -benchtime=10s -count=5 -bench BenchmarkMultimodalInference
goos: darwin
goarch: arm64
pkg: github.com/hybridgroup/yzma/pkg/mtmd
cpu: Apple M4 Max
BenchmarkMultimodalInference-16 10 1577948683 ns/op 788.9 tokens/s
BenchmarkMultimodalInference-16 12 1243692014 ns/op 910.8 tokens/s
BenchmarkMultimodalInference-16 7 1654741804 ns/op 737.2 tokens/s
BenchmarkMultimodalInference-16 7 1568106947 ns/op 771.9 tokens/s
BenchmarkMultimodalInference-16 10 1704669371 ns/op 706.1 tokens/s
PASS
ok github.com/hybridgroup/yzma/pkg/mtmd 76.644sWant to see more benchmarks? Take a look at the BENCHMARKS.md document.
yzma is now ready to be used to build complete applications that incorporate language models directly into your Golang code.
Here are some advantages of yzma with llama.cpp:
- Compile Go programs that use
yzmawith the normalgo buildandgo runcommands. No C compiler needed! - Use the
llama.cpplibraries with whatever hardware acceleration is available for your configuration. CUDA, Vulkan, etc. - High performance from making function calls from within the same process. No external model servers!
- Download
llama.cppprecompiled libraries directly from Github, or include them with your application. - Update the
llama.cpplibraries without recompiling your Go program, as long asllama.cppdoes not make any breaking changes.
The idea is to make it easier for Go developers to use language models as part of "normal" applications without having to use containers or do anything other than the normal GOOS and GOARCH env variables for cross-complication.
yzma originally started with definitions from the https://github.com/dianlight/gollama.cpp package, but then has gone on to modify them rather heavily. Thank you!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for yzma
Similar Open Source Tools
yzma
yzma is a tool that allows you to write Go applications integrating llama.cpp for local inference using hardware acceleration. It supports running Vision Language Models (VLM) and Large/Small/Tiny Language Models (LLM) on Linux, macOS, or Windows with hardware acceleration like CUDA, Metal, or Vulkan. yzma uses purego and ffi packages, eliminating the need for CGo. It works with the latest llama.cpp releases, enabling users to leverage the newest features and model support.
HuggingFaceGuidedTourForMac
HuggingFaceGuidedTourForMac is a guided tour on how to install optimized pytorch and optionally Apple's new MLX, JAX, and TensorFlow on Apple Silicon Macs. The repository provides steps to install homebrew, pytorch with MPS support, MLX, JAX, TensorFlow, and Jupyter lab. It also includes instructions on running large language models using HuggingFace transformers. The repository aims to help users set up their Macs for deep learning experiments with optimized performance.
ollama-r
The Ollama R library provides an easy way to integrate R with Ollama for running language models locally on your machine. It supports working with standard data structures for different LLMs, offers various output formats, and enables integration with other libraries/tools. The library uses the Ollama REST API and requires the Ollama app to be installed, with GPU support for accelerating LLM inference. It is inspired by Ollama Python and JavaScript libraries, making it familiar for users of those languages. The installation process involves downloading the Ollama app, installing the 'ollamar' package, and starting the local server. Example usage includes testing connection, downloading models, generating responses, and listing available models.
agenticSeek
AgenticSeek is a voice-enabled AI assistant powered by DeepSeek R1 agents, offering a fully local alternative to cloud-based AI services. It allows users to interact with their filesystem, code in multiple languages, and perform various tasks autonomously. The tool is equipped with memory to remember user preferences and past conversations, and it can divide tasks among multiple agents for efficient execution. AgenticSeek prioritizes privacy by running entirely on the user's hardware without sending data to the cloud.
openorch
OpenOrch is a daemon that transforms servers into a powerful development environment, running AI models, containers, and microservices. It serves as a blend of Kubernetes and a language-agnostic backend framework for building applications on fixed-resource setups. Users can deploy AI models and build microservices, managing applications while retaining control over infrastructure and data.
zml
ZML is a high-performance AI inference stack built for production, using Zig language, MLIR, and Bazel. It allows users to create exciting AI projects, run pre-packaged models like MNIST, TinyLlama, OpenLLama, and Meta Llama, and compile models for accelerator runtimes. Users can also run tests, explore examples, and contribute to the project. ZML is licensed under the Apache 2.0 license.
simpleAI
SimpleAI is a self-hosted alternative to the not-so-open AI API, focused on replicating main endpoints for LLM such as text completion, chat, edits, and embeddings. It allows quick experimentation with different models, creating benchmarks, and handling specific use cases without relying on external services. Users can integrate and declare models through gRPC, query endpoints using Swagger UI or API, and resolve common issues like CORS with FastAPI middleware. The project is open for contributions and welcomes PRs, issues, documentation, and more.
1backend
1Backend is a flexible and scalable platform designed for running AI models on private servers and handling high-concurrency workloads. It provides a ChatGPT-like interface for users and a network-accessible API for machines, serving as a general-purpose backend framework. The platform offers on-premise ChatGPT alternatives, a microservices-first web framework, out-of-the-box services like file uploads and user management, infrastructure simplification acting as a container orchestrator, reverse proxy, multi-database support with its own ORM, and AI integration with platforms like LlamaCpp and StableDiffusion.
screen-pipe
Screen-pipe is a Rust + WASM tool that allows users to turn their screen into actions using Large Language Models (LLMs). It enables users to record their screen 24/7, extract text from frames, and process text and images for tasks like analyzing sales conversations. The tool is still experimental and aims to simplify the process of recording screens, extracting text, and integrating with various APIs for tasks such as filling CRM data based on screen activities. The project is open-source and welcomes contributions to enhance its functionalities and usability.
allms
allms is a versatile and powerful library designed to streamline the process of querying Large Language Models (LLMs). Developed by Allegro engineers, it simplifies working with LLM applications by providing a user-friendly interface, asynchronous querying, automatic retrying mechanism, error handling, and output parsing. It supports various LLM families hosted on different platforms like OpenAI, Google, Azure, and GCP. The library offers features for configuring endpoint credentials, batch querying with symbolic variables, and forcing structured output format. It also provides documentation, quickstart guides, and instructions for local development, testing, updating documentation, and making new releases.
llm-foundry
LLM Foundry is a codebase for training, finetuning, evaluating, and deploying LLMs for inference with Composer and the MosaicML platform. It is designed to be easy-to-use, efficient _and_ flexible, enabling rapid experimentation with the latest techniques. You'll find in this repo: * `llmfoundry/` - source code for models, datasets, callbacks, utilities, etc. * `scripts/` - scripts to run LLM workloads * `data_prep/` - convert text data from original sources to StreamingDataset format * `train/` - train or finetune HuggingFace and MPT models from 125M - 70B parameters * `train/benchmarking` - profile training throughput and MFU * `inference/` - convert models to HuggingFace or ONNX format, and generate responses * `inference/benchmarking` - profile inference latency and throughput * `eval/` - evaluate LLMs on academic (or custom) in-context-learning tasks * `mcli/` - launch any of these workloads using MCLI and the MosaicML platform * `TUTORIAL.md` - a deeper dive into the repo, example workflows, and FAQs
Trace
Trace is a new AutoDiff-like tool for training AI systems end-to-end with general feedback. It generalizes the back-propagation algorithm by capturing and propagating an AI system's execution trace. Implemented as a PyTorch-like Python library, users can write Python code directly and use Trace primitives to optimize certain parts, similar to training neural networks.
langchain
LangChain is a framework for developing Elixir applications powered by language models. It enables applications to connect language models to other data sources and interact with the environment. The library provides components for working with language models and off-the-shelf chains for specific tasks. It aims to assist in building applications that combine large language models with other sources of computation or knowledge. LangChain is written in Elixir and is not aimed for parity with the JavaScript and Python versions due to differences in programming paradigms and design choices. The library is designed to make it easy to integrate language models into applications and expose features, data, and functionality to the models.
paxml
Pax is a framework to configure and run machine learning experiments on top of Jax.
ash_ai
Ash AI is a tool that provides a Model Context Protocol (MCP) server for exposing tool definitions to an MCP client. It allows for the installation of dev and production MCP servers, and supports features like OAuth2 flow with AshAuthentication, tool data access, tool execution callbacks, prompt-backed actions, and vectorization strategies. Users can also generate a chat feature for their Ash & Phoenix application using `ash_oban` and `ash_postgres`, and specify LLM API keys for OpenAI. The tool is designed to help developers experiment with tools and actions, monitor tool execution, and expose actions as tool calls.
nano-graphrag
nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.
For similar tasks
python-tutorial-notebooks
This repository contains Jupyter-based tutorials for NLP, ML, AI in Python for classes in Computational Linguistics, Natural Language Processing (NLP), Machine Learning (ML), and Artificial Intelligence (AI) at Indiana University.
open-parse
Open Parse is a Python library for visually discerning document layouts and chunking them effectively. It is designed to fill the gap in open-source libraries for handling complex documents. Unlike text splitting, which converts a file to raw text and slices it up, Open Parse visually analyzes documents for superior LLM input. It also supports basic markdown for parsing headings, bold, and italics, and has high-precision table support, extracting tables into clean Markdown formats with accuracy that surpasses traditional tools. Open Parse is extensible, allowing users to easily implement their own post-processing steps. It is also intuitive, with great editor support and completion everywhere, making it easy to use and learn.
MoonshotAI-Cookbook
The MoonshotAI-Cookbook provides example code and guides for accomplishing common tasks with the MoonshotAI API. To run these examples, you'll need an MoonshotAI account and associated API key. Most code examples are written in Python, though the concepts can be applied in any language.
AHU-AI-Repository
This repository is dedicated to the learning and exchange of resources for the School of Artificial Intelligence at Anhui University. Notes will be published on this website first: https://www.aoaoaoao.cn and will be synchronized to the repository regularly. You can also contact me at [email protected].
modern_ai_for_beginners
This repository provides a comprehensive guide to modern AI for beginners, covering both theoretical foundations and practical implementation. It emphasizes the importance of understanding both the mathematical principles and the code implementation of AI models. The repository includes resources on PyTorch, deep learning fundamentals, mathematical foundations, transformer-based LLMs, diffusion models, software engineering, and full-stack development. It also features tutorials on natural language processing with transformers, reinforcement learning, and practical deep learning for coders.
Building-AI-Applications-with-ChatGPT-APIs
This repository is for the book 'Building AI Applications with ChatGPT APIs' published by Packt. It provides code examples and instructions for mastering ChatGPT, Whisper, and DALL-E APIs through building innovative AI projects. Readers will learn to develop AI applications using ChatGPT APIs, integrate them with frameworks like Flask and Django, create AI-generated art with DALL-E APIs, and optimize ChatGPT models through fine-tuning.
examples
This repository contains a collection of sample applications and Jupyter Notebooks for hands-on experience with Pinecone vector databases and common AI patterns, tools, and algorithms. It includes production-ready examples for review and support, as well as learning-optimized examples for exploring AI techniques and building applications. Users can contribute, provide feedback, and collaborate to improve the resource.
lingoose
LinGoose is a modular Go framework designed for building AI/LLM applications. It offers the flexibility to import only the necessary modules, abstracts features for customization, and provides a comprehensive solution for developing AI/LLM applications from scratch. The framework simplifies the process of creating intelligent applications by allowing users to choose preferred implementations or create their own. LinGoose empowers developers to leverage its capabilities to streamline the development of cutting-edge AI and LLM projects.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.
