
llama.rn
React Native binding of llama.cpp
Stars: 662

React Native binding of llama.cpp, which is an inference of LLaMA model in pure C/C++. This tool allows you to use the LLaMA model in your React Native applications for various tasks such as text completion, tokenization, detokenization, and embedding. It provides a convenient interface to interact with the LLaMA model and supports features like grammar sampling and mocking for testing purposes.
README:
React Native binding of llama.cpp.
llama.cpp: Inference of LLaMA model in pure C/C++
npm install llama.rn
Please re-run npx pod-install
again.
By default, llama.rn
will use pre-built rnllama.xcframework
for iOS. If you want to build from source, please set RNLLAMA_BUILD_FROM_SOURCE
to 1
in your Podfile.
Add proguard rule if it's enabled in project (android/app/proguard-rules.pro):
# llama.rn
-keep class com.rnllama.** { *; }
By default, llama.rn
will use pre-built libraries for Android. If you want to build from source, please set rnllamaBuildFromSource
to true
in android/gradle.properties
.
You can search HuggingFace for available models (Keyword: GGUF
).
For get a GGUF model or quantize manually, see Prepare and Quantize
section in llama.cpp.
💡 You can find complete examples in the example project.
Load model info only:
import { loadLlamaModelInfo } from 'llama.rn'
const modelPath = 'file://<path to gguf model>'
console.log('Model Info:', await loadLlamaModelInfo(modelPath))
Initialize a Llama context & do completion:
import { initLlama } from 'llama.rn'
// Initial a Llama context with the model (may take a while)
const context = await initLlama({
model: modelPath,
use_mlock: true,
n_ctx: 2048,
n_gpu_layers: 99, // number of layers to store in VRAM (Currently only for iOS)
// embedding: true, // use embedding
})
const stopWords = ['</s>', '<|end|>', '<|eot_id|>', '<|end_of_text|>', '<|im_end|>', '<|EOT|>', '<|END_OF_TURN_TOKEN|>', '<|end_of_turn|>', '<|endoftext|>']
// Do chat completion
const msgResult = await context.completion(
{
messages: [
{
role: 'system',
content: 'This is a conversation between user and assistant, a friendly chatbot.',
},
{
role: 'user',
content: 'Hello!',
},
],
n_predict: 100,
stop: stopWords,
// ...other params
},
(data) => {
// This is a partial completion callback
const { token } = data
},
)
console.log('Result:', msgResult.text)
console.log('Timings:', msgResult.timings)
// Or do text completion
const textResult = await context.completion(
{
prompt: 'This is a conversation between user and llama, a friendly chatbot. respond in simple markdown.\n\nUser: Hello!\nLlama:',
n_predict: 100,
stop: [...stopWords, 'Llama:', 'User:'],
// ...other params
},
(data) => {
// This is a partial completion callback
const { token } = data
},
)
console.log('Result:', textResult.text)
console.log('Timings:', textResult.timings)
The binding's deisgn inspired by server.cpp example in llama.cpp:
-
/completion
and/chat/completions
:context.completion(params, partialCompletionCallback)
-
/tokenize
:context.tokenize(content)
-
/detokenize
:context.detokenize(tokens)
-
/embedding
:context.embedding(content)
-
/rerank
:context.rerank(query, documents, params)
- ... Other methods
Please visit the Documentation for more details.
You can also visit the example to see how to use it.
llama.rn
supports multimodal capabilities including vision (images) and audio processing. This allows you to interact with models that can understand both text and media content.
Images (Vision):
- JPEG, PNG, BMP, GIF, TGA, HDR, PIC, PNM
- Base64 encoded images (data URLs)
- Local file paths
- * Not supported HTTP URLs yet
Audio:
- WAV, MP3 formats
- Base64 encoded audio (data URLs)
- Local file paths
- * Not supported HTTP URLs yet
First, you need a multimodal model and its corresponding multimodal projector (mmproj) file, see how to obtain mmproj for more details.
import { initLlama } from 'llama.rn'
// First initialize the model context
const context = await initLlama({
model: 'path/to/your/multimodal-model.gguf',
n_ctx: 4096,
n_gpu_layers: 99, // Recommended for multimodal models
// Important: Disable context shifting for multimodal
ctx_shift: false,
})
// Initialize multimodal support with mmproj file
const success = await context.initMultimodal({
path: 'path/to/your/mmproj-model.gguf',
use_gpu: true, // Recommended for better performance
})
// Check if multimodal is enabled
console.log('Multimodal enabled:', await context.isMultimodalEnabled())
if (success) {
console.log('Multimodal support initialized!')
// Check what modalities are supported
const support = await context.getMultimodalSupport()
console.log('Vision support:', support.vision)
console.log('Audio support:', support.audio)
} else {
console.log('Failed to initialize multimodal support')
}
// Release multimodal context
await context.releaseMultimodal()
const result = await context.completion({
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'What do you see in this image?',
},
{
type: 'image_url',
image_url: {
url: 'file:///path/to/image.jpg',
// or base64: 'data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD...'
},
},
],
},
],
n_predict: 100,
temperature: 0.1,
})
console.log('AI Response:', result.text)
// Method 1: Using structured message content (Recommended)
const result = await context.completion({
messages: [
{
role: 'user',
content: [
{
type: 'text',
text: 'Transcribe or describe this audio:',
},
{
type: 'input_audio',
input_audio: {
data: 'data:audio/wav;base64,UklGRiQAAABXQVZFZm10...',
// or url: 'file:///path/to/audio.wav',
format: 'wav', // or 'mp3'
},
},
],
},
],
n_predict: 200,
})
console.log('Transcription:', result.text)
// Tokenize text with media
const tokenizeResult = await context.tokenize(
'Describe this image: <__media__>',
{
media_paths: ['file:///path/to/image.jpg']
}
)
console.log('Tokens:', tokenizeResult.tokens)
console.log('Has media:', tokenizeResult.has_media)
console.log('Media positions:', tokenizeResult.chunk_pos_media)
-
Context Shifting: Multimodal models require
ctx_shift: false
to maintain media token positioning -
Memory: Multimodal models require more memory; use adequate
n_ctx
and consider GPU offloading -
Media Markers: The system automatically handles
<__media__>
markers in prompts. When using structured message content, media items are automatically replaced with this marker - Model Compatibility: Ensure your model supports the media type you're trying to process
llama.rn
has universal tool call support by using minja (as Jinja template parser) and chat.cpp in llama.cpp.
Example:
import { initLlama } from 'llama.rn'
const context = await initLlama({
// ...params
})
const { text, tool_calls } = await context.completion({
// ...params
jinja: true, // Enable Jinja template parser
tool_choice: 'auto',
tools: [
{
type: 'function',
function: {
name: 'ipython',
description:
'Runs code in an ipython interpreter and returns the result of the execution after 60 seconds.',
parameters: {
type: 'object',
properties: {
code: {
type: 'string',
description: 'The code to run in the ipython interpreter.',
},
},
required: ['code'],
},
},
},
],
messages: [
{
role: 'system',
content: 'You are a helpful assistant that can answer questions and help with tasks.',
},
{
role: 'user',
content: 'Test',
},
],
})
console.log('Result:', text)
// If tool_calls is not empty, it means the model has called the tool
if (tool_calls) console.log('Tool Calls:', tool_calls)
You can check chat.cpp for models has native tool calling support, or it will fallback to GENERIC
type tool call.
The generic tool call will be always JSON object as output, the output will be like {"response": "..."}
when it not decided to use tool call.
GBNF (GGML BNF) is a format for defining formal grammars to constrain model outputs in llama.cpp
. For example, you can use it to force the model to generate valid JSON, or speak only in emojis.
You can see GBNF Guide for more details.
llama.rn
provided a built-in function to convert JSON Schema to GBNF:
Example gbnf grammar:
root ::= object
value ::= object | array | string | number | ("true" | "false" | "null") ws
object ::=
"{" ws (
string ":" ws value
("," ws string ":" ws value)*
)? "}" ws
array ::=
"[" ws (
value
("," ws value)*
)? "]" ws
string ::=
"\"" (
[^"\\\x7F\x00-\x1F] |
"\\" (["\\bfnrt] | "u" [0-9a-fA-F]{4}) # escapes
)* "\"" ws
number ::= ("-"? ([0-9] | [1-9] [0-9]{0,15})) ("." [0-9]+)? ([eE] [-+]? [0-9] [1-9]{0,15})? ws
# Optional space: by convention, applied in this grammar after literal chars when allowed
ws ::= | " " | "\n" [ \t]{0,20}
import { initLlama } from 'llama.rn'
const gbnf = '...'
const context = await initLlama({
// ...params
grammar: gbnf,
})
const { text } = await context.completion({
// ...params
messages: [
{
role: 'system',
content: 'You are a helpful assistant that can answer questions and help with tasks.',
},
{
role: 'user',
content: 'Test',
},
],
})
console.log('Result:', text)
Also, this is how json_schema
works in response_format
during completion, it converts the json_schema to gbnf grammar.
The session file is a binary file that contains the state of the context, it can saves time of prompt processing.
const context = await initLlama({ ...params })
// After prompt processing or completion ...
// Save the session
await context.saveSession('<path to save session>')
// Load the session
await context.loadSession('<path to load session>')
- * Session is currently not supported save state from multimodal context, so it only stores the text chunk before the first media chunk.
The embedding API is used to get the embedding of a text.
const context = await initLlama({
...params,
embedding: true,
})
const { embedding } = await context.embedding('Hello, world!')
- You can use model like nomic-ai/nomic-embed-text-v1.5-GGUF for better embedding quality.
- You can use DB like op-sqlite with sqlite-vec support to store and search embeddings.
The rerank API is used to rank documents based on their relevance to a query. This is particularly useful for improving search results and implementing retrieval-augmented generation (RAG) systems.
const context = await initLlama({
...params,
embedding: true, // Required for reranking
pooling_type: 'rank', // Use rank pooling for rerank models
})
// Rerank documents based on relevance to query
const results = await context.rerank(
'What is artificial intelligence?', // query
[
'AI is a branch of computer science.',
'The weather is nice today.',
'Machine learning is a subset of AI.',
'I like pizza.',
], // documents to rank
{
normalize: 1, // Optional: normalize scores (default: from model config)
}
)
// Results are automatically sorted by score (highest first)
results.forEach((result, index) => {
console.log(`Rank ${index + 1}:`, {
score: result.score,
document: result.document,
originalIndex: result.index,
})
})
-
Model Requirements: Reranking requires models with
RANK
pooling type (e.g., reranker models) -
Embedding Enabled: The context must have
embedding: true
to use rerank functionality - Automatic Sorting: Results are returned sorted by relevance score in descending order
- Document Access: Each result includes the original document text and its index in the input array
- Score Interpretation: Higher scores indicate higher relevance to the query
- jinaai - jina-reranker-v2-base-multilingual-GGUF
- BAAI - bge-reranker-v2-m3-GGUF
- Other models with "rerank" or "reranker" in their name and GGUF format
We have provided a mock version of llama.rn
for testing purpose you can use on Jest:
jest.mock('llama.rn', () => require('llama.rn/jest/mock'))
iOS:
- The Extended Virtual Addressing and Increased Memory Limit capabilities are recommended to enable on iOS project.
- Metal:
- We have tested to know some devices is not able to use Metal (GPU) due to llama.cpp used SIMD-scoped operation, you can check if your device is supported in Metal feature set tables, Apple7 GPU will be the minimum requirement.
- It's also not supported in iOS simulator due to this limitation, we used constant buffers more than 14.
Android:
- Currently only supported arm64-v8a / x86_64 platform, this means you can't initialize a context on another platforms. The 64-bit platform are recommended because it can allocate more memory for the model.
- No integrated any GPU backend yet.
See the contributing guide to learn how to contribute to the repository and the development workflow.
- BRICKS: Our product for building interactive signage in simple way. We provide LLM functions as Generator LLM/Assistant.
- ChatterUI: Simple frontend for LLMs built in react-native.
- PocketPal AI: An app that brings language models directly to your phone.
-
llama.node: An another Node.js binding of
llama.cpp
but made API same asllama.rn
.
MIT
Made with create-react-native-library
Built and maintained by BRICKS.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llama.rn
Similar Open Source Tools

llama.rn
React Native binding of llama.cpp, which is an inference of LLaMA model in pure C/C++. This tool allows you to use the LLaMA model in your React Native applications for various tasks such as text completion, tokenization, detokenization, and embedding. It provides a convenient interface to interact with the LLaMA model and supports features like grammar sampling and mocking for testing purposes.

pinecone-ts-client
The official Node.js client for Pinecone, written in TypeScript. This client library provides a high-level interface for interacting with the Pinecone vector database service. With this client, you can create and manage indexes, upsert and query vector data, and perform other operations related to vector search and retrieval. The client is designed to be easy to use and provides a consistent and idiomatic experience for Node.js developers. It supports all the features and functionality of the Pinecone API, making it a comprehensive solution for building vector-powered applications in Node.js.

instructor
Instructor is a popular Python library for managing structured outputs from large language models (LLMs). It offers a user-friendly API for validation, retries, and streaming responses. With support for various LLM providers and multiple languages, Instructor simplifies working with LLM outputs. The library includes features like response models, retry management, validation, streaming support, and flexible backends. It also provides hooks for logging and monitoring LLM interactions, and supports integration with Anthropic, Cohere, Gemini, Litellm, and Google AI models. Instructor facilitates tasks such as extracting user data from natural language, creating fine-tuned models, managing uploaded files, and monitoring usage of OpenAI models.

aiocache
Aiocache is an asyncio cache library that supports multiple backends such as memory, redis, and memcached. It provides a simple interface for functions like add, get, set, multi_get, multi_set, exists, increment, delete, clear, and raw. Users can easily install and use the library for caching data in Python applications. Aiocache allows for easy instantiation of caches and setup of cache aliases for reusing configurations. It also provides support for backends, serializers, and plugins to customize cache operations. The library offers detailed documentation and examples for different use cases and configurations.

sparkle
Sparkle is a tool that streamlines the process of building AI-driven features in applications using Large Language Models (LLMs). It guides users through creating and managing agents, defining tools, and interacting with LLM providers like OpenAI. Sparkle allows customization of LLM provider settings, model configurations, and provides a seamless integration with Sparkle Server for exposing agents via an OpenAI-compatible chat API endpoint.

xsai
xsAI is an extra-small AI SDK designed for Browser, Node.js, Deno, Bun, or Edge Runtime. It provides a series of utils to help users utilize OpenAI or OpenAI-compatible APIs. The SDK is lightweight and efficient, using a variety of methods to minimize its size. It is runtime-agnostic, working seamlessly across different environments without depending on Node.js Built-in Modules. Users can easily install specific utils like generateText or streamText, and leverage tools like weather to perform tasks such as getting the weather in a location.

pocketgroq
PocketGroq is a tool that provides advanced functionalities for text generation, web scraping, web search, and AI response evaluation. It includes features like an Autonomous Agent for answering questions, web crawling and scraping capabilities, enhanced web search functionality, and flexible integration with Ollama server. Users can customize the agent's behavior, evaluate responses using AI, and utilize various methods for text generation, conversation management, and Chain of Thought reasoning. The tool offers comprehensive methods for different tasks, such as initializing RAG, error handling, and tool management. PocketGroq is designed to enhance development processes and enable the creation of AI-powered applications with ease.

python-genai
The Google Gen AI SDK is a Python library that provides access to Google AI and Vertex AI services. It allows users to create clients for different services, work with parameter types, models, generate content, call functions, handle JSON response schemas, stream text and image content, perform async operations, count and compute tokens, embed content, generate and upscale images, edit images, work with files, create and get cached content, tune models, distill models, perform batch predictions, and more. The SDK supports various features like automatic function support, manual function declaration, JSON response schema support, streaming for text and image content, async methods, tuning job APIs, distillation, batch prediction, and more.

instructor
Instructor is a Python library that makes it a breeze to work with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API to manage validation, retries, and streaming responses. Get ready to supercharge your LLM workflows!

json-repair
JSON Repair is a toolkit designed to address JSON anomalies that can arise from Large Language Models (LLMs). It offers a comprehensive solution for repairing JSON strings, ensuring accuracy and reliability in your data processing. With its user-friendly interface and extensive capabilities, JSON Repair empowers developers to seamlessly integrate JSON repair into their workflows.

hezar
Hezar is an all-in-one AI library designed specifically for the Persian community. It brings together various AI models and tools, making it easy to use AI with just a few lines of code. The library seamlessly integrates with Hugging Face Hub, offering a developer-friendly interface and task-based model interface. In addition to models, Hezar provides tools like word embeddings, tokenizers, feature extractors, and more. It also includes supplementary ML tools for deployment, benchmarking, and optimization.

clarifai-python
The Clarifai Python SDK offers a comprehensive set of tools to integrate Clarifai's AI platform to leverage computer vision capabilities like classification , detection ,segementation and natural language capabilities like classification , summarisation , generation , Q&A ,etc into your applications. With just a few lines of code, you can leverage cutting-edge artificial intelligence to unlock valuable insights from visual and textual content.

ai00_server
AI00 RWKV Server is an inference API server for the RWKV language model based upon the web-rwkv inference engine. It supports VULKAN parallel and concurrent batched inference and can run on all GPUs that support VULKAN. No need for Nvidia cards!!! AMD cards and even integrated graphics can be accelerated!!! No need for bulky pytorch, CUDA and other runtime environments, it's compact and ready to use out of the box! Compatible with OpenAI's ChatGPT API interface. 100% open source and commercially usable, under the MIT license. If you are looking for a fast, efficient, and easy-to-use LLM API server, then AI00 RWKV Server is your best choice. It can be used for various tasks, including chatbots, text generation, translation, and Q&A.

langchainrb
Langchain.rb is a Ruby library that makes it easy to build LLM-powered applications. It provides a unified interface to a variety of LLMs, vector search databases, and other tools, making it easy to build and deploy RAG (Retrieval Augmented Generation) systems and assistants. Langchain.rb is open source and available under the MIT License.

lmstudio.js
lmstudio.js is a pre-release alpha client SDK for LM Studio, allowing users to use local LLMs in JS/TS/Node. It is currently undergoing rapid development with breaking changes expected. Users can follow LM Studio's announcements on Twitter and Discord. The SDK provides API usage for loading models, predicting text, setting up the local LLM server, and more. It supports features like custom loading progress tracking, model unloading, structured output prediction, and cancellation of predictions. Users can interact with LM Studio through the CLI tool 'lms' and perform tasks like text completion, conversation, and getting prediction statistics.
For similar tasks

llama.rn
React Native binding of llama.cpp, which is an inference of LLaMA model in pure C/C++. This tool allows you to use the LLaMA model in your React Native applications for various tasks such as text completion, tokenization, detokenization, and embedding. It provides a convenient interface to interact with the LLaMA model and supports features like grammar sampling and mocking for testing purposes.
For similar jobs

h2ogpt
h2oGPT is an Apache V2 open-source project that allows users to query and summarize documents or chat with local private GPT LLMs. It features a private offline database of any documents (PDFs, Excel, Word, Images, Video Frames, Youtube, Audio, Code, Text, MarkDown, etc.), a persistent database (Chroma, Weaviate, or in-memory FAISS) using accurate embeddings (instructor-large, all-MiniLM-L6-v2, etc.), and efficient use of context using instruct-tuned LLMs (no need for LangChain's few-shot approach). h2oGPT also offers parallel summarization and extraction, reaching an output of 80 tokens per second with the 13B LLaMa2 model, HYDE (Hypothetical Document Embeddings) for enhanced retrieval based upon LLM responses, a variety of models supported (LLaMa2, Mistral, Falcon, Vicuna, WizardLM. With AutoGPTQ, 4-bit/8-bit, LORA, etc.), GPU support from HF and LLaMa.cpp GGML models, and CPU support using HF, LLaMa.cpp, and GPT4ALL models. Additionally, h2oGPT provides Attention Sinks for arbitrarily long generation (LLaMa-2, Mistral, MPT, Pythia, Falcon, etc.), a UI or CLI with streaming of all models, the ability to upload and view documents through the UI (control multiple collaborative or personal collections), Vision Models LLaVa, Claude-3, Gemini-Pro-Vision, GPT-4-Vision, Image Generation Stable Diffusion (sdxl-turbo, sdxl) and PlaygroundAI (playv2), Voice STT using Whisper with streaming audio conversion, Voice TTS using MIT-Licensed Microsoft Speech T5 with multiple voices and Streaming audio conversion, Voice TTS using MPL2-Licensed TTS including Voice Cloning and Streaming audio conversion, AI Assistant Voice Control Mode for hands-free control of h2oGPT chat, Bake-off UI mode against many models at the same time, Easy Download of model artifacts and control over models like LLaMa.cpp through the UI, Authentication in the UI by user/password via Native or Google OAuth, State Preservation in the UI by user/password, Linux, Docker, macOS, and Windows support, Easy Windows Installer for Windows 10 64-bit (CPU/CUDA), Easy macOS Installer for macOS (CPU/M1/M2), Inference Servers support (oLLaMa, HF TGI server, vLLM, Gradio, ExLLaMa, Replicate, OpenAI, Azure OpenAI, Anthropic), OpenAI-compliant, Server Proxy API (h2oGPT acts as drop-in-replacement to OpenAI server), Python client API (to talk to Gradio server), JSON Mode with any model via code block extraction. Also supports MistralAI JSON mode, Claude-3 via function calling with strict Schema, OpenAI via JSON mode, and vLLM via guided_json with strict Schema, Web-Search integration with Chat and Document Q/A, Agents for Search, Document Q/A, Python Code, CSV frames (Experimental, best with OpenAI currently), Evaluate performance using reward models, and Quality maintained with over 1000 unit and integration tests taking over 4 GPU-hours.

mistral.rs
Mistral.rs is a fast LLM inference platform written in Rust. We support inference on a variety of devices, quantization, and easy-to-use application with an Open-AI API compatible HTTP server and Python bindings.

ollama
Ollama is a lightweight, extensible framework for building and running language models on the local machine. It provides a simple API for creating, running, and managing models, as well as a library of pre-built models that can be easily used in a variety of applications. Ollama is designed to be easy to use and accessible to developers of all levels. It is open source and available for free on GitHub.

llama-cpp-agent
The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output (objects). It provides a simple yet robust interface and supports llama-cpp-python and OpenAI endpoints with GBNF grammar support (like the llama-cpp-python server) and the llama.cpp backend server. It works by generating a formal GGML-BNF grammar of the user defined structures and functions, which is then used by llama.cpp to generate text valid to that grammar. In contrast to most GBNF grammar generators it also supports nested objects, dictionaries, enums and lists of them.

llama_ros
This repository provides a set of ROS 2 packages to integrate llama.cpp into ROS 2. By using the llama_ros packages, you can easily incorporate the powerful optimization capabilities of llama.cpp into your ROS 2 projects by running GGUF-based LLMs and VLMs.

MITSUHA
OneReality is a virtual waifu/assistant that you can speak to through your mic and it'll speak back to you! It has many features such as: * You can speak to her with a mic * It can speak back to you * Has short-term memory and long-term memory * Can open apps * Smarter than you * Fluent in English, Japanese, Korean, and Chinese * Can control your smart home like Alexa if you set up Tuya (more info in Prerequisites) It is built with Python, Llama-cpp-python, Whisper, SpeechRecognition, PocketSphinx, VITS-fast-fine-tuning, VITS-simple-api, HyperDB, Sentence Transformers, and Tuya Cloud IoT.

wenxin-starter
WenXin-Starter is a spring-boot-starter for Baidu's "Wenxin Qianfan WENXINWORKSHOP" large model, which can help you quickly access Baidu's AI capabilities. It fully integrates the official API documentation of Wenxin Qianfan. Supports text-to-image generation, built-in dialogue memory, and supports streaming return of dialogue. Supports QPS control of a single model and supports queuing mechanism. Plugins will be added soon.

FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.