torchchat
Run PyTorch LLMs locally on servers, desktop and mobile
Stars: 3272
torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.
README:
torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.
[!IMPORTANT] Update September 25, 2024: torchchat has multimodal support for Llama3.2 11B!!
To try it out, finish the Installation section below, then hop over to our multimodal guide to learn more.
- Run models via PyTorch / Python
- Run models on desktop/server without python
- Run models on mobile
- Evaluate a model
- [New!!] Multimodal Support for Llama 3.2 11B
- Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more
- PyTorch-native execution with performance
- Supports popular hardware and OS
- Linux (x86)
- Mac OS (M1/M2/M3)
- Android (Devices that support XNNPACK)
- iOS 17+ and 8+ Gb of RAM (iPhone 15 Pro+ or iPad with Apple Silicon)
- Multiple data types including: float32, float16, bfloat16
- Multiple quantization schemes
- Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch)
The following models are supported by torchchat and have associated aliases.
Model | Mobile Friendly | Notes |
---|---|---|
meta-llama/Meta-Llama-3.2-3B-Instruct | ✅ | Tuned for chat . Alias to llama3.2-3b . |
meta-llama/Meta-Llama-3.2-3B | ✅ | Best for generate . Alias to llama3.2-3b-base . |
meta-llama/Llama-Guard-3-1B | ✅ | Tuned for classification . Alias to llama3-1b-guard . |
meta-llama/Meta-Llama-3.2-1B-Instruct | ✅ | Tuned for chat . Alias to llama3.2-1b . |
meta-llama/Meta-Llama-3.2-1B | ✅ | Best for generate . Alias to llama3.2-1b-base . |
meta-llama/Llama-3.2-11B-Vision-Instruct | Multimodal (Image + Text). Tuned for chat . Alias to llama3.2-11B . |
|
meta-llama/Llama-3.2-11B-Vision | Multimodal (Image + Text). Tuned for generate . Alias to llama3.2-11B-base . |
|
meta-llama/Meta-Llama-3.1-8B-Instruct | ✅ | Tuned for chat . Alias to llama3.1 . |
meta-llama/Meta-Llama-3.1-8B | ✅ | Best for generate . Alias to llama3.1-base . |
meta-llama/Meta-Llama-3-8B-Instruct | ✅ | Tuned for chat . Alias to llama3 . |
meta-llama/Meta-Llama-3-8B | ✅ | Best for generate . Alias to llama3-base . |
meta-llama/Llama-2-7b-chat-hf | ✅ | Tuned for chat . Alias to llama2 . |
meta-llama/Llama-2-13b-chat-hf | Tuned for chat . Alias to llama2-13b-chat . |
|
meta-llama/Llama-2-70b-chat-hf | Tuned for chat . Alias to llama2-70b-chat . |
|
meta-llama/Llama-2-7b-hf | ✅ | Best for generate . Alias to llama2-base . |
meta-llama/CodeLlama-7b-Python-hf | ✅ | Tuned for Python and generate . Alias to codellama . |
meta-llama/CodeLlama-34b-Python-hf | ✅ | Tuned for Python and generate . Alias to codellama-34b . |
mistralai/Mistral-7B-v0.1 | ✅ | Best for generate . Alias to mistral-7b-v01-base . |
mistralai/Mistral-7B-Instruct-v0.1 | ✅ | Tuned for chat . Alias to mistral-7b-v01-instruct . |
mistralai/Mistral-7B-Instruct-v0.2 | ✅ | Tuned for chat . Alias to mistral . |
tinyllamas/stories15M | ✅ | Toy model for generate . Alias to stories15M . |
tinyllamas/stories42M | ✅ | Toy model for generate . Alias to stories42M . |
tinyllamas/stories110M | ✅ | Toy model for generate . Alias to stories110M . |
openlm-research/open_llama_7b | ✅ | Best for generate . Alias to open-llama . |
The following steps require that you have Python 3.10 installed.
[!TIP] torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.
git clone https://github.com/pytorch/torchchat.git
cd torchchat
python3 -m venv .venv
source .venv/bin/activate
./install/install_requirements.sh
The interfaces of torchchat are leveraged through Python Commands and Native Runners. While the Python Commands are enumerable in the --help menu, the latter are explored in their respective sections.
python3 torchchat.py --help
# Output
usage: torchchat [-h] {chat,browser,generate,export,eval,download,list,remove,where,server} ...
positional arguments:
{chat,browser,generate,export,eval,download,list,remove,where,server}
The specific command to run
chat Chat interactively with a model via the CLI
generate Generate responses from a model given a prompt
browser Chat interactively with a model in a locally hosted browser
export Export a model artifact to AOT Inductor or ExecuTorch
download Download model artifacts
list List all supported models
remove Remove downloaded model artifacts
where Return directory containing downloaded model artifacts
server [WIP] Starts a locally hosted REST server for model interaction
eval Evaluate a model via lm-eval
options:
-h, --help show this help message and exit
Python Inference (chat, generate, browser, server)
- These commands represent different flavors of performing model inference in a Python enviroment.
- Models are constructed either from CLI args or from loading exported artifacts.
Exporting (export)
- This command generates model artifacts that are consumed by Python Inference or Native Runners.
- More information is provided in the AOT Inductor and ExecuTorch sections.
Inventory Management (download, list, remove, where)
- These commands are used to manage and download models.
- More information is provided in the Download Weights section.
Evaluation (eval)
- This command test model fidelity via EleutherAI's lm_evaluation_harness.
- More information is provided in the Evaluation section.
Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account.
Create a Hugging Face user access token as documented here with the write
role.
Log into Hugging Face:
huggingface-cli login
Take a look at the available models:
python3 torchchat.py list
Then download one for testing (this README uses llama3.1)
python3 torchchat.py download llama3.1
[!NOTE] This command may prompt you to request access to Llama 3 via Hugging Face, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.*
Additional Model Inventory Management Commands
This subcommand shows location of a particular model.
python3 torchchat.py where llama3.1
This is useful in scripts when you do not want to hard-code paths
This subcommand removes the specified model
python3 torchchat.py remove llama3.1
More information about these commands can be found by adding the --help
option.
The simplest way to run a model in PyTorch is via eager execution. This is the default execution mode for both PyTorch and torchchat. It performs inference without creating exporting artifacts or using a separate runner.
The model used for inference can also be configured and tailored to specific needs (compilation, quantization, etc.). See the customization guide for the options supported by torchchat.
[!TIP] For more information about these commands, please refer to the
--help
menu.
This mode allows you to chat with an LLM in an interactive fashion.
python3 torchchat.py chat llama3.1
This mode generates text based on an input prompt.
python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy and his bear"
This mode exposes a REST API for interacting with a model. The server follows the OpenAI API specification for chat completions.
To test out the REST API, you'll need 2 terminals: one to host the server, and one to send the request. In one terminal, start the server
python3 torchchat.py server llama3.1
In another terminal, query the server using curl
. Depending on the model configuration, this query might take a few minutes to respond.
[!NOTE] Since this feature is under active development, not every parameter is consumed. See api/api.py for details on which request parameters are implemented. If you encounter any issues, please comment on the tracking Github issue.
Example Query
Setting stream
to "true" in the request emits a response in chunks. If stream
is unset or not "true", then the client will await the full response from the server.
Example Input + Output
curl http://127.0.0.1:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"stream": "true",
"max_tokens": 200,
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
}'
{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}
This command opens a basic browser interface for local chat by querying a local server.
First, follow the steps in the Server section above to start a local server. Then, in another terminal, launch the interface. Running the following will open a tab in your browser.
streamlit run torchchat/usages/browser.py
Use the "Max Response Tokens" slider to limit the maximum number of tokens generated by the model for each response. Click the "Reset Chat" button to remove the message history and start a fresh chat.
AOTI compiles models before execution for faster inference. The process creates a DSO model (represented by a file with extension .so
)
that is then loaded for inference. This can be done with both Python and C++ enviroments.
The following example exports and executes the Llama3.1 8B Instruct model. The first command compiles and performs the actual export.
python3 torchchat.py export llama3.1 --output-dso-path exportedModels/llama3.1.so
[!NOTE] If your machine has cuda add this flag for performance
--quantize torchchat/quant_config/cuda.json
when exporting.
For more details on quantization and what settings to use for your use case visit our customization guide.
To run in a python enviroment, use the generate subcommand like before, but include the dso file.
python3 torchchat.py generate llama3.1 --dso-path exportedModels/llama3.1.so --prompt "Hello my name is"
Note: Depending on which accelerator is used to generate the .dso file, the command may need the device specified: --device (cuda | cpu)
.
To run in a C++ enviroment, we need to build the runner binary.
torchchat/utils/scripts/build_native.sh aoti
Then run the compiled executable, with the exported DSO from earlier.
cmake-out/aoti_run exportedModels/llama3.1.so -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time"
Note: Depending on which accelerator is used to generate the .dso file, the runner may need the device specified: -d (CUDA | CPU)
.
ExecuTorch enables you to optimize your model for execution on a mobile or embedded device.
Before running any commands in torchchat that require ExecuTorch, you must first install ExecuTorch.
To install ExecuTorch, run the following commands. This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install.
[!IMPORTANT] The following commands should be run from the torchchat root directory.
export TORCHCHAT_ROOT=${PWD}
./torchchat/utils/scripts/install_et.sh
Similar to AOTI, to deploy onto device, we first export the PTE artifact, then we load the artifact for inference.
The following example uses the Llama3.1 8B Instruct model.
# Export
python3 torchchat.py export llama3.1 --quantize torchchat/quant_config/mobile.json --output-pte-path llama3.1.pte
[!NOTE] We use
--quantize torchchat/quant_config/mobile.json
to quantize the llama3.1 model to reduce model size and improve performance for on-device use cases.
For more details on quantization and what settings to use for your use case visit our customization guide.
While ExecuTorch does not focus on desktop inference, it is capable of doing so. This is handy for testing out PTE models without sending them to a physical device.
Specifically there are 2 ways of doing so: Pure Python and via a Runner
Deploying via Python
# Execute
python3 torchchat.py generate llama3.1 --pte-path llama3.1.pte --prompt "Hello my name is"
Deploying via the c++ Runner
Build the runner
torchchat/utils/scripts/build_native.sh et
Execute using the runner
cmake-out/et_run llama3.1.pte -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time"
The following assumes you've completed the steps for Setting up ExecuTorch.
Deploying with Xcode
- Xcode 15.0 or later
-
Cmake 3.19 or later
- Download and open the macOS
.dmg
installer and move the Cmake app to/Applications
folder. - Install Cmake command line tools:
sudo /Applications/CMake.app/Contents/bin/cmake-gui --install
- Download and open the macOS
- A development provisioning profile with the
increased-memory-limit
entitlement.
-
Open the Xcode project:
open et-build/src/executorch/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj
Note: If you're running into any issues related to package dependencies, close Xcode, clean some of the caches and/or the build products, and open the Xcode project again:
rm -rf \ ~/Library/org.swift.swiftpm \ ~/Library/Caches/org.swift.swiftpm \ ~/Library/Caches/com.apple.dt.Xcode \ ~/Library/Developer/Xcode/DerivedData
-
Click the Play button to launch the app in the Simulator.
-
To run on a device, ensure you have it set up for development and a provisioning profile with the
increased-memory-limit
entitlement. Update the app's bundle identifier to match your provisioning profile with the required capability. -
After successfully launching the app, copy the exported ExecuTorch model (
.pte
) and tokenizer (.model
) files to the iLLaMA folder. You can find the model file calledllama3.1.pte
in the currenttorchchat
directory and the tokenizer file at$(python3 torchchat.py where llama3.1)/tokenizer.model
path.-
For the Simulator: Drag and drop both files onto the Simulator window and save them in the
On My iPhone > iLLaMA
folder. - For a device: Open a separate Finder window, navigate to the Files tab, drag and drop both files into the iLLaMA folder, and wait for the copying to finish.
-
For the Simulator: Drag and drop both files onto the Simulator window and save them in the
-
Follow the app's UI guidelines to select the model and tokenizer files from the local filesystem and issue a prompt.
Click the image below to see it in action!
The following assumes you've completed the steps for Setting up ExecuTorch.
Approach 1 (Recommended): Android Studio
- Android Studio
- Java 17
- Android SDK 34
- adb
-
Download the AAR file, which contains the Java library and corresponding JNI library, to build and run the app.
- executorch-240919.aar (SHASUM: c8a5d38ead03bfa28ee8469f6355840ad0d182ba)
-
Rename the downloaded AAR file to
executorch.aar
and move the file totorchchat/edge/android/torchchat/app/libs/
. You may need to create directorytorchchat/edge/android/torchchat/app/libs/
if it does not exist. -
Push the model and tokenizer file to your device. You can find the model file called
llama3.1.pte
in the currenttorchchat
directory and the tokenizer file at$(python3 torchchat.py where llama3.1)/tokenizer.model
path.adb shell mkdir -p /data/local/tmp/llama adb push <model.pte> /data/local/tmp/llama adb push <tokenizer.model or tokenizer.bin> /data/local/tmp/llama
-
Use Android Studio to open the torchchat app skeleton, located at
torchchat/edge/android/torchchat
. -
Click the Play button (^R) to launch it to emulator/device.
- We recommend using a device with at least 12GB RAM and 20GB storage.
- If using an emulated device, refer to this post on how to set the RAM.
-
Follow the app's UI guidelines to pick the model and tokenizer files from the local filesystem. Then issue a prompt.
Note: The AAR file listed in Step 1 has the tiktoken and sentensepiece tokenizer. To tweak or use a custom tokenizer and runtime, modify the ExecuTorch code and use this script to build the AAR library.
Approach 2: E2E Script
Alternatively, you can run torchchat/utils/scripts/android_example.sh
which sets up Java, Android SDK Manager, Android SDK, Android emulator (if no physical device is found), builds the app, and launches it for you. It can be used if you don't have a GUI.
export TORCHCHAT_ROOT=$(pwd)
sh torchchat/utils/scripts/android_example.sh
Note: This feature is still a work in progress and not all features are working
Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args. See Evaluation
Examples
Eager mode:
python3 torchchat.py eval llama3.1 --dtype fp32 --limit 5
To test the perplexity for a lowered or quantized model, pass it in the same way you would to generate:
python3 torchchat.py eval llama3.1 --pte-path llama3.1.pte --limit 5
torchchat embodies PyTorch’s design philosophy details, especially "usability over everything else".
torchchat is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: Hugging Face models, etc), all of the core functionality is written in PyTorch.
torchchat is designed to be easy to understand, use and extend.
- Composition over implementation inheritance - layers of inheritance for code re-use makes the code hard to read and extend
- No training frameworks - explicitly outlining the training logic makes it easy to extend for custom use cases
- Code duplication is preferred over unnecessary abstractions
- Modular building blocks over monolithic components
torchchat provides well-tested components with a high-bar on correctness. We provide
- Extensive unit-tests to ensure things operate as they should
We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions! If you'd like to help out as well, please see the CONTRIBUTING guide.
A section of commonly encountered setup errors/exceptions. If this section doesn't contain your situation, check the GitHub issues
Access to model is restricted and you are not in the authorized list
Some models require an additional step to access. Follow the link provided in the error to get access.
Failed Building Wheel
If ./torchchat/utils/scripts/install_et.sh
fails with an error like Building wheel for executorch (pyproject.toml) did not run successfully
It's possible that it's linking to an older version of pytorch installed some other way like via homebrew. You can break the link by uninstalling other versions such as brew uninstall pytorch
Note: You may break something that depends on this, so be aware.
CERTIFICATE_VERIFY_FAILED
Run pip install --upgrade certifi
.
If you encounter bugs or difficulty using torchchat, please file an GitHub issue.
Please include the exact command you ran and the output of that command.
Also, run this script and include the output saved to system_info.txt
so that we can better debug your issue.
(echo "Operating System Information"; uname -a; echo ""; cat /etc/os-release; echo ""; echo "Python Version"; python --version || python3 --version; echo ""; echo "PIP Version"; pip --version || pip3 --version; echo ""; echo "Installed Packages"; pip freeze || pip3 freeze; echo ""; echo "PyTorch Version"; python -c "import torch; print(torch.__version__)" || python3 -c "import torch; print(torch.__version__)"; echo ""; echo "Collection Complete") > system_info.txt
The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.
Thank you to the community for all the awesome libraries and tools you've built around local LLM inference.
-
Georgi Gerganov and his GGML project shining a spotlight on community-based enablement and inspiring so many other projects.
-
Andrej Karpathy and his llama2.c project. So many great (and simple!) ideas in llama2.c that we have directly adopted (both ideas and code) from his repo. You can never go wrong by following Andrej's work.
-
Michael Gschwind, Bert Maher, Scott Wolchok, Bin Bao, Chen Yang, Huamin Li and Mu-Chu Li who built the first version of nanogpt (
DSOGPT
) with AOT Inductor proving that AOTI can be used to build efficient LLMs, and DSOs are a viable distribution format for models. nanoGPT. -
Bert Maher and his llama2.so, which built on Andrej's llama2.c and on DSOGPT to close the loop on Llama models with AOTInductor.
-
Christian Puhrsch, Horace He, Joe Isaacson and many more for their many contributions in Accelerating GenAI models in the "Anything, Fast!" pytorch.org blogs, and, in particular, Horace He for GPT, Fast!, which we have directly adopted (both ideas and code) from his repo.
torchchat is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for torchchat
Similar Open Source Tools
torchchat
torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.
telemetry-airflow
This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)
BentoVLLM
BentoVLLM is an example project demonstrating how to serve and deploy open-source Large Language Models using vLLM, a high-throughput and memory-efficient inference engine. It provides a basis for advanced code customization, such as custom models, inference logic, or vLLM options. The project allows for simple LLM hosting with OpenAI compatible endpoints without the need to write any code. Users can interact with the server using Swagger UI or other methods, and the service can be deployed to BentoCloud for better management and scalability. Additionally, the repository includes integration examples for different LLM models and tools.
WindowsAgentArena
Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.
lmql
LMQL is a programming language designed for large language models (LLMs) that offers a unique way of integrating traditional programming with LLM interaction. It allows users to write programs that combine algorithmic logic with LLM calls, enabling model reasoning capabilities within the context of the program. LMQL provides features such as Python syntax integration, rich control-flow options, advanced decoding techniques, powerful constraints via logit masking, runtime optimization, sync and async API support, multi-model compatibility, and extensive applications like JSON decoding and interactive chat interfaces. The tool also offers library integration, flexible tooling, and output streaming options for easy model output handling.
web-llm
WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
mosec
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API. * **Highly performant** : web layer and task coordination built with Rust 🦀, which offers blazing speed in addition to efficient CPU utilization powered by async I/O * **Ease of use** : user interface purely in Python 🐍, by which users can serve their models in an ML framework-agnostic manner using the same code as they do for offline testing * **Dynamic batching** : aggregate requests from different users for batched inference and distribute results back * **Pipelined stages** : spawn multiple processes for pipelined stages to handle CPU/GPU/IO mixed workloads * **Cloud friendly** : designed to run in the cloud, with the model warmup, graceful shutdown, and Prometheus monitoring metrics, easily managed by Kubernetes or any container orchestration systems * **Do one thing well** : focus on the online serving part, users can pay attention to the model optimization and business logic
langroid-examples
Langroid-examples is a repository containing examples of using the Langroid Multi-Agent Programming framework to build LLM applications. It provides a collection of scripts and instructions for setting up the environment, working with local LLMs, using OpenAI LLMs, and running various examples. The repository also includes optional setup instructions for integrating with Qdrant, Redis, Momento, GitHub, and Google Custom Search API. Users can explore different scenarios and functionalities of Langroid through the provided examples and documentation.
metavoice-src
MetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities: * Emotional speech rhythm and tone in English. * Zero-shot cloning for American & British voices, with 30s reference audio. * Support for (cross-lingual) voice cloning with finetuning. * We have had success with as little as 1 minute training data for Indian speakers. * Synthesis of arbitrary length text
code2prompt
code2prompt is a command-line tool that converts your codebase into a single LLM prompt with a source tree, prompt templating, and token counting. It automates generating LLM prompts from codebases of any size, customizing prompt generation with Handlebars templates, respecting .gitignore, filtering and excluding files using glob patterns, displaying token count, including Git diff output, copying prompt to clipboard, saving prompt to an output file, excluding files and folders, adding line numbers to source code blocks, and more. It helps streamline the process of creating LLM prompts for code analysis, generation, and other tasks.
unstructured
The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.
bia-bob
BIA `bob` is a Jupyter-based assistant for interacting with data using large language models to generate Python code. It can utilize OpenAI's chatGPT, Google's Gemini, Helmholtz' blablador, and Ollama. Users need respective accounts to access these services. Bob can assist in code generation, bug fixing, code documentation, GPU-acceleration, and offers a no-code custom Jupyter Kernel. It provides example notebooks for various tasks like bio-image analysis, model selection, and bug fixing. Installation is recommended via conda/mamba environment. Custom endpoints like blablador and ollama can be used. Google Cloud AI API integration is also supported. The tool is extensible for Python libraries to enhance Bob's functionality.
autoarena
AutoArena is a tool designed to create leaderboards ranking Language Model outputs against one another using automated judge evaluation. It allows users to rank outputs from different LLMs, RAG setups, and prompts to find the best configuration of their system. Users can perform automated head-to-head evaluation using judges from various platforms like OpenAI, Anthropic, and Cohere. Additionally, users can define and run custom judges, connect to internal services, or implement bespoke logic. AutoArena enables users to run the application locally, providing full control over their environment and data.
agentok
Agentok Studio is a visual tool built for AutoGen, a cutting-edge agent framework from Microsoft and various contributors. It offers intuitive visual tools to simplify the construction and management of complex agent-based workflows. Users can create workflows visually as graphs, chat with agents, and share flow templates. The tool is designed to streamline the development process for creators and developers working on next-generation Multi-Agent Applications.
character-factory
Character Factory is a Python script designed to generate detailed character cards for SillyTavern, TavernAI, TextGenerationWebUI, and more using Large Language Model (LLM) and Stable Diffusion. It streamlines character generation by leveraging deep learning models to create names, summaries, personalities, greeting messages, and avatars for characters. The tool provides an easy way to create unique and imaginative characters for storytelling, chatting, and other purposes.
For similar tasks
LLM_Web_search
LLM_Web_search project gives local LLMs the ability to search the web by outputting a specific command. It uses regular expressions to extract search queries from model output and then utilizes duckduckgo-search to search the web. LangChain's Contextual compression and Okapi BM25 or SPLADE are used to extract relevant parts of web pages in search results. The extracted results are appended to the model's output.
node-llama-cpp
node-llama-cpp is a tool that allows users to run AI models locally on their machines. It provides pre-built bindings with the option to build from source using cmake. Users can interact with text generation models, chat with models using a chat wrapper, and force models to generate output in a parseable format like JSON. The tool supports Metal and CUDA, offers CLI functionality for chatting with models without coding, and ensures up-to-date compatibility with the latest version of llama.cpp. Installation includes pre-built binaries for macOS, Linux, and Windows, with the option to build from source if binaries are not available for the platform.
Jlama
Jlama is a modern Java inference engine designed for large language models. It supports various model types such as Gemma, Llama, Mistral, GPT-2, BERT, and more. The tool implements features like Flash Attention, Mixture of Experts, and supports different model quantization formats. Built with Java 21 and utilizing the new Vector API for faster inference, Jlama allows users to add LLM inference directly to their Java applications. The tool includes a CLI for running models, a simple UI for chatting with LLMs, and examples for different model types.
torchchat
torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.
chatgpt-cli
ChatGPT CLI provides a powerful command-line interface for seamless interaction with ChatGPT models via OpenAI and Azure. It features streaming capabilities, extensive configuration options, and supports various modes like streaming, query, and interactive mode. Users can manage thread-based context, sliding window history, and provide custom context from any source. The CLI also offers model and thread listing, advanced configuration options, and supports GPT-4, GPT-3.5-turbo, and Perplexity's models. Installation is available via Homebrew or direct download, and users can configure settings through default values, a config.yaml file, or environment variables.
elmer
Elmer is a user-friendly wrapper over common APIs for calling llm’s, with support for streaming and easy registration and calling of R functions. Users can interact with Elmer in various ways, such as interactive chat console, interactive method call, programmatic chat, and streaming results. Elmer also supports async usage for running multiple chat sessions concurrently, useful for Shiny applications. The tool calling feature allows users to define external tools that Elmer can request to execute, enhancing the capabilities of the chat model.
duix.ai
Duix is a silicon-based digital human SDK for intelligent interaction, providing users with instant virtual human interaction experience on devices like Android and iOS. The SDK offers intuitive effect display and supports user customization through open documentation. It is fully open-source, allowing developers to understand its workings, optimize, and innovate further.
LLMBox
LLMBox is a comprehensive library designed for implementing Large Language Models (LLMs) with a focus on a unified training pipeline and comprehensive model evaluation. It serves as a one-stop solution for training and utilizing LLMs, offering flexibility and efficiency in both training and utilization stages. The library supports diverse training strategies, comprehensive datasets, tokenizer vocabulary merging, data construction strategies, parameter efficient fine-tuning, and efficient training methods. For utilization, LLMBox provides comprehensive evaluation on various datasets, in-context learning strategies, chain-of-thought evaluation, evaluation methods, prefix caching for faster inference, support for specific LLM models like vLLM and Flash Attention, and quantization options. The tool is suitable for researchers and developers working with LLMs for natural language processing tasks.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.