torchchat

Run PyTorch LLMs locally on servers, desktop and mobile

Stars: 3468

Visit

torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.

README:

Chat with LLMs Everywhere

torchchat is a small codebase showcasing the ability to run large language models (LLMs) seamlessly. With torchchat, you can run LLMs using Python, within your own (C/C++) application (desktop or server) and on iOS and Android.

[!IMPORTANT] Update September 25, 2024: torchchat has multimodal support for Llama3.2 11B!!

To try it out, finish the Installation section below, then hop over to our multimodal guide to learn more.

What can you do with torchchat?

Run models via PyTorch / Python
Run models on desktop/server without python
- Use AOT Inductor for faster execution
- Running in c++ using the runner
Run models on mobile
- Deploy and run on iOS
- Deploy and run on Android
Evaluate a model

Highlights

[New!!] Multimodal Support for Llama 3.2 11B
Command line interaction with popular LLMs such as Llama 3, Llama 2, Stories, Mistral and more
PyTorch-native execution with performance
Supports popular hardware and OS
- Linux (x86)
- Mac OS (M1/M2/M3)
- Android (Devices that support XNNPACK)
- iOS 17+ and 8+ Gb of RAM (iPhone 15 Pro+ or iPad with Apple Silicon)
Multiple data types including: float32, float16, bfloat16
Multiple quantization schemes
Multiple execution modes including: Python (Eager, Compile) or Native (AOT Inductor (AOTI), ExecuTorch)

Models

The following models are supported by torchchat and have associated aliases.

Model	Mobile Friendly	Notes
meta-llama/Meta-Llama-3.2-3B-Instruct	✅	Tuned for `chat`. Alias to `llama3.2-3b`.
meta-llama/Meta-Llama-3.2-3B	✅	Best for `generate`. Alias to `llama3.2-3b-base`.
meta-llama/Llama-Guard-3-1B	✅	Tuned for classification. Alias to `llama3-1b-guard`.
meta-llama/Meta-Llama-3.2-1B-Instruct	✅	Tuned for `chat`. Alias to `llama3.2-1b`.
meta-llama/Meta-Llama-3.2-1B	✅	Best for `generate`. Alias to `llama3.2-1b-base`.
meta-llama/Llama-3.2-11B-Vision-Instruct		Multimodal (Image + Text). Tuned for `chat`. Alias to `llama3.2-11B`.
meta-llama/Llama-3.2-11B-Vision		Multimodal (Image + Text). Tuned for `generate`. Alias to `llama3.2-11B-base`.
meta-llama/Meta-Llama-3.1-8B-Instruct	✅	Tuned for `chat`. Alias to `llama3.1`.
meta-llama/Meta-Llama-3.1-8B	✅	Best for `generate`. Alias to `llama3.1-base`.
meta-llama/Meta-Llama-3-8B-Instruct	✅	Tuned for `chat`. Alias to `llama3`.
meta-llama/Meta-Llama-3-8B	✅	Best for `generate`. Alias to `llama3-base`.
meta-llama/Llama-2-7b-chat-hf	✅	Tuned for `chat`. Alias to `llama2`.
meta-llama/Llama-2-13b-chat-hf		Tuned for `chat`. Alias to `llama2-13b-chat`.
meta-llama/Llama-2-70b-chat-hf		Tuned for `chat`. Alias to `llama2-70b-chat`.
meta-llama/Llama-2-7b-hf	✅	Best for `generate`. Alias to `llama2-base`.
meta-llama/CodeLlama-7b-Python-hf	✅	Tuned for Python and `generate`. Alias to `codellama`.
meta-llama/CodeLlama-34b-Python-hf	✅	Tuned for Python and `generate`. Alias to `codellama-34b`.
mistralai/Mistral-7B-v0.1	✅	Best for `generate`. Alias to `mistral-7b-v01-base`.
mistralai/Mistral-7B-Instruct-v0.1	✅	Tuned for `chat`. Alias to `mistral-7b-v01-instruct`.
mistralai/Mistral-7B-Instruct-v0.2	✅	Tuned for `chat`. Alias to `mistral`.
tinyllamas/stories15M	✅	Toy model for `generate`. Alias to `stories15M`.
tinyllamas/stories42M	✅	Toy model for `generate`. Alias to `stories42M`.
tinyllamas/stories110M	✅	Toy model for `generate`. Alias to `stories110M`.
openlm-research/open_llama_7b	✅	Best for `generate`. Alias to `open-llama`.
ibm-granite/granite-3b-code-instruct-128k	✅	Alias to `granite-code` and `granite-code-3b`.
ibm-granite/granite-8b-code-instruct-128k	✅	Alias to `granite-code-8b`.
ibm-granite/granite-3.0-2b-instruct	✅	Alias to `granite3-2b` and `granite3`.
ibm-granite/granite-3.0-8b-instruct	✅	Alias to `granite3-8b`.
ibm-granite/granite-3.1-2b-instruct	✅	Alias to `granite3.1-2b` and `granite3.1`.
ibm-granite/granite-3.1-8b-instruct	✅	Alias to `granite3.1-8b`.

Installation

The following steps require that you have Python 3.10 installed.

[!TIP] torchchat uses the latest changes from various PyTorch projects so it's highly recommended that you use a venv (by using the commands below) or CONDA.

git clone https://github.com/pytorch/torchchat.git
cd torchchat
python3 -m venv .venv
source .venv/bin/activate
./install/install_requirements.sh

Commands

The interfaces of torchchat are leveraged through Python Commands and Native Runners. While the Python Commands are enumerable in the --help menu, the latter are explored in their respective sections.

python3 torchchat.py --help

# Output
usage: torchchat [-h] {chat,browser,generate,export,eval,download,list,remove,where,server} ...

positional arguments:
  {chat,browser,generate,export,eval,download,list,remove,where,server}
                        The specific command to run
    chat                Chat interactively with a model via the CLI
    generate            Generate responses from a model given a prompt
    browser             Chat interactively with a model in a locally hosted browser
    export              Export a model artifact to AOT Inductor or ExecuTorch
    download            Download model artifacts
    list                List all supported models
    remove              Remove downloaded model artifacts
    where               Return directory containing downloaded model artifacts
    server              [WIP] Starts a locally hosted REST server for model interaction
    eval                Evaluate a model via lm-eval

options:
  -h, --help            show this help message and exit

Python Inference (chat, generate, browser, server)

These commands represent different flavors of performing model inference in a Python enviroment.
Models are constructed either from CLI args or from loading exported artifacts.

Exporting (export)

This command generates model artifacts that are consumed by Python Inference or Native Runners.
More information is provided in the AOT Inductor and ExecuTorch sections.

Inventory Management (download, list, remove, where)

These commands are used to manage and download models.
More information is provided in the Download Weights section.

Evaluation (eval)

This command test model fidelity via EleutherAI's lm_evaluation_harness.
More information is provided in the Evaluation section.

Download Weights

Most models use Hugging Face as the distribution channel, so you will need to create a Hugging Face account. Create a Hugging Face user access token as documented here with the write role.

Log into Hugging Face:

huggingface-cli login

Take a look at the available models:

python3 torchchat.py list

Then download one for testing (this README uses llama3.1)

python3 torchchat.py download llama3.1

[!NOTE] This command may prompt you to request access to Llama 3 via Hugging Face, if you do not already have access. Simply follow the prompts and re-run the command when access is granted.*

Additional Model Inventory Management Commands

Where

This subcommand shows the location of a particular model.

python3 torchchat.py where llama3.1

This is useful in scripts when you do not want to hard-code paths

Remove

This subcommand removes the specified model

python3 torchchat.py remove llama3.1

More information about these commands can be found by adding the --help option.

Running via PyTorch / Python

The simplest way to run a model in PyTorch is via eager execution. This is the default execution mode for both PyTorch and torchchat. It performs inference without creating exporting artifacts or using a separate runner.

The model used for inference can also be configured and tailored to specific needs (compilation, quantization, etc.). See the customization guide for the options supported by torchchat.

[!TIP] For more information about these commands, please refer to the --help menu.

Chat

This mode allows you to chat with an LLM in an interactive fashion.

python3 torchchat.py chat llama3.1

Generate

This mode generates text based on an input prompt.

python3 torchchat.py generate llama3.1 --prompt "write me a story about a boy and his bear"

Server

This mode exposes a REST API for interacting with a model. The server follows the OpenAI API specification for chat completions.

To test out the REST API, you'll need 2 terminals: one to host the server, and one to send the request. In one terminal, start the server

python3 torchchat.py server llama3.1

shell default: python3 torchchat.py server llama3.1 & server_pid=$! ; sleep 90 # wait for server to be ready to accept requests

In another terminal, query the server using curl. Depending on the model configuration, this query might take a few minutes to respond.

[!NOTE] Since this feature is under active development, not every parameter is consumed. See api/api.py for details on which request parameters are implemented. If you encounter any issues, please comment on the tracking Github issue.

Example Query

Setting stream to "true" in the request emits a response in chunks. If stream is unset or not "true", then the client will await the full response from the server.

Example Input + Output

curl http://127.0.0.1:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "stream": "true",
    "max_tokens": 200,
    "messages": [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": "Hello!"
      }
    ]
  }'

{"response":" I'm a software developer with a passion for building innovative and user-friendly applications. I have experience in developing web and mobile applications using various technologies such as Java, Python, and JavaScript. I'm always looking for new challenges and opportunities to learn and grow as a developer.\n\nIn my free time, I enjoy reading books on computer science and programming, as well as experimenting with new technologies and techniques. I'm also interested in machine learning and artificial intelligence, and I'm always looking for ways to apply these concepts to real-world problems.\n\nI'm excited to be a part of the developer community and to have the opportunity to share my knowledge and experience with others. I'm always happy to help with any questions or problems you may have, and I'm looking forward to learning from you as well.\n\nThank you for visiting my profile! I hope you find my information helpful and interesting. If you have any questions or would like to discuss any topics, please feel free to reach out to me. I"}

shell default: kill ${server_pid}

Browser

This command opens a basic browser interface for local chat by querying a local server.

First, follow the steps in the Server section above to start a local server. Then, in another terminal, launch the interface. Running the following will open a tab in your browser.

streamlit run torchchat/usages/browser.py

Use the "Max Response Tokens" slider to limit the maximum number of tokens generated by the model for each response. Click the "Reset Chat" button to remove the message history and start a fresh chat.

Desktop/Server Execution

AOTI (AOT Inductor)

AOTI compiles models before execution for faster inference. The process creates a zipped PT2 file containing all the artifacts generated by AOTInductor, and a .so file with the runnable contents that is then loaded for inference. This can be done with both Python and C++ enviroments.

The following example exports and executes the Llama3.1 8B Instruct model. The first command compiles and performs the actual export.

python3 torchchat.py export llama3.1 --output-aoti-package-path exportedModels/llama3_1_artifacts.pt2

[!NOTE] If your machine has cuda add this flag for performance --quantize torchchat/quant_config/cuda.json when exporting.

For more details on quantization and what settings to use for your use case visit our customization guide.

Run in a Python Environment

To run in a python enviroment, use the generate subcommand like before, but include the pt2 file.

python3 torchchat.py generate llama3.1 --aoti-package-path exportedModels/llama3_1_artifacts.pt2 --prompt "Hello my name is"

Run using our C++ Runner

To run in a C++ enviroment, we need to build the runner binary.

torchchat/utils/scripts/build_native.sh aoti

Then run the compiled executable, with the pt2.

cmake-out/aoti_run exportedModels/llama3_1_artifacts.pt2 -z `python3 torchchat.py where llama3.1`/tokenizer.model -i "Once upon a time"

Mobile Execution

ExecuTorch enables you to optimize your model for execution on a mobile or embedded device.

Set Up ExecuTorch

Before running any commands in torchchat that require ExecuTorch, you must first install ExecuTorch.

To install ExecuTorch, run the following commands. This will download the ExecuTorch repo to ./et-build/src and install various ExecuTorch libraries to ./et-build/install.

[!IMPORTANT] The following commands should be run from the torchchat root directory.

export TORCHCHAT_ROOT=${PWD}
./torchchat/utils/scripts/install_et.sh

Export for mobile

Similar to AOTI, to deploy onto device, we first export the PTE artifact, then we load the artifact for inference.

The following example uses the Llama3.1 8B Instruct model.

# Export
python3 torchchat.py export llama3.1 --quantize torchchat/quant_config/mobile.json --output-pte-path llama3.1.pte

[!NOTE] We use --quantize torchchat/quant_config/mobile.json to quantize the llama3.1 model to reduce model size and improve performance for on-device use cases.

For more details on quantization and what settings to use for your use case visit our customization guide.

Deploy and run on Desktop

While ExecuTorch does not focus on desktop inference, it is capable of doing so. This is handy for testing out PTE models without sending them to a physical device.

Specifically, there are 2 ways of doing so: Pure Python and via a Runner

Deploying via Python

# Execute
python3 torchchat.py generate llama3.1 --pte-path llama3.1.pte --prompt "Hello my name is"

Deploying via the c++ Runner

Build the runner

torchchat/utils/scripts/build_native.sh et

Execute using the runner

cmake-out/et_run llama3.1.pte -z `python3 torchchat.py where llama3.1`/tokenizer.model -l 3 -i "Once upon a time"

Deploy and run on iOS

The following assumes you've completed the steps for Setting up ExecuTorch.

Deploying with Xcode

Requirements

Xcode 15.0 or later
Cmake 3.19 or later
- Download and open the macOS .dmg installer and move the Cmake app to /Applications folder.
- Install Cmake command line tools: sudo /Applications/CMake.app/Contents/bin/cmake-gui --install
A development provisioning profile with the increased-memory-limit entitlement.

Steps

Open the Xcode project:

open et-build/src/executorch/examples/demo-apps/apple_ios/LLaMA/LLaMA.xcodeproj

Note: If you're running into any issues related to package dependencies, close Xcode, clean some of the caches and/or the build products, and open the Xcode project again:
rm -rf \
  ~/Library/org.swift.swiftpm \
  ~/Library/Caches/org.swift.swiftpm \
  ~/Library/Caches/com.apple.dt.Xcode \
  ~/Library/Developer/Xcode/DerivedData

Click the Play button to launch the app in the Simulator.
To run on a device, ensure you have it set up for development and a provisioning profile with the increased-memory-limit entitlement. Update the app's bundle identifier to match your provisioning profile with the required capability.
After successfully launching the app, copy the exported ExecuTorch model (.pte) and tokenizer (.model) files to the iLLaMA folder. You can find the model file called llama3.1.pte in the current torchchat directory and the tokenizer file at $(python3 torchchat.py where llama3.1)/tokenizer.model path.
- For the Simulator: Drag and drop both files onto the Simulator window and save them in the On My iPhone > iLLaMA folder.
- For a device: Open a separate Finder window, navigate to the Files tab, drag and drop both files into the iLLaMA folder, and wait for the copying to finish.
Follow the app's UI guidelines to select the model and tokenizer files from the local filesystem and issue a prompt.

Click the image below to see it in action!

Deploy and run on Android

The following assumes you've completed the steps for Setting up ExecuTorch.

Approach 1 (Recommended): Android Studio

Requirements

Android Studio
Java 17
Android SDK 34
adb

Steps

Download the AAR file, which contains the Java library and corresponding JNI library, to build and run the app.
- executorch.aar (sha256sums)
Move the downloaded AAR file to torchchat/edge/android/torchchat/app/libs/. You may need to create directory torchchat/edge/android/torchchat/app/libs/ if it does not exist.
Push the model and tokenizer file to your device. You can find the model file called llama3.1.pte in the current torchchat directory and the tokenizer file at $(python3 torchchat.py where llama3.1)/tokenizer.model path.
```
adb shell mkdir -p /data/local/tmp/llama
adb push <model.pte> /data/local/tmp/llama
adb push <tokenizer.model or tokenizer.bin> /data/local/tmp/llama
```
Use Android Studio to open the torchchat app skeleton, located at torchchat/edge/android/torchchat.
Click the Play button (^R) to launch it to emulator/device.
- We recommend using a device with at least 12GB RAM and 20GB storage.
- If using an emulated device, refer to this post on how to set the RAM.
Follow the app's UI guidelines to pick the model and tokenizer files from the local filesystem. Then issue a prompt.

Note: The AAR file listed in Step 1 has the tiktoken and sentensepiece tokenizer. To tweak or use a custom tokenizer and runtime, modify the ExecuTorch code and use this script to build the AAR library.

Approach 2: E2E Script

Alternatively, you can run torchchat/utils/scripts/android_example.sh which sets up Java, Android SDK Manager, Android SDK, Android emulator (if no physical device is found), builds the app, and launches it for you. It can be used if you don't have a GUI.

export TORCHCHAT_ROOT=$(pwd)
sh torchchat/utils/scripts/android_example.sh

Eval

Note: This feature is still a work in progress and not all features are working

Uses the lm_eval library to evaluate model accuracy on a variety of tasks. Defaults to wikitext and can be manually controlled using the tasks and limit args. See Evaluation

Examples

Eager mode:

python3 torchchat.py eval llama3.1 --dtype fp32 --limit 5

To test the perplexity for a lowered or quantized model, pass it in the same way you would to generate:

python3 torchchat.py eval llama3.1 --pte-path llama3.1.pte --limit 5

Design Principles

torchchat embodies PyTorch’s design philosophy details, especially "usability over everything else".

Native PyTorch

torchchat is a native-PyTorch library. While we provide integrations with the surrounding ecosystem (eg: Hugging Face models, etc), all of the core functionality is written in PyTorch.

Simplicity and Extensibility

torchchat is designed to be easy to understand, use and extend.

Composition over implementation inheritance - layers of inheritance for code re-use makes the code hard to read and extend
No training frameworks - explicitly outlining the training logic makes it easy to extend for custom use cases
Code duplication is preferred over unnecessary abstractions
Modular building blocks over monolithic components

Correctness

torchchat provides well-tested components with a high-bar on correctness. We provide

Extensive unit-tests to ensure things operate as they should

Community Contributions

We really value our community and the contributions made by our wonderful users. We'll use this section to call out some of these contributions! If you'd like to help out as well, please see the CONTRIBUTING guide.

To connect with us and other community members, we invite you to join our Slack community by filling out this form. Once you've joined, you can:

Head to the #torchchat-general channel for general questions, discussion, and community support.
Join the #torchchat-contributors channel if you're interested in contributing directly to project development.

Looking forward to discussing with you about torchchat future!

Troubleshooting

A section of commonly encountered setup errors/exceptions. If this section doesn't contain your situation, check the GitHub issues

Model Access

Access to model is restricted and you are not in the authorized list

Some models require an additional step to access. Follow the link provided in the error to get access.

Installing ExecuTorch

Failed Building Wheel

If ./torchchat/utils/scripts/install_et.sh fails with an error like Building wheel for executorch (pyproject.toml) did not run successfully It's possible that it's linking to an older version of pytorch installed some other way like via homebrew. You can break the link by uninstalling other versions such as brew uninstall pytorch Note: You may break something that depends on this, so be aware.

CERTIFICATE_VERIFY_FAILED

Run pip install --upgrade certifi.

Filing Issues

If you encounter bugs or difficulty using torchchat, please file an GitHub issue.

Please include the exact command you ran and the output of that command. Also, run this script and include the output saved to system_info.txt so that we can better debug your issue.

(echo "Operating System Information"; uname -a; echo ""; cat /etc/os-release; echo ""; echo "Python Version"; python --version || python3 --version; echo ""; echo "PIP Version"; pip --version || pip3 --version; echo ""; echo "Installed Packages"; pip freeze || pip3 freeze; echo ""; echo "PyTorch Version"; python -c "import torch; print(torch.__version__)" || python3 -c "import torch; print(torch.__version__)"; echo ""; echo "Collection Complete") > system_info.txt

Disclaimer

The torchchat Repository Content is provided without any guarantees about performance or compatibility. In particular, torchchat makes available model architectures written in Python for PyTorch that may not perform in the same manner or meet the same standards as the original versions of those models. When using the torchchat Repository Content, including any model architectures, you are solely responsible for determining the appropriateness of using or redistributing the torchchat Repository Content and assume any risks associated with your use of the torchchat Repository Content or any models, outputs, or results, both alone and in combination with any other technologies. Additionally, you may have other legal obligations that govern your use of other content, such as the terms of service for third-party models, weights, data, or other technologies, and you are solely responsible for complying with all such obligations.

Acknowledgements

Thank you to the community for all the awesome libraries and tools you've built around local LLM inference.

Georgi Gerganov and his GGML project shining a spotlight on community-based enablement and inspiring so many other projects.
Andrej Karpathy and his llama2.c project. So many great (and simple!) ideas in llama2.c that we have directly adopted (both ideas and code) from his repo. You can never go wrong by following Andrej's work.
Michael Gschwind, Bert Maher, Scott Wolchok, Bin Bao, Chen Yang, Huamin Li and Mu-Chu Li who built the first version of nanogpt (DSOGPT) with AOT Inductor proving that AOTI can be used to build efficient LLMs, and DSOs are a viable distribution format for models. nanoGPT.
Bert Maher and his llama2.so, which built on Andrej's llama2.c and on DSOGPT to close the loop on Llama models with AOTInductor.
Christian Puhrsch, Horace He, Joe Isaacson and many more for their many contributions in Accelerating GenAI models in the "Anything, Fast!" pytorch.org blogs, and, in particular, Horace He for GPT, Fast!, which we have directly adopted (both ideas and code) from his repo.

License

torchchat is released under the BSD 3 license. (Additional code in this distribution is covered by the MIT and Apache Open Source licenses.) However, you may have other legal obligations that govern your use of content, such as the terms of service for third-party models.

For Tasks:

Click tags to check more tools for each tasks

chat with model generate text deploy on mobile evaluate model manage model inventory

For Jobs:

software developer machine learning engineer data scientist ai researcher mobile app developer

Alternative AI tools for torchchat

Similar Open Source Tools

torchchat

github

: 3.5k

telemetry-airflow

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

github

: 185

WindowsAgentArena

Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.

github

: 147

lmql

LMQL is a programming language designed for large language models (LLMs) that offers a unique way of integrating traditional programming with LLM interaction. It allows users to write programs that combine algorithmic logic with LLM calls, enabling model reasoning capabilities within the context of the program. LMQL provides features such as Python syntax integration, rich control-flow options, advanced decoding techniques, powerful constraints via logit masking, runtime optimization, sync and async API support, multi-model compatibility, and extensive applications like JSON decoding and interactive chat interfaces. The tool also offers library integration, flexible tooling, and output streaming options for easy model output handling.

github

: 3.4k

bia-bob

BIA `bob` is a Jupyter-based assistant for interacting with data using large language models to generate Python code. It can utilize OpenAI's chatGPT, Google's Gemini, Helmholtz' blablador, and Ollama. Users need respective accounts to access these services. Bob can assist in code generation, bug fixing, code documentation, GPU-acceleration, and offers a no-code custom Jupyter Kernel. It provides example notebooks for various tasks like bio-image analysis, model selection, and bug fixing. Installation is recommended via conda/mamba environment. Custom endpoints like blablador and ollama can be used. Google Cloud AI API integration is also supported. The tool is extensible for Python libraries to enhance Bob's functionality.

github

: 110

web-llm

WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.

github

: 13.1k

LeanCopilot

Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.

github

: 1.0k

unstructured

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

github

: 10.5k

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

h2o-llmstudio

H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.

github

: 4.1k

vector-inference

This repository provides an easy-to-use solution for running inference servers on Slurm-managed computing clusters using vLLM. All scripts in this repository run natively on the Vector Institute cluster environment. Users can deploy models as Slurm jobs, check server status and performance metrics, and shut down models. The repository also supports launching custom models with specific configurations. Additionally, users can send inference requests and set up an SSH tunnel to run inference from a local device.

github

: 53

civitai

Civitai is a platform where people can share their stable diffusion models (textual inversions, hypernetworks, aesthetic gradients, VAEs, and any other crazy stuff people do to customize their AI generations), collaborate with others to improve them, and learn from each other's work. The platform allows users to create an account, upload their models, and browse models that have been shared by others. Users can also leave comments and feedback on each other's models to facilitate collaboration and knowledge sharing.

github

: 6.5k

KrillinAI

KrillinAI is a video subtitle translation and dubbing tool based on AI large models, featuring speech recognition, intelligent sentence segmentation, professional translation, and one-click deployment of the entire process. It provides a one-stop workflow from video downloading to the final product, empowering cross-language cultural communication with AI. The tool supports multiple languages for input and translation, integrates features like automatic dependency installation, video downloading from platforms like YouTube and Bilibili, high-speed subtitle recognition, intelligent subtitle segmentation and alignment, custom vocabulary replacement, professional-level translation engine, and diverse external service selection for speech and large model services.

github

: 655

neural

Neural is a Vim and Neovim plugin that integrates various machine learning tools to assist users in writing code, generating text, and explaining code or paragraphs. It supports multiple machine learning models, focuses on privacy, and is compatible with Vim 8.0+ and Neovim 0.8+. Users can easily configure Neural to interact with third-party machine learning tools, such as OpenAI, to enhance code generation and completion. The plugin also provides commands like `:NeuralExplain` to explain code or text and `:NeuralStop` to stop Neural from working. Neural is maintained by the Dense Analysis team and comes with a disclaimer about sending input data to third-party servers for machine learning queries.

github

: 492

autoarena

AutoArena is a tool designed to create leaderboards ranking Language Model outputs against one another using automated judge evaluation. It allows users to rank outputs from different LLMs, RAG setups, and prompts to find the best configuration of their system. Users can perform automated head-to-head evaluation using judges from various platforms like OpenAI, Anthropic, and Cohere. Additionally, users can define and run custom judges, connect to internal services, or implement bespoke logic. AutoArena enables users to run the application locally, providing full control over their environment and data.

github

: 65

ScandEval

ScandEval is a framework for evaluating pretrained language models on mono- or multilingual language tasks. It provides a unified interface for benchmarking models on a variety of tasks, including sentiment analysis, question answering, and machine translation. ScandEval is designed to be easy to use and extensible, making it a valuable tool for researchers and practitioners alike.

github

: 81

For similar tasks

LLM_Web_search

LLM_Web_search project gives local LLMs the ability to search the web by outputting a specific command. It uses regular expressions to extract search queries from model output and then utilizes duckduckgo-search to search the web. LangChain's Contextual compression and Okapi BM25 or SPLADE are used to extract relevant parts of web pages in search results. The extracted results are appended to the model's output.

github

: 232

node-llama-cpp

node-llama-cpp is a tool that allows users to run AI models locally on their machines. It provides pre-built bindings with the option to build from source using cmake. Users can interact with text generation models, chat with models using a chat wrapper, and force models to generate output in a parseable format like JSON. The tool supports Metal and CUDA, offers CLI functionality for chatting with models without coding, and ensures up-to-date compatibility with the latest version of llama.cpp. Installation includes pre-built binaries for macOS, Linux, and Windows, with the option to build from source if binaries are not available for the platform.

github

: 853

Jlama

Jlama is a modern Java inference engine designed for large language models. It supports various model types such as Gemma, Llama, Mistral, GPT-2, BERT, and more. The tool implements features like Flash Attention, Mixture of Experts, and supports different model quantization formats. Built with Java 21 and utilizing the new Vector API for faster inference, Jlama allows users to add LLM inference directly to their Java applications. The tool includes a CLI for running models, a simple UI for chatting with LLMs, and examples for different model types.

github

: 987

torchchat

github

: 3.5k

chatgpt-cli

ChatGPT CLI provides a powerful command-line interface for seamless interaction with ChatGPT models via OpenAI and Azure. It features streaming capabilities, extensive configuration options, and supports various modes like streaming, query, and interactive mode. Users can manage thread-based context, sliding window history, and provide custom context from any source. The CLI also offers model and thread listing, advanced configuration options, and supports GPT-4, GPT-3.5-turbo, and Perplexity's models. Installation is available via Homebrew or direct download, and users can configure settings through default values, a config.yaml file, or environment variables.

github

: 661

elmer

Elmer is a user-friendly wrapper over common APIs for calling llm’s, with support for streaming and easy registration and calling of R functions. Users can interact with Elmer in various ways, such as interactive chat console, interactive method call, programmatic chat, and streaming results. Elmer also supports async usage for running multiple chat sessions concurrently, useful for Shiny applications. The tool calling feature allows users to define external tools that Elmer can request to execute, enhancing the capabilities of the chat model.

github

: 71

mlx-lm

MLX LM is a Python package designed for generating text and fine-tuning large language models on Apple silicon using MLX. It offers integration with the Hugging Face Hub for easy access to thousands of LLMs, support for quantizing and uploading models to the Hub, low-rank and full model fine-tuning capabilities, and distributed inference and fine-tuning with `mx.distributed`. Users can interact with the package through command line options or the Python API, enabling tasks such as text generation, chatting with language models, model conversion, streaming generation, and sampling. MLX LM supports various Hugging Face models and provides tools for efficient scaling to long prompts and generations, including a rotating key-value cache and prompt caching. It requires macOS 15.0 or higher for optimal performance.

github

: 339

duix.ai

Duix is a silicon-based digital human SDK for intelligent interaction, providing users with instant virtual human interaction experience on devices like Android and iOS. The SDK offers intuitive effect display and supports user customization through open documentation. It is fully open-source, allowing developers to understand its workings, optimize, and innovate further.

github

: 4.2k

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675