web-llm

High-performance In-browser LLM Inference Engine

Stars: 13132

Visit

WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.

README:

WebLLM

High-Performance In-Browser LLM Inference Engine.

Get Started | Blogpost | Examples | Documentation

Overview

WebLLM is a high-performance in-browser LLM inference engine that brings language model inference directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU.

WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including streaming, JSON-mode, function-calling (WIP), etc.

We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.

You can use WebLLM as a base npm package and build your own web application on top of it by following the examples below. This project is a companion project of MLC LLM, which enables universal deployment of LLM across hardware environments.

Check out WebLLM Chat to try it out!

Key Features

In-Browser Inference: WebLLM is a high-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration, enabling powerful LLM operations directly within web browsers without server-side processing.
Full OpenAI API Compatibility: Seamlessly integrate your app with WebLLM using OpenAI API with functionalities such as streaming, JSON-mode, logit-level control, seeding, and more.
Structured JSON Generation: WebLLM supports state-of-the-art JSON mode structured generation, implemented in the WebAssembly portion of the model library for optimal performance. Check WebLLM JSON Playground on HuggingFace to try generating JSON output with custom JSON schema.
Extensive Model Support: WebLLM natively supports a range of models including Llama 3, Phi 3, Gemma, Mistral, Qwen(通义千问), and many others, making it versatile for various AI tasks. For the complete supported model list, check MLC Models.
Custom Model Integration: Easily integrate and deploy custom models in MLC format, allowing you to adapt WebLLM to specific needs and scenarios, enhancing flexibility in model deployment.
Plug-and-Play Integration: Easily integrate WebLLM into your projects using package managers like NPM and Yarn, or directly via CDN, complete with comprehensive examples and a modular design for connecting with UI components.
Streaming & Real-Time Interactions: Supports streaming chat completions, allowing real-time output generation which enhances interactive applications like chatbots and virtual assistants.
Web Worker & Service Worker Support: Optimize UI performance and manage the lifecycle of models efficiently by offloading computations to separate worker threads or service workers.
Chrome Extension Support: Extend the functionality of web browsers through custom Chrome extensions using WebLLM, with examples available for building both basic and advanced extensions.

Built-in Models

Check the complete list of available models on MLC Models. WebLLM supports a subset of these available models and the list can be accessed at prebuiltAppConfig.model_list.

Here are the primary families of models currently supported:

Llama: Llama 3, Llama 2, Hermes-2-Pro-Llama-3
Phi: Phi 3, Phi 2, Phi 1.5
Gemma: Gemma-2B
Mistral: Mistral-7B-v0.3, Hermes-2-Pro-Mistral-7B, NeuralHermes-2.5-Mistral-7B, OpenHermes-2.5-Mistral-7B
Qwen (通义千问): Qwen2 0.5B, 1.5B, 7B

If you need more models, request a new model via opening an issue or check Custom Models for how to compile and use your own models with WebLLM.

Jumpstart with Examples

Learn how to use WebLLM to integrate large language models into your application and generate chat completions through this simple Chatbot example:

For an advanced example of a larger, more complicated project, check WebLLM Chat.

More examples for different use cases are available in the examples folder.

Get Started

WebLLM offers a minimalist and modular interface to access the chatbot in the browser. The package is designed in a modular way to hook to any of the UI components.

Installation

Package Manager

# npm
npm install @mlc-ai/web-llm
# yarn
yarn add @mlc-ai/web-llm
# or pnpm
pnpm install @mlc-ai/web-llm

Then import the module in your code.

// Import everything
import * as webllm from "@mlc-ai/web-llm";
// Or only import what you need
import { CreateMLCEngine } from "@mlc-ai/web-llm";

CDN Delivery

Thanks to jsdelivr.com, WebLLM can be imported directly through URL and work out-of-the-box on cloud development platforms like jsfiddle.net, Codepen.io, and Scribbler:

import * as webllm from "https://esm.run/@mlc-ai/web-llm";

It can also be dynamicall imported as:

const webllm = await import ("https://esm.run/@mlc-ai/web-llm");

Create MLCEngine

Most operations in WebLLM are invoked through the MLCEngine interface. You can create an MLCEngine instance and loading the model by calling the CreateMLCEngine() factory function.

(Note that loading models requires downloading and it can take a significant amount of time for the very first run without caching previously. You should properly handle this asynchronous call.)

import { CreateMLCEngine } from "@mlc-ai/web-llm";

// Callback function to update model loading progress
const initProgressCallback = (initProgress) => {
  console.log(initProgress);
}
const selectedModel = "Llama-3.1-8B-Instruct-q4f32_1-MLC";

const engine = await CreateMLCEngine(
  selectedModel,
  { initProgressCallback: initProgressCallback }, // engineConfig
);

Under the hood, this factory function does the following steps for first creating an engine instance (synchronous) and then loading the model (asynchronous). You can also do them separately in your application.

import { MLCEngine } from "@mlc-ai/web-llm";

// This is a synchronous call that returns immediately
const engine = new MLCEngine({
  initProgressCallback: initProgressCallback
});

// This is an asynchronous call and can take a long time to finish
await engine.reload(selectedModel);

Chat Completion

After successfully initializing the engine, you can now invoke chat completions using OpenAI style chat APIs through the engine.chat.completions interface. For the full list of parameters and their descriptions, check section below and OpenAI API reference.

(Note: The model parameter is not supported and will be ignored here. Instead, call CreateMLCEngine(model) or engine.reload(model) instead as shown in the Create MLCEngine above.)

const messages = [
  { role: "system", content: "You are a helpful AI assistant." },
  { role: "user", content: "Hello!" },
]

const reply = await engine.chat.completions.create({
  messages,
});
console.log(reply.choices[0].message);
console.log(reply.usage);

Streaming

WebLLM also supports streaming chat completion generating. To use it, simply pass stream: true to the engine.chat.completions.create call.

const messages = [
  { role: "system", content: "You are a helpful AI assistant." },
  { role: "user", content: "Hello!" },
]

// Chunks is an AsyncGenerator object
const chunks = await engine.chat.completions.create({
  messages,
  temperature: 1,
  stream: true, // <-- Enable streaming
  stream_options: { include_usage: true },
});

let reply = "";
for await (const chunk of chunks) {
  reply += chunk.choices[0]?.delta.content || "";
  console.log(reply);
  if (chunk.usage) {
    console.log(chunk.usage); // only last chunk has usage
  }
}

const fullReply = await engine.getMessage();
console.log(fullReply);

Advanced Usage

Using Workers

You can put the heavy computation in a worker script to optimize your application performance. To do so, you need to:

Create a handler in the worker thread that communicates with the frontend while handling the requests.
Create a Worker Engine in your main application, which under the hood sends messages to the handler in the worker thread.

For detailed implementations of different kinds of Workers, check the following sections.

Dedicated Web Worker

WebLLM comes with API support for WebWorker so you can hook the generation process into a separate worker thread so that the computing in the worker thread won't disrupt the UI.

We create a handler in the worker thread that communicates with the frontend while handling the requests.

// worker.ts
import { WebWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

// A handler that resides in the worker thread
const handler = new WebWorkerMLCEngineHandler();
self.onmessage = (msg: MessageEvent) => {
  handler.onmessage(msg);
};

In the main logic, we create a WebWorkerMLCEngine that implements the same MLCEngineInterface. The rest of the logic remains the same.

// main.ts
import { CreateWebWorkerMLCEngine } from "@mlc-ai/web-llm";

async function main() {
  // Use a WebWorkerMLCEngine instead of MLCEngine here
  const engine = await CreateWebWorkerMLCEngine(
    new Worker(
      new URL("./worker.ts", import.meta.url), 
      {
        type: "module",
      }
    ),
    selectedModel,
    { initProgressCallback }, // engineConfig
  );

  // everything else remains the same
}

Use Service Worker

WebLLM comes with API support for ServiceWorker so you can hook the generation process into a service worker to avoid reloading the model in every page visit and optimize your application's offline experience.

(Note, Service Worker's life cycle is managed by the browser and can be killed any time without notifying the webapp. ServiceWorkerMLCEngine will try to keep the service worker thread alive by periodically sending heartbeat events, but your application should also include proper error handling. Check keepAliveMs and missedHeatbeat in ServiceWorkerMLCEngine for more details.)

We create a handler in the worker thread that communicates with the frontend while handling the requests.

// sw.ts
import { ServiceWorkerMLCEngineHandler } from "@mlc-ai/web-llm";

let handler: ServiceWorkerMLCEngineHandler;

self.addEventListener("activate", function (event) {
  handler = new ServiceWorkerMLCEngineHandler();
  console.log("Service Worker is ready");
});

Then in the main logic, we register the service worker and create the engine using CreateServiceWorkerMLCEngine function. The rest of the logic remains the same.

// main.ts
import { MLCEngineInterface, CreateServiceWorkerMLCEngine } from "@mlc-ai/web-llm";

if ("serviceWorker" in navigator) {
  navigator.serviceWorker.register(
    new URL("sw.ts", import.meta.url),  // worker script
    { type: "module" },
  );
}

const engine: MLCEngineInterface =
  await CreateServiceWorkerMLCEngine(
    selectedModel,
    { initProgressCallback }, // engineConfig
  );

You can find a complete example on how to run WebLLM in service worker in examples/service-worker.

Chrome Extension

You can also find examples of building Chrome extension with WebLLM in examples/chrome-extension and examples/chrome-extension-webgpu-service-worker. The latter one leverages service worker, so the extension is persistent in the background.

Full OpenAI Compatibility

WebLLM is designed to be fully compatible with OpenAI API. Thus, besides building a simple chatbot, you can also have the following functionalities with WebLLM:

streaming: return output as chunks in real-time in the form of an AsyncGenerator
json-mode: efficiently ensure output is in JSON format, see OpenAI Reference for more.
seed-to-reproduce: use seeding to ensure a reproducible output with fields seed.
function-calling (WIP): function calling with fields tools and tool_choice (with preliminary support); or manual function calling without tools or tool_choice (keeps the most flexibility).

Custom Models

WebLLM works as a companion project of MLC LLM and it supports custom models in MLC format. It reuses the model artifact and builds the flow of MLC LLM. To compile and use your own models with WebLLM, please check out MLC LLM document on how to compile and deploy new model weights and libraries to WebLLM.

Here, we go over the high-level idea. There are two elements of the WebLLM package that enable new models and weight variants.

model: Contains a URL to model artifacts, such as weights and meta-data.
model_lib: A URL to the web assembly library (i.e. wasm file) that contains the executables to accelerate the model computations.

Both are customizable in the WebLLM.

import { CreateMLCEngine } from "@mlc-ai/web-llm";

async main() {
  const appConfig = {
    "model_list": [
      {
        "model": "/url/to/my/llama",
        "model_id": "MyLlama-3b-v1-q4f32_0",
        "model_lib": "/url/to/myllama3b.wasm",
      }
    ],
  };
  // override default
  const chatOpts = {
    "repetition_penalty": 1.01
  };

  // load a prebuilt model
  // with a chat option override and app config
  // under the hood, it will load the model from myLlamaUrl
  // and cache it in the browser cache
  // The chat will also load the model library from "/url/to/myllama3b.wasm",
  // assuming that it is compatible to the model in myLlamaUrl.
  const engine = await CreateMLCEngine(
    "MyLlama-3b-v1-q4f32_0",
    { appConfig }, // engineConfig
    chatOpts,
  );
}

In many cases, we only want to supply the model weight variant, but not necessarily a new model (e.g. NeuralHermes-Mistral can reuse Mistral's model library). For examples of how a model library can be shared by different model variants, see webllm.prebuiltAppConfig.

Build WebLLM Package From Source

NOTE: you don't need to build from source unless you would like to modify the WebLLM package. To use the npm, simply follow Get Started or any of the examples instead.

To build from source, simply run:

npm install
npm run build

Then, to test the effects of your code change in an example, inside examples/get-started/package.json, change from "@mlc-ai/web-llm": "^0.2.73" to "@mlc-ai/web-llm": ../...

Then run:

cd examples/get-started
npm install
npm start

Note that sometimes you would need to switch between file:../.. and ../.. to trigger npm to recognize new changes. In the worst case, you can run:

cd examples/get-started
rm -rf node_modules dist package-lock.json .parcel-cache
npm install
npm start

In case you need to build TVMjs from source

WebLLM's runtime largely depends on TVMjs: https://github.com/apache/tvm/tree/main/web

While it is also available as an npm package: https://www.npmjs.com/package/@mlc-ai/web-runtime, you can build it from source if needed by following the steps below.

Install emscripten. It is an LLVM-based compiler that compiles C/C++ source code to WebAssembly.
- Follow the installation instruction to install the latest emsdk.
- Source emsdk_env.sh by source path/to/emsdk_env.sh, so that emcc is reachable from PATH and the command emcc works.
We can verify the successful installation by trying out emcc terminal.

Note: We recently found that using the latest emcc version may run into issues during runtime. Use ./emsdk install 3.1.56 instead of ./emsdk install latest for now as a workaround. The error may look like
```
Init error, LinkError: WebAssembly.instantiate(): Import #6 module="wasi_snapshot_preview1"
function="proc_exit": function import requires a callable
```
In ./package.json, change from "@mlc-ai/web-runtime": "0.18.0-dev2", to "@mlc-ai/web-runtime": "file:./tvm_home/web",.
Setup necessary environment

Prepare all the necessary dependencies for web build:
```
./scripts/prep_deps.sh
```
In this step, if $TVM_SOURCE_DIR is not defined in the environment, we will execute the following line to build tvmjs dependency:
```
git clone https://github.com/mlc-ai/relax 3rdparty/tvm-unity --recursive
```
This clones the current HEAD of mlc-ai/relax. However, it may not always be the correct branch or commit to clone. To build a specific npm version from source, refer to the version bump PR, which states which branch (i.e. mlc-ai/relax or apache/tvm) and which commit the current WebLLM version depends on. For instance, version 0.2.52, according to its version bump PR https://github.com/mlc-ai/web-llm/pull/521, is built by checking out the following commit https://github.com/apache/tvm/commit/e6476847753c80e054719ac47bc2091c888418b6 in apache/tvm, rather than the HEAD of mlc-ai/relax.

Besides, --recursive is necessary and important. Otherwise, you may encounter errors like fatal error: 'dlpack/dlpack.h' file not found.
Build WebLLM Package
```
npm run build
```
Validate some of the sub-packages

You can then go to the subfolders in examples to validate some of the sub-packages. We use Parcelv2 for bundling. Although Parcel is not very good at tracking parent directory changes sometimes. When you make a change in the WebLLM package, try to edit the package.json of the subfolder and save it, which will trigger Parcel to rebuild.

Acknowledgement

This project is initiated by members from CMU Catalyst, UW SAMPL, SJTU, OctoML, and the MLC community. We would love to continue developing and supporting the open-source ML community.

This project is only possible thanks to the shoulders open-source ecosystems that we stand on. We want to thank the Apache TVM community and developers of the TVM Unity effort. The open-source ML community members made these models publicly available. PyTorch and Hugging Face communities make these models accessible. We would like to thank the teams behind Vicuna, SentencePiece, LLaMA, and Alpaca. We also would like to thank the WebAssembly, Emscripten, and WebGPU communities. Finally, thanks to Dawn and WebGPU developers.

For Tasks:

Click tags to check more tools for each tasks

build chatbots train language models deploy machine learning models accelerate web applications enable privacy

For Jobs:

chatbot developer ai researcher web developer data scientist machine learning engineer

Alternative AI tools for web-llm

Similar Open Source Tools

web-llm

github

: 13.1k

LeanCopilot

Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.

github

: 1.0k

torchchat

torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.

github

: 3.5k

storm

STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage. **Try out our [live research preview](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**

github

: 17.0k

lmql

LMQL is a programming language designed for large language models (LLMs) that offers a unique way of integrating traditional programming with LLM interaction. It allows users to write programs that combine algorithmic logic with LLM calls, enabling model reasoning capabilities within the context of the program. LMQL provides features such as Python syntax integration, rich control-flow options, advanced decoding techniques, powerful constraints via logit masking, runtime optimization, sync and async API support, multi-model compatibility, and extensive applications like JSON decoding and interactive chat interfaces. The tool also offers library integration, flexible tooling, and output streaming options for easy model output handling.

github

: 3.4k

mflux

MFLUX is a line-by-line port of the FLUX implementation in the Huggingface Diffusers library to Apple MLX. It aims to run powerful FLUX models from Black Forest Labs locally on Mac machines. The codebase is minimal and explicit, prioritizing readability over generality and performance. Models are implemented from scratch in MLX, with tokenizers from the Huggingface Transformers library. Dependencies include Numpy and Pillow for image post-processing. Installation can be done using `uv tool` or classic virtual environment setup. Command-line arguments allow for image generation with specified models, prompts, and optional parameters. Quantization options for speed and memory reduction are available. LoRA adapters can be loaded for fine-tuning image generation. Controlnet support provides more control over image generation with reference images. Current limitations include generating images one by one, lack of support for negative prompts, and some LoRA adapters not working.

github

: 1.3k

VoiceStreamAI

VoiceStreamAI is a Python 3-based server and JavaScript client solution for near-realtime audio streaming and transcription using WebSocket. It employs Huggingface's Voice Activity Detection (VAD) and OpenAI's Whisper model for accurate speech recognition. The system features real-time audio streaming, modular design for easy integration of VAD and ASR technologies, customizable audio chunk processing strategies, support for multilingual transcription, and secure sockets support. It uses a factory and strategy pattern implementation for flexible component management and provides a unit testing framework for robust development.

github

: 547

Bard-API

The Bard API is a Python package that returns responses from Google Bard through the value of a cookie. It is an unofficial API that operates through reverse-engineering, utilizing cookie values to interact with Google Bard for users struggling with frequent authentication problems or unable to authenticate via Google Authentication. The Bard API is not a free service, but rather a tool provided to assist developers with testing certain functionalities due to the delayed development and release of Google Bard's API. It has been designed with a lightweight structure that can easily adapt to the emergence of an official API. Therefore, using it for any other purposes is strongly discouraged. If you have access to a reliable official PaLM-2 API or Google Generative AI API, replace the provided response with the corresponding official code. Check out https://github.com/dsdanielpark/Bard-API/issues/262.

github

: 5.4k

ell

ell is a lightweight, functional prompt engineering framework that treats prompts as programs rather than strings. It provides tools for prompt versioning, monitoring, and visualization, as well as support for multimodal inputs and outputs. The framework aims to simplify the process of prompt engineering for language models.

github

: 4.9k

WindowsAgentArena

Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.

github

: 147

VMind

VMind is an open-source solution for intelligent visualization, providing an intelligent chart component based on LLM by VisActor. It allows users to create chart narrative works with natural language interaction, edit charts through dialogue, and export narratives as videos or GIFs. The tool is easy to use, scalable, supports various chart types, and offers one-click export functionality. Users can customize chart styles, specify themes, and aggregate data using LLM models. VMind aims to enhance efficiency in creating data visualization works through dialogue-based editing and natural language interaction.

github

: 263

openorch

OpenOrch is a daemon that transforms servers into a powerful development environment, running AI models, containers, and microservices. It serves as a blend of Kubernetes and a language-agnostic backend framework for building applications on fixed-resource setups. Users can deploy AI models and build microservices, managing applications while retaining control over infrastructure and data.

github

: 186

lotus

LOTUS (LLMs Over Tables of Unstructured and Structured Data) is a query engine that provides a declarative programming model and an optimized query engine for reasoning-based query pipelines over structured and unstructured data. It offers a simple and intuitive Pandas-like API with semantic operators for fast and easy LLM-powered data processing. The tool implements a semantic operator programming model, allowing users to write AI-based pipelines with high-level logic and leaving the rest of the work to the query engine. LOTUS supports various semantic operators like sem_map, sem_filter, sem_extract, sem_agg, sem_topk, sem_join, sem_sim_join, and sem_search, enabling users to perform tasks like mapping records, filtering data, aggregating records, and more. The tool also supports different model classes such as LM, RM, and Reranker for language modeling, retrieval, and reranking tasks respectively.

github

: 988

paper-qa

PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and includes a process of embedding docs, queries, searching for top passages, creating summaries, using an LLM to re-score and select relevant summaries, putting summaries into prompt, and generating answers. The tool can be used to answer specific questions related to scientific research by leveraging citations and relevant passages from documents.

github

: 6.6k

KrillinAI

KrillinAI is a video subtitle translation and dubbing tool based on AI large models, featuring speech recognition, intelligent sentence segmentation, professional translation, and one-click deployment of the entire process. It provides a one-stop workflow from video downloading to the final product, empowering cross-language cultural communication with AI. The tool supports multiple languages for input and translation, integrates features like automatic dependency installation, video downloading from platforms like YouTube and Bilibili, high-speed subtitle recognition, intelligent subtitle segmentation and alignment, custom vocabulary replacement, professional-level translation engine, and diverse external service selection for speech and large model services.

github

: 655

LLMUnity

LLM for Unity enables seamless integration of Large Language Models (LLMs) within the Unity engine, allowing users to create intelligent characters for immersive player interactions. The tool supports major LLM models, runs locally without internet access, offers fast inference on CPU and GPU, and is easy to set up with a single line of code. It is free for both personal and commercial use, tested on Unity 2021 LTS, 2022 LTS, and 2023. Users can build multiple AI characters efficiently, use remote servers for processing, and customize model settings for text generation.

github

: 1.0k

For similar tasks

web-llm

github

: 13.1k

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583

zep-python

Zep is an open-source platform for building and deploying large language model (LLM) applications. It provides a suite of tools and services that make it easy to integrate LLMs into your applications, including chat history memory, embedding, vector search, and data enrichment. Zep is designed to be scalable, reliable, and easy to use, making it a great choice for developers who want to build LLM-powered applications quickly and easily.

github

: 60

lollms

LoLLMs Server is a text generation server based on large language models. It provides a Flask-based API for generating text using various pre-trained language models. This server is designed to be easy to install and use, allowing developers to integrate powerful text generation capabilities into their applications.

github

: 287

LlamaIndexTS

LlamaIndex.TS is a data framework for your LLM application. Use your own data with large language models (LLMs, OpenAI ChatGPT and others) in Typescript and Javascript.

github

: 2.5k

semantic-kernel

Semantic Kernel is an SDK that integrates Large Language Models (LLMs) like OpenAI, Azure OpenAI, and Hugging Face with conventional programming languages like C#, Python, and Java. Semantic Kernel achieves this by allowing you to define plugins that can be chained together in just a few lines of code. What makes Semantic Kernel _special_ , however, is its ability to _automatically_ orchestrate plugins with AI. With Semantic Kernel planners, you can ask an LLM to generate a plan that achieves a user's unique goal. Afterwards, Semantic Kernel will execute the plan for the user.

github

: 23.9k

botpress

Botpress is a platform for building next-generation chatbots and assistants powered by OpenAI. It provides a range of tools and integrations to help developers quickly and easily create and deploy chatbots for various use cases.

github

: 13.5k

BotSharp

BotSharp is an open-source machine learning framework for building AI bot platforms. It provides a comprehensive set of tools and components for developing and deploying intelligent virtual assistants. BotSharp is designed to be modular and extensible, allowing developers to easily integrate it with their existing systems and applications. With BotSharp, you can quickly and easily create AI-powered chatbots, virtual assistants, and other conversational AI applications.

github

: 2.6k

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

web-llm

README:

WebLLM

Overview

Key Features

Built-in Models

Jumpstart with Examples

Get Started

Installation

Package Manager

CDN Delivery

Create MLCEngine

Chat Completion

Streaming

Advanced Usage

Using Workers

Dedicated Web Worker

Use Service Worker

Chrome Extension

Full OpenAI Compatibility

Custom Models

Build WebLLM Package From Source

In case you need to build TVMjs from source

Links

Acknowledgement

For Tasks:

For Jobs:

Alternative AI tools for web-llm

Similar Open Source Tools

web-llm

LeanCopilot

torchchat

storm

lmql

mflux

VoiceStreamAI

Bard-API

ell

WindowsAgentArena

VMind

openorch

lotus

paper-qa

KrillinAI

LLMUnity

For similar tasks

web-llm

agentcloud

zep-python

lollms

LlamaIndexTS

semantic-kernel

botpress

BotSharp

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick