ai00_server
The all-in-one RWKV runtime box with embed, RAG, AI agents, and more.
Stars: 600
AI00 RWKV Server is an inference API server for the RWKV language model based upon the web-rwkv inference engine. It supports VULKAN parallel and concurrent batched inference and can run on all GPUs that support VULKAN. No need for Nvidia cards!!! AMD cards and even integrated graphics can be accelerated!!! No need for bulky pytorch, CUDA and other runtime environments, it's compact and ready to use out of the box! Compatible with OpenAI's ChatGPT API interface. 100% open source and commercially usable, under the MIT license. If you are looking for a fast, efficient, and easy-to-use LLM API server, then AI00 RWKV Server is your best choice. It can be used for various tasks, including chatbots, text generation, translation, and Q&A.
README:
AI00 RWKV Server is an inference API server for the RWKV language model based upon the web-rwkv inference engine.
It supports Vulkan parallel and concurrent batched inference and can run on all GPUs that support Vulkan. No need for Nvidia cards!!! AMD cards and even integrated graphics can be accelerated!!!
No need for bulky pytorch, CUDA and other runtime environments, it's compact and ready to use out of the box!
Compatible with OpenAI's ChatGPT API interface.
100% open source and commercially usable, under the MIT license.
If you are looking for a fast, efficient, and easy-to-use LLM API server, then AI00 RWKV Server is your best choice. It can be used for various tasks, including chatbots, text generation, translation, and Q&A.
Join the AI00 RWKV Server community now and experience the charm of AI!
QQ Group for communication: 30920262
- Based on the
RWKVmodel, it has high performance and accuracy - Supports
Vulkaninference acceleration, you can enjoy GPU acceleration without the need forCUDA! Supports AMD cards, integrated graphics, and all GPUs that supportVulkan - No need for bulky
pytorch,CUDAand other runtime environments, it's compact and ready to use out of the box! - Compatible with OpenAI's ChatGPT API interface
- Chatbot
- Text generation
- Translation
- Q&A
- Any other tasks that LLM can do
-
Directly download the latest version from Release
-
After downloading the model, place the model in the
assets/models/path, for example,assets/models/RWKV-x060-World-3B-v2-20240228-ctx4096.st -
Optionally modify
assets/configs/Config.tomlfor model configurations like model path, quantization layers, etc. -
Run in the command line
$ ./ai00_rwkv_server
-
Open the browser and visit the WebUI at http://localhost:65530 (https://localhost:65530 if
tlsis enabled)
-
Clone this repository
$ git clone https://github.com/cgisky1980/ai00_rwkv_server.git $ cd ai00_rwkv_server -
After downloading the model, place the model in the
assets/models/path, for example,assets/models/RWKV-x060-World-3B-v2-20240228-ctx4096.st -
Compile
$ cargo build --release
-
After compilation, run
$ cargo run --release
-
Open the browser and visit the WebUI at http://localhost:65530 (https://localhost:65530 if
tlsis enabled)
It only supports Safetensors models with the .st extension now. Models saved with the .pth extension using torch need to be converted before use.
-
(Recommended) Run the python script
convert_safetensors.py:$ python assets/scripts/convert_safetensors.py --input /path/to/model.pth --output /path/to/model.st
Requirements: Python, with
torchandsafetensorsinstalled. -
If you do not want to install python, In the Release you could find an executable called
converter. Run
$ ./converter --input /path/to/model.pth --output /path/to/model.st- If you are building from source, run
$ cargo run --release --package converter -- --input /path/to/model.pth --output /path/to/model.st- Just like the steps mentioned above, place the model in the
.stmodel in theassets/models/path and modify the model path inassets/configs/Config.toml
-
--config: Configure file path (default:assets/configs/Config.toml) -
--ip: The IP address the server is bound to -
--port: Running port
The API service starts at port 65530, and the data input and output format follow the Openai API specification.
Note that some APIs like chat and completions have additional optional fields for advanced functionalities. Visit http://localhost:65530/api-docs for API schema.
/api/oai/v1/models/api/oai/models/api/oai/v1/chat/completions/api/oai/chat/completions/api/oai/v1/completions/api/oai/completions/api/oai/v1/embeddings/api/oai/embeddings
The following is an out-of-box example of Ai00 API invocations in Python:
import openai
class Ai00:
def __init__(self,model="model",port=65530,api_key="JUSTSECRET_KEY") :
openai.api_base = f"http://127.0.0.1:{port}/api/oai"
openai.api_key = api_key
self.ctx = []
self.params = {
"system_name": "System",
"user_name": "User",
"assistant_name": "Assistant",
"model": model,
"max_tokens": 4096,
"top_p": 0.6,
"temperature": 1,
"presence_penalty": 0.3,
"frequency_penalty": 0.3,
"half_life": 400,
"stop": ['\x00','\n\n']
}
def set_params(self,**kwargs):
self.params.update(kwargs)
def clear_ctx(self):
self.ctx = []
def get_ctx(self):
return self.ctx
def continuation(self, message):
response = openai.Completion.create(
model=self.params['model'],
prompt=message,
max_tokens=self.params['max_tokens'],
half_life=self.params['half_life'],
top_p=self.params['top_p'],
temperature=self.params['temperature'],
presence_penalty=self.params['presence_penalty'],
frequency_penalty=self.params['frequency_penalty'],
stop=self.params['stop']
)
result = response.choices[0].text
return result
def append_ctx(self,role,content):
self.ctx.append({
"role": role,
"content": content
})
def send_message(self, message,role="user"):
self.ctx.append({
"role": role,
"content": message
})
result = openai.ChatCompletion.create(
model=self.params['model'],
messages=self.ctx,
names={
"system": self.params['system_name'],
"user": self.params['user_name'],
"assistant": self.params['assistant_name']
},
max_tokens=self.params['max_tokens'],
half_life=self.params['half_life'],
top_p=self.params['top_p'],
temperature=self.params['temperature'],
presence_penalty=self.params['presence_penalty'],
frequency_penalty=self.params['frequency_penalty'],
stop=self.params['stop']
)
result = result.choices[0].message['content']
self.ctx.append({
"role": "assistant",
"content": result
})
return result
ai00 = Ai00()
ai00.set_params(
max_tokens = 4096,
top_p = 0.55,
temperature = 2,
presence_penalty = 0.3,
frequency_penalty = 0.8,
half_life = 400,
stop = ['\x00','\n\n']
)
print(ai00.send_message("how are you?"))
print(ai00.send_message("me too!"))
print(ai00.get_ctx())
ai00.clear_ctx()
print(ai00.continuation("i like"))Since v0.5, Ai00 has a unique feature called BNF sampling. BNF forces the model to output in specified formats (e.g., JSON or markdown with specified fields) by limiting the possible next tokens the model can choose from.
Here is an example BNF for JSON with fields "name", "age" and "job":
start ::= json_object;
json_object ::= "{\n" object_members "\n}";
object_members ::= json_member | json_member ",\n" object_members;
json_member ::= "\t" json_key ": " json_value;
json_key ::= '"' "name" '"' | '"' "age" '"' | '"' "job" '"';
json_value ::= json_string | json_number;
json_string ::= '"'content'"';
content ::= #"\\w*";
json_number ::= positive_digit digits|'0';
digits ::= digit|digit digits;
digit ::= '0'|positive_digit;
positive_digit::="1"|"2"|"3"|"4"|"5"|"6"|"7"|"8"|"9";
- [x] Support for
text_completionsandchat_completions - [x] Support for sse push
- [x] Integrate basic front-end
- [x] Parallel inference via
batch serve - [x] Support for
int8quantization - [x] Support for
NF4quantization - [x] Support for
LoRAmodel - [x] Support for tuned initial states
- [ ] Hot loading and switching of
LoRAmodel - [x] Hot loading and switching of tuned initial states
- [x] BNF sampling
We are always looking for people interested in helping us improve the project. If you are interested in any of the following, please join us!
- 💀Writing code
- 💬Providing feedback
- 🔆Proposing ideas or needs
- 🔍Testing new features
- ✏Translating documentation
- 📣Promoting the project
- 🏅Anything else that would be helpful to us
No matter your skill level, we welcome you to join us. You can join us in the following ways:
- Join our Discord channel
- Join our QQ group
- Submit issues or pull requests on GitHub
- Leave feedback on our website
We can't wait to work with you to make this project better! We hope the project is helpful to you!
Thank you to these awesome individuals who are insightful and outstanding for their support and selfless dedication to the project!
|
顾真牛 📖 💻 🖋 🎨 🧑🏫 |
研究社交 💻 💡 🤔 🚧 👀 📦 |
josc146 🐛 💻 🤔 🔧 |
l15y 🔧 🔌 💻 |
Cahya Wirawan 🐛 |
yuunnn_w 📖 |
longzou 💻 🛡️ |
|
luoqiqi 📖 |
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for ai00_server
Similar Open Source Tools
ai00_server
AI00 RWKV Server is an inference API server for the RWKV language model based upon the web-rwkv inference engine. It supports VULKAN parallel and concurrent batched inference and can run on all GPUs that support VULKAN. No need for Nvidia cards!!! AMD cards and even integrated graphics can be accelerated!!! No need for bulky pytorch, CUDA and other runtime environments, it's compact and ready to use out of the box! Compatible with OpenAI's ChatGPT API interface. 100% open source and commercially usable, under the MIT license. If you are looking for a fast, efficient, and easy-to-use LLM API server, then AI00 RWKV Server is your best choice. It can be used for various tasks, including chatbots, text generation, translation, and Q&A.
hezar
Hezar is an all-in-one AI library designed specifically for the Persian community. It brings together various AI models and tools, making it easy to use AI with just a few lines of code. The library seamlessly integrates with Hugging Face Hub, offering a developer-friendly interface and task-based model interface. In addition to models, Hezar provides tools like word embeddings, tokenizers, feature extractors, and more. It also includes supplementary ML tools for deployment, benchmarking, and optimization.
Scrapegraph-ai
ScrapeGraphAI is a Python library that uses Large Language Models (LLMs) and direct graph logic to create web scraping pipelines for websites, documents, and XML files. It allows users to extract specific information from web pages by providing a prompt describing the desired data. ScrapeGraphAI supports various LLMs, including Ollama, OpenAI, Gemini, and Docker, enabling users to choose the most suitable model for their needs. The library provides a user-friendly interface through its `SmartScraper` class, which simplifies the process of building and executing scraping pipelines. ScrapeGraphAI is open-source and available on GitHub, with extensive documentation and examples to guide users. It is particularly useful for researchers and data scientists who need to extract structured data from web pages for analysis and exploration.
candle-vllm
Candle-vllm is an efficient and easy-to-use platform designed for inference and serving local LLMs, featuring an OpenAI compatible API server. It offers a highly extensible trait-based system for rapid implementation of new module pipelines, streaming support in generation, efficient management of key-value cache with PagedAttention, and continuous batching. The tool supports chat serving for various models and provides a seamless experience for users to interact with LLMs through different interfaces.
LocalAGI
LocalAGI is a powerful, self-hostable AI Agent platform that allows you to design AI automations without writing code. It provides a complete drop-in replacement for OpenAI's Responses APIs with advanced agentic capabilities. With LocalAGI, you can create customizable AI assistants, automations, chat bots, and agents that run 100% locally, without the need for cloud services or API keys. The platform offers features like no-code agents, web-based interface, advanced agent teaming, connectors for various platforms, comprehensive REST API, short & long-term memory capabilities, planning & reasoning, periodic tasks scheduling, memory management, multimodal support, extensible custom actions, fully customizable models, observability, and more.
instructor
Instructor is a Python library that makes it a breeze to work with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API to manage validation, retries, and streaming responses. Get ready to supercharge your LLM workflows!
Webscout
Webscout is an all-in-one Python toolkit for web search, AI interaction, digital utilities, and more. It provides access to diverse search engines, cutting-edge AI models, temporary communication tools, media utilities, developer helpers, and powerful CLI interfaces through a unified library. With features like comprehensive search leveraging Google and DuckDuckGo, AI powerhouse for accessing various AI models, YouTube toolkit for video and transcript management, GitAPI for GitHub data extraction, Tempmail & Temp Number for privacy, Text-to-Speech conversion, GGUF conversion & quantization, SwiftCLI for CLI interfaces, LitPrinter for styled console output, LitLogger for logging, LitAgent for user agent generation, Text-to-Image generation, Scout for web parsing and crawling, Awesome Prompts for specialized tasks, Weather Toolkit, and AI Search Providers.
rust-genai
genai is a multi-AI providers library for Rust that aims to provide a common and ergonomic single API to various generative AI providers such as OpenAI, Anthropic, Cohere, Ollama, and Gemini. It focuses on standardizing chat completion APIs across major AI services, prioritizing ergonomics and commonality. The library initially focuses on text chat APIs and plans to expand to support images, function calling, and more in the future versions. Version 0.1.x will have breaking changes in patches, while version 0.2.x will follow semver more strictly. genai does not provide a full representation of a given AI provider but aims to simplify the differences at a lower layer for ease of use.
pocketgroq
PocketGroq is a tool that provides advanced functionalities for text generation, web scraping, web search, and AI response evaluation. It includes features like an Autonomous Agent for answering questions, web crawling and scraping capabilities, enhanced web search functionality, and flexible integration with Ollama server. Users can customize the agent's behavior, evaluate responses using AI, and utilize various methods for text generation, conversation management, and Chain of Thought reasoning. The tool offers comprehensive methods for different tasks, such as initializing RAG, error handling, and tool management. PocketGroq is designed to enhance development processes and enable the creation of AI-powered applications with ease.
crush
Crush is a versatile tool designed to enhance coding workflows in your terminal. It offers support for multiple LLMs, allows for flexible switching between models, and enables session-based work management. Crush is extensible through MCPs and works across various operating systems. It can be installed using package managers like Homebrew and NPM, or downloaded directly. Crush supports various APIs like Anthropic, OpenAI, Groq, and Google Gemini, and allows for customization through environment variables. The tool can be configured locally or globally, and supports LSPs for additional context. Crush also provides options for ignoring files, allowing tools, and configuring local models. It respects `.gitignore` files and offers logging capabilities for troubleshooting and debugging.
Webscout
WebScout is a versatile tool that allows users to search for anything using Google, DuckDuckGo, and phind.com. It contains AI models, can transcribe YouTube videos, generate temporary email and phone numbers, has TTS support, webai (terminal GPT and open interpreter), and offline LLMs. It also supports features like weather forecasting, YT video downloading, temp mail and number generation, text-to-speech, advanced web searches, and more.
LightRAG
LightRAG is a repository hosting the code for LightRAG, a system that supports seamless integration of custom knowledge graphs, Oracle Database 23ai, Neo4J for storage, and multiple file types. It includes features like entity deletion, batch insert, incremental insert, and graph visualization. LightRAG provides an API server implementation for RESTful API access to RAG operations, allowing users to interact with it through HTTP requests. The repository also includes evaluation scripts, code for reproducing results, and a comprehensive code structure.
sonarqube-mcp-server
The SonarQube MCP Server is a Model Context Protocol (MCP) server that enables seamless integration with SonarQube Server or Cloud for code quality and security. It supports the analysis of code snippets directly within the agent context. The server provides various tools for analyzing code, managing issues, accessing metrics, and interacting with SonarQube projects. It also supports advanced features like dependency risk analysis, enterprise portfolio management, and system health checks. The server can be configured for different transport modes, proxy settings, and custom certificates. Telemetry data collection can be disabled if needed.
avante.nvim
avante.nvim is a Neovim plugin that emulates the behavior of the Cursor AI IDE, providing AI-driven code suggestions and enabling users to apply recommendations to their source files effortlessly. It offers AI-powered code assistance and one-click application of suggested changes, streamlining the editing process and saving time. The plugin is still in early development, with functionalities like setting API keys, querying AI about code, reviewing suggestions, and applying changes. Key bindings are available for various actions, and the roadmap includes enhancing AI interactions, stability improvements, and introducing new features for coding tasks.
generative-ai-python
The Google AI Python SDK is the easiest way for Python developers to build with the Gemini API. The Gemini API gives you access to Gemini models created by Google DeepMind. Gemini models are built from the ground up to be multimodal, so you can reason seamlessly across text, images, and code.
For similar tasks
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
onnxruntime-genai
ONNX Runtime Generative AI is a library that provides the generative AI loop for ONNX models, including inference with ONNX Runtime, logits processing, search and sampling, and KV cache management. Users can call a high level `generate()` method, or run each iteration of the model in a loop. It supports greedy/beam search and TopP, TopK sampling to generate token sequences, has built in logits processing like repetition penalties, and allows for easy custom scoring.
jupyter-ai
Jupyter AI connects generative AI with Jupyter notebooks. It provides a user-friendly and powerful way to explore generative AI models in notebooks and improve your productivity in JupyterLab and the Jupyter Notebook. Specifically, Jupyter AI offers: * An `%%ai` magic that turns the Jupyter notebook into a reproducible generative AI playground. This works anywhere the IPython kernel runs (JupyterLab, Jupyter Notebook, Google Colab, Kaggle, VSCode, etc.). * A native chat UI in JupyterLab that enables you to work with generative AI as a conversational assistant. * Support for a wide range of generative model providers, including AI21, Anthropic, AWS, Cohere, Gemini, Hugging Face, NVIDIA, and OpenAI. * Local model support through GPT4All, enabling use of generative AI models on consumer grade machines with ease and privacy.
khoj
Khoj is an open-source, personal AI assistant that extends your capabilities by creating always-available AI agents. You can share your notes and documents to extend your digital brain, and your AI agents have access to the internet, allowing you to incorporate real-time information. Khoj is accessible on Desktop, Emacs, Obsidian, Web, and Whatsapp, and you can share PDF, markdown, org-mode, notion files, and GitHub repositories. You'll get fast, accurate semantic search on top of your docs, and your agents can create deeply personal images and understand your speech. Khoj is self-hostable and always will be.
langchain_dart
LangChain.dart is a Dart port of the popular LangChain Python framework created by Harrison Chase. LangChain provides a set of ready-to-use components for working with language models and a standard interface for chaining them together to formulate more advanced use cases (e.g. chatbots, Q&A with RAG, agents, summarization, extraction, etc.). The components can be grouped into a few core modules: * **Model I/O:** LangChain offers a unified API for interacting with various LLM providers (e.g. OpenAI, Google, Mistral, Ollama, etc.), allowing developers to switch between them with ease. Additionally, it provides tools for managing model inputs (prompt templates and example selectors) and parsing the resulting model outputs (output parsers). * **Retrieval:** assists in loading user data (via document loaders), transforming it (with text splitters), extracting its meaning (using embedding models), storing (in vector stores) and retrieving it (through retrievers) so that it can be used to ground the model's responses (i.e. Retrieval-Augmented Generation or RAG). * **Agents:** "bots" that leverage LLMs to make informed decisions about which available tools (such as web search, calculators, database lookup, etc.) to use to accomplish the designated task. The different components can be composed together using the LangChain Expression Language (LCEL).
danswer
Danswer is an open-source Gen-AI Chat and Unified Search tool that connects to your company's docs, apps, and people. It provides a Chat interface and plugs into any LLM of your choice. Danswer can be deployed anywhere and for any scale - on a laptop, on-premise, or to cloud. Since you own the deployment, your user data and chats are fully in your own control. Danswer is MIT licensed and designed to be modular and easily extensible. The system also comes fully ready for production usage with user authentication, role management (admin/basic users), chat persistence, and a UI for configuring Personas (AI Assistants) and their Prompts. Danswer also serves as a Unified Search across all common workplace tools such as Slack, Google Drive, Confluence, etc. By combining LLMs and team specific knowledge, Danswer becomes a subject matter expert for the team. Imagine ChatGPT if it had access to your team's unique knowledge! It enables questions such as "A customer wants feature X, is this already supported?" or "Where's the pull request for feature Y?"
infinity
Infinity is an AI-native database designed for LLM applications, providing incredibly fast full-text and vector search capabilities. It supports a wide range of data types, including vectors, full-text, and structured data, and offers a fused search feature that combines multiple embeddings and full text. Infinity is easy to use, with an intuitive Python API and a single-binary architecture that simplifies deployment. It achieves high performance, with 0.1 milliseconds query latency on million-scale vector datasets and up to 15K QPS.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.
