llm-rag-workshop
Chat with your own data - LLM+RAG workshop
Stars: 166
The LLM RAG Workshop repository provides a workshop on using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to generate and understand text in a human-like manner. It includes instructions on setting up the environment, indexing Zoomcamp FAQ documents, creating a Q&A system, and using OpenAI for generation based on retrieved information. The repository focuses on enhancing language model responses with retrieved information from external sources, such as document databases or search engines, to improve factual accuracy and relevance of generated text.
README:
Chat with your own data - LLM+RAG workshop
The content here is based on LLM Zoomcamp - a free course about the engineering aspects of LLMs. The course just started, you can still enroll.
If you want to run a similar workshop in your company, contact me at [email protected].
For this workshop, you need:
- Docker
- Python 3 (we use 3.10)
- GitHub account + VS Code (optional - if you want to use codespaces, already contains Docker and Python)
- OpenAI account (optional - possible to replace with a local LLM)
- HuggingFace account (optional - if you want to access some open-source LLMs in the extended version)
- LLM and RAG (theory)
- Preparing the environment (codespaces)
- Installing pipenv and direnv
- Running ElasticSearch
- Indexing and retrieving documents with ElasticSearch
- Generating the answers with OpenAI
Extended workshop:
- Creating a web interface with Streamlit
- Running LLMs locally
- Replacing OpenAI with Ollama
- Running Ollama and ElasticSearch in Docker-Compose
- Using Open-Source LLMs from HuggingFace Hub
I generated that with ChatGPT:
- Purpose: Generate and understand text in a human-like manner.
- Structure: Built using deep learning techniques, especially Transformer architectures.
- Size: Characterized by having a vast number of parameters (billions to trillions), enabling nuanced understanding and generation.
- Training: Pre-trained on large datasets of text to learn a broad understanding of language, then fine-tuned for specific tasks.
- Applications: Used in chatbots, translation services, content creation, and more.
- Purpose: Enhance language model responses with information retrieved from external sources.
- How It Works: Combines a language model with a retrieval system, typically a document database or search engine.
-
Process:
- Queries an external knowledge source based on input.
- Integrates retrieved information into the generation process to provide contextually rich and accurate responses.
- Advantages: Improves the factual accuracy and relevance of generated text.
- Use Cases: Fact-checking, knowledge-intensive tasks like medical diagnosis assistance, and detailed content creation where accuracy is crucial.
Use ChatGPT to show the difference between generating and RAG.
What we will do:
- Index Zoomcamp FAQ documents
- Create a Q&A system for answering questions about these documents
We will use codespaces - but it will work in any environment with Docker and Python 3
In codespaces:
- Create a repository, e.g. "llm-zoomcamp-rag-workshop"
- Start a codespace there
We will use pipenv for dependency management. It's optional but strongly recommended if you're doing the workshop locally, and not on codespaces.
Let's install it:
pip install pipenv
Install the packages:
pipenv install tqdm jupyter openai elasticsearch
If you use OpenAI, we need the key:
- Sign up at https://platform.openai.com/ if you don't have an account
- Go to https://platform.openai.com/api-keys
- Create a new key, copy it
Let's put the key to an env variable:
export OPENAI_API_KEY="TOKEN"
You can also use GitHub Codespaces secrets for better secret management.
If you don't use codespaces, you can do it with direnv:
sudo apt update
sudo apt install direnv
direnv hook bash >> ~/.bashrc
Create / edit .envrc
in your project directory:
export OPENAI_API_KEY='sk-proj-key'
Make sure .envrc
is in your .gitignore
- never commit it!
echo ".envrc" >> .gitignore
Allow direnv to run:
direnv allow
Start a new terminal, and there run jupyter:
pipenv run jupyter notebook
In another terminal, run elasticsearch with docker:
docker run -it \
--rm \
--name elasticsearch \
-m 2G \
-p 9200:9200 \
-p 9300:9300 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:8.4.3
Verify that ES is running
curl http://localhost:9200
You should get something like this:
{
"name" : "63d0133fc451",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "AKW1gxdRTuSH8eLuxbqH6A",
"version" : {
"number" : "8.4.3",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "42f05b9372a9a4a470db3b52817899b99a76ee73",
"build_date" : "2022-10-04T07:17:24.662462378Z",
"build_snapshot" : false,
"lucene_version" : "9.3.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}
RAG consists of multiple components, and the first is R - "retrieval". For retrieval, we need a search system. In our example, we will use elasticsearch for searching.
Create a nootebook "elastic-rag" or something like that. We will use it for our experiments
First, we need to download the docs:
wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json
Let's load the documents
import json
with open('./documents.json', 'rt') as f_in:
documents_file = json.load(f_in)
documents = []
for course in documents_file:
course_name = course['course']
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
Now we'll index these documents with elastic search
First initiate the connection and check that it's working:
from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:9200")
es.info()
You should see the same response as earlier with curl
.
Before we can index the documents, we need to create an index (an index in elasticsearch is like a table in a "usual" databases):
index_settings = {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"properties": {
"text": {"type": "text"},
"section": {"type": "text"},
"question": {"type": "text"},
"course": {"type": "keyword"}
}
}
}
index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)
response
Now we're ready to index all the documents:
from tqdm.auto import tqdm
for doc in tqdm(documents):
es.index(index=index_name, document=doc)
user_question = "How do I join the course after it has started?"
search_query = {
"size": 5,
"query": {
"bool": {
"must": {
"multi_match": {
"query": user_question,
"fields": ["question^3", "text", "section"],
"type": "best_fields"
}
},
"filter": {
"term": {
"course": "data-engineering-zoomcamp"
}
}
}
}
}
This query:
- Retrieves top 5 matching documents.
- Searches in the "question", "text", "section" fields, prioritizing "question" using
multi_match
query with typebest_fields
(see here for more information) - Matches user query "How do I join the course after it has started?".
- Shows results only for the "data-engineering-zoomcamp" course.
Let's see the output:
response = es.search(index=index_name, body=search_query)
for hit in response['hits']['hits']:
doc = hit['_source']
print(f"Section: {doc['section']}")
print(f"Question: {doc['question']}")
print(f"Answer: {doc['text'][:60]}...\n")
We can make it cleaner by putting it into a function:
def retrieve_documents(query, index_name="course-questions", max_results=5):
es = Elasticsearch("http://localhost:9200")
search_query = {
"size": max_results,
"query": {
"bool": {
"must": {
"multi_match": {
"query": query,
"fields": ["question^3", "text", "section"],
"type": "best_fields"
}
},
"filter": {
"term": {
"course": "data-engineering-zoomcamp"
}
}
}
}
}
response = es.search(index=index_name, body=search_query)
documents = [hit['_source'] for hit in response['hits']['hits']]
return documents
And print the answers:
user_question = "How do I join the course after it has started?"
response = retrieve_documents(user_question)
for doc in response:
print(f"Section: {doc['section']}")
print(f"Question: {doc['question']}")
print(f"Answer: {doc['text'][:60]}...\n")
Now let's do the "G" part - generation based on the "R" output
Today we will use OpenAI (it's the easiest to get started with). In the course, we will learn how to use open-source models
Make sure we have the SDK installed and the key is set.
This is how we communicate with ChatGPT3.5:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "The course already started. Can I still join?"}]
)
print(response.choices[0].message.content)
Now let's build a prompt. First, we put all the documents together in one string:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
context_docs = retrieve_documents(user_question)
context_result = ""
for doc in context_docs:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
context = context_result.strip()
print(context)
Now build the actual prompt:
prompt = f"""
You're a course teaching assistant. Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Only use the facts from the CONTEXT. If the CONTEXT doesn't contan the answer, return "NONE"
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
Now we can put it to OpenAI API:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
answer
Note: there are system and user prompts, we can also experiment with them to make the design of the prompt cleaner.
Now let's put everything together in one function:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
def build_context(documents):
context_result = ""
for doc in documents:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
return context_result.strip()
def build_prompt(user_question, documents):
context = build_context(documents)
prompt = prompt_template.format(
user_question=user_question,
context=context
)
return prompt
def ask_openai(prompt, model="gpt-4o"):
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}]
)
answer = response.choices[0].message.content
return answer
def qa_bot(user_question):
context_docs = retrieve_documents(user_question)
prompt = build_prompt(user_question, context_docs)
answer = ask_openai(prompt)
return answer
Now we can ask it different questions
qa_bot("I'm getting invalid reference format: repository name must be lowercase")
qa_bot("I can't connect to postgres port 5432, my password doesn't work")
qa_bot("how can I run kafka?")
- Use Open-Souce
- Build an interface, e.g. streamlit
- Deploy it
For an extended version of this workshop, we will
- Build a UI with streamlit
- Experiment with open-source LLMs and replace OpenAI
We can build simple UI apps with streamlit. Let's install it
pipenv install streamlit
If you want to learn more about streamlit, you can use this material.
We need a simple form with
- Input box for the prompt
- Button
- Text field to display the response (in markdown)
import streamlit as st
def qa_bot(prompt):
import time
time.sleep(2)
return f"Response for the prompt: {prompt}"
def main():
st.title("DTC Q&A System")
with st.form(key='rag_form'):
prompt = st.text_input("Enter your prompt")
response_placeholder = st.empty()
submit_button = st.form_submit_button(label='Submit')
if submit_button:
response_placeholder.markdown("Loading...")
response = qa_bot(prompt)
response_placeholder.markdown(response)
if __name__ == "__main__":
main()
Let's run it
streamlit run app.py
Now we can replace the function qa_bot
. Let's create
a file rag.py
with the content from the notebook.
You can see the content of the file here.
Also, we add a special dropdown menu to select the course:
courses = [
"data-engineering-zoomcamp",
"machine-learning-zoomcamp",
"mlops-zoomcamp"
]
zoomcamp_option = st.selectbox("Select a zoomcamp", courses)
There are many open-source LLMs. We will use two platforms:
- Ollama for running on CPU
- HuggingFace for running on GPU
The easiest way to run an LLM without a GPU is using Ollama
Note that the 2 core codespaces instance is not enough. For this part it's better to create a separate instance with 4 cores.
You can also run it locally. I have 8 cores on my laptop, so it's faster than doing it on codespaces.
Installing for Linux:
curl -fsSL https://ollama.com/install.sh | sh
Installing for other OS - check the Ollama website. I successfully tested it on Windows too.
Let's run it:
# in one terminal
ollama start
# in another terminal
ollama run phi3
Prompt example:
Question: I just discovered the couse. can i still enrol
Context:
Course - Can I still join the course after the start date? Yes, even if you don't register, you're still eligible to submit the homeworks. Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.
Environment - Is Python 3.9 still the recommended version to use in 2024? Yes, for simplicity (of troubleshooting against the recorded videos) and stability. [source] But Python 3.10 and 3.11 should work fine.
How can we contribute to the course? Star the repo! Share it with friends if you find it useful ❣️ Create a PR if you see you can improve the text or the structure of the repository.
Answer:
Ollama's API is compatible with OpenAI's python client, so we can use it by changing only a few lines of code:
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1/',
api_key='ollama',
)
response = client.chat.completions.create(
model='phi3',
messages=[{"role": "user", "content": prompt}]
)
response.choices[0].message.content
That's it! Now let's put everything in Docker
We already know how to run Elasticsearch in Docker:
docker run -it \
--rm \
--name elasticsearch \
-p 9200:9200 \
-p 9300:9300 \
-e "discovery.type=single-node" \
-e "xpack.security.enabled=false" \
docker.elastic.co/elasticsearch/elasticsearch:8.4.3
This is how we run Ollama in Docker:
docker run -it \
--rm \
--name ollama \
-v ollama:/root/.ollama \
-p 11434:11434 \
ollama/ollama
When we run it, we need to log in to the container to download the phi3 model:
docker exec -it ollama bash
ollama pull phi3
After pulling the model, we can query it with OpenAI's python package. Because we do volume mapping, the model files will stay in the container across multiple runs.
Let's now combine them into one docker-compose file.
Create a docker-compose.yaml
file with both Ollama and Elasticsearch.
And now run it:
docker-compose up
Ollama can run locally on a CPU. But there are many models that require a GPU.
For running them, we will use Colab or other notebook platform with a GPU (for example, SaturnCloud). Let's stop our codespace for now.
In Colab, you need to enable GPU:
- Create a notebook: https://colab.research.google.com/#create=true
- Runtime -> Change runtime type -> T4 GPU
-
!nvidia-smi
to verify you have a GPU
Now we need to install the dependencies:
!pip install -U transformers accelerate bitsandbytes
Also, it's tricky to run Elasticsearch on Colab, so we will replace it with minsearch - a simple in-memory search library:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Let's get the data and create an index:
import requests
docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
documents = []
for course in documents_raw:
course_name = course['course']
for doc in course['documents']:
doc['course'] = course_name
documents.append(doc)
import minsearch
index = minsearch.Index(
text_fields=["question", "text", "section"],
keyword_fields=["course"]
)
index.fit(documents)
Searching with minsearch:
query = "I just discovered the course, can I still join?"
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3}
index.search(query, filter_dict, boost_dict, num_results=5)
Let's replace our search function:
def retrieve_documents(query, max_results=5):
filter_dict = {"course": "data-engineering-zoomcamp"}
boost_dict = {"question": 3}
return index.search(query, filter_dict, boost_dict, num_results=5)
We will use Google's FLAN T5 model: google/flan-t5-xl
.
Downloading and loading it:
from transformers import T5ForConditionalGeneration, T5Tokenizer
model_name = "google/flan-t5-xl"
model = T5ForConditionalGeneration.from_pretrained(model_name, device_map="auto")
tokenizer = T5Tokenizer.from_pretrained(model_name, legacy=False)
tokenizer.model_max_length = 4096
Using it:
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
Let's put it to a function:
def llm(prompt):
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(input_ids, )
result = tokenizer.decode(outputs[0])
return result
Everything together:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()
prompt_template = """
You're a course teaching assistant.
Answer the user QUESTION based on CONTEXT - the documents retrieved from our FAQ database.
Don't use other information outside of the provided CONTEXT.
QUESTION: {user_question}
CONTEXT:
{context}
""".strip()
def build_context(documents):
context_result = ""
for doc in documents:
doc_str = context_template.format(**doc)
context_result += ("\n\n" + doc_str)
return context_result.strip()
def build_prompt(user_question, documents):
context = build_context(documents)
prompt = prompt_template.format(
user_question=user_question,
context=context
)
return prompt
def qa_bot(user_question):
context_docs = retrieve_documents(user_question)
prompt = build_prompt(user_question, context_docs)
answer = llm(prompt)
return answer
Making the answers longer:
def llm(prompt, generate_params=None):
if generate_params is None:
generate_params = {}
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to("cuda")
outputs = model.generate(
input_ids,
max_length=generate_params.get("max_length", 100),
num_beams=generate_params.get("num_beams", 5),
do_sample=generate_params.get("do_sample", False),
temperature=generate_params.get("temperature", 1.0),
top_k=generate_params.get("top_k", 50),
top_p=generate_params.get("top_p", 0.95),
)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
return result
Explanation of the parameters:
-
max_length
: Set this to a higher value if you want longer responses. For example,max_length=300
. -
num_beams
: Increasing this can lead to more thorough exploration of possible sequences. Typical values are between 5 and 10. -
do_sample
: Set this toTrue
to use sampling methods. This can produce more diverse responses. -
temperature
: Lowering this value makes the model more confident and deterministic, while higher values increase diversity. Typical values range from 0.7 to 1.5. -
top_k
andtop_p
: These parameters control nucleus sampling.top_k
limits the sampling pool to the topk
tokens, whiletop_p
uses cumulative probability to cut off the sampling pool. Adjust these based on the desired level of randomness.
Final notebook:
Other models:
microsoft/Phi-3-mini-128k-instruct
mistralai/Mistral-7B-v0.1
- And many more
That was fun - thanks!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llm-rag-workshop
Similar Open Source Tools
llm-rag-workshop
The LLM RAG Workshop repository provides a workshop on using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to generate and understand text in a human-like manner. It includes instructions on setting up the environment, indexing Zoomcamp FAQ documents, creating a Q&A system, and using OpenAI for generation based on retrieved information. The repository focuses on enhancing language model responses with retrieved information from external sources, such as document databases or search engines, to improve factual accuracy and relevance of generated text.
ruby-openai
Use the OpenAI API with Ruby! 🤖🩵 Stream text with GPT-4, transcribe and translate audio with Whisper, or create images with DALL·E... Hire me | 🎮 Ruby AI Builders Discord | 🐦 Twitter | 🧠 Anthropic Gem | 🚂 Midjourney Gem ## Table of Contents * Ruby OpenAI * Table of Contents * Installation * Bundler * Gem install * Usage * Quickstart * With Config * Custom timeout or base URI * Extra Headers per Client * Logging * Errors * Faraday middleware * Azure * Ollama * Counting Tokens * Models * Examples * Chat * Streaming Chat * Vision * JSON Mode * Functions * Edits * Embeddings * Batches * Files * Finetunes * Assistants * Threads and Messages * Runs * Runs involving function tools * Image Generation * DALL·E 2 * DALL·E 3 * Image Edit * Image Variations * Moderations * Whisper * Translate * Transcribe * Speech * Errors * Development * Release * Contributing * License * Code of Conduct
vim-ai
vim-ai is a plugin that adds Artificial Intelligence (AI) capabilities to Vim and Neovim. It allows users to generate code, edit text, and have interactive conversations with GPT models powered by OpenAI's API. The plugin uses OpenAI's API to generate responses, requiring users to set up an account and obtain an API key. It supports various commands for text generation, editing, and chat interactions, providing a seamless integration of AI features into the Vim text editor environment.
aiocsv
aiocsv is a Python module that provides asynchronous CSV reading and writing. It is designed to be a drop-in replacement for the Python's builtin csv module, but with the added benefit of being able to read and write CSV files asynchronously. This makes it ideal for use in applications that need to process large CSV files efficiently.
lmstudio.js
lmstudio.js is a pre-release alpha client SDK for LM Studio, allowing users to use local LLMs in JS/TS/Node. It is currently undergoing rapid development with breaking changes expected. Users can follow LM Studio's announcements on Twitter and Discord. The SDK provides API usage for loading models, predicting text, setting up the local LLM server, and more. It supports features like custom loading progress tracking, model unloading, structured output prediction, and cancellation of predictions. Users can interact with LM Studio through the CLI tool 'lms' and perform tasks like text completion, conversation, and getting prediction statistics.
client-python
The Mistral Python Client is a tool inspired by cohere-python that allows users to interact with the Mistral AI API. It provides functionalities to access and utilize the AI capabilities offered by Mistral. Users can easily install the client using pip and manage dependencies using poetry. The client includes examples demonstrating how to use the API for various tasks, such as chat interactions. To get started, users need to obtain a Mistral API Key and set it as an environment variable. Overall, the Mistral Python Client simplifies the integration of Mistral AI services into Python applications.
parea-sdk-py
Parea AI provides a SDK to evaluate & monitor AI applications. It allows users to test, evaluate, and monitor their AI models by defining and running experiments. The SDK also enables logging and observability for AI applications, as well as deploying prompts to facilitate collaboration between engineers and subject-matter experts. Users can automatically log calls to OpenAI and Anthropic, create hierarchical traces of their applications, and deploy prompts for integration into their applications.
redis-vl-python
The Python Redis Vector Library (RedisVL) is a tailor-made client for AI applications leveraging Redis. It enhances applications with Redis' speed, flexibility, and reliability, incorporating capabilities like vector-based semantic search, full-text search, and geo-spatial search. The library bridges the gap between the emerging AI-native developer ecosystem and the capabilities of Redis by providing a lightweight, elegant, and intuitive interface. It abstracts the features of Redis into a grammar that is more aligned to the needs of today's AI/ML Engineers or Data Scientists.
json-repair
JSON Repair is a toolkit designed to address JSON anomalies that can arise from Large Language Models (LLMs). It offers a comprehensive solution for repairing JSON strings, ensuring accuracy and reliability in your data processing. With its user-friendly interface and extensive capabilities, JSON Repair empowers developers to seamlessly integrate JSON repair into their workflows.
Lumos
Lumos is a Chrome extension powered by a local LLM co-pilot for browsing the web. It allows users to summarize long threads, news articles, and technical documentation. Users can ask questions about reviews and product pages. The tool requires a local Ollama server for LLM inference and embedding database. Lumos supports multimodal models and file attachments for processing text and image content. It also provides options to customize models, hosts, and content parsers. The extension can be easily accessed through keyboard shortcuts and offers tools for automatic invocation based on prompts.
instructor
Instructor is a popular Python library for managing structured outputs from large language models (LLMs). It offers a user-friendly API for validation, retries, and streaming responses. With support for various LLM providers and multiple languages, Instructor simplifies working with LLM outputs. The library includes features like response models, retry management, validation, streaming support, and flexible backends. It also provides hooks for logging and monitoring LLM interactions, and supports integration with Anthropic, Cohere, Gemini, Litellm, and Google AI models. Instructor facilitates tasks such as extracting user data from natural language, creating fine-tuned models, managing uploaded files, and monitoring usage of OpenAI models.
nvim.ai
nvim.ai is a powerful Neovim plugin that enables AI-assisted coding and chat capabilities within the editor. Users can chat with buffers, insert code with an inline assistant, and utilize various LLM providers for context-aware AI assistance. The plugin supports features like interacting with AI about code and documents, receiving relevant help based on current work, code insertion, code rewriting (Work in Progress), and integration with multiple LLM providers. Users can configure the plugin, add API keys to dotfiles, and integrate with nvim-cmp for command autocompletion. Keymaps are available for chat and inline assist functionalities. The chat dialog allows parsing content with keywords and supports roles like /system, /you, and /assistant. Context-aware assistance can be accessed through inline assist by inserting code blocks anywhere in the file.
hf-waitress
HF-Waitress is a powerful server application for deploying and interacting with HuggingFace Transformer models. It simplifies running open-source Large Language Models (LLMs) locally on-device, providing on-the-fly quantization via BitsAndBytes, HQQ, and Quanto. It requires no manual model downloads, offers concurrency, streaming responses, and supports various hardware and platforms. The server uses a `config.json` file for easy configuration management and provides detailed error handling and logging.
motorhead
Motorhead is a memory and information retrieval server for LLMs. It provides three simple APIs to assist with memory handling in chat applications using LLMs. The first API, GET /sessions/:id/memory, returns messages up to a maximum window size. The second API, POST /sessions/:id/memory, allows you to send an array of messages to Motorhead for storage. The third API, DELETE /sessions/:id/memory, deletes the session's message list. Motorhead also features incremental summarization, where it processes half of the maximum window size of messages and summarizes them when the maximum is reached. Additionally, it supports searching by text query using vector search. Motorhead is configurable through environment variables, including the maximum window size, whether to enable long-term memory, the model used for incremental summarization, the server port, your OpenAI API key, and the Redis URL.
langchain-extract
LangChain Extract is a simple web server that allows you to extract information from text and files using LLMs. It is built using FastAPI, LangChain, and Postgresql. The backend closely follows the extraction use-case documentation and provides a reference implementation of an app that helps to do extraction over data using LLMs. This repository is meant to be a starting point for building your own extraction application which may have slightly different requirements or use cases.
chat-ui
A chat interface using open source models, eg OpenAssistant or Llama. It is a SvelteKit app and it powers the HuggingChat app on hf.co/chat.
For similar tasks
llm-rag-workshop
The LLM RAG Workshop repository provides a workshop on using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) to generate and understand text in a human-like manner. It includes instructions on setting up the environment, indexing Zoomcamp FAQ documents, creating a Q&A system, and using OpenAI for generation based on retrieved information. The repository focuses on enhancing language model responses with retrieved information from external sources, such as document databases or search engines, to improve factual accuracy and relevance of generated text.
kernel-memory
Kernel Memory (KM) is a multi-modal AI Service specialized in the efficient indexing of datasets through custom continuous data hybrid pipelines, with support for Retrieval Augmented Generation (RAG), synthetic memory, prompt engineering, and custom semantic memory processing. KM is available as a Web Service, as a Docker container, a Plugin for ChatGPT/Copilot/Semantic Kernel, and as a .NET library for embedded applications. Utilizing advanced embeddings and LLMs, the system enables Natural Language querying for obtaining answers from the indexed data, complete with citations and links to the original sources. Designed for seamless integration as a Plugin with Semantic Kernel, Microsoft Copilot and ChatGPT, Kernel Memory enhances data-driven features in applications built for most popular AI platforms.
swirl-search
Swirl is an open-source software that allows users to simultaneously search multiple content sources and receive AI-ranked results. It connects to various data sources, including databases, public data services, and enterprise sources, and utilizes AI and LLMs to generate insights and answers based on the user's data. Swirl is easy to use, requiring only the download of a YML file, starting in Docker, and searching with Swirl. Users can add credentials to preloaded SearchProviders to access more sources. Swirl also offers integration with ChatGPT as a configured AI model. It adapts and distributes user queries to anything with a search API, re-ranking the unified results using Large Language Models without extracting or indexing anything. Swirl includes five Google Programmable Search Engines (PSEs) to get users up and running quickly. Key features of Swirl include Microsoft 365 integration, SearchProvider configurations, query adaptation, synchronous or asynchronous search federation, optional subscribe feature, pipelining of Processor stages, results stored in SQLite3 or PostgreSQL, built-in Query Transformation support, matching on word stems and handling of stopwords, duplicate detection, re-ranking of unified results using Cosine Vector Similarity, result mixers, page through all results requested, sample data sets, optional spell correction, optional search/result expiration service, easily extensible Connector and Mixer objects, and a welcoming community for collaboration and support.
paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and follows a process of embedding docs and queries, searching for top passages, creating summaries, scoring and selecting relevant summaries, putting summaries into prompt, and generating answers. Users can customize prompts and use various models for embeddings and LLMs. The tool can be used asynchronously and supports adding documents from paths, files, or URLs.
quick-start-connectors
Cohere's Build-Your-Own-Connector framework allows integration of Cohere's Command LLM via the Chat API endpoint to any datastore/software holding text information with a search endpoint. Enables user queries grounded in proprietary information. Use-cases include question/answering, knowledge working, comms summary, and research. Repository provides code for popular datastores and a template connector. Requires Python 3.11+ and Poetry. Connectors can be built and deployed using Docker. Environment variables set authorization values. Pre-commits for linting. Connectors tailored to integrate with Cohere's Chat API for creating chatbots. Connectors return documents as JSON objects for Cohere's API to generate answers with citations.
RAGMeUp
RAG Me Up is a generic framework that enables users to perform Retrieve and Generate (RAG) on their own dataset easily. It consists of a small server and UIs for communication. Best run on GPU with 16GB vRAM. Users can combine RAG with fine-tuning using LLaMa2Lang repository. The tool allows configuration for LLM, data, LLM parameters, prompt, and document splitting. Funding is sought to democratize AI and advance its applications.
local-genAI-search
Local-GenAI Search is a local generative search engine powered by the Llama3 model, allowing users to ask questions about their local files and receive concise answers with relevant document references. It utilizes MS MARCO embeddings for semantic search and can run locally on a 32GB laptop or computer. The tool can be used to index local documents, search for information, and provide generative search services through a user interface.
nanoPerplexityAI
nanoPerplexityAI is an open-source implementation of a large language model service that fetches information from Google. It involves a simple architecture where the user query is checked by the language model, reformulated for Google search, and an answer is generated and saved in a markdown file. The tool requires minimal setup and is designed for easy visualization of answers.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.