RAGElo
RAGElo is a set of tools that helps you selecting the best RAG-based LLM agents by using an Elo ranker
Stars: 127
RAGElo is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system. It simplifies the process of comparing different outputs from multiple prompt and pipeline variations to a 'gold standard' by allowing a powerful LLM to judge between pairs of answers and questions. RAGElo conducts tournament-style Elo ranking of LLM outputs, providing insights into the effectiveness of different settings.
README:
Elo-based RAG Agent evaluator
RAGElo1 is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system.
While it has become easier to prototype and incorporate generative LLMs in production, evaluation is still the most challenging part of the solution. Comparing different outputs from multiple prompt and pipeline variations to a "gold standard" is not easy. Still, we can ask a powerful LLM to judge between pairs of answers and a set of questions.
This led us to develop a simple tool for tournament-style Elo ranking of LLM outputs. By comparing answers from different RAG pipelines and prompts over multiple questions, RAGElo computes a ranking of the different settings, providing a good overview of what works (and what doesn't).
For using RAGElo as a Python library or as CLI, install it using pip:
pip install rageloWhen working from source we recommend an isolated environment (e.g., uv venv && uv pip install -e '.[dev]'). The project's Python lives at .venv/bin/python.
Environment variables and providers:
- OpenAI requires
OPENAI_API_KEY. Set it in your shell or load it via dotenv before invoking the CLI. - Ollama is supported for local models (
--llm-provider-name ollama).
To use RAGElo as a library, all you need to do is import RAGElo, initialize an Evaluator and call either evaluate() for evaluating a retrieved document or an LLM answer, or evaluate_experiment() to evaluate a full experiment. For example, using the RDNAM retrieval evaluator from the Thomas et al. (2023) paper on using GPT-4 for annotating retrieval results:
from ragelo import get_retrieval_evaluator
evaluator = get_retrieval_evaluator("RDNAM", llm_provider="openai")
result = evaluator.evaluate(query="What is the capital of France?", document='Lyon is the second largest city in France.')
print(result.answer)
# Output: RDNAMAnswerEvaluatorFormat(overall=1.0, intent_match=None, trustworthiness=None)
print(result.answer.overall)
# Output: 1.0
print(result.raw_answer)
# Output: '{"overall": 1.0"}'In most cases result.answer contains a BaseModel from Pydantic with the parsed judge response. For more details, check the answer_formats.py file.
RAGElo supports Experiments to keep track of which documents and answers were already evaluated and to compute overall scores for each Agent:
from ragelo import Experiment, get_retrieval_evaluator, get_answer_evaluator, get_agent_ranker, get_llm_provider
experiment = Experiment(experiment_name="A_really_cool_RAGElo_experiment")
# Add two user queries. Alternatively, we can load them from a csv file with .add_queries_from_csv()
experiment.add_query("What is the capital of Brazil?", query_id=0)
experiment.add_query("What is the capital of France?", query_id=1)
# Add four documents retrieved for these queries. Alternatively, we can load them from a csv file with .add_documents_from_csv()
experiment.add_retrieved_doc("Brasília is the capital of Brazil", query_id=0, doc_id=0)
experiment.add_retrieved_doc("Rio de Janeiro used to be the capital of Brazil.", query_id=0, doc_id=1)
experiment.add_retrieved_doc("Paris is the capital of France.", query_id=1, doc_id=2)
experiment.add_retrieved_doc("Lyon is the second largest city in France.", query_id=1, doc_id=3)
# Add the answers generated by agents
experiment.add_agent_answer("Brasília is the capital of Brazil, according to [0].", agent="agent1", query_id=0)
experiment.add_agent_answer("According to [1], Rio de Janeiro used to be the capital of Brazil, until the 60s.", agent="agent2", query_id=0)
experiment.add_agent_answer("Paris is the capital of France, according to [2].", agent="agent1", query_id=1)
experiment.add_agent_answer("According to [3], Lyon is the second largest city in France. Meanwhile, Paris is its capital [2].", agent="agent2", query_id=1)
llm_provider = get_llm_provider("openai", model="gpt-4.1-nano")
retrieval_evaluator = get_retrieval_evaluator("reasoner", llm_provider, rich_print=True)
answer_evaluator = get_answer_evaluator("pairwise", llm_provider, rich_print=True)
elo_ranker = get_agent_ranker("elo", verbose=True)
# Evaluate the retrieval results.
retrieval_evaluator.evaluate_experiment(experiment)
# With the retrieved documents evaluated, evaluate the quality of the answers. using the pairwise evaluator
answer_evaluator.evaluate_experiment(experiment)
# Run the ELO ranker to score the agents
elo_ranker.run(experiment)
# Output:
------- Agents Elo Ratings -------
agent1 : 1035.7(±2.9)
agent2 : 961.3(±2.9)The experiment is save as a JSON in ragelo_cache/experiment_name.json.
For a more complete example, we can evaluate with a custom prompt, and inject metadata into our evaluation prompt:
from pydantic import BaseModel, Field
from ragelo import get_retrieval_evaluator
system_prompt = """You are a helpful assistant for evaluating the relevance of a retrieved document to a user query.
You should pay extra attention to how **recent** a document is. A document older than 5 years is considered outdated.
The answer should be evaluated according to its recency, truthfulness, and relevance to the user query.
"""
user_prompt = """
User query: {{ query.query }}
Retrieved document: {{ document.text }}
The document has a date of {{ document.metadata.date }}.
Today is {{ query.metadata.today_date }}.
"""
class ResponseSchema(BaseModel):
relevance: int = Field(description="An integer, either 0 or 1. 0 if the document is irrelevant, 1 if it is relevant.")
recency: int = Field(description="An integer, either 0 or 1. 0 if the document is outdated, 1 if it is recent.")
truthfulness: int = Field(description="An integer, either 0 or 1. 0 if the document is false, 1 if it is true.")
reasoning: str = Field(description="A short explanation of why you think the document is relevant or irrelevant.")
evaluator = get_retrieval_evaluator(
"custom_prompt", # name of the retrieval evaluator
llm_provider="openai", # Which LLM provider to use
system_prompt=system_prompt, # your custom prompt
user_prompt=user_prompt, # your custom prompt
llm_response_schema=ResponseSchema, # The response schema for the LLM.
)
result = evaluator.evaluate(
query="What is the capital of Brazil?", # The user query
document="Rio de Janeiro is the capital of Brazil.", # The retrieved document
query_metadata={"today_date": "08-04-2024"}, # Some metadata for the query
doc_metadata={"date": "04-03-1950"}, # Some metadata for the document
)
result.answer.model_dump_json(indent=2)
# Output:
'{
"relevance": 0,
"recency": 0,
"truthfulness": 0,
"reasoning": "The document is outdated and incorrect. Rio de Janeiro was the capital of Brazil until 1960 when it was changed to Brasília."
}'Note that, in this example, we passed to the evaluate method two dictionaries with metadata for the query and the document. This metadata is injected into the prompt by matching their keys into the placeholders in the prompt (note the document.metadata.date and query.metadata.today_date templates.)
For a comprehensive example of how to use RAGElo, see the docs/examples/notebooks/rag_eval.ipynb notebook.
After installing RAGElo as a CLI app (and exporting the appropriate LLM provider credentials, e.g., OPENAI_API_KEY), you can run it with the following command:
ragelo run-all \
queries.csv documents.csv answers.csv \
--data-dir tests/data/ \
--experiment-name tutorial \
--output-file tutorial.json \
--verboseWith --verbose enabled you will see outputs such as:
Loaded 2 queries from .../tests/data/queries.csv
Loaded 4 new documents from .../tests/data/documents.csv
Loaded 4 answers from .../tests/data/answers.csv
Evaluating Retrieved documents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4
✅ Done!
🔎 Query ID: 0
📜 Document ID: 0
Parsed Answer: Very relevant: The document directly answers the user question by stating that Brasília is the capital of Brazil.
🔎 Query ID: 0
📜 Document ID: 1
Parsed Answer: Somewhat relevant: The document mentions a former capital of Brazil but does not provide the current capital.
🔎 Query ID: 1
📜 Document ID: 2
Parsed Answer: Very relevant: The document clearly states that Paris is the capital of France, directly answering the user question.
🔎 Query ID: 1
📜 Document ID: 3
Parsed Answer: Not relevant: The document does not provide information about the capital of France.
Evaluating Retrieved documents 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4/4 [ 0:00:02 < 0:00:00 , 2 it/s ]
✅ Done!
Total evaluations: 4
🔎 Query ID: 0
agent1 🆚 agent2
Parsed Answer: A
🔎 Query ID: 1
agent1 🆚 agent2
Parsed Answer: A
Evaluating Agent Answers 100% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 [ 0:00:09 < 0:00:00 , 0 it/s ]
✅ Done!
Total evaluations: 2
------- Agents Elo Ratings -------
agent1 : 1033.0(±0.0)
agent2 : 966.0(±0.0)
By default, evaluations are persisted to ragelo_cache/<experiment>.json alongside incremental results in ragelo_cache/<experiment>_results.jsonl. Passing --output-file writes the experiment JSON (without evaluator traces) to a custom location.
In this example, the output file is a JSON file with the experiment definition and tournament summary. It can be loaded directly as a new Experiment object:
experiment = Experiment.load("experiment", "experiment.json")When running as a CLI, RAGElo expects the input files as CSV files. Specifically, it expects a csv file with the user queries, one with the documents retrieved by the retrieval system and one of the answers each agent produced. These files can be passed with the parameters --queries_csv_file, --documents_csv_file and --answers_csv_file, respectively, or directly as positional arguments.
CSV columns and inference:
- Queries:
qid,query(infersqidif missing) - Documents:
qid,did,document(infersqid/didif missing) - Answers:
qid,agent,answerExtra columns are captured as metadata and available to prompts.
Here are some examples of their expected formats:
queries.csv:
qid,query
0, What is the capital of Brazil?
1, What is the capital of France?
documents.csv:
qid,did,document
0,0, Brasília is the capital of Brazil.
0,1, Rio de Janeiro used to be the capital of Brazil.
1,2, Paris is the capital of France.
1,3, Lyon is the second largest city in France.answers.csv:
qid,agent,answer
0, agent1,"Brasília is the capital of Brazil, according to [0]."
0, agent2,"According to [1], Rio de Janeiro used to be the capital of Brazil, until the 60s."
1, agent1,"Paris is the capital of France, according to [2]."
1, agent2,"According to [3], Lyon is the second largest city in France. Meanwhile, Paris is its capital [2]."While RAGElo can be used end-to-end (run-all), you can also drive individual CLI components.
The retrieval-evaluator tool annotates retrieved documents based on their relevance to the user query. This is done regardless of the answers provided by any Agent. As an example, for calling the Reasoner retrieval evaluator (reasoner only outputs the reasoning why a document is relevant or not) we can use:
ragelo retrieval-evaluator reasoner \
queries.csv documents.csv \
--data-dir tests/data/ \
--experiment-name experiment \
--output-file experiment-docs.json \
--verboseEach run updates the experiment cache and appends evaluation traces to <experiment>_results.jsonl. If all documents already have evaluations you will see an informational message unless --force is provided.
Domain expert example:
ragelo retrieval-evaluator domain-expert \
queries.csv documents.csv \
--data-dir tests/data/ \
--experiment-name experiment \
--expert-in "Chemical Engineering" \
--company "ChemCorp" \
--output-file experiment-docs.json \
--verboseRDNAM example:
ragelo retrieval-evaluator rdnam \
queries.csv documents.csv \
--data-dir tests/data/ \
--experiment-name experiment \
--output-file experiment-docs.json \
--verboseThe answer-evaluator subcommands annotate agent answers. The default pairwise mode compares answers two at a time and can optionally inject reasoning annotations:
ragelo answer-evaluator pairwise \
queries.csv documents.csv answers.csv \
--data-dir tests/data/ \
--experiment-name experiment \
--output-file experiment-answers.json \
--add-reasoning \
--verboseIf --add-reasoning is supplied the CLI will run the reasoner retrieval evaluator first, include the relevance scores in the prompts, and then proceed with pairwise games. Newly created games are tracked inside the experiment and re-used by the Elo ranker.
Domain expert pairwise example:
ragelo answer-evaluator expert-pairwise \
queries.csv documents.csv answers.csv \
--data-dir tests/data/ \
--experiment-name experiment \
--expert-in "Healthcare" \
--add-reasoning \
--output-file experiment-answers.json \
--verboseConcurrency and Rich output:
- Use
--n-processesto control parallel LLM calls. - Use
--no-rich-printin CI to avoid live display issues.
Reproducibility tips:
- Pairwise sampling (
n_games_per_query) is randomized; persist experiment JSON/JSONL to stabilize comparisons.
Evaluating retrieval metrics (optional):
from ragelo import Experiment
exp = Experiment(experiment_name="my_exp", save_on_disk=False)
# load queries/docs/answers and evaluations...
exp.evaluate_retrieval(metrics=["Precision@10", "nDCG@10"], relevance_threshold=1)To install the development dependencies, download the repository and run the following:
git clone https://github.com/zeta-alpha/ragelo && cd ragelo
uv pip install -e '.[dev]'This will install the requirement dependencies in an editable mode (i.e., any changes to the code don't need to be rebuilt.)
For building a new version, use the build command:
python -m build- [ ] Add full documentation of all implemented Evaluators
- [X] Add CI/CD for publishing
- [x] Add option to few-shot examples (Undocumented, yet)
- [x] Testing!
- [x] Publish on PyPi
- [x] Add more document evaluators
- [x] Split Elo evaluator
- [x] Install as standalone CLI
-
The RAGElo logo was created using Dall-E 3 and GPT-4 with the following prompt: "Vector logo design for a toolkit named 'RAGElo'. The logo should have bold, modern typography with emphasis on 'RAG' in a contrasting color. Include a minimalist icon symbolizing retrieval or ranking." ↩
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for RAGElo
Similar Open Source Tools
RAGElo
RAGElo is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system. It simplifies the process of comparing different outputs from multiple prompt and pipeline variations to a 'gold standard' by allowing a powerful LLM to judge between pairs of answers and questions. RAGElo conducts tournament-style Elo ranking of LLM outputs, providing insights into the effectiveness of different settings.
gfm-rag
The GFM-RAG is a graph foundation model-powered pipeline that combines graph neural networks to reason over knowledge graphs and retrieve relevant documents for question answering. It features a knowledge graph index, efficiency in multi-hop reasoning, generalizability to unseen datasets, transferability for fine-tuning, compatibility with agent-based frameworks, and interpretability of reasoning paths. The tool can be used for conducting retrieval and question answering tasks using pre-trained models or fine-tuning on custom datasets.
stark
STaRK is a large-scale semi-structure retrieval benchmark on Textual and Relational Knowledge Bases. It provides natural-sounding and practical queries crafted to incorporate rich relational information and complex textual properties, closely mirroring real-life scenarios. The benchmark aims to assess how effectively large language models can handle the interplay between textual and relational requirements in queries, using three diverse knowledge bases constructed from public sources.
appworld
AppWorld is a high-fidelity execution environment of 9 day-to-day apps, operable via 457 APIs, populated with digital activities of ~100 people living in a simulated world. It provides a benchmark of natural, diverse, and challenging autonomous agent tasks requiring rich and interactive coding. The repository includes implementations of AppWorld apps and APIs, along with tests. It also introduces safety features for code execution and provides guides for building agents and extending the benchmark.
paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and includes a process of embedding docs, queries, searching for top passages, creating summaries, using an LLM to re-score and select relevant summaries, putting summaries into prompt, and generating answers. The tool can be used to answer specific questions related to scientific research by leveraging citations and relevant passages from documents.
ai2-scholarqa-lib
Ai2 Scholar QA is a system for answering scientific queries and literature review by gathering evidence from multiple documents across a corpus and synthesizing an organized report with evidence for each claim. It consists of a retrieval component and a three-step generator pipeline. The retrieval component fetches relevant evidence passages using the Semantic Scholar public API and reranks them. The generator pipeline includes quote extraction, planning and clustering, and summary generation. The system is powered by the ScholarQA class, which includes components like PaperFinder and MultiStepQAPipeline. It requires environment variables for Semantic Scholar API and LLMs, and can be run as local docker containers or embedded into another application as a Python package.
tonic_validate
Tonic Validate is a framework for the evaluation of LLM outputs, such as Retrieval Augmented Generation (RAG) pipelines. Validate makes it easy to evaluate, track, and monitor your LLM and RAG applications. Validate allows you to evaluate your LLM outputs through the use of our provided metrics which measure everything from answer correctness to LLM hallucination. Additionally, Validate has an optional UI to visualize your evaluation results for easy tracking and monitoring.
AirspeedVelocity.jl
AirspeedVelocity.jl is a tool designed to simplify benchmarking of Julia packages over their lifetime. It provides a CLI to generate benchmarks, compare commits/tags/branches, plot benchmarks, and run benchmark comparisons for every submitted PR as a GitHub action. The tool freezes the benchmark script at a specific revision to prevent old history from affecting benchmarks. Users can configure options using CLI flags and visualize benchmark results. AirspeedVelocity.jl can be used to benchmark any Julia package and offers features like generating tables and plots of benchmark results. It also supports custom benchmarks and can be integrated into GitHub actions for automated benchmarking of PRs.
Hurley-AI
Hurley AI is a next-gen framework for developing intelligent agents through Retrieval-Augmented Generation. It enables easy creation of custom AI assistants and agents, supports various agent types, and includes pre-built tools for domains like finance and legal. Hurley AI integrates with LLM inference services and provides observability with Arize Phoenix. Users can create Hurley RAG tools with a single line of code and customize agents with specific instructions. The tool also offers various helper functions to connect with Hurley RAG and search tools, along with pre-built tools for tasks like summarizing text, rephrasing text, understanding memecoins, and querying databases.
chatWeb
ChatWeb is a tool that can crawl web pages, extract text from PDF, DOCX, TXT files, and generate an embedded summary. It can answer questions based on text content using chatAPI and embeddingAPI based on GPT3.5. The tool calculates similarity scores between text vectors to generate summaries, performs nearest neighbor searches, and designs prompts to answer user questions. It aims to extract relevant content from text and provide accurate search results based on keywords. ChatWeb supports various modes, languages, and settings, including temperature control and PostgreSQL integration.
siftrank
siftrank is an implementation of the Sift Rank document ranking algorithm that uses Large Language Models (LLMs) to efficiently find the most relevant items in any dataset based on a given prompt. It addresses issues like non-determinism, limited context, output constraints, and scoring subjectivity encountered when using LLMs directly. siftrank allows users to rank anything without fine-tuning or domain-specific models, running in seconds and costing pennies. It supports JSON input, Go template syntax for customization, and various advanced options for configuration and optimization.
paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and follows a process of embedding docs and queries, searching for top passages, creating summaries, scoring and selecting relevant summaries, putting summaries into prompt, and generating answers. Users can customize prompts and use various models for embeddings and LLMs. The tool can be used asynchronously and supports adding documents from paths, files, or URLs.
ragtacts
Ragtacts is a Clojure library that allows users to easily interact with Large Language Models (LLMs) such as OpenAI's GPT-4. Users can ask questions to LLMs, create question templates, call Clojure functions in natural language, and utilize vector databases for more accurate answers. Ragtacts also supports RAG (Retrieval-Augmented Generation) method for enhancing LLM output by incorporating external data. Users can use Ragtacts as a CLI tool, API server, or through a RAG Playground for interactive querying.
simpleAI
SimpleAI is a self-hosted alternative to the not-so-open AI API, focused on replicating main endpoints for LLM such as text completion, chat, edits, and embeddings. It allows quick experimentation with different models, creating benchmarks, and handling specific use cases without relying on external services. Users can integrate and declare models through gRPC, query endpoints using Swagger UI or API, and resolve common issues like CORS with FastAPI middleware. The project is open for contributions and welcomes PRs, issues, documentation, and more.
godot-llm
Godot LLM is a plugin that enables the utilization of large language models (LLM) for generating content in games. It provides functionality for text generation, text embedding, multimodal text generation, and vector database management within the Godot game engine. The plugin supports features like Retrieval Augmented Generation (RAG) and integrates llama.cpp-based functionalities for text generation, embedding, and multimodal capabilities. It offers support for various platforms and allows users to experiment with LLM models in their game development projects.
magentic
Easily integrate Large Language Models into your Python code. Simply use the `@prompt` and `@chatprompt` decorators to create functions that return structured output from the LLM. Mix LLM queries and function calling with regular Python code to create complex logic.
For similar tasks
RAGElo
RAGElo is a streamlined toolkit for evaluating Retrieval Augmented Generation (RAG)-powered Large Language Models (LLMs) question answering agents using the Elo rating system. It simplifies the process of comparing different outputs from multiple prompt and pipeline variations to a 'gold standard' by allowing a powerful LLM to judge between pairs of answers and questions. RAGElo conducts tournament-style Elo ranking of LLM outputs, providing insights into the effectiveness of different settings.
ChainForge
ChainForge is a visual programming environment for battle-testing prompts to LLMs. It is geared towards early-stage, quick-and-dirty exploration of prompts, chat responses, and response quality that goes beyond ad-hoc chatting with individual LLMs. With ChainForge, you can: * Query multiple LLMs at once to test prompt ideas and variations quickly and effectively. * Compare response quality across prompt permutations, across models, and across model settings to choose the best prompt and model for your use case. * Setup evaluation metrics (scoring function) and immediately visualize results across prompts, prompt parameters, models, and model settings. * Hold multiple conversations at once across template parameters and chat models. Template not just prompts, but follow-up chat messages, and inspect and evaluate outputs at each turn of a chat conversation. ChainForge comes with a number of example evaluation flows to give you a sense of what's possible, including 188 example flows generated from benchmarks in OpenAI evals. This is an open beta of Chainforge. We support model providers OpenAI, HuggingFace, Anthropic, Google PaLM2, Azure OpenAI endpoints, and Dalai-hosted models Alpaca and Llama. You can change the exact model and individual model settings. Visualization nodes support numeric and boolean evaluation metrics. ChainForge is built on ReactFlow and Flask.
llm-consortium
LLM Consortium is a plugin for the `llm` package that implements a model consortium system with iterative refinement and response synthesis. It orchestrates multiple learned language models to collaboratively solve complex problems through structured dialogue, evaluation, and arbitration. The tool supports multi-model orchestration, iterative refinement, advanced arbitration, database logging, configurable parameters, hundreds of models, and the ability to save and load consortium configurations.
judges
The 'judges' repository is a small library designed for using and creating LLM-as-a-Judge evaluators. It offers a curated set of LLM evaluators in a low-friction format for various use cases, backed by research. Users can use these evaluators off-the-shelf or as inspiration for building custom LLM evaluators. The library provides two types of judges: Classifiers that return boolean values and Graders that return scores on a numerical or Likert scale. Users can combine multiple judges using the 'Jury' object and evaluate input-output pairs with the '.judge()' method. Additionally, the repository includes detailed instructions on picking a model, sending data to an LLM, using classifiers, combining judges, and creating custom LLM judges with 'AutoJudge'.
mcp-rubber-duck
MCP Rubber Duck is a Model Context Protocol server that acts as a bridge to query multiple LLMs, including OpenAI-compatible HTTP APIs and CLI coding agents. Users can explain their problems to various AI 'ducks' to get different perspectives. The tool offers features like universal OpenAI compatibility, CLI agent support, conversation management, multi-duck querying, consensus voting, LLM-as-Judge evaluation, structured debates, health monitoring, usage tracking, and more. It supports various HTTP providers like OpenAI, Google Gemini, Anthropic, Groq, Together AI, Perplexity, and CLI providers like Claude Code, Codex, Gemini CLI, Grok, Aider, and custom agents. Users can install the tool globally, configure it using environment variables, and access interactive UIs for comparing ducks, voting, debating, and usage statistics. The tool provides multiple tools for asking questions, chatting, clearing conversations, listing ducks, comparing responses, voting, judging, iterating, debating, and more. It also offers prompt templates for different analysis purposes and extensive documentation for setup, configuration, tools, prompts, CLI providers, MCP Bridge, guardrails, Docker deployment, troubleshooting, contributing, license, acknowledgments, changelog, registry & directory, and support.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.
