ai-rag-chat-evaluator

Tools for evaluation of RAG Chat Apps using Azure AI Evaluate SDK and OpenAI

Stars: 191

Visit

This repository contains scripts and tools for evaluating a chat app that uses the RAG architecture. It provides parameters to assess the quality and style of answers generated by the chat app, including system prompt, search parameters, and GPT model parameters. The tools facilitate running evaluations, with examples of evaluations on a sample chat app. The repo also offers guidance on cost estimation, setting up the project, deploying a GPT-4 model, generating ground truth data, running evaluations, and measuring the app's ability to say 'I don't know'. Users can customize evaluations, view results, and compare runs using provided tools.

README:

Evaluating a RAG Chat App

This repo contains scripts and tools for evaluating a chat app that uses the RAG architecture. There are many parameters that affect the quality and style of answers generated by the chat app, such as the system prompt, search parameters, and GPT model parameters.

Whenever you are making changes to a RAG chat with the goal of improving the answers, you should evaluate the results. This repository offers tools to make it easier to run evaluations, plus examples of evaluations that we've run on our sample chat app.

📺 Watch a video overview of this repo

Table of contents:

Cost estimation
Setting up this project
Deploying a GPT-4 model
Generating ground truth data
Running an evaluation
Viewing the results
Measuring app's ability to say "I don't know"

Cost estimation

There are several places where this project can incur costs:

Cost	Description	Estimated tokens used
Generating ground truth data	This is a one-time cost for generating the initial set of questions and answers, and involves pulling data down from your search index and sending it to the GPT model.	1000 tokens per question generated, which would be 200,000 tokens for the recommended 200 questions.
Running evaluations	Each time you run an evaluation, you may choose to use the GPT-based evaluators (groundedness, coherence, etc). For each GPT-evaluator used, you will incur costs for the tokens used by the GPT model.	1000 tokens per question per evaluator used, which would be 600,000 tokens for the default 200 questions and 3 evaluators.

For a full estimate of the costs for your region and model, see the Azure OpenAI pricing page or use the Azure OpenAI pricing calculator.

Setting up this project

If you open this project in a Dev Container or GitHub Codespaces, it will automatically set up the environment for you. If not, then follow these steps:

Install Python 3.10 or higher
Create a Python virtual environment.
Inside that virtual environment, install the requirements:
```
python -m pip install -r requirements.txt
```

Deploying a GPT-4 model

It's best to use a GPT-4 model for performing the evaluation, even if your chat app uses GPT-3.5 or another model. You can either use an Azure OpenAI instance or an openai.com instance.

Using a new Azure OpenAI instance

To use a new Azure OpenAI instance, you'll need to create a new instance and deploy the app to it. We've made that easy to deploy with the azd CLI tool.

Install the Azure Developer CLI
Run azd auth login to log in to your Azure account
Run azd up to deploy a new GPT-4 instance
Create a .env file based on the provisioned resources by running one of the following commands.

Bash:
```
azd env get-values > .env
```
PowerShell:
```
$output = azd env get-values; Add-Content -Path .env -Value $output;
```

Using an existing Azure OpenAI instance

If you already have an Azure OpenAI instance, you can use that instead of creating a new one.

Create .env file by copying .env.sample

Fill in the values for your instance:

AZURE_OPENAI_EVAL_DEPLOYMENT="<deployment-name>"
AZURE_OPENAI_SERVICE="<service-name>"

The scripts default to keyless access (via AzureDefaultCredential), but you can optionally use a key by setting AZURE_OPENAI_KEY in .env.

Using an openai.com instance

If you have an openai.com instance, you can use that instead of an Azure OpenAI instance.

Create .env file by copying .env.sample
Change OPENAI_HOST to "openai" and fill in the key for for your OpenAI account. If you do not have an organization, you can leave that blank.
```
OPENAI_HOST="openai"
OPENAICOM_KEY=""
OPENAICOM_ORGANIZATION=""
```

Generating ground truth data

In order to evaluate new answers, they must be compared to "ground truth" answers: the ideal answer for a particular question. See example_input/qa.jsonl for an example of the format. We recommend at least 200 QA pairs if possible.

There are a few ways to get this data:

Manually curate a set of questions and answers that you consider to be ideal. This is the most accurate, but also the most time-consuming. Make sure your answers include citations in the expected format. This approach requires domain expertise in the data.
Use the generator script to generate a set of questions and answers. This is the fastest, but may also be the least accurate. See below for details on how to run the generator script.
Use the generator script to generate a set of questions and answers, and then manually curate them, rewriting any answers that are subpar and adding missing citations. This is a good middle ground, and is what we recommend.

Additional tips for ground truth data generation

Generate more QA pairs than you need, then prune them down manually based on quality and overlap. Remove low quality answers, and remove questions that are too similar to other questions.
Be aware of the knowledge distribution in the document set, so you effectively sample questions across the knowledge space.
Once your chat application is live, continually sample live user questions (within accordance to your privacy policy) to make sure you're representing the sorts of questions that users are asking.

Running the generator script

This repo includes a script for generating questions and answers from documents stored in Azure AI Search.

[!IMPORTANT] The generator script can only generate English Q/A pairs right now, due to limitations in the azure-ai-generative SDK.

Create .env file by copying .env.sample
Fill in the values for your Azure AI Search instance:
```
AZURE_SEARCH_SERVICE="<service-name>"
AZURE_SEARCH_INDEX="<index-name>"
AZURE_SEARCH_KEY=""
```
The key may not be necessary if it's configured for keyless access from your account. If providing a key, it's best to provide a query key since the script only requires that level of access.
Run the generator script:
```
python -m scripts generate --output=example_input/qa.jsonl --numquestions=200 --persource=5
```
That script will generate 200 questions and answers, and store them in example_input/qa.jsonl. We've already provided an example based off the sample documents for this app.

To further customize the generator beyond the numquestions and persource parameters, modify scripts/generate.py.

Optional:

By default this script assumes your index citation field is named sourcepage, if your search index contains a different citation field name use the citationfieldname option to specify the correct name
```
python -m scripts generate --output=example_input/qa.jsonl --numquestions=200 --persource=5 --citationfieldname=filepath
```

Running an evaluation

We provide a script that loads in the current azd environment's variables, installs the requirements for the evaluation, and runs the evaluation against the local app. Run it like this:

python -m scripts evaluate --config=example_config.json

The config.json should contain these fields as a minimum:

{
    "testdata_path": "example_input/qa.jsonl",
    "target_url": "http://localhost:50505/chat",
    "requested_metrics": ["groundedness", "relevance", "coherence", "latency", "answer_length"],
    "results_dir": "example_results/experiment<TIMESTAMP>"
}

Running against a local container

If you're running this evaluator in a container and your app is running in a container on the same system, use a URL like this for the target_url:

"target_url": "http://host.docker.internal:50505/chat"

Running against a deployed app

To run against a deployed endpoint, change the target_url to the chat endpoint of the deployed app:

"target_url": "https://app-backend-j25rgqsibtmlo.azurewebsites.net/chat"

Running on a subset of questions

It's common to run the evaluation on a subset of the questions, to get a quick sense of how the changes are affecting the answers. To do this, use the --numquestions parameter:

python -m scripts evaluate --config=example_config.json --numquestions=2

Specifying the evaluate metrics

The evaluate command will use the metrics specified in the requested_metrics field of the config JSON. Some of those metrics are built-in to the evaluation SDK, and the rest are custom metrics that we've added.

Built-in metrics

These metrics are calculated by sending a call to the GPT model, asking it to provide a 1-5 rating, and storing that rating.

[!IMPORTANT] The built-in metrics are only intended for use on evaluating English language answers, since they use English-language prompts internally. For non-English languages, you should use the custom prompt metrics instead.

gpt_coherence measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language.
gpt_relevance assesses the ability of answers to capture the key points of the context.
gpt_groundedness assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context.
gpt_similarity measures the similarity between a source data (ground truth) sentence and the generated response by an AI model.
gpt_fluency measures the grammatical proficiency of a generative AI's predicted answer.
f1_score Measures the ratio of the number of shared words between the model generation and the ground truth answers.

Custom metrics

Prompt metrics

The following metrics are implemented very similar to the built-in metrics, but use a locally stored prompt. They're a great fit if you find that the built-in metrics are not working well for you or if you need to translate the prompt to another language.

mycoherence: Measures how well the language model can produce output that flows smoothly, reads naturally, and resembles human-like language. Based on scripts/evaluate_metrics/prompts/coherence.prompty.
myrelevance: Assesses the ability of answers to capture the key points of the context. Based on scripts/evaluate_metrics/prompts/relevance.prompty.
mygroundedness: Assesses the correspondence between claims in an AI-generated answer and the source context, making sure that these claims are substantiated by the context. Based on scripts/evaluate_metrics/prompts/groundedness.prompty.

Code metrics

These metrics are calculated with some local code based on the results of the chat app, and do not require a call to the GPT model.

latency: The time it takes for the chat app to generate an answer, in seconds.
length: The length of the generated answer, in characters.
has_citation: Whether the answer contains a correctly formatted citation to a source document, assuming citations are in square brackets.
citation_match: Whether the answer contains at least all of the citations that were in the ground truth answer.

Sending additional parameters to the app

This repo assumes that your chat app is following the AI Chat Protocol, which means that all POST requests look like this:

{"messages": [{"content": "<Actual user question goes here>", "role": "user"}],
 "context": {...},
}

Any additional app parameters would be specified in the context of that JSON, such as temperature, search settings, prompt overrides, etc. To specify those parameters, add a target_parameters key to your config JSON. For example:

    "target_parameters": {
        "overrides": {
            "semantic_ranker": false,
            "prompt_template": "<READFILE>example_input/prompt_refined.txt"
        }
    }

The overrides key is the same as the overrides key in the context of the POST request. As a convenience, you can use the <READFILE> prefix to read in a file and use its contents as the value for the parameter. That way, you can store potential (long) prompts separately from the config JSON file.

Specifying the location of answer and context in response

The evaluator needs to know where to find the answer and context in the response from the chat app. If your app returns responses following the recommendations of the AI Chat Protocol, then the answer will be "message": "content" and the context will be a list of strings in "context": "data_points": "text".

If your app returns responses in a different format, you can specify the JMESPath expressions to extract the answer and context from the response. For example:

    "target_response_answer_jmespath": "message.content",
    "target_response_context_jmespath": "context.data_points.text"

Viewing the results

The results of each evaluation are stored in a results folder (defaulting to example_results). Inside each run's folder, you'll find:

eval_results.jsonl: Each question and answer, along with the GPT metrics for each QA pair.
parameters.json: The parameters used for the run, like the overrides.
summary.json: The overall results, like the average GPT metrics.
config.json: The original config used for the run. This is useful for reproducing the run.

To make it easier to view and compare results across runs, we've built a few tools, located inside the review-tools folder.

Using the summary tool

To view a summary across all the runs, use the summary command with the path to the results folder:

python -m review_tools summary example_results

This will display an interactive table with the results for each run, like this:

To see the parameters used for a particular run, select the folder name. A modal will appear with the parameters, including any prompt override.

Using the compare tool

To compare the answers generated for each question across 2 runs, use the compare command with 2 paths:

python -m review_tools diff example_results/baseline_1 example_results/baseline_2

This will display each question, one at a time, with the two generated answers in scrollable panes, and the GPT metrics below each answer.

]

Use the buttons at the bottom to navigate to the next question or quit the tool.

You can also filter to only show questions where the value changed for a particular metric, like this:

python -m review_tools diff example_results/baseline_1 example_results/baseline_2 --changed=has_citation

Measuring app's ability to say "I don't know"

The evaluation flow described above focused on evaluating a model’s answers for a set of questions that could be answered by the data. But what about all those questions that can’t be answered by the data? Does your model know how to say “I don’t know?” The GPT models are trained to try and be helpful, so their tendency is to always give some sort of answer, especially for answers that were in their training data. If you want to ensure your app can say “I don’t know” when it should, you need to evaluate it on a different set of questions with a different metric.

Generating ground truth data for answer-less questions

For this evaluation, our ground truth data needs to be a set of question whose answer should provoke an "I don’t know" response from the data. There are several categories of such questions:

Unknowable: Questions that are related to the sources but not actually in them (and not public knowledge).
Uncitable: Questions whose answers are well known to the LLM from its training data, but are not in the sources. There are two flavors of these:
- Related: Similar topics to sources, so LLM will be particularly tempted to think the sources know.
- Unrelated: Completely unrelated to sources, so LLM shouldn’t be as tempted to think the sources know.
Nonsensical: Questions that are non-questions, that a human would scratch their head at and ask for clarification.

You can write these questions manually, but it’s also possible to generate them using a generator script in this repo, assuming you already have ground truth data with answerable questions.

python -m scripts generate-dontknows --input=example_input/qa.jsonl --output=example_input/qa_dontknows.jsonl --numquestions=45

That script sends the current questions to the configured GPT-4 model along with prompts to generate questions of each kind.

When it’s done, you should review and curate the resulting ground truth data. Pay special attention to the "unknowable" questions at the top of the file, since you may decide that some of those are actually knowable, and you may want to reword or rewrite entirely.

Running an evaluation for answer-less questions

This repo contains a custom GPT metric called "dontknowness" that rates answers from 1-5, where 1 is "answered the question completely with no certainty" and 5 is "said it didn't know and attempted no answer". The goal is for all answers to be rated 4 or 5.

Here's an example configuration JSON that requests that metric, referencing the new ground truth data and a new output folder:

{
    "testdata_path": "example_input/qa_dontknows.jsonl",
    "results_dir": "example_results_dontknows/baseline",
    "requested_metrics": ["dontknowness", "answer_length", "latency", "has_citation"],
    "target_url": "http://localhost:50505/chat",
    "target_parameters": {
    },
    "target_response_answer_jmespath": "message.content",
    "target_response_context_jmespath": "context.data_points.text"
}

We recommend a separate output folder, as you'll likely want to make multiple runs and easily compare between those runs using the review tools.

Run the evaluation like this:

python -m scripts evaluate --config=dontknows.config.json

The results will be stored in the results_dir folder, and can be reviewed using the review tools.

Improving the app's ability to say "I don't know"

If the app is not saying "I don't know" enough, you can use the diff tool to compare the answers for the "dontknows" questions across runs, and see if the answers are improving. Changes you can try:

Adjust the prompt to encourage the model to say "I don't know" more often. Remove anything in the prompt that might be distracting or overly encouraging it to answer.
Try using GPT-4 instead of GPT-3.5. The results will be slower (see the latency column) but it may be more likely to say "I don't know" when it should.
Adjust the temperature of the model used by your app.
Add an additional LLM step in your app after generating the answer, to have the LLM rate its own confidence that the answer is found in the sources. If the confidence is low, the app should say "I don't know".

For Tasks:

Click tags to check more tools for each tasks

evaluate chat app generate ground truth data run evaluations deploy gpt model measure app's response quality

For Jobs:

ai researcher data scientist machine learning engineer natural language processing engineer software developer

Alternative AI tools for ai-rag-chat-evaluator

Similar Open Source Tools

ai-rag-chat-evaluator

github

: 191

call-gpt

Call GPT is a voice application that utilizes Deepgram for Speech to Text, elevenlabs for Text to Speech, and OpenAI for GPT prompt completion. It allows users to chat with ChatGPT on the phone, providing better transcription, understanding, and speaking capabilities than traditional IVR systems. The app returns responses with low latency, allows user interruptions, maintains chat history, and enables GPT to call external tools. It coordinates data flow between Deepgram, OpenAI, ElevenLabs, and Twilio Media Streams, enhancing voice interactions.

github

: 127

bench

Bench is a tool for evaluating LLMs for production use cases. It provides a standardized workflow for LLM evaluation with a common interface across tasks and use cases. Bench can be used to test whether open source LLMs can do as well as the top closed-source LLM API providers on specific data, and to translate the rankings on LLM leaderboards and benchmarks into scores that are relevant for actual use cases.

github

: 321

rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool that helps you conduct experiments and evaluations using Azure AI Search and RAG pattern. It offers a rich set of features, including experiment setup, integration with Azure AI Search, Azure Machine Learning, MLFlow, and Azure OpenAI, multiple document chunking strategies, query generation, multiple search types, sub-querying, re-ranking, metrics and evaluation, report generation, and multi-lingual support. The tool is designed to make it easier and faster to run experiments and evaluations of search queries and quality of response from OpenAI, and is useful for researchers, data scientists, and developers who want to test the performance of different search and OpenAI related hyperparameters, compare the effectiveness of various search strategies, fine-tune and optimize parameters, find the best combination of hyperparameters, and generate detailed reports and visualizations from experiment results.

github

: 242

llamafile

llamafile is a tool that enables users to distribute and run Large Language Models (LLMs) with a single file. It combines llama.cpp with Cosmopolitan Libc to create a framework that simplifies the complexity of LLMs into a single-file executable called a 'llamafile'. Users can run these executable files locally on most computers without the need for installation, making open LLMs more accessible to developers and end users. llamafile also provides example llamafiles for various LLM models, allowing users to try out different LLMs locally. The tool supports multiple CPU microarchitectures, CPU architectures, and operating systems, making it versatile and easy to use.

github

: 19.7k

serena

Serena is a powerful coding agent that integrates with existing LLMs to provide essential semantic code retrieval and editing tools. It is free to use and does not require API keys or subscriptions. Serena can be used for coding tasks such as analyzing, planning, and editing code directly on your codebase. It supports various programming languages and offers semantic code analysis capabilities through language servers. Serena can be integrated with different LLMs using the model context protocol (MCP) or Agno framework. The tool provides a range of functionalities for code retrieval, editing, and execution, making it a versatile coding assistant for developers.

github

: 12.2k

quick-start-connectors

Cohere's Build-Your-Own-Connector framework allows integration of Cohere's Command LLM via the Chat API endpoint to any datastore/software holding text information with a search endpoint. Enables user queries grounded in proprietary information. Use-cases include question/answering, knowledge working, comms summary, and research. Repository provides code for popular datastores and a template connector. Requires Python 3.11+ and Poetry. Connectors can be built and deployed using Docker. Environment variables set authorization values. Pre-commits for linting. Connectors tailored to integrate with Cohere's Chat API for creating chatbots. Connectors return documents as JSON objects for Cohere's API to generate answers with citations.

github

: 132

ChatGPT-Telegram-Bot

The ChatGPT Telegram Bot is a powerful Telegram bot that utilizes various GPT models, including GPT3.5, GPT4, GPT4 Turbo, GPT4 Vision, DALL·E 3, Groq Mixtral-8x7b/LLaMA2-70b, and Claude2.1/Claude3 opus/sonnet API. It enables users to engage in efficient conversations and information searches on Telegram. The bot supports multiple AI models, online search with DuckDuckGo and Google, user-friendly interface, efficient message processing, document interaction, Markdown rendering, and convenient deployment options like Zeabur, Replit, and Docker. Users can set environment variables for configuration and deployment. The bot also provides Q&A functionality, supports model switching, and can be deployed in group chats with whitelisting. The project is open source under GPLv3 license.

github

: 1.1k

aisuite

Aisuite is a simple, unified interface to multiple Generative AI providers. It allows developers to easily interact with various Language Model (LLM) providers like OpenAI, Anthropic, Azure, Google, AWS, and more through a standardized interface. The library focuses on chat completions and provides a thin wrapper around python client libraries, enabling creators to test responses from different LLM providers without changing their code. Aisuite maximizes stability by using HTTP endpoints or SDKs for making calls to the providers. Users can install the base package or specific provider packages, set up API keys, and utilize the library to generate chat completion responses from different models.

github

: 9.5k

HackBot

HackBot is an AI-powered cybersecurity chatbot designed to provide accurate answers to cybersecurity-related queries, conduct code analysis, and scan analysis. It utilizes the Meta-LLama2 AI model through the 'LlamaCpp' library to respond coherently. The chatbot offers features like local AI/Runpod deployment support, cybersecurity chat assistance, interactive interface, clear output presentation, static code analysis, and vulnerability analysis. Users can interact with HackBot through a command-line interface and utilize it for various cybersecurity tasks.

github

: 232

WilmerAI

WilmerAI is a middleware system designed to process prompts before sending them to Large Language Models (LLMs). It categorizes prompts, routes them to appropriate workflows, and generates manageable prompts for local models. It acts as an intermediary between the user interface and LLM APIs, supporting multiple backend LLMs simultaneously. WilmerAI provides API endpoints compatible with OpenAI API, supports prompt templates, and offers flexible connections to various LLM APIs. The project is under heavy development and may contain bugs or incomplete code.

github

: 109

ollama-autocoder

Ollama Autocoder is a simple to use autocompletion engine that integrates with Ollama AI. It provides options for streaming functionality and requires specific settings for optimal performance. Users can easily generate text completions by pressing a key or using a command pallete. The tool is designed to work with Ollama API and a specified model, offering real-time generation of text suggestions.

github

: 92

gpt-subtrans

GPT-Subtrans is an open-source subtitle translator that utilizes large language models (LLMs) as translation services. It supports translation between any language pairs that the language model supports. Note that GPT-Subtrans requires an active internet connection, as subtitles are sent to the provider's servers for translation, and their privacy policy applies.

github

: 418

ultimate-rvc

Ultimate RVC is an extension of AiCoverGen, offering new features and improvements for generating audio content using RVC. It is designed for users looking to integrate singing functionality into AI assistants/chatbots/vtubers, create character voices for songs or books, and train voice models. The tool provides easy setup, voice conversion enhancements, TTS functionality, voice model training suite, caching system, UI improvements, and support for custom configurations. It is available for local and Google Colab use, with a PyPI package for easy access. The tool also offers CLI usage and customization through environment variables.

github

: 147

Mapperatorinator

Mapperatorinator is a multi-model framework that uses spectrogram inputs to generate fully featured osu! beatmaps for all gamemodes and assist modding beatmaps. The project aims to automatically generate rankable quality osu! beatmaps from any song with a high degree of customizability. The tool is built upon osuT5 and osu-diffusion, utilizing GPU compute and instances on vast.ai for development. Users can responsibly use AI in their beatmaps with this tool, ensuring disclosure of AI usage. Installation instructions include cloning the repository, creating a virtual environment, and installing dependencies. The tool offers a Web GUI for user-friendly experience and a Command-Line Inference option for advanced configurations. Additionally, an Interactive CLI script is available for terminal-based workflow with guided setup. The tool provides generation tips and features MaiMod, an AI-driven modding tool for osu! beatmaps. Mapperatorinator tokenizes beatmaps, utilizes a model architecture based on HF Transformers Whisper model, and offers multitask training format for conditional generation. The tool ensures seamless long generation, refines coordinates with diffusion, and performs post-processing for improved beatmap quality. Super timing generator enhances timing accuracy, and LoRA fine-tuning allows adaptation to specific styles or gamemodes. The project acknowledges credits and related works in the osu! community.

github

: 327

aisheets

Hugging Face AI Sheets is an open-source tool for building, enriching, and transforming datasets using AI models with no code. It can be deployed locally or on the Hub, providing access to thousands of open models. Users can easily generate datasets, run data generation scripts, and customize inference endpoints for text generation. The tool supports custom LLMs and offers advanced configuration options for authentication, inference, and miscellaneous settings. With AI Sheets, users can leverage the power of AI models without writing any code, making dataset management and transformation efficient and accessible.

github

: 1.0k

For similar tasks

ai-rag-chat-evaluator

github

: 191

LLM-RGB

LLM-RGB is a repository containing a collection of detailed test cases designed to evaluate the reasoning and generation capabilities of Language Learning Models (LLMs) in complex scenarios. The benchmark assesses LLMs' performance in understanding context, complying with instructions, and handling challenges like long context lengths, multi-step reasoning, and specific response formats. Each test case evaluates an LLM's output based on context length difficulty, reasoning depth difficulty, and instruction compliance difficulty, with a final score calculated for each test case. The repository provides a score table, evaluation details, and quick start guide for running evaluations using promptfoo testing tools.

github

: 138

mastra

Mastra is an opinionated Typescript framework designed to help users quickly build AI applications and features. It provides primitives such as workflows, agents, RAG, integrations, syncs, and evals. Users can run Mastra locally or deploy it to a serverless cloud. The framework supports various LLM providers, offers tools for building language models, workflows, and accessing knowledge bases. It includes features like durable graph-based state machines, retrieval-augmented generation, integrations, syncs, and automated tests for evaluating LLM outputs.

github

: 16.9k

seer

Seer is a service that provides AI capabilities to Sentry by running inference on Sentry issues and providing user insights. It is currently in early development and not yet compatible with self-hosted Sentry instances. The tool requires access to internal Sentry resources and is intended for internal Sentry employees. Users can set up the environment, download model artifacts, integrate with local Sentry, run evaluations for Autofix AI agent, and deploy to a sandbox staging environment. Development commands include applying database migrations, creating new migrations, running tests, and more. The tool also supports VCRs for recording and replaying HTTP requests.

github

: 87

SWELancer-Benchmark

SWE-Lancer is a benchmark repository containing datasets and code for the paper 'SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?'. It provides instructions for package management, building Docker images, configuring environment variables, and running evaluations. Users can use this tool to assess the performance of language models in real-world freelance software engineering tasks.

github

: 1.1k

inspect_evals

Inspect Evals is a repository of community-contributed LLM evaluations for Inspect AI, created in collaboration by the UK AISI, Arcadia Impact, and the Vector Institute. It supports many model providers including OpenAI, Anthropic, Google, Mistral, Azure AI, AWS Bedrock, Together AI, Groq, Hugging Face, vLLM, and Ollama. Users can contribute evaluations, install necessary dependencies, and run evaluations for various models. The repository covers a wide range of evaluation tasks across different domains such as coding, assistants, cybersecurity, safeguards, mathematics, reasoning, knowledge, scheming, multimodal tasks, bias evaluation, personality assessment, and writing tasks.

github

: 236

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.1k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675