
langevals
LangEvals aggregates various language model evaluators into a single platform, providing a standard interface for a multitude of scores and LLM guardrails, for you to protect and benchmark your LLM models and pipelines.
Stars: 65

LangEvals is an all-in-one Python library for testing and evaluating LLM models. It can be used in notebooks for exploration, in pytest for writing unit tests, or as a server API for live evaluations and guardrails. The library is modular, with 20+ evaluators including Ragas for RAG quality, OpenAI Moderation, and Azure Jailbreak detection. LangEvals powers LangWatch evaluations and provides tools for batch evaluations on notebooks and unit test evaluations with PyTest. It also offers LangEvals evaluators for LLM-as-a-Judge scenarios and out-of-the-box evaluators for language detection and answer relevancy checks.
README:
LangEvals is the all-in-one library for LLM testing and evaluating in Python, it can be used in notebooks for exploration, in pytest for writting unit tests or as a server API for live-evaluations and guardrails. LangEvals is modular, including 20+ evaluators such as Ragas for RAG quality, OpenAI Moderation and Azure Jailbreak detection for safety and many others under the same interface.
LangEvals is the backend that powers LangWatch evaluations.
To use LangEvals locally, install it as a dependency, together with the evaluators you are interested on:
pip install "langevals[all]"
# or select only the ones you are interested on, e.g.:
pip install "langevals[azure,ragas,lingua]"
Then right away you can start LangEvals as a server with:
langevals-server
And navigate to http://localhost:5562/
to try out the evaluators.
Alternatively, you can use it as a library as the examples below show.
When exploring, it is usual to generate a number of outputs from your LLM, and then evaluate them all for performance score, for example on a Jupyter Notebook. You can use LangEvals evaluate()
to score the results in batch using diverse evaluators:
import langevals
from langevals_ragas.answer_relevancy import RagasAnswerRelevancyEvaluator
from langevals_langevals.competitor_blocklist import (
CompetitorBlocklistEvaluator,
CompetitorBlocklistSettings,
)
import pandas as pd
entries = pd.DataFrame(
{
"input": ["hello", "how are you?", "what is your name?"],
"output": ["hi", "I am a chatbot, no feelings", "My name is Bob"],
}
)
results = langevals.evaluate(
entries,
[
RagasAnswerRelevancyEvaluator(),
CompetitorBlocklistEvaluator(
settings=CompetitorBlocklistSettings(competitors=["Bob"])
),
],
)
results.to_pandas()
Results:
input | output | answer_relevancy | competitor_blocklist | competitor_blocklist_details |
---|---|---|---|---|
hello | hi | 0.800714 | True | None |
how are you? | I am a chatbot, no feelings | 0.813168 | True | None |
what is your name? | My name is Bob | 0.971663 | False | Competitors mentioned: Bob |
Using various pytest plugins together with LangEvals makes a powerful combination to be able to write unit tests for LLMs and prevent regressions. Due to the probabilistic nature of LLMs, some extra care is needed as you will see below.
The first simple case is when LLMs are used where the expected output is fairly unambiguous, for example, extracting address entities from natural language text. In this example we use the instructor library, to use the LLM to easily extract values to a pydantic module, together with the litellm library, to call multiple LLM models:
from itertools import product
import pytest
import pandas as pd
import instructor
from litellm import completion
from pydantic import BaseModel
class Address(BaseModel):
number: int
street_name: str
city: str
country: str
entries = pd.DataFrame(
{
"input": [
"Please send the package to 123 Main St, Springfield.",
"J'ai déménagé récemment à 56 Rue de l'Université, Paris.",
"A reunião será na Avenida Paulista, 900, São Paulo.",
],
"expected_output": [
Address(
number=123, street_name="Main St", city="Springfield", country="USA"
).model_dump_json(),
Address(
number=56,
street_name="Rue de l'Université",
city="Paris",
country="France",
).model_dump_json(),
Address(
number=900,
street_name="Avenida Paulista",
city="São Paulo",
country="Brazil",
).model_dump_json(),
],
}
)
models = ["gpt-3.5-turbo", "gpt-4-turbo", "groq/llama3-70b-8192"]
client = instructor.from_litellm(completion)
@pytest.mark.parametrize("entry, model", product(entries.itertuples(), models))
@pytest.mark.flaky(max_runs=3)
@pytest.mark.pass_rate(0.6)
def test_extracts_the_right_address(entry, model):
address = client.chat.completions.create(
model=model,
response_model=Address,
messages=[
{"role": "user", "content": entry.input},
],
temperature=0.0,
)
assert address.model_dump_json() == entry.expected_output
In the example above, our test actually becomes 9 tests, checking for address extraction correctness in each of the 3 samples against 3 different models gpt-3.5-turbo
, gpt-4-turbo
and groq/llama3
. This is done by the @pytest.mark.parametrize
annotation and the product
function to combine entries and models. The actual assertion is a simple assert
with ==
comparison as you can see in the last line.
Appart from parametrize
, we also use the flaky library for retries with @pytest.mark.flaky(max_runs=3)
, this allows us to effectively do a 3-shot prompting with our LLM. If you wish, you can also ensure the majority of the attempts are correct by using @pytest.mark.flaky(max_runs=3, min_passes=2)
.
Lastly, we use the @pytest.mark.pass_rate
annotation provided by LangEvals, this allow the test to pass even if some samples fail, as they do for example when the model guesses "United States" instead of "USA" for the country field. Since LLMs are probabilistic, this is necessary for bringing more stability to your test suite, while still ensuring a minimum threshold of accuracy, which in our case is defined as 0.6
(60%).
As things get more nuanced and less objective, exact string matches are no longer possible. We can then rely on LangEvals evaluators for validating many aspects of the LLM inputs and outputs. For complete flexibility, we can use for example a custom LLM-as-a-judge, with CustomLLMBooleanEvaluator
. In the example below we validate that more than 80% of the recipes generated are vegetarian:
from langevals import expect
entries = pd.DataFrame(
{
"input": [
"Generate me a recipe for a quick breakfast with bacon",
"Generate me a recipe for a lunch using lentils",
"Generate me a recipe for a vegetarian dessert",
],
}
)
@pytest.mark.parametrize("entry", entries.itertuples())
@pytest.mark.flaky(max_runs=3)
@pytest.mark.pass_rate(0.8)
def test_extracts_the_right_address(entry):
response: ModelResponse = litellm.completion(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You are a tweet-size recipe generator, just recipe name and ingredients, no yapping.",
},
{"role": "user", "content": entry.input},
],
temperature=0.0,
) # type: ignore
recipe = response.choices[0].message.content # type: ignore
vegetarian_checker = CustomLLMBooleanEvaluator(
settings=CustomLLMBooleanSettings(
prompt="Is the recipe vegetarian?",
)
)
expect(input=entry.input, output=recipe).to_pass(vegetarian_checker)
This test fails with a nice explanation from the LLM judge:
FAILED tests/test_llm_as_judge.py::test_llm_as_judge[entry0] - AssertionError: Custom LLM Boolean Evaluator to_pass FAILED - The recipe for a quick breakfast with bacon includes bacon strips, making it a non-vegetarian recipe.
Notice we use the expect
assertion util, this helps making it easier to run the evaluation and print a nice output with the detailed explanation in case of failures. The expect
utility interface is modeled after Jest assertions, so you can expect a somewhat similar API if you are expericed with Jest.
Just like CustomLLMBooleanEvaluator
, you can use any other evaluator available from LangEvals to prevent regression on a variety of cases, for example, here we check that the LLM answers are always in english, regardless of the language used in the question, we also measure how relevant the answers are to the question:
entries = pd.DataFrame(
{
"input": [
"What's the connection between 'breaking the ice' and the Titanic's first voyage?",
"Comment la bataille de Verdun a-t-elle influencé la cuisine française?",
"¿Puede el musgo participar en la purificación del aire en espacios cerrados?",
],
}
)
@pytest.mark.parametrize("entry", entries.itertuples())
@pytest.mark.flaky(max_runs=3)
@pytest.mark.pass_rate(0.8)
def test_language_and_relevancy(entry):
response: ModelResponse = litellm.completion(
model="gpt-3.5-turbo",
messages=[
{
"role": "system",
"content": "You reply questions only in english, no matter tha language the question was asked",
},
{"role": "user", "content": entry.input},
],
temperature=0.0,
) # type: ignore
recipe = response.choices[0].message.content # type: ignore
language_checker = LinguaLanguageDetectionEvaluator(
settings=LinguaLanguageDetectionSettings(
check_for="output_matches_language",
expected_language="EN",
)
)
answer_relevancy_checker = RagasAnswerRelevancyEvaluator()
expect(input=entry.input, output=recipe).to_pass(language_checker)
expect(input=entry.input, output=recipe).score(
answer_relevancy_checker
).to_be_greater_than(0.8)
In this example we are now not only validating a boolean assertion, but also making sure that 80% of our samples keep an answer relevancy score above 0.8 from the Ragas Answer Relevancy Evaluator.
LangEvals is a monorepo and has many subpackages with different dependencies for each evaluator library or provider. We use poetry to install all dependencies and create a virtual env for each sub-package to make sure they are fully isolated. Given this complexity, to make it easier to contribute to LangEvals we recommend using VS Code for the development. Before opening up on VS Code though, you need to make sure to install all dependencies, generating thus the .venv for each package:
make install
This will also generate the langevals.code-workspace
file, creating a different workspace per evaluator and telling VS Code which venv to use for each. Then, open this file on vscode and click the "Open Workspace" button
To add a completely new evaluator for a library or API that is not already implemented, copy the evaluators/example
folder, and follow the example/word_count.py
boilerplate to implement your own evaluator, adding the dependencies on pyproject.toml
, and testing it properly, following the test_word_count.py
example.
If you want to add a new eval to an existing evaluator package (say, if OpenAI launches a new API for example), simply create a new Python file next to the existing ones.
To test it all together, run:
make lock
make install
make test
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for langevals
Similar Open Source Tools

langevals
LangEvals is an all-in-one Python library for testing and evaluating LLM models. It can be used in notebooks for exploration, in pytest for writing unit tests, or as a server API for live evaluations and guardrails. The library is modular, with 20+ evaluators including Ragas for RAG quality, OpenAI Moderation, and Azure Jailbreak detection. LangEvals powers LangWatch evaluations and provides tools for batch evaluations on notebooks and unit test evaluations with PyTest. It also offers LangEvals evaluators for LLM-as-a-Judge scenarios and out-of-the-box evaluators for language detection and answer relevancy checks.

airflow-ai-sdk
This repository contains an SDK for working with LLMs from Apache Airflow, based on Pydantic AI. It allows users to call LLMs and orchestrate agent calls directly within their Airflow pipelines using decorator-based tasks. The SDK leverages the familiar Airflow `@task` syntax with extensions like `@task.llm`, `@task.llm_branch`, and `@task.agent`. Users can define tasks that call language models, orchestrate multi-step AI reasoning, change the control flow of a DAG based on LLM output, and support various models in the Pydantic AI library. The SDK is designed to integrate LLM workflows into Airflow pipelines, from simple LLM calls to complex agentic workflows.

backtrack_sampler
Backtrack Sampler is a framework for experimenting with custom sampling algorithms that can backtrack the latest generated tokens. It provides a simple and easy-to-understand codebase for creating new sampling strategies. Users can implement their own strategies by creating new files in the `/strategy` directory. The repo includes examples for usage with llama.cpp and transformers, showcasing different strategies like Creative Writing, Anti-slop, Debug, Human Guidance, Adaptive Temperature, and Replace. The goal is to encourage experimentation and customization of backtracking algorithms for language models.

Tools4AI
Tools4AI is a Java-based Agentic Framework for building AI agents to integrate with enterprise Java applications. It enables the conversion of natural language prompts into actionable behaviors, streamlining user interactions with complex systems. By leveraging AI capabilities, it enhances productivity and innovation across diverse applications. The framework allows for seamless integration of AI with various systems, such as customer service applications, to interpret user requests, trigger actions, and streamline workflows. Prompt prediction anticipates user actions based on input prompts, enhancing user experience by proactively suggesting relevant actions or services based on context.

keras-hub
KerasHub is a pretrained modeling library that provides Keras 3 implementations of popular model architectures with pretrained checkpoints. It supports text, image, and audio data for generation, classification, and other tasks. Models are compatible with JAX, TensorFlow, and PyTorch, and can be fine-tuned on GPUs and TPUs. KerasHub components are provided as Layer and Model implementations, extending the core Keras API.

npcsh
`npcsh` is a python-based command-line tool designed to integrate Large Language Models (LLMs) and Agents into one's daily workflow by making them available and easily configurable through the command line shell. It leverages the power of LLMs to understand natural language commands and questions, execute tasks, answer queries, and provide relevant information from local files and the web. Users can also build their own tools and call them like macros from the shell. `npcsh` allows users to take advantage of agents (i.e. NPCs) through a managed system, tailoring NPCs to specific tasks and workflows. The tool is extensible with Python, providing useful functions for interacting with LLMs, including explicit coverage for popular providers like ollama, anthropic, openai, gemini, deepseek, and openai-like providers. Users can set up a flask server to expose their NPC team for use as a backend service, run SQL models defined in their project, execute assembly lines, and verify the integrity of their NPC team's interrelations. Users can execute bash commands directly, use favorite command-line tools like VIM, Emacs, ipython, sqlite3, git, pipe the output of these commands to LLMs, or pass LLM results to bash commands.

experts
Experts.js is a tool that simplifies the creation and deployment of OpenAI's Assistants, allowing users to link them together as Tools to create a Panel of Experts system with expanded memory and attention to detail. It leverages the new Assistants API from OpenAI, which offers advanced features such as referencing attached files & images as knowledge sources, supporting instructions up to 256,000 characters, integrating with 128 tools, and utilizing the Vector Store API for efficient file search. Experts.js introduces Assistants as Tools, enabling the creation of Multi AI Agent Systems where each Tool is an LLM-backed Assistant that can take on specialized roles or fulfill complex tasks.

palimpzest
Palimpzest (PZ) is a tool for managing and optimizing workloads, particularly for data processing tasks. It provides a CLI tool and Python demos for users to register datasets, run workloads, and access results. Users can easily initialize their system, register datasets, and manage configurations using the CLI commands provided. Palimpzest also supports caching intermediate results and configuring for parallel execution with remote services like OpenAI and together.ai. The tool aims to streamline the workflow of working with datasets and optimizing performance for data extraction tasks.

langchain
LangChain is a framework for developing Elixir applications powered by language models. It enables applications to connect language models to other data sources and interact with the environment. The library provides components for working with language models and off-the-shelf chains for specific tasks. It aims to assist in building applications that combine large language models with other sources of computation or knowledge. LangChain is written in Elixir and is not aimed for parity with the JavaScript and Python versions due to differences in programming paradigms and design choices. The library is designed to make it easy to integrate language models into applications and expose features, data, and functionality to the models.

chores
The chores package provides a library of ergonomic LLM assistants designed to help users complete repetitive, hard-to-automate tasks quickly. Users can select code, trigger the chores addin, choose a helper, and watch their code be rewritten. The package offers chore helpers for tasks like converting to cli, testthat, and documenting functions with roxygen. Users can also create their own chore helpers by providing instructions in a markdown file. The cost of using helpers depends on the length of the prompt and the model chosen.

ActionWeaver
ActionWeaver is an AI application framework designed for simplicity, relying on OpenAI and Pydantic. It supports both OpenAI API and Azure OpenAI service. The framework allows for function calling as a core feature, extensibility to integrate any Python code, function orchestration for building complex call hierarchies, and telemetry and observability integration. Users can easily install ActionWeaver using pip and leverage its capabilities to create, invoke, and orchestrate actions with the language model. The framework also provides structured extraction using Pydantic models and allows for exception handling customization. Contributions to the project are welcome, and users are encouraged to cite ActionWeaver if found useful.

paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and includes a process of embedding docs, queries, searching for top passages, creating summaries, using an LLM to re-score and select relevant summaries, putting summaries into prompt, and generating answers. The tool can be used to answer specific questions related to scientific research by leveraging citations and relevant passages from documents.

forust
Forust is a lightweight package for building gradient boosted decision tree ensembles. The algorithm code is written in Rust with a Python wrapper. It implements the same algorithm as XGBoost and provides nearly identical results. The package was developed to better understand XGBoost, as a fun project in Rust, and to experiment with adding new features to the algorithm in a simpler codebase. Forust allows training gradient boosted decision tree ensembles with multiple objective functions, predicting on datasets, inspecting model structures, calculating feature importance, and saving/loading trained boosters.

AI
AI is an open-source Swift framework for interfacing with generative AI. It provides functionalities for text completions, image-to-text vision, function calling, DALLE-3 image generation, audio transcription and generation, and text embeddings. The framework supports multiple AI models from providers like OpenAI, Anthropic, Mistral, Groq, and ElevenLabs. Users can easily integrate AI capabilities into their Swift projects using AI framework.

ai2-scholarqa-lib
Ai2 Scholar QA is a system for answering scientific queries and literature review by gathering evidence from multiple documents across a corpus and synthesizing an organized report with evidence for each claim. It consists of a retrieval component and a three-step generator pipeline. The retrieval component fetches relevant evidence passages using the Semantic Scholar public API and reranks them. The generator pipeline includes quote extraction, planning and clustering, and summary generation. The system is powered by the ScholarQA class, which includes components like PaperFinder and MultiStepQAPipeline. It requires environment variables for Semantic Scholar API and LLMs, and can be run as local docker containers or embedded into another application as a Python package.

minions
Minions is a communication protocol that enables small on-device models to collaborate with frontier models in the cloud. By only reading long contexts locally, it reduces cloud costs with minimal or no quality degradation. The repository provides a demonstration of the protocol.
For similar tasks

langchain-benchmarks
A package to help benchmark various LLM related tasks. The benchmarks are organized by end-to-end use cases, and utilize LangSmith heavily. We have several goals in open sourcing this: * Showing how we collect our benchmark datasets for each task * Showing what the benchmark datasets we use for each task is * Showing how we evaluate each task * Encouraging others to benchmark their solutions on these tasks (we are always looking for better ways of doing things!)

LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of Large Language Models (LLMs) and build intelligent applications that push the boundaries of natural language understanding. This GitHub repository provides in-depth articles, codebase mastery, LLM PlayLab, and resources for cost analysis and network visualization. It covers various aspects of LLMs, including NLP, models, training, evaluation metrics, open LLMs, and more. The repository also includes a collection of code examples and tutorials to help users build and deploy LLM-based applications.

Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, benchmarks, demos, papers for Large Language Models (like ChatGPT, LLaMA, GLM, Baichuan, etc) Evaluation on Language capabilities, Knowledge, Reasoning, Fairness and Safety.

moonshot
Moonshot is a simple and modular tool developed by the AI Verify Foundation to evaluate Language Model Models (LLMs) and LLM applications. It brings Benchmarking and Red-Teaming together to assist AI developers, compliance teams, and AI system owners in assessing LLM performance. Moonshot can be accessed through various interfaces including User-friendly Web UI, Interactive Command Line Interface, and seamless integration into MLOps workflows via Library APIs or Web APIs. It offers features like benchmarking LLMs from popular model providers, running relevant tests, creating custom cookbooks and recipes, and automating Red Teaming to identify vulnerabilities in AI systems.

Cherry_LLM
Cherry Data Selection project introduces a self-guided methodology for LLMs to autonomously discern and select cherry samples from open-source datasets, minimizing manual curation and cost for instruction tuning. The project focuses on selecting impactful training samples ('cherry data') to enhance LLM instruction tuning by estimating instruction-following difficulty. The method involves phases like 'Learning from Brief Experience', 'Evaluating Based on Experience', and 'Retraining from Self-Guided Experience' to improve LLM performance.

self-learn-llms
Self Learn LLMs is a repository containing resources for self-learning about Large Language Models. It includes theoretical and practical hands-on resources to facilitate learning. The repository aims to provide a clear roadmap with milestones for proper understanding of LLMs. The owner plans to refactor the repository to remove irrelevant content, organize model zoo better, and enhance the learning experience by adding contributors and hosting notes, tutorials, and open discussions.

langevals
LangEvals is an all-in-one Python library for testing and evaluating LLM models. It can be used in notebooks for exploration, in pytest for writing unit tests, or as a server API for live evaluations and guardrails. The library is modular, with 20+ evaluators including Ragas for RAG quality, OpenAI Moderation, and Azure Jailbreak detection. LangEvals powers LangWatch evaluations and provides tools for batch evaluations on notebooks and unit test evaluations with PyTest. It also offers LangEvals evaluators for LLM-as-a-Judge scenarios and out-of-the-box evaluators for language detection and answer relevancy checks.

vim-airline
Vim-airline is a lean and mean status/tabline plugin for Vim that provides a nice statusline at the bottom of each Vim window. It consists of several sections displaying information such as mode, environment status, filename, filetype, file encoding, and current position in the file. The plugin is highly customizable and integrates with various plugins, providing a tiny core with extensibility in mind. It is optimized for speed, supports multiple themes, and integrates seamlessly with other plugins. Vim-airline is written in 100% Vimscript, eliminating the need for Python. The plugin aims to be stable and includes a unit testing suite for reliability.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.