eval-assist

eval-assist

EvalAssist is an open-source project that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refining evaluation criteria in a web-based user experience.

Stars: 67

Visit
 screenshot

EvalAssist is an LLM-as-a-Judge framework built on top of the Unitxt open source evaluation library for large language models. It provides users with a convenient way of iteratively testing and refining LLM-as-a-judge criteria, supporting both direct (rubric-based) and pairwise assessment paradigms. EvalAssist is model-agnostic, supporting a rich set of off-the-shelf judge models that can be extended. Users can auto-generate a Notebook with Unitxt code to run bulk evaluations and save their own test cases. The tool is designed for evaluating text data using language models.

README:

PyPI - Version GitHub License OpenSSF Best Practices

EvalAssist

EvalAssist Logo

Project WebsiteDocumentationVideo demo

EvalAssist is an LLM-as-a-Judge framework built on top of the Unitxt open source evaluation library for large language models. The EvalAssist application provides users with a convenient way of iteratively testing and refining LLM-as-a-judge criteria, and supports both direct (rubric-based) and pairwise assessment paradigms (relation-based), the two most prevalent forms of LLM-as-a-judge evaluations available. EvalAssist is designed to be model-agnostic, i.e. the content to be evaluated can come from any model. EvalAssist supports a rich set of off-the-shelf judge models that can easily be extended. An API key is required to use the pre-defined judge models. Once users are satisfied with their criteria, they can auto-generate a Notebook with Unitxt code to run bulk evaluations with larger data sets based on their criteria definition. EvalAssist also includes a catalog of example test cases, exhibiting the use of LLM-as-a-judge across a variety of scenarios. Users can save their own test cases.

How to install and run EvalAssist

EvalAssist can be installed using various package managers. Before proceeding, ensure you're using Python >= 3.10, <3.14 to avoid compatibility issues. Make sure to set DATA_DIR to avoid data loss (e.g. export DATA_DIR="~/.eval_assist").

Installation via pip

python3 -m venv venv
source venv/bin/activate # or venv\Scripts\activate.bat in Windows
pip install 'evalassist[webapp]'
eval-assist serve

Installation via uv

uvx --python 3.11 --from 'evalassist[webapp]' eval-assist serve

Installation via conda

conda create -n evalassist python=3.11
conda activate evalassist
pip install 'evalassist[webapp]'
eval-assist serve

In all cases, after running the command, you can access the EvalAssist server at http://localhost:8000.

EvalAssist can be configured through environment variables and command parameters. Take a look at the configuration documentation.

Check out the tutorials to see how to run evaluations and generate synthetic data.

Use Evalassist backend standalone

You can run LLM as a Judge evaluations using Python only. For example:

from evalassist.judges import SimpleDirectJudge
from evalassist.judges.const import DEFAULT_JUDGE_INFERENCE_PARAMS
from unitxt.inference import CrossProviderInferenceEngine

judge = SimpleDirectJudge(
    inference_engine=CrossProviderInferenceEngine(
        model="llama-3-3-70b-instruct",
        provider="watsonx",
        **DEFAULT_JUDGE_INFERENCE_PARAMS,
    ),
)

results = judge(
    instances=[
        "Use the API client to fetch data from the server and the cache to store frequently accessed results for faster performance."
    ],
    criteria="Is the text self-explanatory and self-contained?",  # Create yes/no direct assessment criteria",
)

Look at the documentation of the judges sub-package.

Contributing

You can contribute to EvalAssist or to Unitxt. Look at the Contribution Guidelines for more details.

Look at the Local Development Guide for instructions on setting up a local development environment.

Documentation

You can find extesive documentation of the system in the Documentation page.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for eval-assist

Similar Open Source Tools

For similar tasks

For similar jobs