rhesis

rhesis

Open-source platform & SDK for testing LLM and agentic apps. Define expected behavior, generate and run test scenarios, and review failures collaboratively.

Stars: 277

Visit
 screenshot

Rhesis is a comprehensive test management platform designed for Gen AI teams, offering tools to create, manage, and execute test cases for generative AI applications. It ensures the robustness, reliability, and compliance of AI systems through features like test set management, automated test generation, edge case discovery, compliance validation, integration capabilities, and performance tracking. The platform is open source, emphasizing community-driven development, transparency, extensible architecture, and democratizing AI safety. It includes components such as backend services, frontend applications, SDK for developers, worker services, chatbot applications, and Polyphemus for uncensored LLM service. Rhesis enables users to address challenges unique to testing generative AI applications, such as non-deterministic outputs, hallucinations, edge cases, ethical concerns, and compliance requirements.

README:

Rhesis AI Logo

Rhesis: Collaborative Testing for LLM & Agentic Applications

License PyPI Version Python Versions codecov Discord LinkedIn Hugging Face Documentation

Website · Docs · Discord · Changelog

More than just evals.
Collaborative agent testing for teams.

Generate tests from requirements, simulate conversation flows, detect adversarial behaviors, evaluate with 60+ metrics, and trace failures with OpenTelemetry. Engineers and domain experts, working together.

Rhesis Platform Overview - Click to watch demo


Core features

Rhesis Core Features

Test generation

AI-Powered Synthesis - Describe requirements in plain language. Rhesis generates hundreds of test scenarios including edge cases and adversarial prompts.

Knowledge-Aware - Connect context sources via file upload or MCP (Notion, GitHub, Jira, Confluence) for better test generation.

Single-turn & conversation simulation

Single-turn for Q&A validation. Conversation simulation for dialogue flows.

Penelope Agent simulates realistic conversations to test context retention, role adherence, and dialogue coherence across extended interactions.

Adversarial testing (red-teaming)

Polyphemus Agent proactively finds vulnerabilities:

  • Jailbreak attempts and prompt injection
  • PII leakage and data extraction
  • Harmful content generation
  • Role violation and instruction bypassing

Garak Integration - Built-in support for garak, the LLM vulnerability scanner, for comprehensive security testing.

60+ pre-built metrics

Framework Example Metrics
RAGAS Context relevance, faithfulness, answer accuracy
DeepEval Bias, toxicity, PII leakage, role violation, turn relevancy, knowledge retention
Garak Jailbreak detection, prompt injection, XSS, malware generation, data leakage
Custom NumericJudge, CategoricalJudge for domain-specific evaluation

All metrics include LLM-as-Judge reasoning explanations.

Traces & observability

Monitor your LLM applications with OpenTelemetry-based tracing:

from rhesis.sdk.decorators import observe

@observe.llm(model="gpt-4")
def generate_response(prompt: str) -> str:
    # Your LLM call here
    return response

Track LLM calls, latency, token usage, and link traces to test results for debugging.

Bring your own model

Use any LLM provider for test generation and evaluation:

Cloud: OpenAI, Anthropic, Google Gemini, Mistral, Cohere, Groq, Together AI

Local/Self-hosted: Ollama, vLLM, LiteLLM

See Model Configuration Docs for setup instructions.


Why Rhesis?

Platform for teams. SDK for developers.

Use the collaborative platform for team-based testing: product managers define requirements, domain experts review results, engineers integrate via CI/CD. Or integrate directly with the Python SDK for code-first workflows.

The testing lifecycle

Six integrated phases from project setup to team collaboration:

Phase What You Do
1. Projects Configure your AI application, upload & connect context sources (files, docs), set up SDK connectors
2. Requirements Define expected behaviors (what your app should and shouldn't do), cover all relevant aspects from product, marketing, customer support, legal and compliance teams
3. Metrics Select from 60+ pre-built metrics or create custom LLM-as-Judge evaluations to assess whether your requirements are met
4. Tests Generate single-turn and conversation simulation test scenarios. Organize in test sets and understand your test coverage
5. Execution Run tests via UI, SDK, or API; integrate into CI/CD pipelines; collect traces during execution
6. Collaboration Review results with your team through comments, tasks, workflows, and side-by-side comparisons

Rhesis vs...

Instead of... Rhesis gives you...
Manual testing AI-generated test cases based on your context, hundreds in minutes
Traditional test frameworks Non-deterministic output handling built-in
LLM observability tools Pre-production validation, not post-production monitoring
Red-teaming services Continuous, self-service adversarial testing, not one-time audits

What you can test

Use Case What Rhesis Tests
Conversational AI Conversation simulation, role adherence, knowledge retention
RAG Systems Context relevance, faithfulness, hallucination detection
NL-to-SQL / NL-to-Code Query accuracy, syntax validation, edge case handling
Agentic Systems Tool selection, goal achievement, multi-agent coordination

SDK: Code-first testing

Test your Python functions directly with the @endpoint decorator:

from rhesis.sdk.decorators import endpoint

@endpoint(name="my-chatbot")
def chat(message: str) -> str:
    # Your LLM logic here
    return response

Features: Zero configuration, automatic parameter binding, auto-reconnection, environment management (dev/staging/production).

Generate tests programmatically:

from rhesis.sdk.synthesizers import PromptSynthesizer

synthesizer = PromptSynthesizer(
    prompt="Generate tests for a medical chatbot that must never provide diagnosis"
)
test_set = synthesizer.generate(num_tests=10)

Deployment options

Option Best For Setup Time
Rhesis Cloud Teams wanting managed deployment Instant
Docker Local development and testing 5 minutes
Kubernetes Production self-hosting See docs

Quick Start

Option 1: Cloud (fastest) - app.rhesis.ai - Managed service, just connect your app

Option 2: Self-host with Docker

git clone https://github.com/rhesis-ai/rhesis.git && cd rhesis && ./rh start

Access: Frontend at localhost:3000, API at localhost:8080/docs

Commands: ./rh logs · ./rh stop · ./rh restart · ./rh delete

Note: This setup enables auto-login for local testing. For production, see Self-hosting Documentation.

Option 3: Python SDK

pip install rhesis-sdk

Integrations

Connect Rhesis to your LLM stack:

Integration Languages Description
Rhesis SDK Python, JS/TS Native SDK with decorators for endpoints and observability. Full control over test execution and tracing.
OpenAI Python Drop-in replacement for OpenAI SDK. Automatic instrumentation with zero code changes.
Anthropic Python Native support for Claude models with automatic tracing.
LangChain Python Add Rhesis callback handler to your LangChain app for automatic tracing and test execution.
LangGraph Python Built-in integration for LangGraph agent workflows with full observability.
AutoGen Python Automatic instrumentation for Microsoft AutoGen multi-agent conversations.
LiteLLM Python Unified interface for 100+ LLMs (OpenAI, Azure, Anthropic, Cohere, Ollama, vLLM, HuggingFace, Replicate).
Google Gemini Python Native integration for Google's Gemini models.
Ollama Python Local LLM deployment with Ollama integration.
OpenRouter Python Access to multiple LLM providers through OpenRouter.
Vertex AI Python Google Cloud Vertex AI model support.
HuggingFace Python Direct integration with HuggingFace models.
REST API Any Direct API access for custom integrations. OpenAPI spec available.

See Integration Docs for setup instructions.


Open source

MIT licensed. No plans to relicense core features. Enterprise version will live in ee/ folders and remain separate.

We built Rhesis because existing LLM testing tools didn't meet our needs. If you face the same challenges, contributions are welcome.


Contributing

See CONTRIBUTING.md for guidelines.

Ways to contribute: Fix bugs or add features · Contribute test sets for common failure modes · Improve documentation · Help others in Discord or GitHub discussions


Support


Security & privacy

We take data security seriously. See our Privacy Policy for details.

Telemetry: Rhesis collects basic, anonymized usage statistics to improve the product. No sensitive data is collected or shared with third parties.

  • Self-hosted: Opt out by setting OTEL_RHESIS_TELEMETRY_ENABLED=false
  • Cloud: Telemetry enabled as part of Terms & Conditions

Made with Rhesis logo in Potsdam, Germany 🇩🇪

Learn more at rhesis.ai

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for rhesis

Similar Open Source Tools

For similar tasks

For similar jobs