arthur-engine

arthur-engine

Make AI work for Everyone - Monitoring and governing for your AI/ML

Stars: 55

Visit
 screenshot

The Arthur Engine is a comprehensive tool for monitoring and governing AI/ML workloads. It provides evaluation and benchmarking of machine learning models, guardrails enforcement, and extensibility for fitting into various application architectures. With support for a wide range of evaluation metrics and customizable features, the tool aims to improve model understanding, optimize generative AI outputs, and prevent data-security and compliance risks. Key features include real-time guardrails, model performance monitoring, feature importance visualization, error breakdowns, and support for custom metrics and models integration.

README:

Arthur AI Logo

Make AI work for Everyone.

GenAI Engine CI Discord

Website - Documentation - Talk to someone at Arthur

The Arthur Engine

The Arthur Engine provides a complete service for monitoring and governing your AI/ML workloads using popular Open-Source technologies and frameworks. It is a tool designed for:

  • Evaluating and Benchmarking Machine Learning models
    • Support for a wide range of evaluation metrics (e.g., drift, accuracy, precision, recall, F1, and AUC)
    • Tools for comparing models, exploring feature importance, and identifying areas for optimization
    • For LLMs/GenAI applications, measure and monitor response relevance, hallucination rates, token counts, latency, and more
  • Enforcing guardrails in your LLM Applications and Generative AI Workflows
    • Configurable metrics for real-time detection of PII or Sensitive Data leakage, Hallucination, Prompt Injection attempts, Toxic language, and other quality metrics
  • Extensibility to fit into your application's architecture
    • Support for plug-and-play metrics and extensible API so you can bring your own custom-models or popular open-source models (inc. HuggingFace, etc.)

Quickstart - See Examples

  1. Clone the repository and cd deployment/docker-compose/genai-engine
  2. Create .env file from .env.template file and modify it (more instructions can be found in README on the current path)
  3. Run docker compose up
  4. Wait for the genai-engine container to initialize then navigate to localhost:3030/docs to see the API docs
  5. Start building!.

Arthur Platform Free Version

The genai-engine standalone deployment in the Quickstart provides powerful LLM evaluation and guardrailing features. To unlock the full capabilities of the Arthur Platform, sign up and get started for free.

Arthur GenAI Evals

Arthur Platform Enterprise Version

The enterprise version of the Arthur Platform provides better performance, additional features, and capabilities, including custom enterprise-ready guardrails + metrics, which can maximize the potential of AI for your organization.

Key features:

  • State-of-the-art proprietary evaluation models trained by Arthur's world-class machine learning engineering team
  • Airgapped deployment of the Arthur Engine (no dependency to Hugging Face Hub)
  • Optional on-premises deployment of the entire Arthur Platform
  • Support from the world-class engineering teams at Arthur

To learn more about the enterprise version of the Arthur Platform, reach out!

Performance Comparison between Free vs Enterprise version of Arthur Engine :

Enterprise version of Arthur Engine leverages state-of-the-art high-performing, low latency proprietary models for some of the LLM evaluations. Please see below for a detailed comparison between open-source vs enterprise performance.

Evaluation Type Dataset Free Version Performance (f1) Enterprise Performance (f1) Free Version Average Latency per Inference (s) Enterprise Average Latency per Inference (s)
Prompt Injection deepset 0.52 (0.44, 0.60) 0.89 (0.85, 0.93) 0.966 0.03
Prompt Injection Arthur’s Custom Benchmark 0.79 (0.62, 0.93) 0.85 (0.71, 0.96) 0.16 0.005
Toxicity Arthur’s Custom Benchmark 0.633 (0.45, 0.79) 0.89 (0.85, 0.93) 3.096 0.0358

Overview

The Arthur Engine is built with a focus on transparency and explainability, this framework provides users with comprehensive performance metrics, error analysis, and interpretable results to improve model understanding and outcomes. With support for plug-and-play metrics and extensible APIs, the Arthur Engine simplifies the process of understanding and optimizing generative AI outputs. The Arthur Engine can prevent data-security and compliance risks from creating negative or harmful experiences for your users in production or negatively impacting your organization's reputation.

Key Features:

  • Evaluate models on structured/tabular datasets with customizable metrics
  • Evaluate LLMs and generative AI workflows with customizable metrics
  • Support building real-time guardrails for LLM applications and agentic workflows
  • Trace and monitor model performance over time
  • Visualize feature importance and error breakdowns
  • Compare multiple models side-by-side
  • Extensible APIs for custom metric development or for using custom models
  • Integration with popular libraries like LangChain or LlamaIndex (coming soon!)

LLM Evaluations:

Eval Technique Source Docs
Hallucination Claim-based LLMJudge technique Source Docs
Prompt Injection Open Source: Using deberta-v3-base-prompt-injection-v2 Source Docs
Toxicity Open Source: Using roberta_toxicity_classifier Source Docs
Sensitive Data Few-shot optimized LLM Judge technique Source Docs
Personally Identifiable Information Using presidio based off Named-Entity recognition Source Docs
CustomRules Extend the service to support whatever monitoring or guardrails are applicable for your use-case Build your own! Docs

NB: We have provided open-source models for Prompt Injection and Toxicity evaluation as default in the free version of Arthur. In the case that you already have custom solutions for these evaluations and would like to use them, the models used for Prompt Injection and Toxicity are fully customizable and can be substituted out here (PI Code Pointer, Toxicity Code Pointer). If you are interested in higher performing and/or lower latency evaluations out of the box, please enquire about the enterprise version of Arthur Engine.

Contributing

  • Join the Arthur community on Discord to get help and share your feedback.
  • To make a request for a bug fix or a new feature, please file a Github issue.
  • For making code contributions, please review the contributing guidelines.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for arthur-engine

Similar Open Source Tools

For similar tasks

For similar jobs