Quantus

Quantus is an eXplainable AI toolkit for responsible evaluation of neural network explanations

Stars: 502

Visit

Quantus is a toolkit designed for the evaluation of neural network explanations. It offers more than 30 metrics in 6 categories for eXplainable Artificial Intelligence (XAI) evaluation. The toolkit supports different data types (image, time-series, tabular, NLP) and models (PyTorch, TensorFlow). It provides built-in support for explanation methods like captum, tf-explain, and zennit. Quantus is under active development and aims to provide a comprehensive set of quantitative evaluation metrics for XAI methods.

README:

A toolkit to evaluate neural network explanations

PyTorch and TensorFlow

Quantus is currently under active development so carefully note the Quantus release version to ensure reproducibility of your work.

📑 Shortcut to paper!

News and Highlights! 🚀

If you want to contribute/ improve/ extend Quantus, join our Discord!
New metrics added: EfficientMPRT and SmoothMPRT by Hedström et al., (2023)
Released a new version here with Python 3.7 discontinued
Accepted to Journal of Machine Learning Research (MLOSS), read the paper
Offers more than 30+ metrics in 6 categories for XAI evaluation
Supports different data types (image, time-series, tabular, NLP next up!) and models (PyTorch, TensorFlow)
Extended built-in support for explanation methods (captum, tf-explain and zennit)

Citation

If you find this toolkit or its companion paper Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations and Beyond interesting or useful in your research, use the following Bibtex annotation to cite us:

@article{hedstrom2023quantus,
  author  = {Anna Hedstr{\"{o}}m and Leander Weber and Daniel Krakowczyk and Dilyara Bareeva and Franz Motzkus and Wojciech Samek and Sebastian Lapuschkin and Marina Marina M.{-}C. H{\"{o}}hne},
  title   = {Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations and Beyond},
  journal = {Journal of Machine Learning Research},
  year    = {2023},
  volume  = {24},
  number  = {34},
  pages   = {1--11},
  url     = {http://jmlr.org/papers/v24/22-0142.html}
}

When applying the individual metrics of Quantus, please make sure to also properly cite the work of the original authors (as linked below).

Library overview
Installation
Getting started
Tutorials
Contributing

Library overview

A simple visual comparison of eXplainable Artificial Intelligence (XAI) methods is often not sufficient to decide which explanation method works best as shown exemplarily in Figure a) for four gradient-based methods — Saliency (Mørch et al., 1995; Baehrens et al., 2010), Integrated Gradients (Sundararajan et al., 2017), GradientShap (Lundberg and Lee, 2017) or FusionGrad (Bykov et al., 2021), yet it is a common practice for evaluation XAI methods in absence of ground truth data. Therefore, we developed Quantus, an easy-to-use yet comprehensive toolbox for quantitative evaluation of explanations — including 30+ different metrics.

With Quantus, we can obtain richer insights on how the methods compare e.g., b) by holistic quantification on several evaluation criteria and c) by providing sensitivity analysis of how a single parameter e.g. the pixel replacement strategy of a faithfulness test influences the ranking of the XAI methods.

Metrics

This project started with the goal of collecting existing evaluation metrics that have been introduced in the context of XAI research — to help automate the task of XAI quantification. Along the way of implementation, it became clear that XAI metrics most often belong to one out of six categories i.e., 1) faithfulness, 2) robustness, 3) localisation 4) complexity 5) randomisation (sensitivity) or 6) axiomatic metrics. The library contains implementations of the following evaluation metrics:

Faithfulness

quantifies to what extent explanations follow the predictive behaviour of the model (asserting that more important features play a larger role in model outcomes)

Faithfulness Correlation (Bhatt et al., 2020): iteratively replaces a random subset of given attributions with a baseline value and then measuring the correlation between the sum of this attribution subset and the difference in function output
Faithfulness Estimate (Alvarez-Melis et al., 2018): computes the correlation between probability drops and attribution scores on various points
Monotonicity Metric (Arya et al. 2019): starts from a reference baseline to then incrementally replace each feature in a sorted attribution vector, measuring the effect on model performance
Monotonicity Metric (Nguyen et al, 2020): measures the spearman rank correlation between the absolute values of the attribution and the uncertainty in the probability estimation
Pixel Flipping (Bach et al., 2015): captures the impact of perturbing pixels in descending order according to the attributed value on the classification score
Region Perturbation (Samek et al., 2015): is an extension of Pixel-Flipping to flip an area rather than a single pixel
Selectivity (Montavon et al., 2018): measures how quickly an evaluated prediction function starts to drop when removing features with the highest attributed values
SensitivityN (Ancona et al., 2019): computes the correlation between the sum of the attributions and the variation in the target output while varying the fraction of the total number of features, averaged over several test samples
IROF (Rieger at el., 2020): computes the area over the curve per class for sorted mean importances of feature segments (superpixels) as they are iteratively removed (and prediction scores are collected), averaged over several test samples
Infidelity (Chih-Kuan, Yeh, et al., 2019): represents the expected mean square error between 1) a dot product of an attribution and input perturbation and 2) difference in model output after significant perturbation
ROAD (Rong, Leemann, et al., 2022): measures the accuracy of the model on the test set in an iterative process of removing k most important pixels, at each step k most relevant pixels (MoRF order) are replaced with noisy linear imputations
Sufficiency (Dasgupta et al., 2022): measures the extent to which similar explanations have the same prediction label

Robustness

measures to what extent explanations are stable when subject to slight perturbations of the input, assuming that model output approximately stayed the same

Local Lipschitz Estimate (Alvarez-Melis et al., 2018): tests the consistency in the explanation between adjacent examples
Max-Sensitivity (Yeh et al., 2019): measures the maximum sensitivity of an explanation using a Monte Carlo sampling-based approximation
Avg-Sensitivity (Yeh et al., 2019): measures the average sensitivity of an explanation using a Monte Carlo sampling-based approximation
Continuity (Montavon et al., 2018): captures the strongest variation in explanation of an input and its perturbed version
Consistency (Dasgupta et al., 2022): measures the probability that the inputs with the same explanation have the same prediction label
Relative Input Stability (RIS) (Agarwal, et. al., 2022): measures the relative distance between explanations e_x and e_x' with respect to the distance between the two inputs x and x'
Relative Representation Stability (RRS) (Agarwal, et. al., 2022): measures the relative distance between explanations e_x and e_x' with respect to the distance between internal models representations L_x and L_x' for x and x' respectively
Relative Output Stability (ROS) (Agarwal, et. al., 2022): measures the relative distance between explanations e_x and e_x' with respect to the distance between output logits h(x) and h(x') for x and x' respectively

Localisation

tests if the explainable evidence is centred around a region of interest (RoI) which may be defined around an object by a bounding box, a segmentation mask or, a cell within a grid

Pointing Game (Zhang et al., 2018): checks whether attribution with the highest score is located within the targeted object
Attribution Localization (Kohlbrenner et al., 2020): measures the ratio of positive attributions within the targeted object towards the total positive attributions
Top-K Intersection (Theiner et al., 2021): computes the intersection between a ground truth mask and the binarized explanation at the top k feature locations
Relevance Rank Accuracy (Arras et al., 2021): measures the ratio of highly attributed pixels within a ground-truth mask towards the size of the ground truth mask
Relevance Mass Accuracy (Arras et al., 2021): measures the ratio of positively attributed attributions inside the ground-truth mask towards the overall positive attributions
AUC (Fawcett et al., 2006): compares the ranking between attributions and a given ground-truth mask
Focus (Arias et al., 2022): quantifies the precision of the explanation by creating mosaics of data instances from different classes

Complexity

captures to what extent explanations are concise i.e., that few features are used to explain a model prediction

Sparseness (Chalasani et al., 2020): uses the Gini Index for measuring, if only highly attributed features are truly predictive of the model output
Complexity (Bhatt et al., 2020): computes the entropy of the fractional contribution of all features to the total magnitude of the attribution individually
Effective Complexity (Nguyen at el., 2020): measures how many attributions in absolute values are exceeding a certain threshold

Randomisation (Sensitivity)

tests to what extent explanations deteriorate as inputs to the evaluation problem e.g., model parameters are increasingly randomised

MPRT (Model Parameter Randomisation Test) (Adebayo et. al., 2018): randomises the parameters of single model layers in a cascading or independent way and measures the distance of the respective explanation to the original explanation
Smooth MPRT (Hedström et. al., 2023): adds a "denoising" preprocessing step to the original MPRT, where the explanations are averaged over N noisy samples before the similarity between the original- and fully random model's explanations is measured
Efficient MPRT (Hedström et. al., 2023): reinterprets MPRT by evaluating the rise in explanation complexity (discrete entropy) before and after full model randomisation, asking for increased explanation complexity post-randomisation
Random Logit Test (Sixt et al., 2020): computes for the distance between the original explanation and the explanation for a random other class

Axiomatic

assesses if explanations fulfil certain axiomatic properties

Completeness (Sundararajan et al., 2017): evaluates whether the sum of attributions is equal to the difference between the function values at the input x and baseline x' (and referred to as Summation to Delta (Shrikumar et al., 2017), Sensitivity-n (slight variation, Ancona et al., 2018) and Conservation (Montavon et al., 2018))
Non-Sensitivity (Nguyen at el., 2020): measures whether the total attribution is proportional to the explainable evidence at the model output
Input Invariance (Kindermans et al., 2017): adds a shift to input, asking that attributions should not change in response (assuming the model does not)

Additional metrics will be included in future releases. Please open an issue if you have a metric you believe should be apart of Quantus.

Disclaimers. It is worth noting that the implementations of the metrics in this library have not been verified by the original authors. Thus any metric implementation in this library may differ from the original authors. Further, bear in mind that evaluation metrics for XAI methods are often empirical interpretations (or translations) of qualities that some researcher(s) claimed were important for explanations to fulfil, so it may be a discrepancy between what the author claims to measure by the proposed metric and what is actually measured e.g., using entropy as an operationalisation of explanation complexity. Please read the user guidelines for further guidance on how to best use the library.

Installation

If you already have PyTorch or TensorFlow installed on your machine, the most light-weight version of Quantus can be obtained from PyPI as follows (no additional explainability functionality or deep learning framework will be included):

pip install quantus

Alternatively, you can simply add the desired deep learning framework (in brackets) to have the package installed together with Quantus. To install Quantus with PyTorch, please run:

pip install "quantus[torch]"

For TensorFlow, please run:

pip install "quantus[tensorflow]"

Package requirements

The package requirements are as follows:

python>=3.8.0
torch>=1.11.0
tensorflow>=2.5.0

Please note that the exact PyTorch and/ or TensorFlow versions to be installed depends on your Python version (3.8-3.11) and platform (darwin, linux, …). See [project.optional-dependencies] section in the pyproject.toml file.

Getting started

The following will give a short introduction to how to get started with Quantus. Note that this example is based on the PyTorch framework, but we also support TensorFlow, which would differ only in the loading of the model, data and explanations. To get started with Quantus, you need:

A model (model), inputs (x_batch) and labels (y_batch)
Some explanations you want to evaluate (a_batch)

Step 1. Load data and model

Let's first load the data and model. In this example, a pre-trained LeNet available from Quantus for the purpose of this tutorial is loaded, but generally, you might use any Pytorch (or TensorFlow) model instead. To follow this example, one needs to have quantus and torch installed, by e.g., pip install 'quantus[torch]'.

import quantus
from quantus.helpers.model.models import LeNet
import torch
import torchvision
from torchvision import transforms
  
# Enable GPU.
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load a pre-trained LeNet classification model (architecture at quantus/helpers/models).
model = LeNet()
if device.type == "cpu":
    model.load_state_dict(torch.load("tests/assets/mnist", map_location=torch.device('cpu')))
else: 
    model.load_state_dict(torch.load("tests/assets/mnist"))

# Load datasets and make loaders.
test_set = torchvision.datasets.MNIST(root='./sample_data', download=True, transform=transforms.Compose([transforms.ToTensor()]))
test_loader = torch.utils.data.DataLoader(test_set, batch_size=24)

# Load a batch of inputs and outputs to use for XAI evaluation.
x_batch, y_batch = iter(test_loader).next()
x_batch, y_batch = x_batch.cpu().numpy(), y_batch.cpu().numpy()

Step 2. Load explanations

We still need some explanations to evaluate. For this, there are two possibilities in Quantus. You can provide either:

a set of re-computed attributions (np.ndarray)
any arbitrary explanation function (callable), e.g., the built-in method quantus.explain or your own customised function

We show the different options below.

Using pre-computed explanations

Quantus allows you to evaluate explanations that you have pre-computed, assuming that they match the data you provide in x_batch. Let's say you have explanations for Saliency and Integrated Gradients already pre-computed.

In that case, you can simply load these into corresponding variables a_batch_saliency and a_batch_intgrad:

a_batch_saliency = load("path/to/precomputed/saliency/explanations")
a_batch_intgrad = load("path/to/precomputed/intgrad/explanations")

Another option is to simply obtain the attributions using one of many XAI frameworks out there, such as Captum, Zennit, tf.explain, or iNNvestigate. The following code example shows how to obtain explanations (Saliency and Integrated Gradients, to be specific) using Captum:

import captum
from captum.attr import Saliency, IntegratedGradients

# Generate Integrated Gradients attributions of the first batch of the test set.
a_batch_saliency = Saliency(model).attribute(inputs=x_batch, target=y_batch, abs=True).sum(axis=1).cpu().numpy()
a_batch_intgrad = IntegratedGradients(model).attribute(inputs=x_batch, target=y_batch, baselines=torch.zeros_like(x_batch)).sum(axis=1).cpu().numpy()

# Save x_batch and y_batch as numpy arrays that will be used to call metric instances.
x_batch, y_batch = x_batch.cpu().numpy(), y_batch.cpu().numpy()

# Quick assert.
assert [isinstance(obj, np.ndarray) for obj in [x_batch, y_batch, a_batch_saliency, a_batch_intgrad]]

Passing an explanation function

If you don't have a pre-computed set of explanations but rather want to pass an arbitrary explanation function that you wish to evaluate with Quantus, this option exists.

For this, you can for example rely on the built-in quantus.explain function to get started, which includes some popular explanation methods (please run quantus.available_methods() to see which ones). Examples of how to use quantus.explain or your own customised explanation function are included in the next section.

As seen in the above image, the qualitative aspects of explanations may look fairly uninterpretable --- since we lack ground truth of what the explanations should be looking like, it is hard to draw conclusions about the explainable evidence. To gather quantitative evidence for the quality of the different explanation methods, we can apply Quantus.

Step 3. Evaluate with Quantus

Quantus implements XAI evaluation metrics from different categories, e.g., Faithfulness, Localisation and Robustness etc which all inherit from the base quantus.Metric class. To apply a metric to your setting (e.g., Max-Sensitivity) it first needs to be instantiated:

metric = quantus.MaxSensitivity(nr_samples=10,
                                lower_bound=0.2,
                                norm_numerator=quantus.fro_norm,
                                norm_denominator=quantus.fro_norm,
                                perturb_func=quantus.uniform_noise,
                                similarity_func=quantus.difference,
                                abs=True,
                                normalise=True)

and then applied to your model, data, and (pre-computed) explanations:

scores = metric(
    model=model,
    x_batch=x_batch,
    y_batch=y_batch,
    a_batch=a_batch_saliency,
    device=device,
    explain_func=quantus.explain,
    explain_func_kwargs={"method": "Saliency"},
)

Use quantus.explain

Since a re-computation of the explanations is necessary for robustness evaluation, in this example, we also pass an explanation function (explain_func) to the metric call. Here, we rely on the built-in quantus.explain function to recompute the explanations. The hyperparameters are set with the explain_func_kwargs dictionary. Please find more details on how to use quantus.explain at API documentation.

Employ customised functions

You can alternatively use your own customised explanation function (assuming it returns an np.ndarray in a shape that matches the input x_batch). This is done as follows:

def your_own_callable(model, models, targets, **kwargs) -> np.ndarray
  """Logic goes here to compute the attributions and return an 
  explanation  in the same shape as x_batch (np.array), 
  (flatten channels if necessary)."""
  return explanation(model, x_batch, y_batch)

scores = metric(
    model=model,
    x_batch=x_batch,
    y_batch=y_batch,
    device=device,
    explain_func=your_own_callable
)

Run large-scale evaluation

Quantus also provides high-level functionality to support large-scale evaluations, e.g., multiple XAI methods, multifaceted evaluation through several metrics, or a combination thereof. To utilise quantus.evaluate(), you simply need to define two things:

The Metrics you would like to use for evaluation (each __init__ parameter configuration counts as its own metric):

metrics = {
    "max-sensitivity-10": quantus.MaxSensitivity(nr_samples=10),
    "max-sensitivity-20": quantus.MaxSensitivity(nr_samples=20),
    "region-perturbation": quantus.RegionPerturbation(),
}

The XAI methods you would like to evaluate, e.g., a dict with pre-computed attributions:

xai_methods = {
    "Saliency": a_batch_saliency,
    "IntegratedGradients": a_batch_intgrad
}

You can then simply run a large-scale evaluation as follows (this aggregates the result by np.mean averaging):

import numpy as np
results = quantus.evaluate(
      metrics=metrics,
      xai_methods=xai_methods,
      agg_func=np.mean,
      model=model,
      x_batch=x_batch,
      y_batch=y_batch,
      **{"softmax": False,}
)

Please see Getting started tutorial to run code similar to this example. For more information on how to customise metrics and extend Quantus' functionality, please see Getting started guide.

Tutorials

Further tutorials are available that showcase the many types of analysis that can be done using Quantus. For this purpose, please see notebooks in the tutorials folder which includes examples such as:

All Metrics ImageNet Example: shows how to instantiate the different metrics for ImageNet dataset
Metric Parameterisation Analysis: explores how sensitive a metric could be to its hyperparameters
Robustness Analysis Model Training: measures robustness of explanations as model accuracy increases
Full Quantification with Quantus: example of benchmarking explanation methods
Tabular Data Example: example of how to use Quantus with tabular data
Quantus and TensorFlow Data Example: showcases how to use Quantus with TensorFlow

... and more.

Contributing

We welcome any sort of contribution to Quantus! For a detailed contribution guide, please refer to Contributing documentation first.

If you have any developer-related questions, please open an issue or write us at [email protected].

For Tasks:

Click tags to check more tools for each tasks

evaluate explanations quantify xai methods support different data types provide evaluation metrics implement explanation methods

For Jobs:

data scientist machine learning engineer ai researcher data analyst research scientist

Alternative AI tools for Quantus

Similar Open Source Tools

Quantus

github

: 502

gritlm

The 'gritlm' repository provides all materials for the paper Generative Representational Instruction Tuning. It includes code for inference, training, evaluation, and known issues related to the GritLM model. The repository also offers models for embedding and generation tasks, along with instructions on how to train and evaluate the models. Additionally, it contains visualizations, acknowledgements, and a citation for referencing the work.

github

: 530

codellm-devkit

Codellm-devkit (CLDK) is a Python library that serves as a multilingual program analysis framework bridging traditional static analysis tools and Large Language Models (LLMs) specialized for code (CodeLLMs). It simplifies the process of analyzing codebases across multiple programming languages, enabling the extraction of meaningful insights and facilitating LLM-based code analysis. The library provides a unified interface for integrating outputs from various analysis tools and preparing them for effective use by CodeLLMs. Codellm-devkit aims to enable the development and experimentation of robust analysis pipelines that combine traditional program analysis tools and CodeLLMs, reducing friction in multi-language code analysis and ensuring compatibility across different tools and LLM platforms. It is designed to seamlessly integrate with popular analysis tools like WALA, Tree-sitter, LLVM, and CodeQL, acting as a crucial intermediary layer for efficient communication between these tools and CodeLLMs. The project is continuously evolving to include new tools and frameworks, maintaining its versatility for code analysis and LLM integration.

github

: 58

zshot

Zshot is a highly customizable framework for performing Zero and Few shot named entity and relationships recognition. It can be used for mentions extraction, wikification, zero and few shot named entity recognition, zero and few shot named relationship recognition, and visualization of zero-shot NER and RE extraction. The framework consists of two main components: the mentions extractor and the linker. There are multiple mentions extractors and linkers available, each serving a specific purpose. Zshot also includes a relations extractor and a knowledge extractor for extracting relations among entities and performing entity classification. The tool requires Python 3.6+ and dependencies like spacy, torch, transformers, evaluate, and datasets for evaluation over datasets like OntoNotes. Optional dependencies include flair and blink for additional functionalities. Zshot provides examples, tutorials, and evaluation methods to assess the performance of the components.

github

: 329

SemanticKernel.Assistants

This repository contains an assistant proposal for the Semantic Kernel, allowing the usage of assistants without relying on OpenAI Assistant APIs. It runs locally planners and plugins for the assistants, providing scenarios like Assistant with Semantic Kernel plugins, Multi-Assistant conversation, and AutoGen conversation. The Semantic Kernel is a lightweight SDK enabling integration of AI Large Language Models with conventional programming languages, offering functions like semantic functions, native functions, and embeddings-based memory. Users can bring their own model for the assistants and host them locally. The repository includes installation instructions, usage examples, and information on creating new conversation threads with the assistant.

github

: 101

create-million-parameter-llm-from-scratch

The 'create-million-parameter-llm-from-scratch' repository provides a detailed guide on creating a Large Language Model (LLM) with 2.3 million parameters from scratch. The blog replicates the LLaMA approach, incorporating concepts like RMSNorm for pre-normalization, SwiGLU activation function, and Rotary Embeddings. The model is trained on a basic dataset to demonstrate the ease of creating a million-parameter LLM without the need for a high-end GPU.

github

: 65

only_train_once

Only Train Once (OTO) is an automatic, architecture-agnostic DNN training and compression framework that allows users to train a general DNN from scratch or a pretrained checkpoint to achieve high performance and slimmer architecture simultaneously in a one-shot manner without fine-tuning. The framework includes features for automatic structured pruning and erasing operators, as well as hybrid structured sparse optimizers for efficient model compression. OTO provides tools for pruning zero-invariant group partitioning, constructing pruned models, and visualizing pruning and erasing dependency graphs. It supports the HESSO optimizer and offers a sanity check for compliance testing on various DNNs. The repository also includes publications, installation instructions, quick start guides, and a roadmap for future enhancements and collaborations.

github

: 261

pytorch-forecasting

PyTorch Forecasting is a PyTorch-based package designed for state-of-the-art timeseries forecasting using deep learning architectures. It offers a high-level API and leverages PyTorch Lightning for efficient training on GPU or CPU with automatic logging. The package aims to simplify timeseries forecasting tasks by providing a flexible API for professionals and user-friendly defaults for beginners. It includes features such as a timeseries dataset class for handling data transformations, missing values, and subsampling, various neural network architectures optimized for real-world deployment, multi-horizon timeseries metrics, and hyperparameter tuning with optuna. Built on pytorch-lightning, it supports training on CPUs, single GPUs, and multiple GPUs out-of-the-box.

github

: 4.0k

RTL-Coder

RTL-Coder is a tool designed to outperform GPT-3.5 in RTL code generation by providing a fully open-source dataset and a lightweight solution. It targets Verilog code generation and offers an automated flow to generate a large labeled dataset with over 27,000 diverse Verilog design problems and answers. The tool addresses the data availability challenge in IC design-related tasks and can be used for various applications beyond LLMs. The tool includes four RTL code generation models available on the HuggingFace platform, each with specific features and performance characteristics. Additionally, RTL-Coder introduces a new LLM training scheme based on code quality feedback to further enhance model performance and reduce GPU memory consumption.

github

: 121

slideflow

Slideflow is a deep learning library for digital pathology, offering a user-friendly interface for model development. It is designed for medical researchers and AI enthusiasts, providing an accessible platform for developing state-of-the-art pathology models. Slideflow offers customizable training pipelines, robust slide processing and stain normalization toolkit, support for weakly-supervised or strongly-supervised labels, built-in foundation models, multiple-instance learning, self-supervised learning, generative adversarial networks, explainability tools, layer activation analysis tools, uncertainty quantification, interactive user interface for model deployment, and more. It supports both PyTorch and Tensorflow, with optional support for Libvips for slide reading. Slideflow can be installed via pip, Docker container, or from source, and includes non-commercial add-ons for additional tools and pretrained models. It allows users to create projects, extract tiles from slides, train models, and provides evaluation tools like heatmaps and mosaic maps.

github

: 260

chembench

ChemBench is a project aimed at expanding chemistry benchmark tasks in a BIG-bench compatible way, providing a pipeline to benchmark frontier and open models. It enables benchmarking across a wide range of API-based models and employs an LLM-based extractor as a fallback mechanism. Users can evaluate models on specific chemistry topics and run comprehensive evaluations across all topics in the benchmark suite. The tool facilitates seamless benchmarking for any model supported by LiteLLM and allows running non-API hosted models.

github

: 64

openagi

OpenAGI is a framework designed to make the development of autonomous human-like agents accessible to all. It aims to pave the way towards open agents and eventually AGI for everyone. The initiative strongly believes in the transformative power of AI and offers developers a platform to create autonomous human-like agents. OpenAGI features a flexible agent architecture, streamlined integration and configuration processes, and automated/manual agent configuration generation. It can be used in education for personalized learning experiences, in finance and banking for fraud detection and personalized banking advice, and in healthcare for patient monitoring and disease diagnosis.

github

: 291

aitviewer

A set of tools to visualize and interact with sequences of 3D data with cross-platform support on Windows, Linux, and macOS. It provides a native Python interface for loading and displaying SMPL[-H/-X], MANO, FLAME, STAR, and SUPR sequences in an interactive viewer. Users can render 3D data on top of images, edit SMPL sequences and poses, export screenshots and videos, and utilize a high-performance ModernGL-based rendering pipeline. The tool is designed for easy use and hacking, with features like headless mode, remote mode, animatable camera paths, and a built-in extensible GUI.

github

: 542

rl

TorchRL is an open-source Reinforcement Learning (RL) library for PyTorch. It provides pytorch and **python-first** , low and high level abstractions for RL that are intended to be **efficient** , **modular** , **documented** and properly **tested**. The code is aimed at supporting research in RL. Most of it is written in python in a highly modular way, such that researchers can easily swap components, transform them or write new ones with little effort.

github

: 2.6k

catalyst

Catalyst is a C# Natural Language Processing library designed for speed, inspired by spaCy's design. It provides pre-trained models, support for training word and document embeddings, and flexible entity recognition models. The library is fast, modern, and pure-C#, supporting .NET standard 2.0. It is cross-platform, running on Windows, Linux, macOS, and ARM. Catalyst offers non-destructive tokenization, named entity recognition, part-of-speech tagging, language detection, and efficient binary serialization. It includes pre-built models for language packages and lemmatization. Users can store and load models using streams. Getting started with Catalyst involves installing its NuGet Package and setting the storage to use the online repository. The library supports lazy loading of models from disk or online. Users can take advantage of C# lazy evaluation and native multi-threading support to process documents in parallel. Training a new FastText word2vec embedding model is straightforward, and Catalyst also provides algorithms for fast embedding search and dimensionality reduction.

github

: 706

DelphiOpenAI

Delphi OpenAI API is an unofficial library providing Delphi implementation over OpenAI public API. It allows users to access various models, make completions, chat conversations, generate images, and call functions using OpenAI service. The library aims to facilitate tasks such as content generation, semantic search, and classification through AI models. Users can fine-tune models, work with natural language processing, and apply reinforcement learning methods for diverse applications.

github

: 225

For similar tasks

Quantus

github

: 502

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

Quantus

README:

A toolkit to evaluate neural network explanations

News and Highlights! 🚀

Citation

Table of contents

Library overview

Metrics

Installation

Package requirements

Getting started

Using pre-computed explanations

Passing an explanation function

Use quantus.explain

Employ customised functions

Run large-scale evaluation

Tutorials

Contributing

For Tasks:

For Jobs:

Alternative AI tools for Quantus

Similar Open Source Tools

Quantus

gritlm

codellm-devkit

zshot

SemanticKernel.Assistants

create-million-parameter-llm-from-scratch

only_train_once

pytorch-forecasting

RTL-Coder

slideflow

chembench

openagi

aitviewer

rl

catalyst

DelphiOpenAI

For similar tasks

Quantus

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick