TrustLLM
[ICML 2024] TrustLLM: Trustworthiness in Large Language Models
Stars: 412
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
README:
- [01/09/2024] TrustLLM toolkit has been downloaded for 4000+ times!
- [15/07/2024] TrustLLM now supports UniGen for dynamic evaluation.
- [02/05/2024] π₯ TrustLLM has been accepted by ICML 2024! See you in Vienna!
- [23/04/2024] β Version 0.3.0: Major updates including bug fixes, enhanced evaluation, and new models added (including ChatGLM3, Llama3-8b, Llama3-70b, GLM4, Mixtral). (See details)
- [20/03/2024] β Version 0.2.4: Fixed many bugs & Support Gemini Pro API
- [01/02/2024] π Version 0.2.2: See our new paper about the awareness in LLMs! (link)
- [29/01/2024] β Version 0.2.1: trustllm toolkit now supports (1) Easy evaluation pipeline (2) LLMs in replicate and deepinfra (3) Azure OpenAI API
- [20/01/2024] β Version 0.2.0 of trustllm toolkit is released! See the new features.
- [12/01/2024] π The dataset, leaderboard, and evaluation toolkit are released!
- TrustLLM (ICML 2024) is a comprehensive framework for studying trustworthiness of large language models, which includes principles, surveys, and benchmarks.
- This code repository is designed to provide an easy toolkit for evaluating the trustworthiness of LLMs (See our docs).
Table of Content
- About TrustLLM
- Before Evaluation
- Evaluation
- Dataset & Task
- Leaderboard
- Contribution
- Citation
- License
We introduce TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
Installation via Github (recommended):
git clone [email protected]:HowieHwong/TrustLLM.git
Installation via pip
:
pip install trustllm
Installation via conda
:
conda install -c conda-forge trustllm
Create a new environment:
conda create --name trustllm python=3.9
Install required packages:
cd trustllm_pkg
pip install .
Download TrustLLM dataset:
from trustllm.dataset_download import download_dataset
download_dataset(save_path='save_path')
We have added generation section from version 0.2.0. Start your generation from this page. Here is an example:
from trustllm.generation.generation import LLMGeneration
llm_gen = LLMGeneration(
model_path="your model name",
test_type="test section",
data_path="your dataset file path",
model_name="",
online_model=False,
use_deepinfra=False,
use_replicate=False,
repetition_penalty=1.0,
num_gpus=1,
max_new_tokens=512,
debug=False,
device='cuda:0'
)
llm_gen.generation_results()
We have provided a toolkit that allows you to more conveniently assess the trustworthiness of large language models. Please refer to the document for more details. Here is an example:
from trustllm.task.pipeline import run_truthfulness
truthfulness_results = run_truthfulness(
internal_path="path_to_internal_consistency_data.json",
external_path="path_to_external_consistency_data.json",
hallucination_path="path_to_hallucination_data.json",
sycophancy_path="path_to_sycophancy_data.json",
advfact_path="path_to_advfact_data.json"
)
β the dataset is from prior work, and β means the dataset is first proposed in our benchmark.
Dataset | Description | Num. | Exist? | Section |
---|---|---|---|---|
SQuAD2.0 | It combines questions in SQuAD1.1 with over 50,000 unanswerable questions. | 100 | β | Misinformation |
CODAH | It contains 28,000 commonsense questions. | 100 | β | Misinformation |
HotpotQA | It contains 113k Wikipedia-based question-answer pairs for complex multi-hop reasoning. | 100 | β | Misinformation |
AdversarialQA | It contains 30,000 adversarial reading comprehension question-answer pairs. | 100 | β | Misinformation |
Climate-FEVER | It contains 7,675 climate change-related claims manually curated by human fact-checkers. | 100 | β | Misinformation |
SciFact | It contains 1,400 expert-written scientific claims pairs with evidence abstracts. | 100 | β | Misinformation |
COVID-Fact | It contains 4,086 real-world COVID claims. | 100 | β | Misinformation |
HealthVer | It contains 14,330 health-related claims against scientific articles. | 100 | β | Misinformation |
TruthfulQA | The multiple-choice questions to evaluate whether a language model is truthful in generating answers to questions. | 352 | β | Hallucination |
HaluEval | It contains 35,000 generated and human-annotated hallucinated samples. | 300 | β | Hallucination |
LM-exp-sycophancy | A dataset consists of human questions with one sycophancy response example and one non-sycophancy response example. | 179 | β | Sycophancy |
Opinion pairs | It contains 120 pairs of opposite opinions. | 240, 120 | β | Sycophancy, Preference |
WinoBias | It contains 3,160 sentences, split for development and testing, created by researchers familiar with the project. | 734 | β | Stereotype |
StereoSet | It contains the sentences that measure model preferences across gender, race, religion, and profession. | 734 | β | Stereotype |
Adult | The dataset, containing attributes like sex, race, age, education, work hours, and work type, is utilized to predict salary levels for individuals. | 810 | β | Disparagement |
Jailbreak Trigger | The dataset contains the prompts based on 13 jailbreak attacks. | 1300 | β | Jailbreak, Toxicity |
Misuse (additional) | This dataset contains prompts crafted to assess how LLMs react when confronted by attackers or malicious users seeking to exploit the model for harmful purposes. | 261 | β | Misuse |
Do-Not-Answer | It is curated and filtered to consist only of prompts to which responsible LLMs do not answer. | 344 + 95 | β | Misuse, Stereotype |
AdvGLUE | A multi-task dataset with different adversarial attacks. | 912 | β | Natural Noise |
AdvInstruction | 600 instructions generated by 11 perturbation methods. | 600 | β | Natural Noise |
ToolE | A dataset with the users' queries which may trigger LLMs to use external tools. | 241 | β | Out of Domain (OOD) |
Flipkart | A product review dataset, collected starting from December 2022. | 400 | β | Out of Domain (OOD) |
DDXPlus | A 2022 medical diagnosis dataset comprising synthetic data representing about 1.3 million patient cases. | 100 | β | Out of Domain (OOD) |
ETHICS | It contains numerous morally relevant scenarios descriptions and their moral correctness. | 500 | β | Implicit Ethics |
Social Chemistry 101 | It contains various social norms, each consisting of an action and its label. | 500 | β | Implicit Ethics |
MoralChoice | It consists of different contexts with morally correct and wrong actions. | 668 | β | Explicit Ethics |
ConfAIde | It contains the description of how information is used. | 196 | β | Privacy Awareness |
Privacy Awareness | It includes different privacy information queries about various scenarios. | 280 | β | Privacy Awareness |
Enron Email | It contains approximately 500,000 emails generated by employees of the Enron Corporation. | 400 | β | Privacy Leakage |
Xstest | It's a test suite for identifying exaggerated safety behaviors in LLMs. | 200 | β | Exaggerated Safety |
β means evaluation through the automatic scripts (e.g., keywords matching), β means the automatic evaluation by ChatGPT, GPT-4 or longformer, and β means the mixture evaluation.
More trustworthy LLMs are expected to have a higher value of the metrics with β and a lower value with β.
Task Name | Metrics | Type | Eval | Section |
---|---|---|---|---|
Closed-book QA | Accuracy (β) | Generation | β | Misinformation(Internal) |
Fact-Checking | Macro F-1 (β) | Classification | β | Misinformation(External) |
Multiple Choice QA | Accuracy (β) | Classification | β | Hallucination |
Hallucination Classification | Accuracy (β) | Classification | β | Hallucination |
Persona Sycophancy | Embedding similarity (β) | Generation | β | Sycophancy |
Opinion Sycophancy | Percentage change (β) | Generation | β | Sycophancy |
Factuality Correction | Percentage change (β) | Generation | β | Adversarial Factuality |
Jailbreak Attack Evaluation | RtA (β) | Generation | β | Jailbreak |
Toxicity Measurement | Toxicity Value (β) | Generation | β | Toxicity |
Misuse Evaluation | RtA (β) | Generation | β | Misuse |
Exaggerated Safety Evaluation | RtA (β) | Generation | β | Exaggerated Safety |
Agreement on Stereotypes | Accuracy (β) | Generation | β | Stereotype |
Recognition of Stereotypes | Agreement Percentage (β) | Classification | β | Stereotype |
Stereotype Query Test | RtA (β) | Generation | β | Stereotype |
Preference Selection | RtA (β) | Generation | β | Preference |
Salary Prediction | p-value (β) | Generation | β | Disparagement |
Adversarial Perturbation in Downstream Tasks | ASR (β), RS (β) | Generation | β | Natural Noise |
Adversarial Perturbation in Open-Ended Tasks | Embedding similarity (β) | Generation | β | Natural Noise |
OOD Detection | RtA (β) | Generation | β | Out of Domain (OOD) |
OOD Generalization | Micro F1 (β) | Classification | β | Out of Domain (OOD) |
Agreement on Privacy Information | Pearsonβs correlation (β) | Classification | β | Privacy Awareness |
Privacy Scenario Test | RtA (β) | Generation | β | Privacy Awareness |
Probing Privacy Information Usage | RtA (β), Accuracy (β) | Generation | β | Privacy Leakage |
Moral Action Judgement | Accuracy (β) | Classification | β | Implicit Ethics |
Moral Reaction Selection (Low-Ambiguity) | Accuracy (β) | Classification | β | Explicit Ethics |
Moral Reaction Selection (High-Ambiguity) | RtA (β) | Generation | β | Explicit Ethics |
Emotion Classification | Accuracy (β) | Classification | β | Emotional Awareness |
If you want to view the performance of all models or upload the performance of your LLM, please refer to this link.
We welcome your contributions, including but not limited to the following:
- New evaluation datasets
- Research on trustworthy issues
- Improvements to the toolkit
If you intend to make improvements to the toolkit, please fork the repository first, make the relevant modifications to the code, and finally initiate a pull request
.
- [x] Faster and simpler evaluation pipeline (Version 0.2.1)
- [x] Dynamic dataset (UniGen)
- [ ] More fine-grained datasets
- [ ] Chinese output evaluation
- [ ] Downstream application evaluation
@inproceedings{huang2024trustllm,
title={TrustLLM: Trustworthiness in Large Language Models},
author={Yue Huang and Lichao Sun and Haoran Wang and Siyuan Wu and Qihui Zhang and Yuan Li and Chujie Gao and Yixin Huang and Wenhan Lyu and Yixuan Zhang and Xiner Li and Hanchi Sun and Zhengliang Liu and Yixin Liu and Yijue Wang and Zhikun Zhang and Bertie Vidgen and Bhavya Kailkhura and Caiming Xiong and Chaowei Xiao and Chunyuan Li and Eric P. Xing and Furong Huang and Hao Liu and Heng Ji and Hongyi Wang and Huan Zhang and Huaxiu Yao and Manolis Kellis and Marinka Zitnik and Meng Jiang and Mohit Bansal and James Zou and Jian Pei and Jian Liu and Jianfeng Gao and Jiawei Han and Jieyu Zhao and Jiliang Tang and Jindong Wang and Joaquin Vanschoren and John Mitchell and Kai Shu and Kaidi Xu and Kai-Wei Chang and Lifang He and Lifu Huang and Michael Backes and Neil Zhenqiang Gong and Philip S. Yu and Pin-Yu Chen and Quanquan Gu and Ran Xu and Rex Ying and Shuiwang Ji and Suman Jana and Tianlong Chen and Tianming Liu and Tianyi Zhou and William Yang Wang and Xiang Li and Xiangliang Zhang and Xiao Wang and Xing Xie and Xun Chen and Xuyu Wang and Yan Liu and Yanfang Ye and Yinzhi Cao and Yong Chen and Yue Zhao},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=bWUU0LwwMp}
}
The code in this repository is open source under the MIT license.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for TrustLLM
Similar Open Source Tools
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
COLD-Attack
COLD-Attack is a framework designed for controllable jailbreaks on large language models (LLMs). It formulates the controllable attack generation problem and utilizes the Energy-based Constrained Decoding with Langevin Dynamics (COLD) algorithm to automate the search of adversarial LLM attacks with control over fluency, stealthiness, sentiment, and left-right-coherence. The framework includes steps for energy function formulation, Langevin dynamics sampling, and decoding process to generate discrete text attacks. It offers diverse jailbreak scenarios such as fluent suffix attacks, paraphrase attacks, and attacks with left-right-coherence.
LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of Large Language Models (LLMs) and build intelligent applications that push the boundaries of natural language understanding. This GitHub repository provides in-depth articles, codebase mastery, LLM PlayLab, and resources for cost analysis and network visualization. It covers various aspects of LLMs, including NLP, models, training, evaluation metrics, open LLMs, and more. The repository also includes a collection of code examples and tutorials to help users build and deploy LLM-based applications.
EasyEdit
EasyEdit is a Python package for edit Large Language Models (LLM) like `GPT-J`, `Llama`, `GPT-NEO`, `GPT2`, `T5`(support models from **1B** to **65B**), the objective of which is to alter the behavior of LLMs efficiently within a specific domain without negatively impacting performance across other inputs. It is designed to be easy to use and easy to extend.
jailbreak_llms
This is the official repository for the ACM CCS 2024 paper 'Do Anything Now': Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. The project employs a new framework called JailbreakHub to conduct the first measurement study on jailbreak prompts in the wild, collecting 15,140 prompts from December 2022 to December 2023, including 1,405 jailbreak prompts. The dataset serves as the largest collection of in-the-wild jailbreak prompts. The repository contains examples of harmful language and is intended for research purposes only.
GenAI-Showcase
The Generative AI Use Cases Repository showcases a wide range of applications in generative AI, including Retrieval-Augmented Generation (RAG), AI Agents, and industry-specific use cases. It provides practical notebooks and guidance on utilizing frameworks such as LlamaIndex and LangChain, and demonstrates how to integrate models from leading AI research companies like Anthropic and OpenAI.
llm-compression-intelligence
This repository presents the findings of the paper "Compression Represents Intelligence Linearly". The study reveals a strong linear correlation between the intelligence of LLMs, as measured by benchmark scores, and their ability to compress external text corpora. Compression efficiency, derived from raw text corpora, serves as a reliable evaluation metric that is linearly associated with model capabilities. The repository includes the compression corpora used in the paper, code for computing compression efficiency, and data collection and processing pipelines.
EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
awesome-open-data-annotation
At ZenML, we believe in the importance of annotation and labeling workflows in the machine learning lifecycle. This repository showcases a curated list of open-source data annotation and labeling tools that are actively maintained and fit for purpose. The tools cover various domains such as multi-modal, text, images, audio, video, time series, and other data types. Users can contribute to the list and discover tools for tasks like named entity recognition, data annotation for machine learning, image and video annotation, text classification, sequence labeling, object detection, and more. The repository aims to help users enhance their data-centric workflows by leveraging these tools.
flute
FLUTE (Flexible Lookup Table Engine for LUT-quantized LLMs) is a tool designed for uniform quantization and lookup table quantization of weights in lower-precision intervals. It offers flexibility in mapping intervals to arbitrary values through a lookup table. FLUTE supports various quantization formats such as int4, int3, int2, fp4, fp3, fp2, nf4, nf3, nf2, and even custom tables. The tool also introduces new quantization algorithms like Learned Normal Float (NFL) for improved performance and calibration data learning. FLUTE provides benchmarks, model zoo, and integration with frameworks like vLLM and HuggingFace for easy deployment and usage.
Awesome-LLM-Large-Language-Models-Notes
Awesome-LLM-Large-Language-Models-Notes is a repository that provides a comprehensive collection of information on various Large Language Models (LLMs) classified by year, size, and name. It includes details on known LLM models, their papers, implementations, and specific characteristics. The repository also covers LLM models classified by architecture, must-read papers, blog articles, tutorials, and implementations from scratch. It serves as a valuable resource for individuals interested in understanding and working with LLMs in the field of Natural Language Processing (NLP).
visionOS-examples
visionOS-examples is a repository containing accelerators for Spatial Computing. It includes examples such as Local Large Language Model, Chat Apple Vision Pro, WebSockets, Anchor To Head, Hand Tracking, Battery Life, Countdown, Plane Detection, Timer Vision, and PencilKit for visionOS. The repository showcases various functionalities and features for Apple Vision Pro, offering tools for developers to enhance their visionOS apps with capabilities like hand tracking, plane detection, and real-time cryptocurrency prices.
phoenix
Phoenix is a tool that provides MLOps and LLMOps insights at lightning speed with zero-config observability. It offers a notebook-first experience for monitoring models and LLM Applications by providing LLM Traces, LLM Evals, Embedding Analysis, RAG Analysis, and Structured Data Analysis. Users can trace through the execution of LLM Applications, evaluate generative models, explore embedding point-clouds, visualize generative application's search and retrieval process, and statistically analyze structured data. Phoenix is designed to help users troubleshoot problems related to retrieval, tool execution, relevance, toxicity, drift, and performance degradation.
tamingLLMs
The 'Taming LLMs' repository provides a practical guide to the pitfalls and challenges associated with Large Language Models (LLMs) when building applications. It focuses on key limitations and implementation pitfalls, offering practical Python examples and open source solutions to help engineers and technical leaders navigate these challenges. The repository aims to equip readers with the knowledge to harness the power of LLMs while avoiding their inherent limitations.
OpenAI-CLIP-Feature
This repository provides code for extracting image and text features using OpenAI CLIP models, supporting both global and local grid visual features. It aims to facilitate multi visual-and-language downstream tasks by allowing users to customize input and output grid resolution easily. The extracted features have shown comparable or superior results in image captioning tasks without hyperparameter tuning. The repo supports various CLIP models and provides detailed information on supported settings and results on MSCOCO image captioning. Users can get started by setting up experiments with the extracted features using X-modaler.
Awesome-LLM-Safety
Welcome to our Awesome-llm-safety repository! We've curated a collection of the latest, most comprehensive, and most valuable resources on large language model safety (llm-safety). But we don't stop there; included are also relevant talks, tutorials, conferences, news, and articles. Our repository is constantly updated to ensure you have the most current information at your fingertips.
For similar tasks
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
deepeval
DeepEval is a simple-to-use, open-source LLM evaluation framework specialized for unit testing LLM outputs. It incorporates various metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., and runs locally on your machine for evaluation. It provides a wide range of ready-to-use evaluation metrics, allows for creating custom metrics, integrates with any CI/CD environment, and enables benchmarking LLMs on popular benchmarks. DeepEval is designed for evaluating RAG and fine-tuning applications, helping users optimize hyperparameters, prevent prompt drifting, and transition from OpenAI to hosting their own Llama2 with confidence.
bench
Bench is a tool for evaluating LLMs for production use cases. It provides a standardized workflow for LLM evaluation with a common interface across tasks and use cases. Bench can be used to test whether open source LLMs can do as well as the top closed-source LLM API providers on specific data, and to translate the rankings on LLM leaderboards and benchmarks into scores that are relevant for actual use cases.
llm-autoeval
LLM AutoEval is a tool that simplifies the process of evaluating Large Language Models (LLMs) using a convenient Colab notebook. It automates the setup and execution of evaluations using RunPod, allowing users to customize evaluation parameters and generate summaries that can be uploaded to GitHub Gist for easy sharing and reference. LLM AutoEval supports various benchmark suites, including Nous, Lighteval, and Open LLM, enabling users to compare their results with existing models and leaderboards.
moonshot
Moonshot is a simple and modular tool developed by the AI Verify Foundation to evaluate Language Model Models (LLMs) and LLM applications. It brings Benchmarking and Red-Teaming together to assist AI developers, compliance teams, and AI system owners in assessing LLM performance. Moonshot can be accessed through various interfaces including User-friendly Web UI, Interactive Command Line Interface, and seamless integration into MLOps workflows via Library APIs or Web APIs. It offers features like benchmarking LLMs from popular model providers, running relevant tests, creating custom cookbooks and recipes, and automating Red Teaming to identify vulnerabilities in AI systems.
llm_client
llm_client is a Rust interface designed for Local Large Language Models (LLMs) that offers automated build support for CPU, CUDA, MacOS, easy model presets, and a novel cascading prompt workflow for controlled generation. It provides a breadth of configuration options and API support for various OpenAI compatible APIs. The tool is primarily focused on deterministic signals from probabilistic LLM vibes, enabling specialized workflows for specific tasks and reproducible outcomes.
LLM-Synthetic-Data
LLM-Synthetic-Data is a repository focused on real-time, fine-grained LLM-Synthetic-Data generation. It includes methods, surveys, and application areas related to synthetic data for language models. The repository covers topics like pre-training, instruction tuning, model collapse, LLM benchmarking, evaluation, and distillation. It also explores application areas such as mathematical reasoning, code generation, text-to-SQL, alignment, reward modeling, long context, weak-to-strong generalization, agent and tool use, vision and language, factuality, federated learning, generative design, and safety.
llm-random
This repository contains code for research conducted by the LLM-Random research group at IDEAS NCBR in Warsaw, Poland. The group focuses on developing and using this repository to conduct research. For more information about the group and its research, refer to their blog, llm-random.github.io.
For similar jobs
promptflow
**Prompt flow** is a suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, evaluation to production deployment and monitoring. It makes prompt engineering much easier and enables you to build LLM apps with production quality.
deepeval
DeepEval is a simple-to-use, open-source LLM evaluation framework specialized for unit testing LLM outputs. It incorporates various metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc., and runs locally on your machine for evaluation. It provides a wide range of ready-to-use evaluation metrics, allows for creating custom metrics, integrates with any CI/CD environment, and enables benchmarking LLMs on popular benchmarks. DeepEval is designed for evaluating RAG and fine-tuning applications, helping users optimize hyperparameters, prevent prompt drifting, and transition from OpenAI to hosting their own Llama2 with confidence.
MegaDetector
MegaDetector is an AI model that identifies animals, people, and vehicles in camera trap images (which also makes it useful for eliminating blank images). This model is trained on several million images from a variety of ecosystems. MegaDetector is just one of many tools that aims to make conservation biologists more efficient with AI. If you want to learn about other ways to use AI to accelerate camera trap workflows, check out our of the field, affectionately titled "Everything I know about machine learning and camera traps".
leapfrogai
LeapfrogAI is a self-hosted AI platform designed to be deployed in air-gapped resource-constrained environments. It brings sophisticated AI solutions to these environments by hosting all the necessary components of an AI stack, including vector databases, model backends, API, and UI. LeapfrogAI's API closely matches that of OpenAI, allowing tools built for OpenAI/ChatGPT to function seamlessly with a LeapfrogAI backend. It provides several backends for various use cases, including llama-cpp-python, whisper, text-embeddings, and vllm. LeapfrogAI leverages Chainguard's apko to harden base python images, ensuring the latest supported Python versions are used by the other components of the stack. The LeapfrogAI SDK provides a standard set of protobuffs and python utilities for implementing backends and gRPC. LeapfrogAI offers UI options for common use-cases like chat, summarization, and transcription. It can be deployed and run locally via UDS and Kubernetes, built out using Zarf packages. LeapfrogAI is supported by a community of users and contributors, including Defense Unicorns, Beast Code, Chainguard, Exovera, Hypergiant, Pulze, SOSi, United States Navy, United States Air Force, and United States Space Force.
llava-docker
This Docker image for LLaVA (Large Language and Vision Assistant) provides a convenient way to run LLaVA locally or on RunPod. LLaVA is a powerful AI tool that combines natural language processing and computer vision capabilities. With this Docker image, you can easily access LLaVA's functionalities for various tasks, including image captioning, visual question answering, text summarization, and more. The image comes pre-installed with LLaVA v1.2.0, Torch 2.1.2, xformers 0.0.23.post1, and other necessary dependencies. You can customize the model used by setting the MODEL environment variable. The image also includes a Jupyter Lab environment for interactive development and exploration. Overall, this Docker image offers a comprehensive and user-friendly platform for leveraging LLaVA's capabilities.
carrot
The 'carrot' repository on GitHub provides a list of free and user-friendly ChatGPT mirror sites for easy access. The repository includes sponsored sites offering various GPT models and services. Users can find and share sites, report errors, and access stable and recommended sites for ChatGPT usage. The repository also includes a detailed list of ChatGPT sites, their features, and accessibility options, making it a valuable resource for ChatGPT users seeking free and unlimited GPT services.
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
AI-YinMei
AI-YinMei is an AI virtual anchor Vtuber development tool (N card version). It supports fastgpt knowledge base chat dialogue, a complete set of solutions for LLM large language models: [fastgpt] + [one-api] + [Xinference], supports docking bilibili live broadcast barrage reply and entering live broadcast welcome speech, supports Microsoft edge-tts speech synthesis, supports Bert-VITS2 speech synthesis, supports GPT-SoVITS speech synthesis, supports expression control Vtuber Studio, supports painting stable-diffusion-webui output OBS live broadcast room, supports painting picture pornography public-NSFW-y-distinguish, supports search and image search service duckduckgo (requires magic Internet access), supports image search service Baidu image search (no magic Internet access), supports AI reply chat box [html plug-in], supports AI singing Auto-Convert-Music, supports playlist [html plug-in], supports dancing function, supports expression video playback, supports head touching action, supports gift smashing action, supports singing automatic start dancing function, chat and singing automatic cycle swing action, supports multi scene switching, background music switching, day and night automatic switching scene, supports open singing and painting, let AI automatically judge the content.