R-Judge

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (EMNLP Findings 2024)

Stars: 58

Visit

R-Judge is a benchmarking tool designed to evaluate the proficiency of Large Language Models (LLMs) in judging and identifying safety risks within diverse environments. It comprises 569 records of multi-turn agent interactions, covering 27 key risk scenarios across 5 application categories and 10 risk types. The tool provides high-quality curation with annotated safety labels and risk descriptions. Evaluation of 11 LLMs on R-Judge reveals the need for enhancing risk awareness in LLMs, especially in open agent scenarios. Fine-tuning on safety judgment is found to significantly improve model performance.

README:

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

The official repository for R-Judge, containing data and evaluation scripts. Paper Link: https://arxiv.org/abs/2401.10019

The repo is updating with Camera-ready newer version. (2024.10.5)

Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on the harmlessness of LLM-generated content in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. R-Judge comprises 569 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 5 application categories and 10 risk types. It is of high-quality curation with annotated safety labels and risk descriptions. Evaluation of 11 LLMs on R-Judge shows considerable room for enhancing the risk awareness of LLMs: The best-performing model, GPT-4o, achieves 74.42% while no other models significantly exceed the random. Moreover, we reveal that risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, thus challenging for LLMs. With further experiments, we find that fine-tuning on safety judgment significantly improve model performance while straightforward prompting mechanisms fail.

Overview

Dataset distribution of R-Judge, which contains 27 key risk scenarios among 5 application categories, and spans across 10 risk types. The proportion of each category is shown in the graph.

Illustration of R-Judge by an example. The upper left part is an example in the dataset, i.e., a record of an agent interacting with the user and environment. The lower left part is human annotation involving a binary safety label and high-quality risk description. Following the arrow, we can see the serial evaluation paradigm with two tests: given record and task instruction, LLMs are asked to generate an analysis and a label. An automatic evaluator compares the analysis with the ground truth risk description to assess the effectiveness of risk identification. Correct generated labels are counted to evaluate the performance of safety judgment.

Evaluation

The evaluation scripts are in ./eval.

1. Label Score for Safety Judgment.

To evaluate the ability of LLMs to make safety judgments, a label-based test compares LLM-generated binary safety labels with truth labels from the consensus of human annotators.

python safety_judgment.py

The results will be saved in ./results/standard.

Before running the script, you should specify certain parameters in the code segment below.

# ./eval/safety_judgment.py

# Add models to the variable MODEL2RPM and MODEL2BASE
MODEL2RPM = {
    "gpt-3.5-turbo-0613": 3000,
    "gpt-4-0613": 200,
    "Llama-2-7b-chat-hf": 60,
    "Llama-2-13b-chat-hf": 60,
    "vicuna-7b-v1.5": 60,
    "vicuna-13b-v1.5": 60,
    "vicuna-7b-v1.5-16k": 60,
    "vicuna-13b-v1.5-16k": 60,
}

MODEL2BASE = {
    "gpt-3.5-turbo-0613": None,
    "gpt-4-0613": None,
    # To test locally deployed open-source models, you should specify server address for the model
    "Llama-2-7b-chat-hf": None, 
    "Llama-2-13b-chat-hf": None,
    "vicuna-7b-v1.5": None,
    "vicuna-13b-v1.5": None,
    "vicuna-7b-v1.5-16k": None,
    "vicuna-13b-v1.5-16k": None,
}

API_KEY = os.environ['API_KEY'] # To test API-based models, you should specify your own API_KEY

For models to test, add it to the variable MODEL2RPM and MODEL2BASE.

To test API-based models, you should specify your own API_KEY. For example, OPENAI_KEY for GPT series.

To test locally deployed open-source models, you should specify server address for the model in the MODEL2BASE.

2. Pairwise Evaluation for Risk Identification

To evaluate the effectiveness of LLMs in identifying safety risks, an open-ended test utilizes GPT-4 as an automatic evaluator to assess the open-ended model-generated analysis on unsafe cases.

Before running the test, you should first specify your OPENAI_KEY.

Option 1: Run our results

For the convenience of users, we have saved the results of 8 model performance on unsafe cases in safety judgment in ./eval/results/overall_results_unsafe.json, which is the input for the risk identification test.

You can directly run the script:

python risk_identification.py

Option 2: Run your own results

Before testing Risk Identification, you should first run safety_judgment.py to get model-generated analysis in ./results/standard.

sh risk_identification.sh

The script includes two .py file.

extract_analysis.py extract model-generated analysis from ./results/standard/.../results.json in such format(The overall_result_unsafe.json store all results).

    {
        "id": 1,
        "contents": [
            {
                "role": "user",
                "content": "..."
            },
            {
                "role": "assistant",
                "content": "..."
            },
            {
                "role": "user",
                "content": "[ENVIRONMENT]: ..."
            }
        ],
        "label": 1,
        "reference": "...",
        "candidates": {
            "Llama-2-13b-chat-hf": "...",
            "Llama-2-7b-chat-hf": "...",
            "vicuna-13b-v1.5": "...",
            "vicuna-13b-v1.5-16k": "...",
            "vicuna-7b-v1.5": "...",
            "vicuna-7b-v1.5-16k": "...",
            "gpt-3.5-turbo-0613": "...",
            "gpt-4-0613": "..."
        },
        "scenario": "..."
    }

risk_identification.py call GPT-4 to evaluate model-generated analysis.

Main Results

Main results(%) of the safety judgment test and risk identification test in R-Judge on two fundamental risk origins: Intended Attacks and Unintended Risks. Safety judgment scores (F1, Recall and Spec as ‘Specificity’) are calculated by counting correct labels and risk identification scores (Effect as ‘effectiveness’) are assigned by an automatic GPT-4 scorer compared with human-annotated risk description. F1 is the main score while the other 3 metrics is for reference. Segment 1: GPT series; Segment 2: Llama 3; Segment 3: Llama 2; Segment 4: Random baseline; Segment 5: Vicuna-1.5; Segment 6: Mistral-7B. The best model results are in bold face, and the second best model results are underlined. F1 scores that exceed random are in italics.

Dataset Introduction

Category & Scenario

Descriptions of 5 common Categories including 27 agent Scenarios in R-Judge.

Risk Type

10 Risk Types of R-Judge, illustrated with risk descriptions.

Statistics

Statistics of R-Judge Datasets, including label count, average turns of interactions, and word counts.

Citation

@article{yuan2024rjudge,
  title={R-Judge: Benchmarking Safety Risk Awareness for LLM Agents},
  author={Tongxin Yuan and Zhiwei He and Lingzhong Dong and Yiming Wang and Ruijie Zhao and Tian Xia and Lizhen Xu and Binglin Zhou and Fangqi Li and Zhuosheng Zhang and Rui Wang and Gongshen Liu},
  journal={arXiv preprint arXiv:2401.10019},
  year={2024}
}

For Tasks:

Click tags to check more tools for each tasks

evaluate safety risks identify risk scenarios benchmark llm proficiency annotate safety labels improve risk awareness

For Jobs:

machine learning engineer data scientist ai researcher natural language processing specialist ai ethics consultant

Alternative AI tools for R-Judge

Similar Open Source Tools

R-Judge

github

: 58

LongBench

LongBench v2 is a benchmark designed to assess the ability of large language models (LLMs) to handle long-context problems requiring deep understanding and reasoning across various real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words, covering six major task categories. The dataset is collected from nearly 100 highly educated individuals with diverse professional backgrounds and is designed to be challenging even for human experts. The evaluation results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.

github

: 707

empower-functions

Empower Functions is a family of large language models (LLMs) that provide GPT-4 level capabilities for real-world 'tool using' use cases. These models offer compatibility support to be used as drop-in replacements, enabling interactions with external APIs by recognizing when a function needs to be called and generating JSON containing necessary arguments based on user inputs. This capability is crucial for building conversational agents and applications that convert natural language into API calls, facilitating tasks such as weather inquiries, data extraction, and interactions with knowledge bases. The models can handle multi-turn conversations, choose between tools or standard dialogue, ask for clarification on missing parameters, integrate responses with tool outputs in a streaming fashion, and efficiently execute multiple functions either in parallel or sequentially with dependencies.

github

: 202

xFinder

xFinder is a model specifically designed for key answer extraction from large language models (LLMs). It addresses the challenges of unreliable evaluation methods by optimizing the key answer extraction module. The model achieves high accuracy and robustness compared to existing frameworks, enhancing the reliability of LLM evaluation. It includes a specialized dataset, the Key Answer Finder (KAF) dataset, for effective training and evaluation. xFinder is suitable for researchers and developers working with LLMs to improve answer extraction accuracy.

github

: 153

DeepFabric

Deepfabric is an SDK and CLI tool that leverages large language models to generate high-quality synthetic datasets. It's designed for researchers and developers building teacher-student distillation pipelines, creating evaluation benchmarks for models and agents, or conducting research requiring diverse training data. The key innovation lies in Deepfabric's graph and tree-based architecture, which uses structured topic nodes as generation seeds. This approach ensures the creation of datasets that are both highly diverse and domain-specific, while minimizing redundancy and duplication across generated samples.

github

: 444

AutoMathText

AutoMathText is an extensive dataset of around 200 GB of mathematical texts autonomously selected by the language model Qwen-72B. It aims to facilitate research in mathematics and artificial intelligence, serve as an educational tool for learning complex mathematical concepts, and provide a foundation for developing AI models specialized in processing mathematical content.

github

: 73

superpipe

Superpipe is a lightweight framework designed for building, evaluating, and optimizing data transformation and data extraction pipelines using LLMs. It allows users to easily combine their favorite LLM libraries with Superpipe's building blocks to create pipelines tailored to their unique data and use cases. The tool facilitates rapid prototyping, evaluation, and optimization of end-to-end pipelines for tasks such as classification and evaluation of job departments based on work history. Superpipe also provides functionalities for evaluating pipeline performance, optimizing parameters for cost, accuracy, and speed, and conducting grid searches to experiment with different models and prompts.

github

: 99

llm-structured-output

This repository contains a library for constraining LLM generation to structured output, enforcing a JSON schema for precise data types and property names. It includes an acceptor/state machine framework, JSON acceptor, and JSON schema acceptor for guiding decoding in LLMs. The library provides reference implementations using Apple's MLX library and examples for function calling tasks. The tool aims to improve LLM output quality by ensuring adherence to a schema, reducing unnecessary output, and enhancing performance through pre-emptive decoding. Evaluations show performance benchmarks and comparisons with and without schema constraints.

github

: 69

llm-strategy

The 'llm-strategy' repository implements the Strategy Pattern using Large Language Models (LLMs) like OpenAI’s GPT-3. It provides a decorator 'llm_strategy' that connects to an LLM to implement abstract methods in interface classes. The package uses doc strings, type annotations, and method/function names as prompts for the LLM and can convert the responses back to Python data. It aims to automate the parsing of structured data by using LLMs, potentially reducing the need for manual Python code in the future.

github

: 393

llmware

LLMWare is a framework for quickly developing LLM-based applications including Retrieval Augmented Generation (RAG) and Multi-Step Orchestration of Agent Workflows. This project provides a comprehensive set of tools that anyone can use - from a beginner to the most sophisticated AI developer - to rapidly build industrial-grade, knowledge-based enterprise LLM applications. Our specific focus is on making it easy to integrate open source small specialized models and connecting enterprise knowledge safely and securely.

github

: 11.3k

VedAstro

VedAstro is an open-source Vedic astrology tool that provides accurate astrological predictions and data. It offers a user-friendly website, a chat API, an open API, a JavaScript SDK, a Swiss Ephemeris API, and a machine learning table generator. VedAstro is free to use and is constantly being updated with new features and improvements.

github

: 279

instructor-js

Instructor is a Typescript library for structured extraction in Typescript, powered by llms, designed for simplicity, transparency, and control. It stands out for its simplicity, transparency, and user-centric design. Whether you're a seasoned developer or just starting out, you'll find Instructor's approach intuitive and steerable.

github

: 299

cltk

The Classical Language Toolkit (CLTK) is a Python library that provides natural language processing (NLP) capabilities for pre-modern languages. It offers a modular processing pipeline with pre-configured defaults and supports almost 20 languages. Users can install the latest version using pip and access detailed documentation on the official website. The toolkit is designed to meet the unique needs of researchers working with historical languages, filling a void in the NLP landscape that often neglects non-spoken languages and different research goals.

github

: 819

marqo

Marqo is more than a vector database, it's an end-to-end vector search engine for both text and images. Vector generation, storage and retrieval are handled out of the box through a single API. No need to bring your own embeddings.

github

: 4.8k

ai-dev-2024-ml-workshop

The 'ai-dev-2024-ml-workshop' repository contains materials for the Deploy and Monitor ML Pipelines workshop at the AI_dev 2024 conference in Paris, focusing on deployment designs of machine learning pipelines using open-source applications and free-tier tools. It demonstrates automating data refresh and forecasting using GitHub Actions and Docker, monitoring with MLflow and YData Profiling, and setting up a monitoring dashboard with Quarto doc on GitHub Pages.

github

: 93

swarms

Swarms provides simple, reliable, and agile tools to create your own Swarm tailored to your specific needs. Currently, Swarms is being used in production by RBC, John Deere, and many AI startups.

github

: 4.8k

For similar tasks

R-Judge

github

: 58

For similar jobs

LLM-and-Law

This repository is dedicated to summarizing papers related to large language models with the field of law. It includes applications of large language models in legal tasks, legal agents, legal problems of large language models, data resources for large language models in law, law LLMs, and evaluation of large language models in the legal domain.

github

: 180

start-llms

This repository is a comprehensive guide for individuals looking to start and improve their skills in Large Language Models (LLMs) without an advanced background in the field. It provides free resources, online courses, books, articles, and practical tips to become an expert in machine learning. The guide covers topics such as terminology, transformers, prompting, retrieval augmented generation (RAG), and more. It also includes recommendations for podcasts, YouTube videos, and communities to stay updated with the latest news in AI and LLMs.

github

: 789

aiverify

AI Verify is an AI governance testing framework and software toolkit that validates the performance of AI systems against internationally recognised principles through standardised tests. It offers a new API Connector feature to bypass size limitations, test various AI frameworks, and configure connection settings for batch requests. The toolkit operates within an enterprise environment, conducting technical tests on common supervised learning models for tabular and image datasets. It does not define AI ethical standards or guarantee complete safety from risks or biases.

github

: 140

Awesome-LLM-Watermark

This repository contains a collection of research papers related to watermarking techniques for text and images, specifically focusing on large language models (LLMs). The papers cover various aspects of watermarking LLM-generated content, including robustness, statistical understanding, topic-based watermarks, quality-detection trade-offs, dual watermarks, watermark collision, and more. Researchers have explored different methods and frameworks for watermarking LLMs to protect intellectual property, detect machine-generated text, improve generation quality, and evaluate watermarking techniques. The repository serves as a valuable resource for those interested in the field of watermarking for LLMs.

github

: 212

LLM-LieDetector

This repository contains code for reproducing experiments on lie detection in black-box LLMs by asking unrelated questions. It includes Q/A datasets, prompts, and fine-tuning datasets for generating lies with language models. The lie detectors rely on asking binary 'elicitation questions' to diagnose whether the model has lied. The code covers generating lies from language models, training and testing lie detectors, and generalization experiments. It requires access to GPUs and OpenAI API calls for running experiments with open-source models. Results are stored in the repository for reproducibility.

github

: 54

graphrag

The GraphRAG project is a data pipeline and transformation suite designed to extract meaningful, structured data from unstructured text using LLMs. It enhances LLMs' ability to reason about private data. The repository provides guidance on using knowledge graph memory structures to enhance LLM outputs, with a warning about the potential costs of GraphRAG indexing. It offers contribution guidelines, development resources, and encourages prompt tuning for optimal results. The Responsible AI FAQ addresses GraphRAG's capabilities, intended uses, evaluation metrics, limitations, and operational factors for effective and responsible use.

github

: 28.4k

langtest

LangTest is a comprehensive evaluation library for custom LLM and NLP models. It aims to deliver safe and effective language models by providing tools to test model quality, augment training data, and support popular NLP frameworks. LangTest comes with benchmark datasets to challenge and enhance language models, ensuring peak performance in various linguistic tasks. The tool offers more than 60 distinct types of tests with just one line of code, covering aspects like robustness, bias, representation, fairness, and accuracy. It supports testing LLMS for question answering, toxicity, clinical tests, legal support, factuality, sycophancy, and summarization.

github

: 515

Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, and exciting jailbreak methods on Large Language Models (LLMs). The repository contains papers, codes, datasets, evaluations, and analyses related to jailbreak attacks on LLMs. It serves as a comprehensive resource for researchers and practitioners interested in exploring various jailbreak techniques and defenses in the context of LLMs. Contributions such as additional jailbreak-related content, pull requests, and issue reports are welcome, and contributors are acknowledged. For any inquiries or issues, contact [email protected]. If you find this repository useful for your research or work, consider starring it to show appreciation.

github

: 507