data:image/s3,"s3://crabby-images/74c83/74c83df2ebf176f02fdd6a78b77f5efae33d2d47" alt="do-not-answer"
do-not-answer
Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
Stars: 109
data:image/s3,"s3://crabby-images/934dd/934dd2d397387f91f5fb31ffdef8e215af81e25e" alt="screenshot"
Do-Not-Answer is an open-source dataset curated to evaluate Large Language Models' safety mechanisms at a low cost. It consists of prompts to which responsible language models do not answer. The dataset includes human annotations and model-based evaluation using a fine-tuned BERT-like evaluator. The dataset covers 61 specific harms and collects 939 instructions across five risk areas and 12 harm types. Response assessment is done for six models, categorizing responses into harmfulness and action categories. Both human and automatic evaluations show the safety of models across different risk areas. The dataset also includes a Chinese version with 1,014 questions for evaluating Chinese LLMs' risk perception and sensitivity to specific words and phrases.
README:
📄 Paper • 🤗 Dataset • 🔱 Evaluator
[!NOTE]
For Chinese Do-Not-Answer, please refer to the Chinese Do-Not-Answer Section.
📣 Our paper is featured in the latest Stanford AI Index Report 2024
Do not answer is an open-source dataset to evaluate LLMs' safety mechanism at a low cost. The dataset is curated and filtered to consist only of prompts to which responsible language models do not answer. Besides human annotations, Do not answer also implements model-based evaluation, where a 600M fine-tuned BERT-like evaluator achieves comparable results with human and GPT-4.
Instructions are organized by a three-level hierarchical taxonomy, covering 61 specific harms.
Do not answer collects 939 instructions, the distribution across five risk areas and 12 harm types is shown below:
We assess six models' responses (GPT-4, ChatGPT, Claude, LLaMA-2, Vicuna, ChatGLM2) from two levels.
- Harmfulness: whether a response is harmful (a binary label).
- Action category: responses are categorized into six distinct groups, each representing a specific response type.
High percentage of safe responses across five risk areas for six models demonstrates that they are basically safe. LLaMA-2 ranks the first, followed by ChatGPT, Claude, GPT-4, and Vicuna, with ChatGLM2 ranking the last by 85 harmful responses among 939.
- Harmlessness Rank
- Action Category Distribution
Longformer evaluators achieve comparable results with human and GPT-4.
- GPT-4 evaluation
- Fine-tuned Longformer classifier (600M)
Model | Human | GPT-4 | Longformer |
---|---|---|---|
LLaMA-2 | 99.7 | 99.4 | 99.3 |
ChatGPT | 98.5 | 97.7 | 97.9 |
Claude | 98.3 | 98.3 | 97.6 |
GPT-4 | 97.6 | 96.5 | 97.2 |
vicuna | 94.5 | 94.9 | 95.0 |
ChatGLM2 | 90.9 | 92.9 | 92.9 |
NOTICE: please fill in you API info into do_not_answer\utils\info.yaml
before using commercial models such as GPT4.
Refer to notebooks for details and paper replication.
Refer to datasets for the dataset used in our paper.
📄 Paper • 🤗 Dataset • 🔱 Evaluator
We construct a Chinese LLM safety evaluation by translating and localizing the "Do-not-answer" dataset and expand it with region-specific questions and align it with country-specific AI generation regulations,
We then extend the resulting 1,014 questions from two prespectives:
- False Negative(FN) questions: risky questions posed in an evasive way, aimed at evaluating an LLM’s sensitivity to perceiving risks, aimed at evaluating an LLM’s sensitivity to perceiving risks
- False Positive(FP) questions: harmless questions containing seemingly risky words, aimed at assessing whether the model is oversensitive, which can limit its helpfulness
This yields 3,042 Chinese questions for evaluating the risk perception and sensitivity to specific words and phrases of LLMs.
Questions are organized by a two-level hierarchical taxonomy, covering 6 risk types 17 specific harm types.
To evaluate the responses of LLMs to the questions, we use GPT-4 for automatic evluation and human evaluation to check the gap between human and GPT-4 based evaluation.
We use two strategies for automatic evaluation of the model responses:
-
Question Set Evaluator: Given a risk area, we ask 1–5 risk-specific questions to identify whether a response touches on unsafe areas.
-
Action Type Evaluator: Given a response, we ask the LLM to select one of the six fine-grained categories of actions how an LLM responds to a question. This is the similar to the auto evaluation in the original "Do-not-answer" dataset.
LLMs that were mostly pre-trained on Chinese text are better at understanding Chinese questions and offering better answers than models mainly pre-trained on English text. English-centric LLaMA-2 series perform worse than Chinese-centric models on Chinese safety datasets, especially for region-specific questions (Type VI).
The number of harmful responses of the indirect-attack questions (yellow bar) is the largest, followed by questions of the direct attack (blue bar). For LLaMA2 and Qwen, the the number of harmful responses over direct attack and general questions is very close to each other, which suggests that some chatbots were likely instruction-tuned to be too sensitive to certain words.
GPT-4 can identify the majority of safe responses correctly, but demonstrates random guess performance on harmful responses. For action classification, responses falling into categories of 3 and 4 tend to be classified as 5 by GPT-4. This to some extent implies that human makes more fine-grained distinctions between different responding patterns than GPT-4.
If our work is useful for your own, you can cite us with the following BibTex entry:
@inproceedings{wang-etal-2024-answer,
title = "Do-Not-Answer: Evaluating Safeguards in {LLM}s",
author = "Wang, Yuxia and
Li, Haonan and
Han, Xudong and
Nakov, Preslav and
Baldwin, Timothy",
editor = "Graham, Yvette and
Purver, Matthew",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2024",
month = mar,
year = "2024",
address = "St. Julian{'}s, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.findings-eacl.61",
pages = "896--911",
}
For Chinese Do-Not-Answer dataset, you can cite us with the following BibTex entry:
@article{wang2024chinese,
title={A Chinese Dataset for Evaluating the Safeguards in Large Language Models},
author={Wang, Yuxia and Zhai, Zenan and Li, Haonan and Han, Xudong and Lin, Lizhi and Zhang, Zhenxuan and Zhao, Jingru and Nakov, Preslav and Baldwin, Timothy},
journal={to appear in ACL 2024 findings},
year={2024}
}
All datasets in this repository are released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. All source files in this repository are released under the Apache 2.0 license, the text of which can be found in the LICENSE file.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for do-not-answer
Similar Open Source Tools
data:image/s3,"s3://crabby-images/934dd/934dd2d397387f91f5fb31ffdef8e215af81e25e" alt="do-not-answer Screenshot"
do-not-answer
Do-Not-Answer is an open-source dataset curated to evaluate Large Language Models' safety mechanisms at a low cost. It consists of prompts to which responsible language models do not answer. The dataset includes human annotations and model-based evaluation using a fine-tuned BERT-like evaluator. The dataset covers 61 specific harms and collects 939 instructions across five risk areas and 12 harm types. Response assessment is done for six models, categorizing responses into harmfulness and action categories. Both human and automatic evaluations show the safety of models across different risk areas. The dataset also includes a Chinese version with 1,014 questions for evaluating Chinese LLMs' risk perception and sensitivity to specific words and phrases.
data:image/s3,"s3://crabby-images/942cc/942cc6532fb88b40b74b96c5fb42f0aa76c6202b" alt="OREAL Screenshot"
OREAL
OREAL is a reinforcement learning framework designed for mathematical reasoning tasks, aiming to achieve optimal performance through outcome reward-based learning. The framework utilizes behavior cloning, reshaping rewards, and token-level reward models to address challenges in sparse rewards and partial correctness. OREAL has achieved significant results, with a 7B model reaching 94.0 pass@1 accuracy on MATH-500 and surpassing previous 32B models. The tool provides training tutorials and Hugging Face model repositories for easy access and implementation.
data:image/s3,"s3://crabby-images/8f727/8f7279997fc4bb31ac325c36c4ef6fa268b17e99" alt="MiniCheck Screenshot"
MiniCheck
MiniCheck is an efficient fact-checking tool designed to verify claims against grounding documents using large language models. It provides a sentence-level fact-checking model that can be used to evaluate the consistency of claims with the provided documents. MiniCheck offers different models, including Bespoke-MiniCheck-7B, which is the state-of-the-art and commercially usable. The tool enables users to fact-check multi-sentence claims by breaking them down into individual sentences for optimal performance. It also supports automatic prefix caching for faster inference when repeatedly fact-checking the same document with different claims.
data:image/s3,"s3://crabby-images/356d9/356d9ae0f6ea785d3c93abcd2da4b87d8ea0a00c" alt="babilong Screenshot"
babilong
BABILong is a generative benchmark designed to evaluate the performance of NLP models in processing long documents with distributed facts. It consists of 20 tasks that simulate interactions between characters and objects in various locations, requiring models to distinguish important information from irrelevant details. The tasks vary in complexity and reasoning aspects, with test samples potentially containing millions of tokens. The benchmark aims to challenge and assess the capabilities of Large Language Models (LLMs) in handling complex, long-context information.
data:image/s3,"s3://crabby-images/da92f/da92f718c3ab113487e2944e1916a7327a897fdf" alt="SeaLLMs Screenshot"
SeaLLMs
SeaLLMs are a family of language models optimized for Southeast Asian (SEA) languages. They were pre-trained from Llama-2, on a tailored publicly-available dataset, which comprises texts in Vietnamese 🇻🇳, Indonesian 🇮🇩, Thai 🇹ðŸ‡, Malay 🇲🇾, Khmer🇰ðŸ‡, Lao🇱🇦, Tagalog🇵🇠and Burmese🇲🇲. The SeaLLM-chat underwent supervised finetuning (SFT) and specialized self-preferencing DPO using a mix of public instruction data and a small number of queries used by SEA language native speakers in natural settings, which **adapt to the local cultural norms, customs, styles and laws in these areas**. SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform **ChatGPT-3.5** in non-Latin languages, such as Thai, Khmer, Lao, and Burmese.
data:image/s3,"s3://crabby-images/ddd46/ddd46ce7d000ba15845269a1648504abf4351204" alt="Slow_Thinking_with_LLMs Screenshot"
Slow_Thinking_with_LLMs
STILL is an open-source project exploring slow-thinking reasoning systems, focusing on o1-like reasoning systems. The project has released technical reports on enhancing LLM reasoning with reward-guided tree search algorithms and implementing slow-thinking reasoning systems using an imitate, explore, and self-improve framework. The project aims to replicate the capabilities of industry-level reasoning systems by fine-tuning reasoning models with long-form thought data and iteratively refining training datasets.
data:image/s3,"s3://crabby-images/ffe25/ffe25edc9f8c80ca4c97c2a180498d9b2136bbd5" alt="KAG Screenshot"
KAG
KAG is a logical reasoning and Q&A framework based on the OpenSPG engine and large language models. It is used to build logical reasoning and Q&A solutions for vertical domain knowledge bases. KAG supports logical reasoning, multi-hop fact Q&A, and integrates knowledge and chunk mutual indexing structure, conceptual semantic reasoning, schema-constrained knowledge construction, and logical form-guided hybrid reasoning and retrieval. The framework includes kg-builder for knowledge representation and kg-solver for logical symbol-guided hybrid solving and reasoning engine. KAG aims to enhance LLM service framework in professional domains by integrating logical and factual characteristics of KGs.
data:image/s3,"s3://crabby-images/ab346/ab3461d3970bb8c9106673edea25dd234295ace1" alt="Woodpecker Screenshot"
Woodpecker
Woodpecker is a tool designed to correct hallucinations in Multimodal Large Language Models (MLLMs) by introducing a training-free method that picks out and corrects inconsistencies between generated text and image content. It consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Woodpecker can be easily integrated with different MLLMs and provides interpretable results by accessing intermediate outputs of the stages. The tool has shown significant improvements in accuracy over baseline models like MiniGPT-4 and mPLUG-Owl.
data:image/s3,"s3://crabby-images/69c18/69c1859ab05bcff42cc076a5b01669d842b82a59" alt="PURE Screenshot"
PURE
PURE (Process-sUpervised Reinforcement lEarning) is a framework that trains a Process Reward Model (PRM) on a dataset and fine-tunes a language model to achieve state-of-the-art mathematical reasoning capabilities. It uses a novel credit assignment method to calculate return and supports multiple reward types. The final model outperforms existing methods with minimal RL data or compute resources, achieving high accuracy on various benchmarks. The tool addresses reward hacking issues and aims to enhance long-range decision-making and reasoning tasks using large language models.
data:image/s3,"s3://crabby-images/cc141/cc1416bddeb688c78d44f89832d3e5ff7dadd738" alt="llmblueprint Screenshot"
llmblueprint
LLM Blueprint is an official implementation of a paper that enables text-to-image generation with complex and detailed prompts. It leverages Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. The tool operates in two phases: Global Scene Generation creates an initial scene using object layouts and background context, and an Iterative Refinement Scheme refines box-level content to align with textual descriptions, ensuring consistency and improving recall compared to baseline diffusion models.
data:image/s3,"s3://crabby-images/9444f/9444f9eed85cd81bf6d8fc98f00cdd698d70cdad" alt="MoBA Screenshot"
MoBA
MoBA (Mixture of Block Attention) is an innovative approach for long-context language models, enabling efficient processing of long sequences by dividing the full context into blocks and introducing a parameter-less gating mechanism. It allows seamless transitions between full and sparse attention modes, enhancing efficiency without compromising performance. MoBA has been deployed to support long-context requests and demonstrates significant advancements in efficient attention computation for large language models.
data:image/s3,"s3://crabby-images/3e732/3e732689cb8157bc1fa097b73e3f004f0c26b070" alt="BitMat Screenshot"
BitMat
BitMat is a Python package designed to optimize matrix multiplication operations by utilizing custom kernels written in Triton. It leverages the principles outlined in the "1bit-LLM Era" paper, specifically utilizing packed int8 data to enhance computational efficiency and performance in deep learning and numerical computing tasks.
data:image/s3,"s3://crabby-images/34142/3414200ffcd35105bacb31b3282e79a1c28392e6" alt="LazyLLM Screenshot"
LazyLLM
LazyLLM is a low-code development tool for building complex AI applications with multiple agents. It assists developers in building AI applications at a low cost and continuously optimizing their performance. The tool provides a convenient workflow for application development and offers standard processes and tools for various stages of application development. Users can quickly prototype applications with LazyLLM, analyze bad cases with scenario task data, and iteratively optimize key components to enhance the overall application performance. LazyLLM aims to simplify the AI application development process and provide flexibility for both beginners and experts to create high-quality applications.
data:image/s3,"s3://crabby-images/bc5bc/bc5bc1197e529219321901400d091c66083b2225" alt="InsPLAD Screenshot"
InsPLAD
InsPLAD is a dataset and benchmark for power line asset inspection in UAV images. It contains 10,607 high-resolution UAV color images of seventeen unique power line assets with six defects. The dataset is used for object detection, defect classification, and anomaly detection tasks in computer vision. InsPLAD offers challenges like multi-scale objects, intra-class variation, cluttered background, and varied lighting conditions, aiming to improve state-of-the-art methods in the field.
data:image/s3,"s3://crabby-images/a6320/a6320fe432a23d6d1a5dbcac68733af8d08fc6c5" alt="AgroTech-AI Screenshot"
AgroTech-AI
AgroTech AI platform is a comprehensive web-based tool where users can access various machine learning models for making accurate predictions related to agriculture. It offers solutions for crop management, soil health assessment, pest control, and more. The platform implements machine learning algorithms to provide functionalities like fertilizer prediction, crop prediction, soil quality prediction, yield prediction, and mushroom edibility prediction.
For similar tasks
data:image/s3,"s3://crabby-images/934dd/934dd2d397387f91f5fb31ffdef8e215af81e25e" alt="do-not-answer Screenshot"
do-not-answer
Do-Not-Answer is an open-source dataset curated to evaluate Large Language Models' safety mechanisms at a low cost. It consists of prompts to which responsible language models do not answer. The dataset includes human annotations and model-based evaluation using a fine-tuned BERT-like evaluator. The dataset covers 61 specific harms and collects 939 instructions across five risk areas and 12 harm types. Response assessment is done for six models, categorizing responses into harmfulness and action categories. Both human and automatic evaluations show the safety of models across different risk areas. The dataset also includes a Chinese version with 1,014 questions for evaluating Chinese LLMs' risk perception and sensitivity to specific words and phrases.
data:image/s3,"s3://crabby-images/7687c/7687c4bde97956cee37b7e9b264591df352601fe" alt="AutoPatent Screenshot"
AutoPatent
AutoPatent is a multi-agent framework designed for automatic patent generation. It challenges large language models to generate full-length patents based on initial drafts. The framework leverages planner, writer, and examiner agents along with PGTree and RRAG to craft lengthy, intricate, and high-quality patent documents. It introduces a new metric, IRR (Inverse Repetition Rate), to measure sentence repetition within patents. The tool aims to streamline the patent generation process by automating the creation of detailed and specialized patent documents.
For similar jobs
data:image/s3,"s3://crabby-images/7689b/7689ba1fce50eb89a5e34075170d6aaee3c49f87" alt="weave Screenshot"
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
data:image/s3,"s3://crabby-images/10ae7/10ae70fb544e4cb1ced622d6de4a6da32e2f9150" alt="LLMStack Screenshot"
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
data:image/s3,"s3://crabby-images/83afc/83afcd39fd69a41723dd590c7594d452ad40edd5" alt="VisionCraft Screenshot"
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
data:image/s3,"s3://crabby-images/065d0/065d091551616e8781269d4b98673eee8b08234f" alt="kaito Screenshot"
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
data:image/s3,"s3://crabby-images/48887/488870f896a867b538f8a551521f4987e02b7077" alt="PyRIT Screenshot"
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
data:image/s3,"s3://crabby-images/c92ac/c92accb591e608b2d38283e73dd764fb033bff25" alt="tabby Screenshot"
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
data:image/s3,"s3://crabby-images/7740a/7740ad4457091afbcd6c9b0f3b808492d0dccb01" alt="spear Screenshot"
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
data:image/s3,"s3://crabby-images/33099/330995f291fdf6166ad2fee1a67c879cd5496194" alt="Magick Screenshot"
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.