BIG-Bench-Mistake

BIG-Bench-Mistake

A dataset of LLM-generated chain-of-thought steps annotated with mistake location.

Stars: 62

Visit
 screenshot

BIG-Bench Mistake is a dataset of chain-of-thought (CoT) outputs annotated with the location of the first logical mistake. It was released as part of a research paper focusing on benchmarking LLMs in terms of their mistake-finding ability. The dataset includes CoT traces for tasks like Word Sorting, Tracking Shuffled Objects, Logical Deduction, Multistep Arithmetic, and Dyck Languages. Human annotators were recruited to identify mistake steps in these tasks, with automated annotation for Dyck Languages. Each JSONL file contains input questions, steps in the chain of thoughts, model's answer, correct answer, and the index of the first logical mistake.

README:

BIG-Bench Mistake

BIG-Bench Mistake is a dataset of chain-of-thought (CoT) outputs annotated with the location of the first logical mistake. This dataset was released as part of our paper, LLMs cannot find reasoning errors, but can correct them given the error location.

Our user interface for annotation

Few-shot prompting results

In our paper, we use this dataset to benchmark LLMs in terms of their mistake-finding ability. We try 3 diffrent types of prompting, and find that LLMs struggle to identify logical mistakes. Results from GPT-4-Turbo, GPT-4, GPT-3.5-Turbo, Gemini Pro, and PaLM 2 Unicorn are shown below; for further details, please refer to our paper. Prompts used for mistake finding are found here.

Image of Table 4 in our paper, comparing few-shot mistake-finding performance between GPT-4-Turbo, GPT-4, GPT-3.5-Turbo, Gemini Pro, and PaLM 2 Unicorn. The best result is from GPT-4 at 52.87 overall accuracy using direct step-level prompting.

Data description

We use PaLM 2-L (Unicorn) to generate CoT traces for 5 tasks:

  1. Word Sorting
  2. Tracking Shuffled Objects
  3. Logical Deduction
  4. Multistep arithmetic
  5. Dyck Languages

In our experiments, we treat Tracking Shuffled Objects and Logical Deduction as multiple choice tasks, while for Word Sorting, Multistep Arithmetic, and Dyck Languages we use exact matching.

We then recruit human annotators to identify mistake steps in 4 of the 5 tasks. For Dyck languages, we automatically annotate most of the traces using annotate_dyck_langauges.py. For further details, please refer to our paper.

Each JSONL file contains outputs for a task from the BIGBench dataset. Each line contains a dictionary with the following keys:

  • input: A string containing the input question. For multiple choice tasks (Tracking Shuffled Objects and Logical Deduction), this also includes the options.
  • steps: A list of strings containing each step in the chain of thoughts. Note that this does not include the prefixes Thought 1:, Thought 2:, etc.
  • answer: A string containing the model's answer, extracted from the list of steps using the regex (?<=[Tt]he answer is).*$. For multiple choice tasks (Tracking Shuffled Objects and Logical Deduction), this is the letter indicating the option (e.g. (A)).
  • target: A string containing the target "correct" answer.
  • mistake_index: The index of the step containing the first logical mistake. Please note that this number is 0-indexed, so 0 indicates a mistake in the first step, 1 indicates a mistake in the second step, and so on. If there are no mistakes, this value is null.

Citation

@inproceedings{tyen-etal-2024-llms,
    title = "{LLMs} cannot find reasoning errors, but can correct them given the error location",
    author = "Tyen, Gladys and Mansoor, Hassan and C\u{a}rbune, Victor and Chen, Peter and Mak, Tony",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    year = "2024",
    publisher = "Association for Computational Linguistics",
}

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for BIG-Bench-Mistake

Similar Open Source Tools

For similar tasks

For similar jobs