TurtleBench

TurtleBench

Benchmark for LLM Reasoning & Understanding with Challenging Tasks from Real Users.

Stars: 106

Visit
 screenshot

TurtleBench is a dynamic evaluation benchmark that assesses the reasoning capabilities of large language models through real-world yes/no puzzles. It emphasizes logical reasoning over knowledge recall by using user-generated data from a Turtle Soup puzzle platform. The benchmark is objective and unbiased, focusing purely on reasoning abilities and providing clear, measurable outcomes for easy comparison. TurtleBench constantly evolves with real user-generated questions, making it impossible to 'game' the system. It tests the model's ability to comprehend context and make logical inferences.

README:

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

arXiv Paper Hugging Face TurtleBench1.5k Apache 2.0 License

TurtleBench is a dynamic evaluation benchmark designed to assess the reasoning capabilities of large language models (LLMs) through real-world yes/no puzzles, emphasizing logical reasoning over knowledge recall by using user-generated data from a Turtle Soup puzzle platform.

Highlights

  • Objective and Unbiased: Eliminates the need for background knowledge, focusing purely on reasoning abilities.
  • Quantifiable Results: Clear, measurable outcomes (correct/incorrect/unknown) for easy comparison.
  • Constantly Evolving: Uses real user-generated questions, making it impossible to "game" the system.
  • Language Understanding: Tests the model's ability to comprehend context and make logical inferences.

Quick Start

# Install dependencies
conda create -n turtle python=3.10
conda activate turtle
pip install -r requirements.txt

# Set up configuration
mv config_example.ini config.ini
# Edit config.ini to add your API key

# Run evaluations
python eval.py --shot 0 --models Claude_3_5_Sonnet --language zh --save_interval 10 --time_delay 2

# Analyze results
python analyst.py

Project Structure

.
├── README.md                  # Project description and documentation
├── analyst.py                 # Data analysis script
├── archived                   # Archived result files
├── config.ini                 # Configuration file (for actual run)
├── config_example.ini         # Example configuration file
├── data                       # Directory containing project data
│   ├── en                     # Subdirectory for English data
│   └── zh                     # Subdirectory for Chinese data
├── eval.py                    # Evaluation script
├── insights                   # Analysis and visualization-related scripts
│   ├── case_handler.py         # OpenAI o1 Error case handling script
│   ├── plots.ipynb             # Data visualization and plot generation
│   ├── token_boxplot.py        # Box plot for token usage
│   └── token_cal.py            # Token calculation script
├── logs                       # Directory for storing logs
├── models.py                  # Model definition script
├── outputs                    # Directory for storing output results
├── prompts                    # Files related to prompt generation
├── requirements.txt           # Project dependencies list
└── stats                      # Statistics and result-related files

Results

Note: You can find detailed experiment results in the archived directory.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for TurtleBench

Similar Open Source Tools

For similar tasks

For similar jobs