
llm_chess
LLM Chess - Large Language Models Competing in Chess
Stars: 64

llm_chess is a tool designed to put Large Language Models (LLMs) against a Random Player in a chess game to test basic instruction following capabilities and chess proficiency. The tool sets constraints for the game, determines win/loss conditions, and handles exceptions/execution errors. Users can run games between LLMs and random players, configure player types, and analyze game results. The tool also supports running multiple games, processing logs, and preparing data for web visualization. It provides insights into player performance, model behavior, and future ideas for validation and benchmarking.
README:
Putting LLMs up against each other Random Player in a chess game. Testing basic instruction following capabilities and of course chess proficiency :)
-
llm_chessp.py
runs the game, collects the results, records video
- Random players plays as white, LLM as black
- Game constraints:
- Max 200 moves (100 moves per player)
- Max 10 turns in LLM dialog when deciding on a move (a turn being a user/assistant pair of messages)
- Max 3 mistakes in LLM dialog (i.e. 3rd mistakes in the dialog breaks game loop and assignes a LOSS to LLM)
- Win OR Loss
- In LLM dialog if max turns is reached OR 3 mistakes are made by LMM, Random Players gets a WIN and LLM gets a LOSS
- If max moves is reached a DRAW is given
- Chess engine evaluates the board after each move and can give the corresponding player a WIN in case of a checkmate or stop the game and give a DRAW for the following reasons: Stalemate, Insufficient Material, Seventy-five Moves, or Fivefold Repetition.
- Exceptions/Execution Errors:
- If the execution is halted due to a programatic error the game is stopped and DRAW is scored.
- MANUAL inspection of logs is required:
- If the error happens to be connectivity or thorrle limmit error by API the game to be discarded
- If the error is model error (e.g. ""The model produced invalid content. Consider modifying your prompt if you are seeing this error persistently." by OpenAI or 400 or 500 errors by Local Models) the draw to be changed to a LOSS to LLM
- Decide if you would like to put one LLM against the other OR a random player (chaos monkey picking random move out of a list of legal moves provided to it)
- Set
use_random_player
to True to make random player play white and LLM play black
- Set
- Set LLM params in the .env file (API key, etc.) for both white and black player agents
- Azure Open AI used by default, modify
utils.py
to use a different provider that is supported by Autogen 0.2
- Azure Open AI used by default, modify
- Check configs (see next)
-
pip install -r requirements.txt
- Optionally create a VENV
- Run
llm_chess.py
Adjust the global configs at llm_chess.py
.
-
white_player_type
: Determines the type of player controlling the white pieces. Options includeRANDOM_PLAYER
,LLM_WHITE
,LLM_BLACK
,CHESS_ENGINE_SUNFISH
, andCHESS_ENGINE_STOCKFISH
. -
black_player_type
: Determines the type of player controlling the black pieces. Options includeRANDOM_PLAYER
,LLM_WHITE
,LLM_BLACK
,CHESS_ENGINE_SUNFISH
, andCHESS_ENGINE_STOCKFISH
. -
enable_reflection
: whether to offer the LLM time to think and evaluate moves by giving an extra "reflect" action -
use_fen_board
: A boolean indicating whether to use the FEN format for board representation. The default isFalse
. -
max_game_moves
: An integer specifying the maximum number of moves allowed in a game before it is automatically terminated. The default is 200. - Constraints for a single move (LLM dialogs if LLM agent is used)
-
max_llm_turns
: An integer indicating the maximum number of conversation turns (pairs of user/assistant messages) an LLM can take while deciding and making a move. The default is 10. -
max_failed_attempts
: An integer that sets the number of incorrect replies or actions a player agent can make before the game is halted and the player is declared the loser. E.g. if a model returns an action name not in the requested format OR asks to make a move that is not possible an internal counter will grow, and the model will be asked to self-correct. If themax_failed_attempts
is reached the game is interrupted, and WIN is given to the opposite player. The default value is 3.
-
-
throttle_delay_moves
: A delay in seconds between moves to prevent throttling by LLM providers due to frequent API requests. The default is 1 second.
These settings configure the game environment and control the flow of the chess match between the agents.
The run_multiple_games.py
script allows you to execute multiple chess games between different agents and aggregate the results.
To run multiple games:
- Adjust the
NUM_REPETITIONS
variable to set the number of games you want to simulate. - The results, including win/loss statistics and material counts, are aggregated and can be analyzed to understand the strengths and weaknesses of each player type.
- Aggregate logs and logs for individual games (if
STORE_INDIVIDUAL_LOGS
is set to True) can be stored in the specifiedLOG_FOLDER
for further inspection.
This feature is used to compare different kinds of players and generalize the findings. For LLM playes 10 games were used, for random/chess engine players 1000 games, some states is provided below.
When running multiple games individual logs are collected in {date_time}.json
files (e.g. 2024.10.11_17:34.json) which are collected in a directory. After the games are finished summary results are collected in aggregate_results.csv
.
Notes:
- "wrong_moves" and "wrong_actions" is the total number of erroneous replies by the player (e.g. not following convention) made in "inner" dialogs between the proxy and player. To get the total number of erroneous replies add up the metrics (i.e.
wrong_moves
do not includewrong_actions
)- A single game is finished if any of the players make more than
max_failed_attempts
in the inner dialogs (i.e.wrong_moves
+wrong_actions
<max_failed_attempts
- see above config)
- A single game is finished if any of the players make more than
-
data_processing/aggregate_logss_to_csv.py
can process individual games logs within a given folder and produce a CSV grouping logs by model name and aggregating stats. Change the default params to point the script to the correct folder and output CSV file:
Run data_processing/get_refined.py
to prdouce refined.csv
ready for index.html
emdedding. Manually copy the needed rows to index.html
const in JS - the data will be used in the Web Leaderboard.
Why copy manually? Cause you are likely updating a leaderboard and some of the models in the Web migth have different names, not the ones used in the logs.
- LLM Player Agent (instantiated from Autogen's standard ConversiableAgent) is used to evaluate the board and make moves on behalf of one of the players (white or black)
- Random Player Agent (custom RandomPlayerAgent) requests for a list of legal moves, randomly picks one, and then makes a move. Always plays as white
- Proxy Agent (custom AutoReplyAgent) is used to kick off conversation with a player agent and provides a list of available actions (get_current_board, get_legal_moves, make_move)
- Chess Engine Players:
- Sunfish: A simple Python engine.
- Stockfish: A powerful engine requiring separate installation.
GPT-4o-Mini
Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of 3 actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.
--------------------------------------------------------------------------------
Player_Black (to Proxy):
get_current_board
--------------------------------------------------------------------------------
Proxy (to Player_Black):
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
--------------------------------------------------------------------------------
Player_Black (to Proxy):
get_legal_moves
--------------------------------------------------------------------------------
Proxy (to Player_Black):
f8h8,f8g8,b8a8,g7h8,g7h6,f6g8,f6h5,f6d5,f6g4,f6e4,c6e5,c6a5,c6d4,c6b4,h7h6,e7e6,d7d6,b7b6,a7a6,e7e5,d7d5,b7b5,a7a5
--------------------------------------------------------------------------------
Player_Black (to Proxy):
make_move f8g8
--------------------------------------------------------------------------------
Proxy (to Player_Black):
Move made, switching player
Game run results are stored under the _logs
folder.
- Logs before 08.10.2024 (first 8) are more strict with wrong moves stats
- Logs before 16.10.2024 had ambiguous "Unknown issue, Player_Black failed to make a move" reason which often meant that a single dialog took more than 10 turns (20 total messages) and execution was halted, changed to include a more specific reason "Max turns in single dialog"
- Different configs could be used, directory folder
- Log
_15.10.2024_gpt-4o-anthropic.claude-v3-5-sonnet_reflectio
had timeout error, aggregate only had 9 out of 10 consistent runs - After 19.10.2024 setting default hyperparams ("temperature": 0.3, "top_p": 1.0, "frequency_penalty": 0.0, "presence_penalty": 0.0,)
- 19.10.2024, Gemini-1.5-flash-001 consistently failed to follow the instructions:
- 22.10.2024, slightly updated common prompt removing excessive tabs
- 05.11.2024, fixed bug with wrong action counting (not global per game but per move), set temperature to 0.7, re-ran no_reflection
- 14.01.2025, changed logs aand switched DRAWS for LLM losses due to model error to Gemini-1.5-pro-preview-0409 (8 logs from November, no reflection), qwq-32b-preview@q4_k_m (1 log from January 9, no reflection), sky-t1-32b-preview@q4_0 (1 log from January 14, no reflection), o1-mini-2024-09-12 (1 lof fro December 22, no reflection), gpt-4-turbo-2024-04-09 (4 logs from November, reflection)
- 31.01.2025 - changing from weighted average to simple average for material count, wrong moves, wrong actions
- 08.03.2025 - thinking models with think tags, changing
ignore_text
toremove_text
ensuring that thinking tags are not inluded in consequent messages shared with the model (i.e. removing think sections returned previously in next messages in multi-turn dialogs, less clutter, recomended by QwQ and I guess assumed by any thinking model they don't need seeing past thinking trails) - 12.03.2025 - adding PGN history to logs
- 16.03.2025 - wrong actions and wrong moves stats has been wrongly collected (underreporting) in all prior logs, the mistakes metric for all prior logs in invalid
- 16.04.2024 - removing "frequency_penalty": 0.0, "presence_penalty": 0.0 as non relevant and breaking LLM config validations with some models
I have conducted a number of games putting LLM (playing with black) against a Random Player (as white), typically 10 games with a cap of 200 moves.
!NOTE, not touching prompts, changing parsing logic to be more relaxed, logs before 08.10.2024 (first 8) are more strict with wrong moves stats
Original kick-off prompt:
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of 3 actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.
And the failure message:
Invalid action. Pick one, reply exactly with the name and space delimited argument: "get_current_board, get_legal_moves, make_move <UCI formatted move>
- GPT-4o, GPT-4o mini, gemini-1.5-pro-preview-0409 worked fine with the original prompts and exact match logic for make_move
- Claude 3.5 Sonnet failed to reply with a simple string always adding verbosity
Proxy (to Player_Black):
Invalid action. Pick one, reply exactly with the name and space delimited argument: get_current_board, get_legal_moves, make_move <UCI formatted move>
--------------------------------------------------------------------------------
Player_Black (to Proxy):
I apologize for the continued confusion. Let me provide the correct format for the move I want to make:
make_move c7c5
- Gemini-1.5-flash-001 always wrapped answers in JSON code blocks (while not asked)
- lama-3.1-nemotron-70b-instruct-hf 4 bit was too verbose failing to make any actions (2024.10.22_23:18_lama-3.1-nemotron-70b-instruct-hf.md)
- gpt-35-turbo-0125 failed to make any moves by not complying with the instructions and reply format (see _23.10.2024_gpt-35-turbo-0125/fail.md)
Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of 3 actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
Respond with the action.
--------------------------------------------------------------------------------
Player_Black (to Proxy):
get_current_board
--------------------------------------------------------------------------------
Proxy (to Player_Black):
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
--------------------------------------------------------------------------------
Player_Black (to Proxy):
get_legal_moves
--------------------------------------------------------------------------------
Proxy (to Player_Black):
f8g7,e8f7,b8c6,a8a7,h7h8,h7g7,h7f7,h5g7,h5f4,h5g3,e7e6,d7d6,c7c6,b7b6,g6g5,f6f5,e7e5,d7d5,c7c5,b7b5
--------------------------------------------------------------------------------
Player_Black (to Proxy):
make_move h7h6
--------------------------------------------------------------------------------
Proxy (to Player_Black):
Failed to make move: illegal uci: 'h7h6' in rnbqkb2/1pppp2r/p4ppp/P6n/3P1P2/1PN2N2/R1P1P1PP/2BQKB1R b Kq - 2 8
- Azure OpenAI:
- gpt-4.1-2025-04-14
- gpt-4.5-preview-2025-02-27
- o3-mini-2025-01-31-medium
- o3-mini-2025-01-31-low
- o1-preview-2024-09-12
- o1-mini-2024-09-12
- gpt-4o-2024-11-20
- gpt-4o-2024-08-06
- gpt-4o-2024-05-13
- gpt-4o-mini-2024-07-18
- gpt-35-turbo-0125
- gpt-35-turbo-0301
- gpt-35-turbo-0613
- gpt-35-turbo-1106
- gpt-4-turbo-2024-04-09
- gpt-4-0613
- gpt-4-32k-0613
- OpenAI
- gpt-4.1-2025-04-14
- gpt-4.1-mini-2025-04-14
- gpt-4.1-nano-2025-04-14
- gpt-4.5-preview-2025-02-27
- AWS:
- anthropic.claude-v3-5-sonnet-v1
- anthropic.claude-v3-5-sonnet-v2
- anthropic.claude-v3-opus
- anthropic.claude-v3-haiku
- amazon.nova-pro-v1
- amazon.nova-lite-v1
- Google:
- gemini-1.5-flash-001
- gemini-1.5-pro-preview-0409
- gemini-2.0-flash-exp
- gemini-2.0-flash-001
- gemini-2.0-flash-lite-001
- gemini-2.0-flash-lite-preview-02-05
- gemini-2.0-flash-thinking-exp-1219
- gemini-2.0-flash-thinking-exp-01-21
- gemini-2.0-flash-lite-preview-02-05
- chat-bison-32k@002 (Palm 2)
- x.ai
- grok-2-1212
- Groq:
- llama-3.2-90b-vision-preview
- llama-3.3-70b-versatile
- llama-3.3-70b-specdec
- llama3-70b-8192
- llama-3.1-8b-instant
- llama3-8b-8192
- mixtral-8x7b-32768
- gemma2-9b-it
- Cerebras:
- llama-4-scout-17b-16e-instruct
- llama-3.3-70b
- llama3.1-8b
- qwen3-32b
- Deepseek:
- deepseek-chat-v3
- deepseek-reasoner-r1
- Alibaba:
- qwen-max-2025-01-25
- qwen-turbo-2024-11-01
- qwen-plus-2025-01-25
- qwen2.5-72b-instruct
- qwen2.5-14b-instruct-1m
- qwen2.5-7b-instruct-1m
- Mercury
- mercury-coder-small
- Local (up to date llama.cpp/LM Studio as of time of testing, RTX4090 24GB, default settings - 8k context, 100% GPU off-load when possible):
- llama-2-7b-chat@q8_0
- llama-3-70b-instruct-awq
- meta-llama-3.1-8b-instruct@fp16
- meta-llama-3.1-8b-instruct@q8_0
- meta-llama-3.1-8b-instruct@q4_k_m
- gemma-2-27b-it@q6_k_l
- gemma-2-9b-it-8bit
- gemma-2-9b-it@iq4_xs
- granite-3.1-8b-instruct@q8_0
- internlm3-8b-instruct@q8_0
- ministral-8b-instruct-2410@q8_0
- mistral-nemo-12b-instruct-2407@q8_0
- mistral-small-instruct-2409
- phi-4@q8
- qwen2.5-14b-instruct@q8_0
- qwq-32b-preview@q4_k_m
- sky-t1-32b-preview@q4_0
- deephermes-3-llama-3-8b-preview@q8
- qwen3-14b@iq4_xs
- Random players request all legal moves and randomly pick one thus always making valid moves
- Chess engines
- Simple no dependency Pyhton engine Sunfish (https://github.com/thomasahle/sunfish)
- Powerful Stockfish
- Requires separate installation and properly defining path with python-chess
- On macOS, I installed it via
brew install stockfish
- Path to Stockfish is identified via
where stockfish
yielding/opt/homebrew/bin/stockfish
- Set the path in
llm_chess.py
.
- On macOS, I installed it via
- Requires separate installation and properly defining path with python-chess
Below are some stats simulating many games (1000) and collecting stats to get an idea of a baseline for LLM players playing against a random player
{
"total_games": 1000,
"white_wins": 0,
"black_wins": 1000,
"draws": 0,
"total_moves": 57922,
"reasons": {
"Checkmate": 1000
},
"player_white": {
"name": "Random_Player",
"model": "",
"total_material": 19125,
"wrong_moves": 0,
"wrong_actions": 0,
"reflections_used": 0,
"reflections_used_before_board": 0,
"avg_material": 19.125,
"std_dev_material": 10.335966679861503
},
"player_black": {
"name": "Chess_Engine_Stockfish_Black",
"model": "",
"total_material": 35113,
"wrong_moves": 0,
"wrong_actions": 0,
"reflections_used": 0,
"reflections_used_before_board": 0,
"avg_material": 35.113,
"std_dev_material": 5.572188105249165
},
"average_moves": 57.922,
"std_dev_moves": 30.961727092404658
}
{
"total_games": 1000,
"white_wins": 105,
"black_wins": 0,
"draws": 895,
"total_moves": 190073,
"reasons": {
"Max moves reached": 886,
"Checkmate": 105,
"Stalemate": 9
},
"player_white": {
"name": "Random_Player",
"model": "",
"total_material": 10555,
"wrong_moves": 0,
"wrong_actions": 0,
"avg_material": 10.555,
"std_dev_material": 7.228224001940049
},
"player_black": {
"name": "Random_Player",
"model": "",
"total_material": 11076,
"wrong_moves": 0,
"wrong_actions": 0,
"avg_material": 11.076,
"std_dev_material": 7.491217680880882
},
"average_moves": 190.073,
"std_dev_moves": 32.05006829755403
}
If enabled (enable_reflection = True
) an extract action is added to the list of available options offerrin an LLM to reflect on best move generating more tokens and thinking through the best next move.
Sample dialog:
Proxy (to Player_Black):
You are a professional chess player and you play as black. Now is your turn to make a move. Before making a move you can pick one of 3 actions:
- 'get_current_board' to get the schema and current status of the board
- 'get_legal_moves' to get a UCI formatted list of available moves
- 'do_reflection' to take a moment to think about your strategy
- 'make_move <UCI formatted move>' when you are ready to complete your turn (e.g., 'make_move e2e4')
--------------------------------------------------------------------------------
Player_Black (to Proxy):
I'll start by checking the current state of the board.
Action: `get_current_board`
--------------------------------------------------------------------------------
Proxy (to Player_Black):
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
β β β β β β β β
--------------------------------------------------------------------------------
Player_Black (to Proxy):
Let's analyze the board further by checking possible moves.
Action: `get_legal_moves`
--------------------------------------------------------------------------------
Proxy (to Player_Black):
g8e7,g8h6,g8f6,f8e7,f8d6,f8c5,f8b4,f8a3,e8e7,e8d7,d8e7,d8d7,d8f6,d8d6,d8g5,d8h4,c8d7,c8e6,c8f5,c8g4,c8h3,b8d7,b8c6,b8a6,h7h6,g7g6,f7f6,c7c6,b7b6,a7a6,e5e4,d5d4,h7h5,g7g5,f7f5,c7c5,b7b5,a7a5
--------------------------------------------------------------------------------
Player_Black (to Proxy):
I'll think about the best strategy given these options.
Action: `do_reflection`
--------------------------------------------------------------------------------
Proxy (to Player_Black):
Before deciding on the next move you can reflect on your current situation, write down notes, and evaluate.
Here are a few recommendations that you can follow to make a better move decision:
- Shortlist the most valuable next moves
- Consider how they affect the situation
- What could be the next moves from your opponent in each case
- Is there any strategy fitting the situation and your choice of moves
- Rerank the shortlisted moves based on the previous steps
--------------------------------------------------------------------------------
Player_Black (to Proxy):
**Reflection:**
1. **Shortlist Valuable Moves:**
- **g8f6**: Develops the knight, controls the center, and targets e4.
- **h7h6**: Prepares to fianchetto the bishop or prevents a knight pin on g5.
- **c7c5**: Challenges the center and prepares to expand on the queenside.
2. **Immediate Effects:**
- **g8f6**: Develops a piece, and brings the knight nearer to the center.
- **h7h6**: Solidifies kingside but is more of a precautionary move.
- **c7c5**: Direct action against the center, preparing for future breaks.
3. **Opponent's Responses:**
- **g8f6**: White might continue developing with Nc3 or push e4.
- **h7h6**: White might ignore it and proceed with normal development.
- **c7c5**: White could opt for dxc5 or develop with Nc3.
4. **Strategy Considerations:**
- Aim to control the center and set up a solid defensive position.
- Balance between development and maintaining structural integrity.
5. **Rerank Shortlisted Moves:**
- **1. g8f6**: Best for development and central control.
- **2. c7c5**: Strong challenge to the center.
- **3. h7h6**: Defensive but adds limited value immediately.
With that in mind, Iβll proceed with the development strategy.
Action: `make_move g8f6`
--------------------------------------------------------------------------------
Proxy (to Player_Black):
Move made, switching player
Prompt:
Please name chess pieces and their colors.
β β β β β β
β β β β β β
Here are the chess pieces and their colors:
White pieces: β Queen β King β Bishop β Knight β Rook β Pawn
Black pieces: β Queen β King β Bishop β Knight β Rook β Pawn
In a standard chess set, one player uses the white pieces, and the other player uses the black pieces. Each side starts with 16 pieces: 1 king, 1 queen, 2 rooks, 2 knights, 2 bishops, and 8 pawns.
Here are the chess pieces and their colors:
**White Pieces:**
* β Queen
* β King
* β Bishop
* β Knight
* β Rook
* β Pawn
**Black Pieces:**
* β Queen
* β King
* β Bishop
* β Knight
* β Rook
* β Pawn
White Pieces
β: White Queen
β: White King
β: White Bishop
β: White Knight
β: White Rook
β: White Pawn
Black Pieces
β: Black Queen
β: Black King
β: Black Bishop
β: Black Knight
β: Black Rook
β: Black Pawn
In chess, the pieces are typically divided into two colors: white and black. Hereβs a breakdown of the pieces and their corresponding colors:
White Pieces:
β - White Queen
β - White King
β - White Bishop
β - White Knight
β - White Rook
β - White Pawn
Black Pieces:
β - Black Queen
β - Black King
β - Black Bishop
β - Black Knight
β - Black Rook
β - Black Pawn
Each player controls one color and aims to checkmate the opponent's king.
-
Validate that Win/Draw rate are representative of resoning abilities
- Currenntly there's a threshold of 0 wins which reliably was only beaten by Open AI o1 models and Deepseek R1 - the models generally well accepted by the puiblic as special and overperforming models
- This supports that chess proficiency is a good indicator of reasoning, the binary behavior - either you step over the threshold or stay behind it, either you do some reasoning or keep generating text converting input to output.
-
Corss-check with other benchmarks
-
I see alignment of instruction following evaluation with other benches
- E.g. ComplexFuncBench has this distinctive split between OS and Closed models with the latter doing much better complying with the prompt
ComplexFuncBench is designed to test complex function calling evaluation with over 1,000 real-world scenarios from five aspects. Function calling with: 1οΈβ£ multiple-step in single-turn 2οΈβ£ User-provided constraints 3οΈβ£ Parameter value reasoning from implicit information 4οΈβ£ Long parameter values that exceed 500 tokens 5οΈβ£ 128k long-context length. Insights: π Claude 3.5 61%; OpenAI 4o 60.5%; Qwen 2.5 72B 40%; Llama 3.1 70B 2.7% π Open LLMs underperform compared to closed models β Value Errors are a major error type π Early stop in multi-step calls, failing to complete all necessary functions. π§© keys like 'filter' and 'legs' are more challenging Paper: https://lnkd.in/eTmf4RPc Github: https://lnkd.in/eKaSG6KR
-
Rerun AIW Benchamrk (https://github.com/LAION-AI/AIW?tab=readme-ov-file) for more recent models and cross-check results - a breainteaser question that lead to breakdown of older Claude 3 and GPT-4
-
Correlate with TAU-Bench (https://github.com/sierra-research/tau-bench)
-
-
Better study and describe stochasticity. Randmization of games make the becnh unique, no memorization can happen, every game is different
- Smth similar to GSM Symbolic observations, adding mutations to a well known math bench demonstrates how models struggle and pass rate drops: https://machinelearning.apple.com/research/gsm-symbolic
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llm_chess
Similar Open Source Tools

llm_chess
llm_chess is a tool designed to put Large Language Models (LLMs) against a Random Player in a chess game to test basic instruction following capabilities and chess proficiency. The tool sets constraints for the game, determines win/loss conditions, and handles exceptions/execution errors. Users can run games between LLMs and random players, configure player types, and analyze game results. The tool also supports running multiple games, processing logs, and preparing data for web visualization. It provides insights into player performance, model behavior, and future ideas for validation and benchmarking.

seahorse
A handy package for kickstarting AI contests. This Python framework simplifies the creation of an environment for adversarial agents, offering various functionalities for game setup, playing against remote agents, data generation, and contest organization. The package is user-friendly and provides easy-to-use features out of the box. Developed by an enthusiastic team of M.Sc candidates at Polytechnique MontrΓ©al, 'seahorse' is distributed under the 3-Clause BSD License.

Simulator-Controller
Simulator Controller is a modular administration and controller application for Sim Racing, featuring a comprehensive plugin automation framework for external controller hardware. It includes voice chat capable Assistants like Virtual Race Engineer, Race Strategist, Race Spotter, and Driving Coach. The tool offers features for setup, strategy development, monitoring races, and more. Developed in AutoHotkey, it supports various simulation games and integrates with third-party applications for enhanced functionality.

pointer
Pointer is a lightweight and efficient tool for analyzing and visualizing data structures in C and C++ programs. It provides a user-friendly interface to track memory allocations, pointer references, and data structures, helping developers to identify memory leaks, pointer errors, and optimize memory usage. With Pointer, users can easily navigate through complex data structures, visualize memory layouts, and debug pointer-related issues in their codebase. The tool offers interactive features such as memory snapshots, pointer tracking, and memory visualization, making it a valuable asset for C and C++ developers working on memory-intensive applications.

crystal
Crystal is an Electron desktop application that allows users to run, inspect, and test multiple Claude Code instances simultaneously using git worktrees. It provides features such as parallel sessions, git worktree isolation, session persistence, git integration, change tracking, notifications, and the ability to run scripts. Crystal simplifies the workflow by creating isolated sessions, iterating with Claude Code, reviewing diff changes, and squashing commits for a clean history. It is a tool designed for collaborative AI notebook editing and testing.

mcp-server-motherduck
The mcp-server-motherduck repository is a server-side application that provides a centralized platform for managing and monitoring multiple Minecraft servers. It allows server administrators to easily control various aspects of their Minecraft servers, such as player management, world backups, and server performance monitoring. The application is designed to streamline server management tasks and enhance the overall gaming experience for both server administrators and players.

mcp-unity
MCP Unity is an implementation of the Model Context Protocol for Unity Editor, enabling AI assistants like Claude, Windsurf, and Cursor to interact with Unity projects. It provides tools to execute Unity menu items, select game objects, manage packages, run tests, and display messages in the Unity Editor. The package bridges Unity with a Node.js server implementing the MCP protocol, offering resources to retrieve menu items, game objects, console logs, packages, assets, and tests. Requirements include Unity 2022.3 or later, Node.js 18 or later for the server, and npm 9 or later for building. Installation involves adding the Unity MCP Server package via Unity Package Manager and installing Node.js. Configuration settings for AI clients like Cursor IDE, Claude Desktop, and Windsurf IDE are provided. Running the server requires starting the Node.js server and Unity Editor MCP Server. Debugging and troubleshooting guidelines are included for server issues. Contributions are welcome under the MIT license.

context-portal
Context-portal is a versatile tool for managing and visualizing data in a collaborative environment. It provides a user-friendly interface for organizing and sharing information, making it easy for teams to work together on projects. With features such as customizable dashboards, real-time updates, and seamless integration with popular data sources, Context-portal streamlines the data management process and enhances productivity. Whether you are a data analyst, project manager, or team leader, Context-portal offers a comprehensive solution for optimizing workflows and driving better decision-making.

Memento
Memento is a lightweight and user-friendly version control tool designed for small to medium-sized projects. It provides a simple and intuitive interface for managing project versions and collaborating with team members. With Memento, users can easily track changes, revert to previous versions, and merge different branches. The tool is suitable for developers, designers, content creators, and other professionals who need a streamlined version control solution. Memento simplifies the process of managing project history and ensures that team members are always working on the latest version of the project.

checkpoint-engine
Checkpoint-engine is a middleware tool designed for updating model weights in LLM inference engines efficiently. It provides implementations for both Broadcast and P2P weight update methods, orchestrating the transfer process and controlling the inference engine through ZeroMQ socket. The tool optimizes weight broadcast by arranging data transfer into stages and organizing transfers into a pipeline for performance. It supports flexible installation options and is tested with various models and device setups. Checkpoint-engine also allows reusing weights from existing instances and provides a patch for FP8 quantization in vLLM.

ml-retreat
ML-Retreat is a comprehensive machine learning library designed to simplify and streamline the process of building and deploying machine learning models. It provides a wide range of tools and utilities for data preprocessing, model training, evaluation, and deployment. With ML-Retreat, users can easily experiment with different algorithms, hyperparameters, and feature engineering techniques to optimize their models. The library is built with a focus on scalability, performance, and ease of use, making it suitable for both beginners and experienced machine learning practitioners.

jadx-ai-mcp
JADX-AI-MCP is a plugin for the JADX decompiler that integrates with Model Context Protocol (MCP) to provide live reverse engineering support with LLMs like Claude. It allows for quick analysis, vulnerability detection, and AI code modification, all in real time. The tool combines JADX-AI-MCP and JADX MCP SERVER to analyze Android APKs effortlessly. It offers various prompts for code understanding, vulnerability detection, reverse engineering helpers, static analysis, AI code modification, and documentation. The tool is part of the Zin MCP Suite and aims to connect all android reverse engineering and APK modification tools with a single MCP server for easy reverse engineering of APK files.

mcp-context-forge
MCP Context Forge is a powerful tool for generating context-aware data for machine learning models. It provides functionalities to create diverse datasets with contextual information, enhancing the performance of AI algorithms. The tool supports various data formats and allows users to customize the context generation process easily. With MCP Context Forge, users can efficiently prepare training data for tasks requiring contextual understanding, such as sentiment analysis, recommendation systems, and natural language processing.

OllamaSharp
OllamaSharp is a .NET binding for the Ollama API, providing an intuitive API client to interact with Ollama. It offers support for all Ollama API endpoints, real-time streaming, progress reporting, and an API console for remote management. Users can easily set up the client, list models, pull models with progress feedback, stream completions, and build interactive chats. The project includes a demo console for exploring and managing the Ollama host.

meeting-minutes
An open-source AI assistant for taking meeting notes that captures live meeting audio, transcribes it in real-time, and generates summaries while ensuring user privacy. Perfect for teams to focus on discussions while automatically capturing and organizing meeting content without external servers or complex infrastructure. Features include modern UI, real-time audio capture, speaker diarization, local processing for privacy, and more. The tool also offers a Rust-based implementation for better performance and native integration, with features like live transcription, speaker diarization, and a rich text editor for notes. Future plans include database connection for saving meeting minutes, improving summarization quality, and adding download options for meeting transcriptions and summaries. The backend supports multiple LLM providers through a unified interface, with configurations for Anthropic, Groq, and Ollama models. System architecture includes core components like audio capture service, transcription engine, LLM orchestrator, data services, and API layer. Prerequisites for setup include Node.js, Python, FFmpeg, and Rust. Development guidelines emphasize project structure, testing, documentation, type hints, and ESLint configuration. Contributions are welcome under the MIT License.

pdr_ai_v2
pdr_ai_v2 is a Python library for implementing machine learning algorithms and models. It provides a wide range of tools and functionalities for data preprocessing, model training, evaluation, and deployment. The library is designed to be user-friendly and efficient, making it suitable for both beginners and experienced data scientists. With pdr_ai_v2, users can easily build and deploy machine learning models for various applications, such as classification, regression, clustering, and more.
For similar tasks

llm_chess
llm_chess is a tool designed to put Large Language Models (LLMs) against a Random Player in a chess game to test basic instruction following capabilities and chess proficiency. The tool sets constraints for the game, determines win/loss conditions, and handles exceptions/execution errors. Users can run games between LLMs and random players, configure player types, and analyze game results. The tool also supports running multiple games, processing logs, and preparing data for web visualization. It provides insights into player performance, model behavior, and future ideas for validation and benchmarking.
For similar jobs

sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.