
mem-kk-logic
On Memorization of Large Language Models in Logical Reasoning
Stars: 60

README:
📃 Paper • Data • Perturbed Data • Project Page
This repository provides the PyTorch implementation of the paper "On Memorization of Large Language Models in Logical Reasoning".
Introduction: In this work, we investigate memorization of LLMs in reasoning tasks.
- We propose a memorizatioin metric for reasoning tasks and a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles.
- LLMs could achieve high training accuracy after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles.
- On the other hand, fine-tuning also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization.
- Finally, we use puzzle-based indicators and model-based indicators to classify puzzles solved by reasoning v.s. memorization.
-
10/31/2024
: Code, data, ArXiv article and project page are available.
conda env create -f environment.yml
conda activate kk
When using our code for evaluation / fine-tuning, we import the datasets from huggingface:
import datasets
datasets.load_dataset('K-and-K/knights-and-knaves', 'test')
datasets.load_dataset('K-and-K/perturbed-knights-and-knaves', 'test')
To generate K&K data for {2,3,4,5,6,7,8}-people puzzles with a train/test split, run:
python data_prep/data_gen_kk.py
Locally perturbed data will also be generated. The generated data will be stored in the data
directory.
In addition, you can use it to generate wrong answer data and wrong CoT data (including one wrong step and shuffuled CoT steps).
Some general evaluation parameters:
Argument | Example | Description |
---|---|---|
--max_token |
2048 |
Maximum number of tokens. |
--split |
train , test
|
Choose the data split for evaluation. |
--limit |
100 |
Limit the number of evaluation samples. |
--ntrain |
0 , 1
|
Number of demonstrations for 0-shot/few-shot prompting. |
--problem_type |
clean , perturbed_statement , perturbed_leaf , random_pair , reorder_statement , uncommon_name , flip_role
|
Type of problem, supporting various perturbations. |
--eval_nppl |
2 ,3 ,4 ,5 ,6 ,7 ,8
|
Number of people in K&K puzzles. If not set, it will evaluate all n-ppl tasks. |
--vllm |
true |
Enable VLLM for faster inference for open-source models. |
--model |
openai/gpt-4o-mini-2024-07-18 |
The model to be evaluated. We support open-source and closed-sourced models. |
For each K&K task, evaluate all test samples (100 samples).
Evaluate on test samples under 1/0-shot & with/without CoT by running:
bash scripts/eval/run_test.sh
Evaluate under 0-shot & without CoT on 2 math-level perturbation types (perturbed_statement
, perturbed_leaf
):
bash scripts/eval/eval_test_pertub.sh
After fine-tuning the models following ## 4. Fine-Tuning
, we evaluate on training samples.
We evaluate the first 100 samples for the fine-tuned GPT-4o-mini, and all samples for open-source models.
Evaluate under 0-shot & without CoT
bash scripts/eval/eval_train.sh
Evaluation on Perturbed Training Samples:
Evaluate under 0-shot & without CoT on 6 perturbation types (perturbed_statement
, perturbed_leaf
random_pair
, reorder_statement
, uncommon_name
, flip_role
):
bash scripts/eval/eval_train_pertub.sh
Provide API keys:
export OPENAI_API_KEY='your-api-key-here'
export ANTHROPIC_API_KEY='your-api-key-here'
Example usages for OpenAI/Anthropic models with direct prompting:
bash scripts/eval/gpt4omini_direct.sh
bash scripts/eval/claude-sonet.sh
Evaluate with cot prompting:
bash scripts/eval/gpt4omini_cot.sh
To fine-tune the model directly on answers (without CoT), run:
bash scripts/ft/ft_lm3.sh
To fine-tune the model with CoT, run:
bash scripts/ft/ft_lm3_cot.sh
You can change the saved model path output_dir
in the above scripts.
Load the saved adapter from fine-tuning, as well as the base model, then save the merged model by running:
bash scripts/ft/merge_adapter.sh
Make sure to change the model paths base_model_path
, adapter_path
, base_model_path
in the script as needed.
For closed-sourced models, we follow the OpenAI finetuning API to finetune GPT-4o-mini.
To probe the model's internal representations, update the model paths and the number of ppl in the puzzles for evaluation in the script:
bash scripts/probe/run.sh
Here we classify on consistenly solved v.s. non consistenly solved puzzles.
Update the model paths and provide data with binary label of consistenly solved v.s. non consistenly solved for each training sample, and then run the following:
Classification with puzzled-based indicators:
bash scripts/mem_classify/model_indicator.sh
Classification with model-bases indicators:
bash scripts/mem_classify/puzzle_indicator.sh
If you find our work helpful, please consider citing it as follows:
@article{xie2024memorization,
title={On Memorization of Large Language Models in Logical Reasoning},
author={Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar},
year={2024},
eprint={2410.23123},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.23123},
}
Please reach out to us if you have any suggestions or need any help in reproducing the results. You can submit an issue or pull request, or send an email to [email protected].
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for mem-kk-logic
Similar Open Source Tools

detoxify
Detoxify is a library that provides trained models and code to predict toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintended Bias in Toxic comments, Multilingual toxic comment classification. It includes models like 'original', 'unbiased', and 'multilingual' trained on different datasets to detect toxicity and minimize bias. The library aims to help in stopping harmful content online by interpreting visual content in context. Users can fine-tune the models on carefully constructed datasets for research purposes or to aid content moderators in flagging out harmful content quicker. The library is built to be user-friendly and straightforward to use.

can-ai-code
Can AI Code is a self-evaluating interview tool for AI coding models. It includes interview questions written by humans and tests taken by AI, inference scripts for common API providers and CUDA-enabled quantization runtimes, a Docker-based sandbox environment for validating untrusted Python and NodeJS code, and the ability to evaluate the impact of prompting techniques and sampling parameters on large language model (LLM) coding performance. Users can also assess LLM coding performance degradation due to quantization. The tool provides test suites for evaluating LLM coding performance, a webapp for exploring results, and comparison scripts for evaluations. It supports multiple interviewers for API and CUDA runtimes, with detailed instructions on running the tool in different environments. The repository structure includes folders for interviews, prompts, parameters, evaluation scripts, comparison scripts, and more.

AgentPoison
AgentPoison is a repository that provides the official PyTorch implementation of the paper 'AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning'. It offers tools for red-teaming LLM agents by poisoning memory or knowledge bases. The repository includes trigger optimization algorithms, agent experiments, and evaluation scripts for Agent-Driver, ReAct-StrategyQA, and EHRAgent. Users can fine-tune motion planners, inject queries with triggers, and evaluate red-teaming performance. The codebase supports multiple RAG embedders and provides a unified dataset access for all three agents.

LEADS
LEADS is a lightweight embedded assisted driving system designed to simplify the development of instrumentation, control, and analysis systems for racing cars. It is written in Python and C/C++ with impressive performance. The system is customizable and provides abstract layers for component rearrangement. It supports hardware components like Raspberry Pi and Arduino, and can adapt to various hardware types. LEADS offers a modular structure with a focus on flexibility and lightweight design. It includes robust safety features, modern GUI design with dark mode support, high performance on different platforms, and powerful ESC systems for traction control and braking. The system also supports real-time data sharing, live video streaming, and AI-enhanced data analysis for driver training. LEADS VeC Remote Analyst enables transparency between the driver and pit crew, allowing real-time data sharing and analysis. The system is designed to be user-friendly, adaptable, and efficient for racing car development.

green-bit-llm
Green-Bit-LLM is a Python toolkit designed for fine-tuning, inferencing, and evaluating GreenBitAI's low-bit Language Models (LLMs). It utilizes the Bitorch Engine for efficient operations on low-bit LLMs, enabling high-performance inference on various GPUs and supporting full-parameter fine-tuning using quantized LLMs. The toolkit also provides evaluation tools to validate model performance on benchmark datasets. Green-Bit-LLM is compatible with AutoGPTQ series of 4-bit quantization and compression models.

llama.vim
llama.vim is a plugin that provides local LLM-assisted text completion for Vim users. It offers features such as auto-suggest on cursor movement, manual suggestion toggling, suggestion acceptance with Tab and Shift+Tab, control over text generation time, context configuration, ring context with chunks from open and edited files, and performance stats display. The plugin requires a llama.cpp server instance to be running and supports FIM-compatible models. It aims to be simple, lightweight, and provide high-quality and performant local FIM completions even on consumer-grade hardware.

stable-diffusion-webui
Stable Diffusion WebUI Docker Image allows users to run Automatic1111 WebUI in a docker container locally or in the cloud. The images do not bundle models or third-party configurations, requiring users to use a provisioning script for container configuration. It supports NVIDIA CUDA, AMD ROCm, and CPU platforms, with additional environment variables for customization and pre-configured templates for Vast.ai and Runpod.io. The service is password protected by default, with options for version pinning, startup flags, and service management using supervisorctl.

comfyui
ComfyUI is a highly-configurable, cloud-first AI-Dock container that allows users to run ComfyUI without bundled models or third-party configurations. Users can configure the container using provisioning scripts. The Docker image supports NVIDIA CUDA, AMD ROCm, and CPU platforms, with version tags for different configurations. Additional environment variables and Python environments are provided for customization. ComfyUI service runs on port 8188 and can be managed using supervisorctl. The tool also includes an API wrapper service and pre-configured templates for Vast.ai. The author may receive compensation for services linked in the documentation.

Construction-Hazard-Detection
Construction-Hazard-Detection is an AI-driven tool focused on improving safety at construction sites by utilizing the YOLOv8 model for object detection. The system identifies potential hazards like overhead heavy loads and steel pipes, providing real-time analysis and warnings. Users can configure the system via a YAML file and run it using Docker. The primary dataset used for training is the Construction Site Safety Image Dataset enriched with additional annotations. The system logs are accessible within the Docker container for debugging, and notifications are sent through the LINE messaging API when hazards are detected.

paxml
Pax is a framework to configure and run machine learning experiments on top of Jax.

FinRL_DeepSeek
FinRL-DeepSeek is a project focusing on LLM-infused risk-sensitive reinforcement learning for trading agents. It provides a framework for training and evaluating trading agents in different market conditions using deep reinforcement learning techniques. The project integrates sentiment analysis and risk assessment to enhance trading strategies in both bull and bear markets. Users can preprocess financial news data, add LLM signals, and train agent-ready datasets for PPO and CPPO algorithms. The project offers specific training and evaluation environments for different agent configurations, along with detailed instructions for installation and usage.

nano-graphrag
nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.

LL3DA
LL3DA is a Large Language 3D Assistant that responds to both visual and textual interactions within complex 3D environments. It aims to help Large Multimodal Models (LMM) comprehend, reason, and plan in diverse 3D scenes by directly taking point cloud input and responding to textual instructions and visual prompts. LL3DA achieves remarkable results in 3D Dense Captioning and 3D Question Answering, surpassing various 3D vision-language models. The code is fully released, allowing users to train customized models and work with pre-trained weights. The tool supports training with different LLM backends and provides scripts for tuning and evaluating models on various tasks.

AnglE
AnglE is a library for training state-of-the-art BERT/LLM-based sentence embeddings with just a few lines of code. It also serves as a general sentence embedding inference framework, allowing for inferring a variety of transformer-based sentence embeddings. The library supports various loss functions such as AnglE loss, Contrastive loss, CoSENT loss, and Espresso loss. It provides backbones like BERT-based models, LLM-based models, and Bi-directional LLM-based models for training on single or multi-GPU setups. AnglE has achieved significant performance on various benchmarks and offers official pretrained models for both BERT-based and LLM-based models.

octopus-v4
The Octopus-v4 project aims to build the world's largest graph of language models, integrating specialized models and training Octopus models to connect nodes efficiently. The project focuses on identifying, training, and connecting specialized models. The repository includes scripts for running the Octopus v4 model, methods for managing the graph, training code for specialized models, and inference code. Environment setup instructions are provided for Linux with NVIDIA GPU. The Octopus v4 model helps users find suitable models for tasks and reformats queries for effective processing. The project leverages Language Large Models for various domains and provides benchmark results. Users are encouraged to train and add specialized models following recommended procedures.