mem-kk-logic

On Memorization of Large Language Models in Logical Reasoning

Stars: 60

Visit

README:

On Memorization of Large Language Models in Logical Reasoning

📃 Paper • Data • Perturbed Data • Project Page

This repository provides the PyTorch implementation of the paper "On Memorization of Large Language Models in Logical Reasoning".

Introduction: In this work, we investigate memorization of LLMs in reasoning tasks.

We propose a memorizatioin metric for reasoning tasks and a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles.
LLMs could achieve high training accuracy after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles.
On the other hand, fine-tuning also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization.
Finally, we use puzzle-based indicators and model-based indicators to classify puzzles solved by reasoning v.s. memorization.

Updates

10/31/2024: Code, data, ArXiv article and project page are available.

🛠️ Installation

conda env create -f environment.yml
conda activate kk

📝 Synthetic data

Option 1: use HF dataset

When using our code for evaluation / fine-tuning, we import the datasets from huggingface:

import datasets
datasets.load_dataset('K-and-K/knights-and-knaves', 'test')
datasets.load_dataset('K-and-K/perturbed-knights-and-knaves', 'test')

Option 2: generate data locally

To generate K&K data for {2,3,4,5,6,7,8}-people puzzles with a train/test split, run:

python data_prep/data_gen_kk.py

Locally perturbed data will also be generated. The generated data will be stored in the data directory.

In addition, you can use it to generate wrong answer data and wrong CoT data (including one wrong step and shuffuled CoT steps).

🤖 Evaluation

Some general evaluation parameters:

Argument	Example	Description
`--max_token`	`2048`	Maximum number of tokens.
`--split`	`train`, `test`	Choose the data split for evaluation.
`--limit`	`100`	Limit the number of evaluation samples.
`--ntrain`	`0`, `1`	Number of demonstrations for 0-shot/few-shot prompting.
`--problem_type`	`clean`, `perturbed_statement`, `perturbed_leaf`, `random_pair`, `reorder_statement`, `uncommon_name`, `flip_role`	Type of problem, supporting various perturbations.
`--eval_nppl`	`2`,`3`,`4`,`5`,`6`,`7`,`8`	Number of people in K&K puzzles. If not set, it will evaluate all n-ppl tasks.
`--vllm`	`true`	Enable VLLM for faster inference for open-source models.
`--model`	`openai/gpt-4o-mini-2024-07-18`	The model to be evaluated. We support open-source and closed-sourced models.

Evaluation on test samples

For each K&K task, evaluate all test samples (100 samples).

Evaluate on test samples under 1/0-shot & with/without CoT by running:

bash scripts/eval/run_test.sh

Evaluate under 0-shot & without CoT on 2 math-level perturbation types (perturbed_statement, perturbed_leaf):

bash scripts/eval/eval_test_pertub.sh

Evaluation on training samples

After fine-tuning the models following ## 4. Fine-Tuning, we evaluate on training samples. We evaluate the first 100 samples for the fine-tuned GPT-4o-mini, and all samples for open-source models.

Evaluate under 0-shot & without CoT

bash scripts/eval/eval_train.sh

Evaluation on Perturbed Training Samples:

Evaluate under 0-shot & without CoT on 6 perturbation types (perturbed_statement, perturbed_leaf random_pair, reorder_statement, uncommon_name, flip_role):

bash scripts/eval/eval_train_pertub.sh

Evaluation on closed-sourced models

Provide API keys:

export OPENAI_API_KEY='your-api-key-here'
export ANTHROPIC_API_KEY='your-api-key-here'

Example usages for OpenAI/Anthropic models with direct prompting:

bash scripts/eval/gpt4omini_direct.sh
bash scripts/eval/claude-sonet.sh

Evaluate with cot prompting:

bash scripts/eval/gpt4omini_cot.sh

🚗 Fine-tuning

Direct fine-tune

To fine-tune the model directly on answers (without CoT), run:

bash scripts/ft/ft_lm3.sh

CoT fine-tune

To fine-tune the model with CoT, run:

bash scripts/ft/ft_lm3_cot.sh

You can change the saved model path output_dir in the above scripts.

Merge fine-tuned adapter and base model

Load the saved adapter from fine-tuning, as well as the base model, then save the merged model by running:

bash scripts/ft/merge_adapter.sh

Make sure to change the model paths base_model_path, adapter_path, base_model_path in the script as needed.

Fine-tune closed-sourced models

For closed-sourced models, we follow the OpenAI finetuning API to finetune GPT-4o-mini.

🔍 Probe

To probe the model's internal representations, update the model paths and the number of ppl in the puzzles for evaluation in the script:

bash scripts/probe/run.sh

🗃️ Sample classification

Here we classify on consistenly solved v.s. non consistenly solved puzzles.

Update the model paths and provide data with binary label of consistenly solved v.s. non consistenly solved for each training sample, and then run the following:

Classification with puzzled-based indicators:

bash scripts/mem_classify/model_indicator.sh

Classification with model-bases indicators:

bash scripts/mem_classify/puzzle_indicator.sh

📚 Citation

If you find our work helpful, please consider citing it as follows:

@article{xie2024memorization,
title={On Memorization of Large Language Models in Logical Reasoning}, 
author={Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar},
year={2024},
eprint={2410.23123},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.23123}, 
}

📖 Questions

Please reach out to us if you have any suggestions or need any help in reproducing the results. You can submit an issue or pull request, or send an email to [email protected].

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for mem-kk-logic

Similar Open Source Tools

Detoxify is a library that provides trained models and code to predict toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintended Bias in Toxic comments, Multilingual toxic comment classification. It includes models like 'original', 'unbiased', and 'multilingual' trained on different datasets to detect toxicity and minimize bias. The library aims to help in stopping harmful content online by interpreting visual content in context. Users can fine-tune the models on carefully constructed datasets for research purposes or to aid content moderators in flagging out harmful content quicker. The library is built to be user-friendly and straightforward to use.

github

: 980

can-ai-code

Can AI Code is a self-evaluating interview tool for AI coding models. It includes interview questions written by humans and tests taken by AI, inference scripts for common API providers and CUDA-enabled quantization runtimes, a Docker-based sandbox environment for validating untrusted Python and NodeJS code, and the ability to evaluate the impact of prompting techniques and sampling parameters on large language model (LLM) coding performance. Users can also assess LLM coding performance degradation due to quantization. The tool provides test suites for evaluating LLM coding performance, a webapp for exploring results, and comparison scripts for evaluations. It supports multiple interviewers for API and CUDA runtimes, with detailed instructions on running the tool in different environments. The repository structure includes folders for interviews, prompts, parameters, evaluation scripts, comparison scripts, and more.

github

: 511

AgentPoison

AgentPoison is a repository that provides the official PyTorch implementation of the paper 'AgentPoison: Red-teaming LLM Agents via Memory or Knowledge Base Backdoor Poisoning'. It offers tools for red-teaming LLM agents by poisoning memory or knowledge bases. The repository includes trigger optimization algorithms, agent experiments, and evaluation scripts for Agent-Driver, ReAct-StrategyQA, and EHRAgent. Users can fine-tune motion planners, inject queries with triggers, and evaluate red-teaming performance. The codebase supports multiple RAG embedders and provides a unified dataset access for all three agents.

github

: 78

LEADS

LEADS is a lightweight embedded assisted driving system designed to simplify the development of instrumentation, control, and analysis systems for racing cars. It is written in Python and C/C++ with impressive performance. The system is customizable and provides abstract layers for component rearrangement. It supports hardware components like Raspberry Pi and Arduino, and can adapt to various hardware types. LEADS offers a modular structure with a focus on flexibility and lightweight design. It includes robust safety features, modern GUI design with dark mode support, high performance on different platforms, and powerful ESC systems for traction control and braking. The system also supports real-time data sharing, live video streaming, and AI-enhanced data analysis for driver training. LEADS VeC Remote Analyst enables transparency between the driver and pit crew, allowing real-time data sharing and analysis. The system is designed to be user-friendly, adaptable, and efficient for racing car development.

github

: 241

green-bit-llm

Green-Bit-LLM is a Python toolkit designed for fine-tuning, inferencing, and evaluating GreenBitAI's low-bit Language Models (LLMs). It utilizes the Bitorch Engine for efficient operations on low-bit LLMs, enabling high-performance inference on various GPUs and supporting full-parameter fine-tuning using quantized LLMs. The toolkit also provides evaluation tools to validate model performance on benchmark datasets. Green-Bit-LLM is compatible with AutoGPTQ series of 4-bit quantization and compression models.

github

: 78

llama.vim

llama.vim is a plugin that provides local LLM-assisted text completion for Vim users. It offers features such as auto-suggest on cursor movement, manual suggestion toggling, suggestion acceptance with Tab and Shift+Tab, control over text generation time, context configuration, ring context with chunks from open and edited files, and performance stats display. The plugin requires a llama.cpp server instance to be running and supports FIM-compatible models. It aims to be simple, lightweight, and provide high-quality and performant local FIM completions even on consumer-grade hardware.

github

: 1.3k

stable-diffusion-webui

Stable Diffusion WebUI Docker Image allows users to run Automatic1111 WebUI in a docker container locally or in the cloud. The images do not bundle models or third-party configurations, requiring users to use a provisioning script for container configuration. It supports NVIDIA CUDA, AMD ROCm, and CPU platforms, with additional environment variables for customization and pre-configured templates for Vast.ai and Runpod.io. The service is password protected by default, with options for version pinning, startup flags, and service management using supervisorctl.

github

: 98

comfyui

ComfyUI is a highly-configurable, cloud-first AI-Dock container that allows users to run ComfyUI without bundled models or third-party configurations. Users can configure the container using provisioning scripts. The Docker image supports NVIDIA CUDA, AMD ROCm, and CPU platforms, with version tags for different configurations. Additional environment variables and Python environments are provided for customization. ComfyUI service runs on port 8188 and can be managed using supervisorctl. The tool also includes an API wrapper service and pre-configured templates for Vast.ai. The author may receive compensation for services linked in the documentation.

github

: 434

Construction-Hazard-Detection

Construction-Hazard-Detection is an AI-driven tool focused on improving safety at construction sites by utilizing the YOLOv8 model for object detection. The system identifies potential hazards like overhead heavy loads and steel pipes, providing real-time analysis and warnings. Users can configure the system via a YAML file and run it using Docker. The primary dataset used for training is the Construction Site Safety Image Dataset enriched with additional annotations. The system logs are accessible within the Docker container for debugging, and notifications are sent through the LINE messaging API when hazards are detected.

github

: 153

FinRL_DeepSeek

FinRL-DeepSeek is a project focusing on LLM-infused risk-sensitive reinforcement learning for trading agents. It provides a framework for training and evaluating trading agents in different market conditions using deep reinforcement learning techniques. The project integrates sentiment analysis and risk assessment to enhance trading strategies in both bull and bear markets. Users can preprocess financial news data, add LLM signals, and train agent-ready datasets for PPO and CPPO algorithms. The project offers specific training and evaluation environments for different agent configurations, along with detailed instructions for installation and usage.

github

: 100

nano-graphrag

nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.

github

: 2.6k

LL3DA

LL3DA is a Large Language 3D Assistant that responds to both visual and textual interactions within complex 3D environments. It aims to help Large Multimodal Models (LMM) comprehend, reason, and plan in diverse 3D scenes by directly taking point cloud input and responding to textual instructions and visual prompts. LL3DA achieves remarkable results in 3D Dense Captioning and 3D Question Answering, surpassing various 3D vision-language models. The code is fully released, allowing users to train customized models and work with pre-trained weights. The tool supports training with different LLM backends and provides scripts for tuning and evaluating models on various tasks.

github

: 207

octopus-v4

The Octopus-v4 project aims to build the world's largest graph of language models, integrating specialized models and training Octopus models to connect nodes efficiently. The project focuses on identifying, training, and connecting specialized models. The repository includes scripts for running the Octopus v4 model, methods for managing the graph, training code for specialized models, and inference code. Environment setup instructions are provided for Linux with NVIDIA GPU. The Octopus v4 model helps users find suitable models for tasks and reformats queries for effective processing. The project leverages Language Large Models for various domains and provides benchmark results. Users are encouraged to train and add specialized models following recommended procedures.

github

: 97

MindSearch

MindSearch is an open-source AI Search Engine Framework that mimics human minds to provide deep AI search capabilities. It allows users to deploy their own search engine using either close-source or open-source language models. MindSearch offers features such as answering any question using web knowledge, in-depth knowledge discovery, detailed solution paths, optimized UI experience, and dynamic graph construction process.

github

: 4.8k

BodhiApp

Bodhi App runs Open Source Large Language Models locally, exposing LLM inference capabilities as OpenAI API compatible REST APIs. It leverages llama.cpp for GGUF format models and huggingface.co ecosystem for model downloads. Users can run fine-tuned models for chat completions, create custom aliases, and convert Huggingface models to GGUF format. The CLI offers commands for environment configuration, model management, pulling files, serving API, and more.

github

: 67

For similar tasks

No tools available

For similar jobs

No tools available

mem-kk-logic

README:

On Memorization of Large Language Models in Logical Reasoning

Updates

🛠️ Installation

📝 Synthetic data

Option 1: use HF dataset

Option 2: generate data locally

🤖 Evaluation

Evaluation on test samples

Evaluation on training samples

Evaluation on closed-sourced models

🚗 Fine-tuning

Direct fine-tune

CoT fine-tune

Merge fine-tuned adapter and base model

Fine-tune closed-sourced models

🔍 Probe

🗃️ Sample classification

📚 Citation

📖 Questions

For Tasks:

For Jobs:

Alternative AI tools for mem-kk-logic

Similar Open Source Tools

mem-kk-logic

detoxify

can-ai-code

AgentPoison

LEADS

green-bit-llm

llama.vim

stable-diffusion-webui

comfyui

Construction-Hazard-Detection

FinRL_DeepSeek

nano-graphrag

LL3DA

octopus-v4

MindSearch

BodhiApp

For similar tasks

For similar jobs