EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.

Stars: 381

Visit

EasyInstruct is a Python package proposed as an easy-to-use instruction processing framework for Large Language Models (LLMs) like GPT-4, LLaMA, ChatGLM in your research experiments. EasyInstruct modularizes instruction generation, selection, and prompting, while also considering their combination and interaction.

README:

An Easy-to-use Instruction Processing Framework for Large Language Models.

Project • Paper • Demo • Overview • Installation • Quickstart • How To Use • Docs • Video • Citation • Contributors

What's New
Overview
Installation
Quickstart
- Shell Script
- Gradio App
Use EasyInstruct
- Generators
- Selectors
- Prompts
- Engines
Citation
Contributors

🔔News

2024-06-04, EasyInstruct is accepted by ACL 2024 System Demonstration Track. 🎉🎉
2024-02-06 We release a new paper: "EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models" with an HF demo EasyInstruct.
2024-02-06 We release a preliminary tool EasyDetect for hallucination detection, with a demo.
2024-02-05 We release version 0.1.2, supporting for new features and optimising the function interface.
2023-12-09 The paper "When Do Program-of-Thoughts Work for Reasoning?" (supported by EasyInstruct), is accepted by AAAI 2024!
2023-10-28 We release version 0.1.1, supporting for new features of instruction generation and instruction selection.
2023-08-09 We release version 0.0.6, supporting Cohere API calls.
2023-07-12 We release EasyEdit, an easy-to-use framework to edit Large Language Models.

Previous news

2023-5-23 We release version 0.0.5, removing requirement of llama-cpp-python.
2023-5-16 We release version 0.0.4, fixing some problems.
2023-4-21 We release version 0.0.3, check out our documentations for more details.
2023-3-25 We release version 0.0.2, suporting IndexPrompt, MMPrompt, IEPrompt and more LLMs
2023-3-13 We release version 0.0.1, supporting in-context learning, chain-of-thought with ChatGPT.

This repository is a subproject of KnowLM.

🌟Overview

EasyInstruct is a Python package which is proposed as an easy-to-use instruction processing framework for Large Language Models(LLMs) like GPT-4, LLaMA, ChatGLM in your research experiments. EasyInstruct modularizes instruction generation, selection, and prompting, while also considering their combination and interaction.

The current supported instruction generation techniques are as follows:

Methods	Description
Self-Instruct	The method that randomly samples a few instructions from a human-annotated seed tasks pool as demonstrations and prompts an LLM to generate more instructions and corresponding input-output pairs.
Evol-Instruct	The method that incrementally upgrades an initial set of instructions into more complex instructions by prompting an LLM with specific prompts.
Backtranslation	The method that creates an instruction following training instance by predicting an instruction that would be correctly answered by a portion of a document of the corpus.
KG2Instruct	The method that creates an instruction following training instance by predicting an instruction that would be correctly answered by a portion of a document of the corpus.

The current supported instruction selection metrics are as follows:

Metrics	Notation	Description
Length	$Len$	The bounded length of every pair of instruction and response.
Perplexity	$PPL$	The exponentiated average negative log-likelihood of response.
MTLD	$MTLD$	Measure of textual lexical diversity, the mean length of sequential words in a text that maintains a minimum threshold TTR score.
ROUGE	$ROUGE$	Recall-Oriented Understudy for Gisting Evaluation, a set of metrics used for evaluating similarities between sentences.
GPT score	$GPT$	The score of whether the output is a good example of how AI Assistant should respond to the user's instruction, provided by ChatGPT.
CIRS	$CIRS$	The score using the abstract syntax tree to encode structural and logical attributes, to measure the correlation between code and reasoning abilities.

API service providers and their corresponding LLM products that are currently available:

Model	Description	Default Version
OpenAI
GPT-3.5	A set of models that improve on GPT-3 and can understand as well as generate natural language or code.	`gpt-3.5-turbo`
GPT-4	A set of models that improve on GPT-3.5 and can understand as well as generate natural language or code.	`gpt-4`
Anthropic
Claude	A next-generation AI assistant based on Anthropic’s research into training helpful, honest, and harmless AI systems.	`claude-2.0`
Claude-Instant	A lighter, less expensive, and much faster option than Claude.	`claude-instant-1.2`
Cohere
Command	A flagship text generation model of Cohere trained to follow user commands and to be instantly useful in practical business applications.	`command`
Command-Light	A light version of Command models that are faster but may produce lower-quality generated text.	`command-light`

🔧Installation

Installation from git repo branch:

pip install git+https://github.com/zjunlp/EasyInstruct@main

Installation for local development:

git clone https://github.com/zjunlp/EasyInstruct
cd EasyInstruct
pip install -e .

Installation using PyPI (not the latest version):

pip install easyinstruct -i https://pypi.org/simple

⏩Quickstart

We provide two ways for users to quickly get started with EasyInstruct. You can either use the shell script or the Gradio app based on your specific needs.

Shell Script

Step1: Prepare a configuration file

Users can easily configure the parameters of EasyInstruct in a YAML-style file or just quickly use the default parameters in the configuration files we provide. Following is an example of the configuration file for Self-Instruct:

generator:
  SelfInstructGenerator:
    target_dir: data/generations/
    data_format: alpaca
    seed_tasks_path: data/seed_tasks.jsonl
    generated_instructions_path: generated_instructions.jsonl
    generated_instances_path: generated_instances.jsonl
    num_instructions_to_generate: 100
    engine: gpt-3.5-turbo
    num_prompt_instructions: 8

More example configuration files can be found at configs.

Step2: Run the shell script

Users should first specify the configuration file and provide their own OpenAI API key. Then, run the following shell script to launch the instruction generation or selection process.

config_file=""
openai_api_key=""

python demo/run.py \
    --config  $config_file\
    --openai_api_key $openai_api_key \

Gradio App

We provide a Gradio app for users to quickly get started with EasyInstruct. You can run the following command to launch the Gradio app locally on the port 8080 (if available).

python demo/app.py

We also host a running gradio app in HuggingFace Spaces. You can try it out here.

📌Use EasyInstruct

Please refer to our documentations for more details.

Generators

The Generators module streamlines the process of instruction data generation, allowing for the generation of instruction data based on seed data. You can choose the appropriate generator based on your specific needs.

BaseGenerator

BaseGenerator is the base class for all generators.

You can also easily inherit this base class to customize your own generator class. Just override the __init__ and generate method.

SelfInstructGenerator

SelfInstructGenerator is the class for the instruction generation method of Self-Instruct. See Self-Instruct: Aligning Language Model with Self Generated Instructions for more details.

Example

from easyinstruct import SelfInstructGenerator
from easyinstruct.utils.api import set_openai_key

# Step1: Set your own API-KEY
set_openai_key("YOUR-KEY")

# Step2: Declare a generator class
generator = SelfInstructGenerator(num_instructions_to_generate=10)

# Step3: Generate self-instruct data
generator.generate()

BacktranslationGenerator

BacktranslationGenerator is the class for the instruction generation method of Instruction Backtranslation. See Self-Alignment with Instruction Backtranslation for more details.

Example

from easyinstruct import BacktranslationGenerator
from easyinstruct.utils.api import set_openai_key

# Step1: Set your own API-KEY
set_openai_key("YOUR-KEY")

# Step2: Declare a generator class
generator = BacktranslationGenerator(num_instructions_to_generate=10)

# Step3: Generate backtranslation data
generator.generate()

EvolInstructGenerator

EvolInstructGenerator is the class for the instruction generation method of EvolInstruct. See WizardLM: Empowering Large Language Models to Follow Complex Instructions for more details.

Example

from easyinstruct import EvolInstructGenerator
from easyinstruct.utils.api import set_openai_key

# Step1: Set your own API-KEY
set_openai_key("YOUR-KEY")

# Step2: Declare a generator class
generator = EvolInstructGenerator(num_instructions_to_generate=10)

# Step3: Generate evolution data
generator.generate()

KG2InstructGenerator

KG2InstructGenerator is the class for the instruction generation method of KG2Instruct. See InstructIE: A Chinese Instruction-based Information Extraction Dataset for more details.

Selectors

The Selectors module standardizes the instruction selection process, enabling the extraction of high-quality instruction datasets from raw, unprocessed instruction data. The raw data can be sourced from publicly available instruction datasets or generated by the framework itself. You can choose the appropriate selector based on your specific needs.

BaseSelector

BaseSelector is the base class for all selectors.

You can also easily inherit this base class to customize your own selector class. Just override the __init__ and __process__ method.

Deduplicator

Deduplicator is the class for eliminating duplicate instruction samples that could adversely affect both pre-training stability and the performance of LLMs. Deduplicator can also enables efficient use and optimization of storage space.

LengthSelector

LengthSelector is the class for selecting instruction samples based on the length of the instruction. Instructions that are too long or too short can affect data quality and are not conducive to instruction tuning.

RougeSelector

RougeSelector is the class for selecting instruction samples based on the ROUGE metric which is often used for evaluating the quality of automated generation of text.

GPTScoreSelector

GPTScoreSelector is the class for selecting instruction samples based on the GPT score, which reflects whether the output is a good example of how AI Assistant should respond to the user's instruction, provided by ChatGPT.

PPLSelector

PPLSelector is the class for selecting instruction samples based on the perplexity, which is the exponentiated average negative log-likelihood of response.

MTLDSelector

MTLDSelector is the class for selecting instruction samples based on the MTLD, which is short for Measure of Textual Lexical Diversity.

CodeSelector

CodeSelector is the class for selecting code instruction samples based on the Complexity-Impacted Reasoning Score (CIRS), which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. See When Do Program-of-Thoughts Work for Reasoning? for more details.

Example

from easyinstruct import CodeSelector

# Step1: Specify your source file of code instructions
src_file = "data/code_example.json"

# Step2: Declare a code selecter class
selector = CodeSelector(
    source_file_path=src_file, 
    target_dir="data/selections/",
    manually_partion_data=True,
    min_boundary = 0.125,
    max_boundary = 0.5,
    automatically_partion_data = True,
    k_means_cluster_number = 2,
    )

# Step3: Process the code instructions
selector.process()

MultiSelector

MultiSelector is the class for combining multiple appropricate selectors based on your specific needs.

Prompts

The Prompts module standardizes the instruction prompting step, where user requests are constructed as instruction prompts and sent to specific LLMs to obtain responses. You can choose the appropriate prompting method based on your specific needs.

Please check out link for more detials.

Engines

The Engines module standardizes the instruction execution process, enabling the execution of instruction prompts on specific locally deployed LLMs. You can choose the appropriate engine based on your specific needs.

Please check out link for more detials.

🚩Citation

Please cite our repository if you use EasyInstruct in your work.

@article{ou2024easyinstruct,
  title={EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models},
  author={Ou, Yixin and Zhang, Ningyu and Gui, Honghao and Xu, Ziwen and Qiao, Shuofei and Bi, Zhen and Chen, Huajun},
  journal={arXiv preprint arXiv:2402.03049},
  year={2024}
}

@misc{knowlm,
  author = {Ningyu Zhang and Jintian Zhang and Xiaohan Wang and Honghao Gui and Kangwei Liu and Yinuo Jiang and Xiang Chen and Shengyu Mao and Shuofei Qiao and Yuqi Zhu and Zhen Bi and Jing Chen and Xiaozhuan Liang and Yixin Ou and Runnan Fang and Zekun Xi and Xin Xu and Lei Li and Peng Wang and Mengru Wang and Yunzhi Yao and Bozhong Tian and Yin Fang and Guozhou Zheng and Huajun Chen},
  title = {KnowLM: An Open-sourced Knowledgeable Large Langugae Model Framework},
  year = {2023},
 url = {http://knowlm.zjukg.cn/},
}

@article{bi2023program,
  title={When do program-of-thoughts work for reasoning?},
  author={Bi, Zhen and Zhang, Ningyu and Jiang, Yinuo and Deng, Shumin and Zheng, Guozhou and Chen, Huajun},
  journal={arXiv preprint arXiv:2308.15452},
  year={2023}
}

🎉Contributors

We will offer long-term maintenance to fix bugs, solve issues and meet new requests. So if you have any problems, please put issues to us.

Other Related Projects

🙌 We would like to express our heartfelt gratitude for the contribution of Self-Instruct to our project, as we have utilized portions of their source code in our project.

For Tasks:

Click tags to check more tools for each tasks

generate instructions select instructions prompt instructions execute instructions evaluate instructions

For Jobs:

instruction generation instruction selection instruction prompting instruction execution instruction evaluation

Alternative AI tools for EasyInstruct

Similar Open Source Tools

EasyInstruct

github

: 381

wanda

Official PyTorch implementation of Wanda (Pruning by Weights and Activations), a simple and effective pruning approach for large language models. The pruning approach removes weights on a per-output basis, by the product of weight magnitudes and input activation norms. The repository provides support for various features such as LLaMA-2, ablation study on OBS weight update, zero-shot evaluation, and speedup evaluation. Users can replicate main results from the paper using provided bash commands. The tool aims to enhance the efficiency and performance of language models through structured and unstructured sparsity techniques.

github

: 560

rag-chatbot

The RAG ChatBot project combines Lama.cpp, Chroma, and Streamlit to build a Conversation-aware Chatbot and a Retrieval-augmented generation (RAG) ChatBot. The RAG Chatbot works by taking a collection of Markdown files as input and provides answers based on the context provided by those files. It utilizes a Memory Builder component to load Markdown pages, divide them into sections, calculate embeddings, and save them in an embedding database. The chatbot retrieves relevant sections from the database, rewrites questions for optimal retrieval, and generates answers using a local language model. It also remembers previous interactions for more accurate responses. Various strategies are implemented to deal with context overflows, including creating and refining context, hierarchical summarization, and async hierarchical summarization.

github

: 194

LLM-Pruner

LLM-Pruner is a tool for structural pruning of large language models, allowing task-agnostic compression while retaining multi-task solving ability. It supports automatic structural pruning of various LLMs with minimal human effort. The tool is efficient, requiring only 3 minutes for pruning and 3 hours for post-training. Supported LLMs include Llama-3.1, Llama-3, Llama-2, LLaMA, BLOOM, Vicuna, and Baichuan. Updates include support for new LLMs like GQA and BLOOM, as well as fine-tuning results achieving high accuracy. The tool provides step-by-step instructions for pruning, post-training, and evaluation, along with a Gradio interface for text generation. Limitations include issues with generating repetitive or nonsensical tokens in compressed models and manual operations for certain models.

github

: 828

basiclingua-LLM-Based-NLP

BasicLingua is a Python library that provides functionalities for linguistic tasks such as tokenization, stemming, lemmatization, and many others. It is based on the Gemini Language Model, which has demonstrated promising results in dealing with text data. BasicLingua can be used as an API or through a web demo. It is available under the MIT license and can be used in various projects.

github

: 77

DevDocs

DevDocs is a platform designed to simplify the process of digesting technical documentation for software engineers and developers. It automates the extraction and conversion of web content into markdown format, making it easier for users to access and understand the information. By crawling through child pages of a given URL, DevDocs provides a streamlined approach to gathering relevant data and integrating it into various tools for software development. The tool aims to save time and effort by eliminating the need for manual research and content extraction, ultimately enhancing productivity and efficiency in the development process.

github

: 469

RainbowGPT

RainbowGPT is a versatile tool that offers a range of functionalities, including Stock Analysis for financial decision-making, MySQL Management for database navigation, and integration of AI technologies like GPT-4 and ChatGlm3. It provides a user-friendly interface suitable for all skill levels, ensuring seamless information flow and continuous expansion of emerging technologies. The tool enhances adaptability, creativity, and insight, making it a valuable asset for various projects and tasks.

github

: 86

MetaGPT

MetaGPT is a multi-agent framework that enables GPT to work in a software company, collaborating to tackle more complex tasks. It assigns different roles to GPTs to form a collaborative entity for complex tasks. MetaGPT takes a one-line requirement as input and outputs user stories, competitive analysis, requirements, data structures, APIs, documents, etc. Internally, MetaGPT includes product managers, architects, project managers, and engineers. It provides the entire process of a software company along with carefully orchestrated SOPs. MetaGPT's core philosophy is "Code = SOP(Team)", materializing SOP and applying it to teams composed of LLMs.

github

: 51.4k

xFasterTransformer

xFasterTransformer is an optimized solution for Large Language Models (LLMs) on the X86 platform, providing high performance and scalability for inference on mainstream LLM models. It offers C++ and Python APIs for easy integration, along with example codes and benchmark scripts. Users can prepare models in a different format, convert them, and use the APIs for tasks like encoding input prompts, generating token ids, and serving inference requests. The tool supports various data types and models, and can run in single or multi-rank modes using MPI. A web demo based on Gradio is available for popular LLM models like ChatGLM and Llama2. Benchmark scripts help evaluate model inference performance quickly, and MLServer enables serving with REST and gRPC interfaces.

github

: 247

sec-parser

The `sec-parser` project simplifies extracting meaningful information from SEC EDGAR HTML documents by organizing them into semantic elements and a tree structure. It helps in parsing SEC filings for financial and regulatory analysis, analytics and data science, AI and machine learning, causal AI, and large language models. The tool is especially beneficial for AI, ML, and LLM applications by streamlining data pre-processing and feature extraction.

github

: 99

TokenFormer

TokenFormer is a fully attention-based neural network architecture that leverages tokenized model parameters to enhance architectural flexibility. It aims to maximize the flexibility of neural networks by unifying token-token and token-parameter interactions through the attention mechanism. The architecture allows for incremental model scaling and has shown promising results in language modeling and visual modeling tasks. The codebase is clean, concise, easily readable, state-of-the-art, and relies on minimal dependencies.

github

: 481

Neurite

Neurite is an innovative project that combines chaos theory and graph theory to create a digital interface that explores hidden patterns and connections for creative thinking. It offers a unique workspace blending fractals with mind mapping techniques, allowing users to navigate the Mandelbrot set in real-time. Nodes in Neurite represent various content types like text, images, videos, code, and AI agents, enabling users to create personalized microcosms of thoughts and inspirations. The tool supports synchronized knowledge management through bi-directional synchronization between mind-mapping and text-based hyperlinking. Neurite also features FractalGPT for modular conversation with AI, local AI capabilities for multi-agent chat networks, and a Neural API for executing code and sequencing animations. The project is actively developed with plans for deeper fractal zoom, advanced control over node placement, and experimental features.

github

: 891

UFO

UFO is a UI-focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.

github

: 6.6k

open-unlearning

OpenUnlearning is an easily extensible framework that unifies LLM unlearning evaluation benchmarks. It provides efficient implementations of TOFU and MUSE unlearning benchmarks, supporting 5 unlearning methods, 3+ datasets, 6+ evaluation metrics, and 7+ LLMs. Users can easily extend the framework to incorporate more variants, collaborate by adding new benchmarks, unlearning methods, datasets, and evaluation metrics, and drive progress in the field.

github

: 130

ragflow

RAGFlow is an open-source Retrieval-Augmented Generation (RAG) engine that combines deep document understanding with Large Language Models (LLMs) to provide accurate question-answering capabilities. It offers a streamlined RAG workflow for businesses of all sizes, enabling them to extract knowledge from unstructured data in various formats, including Word documents, slides, Excel files, images, and more. RAGFlow's key features include deep document understanding, template-based chunking, grounded citations with reduced hallucinations, compatibility with heterogeneous data sources, and an automated and effortless RAG workflow. It supports multiple recall paired with fused re-ranking, configurable LLMs and embedding models, and intuitive APIs for seamless integration with business applications.

github

: 47.9k

julep

Julep is an advanced platform for creating stateful and functional AI apps powered by large language models. It offers features like statefulness by design, automatic function calling, production-ready deployment, cron-like asynchronous functions, 90+ built-in tools, and the ability to switch between different LLMs easily. Users can build AI applications without the need to write code for embedding, saving, and retrieving conversation history, and can connect to third-party applications using Composio. Julep simplifies the process of getting started with AI apps, whether they are conversational, functional, or agentic.

github

: 5.2k

For similar tasks

EasyInstruct

github

: 381

forevervm

foreverVM is a tool that provides an API for running arbitrary, stateful Python code securely. It revolves around the concepts of machines and instructions, where machines represent stateful Python processes and instructions are Python statements and expressions that can be executed on these machines. Users can interact with machines, run instructions, and receive results. The tool ensures that machines are managed efficiently by automatically swapping them from memory to disk when idle and back when needed, allowing for running REPLs 'forever'. Users can easily get started with foreverVM using the CLI and an API token, and can leverage the SDK for more advanced functionalities.

github

: 168

For similar jobs

EasyInstruct

github

: 381

EasyInstruct

README:

Table of Contents

🔔News

🌟Overview

🔧Installation

⏩Quickstart

Shell Script

Step1: Prepare a configuration file

Step2: Run the shell script

Gradio App

📌Use EasyInstruct

Generators

BaseGenerator

SelfInstructGenerator

BacktranslationGenerator

EvolInstructGenerator

KG2InstructGenerator

Selectors

BaseSelector

Deduplicator

LengthSelector

RougeSelector

GPTScoreSelector

PPLSelector

MTLDSelector

CodeSelector

MultiSelector

Prompts

Engines

🚩Citation

🎉Contributors

For Tasks:

For Jobs:

Alternative AI tools for EasyInstruct

Similar Open Source Tools

EasyInstruct

wanda

rag-chatbot

LLM-Pruner

basiclingua-LLM-Based-NLP

DevDocs

RainbowGPT

MetaGPT

xFasterTransformer

sec-parser

TokenFormer

Neurite

UFO

open-unlearning

ragflow

julep

For similar tasks

EasyInstruct

forevervm

For similar jobs

EasyInstruct