LongRAG

Official repo for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs".

Stars: 103

Visit

This repository contains the code for LongRAG, a framework that enhances retrieval-augmented generation with long-context LLMs. LongRAG introduces a 'long retriever' and a 'long reader' to improve performance by using a 4K-token retrieval unit, offering insights into combining RAG with long-context LLMs. The repo provides instructions for installation, quick start, corpus preparation, long retriever, and long reader.

README:

LongRAG

This repo contains the code for "LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs". We are still in the process to polish our repo.

Introduction
Installation
Quick Start
Corpus Preparation (Optional)
Long Retriever
Long Reader
License
Citation

Introduction

In traditional RAG framework, the basic retrieval units are normally short. Such a design forces the retriever to search over a large corpus to find the "needle" unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced heavy retriever and light reader design can lead to sub-optimal performance. We propose a new framework LongRAG, consisting of a "long retriever" and a "long reader". Our framework use a 4K-token retrieval unit, which is 30x longer than before. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.

Installation

Clone this repository and install the required packages:

git clone https://github.com/TIGER-AI-Lab/LongRAG.git
cd LongRAG
pip install -r requirements.txt

Quick Start

Please go to the "Long Reader" section and follow the instructions. This will help you get the final prediction for 100 examples. The output will be similar to our sample files in the exp/ directory.

Corpus Preparation (Optional)

This is an optional step. You can use our processed corpus directly. We have released two versions of the retrieval corpus for NQ and HotpotQA on Hugging Face.

from datasets import load_dataset
corpus_nq = load_dataset("TIGER-Lab/LongRAG", "nq_corpus")
corpus_hotpotqa = load_dataset("TIGER-Lab/LongRAG", "hotpot_qa_corpus")

If you are still interested in how we craft the corpus, you can start reading here.

Wikipedia raw data clean: We first clean Wikipedia raw data by following the standard process. We use WikiExtractor. This is a widely-used Python script that extracts and cleans text from a Wikipedia database backup dump. Please ensure you use the required Python environment. A sample script is:

sh scripts/extract_and_clean_wiki_dump.sh

Preprocess Wikipedia data After cleaning the Wikipedia raw data, run the following script to gather more information.

sh scripts/process_wiki_page.sh

dir_path: The directory path of the cleaned Wikipedia dump, which is the output of the previous step.
output_path_dir: The output directory will contain several pickle files, each representing a dictionary for the Wikipedia page. degree.pickle: The key is the Wikipedia page title, and the value is the number of hyperlinks. abs_adj.pickle: The key is the Wikipedia page title, and the value is the linked page in the abstract paragraph. full_adj.pickle: The key is the Wikipedia page title, and the value is the linked page in the entire page. doc_size.pickle: The key is the Wikipedia page title, and the value is the number of tokens on that page. doc_dict.pickle: The key is the Wikipedia page title, and the value is the text of the page.
corpus_title_path: The key is used to filter the NQ dataset. In the original DPR paper, certain Wikipedia pages, such as list pages and disambiguation pages, were removed, reducing the total number of Wikipedia pages from 5 million to 3 million. For a fair comparison, we also chose to exclude these pages. (For HotpotQA, we did not remove any pages, so the number of Wikipedia pages remains at 5 million.) You can download the DPR's titles from this link.

We have provided the processed Wikipedia in our huggingface repo. Please check out the nq_wiki and hotpot_qa_wiki subsets for more information. You could easily derive these pickle files from these two datasets.

Retrieval Corpus: By grouping multiple related documents, we can construct long retrieval units with more than 4K tokens. This design could also significantly reduce the corpus size (number of retrieval units in the corpus). Then, the retriever’s task becomes much easier. Additionally, the long retrieval unit will also improve the information completeness to avoid ambiguity or confusion.

sh scripts/group_documents.sh

processed_wiki_dir: The output directory of the above step.
mode: abs is for HotpotQA corpus, full is for NQ corpus.
output_dir: The output directory, The output directory will contain several pickle files, each representing a dictionary for the retrieval corpus. The most important one is group_text.pickle, which maps the corpus ID to the corpus text. For more details, please refer to our released corpus on Hugging Face.

Long Retriever

We leverage open-sourced dense retrieval toolkit, Tevatron. For all our retrieval experiments. The base embedding model we used is bge-large-en-v1.5. We have provided a sample script; make sure to update the parameters with your own dataset local path. Additionally, our script uses 4 GPUs to encode the corpus for time saving; please update this based on your own use case.

sh scripts/run_retrieve_tevatron.sh

Long Reader

We select Gemini-1.5-Pro and GPT-4o as our long reader given their strong ability to handle long context input. (We also plan to test other LLMs capable of handling long contexts in the future.)

The input of the reader is a concatenation of all the long retrieval units from the long retriever. We have provided the input file in our Huggingface repo.

mkdir -p exp/
sh scripts/run_eval_qa.sh

test_data_name: Test set name, nq (NQ) or hotpot_qa (HotpotQA).
test_data_split: For each test set, there are three splits: full, subset_1000, subset_100. We suggest starting with subset_100 for a quick start or debugging and using subset_1000 to obtain relatively stable results.
output_file_path: The output file, here it's placed in the exp/ directory.
reader_model: The long context reader model we use, currently our code support GPT-4o, GPT-4-Turbo, Gemini-1.5-Pro, Claude-3-Opus. Please note that you need to update the related API key and API configuration in the code. For example, if you are using the GPT-4 series, you need to configure the code in utils/gpt_inference.py; if you are using the Gemini series, you need to configure the code in utils/gemini_inference.py. We will continue to support more models in the future.

The output file contains one test case per row. The short_ans field is our final prediction.

{
    "query_id": "383", 
    "question": "how many episodes of touching evil are there", 
    "answers": ["16"], 
    "long_ans": "16 episodes.", 
    "short_ans": "16", 
    "is_exact_match": 1, 
    "is_substring_match": 1, 
    "is_retrieval": 1
}

We have provided some sample output files in our exp/ directory. For example, exp/nq_gpt4o_100.json contains the result from the running file:

python eval/eval_qa.py \
  --test_data_name "nq" \
  --test_data_split "subset_100" \
  --output_file_path "./exp/nq_gpt4o_100.json" \
  --reader_model "GPT-4o"

The top-1 retrieval accuracy is 88%, and the exact match rate is 64%.

License

Please check out the license of each subset we use in our work.

Dataset Name	License Type
NQ	Apache License 2.0
HotpotQA	CC BY-SA 4.0 License

Citation

Please kindly cite our paper if you find our project is useful:

@article{jiang2024longrag
  title={LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs},
  author={Ziyan Jiang, Xueguang Ma, Wenhu Chen},
  journal={arXiv preprint arXiv:2406.15319},
  year={2024},
  url={https://arxiv.org/abs/2406.15319}
}

For Tasks:

Click tags to check more tools for each tasks

generate predictions encode corpus prepare retrieval corpus evaluate qa performance handle long context input

For Jobs:

machine learning engineer research scientist data scientist ai engineer nlp specialist

Alternative AI tools for LongRAG

Similar Open Source Tools

LongRAG

github

: 103

vulnerability-analysis

The NVIDIA AI Blueprint for Vulnerability Analysis for Container Security showcases accelerated analysis on common vulnerabilities and exposures (CVE) at an enterprise scale, reducing mitigation time from days to seconds. It enables security analysts to determine software package vulnerabilities using large language models (LLMs) and retrieval-augmented generation (RAG). The blueprint is designed for security analysts, IT engineers, and AI practitioners in cybersecurity. It requires NVAIE developer license and API keys for vulnerability databases, search engines, and LLM model services. Hardware requirements include L40 GPU for pipeline operation and optional LLM NIM and Embedding NIM. The workflow involves LLM pipeline for CVE impact analysis, utilizing LLM planner, agent, and summarization nodes. The blueprint uses NVIDIA NIM microservices and Morpheus Cybersecurity AI SDK for vulnerability analysis.

github

: 86

LLM-Merging

LLM-Merging is a repository containing starter code for the LLM-Merging competition. It provides a platform for efficiently building LLMs through merging methods. Users can develop new merging methods by creating new files in the specified directory and extending existing classes. The repository includes instructions for setting up the environment, developing new merging methods, testing the methods on specific datasets, and submitting solutions for evaluation. It aims to facilitate the development and evaluation of merging methods for LLMs.

github

: 122

OlympicArena

OlympicArena is a comprehensive benchmark designed to evaluate advanced AI capabilities across various disciplines. It aims to push AI towards superintelligence by tackling complex challenges in science and beyond. The repository provides detailed data for different disciplines, allows users to run inference and evaluation locally, and offers a submission platform for testing models on the test set. Additionally, it includes an annotation interface and encourages users to cite their paper if they find the code or dataset helpful.

github

: 74

eval-dev-quality

DevQualityEval is an evaluation benchmark and framework designed to compare and improve the quality of code generation of Language Model Models (LLMs). It provides developers with a standardized benchmark to enhance real-world usage in software development and offers users metrics and comparisons to assess the usefulness of LLMs for their tasks. The tool evaluates LLMs' performance in solving software development tasks and measures the quality of their results through a point-based system. Users can run specific tasks, such as test generation, across different programming languages to evaluate LLMs' language understanding and code generation capabilities.

github

: 159

warc-gpt

WARC-GPT is an experimental retrieval augmented generation pipeline for web archive collections. It allows users to interact with WARC files, extract text, generate text embeddings, visualize embeddings, and interact with a web UI and API. The tool is highly customizable, supporting various LLMs, providers, and embedding models. Users can configure the application using environment variables, ingest WARC files, start the server, and interact with the web UI and API to search for content and generate text completions. WARC-GPT is designed for exploration and experimentation in exploring web archives using AI.

github

: 219

LiveBench

LiveBench is a benchmark tool designed for Language Model Models (LLMs) with a focus on limiting contamination through monthly new questions based on recent datasets, arXiv papers, news articles, and IMDb movie synopses. It provides verifiable, objective ground-truth answers for accurate scoring without an LLM judge. The tool offers 18 diverse tasks across 6 categories and promises to release more challenging tasks over time. LiveBench is built on FastChat's llm_judge module and incorporates code from LiveCodeBench and IFEval.

github

: 598

aisuite

Aisuite is a simple, unified interface to multiple Generative AI providers. It allows developers to easily interact with various Language Model (LLM) providers like OpenAI, Anthropic, Azure, Google, AWS, and more through a standardized interface. The library focuses on chat completions and provides a thin wrapper around python client libraries, enabling creators to test responses from different LLM providers without changing their code. Aisuite maximizes stability by using HTTP endpoints or SDKs for making calls to the providers. Users can install the base package or specific provider packages, set up API keys, and utilize the library to generate chat completion responses from different models.

github

: 9.5k

BTGenBot

BTGenBot is a tool that generates behavior trees for robots using lightweight large language models (LLMs) with a maximum of 7 billion parameters. It fine-tunes on a specific dataset, compares multiple LLMs, and evaluates generated behavior trees using various methods. The tool demonstrates the potential of LLMs with a limited number of parameters in creating effective and efficient robot behaviors.

github

: 65

PolyMind

PolyMind is a multimodal, function calling powered LLM webui designed for various tasks such as internet searching, image generation, port scanning, Wolfram Alpha integration, Python interpretation, and semantic search. It offers a plugin system for adding extra functions and supports different models and endpoints. The tool allows users to interact via function calling and provides features like image input, image generation, and text file search. The application's configuration is stored in a `config.json` file with options for backend selection, compatibility mode, IP address settings, API key, and enabled features.

github

: 204

ell

ell is a lightweight, functional prompt engineering framework that treats prompts as programs rather than strings. It provides tools for prompt versioning, monitoring, and visualization, as well as support for multimodal inputs and outputs. The framework aims to simplify the process of prompt engineering for language models.

github

: 4.9k

curategpt

CurateGPT is a prototype web application and framework designed for general purpose AI-guided curation and curation-related operations over collections of objects. It provides functionalities for loading example data, building indexes, interacting with knowledge bases, and performing tasks such as chatting with a knowledge base, querying Pubmed, interacting with a GitHub issue tracker, term autocompletion, and all-by-all comparisons. The tool is built to work best with the OpenAI gpt-4 model and OpenAI ada-text-embedding-002 for embedding, but also supports alternative models through a plugin architecture.

github

: 81

curate-gpt

CurateGPT is a prototype web application and framework for performing general purpose AI-guided curation and curation-related operations over collections of objects. It allows users to load JSON, YAML, or CSV data, build vector database indexes for ontologies, and interact with various data sources like GitHub, Google Drives, Google Sheets, and more. The tool supports ontology curation, knowledge base querying, term autocompletion, and all-by-all comparisons for objects in a collection.

github

: 56

truss

Truss is a tool that simplifies the process of serving AI/ML models in production. It provides a consistent and easy-to-use interface for packaging, testing, and deploying models, regardless of the framework they were created with. Truss also includes a live reload server for fast feedback during development, and a batteries-included model serving environment that eliminates the need for Docker and Kubernetes configuration.

github

: 971

LLM-LieDetector

This repository contains code for reproducing experiments on lie detection in black-box LLMs by asking unrelated questions. It includes Q/A datasets, prompts, and fine-tuning datasets for generating lies with language models. The lie detectors rely on asking binary 'elicitation questions' to diagnose whether the model has lied. The code covers generating lies from language models, training and testing lie detectors, and generalization experiments. It requires access to GPUs and OpenAI API calls for running experiments with open-source models. Results are stored in the repository for reproducibility.

github

: 54

ReasonablePlanningAI

Reasonable Planning AI is a robust design and data-driven AI solution for game developers. It provides an AI Editor that allows creating AI without Blueprints or C++. The AI can think for itself, plan actions, adapt to the game environment, and act dynamically. It consists of Core components like RpaiGoalBase, RpaiActionBase, RpaiPlannerBase, RpaiReasonerBase, and RpaiBrainComponent, as well as Composer components for easier integration by Game Designers. The tool is extensible, cross-compatible with Behavior Trees, and offers debugging features like visual logging and heuristics testing. It follows a simple path of execution and supports versioning for stability and compatibility with Unreal Engine versions.

github

: 95

For similar tasks

LongRAG

github

: 103

RAGFoundry

RAG Foundry is a library designed to enhance Large Language Models (LLMs) by fine-tuning models on RAG-augmented datasets. It helps create training data, train models using parameter-efficient finetuning (PEFT), and measure performance using RAG-specific metrics. The library is modular, customizable using configuration files, and facilitates prototyping with various RAG settings and configurations for tasks like data processing, retrieval, training, inference, and evaluation.

github

: 463

RAG-FiT

RAG-FiT is a library designed to improve Language Models' ability to use external information by fine-tuning models on specially created RAG-augmented datasets. The library assists in creating training data, training models using parameter-efficient finetuning (PEFT), and evaluating performance using RAG-specific metrics. It is modular, customizable via configuration files, and facilitates fast prototyping and experimentation with various RAG settings and configurations.

github

: 517

LotteryAi

LotteryAi is a lottery prediction artificial intelligence that uses machine learning to predict the winning numbers of any lottery game. It requires Python 3.x and specific libraries like numpy, tensorflow, keras, and art for installation. Users need a data file with past lottery results in a comma-separated format to train the model and generate predictions. The tool comes with no guarantee of accuracy in predicting lottery numbers and is meant for educational and research purposes only.

github

: 108

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k

LongRAG

README:

LongRAG

Table of Contents

Introduction

Installation

Quick Start

Corpus Preparation (Optional)

Long Retriever

Long Reader

License

Citation

For Tasks:

For Jobs:

Alternative AI tools for LongRAG

Similar Open Source Tools

LongRAG

vulnerability-analysis

LLM-Merging

OlympicArena

eval-dev-quality

warc-gpt

LiveBench

aisuite

BTGenBot

PolyMind

ell

curategpt

curate-gpt

truss

LLM-LieDetector

ReasonablePlanningAI

For similar tasks

LongRAG

RAGFoundry

RAG-FiT

LotteryAi

For similar jobs

sweep

teams-ai

ai-guide

classifai

chatbot-ui

BricksLLM

uAgents

griptape