easy-dataset

easy-dataset

A powerful tool for creating fine-tuning datasets for LLM

Stars: 10568

Visit
 screenshot

Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning. With Easy Dataset, users can transform domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.

README:

GitHub Repo stars GitHub Downloads (all assets, all releases) GitHub Release AGPL 3.0 License GitHub contributors GitHub last commit arXiv:2507.04009

ConardLi%2Feasy-dataset | Trendshift

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

FeaturesQuick StartDocumentationContributingLicense

If you like this project, please give it a Star⭐️, or buy the author a coffee => Donate ❤️!

Overview

Easy Dataset is an application specifically designed for creating fine-tuning datasets for Large Language Models (LLMs). It provides an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform domain knowledge into structured datasets, compatible with all LLM APIs that follow the OpenAI format, making the fine-tuning process simple and efficient.

Features

  • Intelligent Document Processing: Supports intelligent recognition and processing of multiple formats including PDF, Markdown, DOCX, etc.
  • Intelligent Text Splitting: Supports multiple intelligent text splitting algorithms and customizable visual segmentation
  • Intelligent Question Generation: Extracts relevant questions from each text segment
  • Domain Labels: Intelligently builds global domain labels for datasets, with global understanding capabilities
  • Answer Generation: Uses LLM API to generate comprehensive answers and Chain of Thought (COT)
  • Flexible Editing: Edit questions, answers, and datasets at any stage of the process
  • Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
  • Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
  • User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
  • Custom System Prompts: Add custom system prompts to guide model responses

Quick Demo

https://github.com/user-attachments/assets/6ddb1225-3d1b-4695-90cd-aa4cb01376a8

Local Run

Download Client

Windows MacOS Linux

Setup.exe

Intel

M

AppImage

Install with NPM

  1. Clone the repository:
   git clone https://github.com/ConardLi/easy-dataset.git
   cd easy-dataset
  1. Install dependencies:
   npm install
  1. Start the development server:
   npm run build

   npm run start
  1. Open your browser and visit http://localhost:1717

Using the Official Docker Image

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Modify the docker-compose.yml file:
services:
  easy-dataset:
    image: ghcr.io/conardli/easy-dataset
    container_name: easy-dataset
    ports:
      - '1717:1717'
    volumes:
      - ./local-db:/app/local-db
      # - ./prisma:/app/prisma  If mounting is required, please manually initialize the database file first.
    restart: unless-stopped

Note: Replace {YOUR_LOCAL_DB_PATH} and {LOCAL_PRISMA_PATH} with the actual paths where you want to store the local database. It is recommended to use the local-db and prisma folders in the current code repository directory to maintain consistency with the database paths when starting via NPM.

Note: If you need to mount the database file (PRISMA), you need to run npm run db:push in advance to initialize the database file.

  1. Start with docker-compose:
docker-compose up -d
  1. Open a browser and visit http://localhost:1717

Building with a Local Dockerfile

If you want to build the image yourself, use the Dockerfile in the project root directory:

  1. Clone the repository:
git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset
  1. Build the Docker image:
docker build -t easy-dataset .
  1. Run the container:
docker run -d \
  -p 1717:1717 \
  -v {YOUR_LOCAL_DB_PATH}:/app/local-db \
  -v {LOCAL_PRISMA_PATH}:/app/prisma \
  --name easy-dataset \
  easy-dataset

Note: Replace {YOUR_LOCAL_DB_PATH} and {LOCAL_PRISMA_PATH} with the actual paths where you want to store the local database. It is recommended to use the local-db and prisma folders in the current code repository directory to maintain consistency with the database paths when starting via NPM.

  1. Open a browser and visit http://localhost:1717

How to Use

Create a Project

  1. Click the "Create Project" button on the homepage;
  2. Enter a project name and description;
  3. Configure your preferred LLM API settings

Process Documents

  1. Upload your files in the "Text Split" section (supports PDF, Markdown, txt, DOCX);
  2. View and adjust the automatically split text segments;
  3. View and adjust the global domain tree

Generate Questions

  1. Batch construct questions based on text blocks;
  2. View and edit the generated questions;
  3. Organize questions using the label tree

Create Datasets

  1. Batch construct datasets based on questions;
  2. Generate answers using the configured LLM;
  3. View, edit, and optimize the generated answers

Export Datasets

  1. Click the "Export" button in the Datasets section;
  2. Choose your preferred format (Alpaca or ShareGPT);
  3. Select the file format (JSON or JSONL);
  4. Add custom system prompts as needed;
  5. Export your dataset

Documentation

Community Practice

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request (submit to the DEV branch)

Please ensure that tests are appropriately updated and adhere to the existing coding style.

Join Discussion Group & Contact the Author

https://docs.easy-dataset.com/geng-duo/lian-xi-wo-men

License

This project is licensed under the AGPL 3.0 License - see the LICENSE file for details.

Citation

If this work is helpful, please kindly cite as:

@misc{miao2025easydataset,
  title={Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents},
  author={Ziyang Miao and Qiyu Sun and Jingyuan Wang and Yuchen Gong and Yaowei Zheng and Shiqi Li and Richong Zhang},
  year={2025},
  eprint={2507.04009},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2507.04009}
}

Star History

Star History Chart

Built with ❤️ by ConardLi • Follow me: WeChat Official AccountBilibiliJuejinZhihuYoutube

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for easy-dataset

Similar Open Source Tools

For similar tasks

For similar jobs