easy-dataset

easy-dataset

A powerful tool for creating fine-tuning datasets for LLM

Stars: 439

Visit
 screenshot

Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning. With Easy Dataset, users can transform domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.

README:

version 1.0.0 Apache 2.0 License Next.js 14.1.0 React 18.2.0 Material UI 5.15.7

A powerful tool for creating fine-tuning datasets for Large Language Models

简体中文 | English

FeaturesGetting StartedUsageDocumentationContributingLicense

Overview

Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning.

With Easy Dataset, you can transform your domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.

Features

  • Intelligent Document Processing: Upload Markdown files and automatically split them into meaningful segments
  • Smart Question Generation: Extract relevant questions from each text segment
  • Answer Generation: Generate comprehensive answers for each question using LLM APIs
  • Flexible Editing: Edit questions, answers, and datasets at any stage of the process
  • Multiple Export Formats: Export datasets in various formats (Alpaca, ShareGPT) and file types (JSON, JSONL)
  • Wide Model Support: Compatible with all LLM APIs that follow the OpenAI format
  • User-Friendly Interface: Intuitive UI designed for both technical and non-technical users
  • Customizable System Prompts: Add custom system prompts to guide model responses

Getting Started

Prerequisites

  • Node.js 18.x or higher
  • pnpm (recommended) or npm

Using npm

  1. Clone the repository:

    git clone https://github.com/ConardLi/easy-dataset.git
    cd easy-dataset
  2. Install dependencies:

    npm install
  3. Start the development server:

    npm run build
    
    npm run start
  4. Open your browser and navigate to http://localhost:3000

Using Docker

  1. Pull the image:

    docker pull conardli17/easy-dataset:latest
  2. Run the container:

    docker run -d -p 3000:3000 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset conardli17/easy-dataset:latest

Note: To prevent data loss, replace YOUR_LOCAL_DB_PATH with the path where you want to store the local database.

  1. Open your browser and navigate to http://localhost:3000

Usage

Creating a Project

  1. Click the "Create Project" button on the home page
  2. Enter a project name and description
  3. Configure your preferred LLM API settings

Processing Documents

  1. Upload your Markdown files in the "Text Split" section
  2. Review the automatically split text segments
  3. Adjust the segmentation if needed

Generating Questions

  1. Navigate to the "Questions" section
  2. Select text segments to generate questions from
  3. Review and edit the generated questions
  4. Organize questions using the tag tree

Creating Datasets

  1. Go to the "Datasets" section
  2. Select questions to include in your dataset
  3. Generate answers using your configured LLM
  4. Review and edit the generated answers

Exporting Datasets

  1. Click the "Export" button in the Datasets section
  2. Select your preferred format (Alpaca or ShareGPT)
  3. Choose file format (JSON or JSONL)
  4. Add custom system prompts if needed
  5. Export your dataset

Project Structure

easy-dataset/
├── app/                                # Next.js application directory
│   ├── api/                            # API routes
│   │   ├── llm/                        # LLM API integration
│   │   │   ├── ollama/                 # Ollama API integration
│   │   │   └── openai/                 # OpenAI API integration
│   │   ├── projects/                   # Project management APIs
│   │   │   ├── [projectId]/            # Project-specific operations
│   │   │   │   ├── chunks/             # Text chunk operations
│   │   │   │   ├── datasets/           # Dataset generation and management
│   │   │   │   │   └── optimize/       # Dataset optimization API
│   │   │   │   ├── generate-questions/ # Batch question generation
│   │   │   │   ├── questions/          # Question management
│   │   │   │   └── split/              # Text splitting operations
│   │   │   └── user/                   # User-specific project operations
│   ├── projects/                       # Front-end project pages
│   │   └── [projectId]/                # Project-specific pages
│   │       ├── datasets/               # Dataset management UI
│   │       ├── questions/              # Question management UI
│   │       ├── settings/               # Project settings UI
│   │       └── text-split/             # Text processing UI
│   └── page.js                         # Home page
├── components/                         # React components
│   ├── datasets/                       # Dataset-related components
│   ├── home/                           # Home page components
│   ├── projects/                       # Project management components
│   ├── questions/                      # Question management components
│   └── text-split/                     # Text processing components
├── lib/                                # Core libraries and utilities
│   ├── db/                             # Database operations
│   ├── i18n/                           # Internationalization
│   ├── llm/                            # LLM integration
│   │   ├── common/                     # Common LLM utilities
│   │   ├── core/                       # Core LLM client
│   │   └── prompts/                    # Prompt templates
│   │       ├── answer.js               # Answer generation prompts (Chinese)
│   │       ├── answerEn.js             # Answer generation prompts (English)
│   │       ├── question.js             # Question generation prompts (Chinese)
│   │       ├── questionEn.js           # Question generation prompts (English)
│   │       └── ... other prompts
│   └── text-splitter/                  # Text splitting utilities
├── locales/                            # Internationalization resources
│   ├── en/                             # English translations
│   └── zh-CN/                          # Chinese translations
├── public/                             # Static assets
│   └── imgs/                           # Image resources
└── local-db/                           # Local file-based database
    └── projects/                       # Project data storage

Documentation

For detailed documentation on all features and APIs, please visit our Documentation Site.

Contributing

We welcome contributions from the community! If you'd like to contribute to Easy Dataset, please follow these steps:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/amazing-feature)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some amazing feature')
  5. Push to the branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Please make sure to update tests as appropriate and adhere to the existing coding style.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Built with ❤️ by ConardLi • Follow me:WeChatBilibiliJuijinZhihu

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for easy-dataset

Similar Open Source Tools

For similar tasks

For similar jobs