DataFlow

DataFlow

Easy Data Preparation with latest LLMs-based Operators and Pipelines.

Stars: 1259

Visit
 screenshot

DataFlow is a data preparation and training system designed to parse, generate, process, and evaluate high-quality data from noisy sources, improving the performance of large language models in specific domains. It constructs diverse operators and pipelines, validated to enhance domain-oriented LLM's performance in fields like healthcare, finance, and law. DataFlow also features an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.

README:

DataFlow

Documents

🎉 If you like our project, please give us a star ⭐ on GitHub for the latest update.

简体中文 | English

https://github.com/user-attachments/assets/19742159-cfe0-42a6-9d3d-152466d2d588

📰 1. News

🎉 [2025-06-28] We’re excited to announce that DataFlow, our Data-centric AI system, is now released! Stay tuned for future updates.

🔍 2. Overview

DataFlow is a data preparation and training system designed to parse, generate, process and evaluate high-quality data from noisy sources (PDF, plain-text, low-quality QA), thereby improving the performance of large language models (LLMs) in specific domains through targeted training (Pre-training, Supervised Fine-tuing, RL training) or RAG using knowledge base cleaning. DataFlow has been empirically validated to improve domain-oriented LLM's performance in fields such as healthcare, finance, and law.

Specifically, we constructing diverse operators leveraging rule-based methods, deep learning models, LLMs, and LLM APIs. These operators are systematically integrated into distinct pipelines, collectively forming the comprehensive DataFlow system. Additionally, we develop an intelligent DataFlow-agent capable of dynamically assembling new pipelines by recombining existing operators on demand.

🛠️ 3. Operators Functionality

🔧 3.1 How Operators Work

DataFlow adopts a modular operator design philosophy, building flexible data processing pipelines by combining different types of operators. As the basic unit of data processing, an operator can receive structured data input (such as in json/jsonl/csv format) and, after intelligent processing, output high-quality data results. For a detailed guide on using operators, please refer to the Operator Documentation.

📊 3.2 Operator Classification System

In the DataFlow framework, operators are divided into three core categories based on their functional characteristics:

Operator Type Quantity Main Function
Generic Operators 80+ Covers general functions for text evaluation, processing, and synthesis
Domain-Specific Operators 40+ Specialized processing for specific domains (e.g., medical, financial, legal)
Evaluation Operators 20+ Comprehensively evaluates data quality from 6 dimensions

🛠️ 4. Pipelines Functionality

🔧 4.1 Ready-to-Use PipeLines

Current Pipelines in Dataflow are as follows:

⚙️ 4.2 Flexible Operator PipeLines

In this framework, operators are categorized into Fundamental Operators, Generic Operators, Domain-Specific Operators, and Evaluation Operators, etc., supporting data processing and evaluation functionalities. Please refer to the documentation for details.

🤖 4.3 Agent Guided Pipelines

⚡ 5. Quick Start

🛠️ 5.1 Environment Setup and Installation

Please use the following commands for environment setup and installation👇

conda create -n dataflow python=3.10 
conda activate dataflow

pip install open-dataflow

If you want to use your own GPU for local inference, please use:

pip install open-dataflow[vllm]

DataFlow supports Python>=3.10 environments

After installation, you can use the following command to check if dataflow has been installed correctly:

dataflow -v

If installed correctly, you should see:

open-dataflow codebase version: 1.0.0
        Checking for updates...
        Local version:  1.0.0
        PyPI newest version:  1.0.0
You are using the latest version: 1.0.0.

🚀 5.2 Using the Gradio Web Interface

DataFlow provides two interactive web interfaces to help you use operators, pipelines, and agents:

5.2.1 DataFlow Operators Interface

Launch the DataFlow operator interface to test and visualize all operators and pipelines:

dataflow webui

This command will start an interactive web interface, allowing you to visualize and flexibly use all operators and pipelines.

5.2.2 DataFlow Agent Interface

Launch the DataFlow agent interface for operator authoring and pipeline design:

dataflow webui agent

This command will start the DataFlow-Agent interface, providing automated operator authoring and pipeline recommendation services.

https://github.com/user-attachments/assets/fda1ad47-a9f3-447a-b5c0-cf4c9ad64763

🌐 5.3 ADP Intelligent Data Platform

Beyond the local Gradio interface, DataFlow is also available as a fully-managed SaaS solution on the ADP Intelligent Data Platform.

ADP is an end-to-end system by OriginHub, designed to help enterprises accelerate the development of custom Agents and Models by integrating Large Language Models (LLMs) with private data.

Core Capabilities:

  • 🤖 Automated Data Preparation: Leverage DataFlow for full-process automation of your data workflows.
  • 📚 Unified Knowledge System: Integrate and manage large-scale, multimodal knowledge bases.
  • 🤝 Intelligent Collaboration: Build and orchestrate powerful multi-agent systems.
  • 🗄️ AI-Native Database: Manage the full lifecycle of your multimodal data with a purpose-built AI database.

ADP Platform Interface

Get Started for Free

👉 Sign up now to claim your free compute credits!

📖 5.4 Reference Project Documentation

For detailed usage instructions and getting started guide, please visit our Documentation.

🧪 6. Experimental Results

For Detailed Experiments setting, please visit our documentation.

📝 6.1 Text PipeLine

6.1.1 Pre-training data filter pipeline

The pre-training data processing pipeline was applied to randomly sampled data from the RedPajama dataset, resulting in a final data retention rate of 13.65%. The analysis results using QuratingScorer are shown in the figure. As can be seen, the filtered pretraining data significantly outperforms the original data across four scoring dimensions: writing style, requirement for expert knowledge, factual content, and educational value. This demonstrates the effectiveness of the DataFlow pretraining data processing.

6.1.2 SFT data filter pipeline

We filted 3k record from alpaca dataset and compare it with radom selected 3k data from alpaca dataset by training it on Qwen2.5-7B. Results are:

🧠 6.2 Reasoning Pipeline

We verify our reasoning pipeline by SFT on a Qwen2.5-32B-Instruct with Reasoning Pipeline synsthized data. We generated 1k and 5k SFT data pairs. Results are:

🗃️ 6.3 Text2SQL PipeLine

We fine-tuned the Qwen2.5-Coder-7B-Instruct model using both Supervised Fine-tuning (SFT) and Reinforcement Learning (RL), with data constructed via the DataFlow-Text2SQL Pipeline. Results are:

📄 7. Publications

Our team has published the following papers that form core components of the DataFlow system:

Paper Title DataFlow Component Venue Year
MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification Multimodal reasoning verification framework for data processing and evaluation ACL 2025
Efficient Pretraining Data Selection for Language Models via Multi-Actor Collaboration Multi-actor collaborative data selection mechanism for enhanced data filtering and processing ACL 2025

Contributing Institutions: PKU HKUST CAS Shanghai AI Lab Baichuan Ant Group

💐 8. Acknowledgements

We sincerely appreciate MinerU's outstanding contribution, particularly its robust text extraction capabilities from PDFs and documents, which greatly facilitates data loading.

🤝 9. Community & Support

Join the DataFlow open-source community to ask questions, share ideas, and collaborate with other developers!

• 📮 GitHub Issues: Report bugs or suggest features

• 🔧 GitHub Pull Requests: Contribute code improvements

• 💬 Join our community groups to connect with us and other contributors!

📜 10. Citation

If you use DataFlow in your research, feel free to give us a cite.

@misc{dataflow2025,
  author       = {DataFlow Develop Team},
  title        = {DataFlow: A Unified Framework for Data-Centric AI},
  year         = {2025},
  howpublished = {\url{https://github.com/OpenDCAI/DataFlow}},
  note         = {Accessed: 2025-07-08}
}

📊 11. Statistics

Star History Chart

Connect with the PKU-DCAI Research Team on Xiaohongshu: 26133106768

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for DataFlow

Similar Open Source Tools

For similar tasks

For similar jobs