
dingo
Dingo: A Comprehensive Data Quality Evaluation Tool
Stars: 109

README:
- 2024/12/27: Project Initialization
Dingo is a data quality evaluation tool that helps you automatically detect data quality issues in your datasets. Dingo provides a variety of built-in rules and model evaluation methods, and also supports custom evaluation methods. Dingo supports commonly used text datasets and multimodal datasets, including pre-training datasets, fine-tuning datasets, and evaluation datasets. In addition, Dingo supports multiple usage methods, including local CLI and SDK, making it easy to integrate into various evaluation platforms, such as OpenCompass.
pip install dingo-python
from dingo.io import InputArgs
from dingo.exec import Executor
# Evaluate a plaintext file
input_data = {
"eval_group": "sft", # Rule set for SFT data
"input_path": "data.txt", # Path to local text file
"dataset": "local",
"data_format": "plaintext", # Format: plaintext
"save_data": True # Save evaluation results
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
from dingo.io import InputArgs
from dingo.exec import Executor
# Evaluate a dataset from Hugging Face
input_data = {
"eval_group": "sft", # Rule set for SFT data
"input_path": "tatsu-lab/alpaca", # Dataset from Hugging Face
"data_format": "plaintext", # Format: plaintext
"save_data": True # Save evaluation results
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
from dingo.io import InputArgs
from dingo.exec import Executor
# Evaluate a JSON file
input_data = {
"eval_group": "default", # Default rule set
"input_path": "data.json", # Path to local JSON file
"dataset": "local",
"data_format": "json", # Format: json
"column_content": "text", # Column containing the text to evaluate
"save_data": True # Save evaluation results
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
from dingo.io import InputArgs
from dingo.exec import Executor
# Evaluate using GPT model
input_data = {
"input_path": "data.jsonl", # Path to local JSONL file
"dataset": "local",
"data_format": "jsonl",
"column_content": "content",
"custom_config": {
"prompt_list": ["PromptRepeat"], # Prompt to use
"llm_config": {
"detect_text_quality": {
"model": "gpt-4o",
"key": "YOUR_API_KEY",
"api_url": "https://api.openai.com/v1/chat/completions"
}
}
}
}
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
print(result)
python -m dingo.run.cli --input_path data.txt --dataset local -e sft --data_format plaintext --save_data True
python -m dingo.run.cli --input_path data.json --dataset local -e openai --data_format json --column_content text --custom_config config_gpt.json --save_data True
Example config_gpt.json
:
{
"llm_config": {
"openai": {
"model": "gpt-4o",
"key": "YOUR_API_KEY",
"api_url": "https://api.openai.com/v1/chat/completions"
}
}
}
After evaluation (with save_data=True
), a frontend page will be automatically generated. To manually start the frontend:
python -m dingo.run.vsl --input output_directory
Where output_directory
contains the evaluation results with a summary.json
file.
Try Dingo on our online demo: (Hugging Face)🤗
Dingo classifies data quality issues into 7 dimensions of Quality Metrics. Each dimension can be evaluated using both rule-based methods and LLM-based prompts:
Quality Metric | Description | Rule Examples | LLM Prompt Examples |
---|---|---|---|
COMPLETENESS | Checks if data is incomplete or missing |
RuleColonEnd , RuleContentNull
|
Evaluates if text abruptly ends with a colon or ellipsis, has mismatched parentheses, or missing critical components |
EFFECTIVENESS | Checks if data is meaningful and properly formatted |
RuleAbnormalChar , RuleHtmlEntity , RuleSpecialCharacter
|
Detects garbled text, words stuck together without spaces, and text lacking proper punctuation |
FLUENCY | Checks if text is grammatically correct and reads naturally |
RuleAbnormalNumber , RuleNoPunc , RuleWordStuck
|
Identifies excessively long words, text fragments without punctuation, or content with chaotic reading order |
RELEVANCE | Detects irrelevant content within the data |
RuleHeadWord variants for different languages |
Examines for irrelevant information like citation details, headers/footers, entity markers, HTML tags |
SECURITY | Identifies sensitive information or value conflicts |
RuleIDCard , RuleUnsafeWords
|
Checks for personal information, and content related to gambling, pornography, political issues |
SIMILARITY | Detects repetitive or highly similar content | RuleDocRepeat |
Evaluates text for consecutive repeated content or multiple occurrences of special characters |
UNDERSTANDABILITY | Assesses how easily data can be interpreted | RuleCapitalWords |
Ensures LaTeX formulas and Markdown are correctly formatted, with proper segmentation and line breaks |
Dingo provides several LLM-based assessment methods defined by prompts in the dingo/model/prompt
directory. These prompts are registered using the prompt_register
decorator and can be combined with LLM models for quality evaluation:
Prompt Type | Metric | Description |
---|---|---|
TEXT_QUALITY_V2 , TEXT_QUALITY_V3
|
Various quality dimensions | Comprehensive text quality evaluation covering effectiveness, relevance, completeness, understandability, similarity, fluency, and security |
QUALITY_BAD_EFFECTIVENESS |
Effectiveness | Detects garbled text and anti-crawling content |
QUALITY_BAD_SIMILARITY |
Similarity | Identifies text repetition issues |
WORD_STICK |
Fluency | Checks for words stuck together without proper spacing |
CODE_LIST_ISSUE |
Completeness | Evaluates code blocks and list formatting issues |
UNREAD_ISSUE |
Effectiveness | Detects unreadable characters due to encoding issues |
Prompt Type | Metric | Description |
---|---|---|
QUALITY_HONEST |
Honesty | Evaluates if responses provide accurate information without fabrication or deception |
QUALITY_HELPFUL |
Helpfulness | Assesses if responses address questions directly and follow instructions appropriately |
QUALITY_HARMLESS |
Harmlessness | Checks if responses avoid harmful content, discriminatory language, and dangerous assistance |
Prompt Type | Metric | Description |
---|---|---|
TEXT_QUALITY_KAOTI |
Exam question quality | Specialized assessment for evaluating the quality of exam questions, focusing on formula rendering, table formatting, paragraph structure, and answer formatting |
Html_Abstract |
HTML extraction quality | Compares different methods of extracting Markdown from HTML, evaluating completeness, formatting accuracy, and semantic coherence |
Prompt Type | Metric | Description |
---|---|---|
CLASSIFY_TOPIC |
Topic Categorization | Classifies text into categories like language processing, writing, code, mathematics, role-play, or knowledge Q&A |
CLASSIFY_QR |
Image Classification | Identifies images as CAPTCHA, QR code, or normal images |
Prompt Type | Metric | Description |
---|---|---|
IMAGE_RELEVANCE |
Image Relevance | Evaluates if an image matches reference image in terms of face count, feature details, and visual elements |
To use these assessment prompts in your evaluations, specify them in your configuration:
input_data = {
# Other parameters...
"custom_config": {
"prompt_list": ["QUALITY_BAD_SIMILARITY"], # Specific prompt to use
"llm_config": {
"detect_text_quality": { # LLM model to use
"model": "gpt-4o",
"key": "YOUR_API_KEY",
"api_url": "https://api.openai.com/v1/chat/completions"
}
}
}
}
You can customize these prompts to focus on specific quality dimensions or to adapt to particular domain requirements. When combined with appropriate LLM models, these prompts enable comprehensive evaluation of data quality across multiple dimensions.
Dingo provides pre-configured rule groups for different types of datasets:
Group | Use Case | Example Rules |
---|---|---|
default |
General text quality |
RuleColonEnd , RuleContentNull , RuleDocRepeat , etc. |
sft |
Fine-tuning datasets | Rules from default plus RuleLineStartWithBulletpoint
|
pretrain |
Pre-training datasets | Comprehensive set of 20+ rules including RuleAlphaWords , RuleCapitalWords , etc. |
To use a specific rule group:
input_data = {
"eval_group": "sft", # Use "default", "sft", or "pretrain"
# other parameters...
}
- Data Sources: Local files, Hugging Face datasets, S3 storage
- Data Types: Pre-training, fine-tuning, and evaluation datasets
- Data Modalities: Text and image
- Built-in Rules: 20+ general heuristic evaluation rules
- LLM Integration: OpenAI, Kimi, and local models (e.g., Llama3)
- Custom Rules: Easily extend with your own rules and models
- Security Evaluation: Perspective API integration
- Interfaces: CLI and SDK options
- Integration: Easy integration with other platforms
- Execution Engines: Local and Spark
- Quality Metrics: 7-dimensional quality assessment
- Traceability: Detailed reports for anomaly tracking
If the built-in rules don't meet your requirements, you can create custom ones:
from dingo.model import Model
from dingo.model.rule.base import BaseRule
from dingo.config.config import DynamicRuleConfig
from dingo.io import MetaData
from dingo.model.modelres import ModelRes
@Model.rule_register('QUALITY_BAD_RELEVANCE', ['default'])
class MyCustomRule(BaseRule):
"""Check for custom pattern in text"""
dynamic_config = DynamicRuleConfig(pattern=r'your_pattern_here')
@classmethod
def eval(cls, input_data: MetaData) -> ModelRes:
res = ModelRes()
# Your rule implementation here
return res
from dingo.model import Model
from dingo.model.llm.base_openai import BaseOpenAI
@Model.llm_register('my_custom_model')
class MyCustomModel(BaseOpenAI):
# Custom implementation here
pass
See more examples in:
from dingo.io import InputArgs
from dingo.exec import Executor
input_args = InputArgs(**input_data)
executor = Executor.exec_map["local"](input_args)
result = executor.execute()
# Get results
summary = executor.get_summary() # Overall evaluation summary
bad_data = executor.get_bad_info_list() # List of problematic data
good_data = executor.get_good_info_list() # List of high-quality data
from dingo.io import InputArgs
from dingo.exec import Executor
from pyspark.sql import SparkSession
# Initialize Spark
spark = SparkSession.builder.appName("Dingo").getOrCreate()
spark_rdd = spark.sparkContext.parallelize([...]) # Your data as MetaData objects
input_args = InputArgs(eval_group="default", save_data=True)
executor = Executor.exec_map["spark"](input_args, spark_session=spark, spark_rdd=spark_rdd)
result = executor.execute()
After evaluation, Dingo generates:
-
Summary Report (
summary.json
): Overall metrics and scores - Detailed Reports: Specific issues for each rule violation
Example summary:
{
"task_id": "d6c922ec-981c-11ef-b723-7c10c9512fac",
"task_name": "dingo",
"eval_group": "default",
"input_path": "test/data/test_local_jsonl.jsonl",
"output_path": "outputs/d6c921ac-981c-11ef-b723-7c10c9512fac",
"create_time": "20241101_144510",
"score": 50.0,
"num_good": 1,
"num_bad": 1,
"total": 2,
"type_ratio": {
"QUALITY_BAD_COMPLETENESS": 0.5,
"QUALITY_BAD_RELEVANCE": 0.5
},
"name_ratio": {
"QUALITY_BAD_COMPLETENESS-RuleColonEnd": 0.5,
"QUALITY_BAD_RELEVANCE-RuleSpecialCharacter": 0.5
}
}
- [ ] Richer graphic and text evaluation indicators
- [ ] Audio and video data modality evaluation
- [ ] Small model evaluation (fasttext, Qurating)
- [ ] Data diversity evaluation
The current built-in detection rules and model methods focus on common data quality problems. For specialized evaluation needs, we recommend customizing detection rules.
We appreciate all the contributors for their efforts to improve and enhance Dingo
. Please refer to the Contribution Guide for guidance on contributing to the project.
This project uses the Apache 2.0 Open Source License.
If you find this project useful, please consider citing our tool:
@misc{dingo,
title={Dingo: A Comprehensive Data Quality Evaluation Tool for Large Models},
author={Dingo Contributors},
howpublished={\url{https://github.com/DataEval/dingo}},
year={2024}
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for dingo
Similar Open Source Tools

lumen
Lumen is a command-line tool that leverages AI to enhance your git workflow. It assists in generating commit messages, understanding changes, interactive searching, and analyzing impacts without the need for an API key. With smart commit messages, git history insights, interactive search, change analysis, and rich markdown output, Lumen offers a seamless and flexible experience for users across various git workflows.

json-translator
The json-translator repository provides a free tool to translate JSON/YAML files or JSON objects into different languages using various translation modules. It supports CLI usage and package support, allowing users to translate words, sentences, JSON objects, and JSON files. The tool also offers multi-language translation, ignoring specific words, and safe translation practices. Users can contribute to the project by updating CLI, translation functions, JSON operations, and more. The roadmap includes features like Libre Translate option, Argos Translate option, Bing Translate option, and support for additional translation modules.

freeGPT
freeGPT provides free access to text and image generation models. It supports various models, including gpt3, gpt4, alpaca_7b, falcon_40b, prodia, and pollinations. The tool offers both asynchronous and non-asynchronous interfaces for text completion and image generation. It also features an interactive Discord bot that provides access to all the models in the repository. The tool is easy to use and can be integrated into various applications.

LightRAG
LightRAG is a repository hosting the code for LightRAG, a system that supports seamless integration of custom knowledge graphs, Oracle Database 23ai, Neo4J for storage, and multiple file types. It includes features like entity deletion, batch insert, incremental insert, and graph visualization. LightRAG provides an API server implementation for RESTful API access to RAG operations, allowing users to interact with it through HTTP requests. The repository also includes evaluation scripts, code for reproducing results, and a comprehensive code structure.

e2m
E2M is a Python library that can parse and convert various file types into Markdown format. It supports the conversion of multiple file formats, including doc, docx, epub, html, htm, url, pdf, ppt, pptx, mp3, and m4a. The ultimate goal of the E2M project is to provide high-quality data for Retrieval-Augmented Generation (RAG) and model training or fine-tuning. The core architecture consists of a Parser responsible for parsing various file types into text or image data, and a Converter responsible for converting text or image data into Markdown format.

ChatGLM3
ChatGLM3 is a conversational pretrained model jointly released by Zhipu AI and THU's KEG Lab. ChatGLM3-6B is the open-sourced model in the ChatGLM3 series. It inherits the advantages of its predecessors, such as fluent conversation and low deployment threshold. In addition, ChatGLM3-6B introduces the following features: 1. A stronger foundation model: ChatGLM3-6B's foundation model ChatGLM3-6B-Base employs more diverse training data, more sufficient training steps, and more reasonable training strategies. Evaluation on datasets from different perspectives, such as semantics, mathematics, reasoning, code, and knowledge, shows that ChatGLM3-6B-Base has the strongest performance among foundation models below 10B parameters. 2. More complete functional support: ChatGLM3-6B adopts a newly designed prompt format, which supports not only normal multi-turn dialogue, but also complex scenarios such as tool invocation (Function Call), code execution (Code Interpreter), and Agent tasks. 3. A more comprehensive open-source sequence: In addition to the dialogue model ChatGLM3-6B, the foundation model ChatGLM3-6B-Base, the long-text dialogue model ChatGLM3-6B-32K, and ChatGLM3-6B-128K, which further enhances the long-text comprehension ability, are also open-sourced. All the above weights are completely open to academic research and are also allowed for free commercial use after filling out a questionnaire.

quantalogic
QuantaLogic is a ReAct framework for building advanced AI agents that seamlessly integrates large language models with a robust tool system. It aims to bridge the gap between advanced AI models and practical implementation in business processes by enabling agents to understand, reason about, and execute complex tasks through natural language interaction. The framework includes features such as ReAct Framework, Universal LLM Support, Secure Tool System, Real-time Monitoring, Memory Management, and Enterprise Ready components.

BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

agentic_security
Agentic Security is an open-source vulnerability scanner designed for safety scanning, offering customizable rule sets and agent-based attacks. It provides comprehensive fuzzing for any LLMs, LLM API integration, and stress testing with a wide range of fuzzing and attack techniques. The tool is not a foolproof solution but aims to enhance security measures against potential threats. It offers installation via pip and supports quick start commands for easy setup. Users can utilize the tool for LLM integration, adding custom datasets, running CI checks, extending dataset collections, and dynamic datasets with mutations. The tool also includes a probe endpoint for integration testing. The roadmap includes expanding dataset variety, introducing new attack vectors, developing an attacker LLM, and integrating OWASP Top 10 classification.

gemini-openai-proxy
Gemini-OpenAI-Proxy is a proxy software designed to convert OpenAI API protocol calls into Google Gemini Pro protocol, allowing software using OpenAI protocol to utilize Gemini Pro models seamlessly. It provides an easy integration of Gemini Pro's powerful features without the need for complex development work.

LLMVoX
LLMVoX is a lightweight 30M-parameter, LLM-agnostic, autoregressive streaming Text-to-Speech (TTS) system designed to convert text outputs from Large Language Models into high-fidelity streaming speech with low latency. It achieves significantly lower Word Error Rate compared to speech-enabled LLMs while operating at comparable latency and speech quality. Key features include being lightweight & fast with only 30M parameters, LLM-agnostic for easy integration with existing models, multi-queue streaming for continuous speech generation, and multilingual support for easy adaptation to new languages.

GPT-Vis
GPT-Vis is a tool designed for GPTs, generative AI, and LLM projects. It provides components such as LLM Protocol for conversational interaction, LLM Component for application development, and LLM access for knowledge base and model solutions. The tool aims to facilitate rapid integration into AI applications by offering a visual protocol, built-in components, and chart recommendations for LLM.

arxiv-mcp-server
The ArXiv MCP Server acts as a bridge between AI assistants and arXiv's research repository, enabling AI models to search for and access papers programmatically through the Message Control Protocol (MCP). It offers features like paper search, access, listing, local storage, and research prompts. Users can install it via Smithery or manually for Claude Desktop. The server provides tools for paper search, download, listing, and reading, along with specialized prompts for paper analysis. Configuration can be done through environment variables, and testing is supported with a test suite. The tool is released under the MIT License and is developed by the Pearl Labs Team.

mcp-framework
MCP-Framework is a TypeScript framework for building Model Context Protocol (MCP) servers with automatic directory-based discovery for tools, resources, and prompts. It provides powerful abstractions, simple server setup, and a CLI for rapid development and project scaffolding.

botgroup.chat
botgroup.chat is a multi-person AI chat application based on React and Cloudflare Pages for free one-click deployment. It supports multiple AI roles participating in conversations simultaneously, providing an interactive experience similar to group chat. The application features real-time streaming responses, customizable AI roles and personalities, group management functionality, AI role mute function, Markdown format support, mathematical formula display with KaTeX, aesthetically pleasing UI design, and responsive design for mobile devices.