awesome-open-data-annotation
Open Source Data Annotation & Labeling Tools
Stars: 425
At ZenML, we believe in the importance of annotation and labeling workflows in the machine learning lifecycle. This repository showcases a curated list of open-source data annotation and labeling tools that are actively maintained and fit for purpose. The tools cover various domains such as multi-modal, text, images, audio, video, time series, and other data types. Users can contribute to the list and discover tools for tasks like named entity recognition, data annotation for machine learning, image and video annotation, text classification, sequence labeling, object detection, and more. The repository aims to help users enhance their data-centric workflows by leveraging these tools.
README:
At ZenML we believe that annotation and labeling workflows are a core part of the machine learning lifecycle. As an open-source tool, we wanted to highlight and recognize the variety of tools that are available to help your workflows become more data-centric. We had three core criteria to decide whether a particular tool could make it into the list:
- The tool has an open-source licence.
- The tool is actively maintained.
- The tool is functional and fit for purpose.
We welcome contributions to this list, so if you know of a tool that we've missed or if you've built one yourself, please do create a PR!
🔥 Do you use these tools or do you want to add one to your MLOps stack? At ZenML, we are looking for design partnerships and collaboration to develop the integrations and workflows around using annotation within the MLOps lifecycle. If you'd like to learn more, please join our Slack and leave us a message!
Name | Description | License |
---|---|---|
Acharya | A Data Centric MLOps tool for your Named Entity Recognition projects | ? |
Adala | An Autonomous Data (Labeling) Agent framework. | Apache-2 |
Classifai | A comprehensive open-source data annotation platform | Apache-2 |
Computer Vision Annotation Tool (CVAT) | A free, online, interactive video and image annotation tool for computer vision | MIT |
Data Annotator for Machine Learning (DAML) | An application that helps machine learning teams facilitating the creation and management of annotations | Apache-2 |
DataGym | Open source annotation and labeling tool for image and video assets | MIT |
Diffgram | Training Data (Data Labeling, Annotation, Workflow) for all Data Types (Image, Video, 3D, Text, Geo, Audio, more) at scale | ELv2 |
Hover | Explore and label on a map of raw data. Handles text, audio and images. | MIT |
Label Studio | A multi-type data labeling and annotation tool with standardized output format | Apache-2 |
Pigeon | A simple widget that lets you quickly annotate a dataset of unlabeled examples from the comfort of your Jupyter notebook | Apache-2 |
QSL: Quick and Simple Labeler | A quick and simple tool for labeling images, videos and time series data, right from Jupyter | MIT |
Shoonya | Platform to Annotate and label data at scale | MIT |
Tator | Video analytics web platform | AGPL-3 |
TornadoAi | A human-in-the-loop machine learning framework | AGPL-3 |
Universal Data Tool | A web/desktop app for editing and annotating images, text, audio, documents and to view and edit any data defined in the extensible .udt.json and .udt.csv standard | MIT |
VGG Image Annotator (VIA) | A standalone image annotator application packaged as a single HTML file (< 400 KB) that runs on most modern web browsers | BSD-2 |
VIAME | Video and Image Analytics for Multiple Environments | Custom |
Xtreme1 | An all-in-one data labeling and annotation platform for multimodal data training and supports 3D LiDAR point cloud, image, and LLM | Apache-2 |
Name | Description | License |
---|---|---|
Annotation Lab | An NLP annotation tool included in spark-nlp
|
Apache-2 |
Argilla | A production-ready Python framework for exploring, annotating, and managing data in NLP projects | Apache-2 |
bulk | Bulk is a quick developer tool to apply some bulk labels | MIT |
CoreNLP | A Java suite of core NLP tools | GPL-3 |
DataQA | Labeling platform for text using weak supervision | GPL-3 |
doccano | An open source text annotation tool supporting text classification, sequence labeling and sequence to sequence tasks | MIT |
FLAT - FoLiA Linguistic Annotation Tool | A web-based linguistic annotation environment based around the FoLiA format, an XML-based format for linguistic annotation | GPL-3 |
INCEpTION | A semantic annotation platform offering intelligent annotation assistance and knowledge management | Apache-2 |
knodle | Knodle (Knowledge-supervised Deep Learning Framework) | Apache-2 |
Markup | A web-based document annotation tool, powered by GPT-4 | Unknown |
NER Annotator for Spacy | NER Annotator for SpaCy allows you to create training data for creating a custom NER Model with custom tags. | MIT |
NPLM | Noisy Partial Label Model(NPLM) | N/A |
Potato | An annotation framework with 20+ templates, editable UI, quality control, data management and an option to add a survey for crowdsourcing | PolyForm Shield |
refinery | The data scientist's open-source choice to scale, assess and maintain natural language data. | Apache-2 |
Slate | A Super-Lightweight Annotation Tool for Experts: Label text in a terminal with just Python | ISC |
SMART | A tool for building labeled training datasets for supervised machine learning tasks in NLP | MIT |
SpaCy annotator | Spacy NER annotator using ipywidgets | N/A |
Small-Text | Active Learning for Text Classification | MIT |
Snorkel | Programmatically Build and Manage Training Data | Apache-2 |
skweak | skweak: Weak supervision for NLP | MIT |
TALEN | A way to do annotations for NER | Custom |
Theme | Minimalistic CLI labeling tool for text classification | MIT |
YEDDA | A lightweight collaborative text span annotation tool | Apache-2 |
WeaSEL | WeaSEL: Weakly Supervised End-to-end Learning | Apache-2 |
Name | Description | License |
---|---|---|
3D Slicer | Visualization, processing, segmentation, registration, and analysis of medical, biomedical, and other 3D images and meshes | BSD |
Annotorious | A JavaScript library for image annotation | BSD-3 |
AnyLabeling | Effortless AI-assisted data labeling with AI support from YOLO, Segment Anything, MobileSAM | GPL-3 |
autodistill | Images to inference with no labeling (use foundation models to train supervised models) | Apache-2 |
bbox-visualizer | Make drawing and labeling bounding boxes easy as cake | MIT |
Bounding Box Editor | A JavaFX desktop application for creating image-object-annotations with bounding boxes | GPL-3 |
CATMAID | The Collaborative Annotation Toolkit for Massive Amounts of Image Data | GPL-3 |
COCO Annotator | A web-based image segmentation tool for object detection, localization, and keypoints | MIT |
DeepLabel | A cross-platform desktop image annotation tool for machine learning | MIT |
ilastik | Segment, classify, track and count your cells or other experimental data | Custom |
ImageTagger | An open source online platform for collaborative image labeling | MIT |
imglab | A web based tool to label images for objects that can be used to train dlib or other object detectors | MIT |
KNOSSOS | A software tool for the visualization and annotation of 3D image data and was developed for the rapid reconstruction of neural morphology and connectivity | GPL-2 |
labelCloud | A lightweight tool for labeling 3D bounding boxes in point clouds | GPL-3 |
LabelFlow | An open platform for image labeling | Custom |
labelme | Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation) | Custom |
LabelImg | A graphical image annotation tool and label object bounding boxes in images | MIT |
LOST | A flexible web-based framework for semi-automatic image annotation | MIT |
Make Sense | A free-to-use online tool for labeling photos | GPL-3 |
MyVision | Computer vision based ML training data generation tool | GPL-3 |
OHIF Medical Imaging Viewer | OHIF zero-footprint DICOM viewer and oncology specific Lesion Tracker | MIT |
OpenLabeler | An open source desktop application for annotating objects for AI appplications | Apache-2 |
Pixano | A web-based smart-annotation tool for computer vision applications | CeCILL-C |
Scalabel | A web-based visual data annotation tool, supporting both 2D and 3D data labeling | Apache-2 |
webKnossos | A fully cloud- and browser-based 3D annotation tool for distributed large-scale data analysis in light- and electron-microscopy based Connectomics | AGPL-3 |
Yolo_Label | GUI for marking bounded boxes of objects in images for training neural network YOLO | MIT |
Name | Description | License |
---|---|---|
DIVE | Media annotation and analysis tools for web and desktop | Apache-2 |
UltimateLabeling | A multi-purpose Video Labeling GUI in Python with integrated SOTA detector and tracker | MIT |
Name | Description | License |
---|---|---|
aubio | A library for audio and music analysis | GPL-3 |
audino | Open source audio annotation tool | MIT |
Praat | Annotation tool for phonetics analysis | GPL-3 |
Peaks.js | JavaScript UI component for interacting with audio waveforms | LGPL-3 |
Wavesurfer.js | Navigable waveform built on Web Audio and Canvas | BSD-3 |
Name | Description | License |
---|---|---|
sktime | A framework for machine learning with time series | BSD-3 |
Name | Description | License |
---|---|---|
Compose | Automated prediction engineering. Allows you to easily structure prediction problems and generate labels for supervised learning | BSD-3 |
Encord Active | Toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling | Apache-2 |
OpenCRAVAT | A modular annotation tool for genomic variants | MIT |
PatchSorter | An open-source digital pathology tool for histologic object labeling | BSD-3 |
Personal Cancer Genome Reporter (PCGR) | A stand-alone software package for translation of individual tumor genomes for precision cancer medicine | MIT |
Quepid | Gather Human Judgements (aka Explicit Ratings) for Search Quality. Also a safe space to play with your search algorithm. | Apache-2 |
Thanks to the creators of these other repositories (and this one!) for getting us going down the path of creating our own. I used these efforts to get started in my survey of the space before adding, updating and pruning as per the open-source and other criteria specified above.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for awesome-open-data-annotation
Similar Open Source Tools
awesome-open-data-annotation
At ZenML, we believe in the importance of annotation and labeling workflows in the machine learning lifecycle. This repository showcases a curated list of open-source data annotation and labeling tools that are actively maintained and fit for purpose. The tools cover various domains such as multi-modal, text, images, audio, video, time series, and other data types. Users can contribute to the list and discover tools for tasks like named entity recognition, data annotation for machine learning, image and video annotation, text classification, sequence labeling, object detection, and more. The repository aims to help users enhance their data-centric workflows by leveraging these tools.
llm-app-stack
LLM App Stack, also known as Emerging Architectures for LLM Applications, is a comprehensive list of available tools, projects, and vendors at each layer of the LLM app stack. It covers various categories such as Data Pipelines, Embedding Models, Vector Databases, Playgrounds, Orchestrators, APIs/Plugins, LLM Caches, Logging/Monitoring/Eval, Validators, LLM APIs (proprietary and open source), App Hosting Platforms, Cloud Providers, and Opinionated Clouds. The repository aims to provide a detailed overview of tools and projects for building, deploying, and maintaining enterprise data solutions, AI models, and applications.
LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of Large Language Models (LLMs) and build intelligent applications that push the boundaries of natural language understanding. This GitHub repository provides in-depth articles, codebase mastery, LLM PlayLab, and resources for cost analysis and network visualization. It covers various aspects of LLMs, including NLP, models, training, evaluation metrics, open LLMs, and more. The repository also includes a collection of code examples and tutorials to help users build and deploy LLM-based applications.
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
Topu-ai
TOPU Md is a simple WhatsApp user bot created by Topu Tech. It offers various features such as multi-device support, AI photo enhancement, downloader commands, hidden NSFW commands, logo commands, anime commands, economy menu, various games, and audio/video editor commands. Users can fork the repo, get a session ID by pairing code, and deploy on Heroku. The bot requires Node version 18.x or higher for optimal performance. Contributions to TOPU-MD are welcome, and the tool is safe for use on WhatsApp and Heroku. The tool is licensed under the MIT License and is designed to enhance the WhatsApp experience with diverse features.
visionOS-examples
visionOS-examples is a repository containing accelerators for Spatial Computing. It includes examples such as Local Large Language Model, Chat Apple Vision Pro, WebSockets, Anchor To Head, Hand Tracking, Battery Life, Countdown, Plane Detection, Timer Vision, and PencilKit for visionOS. The repository showcases various functionalities and features for Apple Vision Pro, offering tools for developers to enhance their visionOS apps with capabilities like hand tracking, plane detection, and real-time cryptocurrency prices.
GenAI-Showcase
The Generative AI Use Cases Repository showcases a wide range of applications in generative AI, including Retrieval-Augmented Generation (RAG), AI Agents, and industry-specific use cases. It provides practical notebooks and guidance on utilizing frameworks such as LlamaIndex and LangChain, and demonstrates how to integrate models from leading AI research companies like Anthropic and OpenAI.
chat-your-doc
Chat Your Doc is an experimental project exploring various applications based on LLM technology. It goes beyond being just a chatbot project, focusing on researching LLM applications using tools like LangChain and LlamaIndex. The project delves into UX, computer vision, and offers a range of examples in the 'Lab Apps' section. It includes links to different apps, descriptions, launch commands, and demos, aiming to showcase the versatility and potential of LLM applications.
langfuse
Langfuse is a powerful tool that helps you develop, monitor, and test your LLM applications. With Langfuse, you can: * **Develop:** Instrument your app and start ingesting traces to Langfuse, inspect and debug complex logs, and manage, version, and deploy prompts from within Langfuse. * **Monitor:** Track metrics (cost, latency, quality) and gain insights from dashboards & data exports, collect and calculate scores for your LLM completions, run model-based evaluations, collect user feedback, and manually score observations in Langfuse. * **Test:** Track and test app behaviour before deploying a new version, test expected in and output pairs and benchmark performance before deploying, and track versions and releases in your application. Langfuse is easy to get started with and offers a generous free tier. You can sign up for Langfuse Cloud or deploy Langfuse locally or on your own infrastructure. Langfuse also offers a variety of integrations to make it easy to connect to your LLM applications.
agents
The LiveKit Agent Framework is designed for building real-time, programmable participants that run on servers. Easily tap into LiveKit WebRTC sessions and process or generate audio, video, and data streams. The framework includes plugins for common workflows, such as voice activity detection and speech-to-text. Agents integrates seamlessly with LiveKit server, offloading job queuing and scheduling responsibilities to it. This eliminates the need for additional queuing infrastructure. Agent code developed on your local machine can scale to support thousands of concurrent sessions when deployed to a server in production.
EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
TRACE
TRACE is a temporal grounding video model that utilizes causal event modeling to capture videos' inherent structure. It presents a task-interleaved video LLM model tailored for sequential encoding/decoding of timestamps, salient scores, and textual captions. The project includes various model checkpoints for different stages and fine-tuning on specific datasets. It provides evaluation codes for different tasks like VTG, MVBench, and VideoMME. The repository also offers annotation files and links to raw videos preparation projects. Users can train the model on different tasks and evaluate the performance based on metrics like CIDER, METEOR, SODA_c, F1, mAP, Hit@1, etc. TRACE has been enhanced with trace-retrieval and trace-uni models, showing improved performance on dense video captioning and general video understanding tasks.
phoenix
Phoenix is a tool that provides MLOps and LLMOps insights at lightning speed with zero-config observability. It offers a notebook-first experience for monitoring models and LLM Applications by providing LLM Traces, LLM Evals, Embedding Analysis, RAG Analysis, and Structured Data Analysis. Users can trace through the execution of LLM Applications, evaluate generative models, explore embedding point-clouds, visualize generative application's search and retrieval process, and statistically analyze structured data. Phoenix is designed to help users troubleshoot problems related to retrieval, tool execution, relevance, toxicity, drift, and performance degradation.
tamingLLMs
The 'Taming LLMs' repository provides a practical guide to the pitfalls and challenges associated with Large Language Models (LLMs) when building applications. It focuses on key limitations and implementation pitfalls, offering practical Python examples and open source solutions to help engineers and technical leaders navigate these challenges. The repository aims to equip readers with the knowledge to harness the power of LLMs while avoiding their inherent limitations.
llm-compression-intelligence
This repository presents the findings of the paper "Compression Represents Intelligence Linearly". The study reveals a strong linear correlation between the intelligence of LLMs, as measured by benchmark scores, and their ability to compress external text corpora. Compression efficiency, derived from raw text corpora, serves as a reliable evaluation metric that is linearly associated with model capabilities. The repository includes the compression corpora used in the paper, code for computing compression efficiency, and data collection and processing pipelines.
kangaroo
Kangaroo is an AI-powered SQL client and admin tool for popular databases like SQLite, MySQL, PostgreSQL, etc. It supports various functionalities such as table design, query, model, sync, export/import, and more. The tool is designed to be comfortable, fun, and developer-friendly, with features like code intellisense and autocomplete. Kangaroo aims to provide a seamless experience for database management across different operating systems.
For similar tasks
X-AnyLabeling
X-AnyLabeling is a robust annotation tool that seamlessly incorporates an AI inference engine alongside an array of sophisticated features. Tailored for practical applications, it is committed to delivering comprehensive, industrial-grade solutions for image data engineers. This tool excels in swiftly and automatically executing annotations across diverse and intricate tasks.
file-organizer-2000
AI File Organizer 2000 is an Obsidian Plugin that uses AI to transcribe audio, annotate images, and automatically organize files by moving them to the most likely folders. It supports text, audio, and images, with upcoming local-first LLM support. Users can simply place unorganized files into the 'Inbox' folder for automatic organization. The tool renames and moves files quickly, providing a seamless file organization experience. Self-hosting is also possible by running the server and enabling the 'Self-hosted' option in the plugin settings. Join the community Discord server for more information and use the provided iOS shortcut for easy access on mobile devices.
LabelLLM
LabelLLM is an open-source data annotation platform designed to optimize the data annotation process for LLM development. It offers flexible configuration, multimodal data support, comprehensive task management, and AI-assisted annotation. Users can access a suite of annotation tools, enjoy a user-friendly experience, and enhance efficiency. The platform allows real-time monitoring of annotation progress and quality control, ensuring data integrity and timeliness.
awesome-open-data-annotation
At ZenML, we believe in the importance of annotation and labeling workflows in the machine learning lifecycle. This repository showcases a curated list of open-source data annotation and labeling tools that are actively maintained and fit for purpose. The tools cover various domains such as multi-modal, text, images, audio, video, time series, and other data types. Users can contribute to the list and discover tools for tasks like named entity recognition, data annotation for machine learning, image and video annotation, text classification, sequence labeling, object detection, and more. The repository aims to help users enhance their data-centric workflows by leveraging these tools.
anylabeling
AnyLabeling is a tool for effortless data labeling with AI support from YOLO and Segment Anything. It combines features from LabelImg and Labelme with an improved UI and auto-labeling capabilities. Users can annotate images with polygons, rectangles, circles, lines, and points, as well as perform auto-labeling using YOLOv5 and Segment Anything. The tool also supports text detection, recognition, and Key Information Extraction (KIE) labeling, with multiple language options available such as English, Vietnamese, and Chinese.
awesome-object-detection-datasets
This repository is a curated list of awesome public object detection and recognition datasets. It includes a wide range of datasets related to object detection and recognition tasks, such as general detection and recognition datasets, autonomous driving datasets, adverse weather datasets, person detection datasets, anti-UAV datasets, optical aerial imagery datasets, low-light image datasets, infrared image datasets, SAR image datasets, multispectral image datasets, 3D object detection datasets, vehicle-to-everything field datasets, super-resolution field datasets, and face detection and recognition datasets. The repository also provides information on tools for data annotation, data augmentation, and data management related to object detection tasks.
llm2vec
LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) training with masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance.
marvin
Marvin is a lightweight AI toolkit for building natural language interfaces that are reliable, scalable, and easy to trust. Each of Marvin's tools is simple and self-documenting, using AI to solve common but complex challenges like entity extraction, classification, and generating synthetic data. Each tool is independent and incrementally adoptable, so you can use them on their own or in combination with any other library. Marvin is also multi-modal, supporting both image and audio generation as well using images as inputs for extraction and classification. Marvin is for developers who care more about _using_ AI than _building_ AI, and we are focused on creating an exceptional developer experience. Marvin users should feel empowered to bring tightly-scoped "AI magic" into any traditional software project with just a few extra lines of code. Marvin aims to merge the best practices for building dependable, observable software with the best practices for building with generative AI into a single, easy-to-use library. It's a serious tool, but we hope you have fun with it. Marvin is open-source, free to use, and made with 💙 by the team at Prefect.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.