awesome-open-data-annotation
Open Source Data Annotation & Labeling Tools
Stars: 425
At ZenML, we believe in the importance of annotation and labeling workflows in the machine learning lifecycle. This repository showcases a curated list of open-source data annotation and labeling tools that are actively maintained and fit for purpose. The tools cover various domains such as multi-modal, text, images, audio, video, time series, and other data types. Users can contribute to the list and discover tools for tasks like named entity recognition, data annotation for machine learning, image and video annotation, text classification, sequence labeling, object detection, and more. The repository aims to help users enhance their data-centric workflows by leveraging these tools.
README:
At ZenML we believe that annotation and labeling workflows are a core part of the machine learning lifecycle. As an open-source tool, we wanted to highlight and recognize the variety of tools that are available to help your workflows become more data-centric. We had three core criteria to decide whether a particular tool could make it into the list:
- The tool has an open-source licence.
- The tool is actively maintained.
- The tool is functional and fit for purpose.
We welcome contributions to this list, so if you know of a tool that we've missed or if you've built one yourself, please do create a PR!
🔥 Do you use these tools or do you want to add one to your MLOps stack? At ZenML, we are looking for design partnerships and collaboration to develop the integrations and workflows around using annotation within the MLOps lifecycle. If you'd like to learn more, please join our Slack and leave us a message!
Name | Description | License |
---|---|---|
Acharya | A Data Centric MLOps tool for your Named Entity Recognition projects | ? |
Adala | An Autonomous Data (Labeling) Agent framework. | Apache-2 |
Classifai | A comprehensive open-source data annotation platform | Apache-2 |
Computer Vision Annotation Tool (CVAT) | A free, online, interactive video and image annotation tool for computer vision | MIT |
Data Annotator for Machine Learning (DAML) | An application that helps machine learning teams facilitating the creation and management of annotations | Apache-2 |
DataGym | Open source annotation and labeling tool for image and video assets | MIT |
Diffgram | Training Data (Data Labeling, Annotation, Workflow) for all Data Types (Image, Video, 3D, Text, Geo, Audio, more) at scale | ELv2 |
Hover | Explore and label on a map of raw data. Handles text, audio and images. | MIT |
Label Studio | A multi-type data labeling and annotation tool with standardized output format | Apache-2 |
Pigeon | A simple widget that lets you quickly annotate a dataset of unlabeled examples from the comfort of your Jupyter notebook | Apache-2 |
QSL: Quick and Simple Labeler | A quick and simple tool for labeling images, videos and time series data, right from Jupyter | MIT |
Shoonya | Platform to Annotate and label data at scale | MIT |
Tator | Video analytics web platform | AGPL-3 |
TornadoAi | A human-in-the-loop machine learning framework | AGPL-3 |
Universal Data Tool | A web/desktop app for editing and annotating images, text, audio, documents and to view and edit any data defined in the extensible .udt.json and .udt.csv standard | MIT |
VGG Image Annotator (VIA) | A standalone image annotator application packaged as a single HTML file (< 400 KB) that runs on most modern web browsers | BSD-2 |
VIAME | Video and Image Analytics for Multiple Environments | Custom |
Xtreme1 | An all-in-one data labeling and annotation platform for multimodal data training and supports 3D LiDAR point cloud, image, and LLM | Apache-2 |
Name | Description | License |
---|---|---|
Annotation Lab | An NLP annotation tool included in spark-nlp
|
Apache-2 |
Argilla | A production-ready Python framework for exploring, annotating, and managing data in NLP projects | Apache-2 |
bulk | Bulk is a quick developer tool to apply some bulk labels | MIT |
CoreNLP | A Java suite of core NLP tools | GPL-3 |
DataQA | Labeling platform for text using weak supervision | GPL-3 |
doccano | An open source text annotation tool supporting text classification, sequence labeling and sequence to sequence tasks | MIT |
FLAT - FoLiA Linguistic Annotation Tool | A web-based linguistic annotation environment based around the FoLiA format, an XML-based format for linguistic annotation | GPL-3 |
INCEpTION | A semantic annotation platform offering intelligent annotation assistance and knowledge management | Apache-2 |
knodle | Knodle (Knowledge-supervised Deep Learning Framework) | Apache-2 |
Markup | A web-based document annotation tool, powered by GPT-4 | Unknown |
NER Annotator for Spacy | NER Annotator for SpaCy allows you to create training data for creating a custom NER Model with custom tags. | MIT |
NPLM | Noisy Partial Label Model(NPLM) | N/A |
Potato | An annotation framework with 20+ templates, editable UI, quality control, data management and an option to add a survey for crowdsourcing | PolyForm Shield |
refinery | The data scientist's open-source choice to scale, assess and maintain natural language data. | Apache-2 |
Slate | A Super-Lightweight Annotation Tool for Experts: Label text in a terminal with just Python | ISC |
SMART | A tool for building labeled training datasets for supervised machine learning tasks in NLP | MIT |
SpaCy annotator | Spacy NER annotator using ipywidgets | N/A |
Small-Text | Active Learning for Text Classification | MIT |
Snorkel | Programmatically Build and Manage Training Data | Apache-2 |
skweak | skweak: Weak supervision for NLP | MIT |
TALEN | A way to do annotations for NER | Custom |
Theme | Minimalistic CLI labeling tool for text classification | MIT |
YEDDA | A lightweight collaborative text span annotation tool | Apache-2 |
WeaSEL | WeaSEL: Weakly Supervised End-to-end Learning | Apache-2 |
Name | Description | License |
---|---|---|
3D Slicer | Visualization, processing, segmentation, registration, and analysis of medical, biomedical, and other 3D images and meshes | BSD |
Annotorious | A JavaScript library for image annotation | BSD-3 |
AnyLabeling | Effortless AI-assisted data labeling with AI support from YOLO, Segment Anything, MobileSAM | GPL-3 |
autodistill | Images to inference with no labeling (use foundation models to train supervised models) | Apache-2 |
bbox-visualizer | Make drawing and labeling bounding boxes easy as cake | MIT |
Bounding Box Editor | A JavaFX desktop application for creating image-object-annotations with bounding boxes | GPL-3 |
CATMAID | The Collaborative Annotation Toolkit for Massive Amounts of Image Data | GPL-3 |
COCO Annotator | A web-based image segmentation tool for object detection, localization, and keypoints | MIT |
DeepLabel | A cross-platform desktop image annotation tool for machine learning | MIT |
ilastik | Segment, classify, track and count your cells or other experimental data | Custom |
ImageTagger | An open source online platform for collaborative image labeling | MIT |
imglab | A web based tool to label images for objects that can be used to train dlib or other object detectors | MIT |
KNOSSOS | A software tool for the visualization and annotation of 3D image data and was developed for the rapid reconstruction of neural morphology and connectivity | GPL-2 |
labelCloud | A lightweight tool for labeling 3D bounding boxes in point clouds | GPL-3 |
LabelFlow | An open platform for image labeling | Custom |
labelme | Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation) | Custom |
LabelImg | A graphical image annotation tool and label object bounding boxes in images | MIT |
LOST | A flexible web-based framework for semi-automatic image annotation | MIT |
Make Sense | A free-to-use online tool for labeling photos | GPL-3 |
MyVision | Computer vision based ML training data generation tool | GPL-3 |
OHIF Medical Imaging Viewer | OHIF zero-footprint DICOM viewer and oncology specific Lesion Tracker | MIT |
OpenLabeler | An open source desktop application for annotating objects for AI appplications | Apache-2 |
Pixano | A web-based smart-annotation tool for computer vision applications | CeCILL-C |
Scalabel | A web-based visual data annotation tool, supporting both 2D and 3D data labeling | Apache-2 |
webKnossos | A fully cloud- and browser-based 3D annotation tool for distributed large-scale data analysis in light- and electron-microscopy based Connectomics | AGPL-3 |
Yolo_Label | GUI for marking bounded boxes of objects in images for training neural network YOLO | MIT |
Name | Description | License |
---|---|---|
DIVE | Media annotation and analysis tools for web and desktop | Apache-2 |
UltimateLabeling | A multi-purpose Video Labeling GUI in Python with integrated SOTA detector and tracker | MIT |
Name | Description | License |
---|---|---|
aubio | A library for audio and music analysis | GPL-3 |
audino | Open source audio annotation tool | MIT |
Praat | Annotation tool for phonetics analysis | GPL-3 |
Peaks.js | JavaScript UI component for interacting with audio waveforms | LGPL-3 |
Wavesurfer.js | Navigable waveform built on Web Audio and Canvas | BSD-3 |
Name | Description | License |
---|---|---|
sktime | A framework for machine learning with time series | BSD-3 |
Name | Description | License |
---|---|---|
Compose | Automated prediction engineering. Allows you to easily structure prediction problems and generate labels for supervised learning | BSD-3 |
Encord Active | Toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling | Apache-2 |
OpenCRAVAT | A modular annotation tool for genomic variants | MIT |
PatchSorter | An open-source digital pathology tool for histologic object labeling | BSD-3 |
Personal Cancer Genome Reporter (PCGR) | A stand-alone software package for translation of individual tumor genomes for precision cancer medicine | MIT |
Quepid | Gather Human Judgements (aka Explicit Ratings) for Search Quality. Also a safe space to play with your search algorithm. | Apache-2 |
Thanks to the creators of these other repositories (and this one!) for getting us going down the path of creating our own. I used these efforts to get started in my survey of the space before adding, updating and pruning as per the open-source and other criteria specified above.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for awesome-open-data-annotation
Similar Open Source Tools
awesome-open-data-annotation
At ZenML, we believe in the importance of annotation and labeling workflows in the machine learning lifecycle. This repository showcases a curated list of open-source data annotation and labeling tools that are actively maintained and fit for purpose. The tools cover various domains such as multi-modal, text, images, audio, video, time series, and other data types. Users can contribute to the list and discover tools for tasks like named entity recognition, data annotation for machine learning, image and video annotation, text classification, sequence labeling, object detection, and more. The repository aims to help users enhance their data-centric workflows by leveraging these tools.
llm-app-stack
LLM App Stack, also known as Emerging Architectures for LLM Applications, is a comprehensive list of available tools, projects, and vendors at each layer of the LLM app stack. It covers various categories such as Data Pipelines, Embedding Models, Vector Databases, Playgrounds, Orchestrators, APIs/Plugins, LLM Caches, Logging/Monitoring/Eval, Validators, LLM APIs (proprietary and open source), App Hosting Platforms, Cloud Providers, and Opinionated Clouds. The repository aims to provide a detailed overview of tools and projects for building, deploying, and maintaining enterprise data solutions, AI models, and applications.
LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing
LLM-PowerHouse is a comprehensive and curated guide designed to empower developers, researchers, and enthusiasts to harness the true capabilities of Large Language Models (LLMs) and build intelligent applications that push the boundaries of natural language understanding. This GitHub repository provides in-depth articles, codebase mastery, LLM PlayLab, and resources for cost analysis and network visualization. It covers various aspects of LLMs, including NLP, models, training, evaluation metrics, open LLMs, and more. The repository also includes a collection of code examples and tutorials to help users build and deploy LLM-based applications.
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
Topu-ai
TOPU Md is a simple WhatsApp user bot created by Topu Tech. It offers various features such as multi-device support, AI photo enhancement, downloader commands, hidden NSFW commands, logo commands, anime commands, economy menu, various games, and audio/video editor commands. Users can fork the repo, get a session ID by pairing code, and deploy on Heroku. The bot requires Node version 18.x or higher for optimal performance. Contributions to TOPU-MD are welcome, and the tool is safe for use on WhatsApp and Heroku. The tool is licensed under the MIT License and is designed to enhance the WhatsApp experience with diverse features.
visionOS-examples
visionOS-examples is a repository containing accelerators for Spatial Computing. It includes examples such as Local Large Language Model, Chat Apple Vision Pro, WebSockets, Anchor To Head, Hand Tracking, Battery Life, Countdown, Plane Detection, Timer Vision, and PencilKit for visionOS. The repository showcases various functionalities and features for Apple Vision Pro, offering tools for developers to enhance their visionOS apps with capabilities like hand tracking, plane detection, and real-time cryptocurrency prices.
GenAI-Showcase
The Generative AI Use Cases Repository showcases a wide range of applications in generative AI, including Retrieval-Augmented Generation (RAG), AI Agents, and industry-specific use cases. It provides practical notebooks and guidance on utilizing frameworks such as LlamaIndex and LangChain, and demonstrates how to integrate models from leading AI research companies like Anthropic and OpenAI.
llm-awq
AWQ (Activation-aware Weight Quantization) is a tool designed for efficient and accurate low-bit weight quantization (INT3/4) for Large Language Models (LLMs). It supports instruction-tuned models and multi-modal LMs, providing features such as AWQ search for accurate quantization, pre-computed AWQ model zoo for various LLMs, memory-efficient 4-bit linear in PyTorch, and efficient CUDA kernel implementation for fast inference. The tool enables users to run large models on resource-constrained edge platforms, delivering more efficient responses with LLM/VLM chatbots through 4-bit inference.
EAGLE
Eagle is a family of Vision-Centric High-Resolution Multimodal LLMs that enhance multimodal LLM perception using a mix of vision encoders and various input resolutions. The model features a channel-concatenation-based fusion for vision experts with different architectures and knowledge, supporting up to over 1K input resolution. It excels in resolution-sensitive tasks like optical character recognition and document understanding.
Awesome-LWMs
Awesome Large Weather Models (LWMs) is a curated collection of articles and resources related to large weather models used in AI for Earth and AI for Science. It includes information on various cutting-edge weather forecasting models, benchmark datasets, and research papers. The repository serves as a hub for researchers and enthusiasts to explore the latest advancements in weather modeling and forecasting.
llm-compression-intelligence
This repository presents the findings of the paper "Compression Represents Intelligence Linearly". The study reveals a strong linear correlation between the intelligence of LLMs, as measured by benchmark scores, and their ability to compress external text corpora. Compression efficiency, derived from raw text corpora, serves as a reliable evaluation metric that is linearly associated with model capabilities. The repository includes the compression corpora used in the paper, code for computing compression efficiency, and data collection and processing pipelines.
kangaroo
Kangaroo is an AI-powered SQL client and admin tool for popular databases like SQLite, MySQL, PostgreSQL, etc. It supports various functionalities such as table design, query, model, sync, export/import, and more. The tool is designed to be comfortable, fun, and developer-friendly, with features like code intellisense and autocomplete. Kangaroo aims to provide a seamless experience for database management across different operating systems.
openinference
OpenInference is a set of conventions and plugins that complement OpenTelemetry to enable tracing of AI applications. It provides a way to capture and analyze the performance and behavior of AI models, including their interactions with other components of the application. OpenInference is designed to be language-agnostic and can be used with any OpenTelemetry-compatible backend. It includes a set of instrumentations for popular machine learning SDKs and frameworks, making it easy to add tracing to your AI applications.
CS7320-AI
CS7320-AI is a repository containing lecture materials, simple Python code examples, and assignments for the course CS 5/7320 Artificial Intelligence. The code examples cover various chapters of the textbook 'Artificial Intelligence: A Modern Approach' by Russell and Norvig. The repository focuses on basic AI concepts rather than advanced implementation techniques. It includes HOWTO guides for installing Python, working on assignments, and using AI with Python.
InternVL
InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. It is a vision-language foundation model that can perform various tasks, including: **Visual Perception** - Linear-Probe Image Classification - Semantic Segmentation - Zero-Shot Image Classification - Multilingual Zero-Shot Image Classification - Zero-Shot Video Classification **Cross-Modal Retrieval** - English Zero-Shot Image-Text Retrieval - Chinese Zero-Shot Image-Text Retrieval - Multilingual Zero-Shot Image-Text Retrieval on XTD **Multimodal Dialogue** - Zero-Shot Image Captioning - Multimodal Benchmarks with Frozen LLM - Multimodal Benchmarks with Trainable LLM - Tiny LVLM InternVL has been shown to achieve state-of-the-art results on a variety of benchmarks. For example, on the MMMU image classification benchmark, InternVL achieves a top-1 accuracy of 51.6%, which is higher than GPT-4V and Gemini Pro. On the DocVQA question answering benchmark, InternVL achieves a score of 82.2%, which is also higher than GPT-4V and Gemini Pro. InternVL is open-sourced and available on Hugging Face. It can be used for a variety of applications, including image classification, object detection, semantic segmentation, image captioning, and question answering.
awesome-mobile-llm
Awesome Mobile LLMs is a curated list of Large Language Models (LLMs) and related studies focused on mobile and embedded hardware. The repository includes information on various LLM models, deployment frameworks, benchmarking efforts, applications, multimodal LLMs, surveys on efficient LLMs, training LLMs on device, mobile-related use-cases, industry announcements, and related repositories. It aims to be a valuable resource for researchers, engineers, and practitioners interested in mobile LLMs.
For similar tasks
X-AnyLabeling
X-AnyLabeling is a robust annotation tool that seamlessly incorporates an AI inference engine alongside an array of sophisticated features. Tailored for practical applications, it is committed to delivering comprehensive, industrial-grade solutions for image data engineers. This tool excels in swiftly and automatically executing annotations across diverse and intricate tasks.
file-organizer-2000
AI File Organizer 2000 is an Obsidian Plugin that uses AI to transcribe audio, annotate images, and automatically organize files by moving them to the most likely folders. It supports text, audio, and images, with upcoming local-first LLM support. Users can simply place unorganized files into the 'Inbox' folder for automatic organization. The tool renames and moves files quickly, providing a seamless file organization experience. Self-hosting is also possible by running the server and enabling the 'Self-hosted' option in the plugin settings. Join the community Discord server for more information and use the provided iOS shortcut for easy access on mobile devices.
LabelLLM
LabelLLM is an open-source data annotation platform designed to optimize the data annotation process for LLM development. It offers flexible configuration, multimodal data support, comprehensive task management, and AI-assisted annotation. Users can access a suite of annotation tools, enjoy a user-friendly experience, and enhance efficiency. The platform allows real-time monitoring of annotation progress and quality control, ensuring data integrity and timeliness.
awesome-open-data-annotation
At ZenML, we believe in the importance of annotation and labeling workflows in the machine learning lifecycle. This repository showcases a curated list of open-source data annotation and labeling tools that are actively maintained and fit for purpose. The tools cover various domains such as multi-modal, text, images, audio, video, time series, and other data types. Users can contribute to the list and discover tools for tasks like named entity recognition, data annotation for machine learning, image and video annotation, text classification, sequence labeling, object detection, and more. The repository aims to help users enhance their data-centric workflows by leveraging these tools.
anylabeling
AnyLabeling is a tool for effortless data labeling with AI support from YOLO and Segment Anything. It combines features from LabelImg and Labelme with an improved UI and auto-labeling capabilities. Users can annotate images with polygons, rectangles, circles, lines, and points, as well as perform auto-labeling using YOLOv5 and Segment Anything. The tool also supports text detection, recognition, and Key Information Extraction (KIE) labeling, with multiple language options available such as English, Vietnamese, and Chinese.
llm2vec
LLM2Vec is a simple recipe to convert decoder-only LLMs into text encoders. It consists of 3 simple steps: 1) enabling bidirectional attention, 2) training with masked next token prediction, and 3) unsupervised contrastive learning. The model can be further fine-tuned to achieve state-of-the-art performance.
marvin
Marvin is a lightweight AI toolkit for building natural language interfaces that are reliable, scalable, and easy to trust. Each of Marvin's tools is simple and self-documenting, using AI to solve common but complex challenges like entity extraction, classification, and generating synthetic data. Each tool is independent and incrementally adoptable, so you can use them on their own or in combination with any other library. Marvin is also multi-modal, supporting both image and audio generation as well using images as inputs for extraction and classification. Marvin is for developers who care more about _using_ AI than _building_ AI, and we are focused on creating an exceptional developer experience. Marvin users should feel empowered to bring tightly-scoped "AI magic" into any traditional software project with just a few extra lines of code. Marvin aims to merge the best practices for building dependable, observable software with the best practices for building with generative AI into a single, easy-to-use library. It's a serious tool, but we hope you have fun with it. Marvin is open-source, free to use, and made with 💙 by the team at Prefect.
curated-transformers
Curated Transformers is a transformer library for PyTorch that provides state-of-the-art models composed of reusable components. It supports various transformer architectures, including encoders like ALBERT, BERT, and RoBERTa, and decoders like Falcon, Llama, and MPT. The library emphasizes consistent type annotations, minimal dependencies, and ease of use for education and research. It has been production-tested by Explosion and will be the default transformer implementation in spaCy 3.7.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.