Best AI tools for< Data Collection >
Infographic
20 - AI tool Sites

SOAX AI data collection
SOAX AI data collection is a powerful tool that utilizes artificial intelligence to gather and analyze data from various online sources. It automates the process of data collection, saving time and effort for users. The tool is designed to extract relevant information efficiently and accurately, providing valuable insights for businesses and researchers. With its advanced algorithms, SOAX AI data collection can handle large volumes of data quickly and effectively, making it a valuable asset for anyone in need of data-driven decision-making.

Globose Technology Solutions
Globose Technology Solutions Pvt Ltd (GTS) is an AI data collection company that provides various datasets such as image datasets, video datasets, text datasets, speech datasets, etc., to train machine learning models. They offer premium data collection services with a human touch, aiming to refine AI vision and propel AI forward. With over 25+ years of experience, they specialize in data management, annotation, and effective data collection techniques for AI/ML. The company focuses on unlocking high-quality data, understanding AI's transformative impact, and ensuring data accuracy as the backbone of reliable AI.

Webscrape AI
Webscrape AI is a no-code web scraping tool that allows users to collect data from websites without writing any code. It is easy to use, accurate, and affordable, making it a great option for businesses of all sizes. With Webscrape AI, you can automate your data collection process and free up your time to focus on other tasks.

TypeflowAI
TypeflowAI is a platform that allows users to create AI tools without coding in minutes. Users can easily build, share, and embed AI tools into their websites to enhance SEO, increase traffic, and generate more leads. The platform offers features such as creating dynamic lead magnets, AI quizzes, calculators, and more, along with customization options to fit users' branding. With flexible pricing plans and integrations with popular apps, TypeflowAI simplifies the process of creating AI tools and lead magnets for businesses and individuals.

VideoAsk by Typeform
VideoAsk by Typeform is an interactive video platform that helps streamline conversations and build business relationships at scale. It offers features such as asynchronous interviews, easy scheduling, tagging, gathering contact info, capturing leads, research and feedback, training, customer support, and more. Users can create interactive video forms, conduct async interviews, and engage with their audience through AI-powered video chatbots. The platform is user-friendly, code-free, and integrates with over 1,500 applications through Zapier.

Alice
Alice is a fast, accurate AI transcription and recorder application that prioritizes privacy and cost-effectiveness. It allows users to securely record audio and video, transcribe in multiple languages and accents with high accuracy, and offers real-time text streaming. Alice integrates with various tools, supports webhooks, and is trusted by journalists for its reliability and security features. The application is designed to be user-friendly, efficient, and suitable for a wide range of tasks, making it a valuable tool for journalists, freelancers, and anyone in need of transcription services.

Macgence AI Training Data Services
Macgence is an AI training data services platform that offers high-quality off-the-shelf structured training data for organizations to build effective AI systems at scale. They provide services such as custom data sourcing, data annotation, data validation, content moderation, and localization. Macgence combines global linguistic, cultural, and technological expertise to create high-quality datasets for AI models, enabling faster time-to-market across the entire model value chain. With more than 5 years of experience, they support and scale AI initiatives of leading global innovators by designing custom data collection programs. Macgence specializes in handling AI training data for text, speech, image, and video data, offering cognitive annotation services to unlock the potential of unstructured textual data.

DataHawk
DataHawk is an eCommerce analytics platform powered by AI that helps users optimize their eCommerce operations from A to Z. It provides competitive intelligence, performance analysis, conversion optimization, sales growth, brand image protection, traffic generation, and more. DataHawk offers automated data collection, composable analytics, AI-powered productivity, integrations, exclusive datasets, and professional services. The platform empowers users with end-to-end visibility, comprehensive marketplace data, flexibility, control over operational data, and expert support. With AI capabilities, DataHawk automates data collection, detects anomalies, generates recommended actions, and augments productivity. The platform enhances revenue, RoAS, and time savings for users, supported by dedicated experts and customer success stories.

Smace
Smace is an AI-powered SaaS platform designed to enhance process implementation efficiency. It offers features such as enhanced process collaboration, automated workflows and integration, streamlined task management, and data-driven decision support. Smace aims to bridge the gap between process design and execution, promoting team efficiency, streamlined collaboration, and advanced integration.

BlockSurvey
BlockSurvey is a privacy-first AI-powered survey tool that empowers users to create surveys with confidence. It ensures data ownership, end-to-end encryption, and AI-driven efficiency, allowing users to get meaningful insights without compromising privacy. With features like anonymous surveys, adaptive questioning, and multilingual support, BlockSurvey offers a secure and user-friendly platform for survey creation and analysis.

Innovatiana
Innovatiana is a data labeling outsourcing company that provides high-quality training data for AI models. They specialize in computer vision, data moderation, document processing, natural language processing, and data collection. Innovatiana is committed to ethical and sustainable practices, and they pay their data labelers fair wages and provide them with good working conditions. They also use a variety of quality control measures to ensure that their data is accurate and reliable.

MyReport
MyReport is an AI-powered tool that helps users create automated reports in minutes. It uses advanced NLP technology to navigate the web and gather relevant information based on a user's input. The tool offers appealing full reports with professional outcomes, including images, graphs, tables, citations, quotes, and references. It also allows users to work with their own data by sharing a drive folder with their documents. MyReport is private and secure, and the user's information is not shared with third parties. The tool is available for professional users and offers fast generation and instant link sharing.

involve.me
involve.me is an AI-powered platform that allows users to create high-converting forms, surveys, quizzes, and calculators for interactive user flows that boost website conversions. It offers a wide range of pre-built templates and customization options to fit brand identities. The platform provides features like AI form generation, quiz creation, survey building, and analytics insights to help businesses collect quality data and increase user engagement. With involve.me, users can automate sales processes, personalize customer interactions, and grow their email list effortlessly.

GoodVision
GoodVision is an AI-driven traffic solutions platform that empowers cities to prevent congestion, enhance safety, and optimize traffic flow using real-time camera-based analytics. The platform offers a range of products and services, including real-time traffic monitoring, automatic incident detection, traffic data collection from cameras and drones, and traffic data analytics for traffic modeling. GoodVision's cutting-edge solutions provide cities with the tools they need to make data-driven decisions and improve traffic management.

Innodata Inc.
Innodata Inc. is a global data engineering company that delivers AI-enabled software platforms and managed services for AI data collection/annotation, AI digital transformation, and industry-specific business processes. They provide a full-suite of services and products to power data-centric AI initiatives using artificial intelligence and human expertise. With a 30+ year legacy, they offer the highest quality data and outstanding service to their customers.

AILYZE
AILYZE is an AI tool designed for qualitative data collection and analysis. Users can upload various document formats in any language to generate codes, conduct thematic, frequency, content, and cross-group analysis, extract top quotes, and more. The tool also allows users to create surveys, utilize an AI voice interviewer, and recruit participants globally. AILYZE offers different plans with varying features and data security measures, including options for advanced analysis and AI interviewer add-ons. Additionally, users can tap into data scientists for detailed and customized analyses on a wide range of documents.

Quid
Quid is an AI-powered consumer and market intelligence platform that goes beyond simple data collection and analytics. It provides a complete picture of customer context, helping businesses make informed decisions based on future trends and opportunities. With features like Quid Discover for uncovering insights, Quid Monitor for real-time analytics, Quid Predict for future focus, Quid Compete for competitive analysis, and Quid Connect for data integration, the platform empowers organizations with proactive, data-driven decision-making.

Lychee
Lychee is an all-in-one data platform designed to streamline the data workflow process, from data collection to analysis, visualization, and sharing. It offers powerful features such as instant graph generation, AI data analysis, website deployment, social media sharing, and team collaboration. Lychee aims to simplify complex data tasks and help users make faster progress in their data-related projects.

Pulan
Pulan is a comprehensive platform designed to assist in collecting, curating, annotating, and evaluating data points for various AI initiatives. It offers services in Natural Language Processing, Data Annotation, and Computer Vision across multiple industries such as Agriculture, Medical, Life Sciences, Government, Automotive, Insurance & Finance, Logistics, Software & Internet, Manufacturing, Retail, Construction, Energy, and Food & Beverage. Pulan provides a one-stop destination for reliable data collection and curation by industry experts, with a vast inventory of millions of datasets available for licensing at a fraction of the cost of creating the data oneself.

Sapien.io
Sapien.io is a decentralized data foundry that offers data labeling services powered by a decentralized workforce and gamified platform. The platform provides high-quality training data for large language models through a human-in-the-loop labeling process, enabling fine-tuning of datasets to build performant AI models. Sapien combines AI and human intelligence to collect and annotate various data types for any model, offering customized data collection and labeling models across industries.
20 - Open Source Tools

TagUI
TagUI is an open-source RPA tool that allows users to automate repetitive tasks on their computer, including tasks on websites, desktop apps, and the command line. It supports multiple languages and offers features like interacting with identifiers, automating data collection, moving data between TagUI and Excel, and sending Telegram notifications. Users can create RPA robots using MS Office Plug-ins or text editors, run TagUI on the cloud, and integrate with other RPA tools. TagUI prioritizes enterprise security by running on users' computers and not storing data. It offers detailed logs, enterprise installation guides, and support for centralised reporting.

Steel-LLM
Steel-LLM is a project to pre-train a large Chinese language model from scratch using over 1T of data to achieve a parameter size of around 1B, similar to TinyLlama. The project aims to share the entire process including data collection, data processing, pre-training framework selection, model design, and open-source all the code. The goal is to enable reproducibility of the work even with limited resources. The name 'Steel' is inspired by a band '万能青年旅店' and signifies the desire to create a strong model despite limited conditions. The project involves continuous data collection of various cultural elements, trivia, lyrics, niche literature, and personal secrets to train the LLM. The ultimate aim is to fill the model with diverse data and leave room for individual input, fostering collaboration among users.

ProactiveAgent
Proactive Agent is a project aimed at constructing a fully active agent that can anticipate user's requirements and offer assistance without explicit requests. It includes a data collection and generation pipeline, automatic evaluator, and training agent. The project provides datasets, evaluation scripts, and prompts to finetune LLM for proactive agent. Features include environment sensing, assistance annotation, dynamic data generation, and construction pipeline with a high F1 score on the test set. The project is intended for coding, writing, and daily life scenarios, distributed under Apache License 2.0.

SmallLanguageModel-project
This repository provides all the necessary items to build a Language Model from scratch, inspired by Karpathy's nanoGPT and Shakespeare generator. It includes data collection tools, data processing scripts, various models like BERT, GPT, and Seq-2-Seq, along with tokenizer and training files.

CritiqueLLM
CritiqueLLM is an official implementation of a model designed for generating informative critiques to evaluate large language model generation. It includes functionalities for data collection, referenced pointwise grading, referenced pairwise comparison, reference-free pairwise comparison, reference-free pointwise grading, inference for pointwise grading and pairwise comparison, and evaluation of the generated results. The model aims to provide a comprehensive framework for evaluating the performance of large language models based on human ratings and comparisons.

DB-GPT-Hub
DB-GPT-Hub is an experimental project leveraging Large Language Models (LLMs) for Text-to-SQL parsing. It includes stages like data collection, preprocessing, model selection, construction, and fine-tuning of model weights. The project aims to enhance Text-to-SQL capabilities, reduce model training costs, and enable developers to contribute to improving Text-to-SQL accuracy. The ultimate goal is to achieve automated question-answering based on databases, allowing users to execute complex database queries using natural language descriptions. The project has successfully integrated multiple large models and established a comprehensive workflow for data processing, SFT model training, prediction output, and evaluation.

PC-Agent
PC Agent introduces a novel framework to empower autonomous digital agents through human cognition transfer. It consists of PC Tracker for data collection, Cognition Completion for transforming raw data, and a multi-agent system for decision-making and visual grounding. Users can set up the tool in Python environment, customize data collection with PC Tracker, process data into cognitive trajectories, and run the multi-agent system. The tool aims to enable AI to work autonomously while users sleep, providing a cognitive journey into the digital world.

KG-LLM-MDQA
This repository contains code and demo for Knowledge Graph Prompting for Multi-Document Question Answering. It includes modules for data collection, training DPR and MDR models, fine-tuning T5 and LLaMA, and reproducing KGP-LLM algorithm. The workflow involves document collection, knowledge graph construction, fine-tuning models, and reproducing main table results. The repository provides instructions for environment setup, folder architecture, and running different modules.

LLM-Engineers-Handbook
The LLM Engineer's Handbook is an official repository containing a comprehensive guide on creating an end-to-end LLM-based system using best practices. It covers data collection & generation, LLM training pipeline, a simple RAG system, production-ready AWS deployment, comprehensive monitoring, and testing and evaluation framework. The repository includes detailed instructions on setting up local and cloud dependencies, project structure, installation steps, infrastructure setup, pipelines for data processing, training, and inference, as well as QA, tests, and running the project end-to-end.

RoboMatrix
RoboMatrix is a skill-centric hierarchical framework for scalable robot task planning and execution in an open-world environment. It provides a structured approach to robot task execution using a combination of hardware components, environment configuration, installation procedures, and data collection methods. The framework is developed using the ROS2 framework on Ubuntu and supports robots from DJI's RoboMaster series. Users can follow the provided installation guidance to set up RoboMatrix and utilize it for various tasks such as data collection, task execution, and dataset construction. The framework also includes a supervised fine-tuning dataset and aims to optimize communication and release additional components in the future.

LLM-Geo
LLM-Geo is an AI-powered geographic information system (GIS) that leverages Large Language Models (LLMs) for automatic spatial data collection, analysis, and visualization. By adopting LLM as the reasoning core, it addresses spatial problems with self-generating, self-organizing, self-verifying, self-executing, and self-growing capabilities. The tool aims to make spatial analysis easier, faster, and more accessible by reducing manual operation time and delivering accurate results through case studies. It uses GPT-4 API in a Python environment and advocates for further research and development in autonomous GIS.

Ultimate-Data-Science-Toolkit---From-Python-Basics-to-GenerativeAI
Ultimate Data Science Toolkit is a comprehensive repository covering Python basics to Generative AI. It includes modules on Python programming, data analysis, statistics, machine learning, MLOps, case studies, and deep learning. The repository provides detailed tutorials on various topics such as Python data structures, control statements, functions, modules, object-oriented programming, exception handling, file handling, web API, databases, list comprehension, lambda functions, Pandas, Numpy, data visualization, statistical analysis, supervised and unsupervised machine learning algorithms, model serialization, ML pipeline orchestration, case studies, and deep learning concepts like neural networks and autoencoders.

trubrics-sdk
Trubrics-sdk is a software development kit designed to facilitate the integration of analytics features into applications. It provides a set of tools and functionalities that enable developers to easily incorporate analytics capabilities, such as data collection, analysis, and reporting, into their software products. The SDK streamlines the process of implementing analytics solutions, allowing developers to focus on building and enhancing their applications' functionality and user experience. By leveraging trubrics-sdk, developers can quickly and efficiently integrate robust analytics features, gaining valuable insights into user behavior and application performance.

llm-compression-intelligence
This repository presents the findings of the paper "Compression Represents Intelligence Linearly". The study reveals a strong linear correlation between the intelligence of LLMs, as measured by benchmark scores, and their ability to compress external text corpora. Compression efficiency, derived from raw text corpora, serves as a reliable evaluation metric that is linearly associated with model capabilities. The repository includes the compression corpora used in the paper, code for computing compression efficiency, and data collection and processing pipelines.

flo-ai
Flo AI is a Python framework that enables users to build production-ready AI agents and teams with minimal code. It allows users to compose complex AI architectures using pre-built components while maintaining the flexibility to create custom components. The framework supports composable, production-ready, YAML-first, and flexible AI systems. Users can easily create AI agents and teams, manage teams of AI agents working together, and utilize built-in support for Retrieval-Augmented Generation (RAG) and compatibility with Langchain tools. Flo AI also provides tools for output parsing and formatting, tool logging, data collection, and JSON output collection. It is MIT Licensed and offers detailed documentation, tutorials, and examples for AI engineers and teams to accelerate development, maintainability, scalability, and testability of AI systems.

ai-trend-publish
AI TrendPublish is an AI-based trend discovery and content publishing system that supports multi-source data collection, intelligent summarization, and automatic publishing to WeChat official accounts. It features data collection from various sources, AI-powered content processing using DeepseekAI Together, key information extraction, intelligent title generation, automatic article publishing to WeChat official accounts with custom templates and scheduled tasks, notification system integration with Bark for task status updates and error alerts. The tool offers multiple templates for content customization and is built using Node.js + TypeScript with AI services from DeepseekAI Together, data sources including Twitter/X API and FireCrawl, and uses node-cron for scheduling tasks and EJS as the template engine.

deepflow
DeepFlow is an open-source project that provides deep observability for complex cloud-native and AI applications. It offers Zero Code data collection with eBPF for metrics, distributed tracing, request logs, and function profiling. DeepFlow is integrated with SmartEncoding to achieve Full Stack correlation and efficient access to all observability data. With DeepFlow, cloud-native and AI applications automatically gain deep observability, removing the burden of developers continually instrumenting code and providing monitoring and diagnostic capabilities covering everything from code to infrastructure for DevOps/SRE teams.

embodied-agents
Embodied Agents is a toolkit for integrating large multi-modal models into existing robot stacks with just a few lines of code. It provides consistency, reliability, scalability, and is configurable to any observation and action space. The toolkit is designed to reduce complexities involved in setting up inference endpoints, converting between different model formats, and collecting/storing datasets. It aims to facilitate data collection and sharing among roboticists by providing Python-first abstractions that are modular, extensible, and applicable to a wide range of tasks. The toolkit supports asynchronous and remote thread-safe agent execution for maximal responsiveness and scalability, and is compatible with various APIs like HuggingFace Spaces, Datasets, Gymnasium Spaces, Ollama, and OpenAI. It also offers automatic dataset recording and optional uploads to the HuggingFace hub.

AITreasureBox
AITreasureBox is a comprehensive collection of AI tools and resources designed to simplify and accelerate the development of AI projects. It provides a wide range of pre-trained models, datasets, and utilities that can be easily integrated into various AI applications. With AITreasureBox, developers can quickly prototype, test, and deploy AI solutions without having to build everything from scratch. Whether you are working on computer vision, natural language processing, or reinforcement learning projects, AITreasureBox has something to offer for everyone. The repository is regularly updated with new tools and resources to keep up with the latest advancements in the field of artificial intelligence.
20 - OpenAI Gpts

Coach Gestion Data
Collecte et analyse de données sur la résilience aux catastrophes naturelles.

GA4 Implementation Assistant
A helper for implementing Google Analytics 4 with tips and troubleshooting.

Credit & Collections Advisor
Manages credit risk and implements effective collection strategies.

AI Entrepreneur
An adept financier overseeing a varied collection of investments, dedicated to recognizing and nurturing innovative business endeavors.

Collect, Value, Connect
Expert in collectible valuation with real-time market data insights.

Interactive Spring API Creator
Pass in the attributes of Pojo entity class objects, generate corresponding addition, deletion, modification, and pagination query functions, including generating database connection configuration files yaml and database script files, as well as XML dynamic SQL concatenation statements.

Faith Scienceness
Analysis of religiosity with an eye towards cognitive empathy and collective intelligences. Member of the Hipster Energy Team. https://hipster.energy/team

Ask Cris about File Maker
An experiment in personal FileMaker guidance from the collective works of lifetime award-winning FileMaker trainer, Cris Ippolite. Not just links to resources, but direct access to 20+ years of custom training curriculum combined with expert AI instruction without the noise of external web links.