Best AI tools for< Collect Datasets >
20 - AI tool Sites

Sapien.io
Sapien.io is a decentralized data foundry that offers data labeling services powered by a decentralized workforce and gamified platform. The platform provides high-quality training data for large language models through a human-in-the-loop labeling process, enabling fine-tuning of datasets to build performant AI models. Sapien combines AI and human intelligence to collect and annotate various data types for any model, offering customized data collection and labeling models across industries.

Shaip
Shaip is a human-powered data processing service specializing in AI and ML models. They offer a wide range of services including data collection, annotation, de-identification, and more. Shaip provides high-quality training data for various AI applications, such as healthcare AI, conversational AI, and computer vision. With over 15 years of expertise, Shaip helps organizations unlock critical information from unstructured data, enabling them to achieve better results in their AI initiatives.

Globose Technology Solutions
Globose Technology Solutions Pvt Ltd (GTS) is an AI data collection company that provides various datasets such as image datasets, video datasets, text datasets, speech datasets, etc., to train machine learning models. They offer premium data collection services with a human touch, aiming to refine AI vision and propel AI forward. With over 25+ years of experience, they specialize in data management, annotation, and effective data collection techniques for AI/ML. The company focuses on unlocking high-quality data, understanding AI's transformative impact, and ensuring data accuracy as the backbone of reliable AI.

Innovatiana
Innovatiana is a data labeling outsourcing platform that offers high-quality datasets for artificial intelligence models. They specialize in image, audio/video, and text data labeling tasks, providing ethical outsourcing with a focus on impact and transparency. Innovatiana recruits and trains their own team in Madagascar, ensuring fair pay and good working conditions. They offer competitive rates, secure data handling, and high-quality labeled data to feed AI models. The platform supports various AI tasks such as Computer Vision, Data Collection, Data Moderation, Documents Processing, and Natural Language Processing.

Pulan
Pulan is a comprehensive platform designed to assist in collecting, curating, annotating, and evaluating data points for various AI initiatives. It offers services in Natural Language Processing, Data Annotation, and Computer Vision across multiple industries such as Agriculture, Medical, Life Sciences, Government, Automotive, Insurance & Finance, Logistics, Software & Internet, Manufacturing, Retail, Construction, Energy, and Food & Beverage. Pulan provides a one-stop destination for reliable data collection and curation by industry experts, with a vast inventory of millions of datasets available for licensing at a fraction of the cost of creating the data oneself.

Appen
Appen is a leading provider of high-quality data for training AI models. The company's end-to-end platform, flexible services, and deep expertise ensure the delivery of high-quality, diverse data that is crucial for building foundation models and enterprise-ready AI applications. Appen has been providing high-quality datasets that power the world's leading AI models for decades. The company's services enable it to prepare data at scale, meeting the demands of even the most ambitious AI projects. Appen also provides enterprises with software to collect, curate, fine-tune, and monitor traditionally human-driven tasks, creating massive efficiencies through a trustworthy, traceable process.

Dobb·E
Dobb·E is an open-source, general framework for learning household robotic manipulation. It aims to create a generalist machine for homes that can adapt and learn from users' needs efficiently. Dobb·E can learn a new task with just five minutes of demonstration, achieving an 81% success rate in various household environments. The project focuses on accelerating research on home robots and making robot assistants a common sight in every home.

Collect.chat
Collect.chat is a website chatbot that helps businesses collect leads, appointments, enquiries, meetings, bookings, feedbacks, orders, surveys, and suggestions. It is a drag-and-drop builder that is easy to use and requires no coding. Collect.chat can be integrated with other apps without coding, such as Salesforce, Mailchimp, and Slack. It is a great tool for businesses of all sizes to improve their online presence and customer service.

re:collect
re:collect is an AI-powered tool that helps you enhance your memory, perception, and synthesis. It connects the information you consume and helps you quickly recall the right information when you need it. With re:collect, you can:

TrueAccord
TrueAccord is an industry-leading recovery and collections platform powered by machine learning and a consumer-friendly digital experience. It offers third-party collection services for better relationships and faster recoveries through engagement, commitment, and resolution. TrueAccord's platform can handle delinquency needs from one day past due through charge off, with features like HeartBeat for better results, self-serve digital experience for consumers, scalability in delinquency infrastructure, and end-to-end personalization. The platform is trusted by market leaders and loved by consumers, providing a full-service digital-first debt collection agency.

ProProfs
ProProfs is a user research and customer feedback software that offers a comprehensive suite of tools for collecting real-time insights from website visitors and app users. With features like AI sentiment analysis, question branching, branding customization, advanced user targeting, and nudge for prototype, ProProfs empowers businesses to gather valuable feedback to enhance user experience, product functionality, and customer satisfaction. The application is designed to help businesses improve website leads, iOS & Android app feedback, NPS, customer experience management, ecommerce conversion, SaaS conversions, and marketing performance. ProProfs is known for its AI-powered analytics and reports, user-friendly interface, and exceptional customer support.

Testimonial
Testimonial is an AI-powered tool that helps businesses collect and embed customer testimonials on their website in minutes. It provides an easy-to-use interface that allows users to create beautiful testimonial widgets, customize them to match their brand, and embed them on any web page. Testimonial also offers a range of features to help businesses manage their testimonials, such as the ability to collect testimonials from multiple sources, moderate them before they are published, and track their performance.

Sonny9
Sonny9 is an AI-powered data collection tool designed specifically for CPAs, tax preparers, and auditors. It helps professionals in these fields collect customer information and documents efficiently, minimizing the time and effort spent on back-and-forth communications. With Sonny9, users can automate repetitive tasks, receive notifications about new insights and consulting opportunities, and get prepared data for further analysis. The tool integrates with QuickBooks and can automatically extract data from documents into CSV format. Sonny9 also provides users with tips and opportunities for high-level consulting services based on customer information.

heardeer
heardeer is an AI-powered user interview platform that helps businesses collect valuable feedback from their users. With heardeer, you can create custom interview questionnaires, share them with your users, and let AI manage the interviews. heardeer's AI engine will ask your users questions, follow up on their answers, and provide you with detailed insights into their feedback. This can help you understand your users' needs, improve your products and services, and make better decisions.

CookieChimp
CookieChimp is a modern consent management platform designed to help websites effortlessly collect user consent for 3rd party services while ensuring compliance with various privacy standards such as GDPR, CCPA, TCF 2.2, and Google Consent Mode. The platform offers features like dashboard monitoring, automated cookie scanning, customizable consent banners, integrations with CRM and marketing tools, global compliance solutions, robust consent logging, and AI-powered efficiency. CookieChimp is trusted by modern companies for its user-friendly interface, valuable insights, and quick setup process.

ChoiceChaser
ChoiceChaser is an AI-powered lead generation tool that helps businesses find and connect with potential customers on social media, forums, and other online platforms. It uses natural language processing and machine learning to identify relevant posts and conversations, and then notifies users when there is a match. ChoiceChaser can help businesses save time and energy by automating the process of lead generation, and it can also help them reach a wider audience of potential customers.

Archive
Archive is an AI-first influencer marketing platform that automates the process of tracking, managing, and repurposing user-generated content (UGC) for brands. It offers features such as Campaign Dashboards, Social Listening, AI-powered Creator Search, Competitor Insights, Shoppable UGC Feeds, and detailed Reports. Archive helps brands drive results by simplifying influencer marketing campaigns and improving conversion rates through the use of UGC. With a user-friendly interface and powerful AI capabilities, Archive is a trusted tool used by over 50,000 brands to streamline their influencer marketing strategies.

Yasna
Yasna by Fastuna is an AI assistant designed for conducting interviews with people, regardless of the scale. It enables users to collect facts, opinions, and ideas from any audience without direct communication. Yasna is trained to act as a professional interviewer, providing detailed responses beyond traditional surveys. The platform supports multiple languages, adapts to various styles, and offers effortless analytics for interview analysis. It creates a safe and engaging environment for participants to express themselves freely.

Leanbe
Leanbe is a product management platform designed for data-driven teams. It helps product managers capture insights from customers, competitors, and team members to make better product decisions. Leanbe uses AI to continuously update idea scores based on customer behavior, insights, and feedback from the team. This helps product managers prioritize the right ideas and build products that meet the needs of their customers.

WellyBox
WellyBox is an AI-powered platform designed to help users organize receipts and invoices effortlessly. With the power of AI, WellyBox automatically collects and analyzes receipts from various sources, integrates them with business tools and accountants, and simplifies financial operations for over 70,000 businesses worldwide. The platform offers a seamless solution for managing financial documents, saving time on administrative tasks, and ensuring efficient receipt organization.
20 - Open Source AI Tools

langchain-benchmarks
A package to help benchmark various LLM related tasks. The benchmarks are organized by end-to-end use cases, and utilize LangSmith heavily. We have several goals in open sourcing this: * Showing how we collect our benchmark datasets for each task * Showing what the benchmark datasets we use for each task is * Showing how we evaluate each task * Encouraging others to benchmark their solutions on these tasks (we are always looking for better ways of doing things!)

Q-Bench
Q-Bench is a benchmark for general-purpose foundation models on low-level vision, focusing on multi-modality LLMs performance. It includes three realms for low-level vision: perception, description, and assessment. The benchmark datasets LLVisionQA and LLDescribe are collected for perception and description tasks, with open submission-based evaluation. An abstract evaluation code is provided for assessment using public datasets. The tool can be used with the datasets API for single images and image pairs, allowing for automatic download and usage. Various tasks and evaluations are available for testing MLLMs on low-level vision tasks.

ai-audio-datasets
AI Audio Datasets List (AI-ADL) is a comprehensive collection of datasets consisting of speech, music, and sound effects, used for Generative AI, AIGC, AI model training, and audio applications. It includes datasets for speech recognition, speech synthesis, music information retrieval, music generation, audio processing, sound synthesis, and more. The repository provides a curated list of diverse datasets suitable for various AI audio tasks.

deeplake
Deep Lake is a Database for AI powered by a storage format optimized for deep-learning applications. Deep Lake can be used for: 1. Storing data and vectors while building LLM applications 2. Managing datasets while training deep learning models Deep Lake simplifies the deployment of enterprise-grade LLM-based products by offering storage for all data types (embeddings, audio, text, videos, images, pdfs, annotations, etc.), querying and vector search, data streaming while training models at scale, data versioning and lineage, and integrations with popular tools such as LangChain, LlamaIndex, Weights & Biases, and many more. Deep Lake works with data of any size, it is serverless, and it enables you to store all of your data in your own cloud and in one place. Deep Lake is used by Intel, Bayer Radiology, Matterport, ZERO Systems, Red Cross, Yale, & Oxford.

lance
Lance is a modern columnar data format optimized for ML workflows and datasets. It offers high-performance random access, vector search, zero-copy automatic versioning, and ecosystem integrations with Apache Arrow, Pandas, Polars, and DuckDB. Lance is designed to address the challenges of the ML development cycle, providing a unified data format for collection, exploration, analytics, feature engineering, training, evaluation, deployment, and monitoring. It aims to reduce data silos and streamline the ML development process.

AITreasureBox
AITreasureBox is a comprehensive collection of AI tools and resources designed to simplify and accelerate the development of AI projects. It provides a wide range of pre-trained models, datasets, and utilities that can be easily integrated into various AI applications. With AITreasureBox, developers can quickly prototype, test, and deploy AI solutions without having to build everything from scratch. Whether you are working on computer vision, natural language processing, or reinforcement learning projects, AITreasureBox has something to offer for everyone. The repository is regularly updated with new tools and resources to keep up with the latest advancements in the field of artificial intelligence.

agentneo
AgentNeo is a Python package that provides functionalities for project, trace, dataset, experiment management. It allows users to authenticate, create projects, trace agents and LangGraph graphs, manage datasets, and run experiments with metrics. The tool aims to streamline AI project management and analysis by offering a comprehensive set of features.

curator
Bespoke Curator is an open-source tool for data curation and structured data extraction. It provides a Python library for generating synthetic data at scale, with features like programmability, performance optimization, caching, and integration with HuggingFace Datasets. The tool includes a Curator Viewer for dataset visualization and offers a rich set of functionalities for creating and refining data generation strategies.

InstructGraph
InstructGraph is a framework designed to enhance large language models (LLMs) for graph-centric tasks by utilizing graph instruction tuning and preference alignment. The tool collects and decomposes 29 standard graph datasets into four groups, enabling LLMs to better understand and generate graph data. It introduces a structured format verbalizer to transform graph data into a code-like format, facilitating code understanding and generation. Additionally, it addresses hallucination problems in graph reasoning and generation through direct preference optimization (DPO). The tool aims to bridge the gap between textual LLMs and graph data, offering a comprehensive solution for graph-related tasks.

ReST-MCTS
ReST-MCTS is a reinforced self-training approach that integrates process reward guidance with tree search MCTS to collect higher-quality reasoning traces and per-step value for training policy and reward models. It eliminates the need for manual per-step annotation by estimating the probability of steps leading to correct answers. The inferred rewards refine the process reward model and aid in selecting high-quality traces for policy model self-training.

ice-score
ICE-Score is a tool designed to instruct large language models to evaluate code. It provides a minimum viable product (MVP) for evaluating generated code snippets using inputs such as problem, output, task, aspect, and model. Users can also evaluate with reference code and enable zero-shot chain-of-thought evaluation. The tool is built on codegen-metrics and code-bert-score repositories and includes datasets like CoNaLa and HumanEval. ICE-Score has been accepted to EACL 2024.

awesome-ai-llm4education
The 'awesome-ai-llm4education' repository is a curated list of papers related to artificial intelligence (AI) and large language models (LLM) for education. It collects papers from top conferences, journals, and specialized domain-specific conferences, categorizing them based on specific tasks for better organization. The repository covers a wide range of topics including tutoring, personalized learning, assessment, material preparation, specific scenarios like computer science, language, math, and medicine, aided teaching, as well as datasets and benchmarks for educational research.

langfuse
Langfuse is a powerful tool that helps you develop, monitor, and test your LLM applications. With Langfuse, you can: * **Develop:** Instrument your app and start ingesting traces to Langfuse, inspect and debug complex logs, and manage, version, and deploy prompts from within Langfuse. * **Monitor:** Track metrics (cost, latency, quality) and gain insights from dashboards & data exports, collect and calculate scores for your LLM completions, run model-based evaluations, collect user feedback, and manually score observations in Langfuse. * **Test:** Track and test app behaviour before deploying a new version, test expected in and output pairs and benchmark performance before deploying, and track versions and releases in your application. Langfuse is easy to get started with and offers a generous free tier. You can sign up for Langfuse Cloud or deploy Langfuse locally or on your own infrastructure. Langfuse also offers a variety of integrations to make it easy to connect to your LLM applications.

Odyssey
Odyssey is a framework designed to empower agents with open-world skills in Minecraft. It provides an interactive agent with a skill library, a fine-tuned LLaMA-3 model, and an open-world benchmark for evaluating agent capabilities. The framework enables agents to explore diverse gameplay opportunities in the vast Minecraft world by offering primitive and compositional skills, extensive training data, and various long-term planning tasks. Odyssey aims to advance research on autonomous agent solutions by providing datasets, model weights, and code for public use.

ProactiveAgent
Proactive Agent is a project aimed at constructing a fully active agent that can anticipate user's requirements and offer assistance without explicit requests. It includes a data collection and generation pipeline, automatic evaluator, and training agent. The project provides datasets, evaluation scripts, and prompts to finetune LLM for proactive agent. Features include environment sensing, assistance annotation, dynamic data generation, and construction pipeline with a high F1 score on the test set. The project is intended for coding, writing, and daily life scenarios, distributed under Apache License 2.0.

LLMTSCS
LLMLight is a novel framework that employs Large Language Models (LLMs) as decision-making agents for Traffic Signal Control (TSC). The framework leverages the advanced generalization capabilities of LLMs to engage in a reasoning and decision-making process akin to human intuition for effective traffic control. LLMLight has been demonstrated to be remarkably effective, generalizable, and interpretable against various transportation-based and RL-based baselines on nine real-world and synthetic datasets.

Awesome-Colorful-LLM
Awesome-Colorful-LLM is a meticulously assembled anthology of vibrant multimodal research focusing on advancements propelled by large language models (LLMs) in domains such as Vision, Audio, Agent, Robotics, and Fundamental Sciences like Mathematics. The repository contains curated collections of works, datasets, benchmarks, projects, and tools related to LLMs and multimodal learning. It serves as a comprehensive resource for researchers and practitioners interested in exploring the intersection of language models and various modalities for tasks like image understanding, video pretraining, 3D modeling, document understanding, audio analysis, agent learning, robotic applications, and mathematical research.

cambrian
Cambrian-1 is a fully open project focused on exploring multimodal Large Language Models (LLMs) with a vision-centric approach. It offers competitive performance across various benchmarks with models at different parameter levels. The project includes training configurations, model weights, instruction tuning data, and evaluation details. Users can interact with Cambrian-1 through a Gradio web interface for inference. The project is inspired by LLaVA and incorporates contributions from Vicuna, LLaMA, and Yi. Cambrian-1 is licensed under Apache 2.0 and utilizes datasets and checkpoints subject to their respective original licenses.

embodied-agents
Embodied Agents is a toolkit for integrating large multi-modal models into existing robot stacks with just a few lines of code. It provides consistency, reliability, scalability, and is configurable to any observation and action space. The toolkit is designed to reduce complexities involved in setting up inference endpoints, converting between different model formats, and collecting/storing datasets. It aims to facilitate data collection and sharing among roboticists by providing Python-first abstractions that are modular, extensible, and applicable to a wide range of tasks. The toolkit supports asynchronous and remote thread-safe agent execution for maximal responsiveness and scalability, and is compatible with various APIs like HuggingFace Spaces, Datasets, Gymnasium Spaces, Ollama, and OpenAI. It also offers automatic dataset recording and optional uploads to the HuggingFace hub.
20 - OpenAI Gpts

Collect, Value, Connect
Expert in collectible valuation with real-time market data insights.

👑 Data Privacy for Insurance Companies 👑
Insurance providers collect and process personal health, financial, and property information, making it crucial to implement comprehensive data protection strategies.

👑 Data Privacy for Spa & Beauty Salons 👑
Spa and Beauty Salons collect Customer inforation, including personal details and treatment records, necessitating a high level of confidentiality and data protection.

👑 Data Privacy for Public Transportation 👑
Public transport authorities collect data on travel patterns, fares, and sometimes personal details of passengers, necessitating strong privacy measures.

👑 Data Privacy for Fitness & Wellness Centers 👑
Fitness and Wellness Centers collect personal health and fitness data of their clients, including potentially sensitive health metrics, requiring careful handling and protection of this data.

👑 Data Privacy for Language & Training Centers 👑
Language and Skill Training Centers collect personal information of learners, including progress tracking and sometimes payment details.

👑 Data Privacy for Social Media Companies 👑
Data Privacy for Social Media Companies & Platforms collect detailed personal information, preferences, and interactions of users, making it essential to have strong data privacy policies and practices in place.

BREAKING NEWS: BOT
A GPT/AI system designed to collect, analyze, and summarize recent news from established media outlets, emphasizing balance in perspectives and precision in content delivery, with a default focus on top breaking news stories, adaptable to user-specified topics.

Highlight Optimizer
Supercharge your personal knowledge management journey by using a highlight capturing service (such as Readwise) and then turning those highlights into useful knowledge assets. Examples include flash cards, research abstracts or articles based off the highlights you collect and choose to combine.

👑 Data Privacy for Travel & Hospitality 👑
Travel and Hospitality Industry. Hotels, Airlines, and Travel Agencies collect personal information like travel histories, passport details, and payment information, necessitating robust privacy and security measures.

Log Analyzer
I'm designed to help You analyze any logs like Linux system logs, Windows logs, any security logs, access logs, error logs, etc. Please do not share information that You would like to keep private. The author does not collect or process any personal data.

Customer Feedback Management Advisor
Facilitates customer satisfaction through effective feedback management.

Lívia
Assistente de atendimento de vendas da IntegraVoip, especializado em fornecer suporte inicial e coletar informações essenciais dos clientes durante o processo de venda consultiva.

Qualitative Quest
I'm Qualitative Quest, here to offer concise advice on qualitative research methods, analysis techniques, and tools.

M&E Expert
I'm an M&E expert for NGOs, offering professional, detailed guidance to specialists.