Best AI tools for< Extract Data >
58 - AI tool Sites
Greptile AI
Greptile AI is an advanced web scraping tool that utilizes artificial intelligence to extract data from websites efficiently and accurately. It offers users the ability to sign in with GitHub or other methods to access its powerful features. With Greptile AI, users can easily scrape and collect data from various websites for analysis, research, or any other purposes.
Dili
Dili is an AI Diligence Platform that automates diligence processes for various industries such as real estate, private equity, and venture capital. It helps users extract key data, summarize documents, flag issues, and generate reports with high accuracy and efficiency. Dili's advanced AI technology enhances due diligence procedures, reduces human errors, and provides valuable insights for making informed decisions in high-stakes deals.
GetOData
GetOData is a powerful web scraping API and Chrome extension that offers AI-based data extraction tools for small-scale scraping projects. It allows users to extract large amounts of data without being blocked by anti-bot mechanisms like Captchas, Cloudflare, or Akimai. The API is built by data extraction experts and provides features such as choosing the output format (HTML or JSON), setting proxy locations, executing JavaScript, taking screenshots, and more. GetOData offers simplified pricing options for freelancers, startups, and businesses, with competitive rates and high success rates compared to other services.
Base64.ai
Base64.ai is an automated document processing API that offers a leading no-code AI solution for understanding documents, photos, and videos. It provides a comprehensive set of features for document processing across various industries, with a strong focus on accuracy, security, and extensibility. Base64.ai is designed to streamline document automation processes and improve data extraction efficiency.
Browse AI
Browse AI is an AI tool that offers the easiest way to extract and monitor data from any website without the need for coding. Users can train a robot in just 2 minutes to extract specific data in spreadsheet format or monitor data on a schedule. With over 7,000 integrations, Browse AI allows users to scrape structured data, run multiple robots simultaneously, emulate user interactions, handle pagination, and more. Trusted by over 370,000 individuals and teams, Browse AI is a powerful tool for data extraction and monitoring tasks.
Apify
Apify is a full-stack web scraping and data extraction platform that allows developers to build, deploy, and publish web scraping, data extraction, and web automation tools. The platform offers ready-made web scraping tools, serverless programs, integrations with various apps and services, storage for results, anti-blocking proxies, and an open-source web scraping and crawling library. Apify simplifies web scraping challenges by providing a full-stack platform that includes tools for developing, deploying, and running web scrapers efficiently.
Centari
Centari is an AI-powered platform that helps firms transform complex documents into valuable insights using generative AI technology. It enables users to enhance marketing materials, visualize market trends, extract deal points, validate data, and navigate deal history with ease. Centari's innovative features and capabilities make it a valuable tool for law firms and legal professionals looking to streamline deal intelligence processes and gain a competitive edge in the market.
Beatandraise
Beatandraise is an AI-powered platform that offers Equity Research services using AI and ChatGPT technology on SEC Edgar Chat history. The platform allows users to quickly gain insights into their investment process by leveraging AI to extract attributed sources and time series data from documents. Users can sign up for free to access the platform and perform in-depth searches and analyses on SEC filings. Beatandraise aims to streamline the research process for investors by providing accurate and timely information sourced from official documents.
Procys
Procys is a document processing platform powered by AI that offers intelligent document extraction and processing solutions. It provides a self-learning engine for efficient data extraction, seamless integration with various apps, OCR API powered by AI, customized data extraction capabilities, and AI autosplit feature for automatic document splitting. Procys caters to various use cases such as invoice OCR, ID card OCR, receipt OCR, and account payable automation. The platform ensures data security as a top priority and simplifies the document processing workflow through advanced OCR technology.
AutoRegex
AutoRegex is a web application that utilizes Natural Language Processing (NLP) and Artificial Intelligence (AI) to convert English text into Regular Expressions (RegEx) effortlessly. With the help of AI technology, users can easily translate their English requirements into complex RegEx patterns without the need for deep technical knowledge. The tool simplifies the process of creating RegEx, making it accessible to a wider audience, including those with limited programming experience. AutoRegex aims to streamline the conversion process and enhance productivity for individuals working with text data and pattern matching tasks.
Agentive
Agentive is an AI-powered audit software that simplifies and automates audits using machine learning and large language AI models. It helps users set up audit procedures templates, extract structured data from files, and match attributes to values with AI. The platform aims to make auditing easier and more efficient by eliminating manual procedures and providing support in various formats.
Receipt OCR API
Receipt OCR API by ReceiptUp is an advanced tool for precise data extraction from receipt and invoice images. The API leverages OCR and AI technology to accurately extract total amounts, taxes, dates, and merchant information, streamlining financial operations. It supports over 50 languages, multiple image formats, and offers affordable pricing. Users can easily integrate the API into their software systems for efficient receipt management and enhanced business analytics.
Simplescraper
Simplescraper is a web scraping tool that allows users to extract data from any website in seconds. It offers the ability to download data instantly, scrape at scale in the cloud, or create APIs without the need for coding. The tool is designed for developers and no-coders, making web scraping simple and efficient. Simplescraper AI Enhance provides a new way to pull insights from web data, allowing users to summarize, analyze, format, and understand extracted data using AI technology.
Pentest Copilot
Pentest Copilot by BugBase is an ultimate ethical hacking assistant that guides users through each step of the hacking journey, from analyzing web apps to root shells. It eliminates redundant research, automates payload and command generation, and provides intelligent contextual analysis to save time. The application excels at data extraction, privilege escalation, lateral movement, and leaving no trace behind. With features like secure VPN integration, total control over sessions, parallel command processing, and flexibility to choose between local or cloud execution, Pentest Copilot offers a seamless and efficient hacking experience without the need for Kali Linux installation.
Docugami
Docugami is an AI-powered document engineering platform that enables business users to extract, analyze, and automate data from various types of documents. It empowers users with immediate impact without the need for extensive machine learning investments or IT development. Docugami's proprietary Business Document Foundation Model leverages Generative AI to transform unstructured text into structured information, allowing users to unlock insights and drive business processes efficiently.
PixieBrix
PixieBrix is an AI engagement platform that allows users to build, deploy, and manage internal AI tools to drive team productivity. It unifies AI landscapes with oversight and governance for enterprise scale. The platform is enterprise-ready and fully customizable to meet unique needs, and can be deployed on any site, making it easy to integrate into existing systems. PixieBrix leverages the power of AI and automation to harness the latest technology to streamline workflows and take productivity to new heights.
Visus
Visus is a tool that allows you to create your own ChatGPT AI. With Visus, you can train your AI on your own data, ask it questions, and get instant answers. Visus is designed to understand your language and provide quick and accurate responses to any question you may have about your documents. It can help you uncover valuable insights from your data quickly and effortlessly.
Magical
Magical is an AI-powered productivity app that automates repetitive tasks on any website. It offers a range of features including text expansion, autofill, data entry and research, and AI-powered writing assistance. Magical is designed to help users save time and focus on more important work.
Doclime
Doclime is an AI-powered document analysis tool that helps users extract insights and automate tasks from their documents. It uses natural language processing and machine learning to analyze text, identify key information, and generate summaries, reports, and other insights. Doclime can be used for a variety of tasks, including contract review, due diligence, market research, and customer support.
ChatDOC
ChatDOC is an AI-powered tool that allows users to chat with PDF documents and get instant answers with cited sources. It can summarize long documents, explain complex concepts, and find key information in seconds. ChatDOC is built for professionals and is used by over 500,000 global users.
Bardeen AI
Bardeen AI is a workflow task automation software that helps users automate repetitive tasks in apps and websites they use for work. It offers a no-code builder, pre-built workflow templates, and integrations with popular apps like Salesforce, Slack, and Google Sheets. Bardeen AI is designed to be easy to use, even for non-technical users, and it can help users save time and increase productivity.
Elicit
Elicit is a research tool that uses artificial intelligence to help researchers analyze research papers more efficiently. It can summarize papers, extract data, and synthesize findings, saving researchers time and effort. Elicit is used by over 800,000 researchers worldwide and has been featured in publications such as Nature and Science. It is a powerful tool that can help researchers stay up-to-date on the latest research and make new discoveries.
Flowshot
Flowshot is an AI plugin for Google Sheets that allows users to supercharge their spreadsheets with AI. With Flowshot, users can work faster with AI prompts, autocomplete repetitive tasks, build custom AI models without code, and generate formulas and AI images. Flowshot is used by organizations of all shapes and sizes and has been rated 5 stars by its customers.
HARPA AI
HARPA AI is a Google Chrome extension that brings AI to your browser. It can summarize and reply to emails, rewrite, rephrase, correct and expand text, read articles, translate and scan web pages for data. HARPA has a hybrid AI engine and works with OpenAI GPT-3 & GPT-4 API, ChatGPT, Claude2 and Google Gemini.
Levity
Levity is an AI-powered email automation tool designed specifically for the freight industry. It connects to your inbox, categorizes incoming emails, extracts critical information, and pushes it to your TMS, allowing you to focus on building customer relationships instead of manual data entry and repetitive tasks.
Bytebot
Bytebot is a web automation tool that uses AI to make it easy to create and manage web tasks. With Bytebot, you can create browser automations as intuitively as writing a simple prompt. Bytebot will take care of the code for you, so you can focus on the task at hand. Bytebot is perfect for a variety of tasks, including data extraction, form filling, and website monitoring.
Scrol.ai
Scrol.ai is a powerful AI-powered tool that allows users to search, analyze, and generate data from various sources. It utilizes advanced language models like GPT-4 and ChatGPT to provide users with a seamless and efficient way to extract insights, summarize information, and create new content. With its user-friendly interface and robust features, Scrol.ai empowers users to streamline their workflow, enhance productivity, and make informed decisions.
Spreadsheet Daddy
Spreadsheet Daddy is an AI-powered add-on for Google Sheets that enables users to automate tasks, generate content, extract data, and perform various other functions using advanced AI models like GPT-4 and GPT-4-32k. It seamlessly integrates with Google Sheets, allowing users to leverage the power of AI within their spreadsheets. With its user-friendly interface and diverse range of features, Spreadsheet Daddy empowers businesses and individuals to enhance their productivity and efficiency.
Sonny9
Sonny9 is an AI-powered data collection tool designed specifically for CPAs, tax preparers, and auditors. It helps professionals in these fields collect customer information and documents efficiently, minimizing the time and effort spent on back-and-forth communications. With Sonny9, users can automate repetitive tasks, receive notifications about new insights and consulting opportunities, and get prepared data for further analysis. The tool integrates with QuickBooks and can automatically extract data from documents into CSV format. Sonny9 also provides users with tips and opportunities for high-level consulting services based on customer information.
Sharly AI
Sharly AI is a revolutionary tool that utilizes advanced AI technology to transform complex documents and PDFs into easily digestible summaries and facilitate interactive chat-based interactions. It empowers users to engage in natural language conversations with their documents, ask questions, and retrieve specific information effortlessly. Sharly AI's capabilities extend to various domains, including research, legal analysis, project management, and content summarization, offering tailored solutions for professionals in each field. By leveraging the power of AI, Sharly AI streamlines workflows, enhances productivity, and unlocks deeper insights from vast amounts of information.
FranzAI LLM Playground
FranzAI LLM Playground is an AI-powered tool that helps you extract, classify, and analyze unstructured text data. It leverages transformer models to provide accurate and meaningful results, enabling you to build data applications faster and more efficiently. With FranzAI, you can accelerate product and content classification, enhance data interpretation, and advance data extraction processes, unlocking key insights from your textual data.
Amped Software
Amped Software develops solutions for the analysis and enhancement of images and videos for forensic, security, and investigative applications. Their tools are used by top forensic labs, law enforcement, military, security, and government agencies worldwide.
Rgx.tools
Rgx.tools is an AI-powered text-to-regex generator that helps users create regular expressions quickly and easily. It is a wrapper around OpenAI's gpt-3.5-chat model, which generates clean, readable, and efficient regular expressions based on user input. Rgx.tools is designed to make the process of writing regular expressions less painful and more accessible, even for those with limited experience.
SectorFlow
SectorFlow is a zero-code AI partner that helps businesses and teams elevate their productivity through chat-powered productivity, data insights, and secure AI chat for the workplace. With SectorFlow, users can handle requests, analyze data, and access leading language models through a user-friendly chat interface. The platform offers flexible deployment options, robust APIs, and enterprise-grade security, making it suitable for businesses of all sizes. SectorFlow's mission is to empower organizations to harness the power of AI responsibly and securely, enabling them to make informed decisions, save time, and boost efficiency.
TextMine
TextMine is an AI-powered knowledge base that helps businesses analyze, manage, and search thousands of documents. It uses AI to analyze unstructured textual data and document databases, automatically retrieving key terms to help users make informed decisions. TextMine's features include a document vault for storing and managing documents, a categorization system for organizing documents, and a data extraction tool for extracting insights from documents. TextMine can help businesses save time, money, and improve efficiency by automating manual data entry and information retrieval tasks.
Innodata Inc.
Innodata Inc. is a global data engineering company that delivers AI-enabled software platforms and managed services for AI data collection/annotation, AI digital transformation, and industry-specific business processes. They provide a full-suite of services and products to power data-centric AI initiatives using artificial intelligence and human expertise. With a 30+ year legacy, they offer the highest quality data and outstanding service to their customers.
Chat With PDF AI Tool
The Chat With PDF AI Tool is an innovative application that allows users to interact with PDF documents using artificial intelligence technology. Users can engage in conversations with the AI tool to extract information, ask questions, and receive instant responses. The tool simplifies the process of working with PDF files by providing a conversational interface, making it user-friendly and efficient. With its advanced AI capabilities, the tool can understand natural language queries and provide accurate results, enhancing productivity and workflow efficiency.
Vellum AI
Vellum AI is an AI platform that supports using Microsoft Azure hosted OpenAI models. It offers tools for prompt engineering, semantic search, prompt chaining, evaluations, and monitoring. Vellum enables users to build AI systems with features like workflow automation, document analysis, fine-tuning, Q&A over documents, intent classification, summarization, vector search, chatbots, blog generation, sentiment analysis, and more. The platform is backed by top VCs and founders of well-known companies, providing a complete solution for building LLM-powered applications.
Exante
Exante is an AI-powered contract intelligence platform that offers a single source of truth for organizations' contracts. It revolutionizes contract handling by providing centralized, secure storage, AI-powered extraction and organization of unstructured data, real-time visibility, user-friendly reporting, and collaboration tools. The platform aims to streamline processes, reduce risks, and improve compliance for efficient contract management. Exante delivers tangible value by automating data extraction, reducing costs, improving accuracy, reinforcing compliance, enhancing accessibility, and providing actionable insights.
Elicit
Elicit is an AI research assistant that helps researchers analyze research papers at superhuman speed. It automates time-consuming research tasks such as summarizing papers, extracting data, and synthesizing findings. Trusted by researchers, Elicit offers a plethora of features to speed up the research process and is particularly beneficial for empirical domains like biomedicine and machine learning.
PrivacyDoc
PrivacyDoc is an AI-powered portal that allows users to analyze and query PDF and ebooks effortlessly. By leveraging advanced NLP technology, PrivacyDoc enables users to uncover insights and conduct thorough document analysis. The platform offers features such as easy file upload, query functionality, enhanced security measures, and free access to powerful PDF analysis tools. With PrivacyDoc, users can experience the convenience of logging in with their Google account, submitting queries for prompt AI-driven responses, and ensuring data privacy with secure file handling.
SOAX AI data collection
SOAX AI data collection is a powerful tool that utilizes artificial intelligence to gather and analyze data from various online sources. It automates the process of data collection, saving time and effort for users. The tool is designed to extract relevant information efficiently and accurately, providing valuable insights for businesses and researchers. With its advanced algorithms, SOAX AI data collection can handle large volumes of data quickly and effectively, making it a valuable asset for anyone in need of data-driven decision-making.
Web Transpose
Web Transpose is an AI-powered web scraping and web crawling API that allows users to transform any website into structured data. By utilizing artificial intelligence, Web Transpose can instantly build web scrapers for any website, enabling users to extract valuable information efficiently and accurately. The tool is designed for production use, offering low latency and effective proxy handling. Web Transpose learns the structure of the target website, reducing latency and preventing hallucinations commonly associated with traditional web scraping methods. Users can query any website like an API and build products quickly using the scraped data.
PandasAI
PandasAI is an open-source AI tool designed for conversational data analysis. It allows users to ask questions in natural language to their enterprise data and receive real-time data insights. The tool is integrated with various data sources and offers enhanced analytics, actionable insights, detailed reports, and visual data representation. PandasAI aims to democratize data analysis for better decision-making, offering enterprise solutions for stable and scalable internal data analysis. Users can also fine-tune models, ingest universal data, structure data automatically, augment datasets, extract data from websites, and forecast trends using AI.
Kensho Solutions
Kensho Solutions is an AI tool that illuminates insights in the world's data by providing AI solutions for audio transcription, entity identification, document classification, data extraction, and company data mapping. Their AI solutions unlock insights, enabling users to make data-driven decisions with conviction. In partnership with S&P Global, Kensho Solutions has access to vast amounts of data, which they use to train and develop machine learning algorithms to address the business world's most pressing challenges.
Invoice Detector
Invoice Detector is an AI-powered application designed to streamline invoice management by automatically fetching invoices from emails and subscriptions. It offers features such as auto invoice collection, spend optimization notifications, and expense reports. The application provides users with clarity and control over their expenses through easy-to-understand reports and smart notifications. Invoice Detector ensures data security by encrypting all data and offers different pricing plans to cater to various needs. With a user-friendly interface and AI agents, the application simplifies the process of tracking expenses and managing invoices.
ScrapeComfort
ScrapeComfort is an AI-driven web scraping tool that offers an effortless and intuitive data mining solution. It leverages AI technology to extract data from websites without the need for complex coding or technical expertise. Users can easily input URLs, download data, set up extractors, and save extracted data for immediate use. The tool is designed to cater to various needs such as data analytics, market investigation, and lead acquisition, making it a versatile solution for businesses and individuals looking to streamline their data collection process.
Kudra
Kudra is an AI-powered data extraction tool that offers dedicated solutions for finance, human resources, logistics, legal, and more. It effortlessly extracts critical data fields, tables, relationships, and summaries from various documents, transforming unstructured data into actionable insights. Kudra provides customizable AI models, seamless integrations, and secure document processing while supporting over 20 languages. With features like custom workflows, model training, API integration, and workflow builder, Kudra aims to streamline document processing for businesses of all sizes.
Infrrd
Infrrd is an intelligent document automation platform that offers advanced document extraction solutions. It leverages AI technology to enhance, classify, extract, and review documents with high accuracy, eliminating the need for human review. Infrrd provides effective process transformation solutions across various industries, such as mortgage, invoice, insurance, and audit QC. The platform is known for its world-class document extraction engine, supported by over 10 patents and award-winning algorithms. Infrrd's AI-powered automation streamlines document processing, improves data accuracy, and enhances operational efficiency for businesses.
FillBot
FillBot is an AI-powered form filling and autofill extension designed to streamline workflows and save time by automating tedious data entry tasks. It enhances productivity, ensures data accuracy, and provides bank-level data security. FillBot simplifies tasks with advanced AI technology, smart data extraction, and seamless integration with Chrome. Trusted by professionals worldwide, FillBot is endorsed for its top-tier security, worldwide network support, award-winning innovation, and high customer satisfaction.
basebox
basebox is an AI application designed to provide secure and efficient AI solutions for businesses across various industries. It offers a range of features such as secure text editing, data extraction from PDFs and Excel documents, academic text summarization, multilingual translation, and blog post creation. With a focus on data privacy and security, basebox ensures end-to-end encryption, GDPR compliance, and hosting in Europe. The application is user-friendly, requiring no technical expertise for setup, and offers transparent pricing based on actual usage.
ASSIST
ASSIST is an AI-driven document management software designed to streamline financial paperwork processing and data entry tasks. The application offers features such as SmartDoc Entry for extracting information from invoices and receipts, Polyglot Processing for multilingual support, One-Tap Integration with accounting platforms, ExportEase for data export in CSV format, and AutoFlow Revolution for automated workflows. ASSIST aims to simplify document management, enhance efficiency, and drive digital transformation in businesses by leveraging AI technology.
TextMine
TextMine is an AI-powered knowledge base designed for businesses to manage and analyze critical documents efficiently. It offers features such as document analysis, smart-search capabilities, automated data extraction, and structured dataset transformation. TextMine helps businesses save time and money by streamlining document management processes and enabling informed decision-making. The application caters to various industries like Technology, Legal Services, and Financial Services, providing solutions for teams in Procurement, Finance, Compliance, CIOs, and CDOs.
PDFMerse
PDFMerse is an AI-powered data extraction tool that revolutionizes how users handle document data. It allows users to effortlessly extract information from PDFs with precision, saving time and enhancing workflow. With cutting-edge AI technology, PDFMerse automates data extraction, ensures data accuracy, and offers versatile output formats like CSV, JSON, and Excel. The tool is designed to dramatically reduce processing time and operational costs, enabling users to focus on higher-value tasks.
Feathery
Feathery is an AI-powered platform that enables users to create powerful forms and workflows without the need for coding. It offers advanced features such as AI data extraction, document intelligence, signatures, and collaboration tools. Feathery caters to various industries like insurance, healthcare, financial services, software, and education, providing solutions to streamline processes and enhance user experiences. The platform is designed to automate form workflows, extract and fill documents, and connect with different systems, making it a versatile tool for data management and workflow optimization.
AgentQL
AgentQL is an AI-powered tool for painless data extraction and web automation. It eliminates the need for fragile XPath or DOM selectors by using semantic selectors and natural language descriptions to find web elements reliably. With controlled output and deterministic behavior, AgentQL allows users to shape data exactly as needed. The tool offers features such as extracting data, filling forms automatically, and streamlining testing processes. It is designed to be user-friendly and efficient for developers and data engineers.
Octoparse
Octoparse is an AI web scraping tool that offers a no-coding solution for turning web pages into structured data with just a few clicks. It provides users with the ability to build reliable web scrapers without any coding knowledge, thanks to its intuitive workflow designer. With features like AI assistance, automation, and template libraries, Octoparse is a powerful tool for data extraction and analysis across various industries.
FB Group Extractor
FB Group Extractor is an AI-powered tool designed to scrape Facebook group members' data with one click. It allows users to easily extract, analyze, and utilize valuable information from Facebook groups using artificial intelligence technology. The tool provides features such as data extraction, behavioral analytics for personalized ads, content enhancement, user research, and more. With over 10k satisfied users, FB Group Extractor offers a seamless experience for businesses to enhance their marketing strategies and customer insights.
28 - Open Source AI Tools
skyvern
Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions. Traditional approaches to browser automations required writing custom scripts for websites, often relying on DOM parsing and XPath-based interactions which would break whenever the website layouts changed. Instead of only relying on code-defined XPath interactions, Skyvern adds computer vision and LLMs to the mix to parse items in the viewport in real-time, create a plan for interaction and interact with them. This approach gives us a few advantages: 1. Skyvern can operate on websites it’s never seen before, as it’s able to map visual elements to actions necessary to complete a workflow, without any customized code 2. Skyvern is resistant to website layout changes, as there are no pre-determined XPaths or other selectors our system is looking for while trying to navigate 3. Skyvern leverages LLMs to reason through interactions to ensure we can cover complex situations. Examples include: 1. If you wanted to get an auto insurance quote from Geico, the answer to a common question “Were you eligible to drive at 18?” could be inferred from the driver receiving their license at age 16 2. If you were doing competitor analysis, it’s understanding that an Arnold Palmer 22 oz can at 7/11 is almost definitely the same product as a 23 oz can at Gopuff (even though the sizes are slightly different, which could be a rounding error!) Want to see examples of Skyvern in action? Jump to #real-world-examples-of- skyvern
airbyte-connectors
This repository contains Airbyte connectors used in Faros and Faros Community Edition platforms as well as Airbyte Connector Development Kit (CDK) for JavaScript/TypeScript.
open-parse
Open Parse is a Python library for visually discerning document layouts and chunking them effectively. It is designed to fill the gap in open-source libraries for handling complex documents. Unlike text splitting, which converts a file to raw text and slices it up, Open Parse visually analyzes documents for superior LLM input. It also supports basic markdown for parsing headings, bold, and italics, and has high-precision table support, extracting tables into clean Markdown formats with accuracy that surpasses traditional tools. Open Parse is extensible, allowing users to easily implement their own post-processing steps. It is also intuitive, with great editor support and completion everywhere, making it easy to use and learn.
unstract
Unstract is a no-code platform that enables users to launch APIs and ETL pipelines to structure unstructured documents. With Unstract, users can go beyond co-pilots by enabling machine-to-machine automation. Unstract's Prompt Studio provides a simple, no-code approach to creating prompts for LLMs, vector databases, embedding models, and text extractors. Users can then configure Prompt Studio projects as API deployments or ETL pipelines to automate critical business processes that involve complex documents. Unstract supports a wide range of LLM providers, vector databases, embeddings, text extractors, ETL sources, and ETL destinations, providing users with the flexibility to choose the best tools for their needs.
Dot
Dot is a standalone, open-source application designed for seamless interaction with documents and files using local LLMs and Retrieval Augmented Generation (RAG). It is inspired by solutions like Nvidia's Chat with RTX, providing a user-friendly interface for those without a programming background. Pre-packaged with Mistral 7B, Dot ensures accessibility and simplicity right out of the box. Dot allows you to load multiple documents into an LLM and interact with them in a fully local environment. Supported document types include PDF, DOCX, PPTX, XLSX, and Markdown. Users can also engage with Big Dot for inquiries not directly related to their documents, similar to interacting with ChatGPT. Built with Electron JS, Dot encapsulates a comprehensive Python environment that includes all necessary libraries. The application leverages libraries such as FAISS for creating local vector stores, Langchain, llama.cpp & Huggingface for setting up conversation chains, and additional tools for document management and interaction.
instructor
Instructor is a Python library that makes it a breeze to work with structured outputs from large language models (LLMs). Built on top of Pydantic, it provides a simple, transparent, and user-friendly API to manage validation, retries, and streaming responses. Get ready to supercharge your LLM workflows!
sparrow
Sparrow is an innovative open-source solution for efficient data extraction and processing from various documents and images. It seamlessly handles forms, invoices, receipts, and other unstructured data sources. Sparrow stands out with its modular architecture, offering independent services and pipelines all optimized for robust performance. One of the critical functionalities of Sparrow - pluggable architecture. You can easily integrate and run data extraction pipelines using tools and frameworks like LlamaIndex, Haystack, or Unstructured. Sparrow enables local LLM data extraction pipelines through Ollama or Apple MLX. With Sparrow solution you get API, which helps to process and transform your data into structured output, ready to be integrated with custom workflows. Sparrow Agents - with Sparrow you can build independent LLM agents, and use API to invoke them from your system. **List of available agents:** * **llamaindex** - RAG pipeline with LlamaIndex for PDF processing * **vllamaindex** - RAG pipeline with LLamaIndex multimodal for image processing * **vprocessor** - RAG pipeline with OCR and LlamaIndex for image processing * **haystack** - RAG pipeline with Haystack for PDF processing * **fcall** - Function call pipeline * **unstructured-light** - RAG pipeline with Unstructured and LangChain, supports PDF and image processing * **unstructured** - RAG pipeline with Weaviate vector DB query, Unstructured and LangChain, supports PDF and image processing * **instructor** - RAG pipeline with Unstructured and Instructor libraries, supports PDF and image processing. Works great for JSON response generation
Open-DocLLM
Open-DocLLM is an open-source project that addresses data extraction and processing challenges using OCR and LLM technologies. It consists of two main layers: OCR for reading document content and LLM for extracting specific content in a structured manner. The project offers a larger context window size compared to JP Morgan's DocLLM and integrates tools like Tesseract OCR and Mistral for efficient data analysis. Users can run the models on-premises using LLM studio or Ollama, and the project includes a FastAPI app for testing purposes.
concierge
Concierge is a versatile automation tool designed to streamline repetitive tasks and workflows. It provides a user-friendly interface for creating custom automation scripts without the need for extensive coding knowledge. With Concierge, users can automate various tasks across different platforms and applications, increasing efficiency and productivity. The tool offers a wide range of pre-built automation templates and allows users to customize and schedule their automation processes. Concierge is suitable for individuals and businesses looking to automate routine tasks and improve overall workflow efficiency.
blinkid-ios
BlinkID iOS is a mobile SDK that enables developers to easily integrate ID scanning and data extraction capabilities into their iOS applications. The SDK supports scanning and processing various types of identity documents, such as passports, driver's licenses, and ID cards. It provides accurate and fast data extraction, including personal information and document details. With BlinkID iOS, developers can enhance their apps with secure and reliable ID verification functionality, improving user experience and streamlining identity verification processes.
llm-document-ocr
LLM Document OCR is a Node.js tool that utilizes GPT4 and Claude3 for OCR and data extraction. It converts PDFs into PNGs, crops white-space, cleans up JSON strings, and supports various image formats. Users can customize prompts for data extraction. The tool is sponsored by Mercoa, offering API for BillPay and Invoicing.
blinkid-react-native
BlinkID SDK wrapper for React Native provides best-in-class ID scanning software for cross-platform apps built with React Native. It offers complete guidance on installing and linking BlinkID library with iOS and Android apps. The SDK requires a valid license key for scanning, with offline data extraction. It supports React Native v0.71.2 and includes installation and linking instructions for iOS and Android. The repository also contains a script to create a sample React Native project and dependencies. Video tutorials demonstrate using documentVerificationOverlay and CombinedRecognizer for scanning various document types.
crawl4ai
Crawl4AI is a powerful and free web crawling service that extracts valuable data from websites and provides LLM-friendly output formats. It supports crawling multiple URLs simultaneously, replaces media tags with ALT, and is completely free to use and open-source. Users can integrate Crawl4AI into Python projects as a library or run it as a standalone local server. The tool allows users to crawl and extract data from specified URLs using different providers and models, with options to include raw HTML content, force fresh crawls, and extract meaningful text blocks. Configuration settings can be adjusted in the `crawler/config.py` file to customize providers, API keys, chunk processing, and word thresholds. Contributions to Crawl4AI are welcome from the open-source community to enhance its value for AI enthusiasts and developers.
ExtractThinker
ExtractThinker is a library designed for extracting data from files and documents using Language Model Models (LLMs). It offers ORM-style interaction between files and LLMs, supporting multiple document loaders such as Tesseract OCR, Azure Form Recognizer, AWS TextExtract, and Google Document AI. Users can customize extraction using contract definitions, process documents asynchronously, handle various document formats efficiently, and split and process documents. The project is inspired by the LangChain ecosystem and focuses on Intelligent Document Processing (IDP) using LLMs to achieve high accuracy in document extraction tasks.
fuji-web
Fuji-Web is an intelligent AI partner designed for full browser automation. It autonomously navigates websites and performs tasks on behalf of the user while providing explanations for each action step. Users can easily install the extension in their browser, access the Fuji icon to input tasks, and interact with the tool to streamline web browsing tasks. The tool aims to enhance user productivity by automating repetitive web actions and providing a seamless browsing experience.
empower-functions
Empower Functions is a family of large language models (LLMs) that provide GPT-4 level capabilities for real-world 'tool using' use cases. These models offer compatibility support to be used as drop-in replacements, enabling interactions with external APIs by recognizing when a function needs to be called and generating JSON containing necessary arguments based on user inputs. This capability is crucial for building conversational agents and applications that convert natural language into API calls, facilitating tasks such as weather inquiries, data extraction, and interactions with knowledge bases. The models can handle multi-turn conversations, choose between tools or standard dialogue, ask for clarification on missing parameters, integrate responses with tool outputs in a streaming fashion, and efficiently execute multiple functions either in parallel or sequentially with dependencies.
aio-scrapy
Aio-scrapy is an asyncio-based web crawling and web scraping framework inspired by Scrapy. It supports distributed crawling/scraping, implements compatibility with scrapyd, and provides options for using redis queue and rabbitmq queue. The framework is designed for fast extraction of structured data from websites. Aio-scrapy requires Python 3.9+ and is compatible with Linux, Windows, macOS, and BSD systems.
crawlee-python
Crawlee-python is a web scraping and browser automation library that covers crawling and scraping end-to-end, helping users build reliable scrapers fast. It allows users to crawl the web for links, scrape data, and store it in machine-readable formats without worrying about technical details. With rich configuration options, users can customize almost any aspect of Crawlee to suit their project's needs.
graphrag
The GraphRAG project is a data pipeline and transformation suite designed to extract meaningful, structured data from unstructured text using LLMs. It enhances LLMs' ability to reason about private data. The repository provides guidance on using knowledge graph memory structures to enhance LLM outputs, with a warning about the potential costs of GraphRAG indexing. It offers contribution guidelines, development resources, and encourages prompt tuning for optimal results. The Responsible AI FAQ addresses GraphRAG's capabilities, intended uses, evaluation metrics, limitations, and operational factors for effective and responsible use.
Scrapegraph-LabLabAI-Hackathon
ScrapeGraphAI is a web scraping Python library that utilizes LangChain, LLM, and direct graph logic to create scraping pipelines. Users can specify the information they want to extract, and the library will handle the extraction process. The tool is designed to simplify web scraping tasks by providing a streamlined and efficient approach to data extraction.
parsera
Parsera is a lightweight Python library designed for scraping websites using LLMs. It offers simplicity and efficiency by minimizing token usage, enhancing speed, and reducing costs. Users can easily set up and run the tool to extract specific elements from web pages, generating JSON output with relevant data. Additionally, Parsera supports integration with various chat models, such as Azure, expanding its functionality and customization options for web scraping tasks.
CyberScraper-2077
CyberScraper 2077 is an advanced web scraping tool powered by AI, designed to extract data from websites with precision and style. It offers a user-friendly interface, supports multiple data export formats, operates in stealth mode to avoid detection, and promises lightning-fast scraping. The tool respects ethical scraping practices, including robots.txt and site policies. With upcoming features like proxy support and page navigation, CyberScraper 2077 is a futuristic solution for data extraction in the digital realm.
linkedin-api
The Linkedin API for Python allows users to programmatically search profiles, send messages, and find jobs using a regular Linkedin user account. It does not require 'official' API access, just a valid Linkedin account. However, it is important to note that this library is not officially supported by LinkedIn and using it may violate LinkedIn's Terms of Service. Users can authenticate using any Linkedin account credentials and access features like getting profiles, profile contact info, and connections. The library also provides commercial alternatives for extracting data, scraping public profiles, and accessing a full LinkedIn API. It is not endorsed or supported by LinkedIn and is intended for educational purposes and personal use only.
scylla
Scylla is an intelligent proxy pool tool designed for humanities, enabling users to extract content from the internet and build their own Large Language Models in the AI era. It features automatic proxy IP crawling and validation, an easy-to-use JSON API, a simple web-based user interface, HTTP forward proxy server, Scrapy and requests integration, and headless browser crawling. Users can start using Scylla with just one command, making it a versatile tool for various web scraping and content extraction tasks.
Scrapegraph-demo
ScrapeGraphAI is a web scraping Python library that utilizes LangChain, LLM, and direct graph logic to create scraping pipelines. Users can specify the information they want to extract, and the library will handle the extraction process. This repository contains an official demo/trial for the ScrapeGraphAI library, showcasing its capabilities in web scraping tasks. The tool is designed to simplify the process of extracting data from websites by providing a user-friendly interface and powerful scraping functionalities.
spider
Spider is a high-performance web crawler and indexer designed to handle data curation workloads efficiently. It offers features such as concurrency, streaming, decentralization, headless Chrome rendering, HTTP proxies, cron jobs, subscriptions, smart mode, blacklisting, whitelisting, budgeting depth, dynamic AI prompt scripting, CSS scraping, and more. Users can easily get started with the Spider Cloud hosted service or set up local installations with spider-cli. The tool supports integration with Node.js and Python for additional flexibility. With a focus on speed and scalability, Spider is ideal for extracting and organizing data from the web.
asktube
AskTube is an AI-powered YouTube video summarizer and QA assistant that utilizes Retrieval Augmented Generation (RAG) technology. It offers a comprehensive solution with Q&A functionality and aims to provide a user-friendly experience for local machine usage. The project integrates various technologies including Python, JS, Sanic, Peewee, Pytubefix, Sentence Transformers, Sqlite, Chroma, and NuxtJs/DaisyUI. AskTube supports multiple providers for analysis, AI services, and speech-to-text conversion. The tool is designed to extract data from YouTube URLs, store embedding chapter subtitles, and facilitate interactive Q&A sessions with enriched questions. It is not intended for production use but rather for end-users on their local machines.
panda-etl
PandaETL is an open-source, no-code ETL tool designed to extract and parse data from various document types including PDFs, emails, websites, audio files, and more. With an intuitive interface and powerful backend, PandaETL simplifies the process of data extraction and transformation, making it accessible to users without programming skills.
26 - OpenAI Gpts
Advanced Web Scraper with Code Generator
Generates web scraping code with accurate selectors.
Data Extractor Pro
Expert in data extraction and context-driven analysis. Can read most filetypes including PDFS, XLSX, Word, TXT, CSV, EML, Etc.
Regex Wizard
Generate and explain regex patterns from your description, it support English and Chinese.
DocuScan and Scribe
Scans and transcribes images into documents, offers downloadable copies in a document and offers to translate into different languages
PDF AI
PDFChat : Analyse 1000's of PDF's in seconds, extract and chat with PDFs in any language.
Image Analyzer
I'm an image analysis assistant, providing detailed summaries and insights.
RegExp Builder
This GPT lets you build PCRE Regular Expressions (for use the RegExp constructor).
Assistente Codificação TUSS Exames com OCR
Portuguese OCR for medical test coding, outputs in table format.