extractor

Extractor: AI-Powered Data Extraction for your Laravel application.

Stars: 86

Visit

Extractor is an AI-powered data extraction library for Laravel that leverages OpenAI's capabilities to effortlessly extract structured data from various sources, including images, PDFs, and emails. It features a convenient wrapper around OpenAI Chat and Completion endpoints, supports multiple input formats, includes a flexible Field Extractor for arbitrary data extraction, and integrates with Textract for OCR functionality. Extractor utilizes JSON Mode from the latest GPT-3.5 and GPT-4 models, providing accurate and efficient data extraction.

README:

Extractor: AI-Powered Data Extraction Library for Laravel.

Effortlessly extract structured data from various sources, including images, PDFs, and emails, using OpenAI within your Laravel application.

Features

A convenient wrapper around OpenAI Chat and Completion endpoints.
Supports multiple input formats such as Plain Text, PDF, Rtf, Images, Word documents and Web content.
Includes a flexible Field Extractor that can extract any arbitrary data without writing custom logic.
Can return a regular array or a Spatie/data object.
Integrates with Textract for OCR functionality.
Uses JSON Mode from the latest GPT-3.5 and GPT-4 models.

Example

Example code

<?php

use HelgeSverre\Extractor\Facades\Extractor;
use HelgeSverre\Extractor\Facades\Text;
use Illuminate\Support\Facades\Storage;

$image = Storage::get("restaurant_menu.png")

// Extract text from images
$textFromImage = Text::textract($image);

// Extract structured data from plain text
$menu = Extractor::fields($textFromImage,
    fields: [
        'restaurantName',
        'phoneNumber',
        'dishes' => [
            'name' => 'name of the dish',
            'description' => 'description of the dish',
            'price' => 'price of the dish as a number',
        ],
    ],
    model: "gpt-3.5-turbo-1106",
    maxTokens: 4000,
);

Installation

Install the package via composer:

composer require helgesverre/extractor

Publish the configuration file:

php artisan vendor:publish --tag="extractor-config"

You can find all the configuration options in the configuration file.

Since this package relies on the OpenAI Laravel Package, you also need to publish their configuration and add the OPENAI_API_KEY to your .env file:

php artisan vendor:publish --provider="OpenAI\Laravel\ServiceProvider"

OPENAI_API_KEY="your-key-here"

# Optional: Set request timeout (default: 30s).
OPENAI_REQUEST_TIMEOUT=60

Usage

Extracting plain text from documents

use HelgeSverre\Extractor\Facades\Text;

$textPlainText = Text::text(file_get_contents('./data.txt'));
$textPdf = Text::pdf(file_get_contents('./data.pdf'));
$textImageOcr = Text::textract(file_get_contents('./data.jpg'));
$textPdfOcr = Text::textractUsingS3Upload(file_get_contents('./data.pdf'));
$textWord = Text::word(file_get_contents('./data.doc'));
$textWeb = Text::web('https://example.com');
$textHtml = Text::html(file_get_contents('./data.html'));

Description	Method
Extract text from a plain text, useful if you need trim/normalize whitespace in a string.	`Text::text`
Extract text from a PDF file, uses smalot/pdfparser	`Text::pdf`
Extract text with AWS Textract by sending the content as a base64 encoded string (faster, but has limitations	`Text::textract`
Extract text with AWS Textract by uploading file to S3 and polling for completion (handles larger files and multi-page PDFs)	`Text::textractUsingS3Upload`
Extract plain text from a Word document (Uses simple xml parsing and unzipping)	`Text::word`
Fetches HTML from an URL via HTTP, strip all HTML tags, squish and trim all whitespace.	`Text::web`
Extract text from an HTML file (same, but for HTML content)	`Text::html`

Extracting structured data

The Extractor package includes a set of pre-built extractors designed to simplify the extraction of structured data from various types of text. Each extractor is optimized for specific data formats, making it easy to process different types of information. Below is a list of the included extractors along with brief descriptions and convenient shortened methods for each:

Example	Extractor	Description
`Extractor::extract(Contacts::class, $text);`	Contacts	Extracts a list of contacts (name, title, email, phone).
`Extractor::extract(Receipt::class, $text);`	Receipt	Extracts common Receipt data, See receipt-scanner for details.
`Extractor::fields($text, fields: ["name","address", "phone"]);`	Fields	Extracts arbitrary fields provided as an array of output key, and optional description, also supports nested fields

These extractors are provided out of the box and offer a convenient way to extract specific types of structured data from text. You can use the shortened methods to easily access the functionality of each extractor.

Using the Field extractor

The field extractor is great if you don't need much custom logic or validation and just want to extract out some structured data from a piece of text.

Here is an example of extracting information from a CV, note that providing a description to guide the AI model is supported, as well as nested items (which is useful for lists of sub-items, like work history, line items, comments on a product etc )

$sample = Text::pdf(file_get_contents(__DIR__.'/../samples/helge-cv.pdf'));

$data = Extractor::fields($sample,
    fields: [
        'name' => 'the name of the candidate',
        'email',
        'certifications' => 'list of certifications, if any',
        'workHistory' => [
            'companyName',
            'from' => 'Y-m-d if available, Year only if not, null if missing',
            'to' => 'Y-m-d if available, Year only if not, null if missing',
            'text',
        ],
    ],
    model: Engine::GPT_3_TURBO_1106,
);

Using GPT-4-Vision with Extractor

Note: This feature is still WIP.

The Extractor package also integrates with OpenAI's new Vision API, leveraging the powerful gpt-4-vision-preview model to extract structured data from images. This feature enables you to analyze and interpret visual content with ease, whether it's reading text from images, extracting data from charts, or understanding complex visual scenarios.

How to Use OpenAI's Vision API with ImageContent

To use the Vision features in Extractor, you need to provide an image as input. This can be done in a few different ways:

Using a File Path: Load an image from a file path.
Using Raw Image Data: Use the raw data of an image, for example, from an uploaded file.
Using an Image URL: Load an image directly from a URL.

Here's how you can use each method:

Using a File Path

use HelgeSverre\Extractor\Text\ImageContent;

$imagePath = __DIR__ . '/../samples/sample-image.jpg';
$imageContent = ImageContent::file($imagePath);

Using Raw Image Data

use HelgeSverre\Extractor\Text\ImageContent;

$rawImageData = file_get_contents(__DIR__ . '/../samples/sample-image.jpg');
$imageContent = ImageContent::raw($rawImageData);

Using an Image URL

use HelgeSverre\Extractor\Text\ImageContent;

$imageUrl = 'https://example.com/sample-image.jpg';
$imageContent = ImageContent::url($imageUrl);

Extracting Data from Images with OpenAI's Vision API

After preparing your ImageContent object, you can pass it to the Extractor::fields method to extract structured data using OpenAI's Vision API. For example:

use HelgeSverre\Extractor\Facades\Extractor;
use HelgeSverre\Extractor\Text\ImageContent;

$imageContent = ImageContent::file(__DIR__ . '/../samples/product-catalog.jpg');

$data = Extractor::fields(
    $imageContent,
    fields: [
        'productName',
        'price',
        'description',
    ],
    model: Engine::GPT_4_VISION,
);

Creating Custom Extractors

Custom extractors in Extractor allow for tailored data extraction to meet specific needs. Here's how you can create and use a custom extractor, using the example of a Job Posting Extractor.

Implementing a Custom Extractor

Create a new class for your custom extractor by extending the Extractor class. In this example, we'll create a JobPostingExtractor to extract key information from job postings:

<?php

namespace App\Extractors;

use HelgeSverre\Extractor\Extraction\Extractor;use HelgeSverre\Extractor\Text\TextContent;

class JobPostingExtractor extends Extractor
{
    public function prompt(string|TextContent $input): string
    {
        $outputKey = $this->expectedOutputKey();

        return "Extract the following fields from the job posting below:"
            . "\n- jobTitle: The title or designation of the job."
            . "\n- companyName: The name of the company or organization posting the job."
            . "\n- location: The geographical location or workplace where the job is based."
            . "\n- jobType: The nature of employment (e.g., Full-time, Part-time, Contract)."
            . "\n- description: A brief summary or detailed description of the job."
            . "\n- applicationDeadline: The closing date for applications, if specified."
            . "\n\nThe output should be a JSON object under the key '{$outputKey}'."
            . "\n\nINPUT STARTS HERE\n\n$input\n\nOUTPUT IN JSON:\n";
    }

    public function expectedOutputKey(): string
    {
        return 'extractedData';
    }
}

Note: Adding an instruction on which $outputKey key to nest the data under is recommended, as the JsonMode response from OpenAI end to want to put everything under a root key, by overriding the expectedOutputKey() method, it will tell the base Extractor class which key to pull the data from.

Registering the Custom Extractor

After defining your custom extractor, register it with the main Extractor class using the extend method:

use HelgeSverre\Extractor\Extractor;

Extractor::extend("job-posting", fn() => new JobPostingExtractor());

Using the Custom Extractor

Once registered, you can use your custom extractor just like the built-in ones. Here's an example of how to use the JobPostingExtractor:

use HelgeSverre\Extractor\Facades\Text;
use HelgeSverre\Extractor\Extractor;

$jobPostingContent = Text::web("https://www.finn.no/job/fulltime/ad.html?finnkode=329443482");

$extractedData = Extractor::extract('job-posting', $jobPostingContent);
// Or you can specify the class-string instead
// ex: Extractor::extract(JobPostingExtractor::class, $jobPostingContent);

// $extractedData now contains structured information from the job posting

With the JobPostingExtractor, you can efficiently parse and extract key information from job postings, structuring it in a way that's easy to manage and use within your Laravel application.

Adding Validation to the Job Posting Extractor

To ensure the integrity of the extracted data, you can add validation rules to your Job Posting Extractor. This is done by using the HasValidation trait and defining validation rules in the rules method:

<?php

namespace App\Extractors;

use HelgeSverre\Extractor\Extraction\Concerns\HasValidation;
use HelgeSverre\Extractor\Extraction\Extractor;

class JobPostingExtractor extends Extractor
{
    use HasValidation;

    public function rules(): array
    {
        return [
            'jobTitle' => ['required', 'string'],
            'companyName' => ['required', 'string'],
            'location' => ['required', 'string'],
            'jobType' => ['required', 'string'],
            'salary' => ['required', 'numeric'],
            'description' => ['required', 'string'],
            'applicationDeadline' => ['required', 'date']
        ];
    }
}

This will ensure that each key field in the job posting data meets the specified criteria, enhancing the reliability of your data extraction.

Extracting Data into a DTO

Extractor can integrate with spatie/data to cast the extracted data into a Data Transfer Object (DTO) of your choosing. To do this, add the HasDto trait to your extractor and specify the DTO class in the dataClass method:

<?php

namespace App\Extractors;

use DateTime;
use App\Extractors\JobPostingDto;
use HelgeSverre\Extractor\Extraction\Concerns\HasDto;
use HelgeSverre\Extractor\Extraction\Extractor;
use Spatie\LaravelData\Data;

class JobPostingDto extends Data
{
    public function __construct(
        public string $jobTitle,
        public string $companyName,
        public string $location,
        public string $jobType,
        public int|float $salary,
        public string $description,
        public DateTime $applicationDeadline
    ) {
    }
}

class JobPostingExtractor extends Extractor
{
    use HasDto;

    public function dataClass(): string
    {
        return JobPostingDto::class;
    }

    public function isCollection(): bool
    {
        return false; 
    }
}

OCR Configuration with AWS Textract

To use AWS Textract for extracting text from large images and multi-page PDFs, the package needs to upload the file to S3 and pass the s3 object location along to the textract service.

So you need to configure your AWS Credentials in the config/extractor.php file as follows:

TEXTRACT_KEY="your-aws-access-key"
TEXTRACT_SECRET="your-aws-security"
TEXTRACT_REGION="your-textract-region"

# Can be omitted
TEXTRACT_VERSION="2018-06-27"

You also need to configure a seperate Textract disk where the files will be stored, open your config/filesystems.php configuration file and add the following:

'textract' => [
    'driver' => 's3',
    'key' => env('TEXTRACT_KEY'),
    'secret' => env('TEXTRACT_SECRET'),
    'region' => env('TEXTRACT_REGION'),
    'bucket' => env('TEXTRACT_BUCKET'),
],

Ensure the textract_disk setting in config/extractor.php is the same as your disk name in the filesystems.php config, you can change it with the .env value TEXTRACT_DISK.

return [
    "textract_disk" => env("TEXTRACT_DISK")
];

.env

TEXTRACT_DISK="uploads"

Delete the files after processing them with Textract

Using S3 Lifecycle Rules

You can configure a lifecycle rule on your S3 bucket to delete the files after a certain amount of time, see the AWS docs for more info:

https://repost.aws/knowledge-center/s3-empty-bucket-lifecycle-rule

Using the `cleanupFileUsing` hook

By default, the package will NOT delete the files that has been uploaded in the textract S3 bucket, if you want to delete these files, you can implement this using the TextractUsingS3Upload::cleanupFileUsing(Closure) hook.

// Delete the file from the S3 bucket
TextractUsingS3Upload::cleanupFileUsing(function (string $filePath) {
    Storage::disk('textract')->delete($filePath);
}

Note

Textract is not available in all regions:

Q: In which AWS regions is Amazon Textract available? Amazon Textract is currently available in the US East (Northern Virginia), US East (Ohio), US West (Oregon), US West ( N. California), AWS GovCloud (US-West), AWS GovCloud (US-East), Canada (Central), EU (Ireland), EU (London), EU ( Frankfurt), EU (Paris), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Seoul), and Asia Pacific ( Mumbai) Regions.

See: https://aws.amazon.com/textract/faqs/

All Parameters and Their Functions

$input (TextContent|string)

The input text or data that needs to be processed. It accepts either a TextContent object or a string.

$model (Model)

This parameter specifies the OpenAI model used for the extraction process.

It accepts a string value. Different models have different speed/accuracy characteristics and use cases, for convenience, most of the accepted models are provided as constants on the Engine class.

Available Models:

Model Identifier	Model	Note
`Engine::GPT_4_1106_PREVIEW`	'gpt-4-1106-preview'	GPT-4 Turbo, featuring improved instruction following, JSON mode, reproducible outputs, parallel function calling. Maximum 4,096 output tokens. Preview model, not yet for production traffic.
`Engine::GPT_3_TURBO_1106`	'gpt-3.5-turbo-1106'	Updated GPT-3.5 Turbo, with improvements similar to GPT-4 Turbo. Returns up to 4,096 output tokens.
`Engine::GPT_4`	'gpt-4'	Large multimodal model, capable of solving complex problems with greater accuracy. Suited for both chat and traditional completions tasks.
`Engine::GPT4_32K`	'gpt-4-32k'	Extended version of GPT-4 with a larger context window of 32,768 tokens.
`Engine::GPT_3_TURBO_INSTRUCT`	'gpt-3.5-turbo-instruct'	Similar to `text-davinci-003`, optimized for legacy Completions endpoint, not for Chat Completions.
`Engine::GPT_3_TURBO_16K`	'gpt-3.5-turbo-16k'	Extended version of GPT-3.5 Turbo, supporting a larger context window of 16,385 tokens.
`Engine::GPT_3_TURBO`	'gpt-3.5-turbo'	Optimized for chat using the Chat Completions API, suitable for traditional completion tasks.
`Engine::TEXT_DAVINCI_003`	'text-davinci-003'	Legacy model, better quality and consistency for language tasks. To be deprecated on Jan 4, 2024.
`Engine::TEXT_DAVINCI_002`	'text-davinci-002'	Similar to `text-davinci-003` but trained with supervised fine-tuning. To be deprecated on Jan 4, 2024.

$maxTokens (int)

The maximum number of tokens that the model will process. The default value is 2000, and adjusting this value may be necessary for very long text. A value of 2000 is usually sufficient.

$temperature (float)

Controls the randomness/creativity of the model's output.

A higher value (e.g., 0.8) makes the output more random, which is usually not desired in this context. A recommended value is 0.1 or 0.2; anything over 0.5 tends to be less useful. The default is 0.1.

License

This package is licensed under the MIT License. For more details, refer to the License File.

For Tasks:

Click tags to check more tools for each tasks

extract text from pdf extract data from image extract data from email extract structured data extract arbitrary data

For Jobs:

data analyst research assistant information architect knowledge engineer content curator

Alternative AI tools for extractor

Similar Open Source Tools

extractor

github

: 86

receipt-scanner

The receipt-scanner repository is an AI-Powered Receipt and Invoice Scanner for Laravel that allows users to easily extract structured receipt data from images, PDFs, and emails within their Laravel application using OpenAI. It provides a light wrapper around OpenAI Chat and Completion endpoints, supports various input formats, and integrates with Textract for OCR functionality. Users can install the package via composer, publish configuration files, and use it to extract data from plain text, PDFs, images, Word documents, and web content. The scanned receipt data is parsed into a DTO structure with main classes like Receipt, Merchant, and LineItem.

github

: 95

llm-client

LLMClient is a JavaScript/TypeScript library that simplifies working with large language models (LLMs) by providing an easy-to-use interface for building and composing efficient prompts using prompt signatures. These signatures enable the automatic generation of typed prompts, allowing developers to leverage advanced capabilities like reasoning, function calling, RAG, ReAcT, and Chain of Thought. The library supports various LLMs and vector databases, making it a versatile tool for a wide range of applications.

github

: 540

syncode

SynCode is a novel framework for the grammar-guided generation of Large Language Models (LLMs) that ensures syntactically valid output with respect to defined Context-Free Grammar (CFG) rules. It supports general-purpose programming languages like Python, Go, SQL, JSON, and more, allowing users to define custom grammars using EBNF syntax. The tool compares favorably to other constrained decoders and offers features like fast grammar-guided generation, compatibility with HuggingFace Language Models, and the ability to work with various decoding strategies.

github

: 165

magentic

Easily integrate Large Language Models into your Python code. Simply use the `@prompt` and `@chatprompt` decorators to create functions that return structured output from the LLM. Mix LLM queries and function calling with regular Python code to create complex logic.

github

: 1.9k

ax

Ax is a Typescript library that allows users to build intelligent agents inspired by agentic workflows and the Stanford DSP paper. It seamlessly integrates with multiple Large Language Models (LLMs) and VectorDBs to create RAG pipelines or collaborative agents capable of solving complex problems. The library offers advanced features such as streaming validation, multi-modal DSP, and automatic prompt tuning using optimizers. Users can easily convert documents of any format to text, perform smart chunking, embedding, and querying, and ensure output validation while streaming. Ax is production-ready, written in Typescript, and has zero dependencies.

github

: 871

LLMDebugger

This repository contains the code and dataset for LDB, a novel debugging framework that enables Large Language Models (LLMs) to refine their generated programs by tracking the values of intermediate variables throughout the runtime execution. LDB segments programs into basic blocks, allowing LLMs to concentrate on simpler code units, verify correctness block by block, and pinpoint errors efficiently. The tool provides APIs for debugging and generating code with debugging messages, mimicking how human developers debug programs.

github

: 302

laravel-slower

Laravel Slower is a powerful package designed for Laravel developers to optimize the performance of their applications by identifying slow database queries and providing AI-driven suggestions for optimal indexing strategies and performance improvements. It offers actionable insights for debugging and monitoring database interactions, enhancing efficiency and scalability.

github

: 284

auto-playwright

Auto Playwright is a tool that allows users to run Playwright tests using AI. It eliminates the need for selectors by determining actions at runtime based on plain-text instructions. Users can automate complex scenarios, write tests concurrently with or before functionality development, and benefit from rapid test creation. The tool supports various Playwright actions and offers additional options for debugging and customization. It uses HTML sanitization to reduce costs and improve text quality when interacting with the OpenAI API.

github

: 298

mistral-inference

Mistral Inference repository contains minimal code to run 7B, 8x7B, and 8x22B models. It provides model download links, installation instructions, and usage guidelines for running models via CLI or Python. The repository also includes information on guardrailing, model platforms, deployment, and references. Users can interact with models through commands like mistral-demo, mistral-chat, and mistral-common. Mistral AI models support function calling and chat interactions for tasks like testing models, chatting with models, and using Codestral as a coding assistant. The repository offers detailed documentation and links to blogs for further information.

github

: 9.5k

IntelliNode

IntelliNode is a javascript module that integrates cutting-edge AI models like ChatGPT, LLaMA, WaveNet, Gemini, and Stable diffusion into projects. It offers functions for generating text, speech, and images, as well as semantic search, multi-model evaluation, and chatbot capabilities. The module provides a wrapper layer for low-level model access, a controller layer for unified input handling, and a function layer for abstract functionality tailored to various use cases.

github

: 201

litserve

LitServe is a high-throughput serving engine for deploying AI models at scale. It generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs, and more. Built for enterprise scale, it supports every framework like PyTorch, JAX, Tensorflow, and more. LitServe is designed to let users focus on model performance, not the serving boilerplate. It is like PyTorch Lightning for model serving but with broader framework support and scalability.

github

: 53

tensorrtllm_backend

The TensorRT-LLM Backend is a Triton backend designed to serve TensorRT-LLM models with Triton Inference Server. It supports features like inflight batching, paged attention, and more. Users can access the backend through pre-built Docker containers or build it using scripts provided in the repository. The backend can be used to create models for tasks like tokenizing, inferencing, de-tokenizing, ensemble modeling, and more. Users can interact with the backend using provided client scripts and query the server for metrics related to request handling, memory usage, KV cache blocks, and more. Testing for the backend can be done following the instructions in the 'ci/README.md' file.

github

: 628

pg_vectorize

pg_vectorize is a Postgres extension that automates text to embeddings transformation, enabling vector search and LLM applications with minimal function calls. It integrates with popular LLMs, provides workflows for vector search and RAG, and automates Postgres triggers for updating embeddings. The tool is part of the VectorDB Stack on Tembo Cloud, offering high-level APIs for easy initialization and search.

github

: 694

litdata

LitData is a tool designed for blazingly fast, distributed streaming of training data from any cloud storage. It allows users to transform and optimize data in cloud storage environments efficiently and intuitively, supporting various data types like images, text, video, audio, geo-spatial, and multimodal data. LitData integrates smoothly with frameworks such as LitGPT and PyTorch, enabling seamless streaming of data to multiple machines. Key features include multi-GPU/multi-node support, easy data mixing, pause & resume functionality, support for profiling, memory footprint reduction, cache size configuration, and on-prem optimizations. The tool also provides benchmarks for measuring streaming speed and conversion efficiency, along with runnable templates for different data types. LitData enables infinite cloud data processing by utilizing the Lightning.ai platform to scale data processing with optimized machines.

github

: 321

langserve

LangServe helps developers deploy `LangChain` runnables and chains as a REST API. This library is integrated with FastAPI and uses pydantic for data validation. In addition, it provides a client that can be used to call into runnables deployed on a server. A JavaScript client is available in LangChain.js.

github

: 1.9k

For similar tasks

extractor

github

: 86

NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding _programmable guardrails_ to LLM-based conversational applications. Guardrails (or "rails" for short) are specific ways of controlling the output of a large language model, such as not talking about politics, responding in a particular way to specific user requests, following a predefined dialog path, using a particular language style, extracting structured data, and more.

github

: 4.0k

kor

Kor is a prototype tool designed to help users extract structured data from text using Language Models (LLMs). It generates prompts, sends them to specified LLMs, and parses the output. The tool works with the parsing approach and is integrated with the LangChain framework. Kor is compatible with pydantic v2 and v1, and schema is typed checked using pydantic. It is primarily used for extracting information from text based on provided reference examples and schema documentation. Kor is designed to work with all good-enough LLMs regardless of their support for function/tool calling or JSON modes.

github

: 1.6k

awesome-llm-json

This repository is an awesome list dedicated to resources for using Large Language Models (LLMs) to generate JSON or other structured outputs. It includes terminology explanations, hosted and local models, Python libraries, blog articles, videos, Jupyter notebooks, and leaderboards related to LLMs and JSON generation. The repository covers various aspects such as function calling, JSON mode, guided generation, and tool usage with different providers and models.

github

: 1.9k

tensorzero

TensorZero is an open-source platform that helps LLM applications graduate from API wrappers into defensible AI products. It enables a data & learning flywheel for LLMs by unifying inference, observability, optimization, and experimentation. The platform includes a high-performance model gateway, structured schema-based inference, observability, experimentation, and data warehouse for analytics. TensorZero Recipes optimize prompts and models, and the platform supports experimentation features and GitOps orchestration for deployment.

github

: 78

terraform-genai-doc-summarization

This solution showcases how to summarize a large corpus of documents using Generative AI. It provides an end-to-end demonstration of document summarization going all the way from raw documents, detecting text in the documents and summarizing the documents on-demand using Vertex AI LLM APIs, Cloud Vision Optical Character Recognition (OCR) and BigQuery.

github

: 85

For similar jobs

book

Podwise is an AI knowledge management app designed specifically for podcast listeners. With the Podwise platform, you only need to follow your favorite podcasts, such as "Hardcore Hackers". When a program is released, Podwise will use AI to transcribe, extract, summarize, and analyze the podcast content, helping you to break down the hard-core podcast knowledge. At the same time, it is connected to platforms such as Notion, Obsidian, Logseq, and Readwise, embedded in your knowledge management workflow, and integrated with content from other channels including news, newsletters, and blogs, helping you to improve your second brain 🧠.

github

: 1.0k

extractor

github

: 86

Scrapegraph-ai

ScrapeGraphAI is a Python library that uses Large Language Models (LLMs) and direct graph logic to create web scraping pipelines for websites, documents, and XML files. It allows users to extract specific information from web pages by providing a prompt describing the desired data. ScrapeGraphAI supports various LLMs, including Ollama, OpenAI, Gemini, and Docker, enabling users to choose the most suitable model for their needs. The library provides a user-friendly interface through its `SmartScraper` class, which simplifies the process of building and executing scraping pipelines. ScrapeGraphAI is open-source and available on GitHub, with extensive documentation and examples to guide users. It is particularly useful for researchers and data scientists who need to extract structured data from web pages for analysis and exploration.

github

: 12.8k

databerry

Chaindesk is a no-code platform that allows users to easily set up a semantic search system for personal data without technical knowledge. It supports loading data from various sources such as raw text, web pages, files (Word, Excel, PowerPoint, PDF, Markdown, Plain Text), and upcoming support for web sites, Notion, and Airtable. The platform offers a user-friendly interface for managing datastores, querying data via a secure API endpoint, and auto-generating ChatGPT Plugins for each datastore. Chaindesk utilizes a Vector Database (Qdrant), Openai's text-embedding-ada-002 for embeddings, and has a chunk size of 1024 tokens. The technology stack includes Next.js, Joy UI, LangchainJS, PostgreSQL, Prisma, and Qdrant, inspired by the ChatGPT Retrieval Plugin.

github

: 2.9k

auto-news

Auto-News is an automatic news aggregator tool that utilizes Large Language Models (LLM) to pull information from various sources such as Tweets, RSS feeds, YouTube videos, web articles, Reddit, and journal notes. The tool aims to help users efficiently read and filter content based on personal interests, providing a unified reading experience and organizing information effectively. It features feed aggregation with summarization, transcript generation for videos and articles, noise reduction, task organization, and deep dive topic exploration. The tool supports multiple LLM backends, offers weekly top-k aggregations, and can be deployed on Linux/MacOS using docker-compose or Kubernetes.

github

: 219

SemanticFinder

SemanticFinder is a frontend-only live semantic search tool that calculates embeddings and cosine similarity client-side using transformers.js and SOTA embedding models from Huggingface. It allows users to search through large texts like books with pre-indexed examples, customize search parameters, and offers data privacy by keeping input text in the browser. The tool can be used for basic search tasks, analyzing texts for recurring themes, and has potential integrations with various applications like wikis, chat apps, and personal history search. It also provides options for building browser extensions and future ideas for further enhancements and integrations.

github

: 204

1filellm

1filellm is a command-line data aggregation tool designed for LLM ingestion. It aggregates and preprocesses data from various sources into a single text file, facilitating the creation of information-dense prompts for large language models. The tool supports automatic source type detection, handling of multiple file formats, web crawling functionality, integration with Sci-Hub for research paper downloads, text preprocessing, and token count reporting. Users can input local files, directories, GitHub repositories, pull requests, issues, ArXiv papers, YouTube transcripts, web pages, Sci-Hub papers via DOI or PMID. The tool provides uncompressed and compressed text outputs, with the uncompressed text automatically copied to the clipboard for easy pasting into LLMs.

github

: 292

Agently-Daily-News-Collector

Agently Daily News Collector is an open-source project showcasing a workflow powered by the Agent ly AI application development framework. It allows users to generate news collections on various topics by inputting the field topic. The AI agents automatically perform the necessary tasks to generate a high-quality news collection saved in a markdown file. Users can edit settings in the YAML file, install Python and required packages, input their topic idea, and wait for the news collection to be generated. The process involves tasks like outlining, searching, summarizing, and preparing column data. The project dependencies include Agently AI Development Framework, duckduckgo-search, BeautifulSoup4, and PyYAM.

github

: 338

extractor

README:

Extractor: AI-Powered Data Extraction Library for Laravel.

Features

Example

Installation

Usage

Extracting plain text from documents

Extracting structured data

Using the Field extractor

Using GPT-4-Vision with Extractor

How to Use OpenAI's Vision API with ImageContent

Using a File Path

Using Raw Image Data

Using an Image URL

Extracting Data from Images with OpenAI's Vision API

Creating Custom Extractors

Implementing a Custom Extractor

Registering the Custom Extractor

Using the Custom Extractor

Adding Validation to the Job Posting Extractor

Extracting Data into a DTO

OCR Configuration with AWS Textract

Delete the files after processing them with Textract

Using S3 Lifecycle Rules

Using the cleanupFileUsing hook

All Parameters and Their Functions

License

For Tasks:

For Jobs:

Alternative AI tools for extractor

Similar Open Source Tools

extractor

receipt-scanner

llm-client

syncode

magentic

ax

LLMDebugger

laravel-slower

auto-playwright

mistral-inference

IntelliNode

litserve

tensorrtllm_backend

pg_vectorize

litdata

langserve

For similar tasks

extractor

NeMo-Guardrails

kor

awesome-llm-json

tensorzero

terraform-genai-doc-summarization

For similar jobs

book

extractor

Scrapegraph-ai

databerry

auto-news

SemanticFinder

1filellm

Agently-Daily-News-Collector

Using the `cleanupFileUsing` hook