llm-data-scrapers

llm-data-scrapers

A list of useful Open Source tools and scrapers to gather data for LLMs

Stars: 90

Visit
 screenshot

LLM Data Scrapers is a collection of open source tools and scrapers designed to gather data for Large Language Models (LLMs). The repository includes various tools such as gitingest for extracting codebases, repomix for packing repositories into AI-friendly files, llm-scraper for converting webpages into structured data, crawl4ai for web crawling, and firecrawl for turning websites into LLM-ready markdown or structured data. Additionally, the repository offers tools like llmstxt-generator for generating training data, trafilatura for gathering web text and metadata, RepoToTextForLLMs for fetching repo content, marker for converting PDFs, reader for converting URLs to LLM-friendly inputs, and files-to-prompt for concatenating files into prompts for LLMs.

README:

LLM Data Scrapers 🚀

A list of useful Open Source tools and scrapers to gather data for LLMs:

Name
gitingest Replace hub with ingest in any github url to get a prompt-friendly extract of a codebase
repomix Packs your entire repository into a single, AI-friendly file
llm-scraper Turn any webpage into structured data using LLMs
crawl4ai LLM friendly web crawler & scraper
firecrawl API to turn websites into LLM-ready markdown or structured data, can be self-hosted
llmstxt-generator Generate consolidated llms.txtfiles from websites for LLM training and inference
trafilatura Python & Command-line tool to gather text and metadata on the web
RepoToTextForLLMs Simple Python script to fetch repo content
marker Convert PDF to markdown or JSON quickly
reader Convert any URL to an LLM-friendly input with a simple prefix https://r.jina.ai/
files-to-prompt Concatenate a directory full of files into a single prompt for use with LLMs

More

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for llm-data-scrapers

Similar Open Source Tools

For similar tasks

For similar jobs