spider

spider

Efficient Web Crawler and Scraper with AI Integration in Rust

Stars: 946

Visit
 screenshot

Spider is a high-performance web crawler and indexer designed to handle data curation workloads efficiently. It offers features such as concurrency, streaming, decentralization, headless Chrome rendering, HTTP proxies, cron jobs, subscriptions, smart mode, blacklisting, whitelisting, budgeting depth, dynamic AI prompt scripting, CSS scraping, and more. Users can easily get started with the Spider Cloud hosted service or set up local installations with spider-cli. The tool supports integration with Node.js and Python for additional flexibility. With a focus on speed and scalability, Spider is ideal for extracting and organizing data from the web.

README:

Spider

Build Status Crates.io Documentation Rust Discord chat

Website | Guides | API Docs | Chat

The fastest web crawler and indexer. Foundational building blocks for data curation workloads.

  • Concurrent
  • Streaming
  • Decentralization
  • Headless Chrome Rendering
  • HTTP Proxies
  • Cron Jobs
  • Subscriptions
  • Smart Mode
  • Blacklisting, Whitelisting, and Budgeting Depth
  • Dynamic AI Prompt Scripting Headless with Step Caching
  • CSS Scraping with spider_utils
  • Changelog

Getting Started

The simplest way to get started is to use the Spider Cloud hosted service. View the spider or spider_cli directory for local installations. You can also use spider with Node.js using spider-nodejs and Python using spider-py.

Benchmarks

See BENCHMARKS.

Examples

See EXAMPLES.

License

This project is licensed under the MIT license.

Contributing

See CONTRIBUTING.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for spider

Similar Open Source Tools

For similar tasks

For similar jobs