ShadowCrawl

ShadowCrawl

πŸ₯· The FREE, Sovereign alternative to Firecrawl & Tavily. Pure Rust Stealth Scraper + Built-in God-Tier Meta-Search for AI Agents. Bypass Cloudflare & DataDome via HITL. Zero-bloat, ultra-clean LLM data. 99.99% Success Rate. πŸ¦€

Stars: 51

Visit
 screenshot

ShadowCrawl is a web scraping tool designed to extract data from websites. It provides a user-friendly interface to easily configure scraping tasks and obtain structured data. With ShadowCrawl, users can automate the process of collecting information from various websites without the need for manual intervention. The tool supports various data formats and allows users to schedule scraping tasks for regular updates. Whether you need to gather pricing information, monitor competitor websites, or extract contact details, ShadowCrawl offers a versatile solution for web data extraction.

README:

πŸ₯· ShadowCrawl MCP β€” v3.1.0

ShadowCrawl Logo

Search Smarter. Scrape Anything. Block Nothing.

The God-Tier Intelligence Engine for AI Agents

The Sovereign, Self-Hosted Alternative to Firecrawl, Jina, and Tavily.

License: MIT Rust MCP CI


ShadowCrawl is not just a scraper or a search wrapper β€” it is a complete intelligence layer purpose-built for AI Agents. ShadowCrawl ships a native Rust meta-search engine running inside the same binary. Zero extra containers. Parallel engines. LLM-grade clean output.

When every other tool gets blocked, ShadowCrawl doesn't retreat β€” it escalates: native engines β†’ native Chromium CDP headless β†’ Human-In-The-Loop (HITL) nuclear option. You always get results.


⚑ God-Tier Internal Meta-Search (v3.0.0)

ShadowCrawl v3.0.0 ships a 100% Rust-native metasearch engine that queries 4 engines in parallel and fuses results intelligently:

Engine Coverage Notes
πŸ”΅ DuckDuckGo General Web HTML scrape, no API key needed
🟒 Bing General + News Best for current events
πŸ”΄ Google Authoritative Results High-relevance, deduped
🟠 Brave Search Privacy-Focused Independent index, low overlap

🧠 What makes it God-Tier?

Parallel Concurrency β€” All 4 engines fire simultaneously. Total latency = slowest engine, not sum of all.

Smart Deduplication + Scoring β€” Cross-engine results are merged by URL fingerprint. Pages confirmed by 2+ engines receive a corroboration score boost. Domain authority weighting (docs, .gov, .edu, major outlets) pushes high-trust sources to the top.

Ultra-Clean Output for LLMs β€” Clean fields and predictable structure:

  • published_at is parsed and stored as a clean ISO-8601 field (2025-07-23T00:00:00)
  • content / snippet is clean β€” zero date-prefix garbage
  • breadcrumbs extracted from URL path for navigation context
  • domain and source_type auto-classified (blog, docs, reddit, news, etc.)

Result: LLMs receive dense, token-efficient, structured data β€” not a wall of noisy text.

Unstoppable Fallback β€” If an engine returns a bot-challenge page (anomaly.js, Cloudflare, PerimeterX), it is automatically retried via the native Chromium CDP instance (headless Chrome, bundled in-binary). No manual intervention. No 0-result failures.

Quality > Quantity β€” ~20 deduplicated, scored results rather than 50 raw duplicates. For an AI agent with a limited context window, 20 high-quality results outperform 50 noisy ones every time.


οΏ½ Deep Research Engine (v3.1.0)

ShadowCrawl v3.1.0 ships a self-contained multi-hop research pipeline as a first-class MCP tool β€” no external infra, no key required for local LLMs.

How it works

  1. Query Expansion β€” expands your question into multiple targeted sub-queries (3 axes: core concept, comparison/alternatives, implementation specifics)
  2. Parallel Search + Scrape β€” fires all sub-queries across 4 search engines; auto-scrapes top results (configurable depth 1–3, up to 20 sources)
  3. Semantic Filtering β€” Model2Vec-powered relevance scoring keeps only on-topic content chunks
  4. LLM Synthesis β€” condenses all findings into a zero-fluff Markdown fact-sheet via any OpenAI-compatible API

LLM Backend Options

Backend llm_base_url Key required
OpenAI (default) https://api.openai.com/v1 Yes β€” OPENAI_API_KEY
Ollama (local) http://localhost:11434/v1 No
LM Studio (local) http://localhost:1234/v1 No
Any OpenAI-compatible proxy custom URL Optional

Configuration (shadowcrawl.json)

Create shadowcrawl.json in the same directory as the binary (or repo root) to configure the engine β€” no rebuild needed. All fields are optional; env vars are used as fallback.

{
  "deep_research": {
    "enabled": true,
    "llm_base_url": "http://localhost:11434/v1",
    "llm_api_key": "",
    "llm_model": "llama3",
    "synthesis_enabled": true,
    "synthesis_max_sources": 8,
    "synthesis_max_chars_per_source": 2500,
    "synthesis_max_tokens": 1024
  }
}

Priority: shadowcrawl.json field β†’ env var fallback β†’ hardcoded default.

Build flags

# Full build (deep_research included by default)
cargo build --release

# Lean build β€” strip deep_research feature entirely
cargo build --release --no-default-features --features non_robot_search

The deep-research Cargo feature is on by default. Use --no-default-features for minimal deployments.


οΏ½πŸ›  Full Feature Roster

Feature Details
οΏ½ Deep Research Engine Multi-hop search + scrape + semantic filter + LLM synthesis (OpenAI / Ollama / LM Studio)
οΏ½πŸ” God-Tier Meta-Search Parallel Google / Bing / DDG / Brave Β· dedup Β· scoring Β· breadcrumbs Β· published_at
πŸ•· Universal Scraper Rust-native + native Chromium CDP for JS-heavy and anti-bot sites
πŸ›‚ Human Auth (HITL) human_auth_session: Real browser + persistent cookies + instruction overlay + Automatic Re-injection. Fetch any protected URL.
🧠 Semantic Memory Embedded LanceDB + Model2Vec for long-term research recall (no DB container)
πŸ€– HITL Non-Robot Search Visible Brave Browser + keyboard hooks for human CAPTCHA / login-wall bypass
🌐 Deep Crawler Recursive, bounded crawl to map entire subdomains
πŸ”’ Proxy Master Native HTTP/SOCKS5 pool rotation with health checks
🧽 Universal Janitor Strips cookie banners, popups, skeleton screens β€” delivers clean Markdown
πŸ”₯ Hydration Extractor Resolves React/Next.js hydration JSON (__NEXT_DATA__, embedded state)
πŸ›‘ Anti-Bot Arsenal Stealth UA rotation, fingerprint spoofing, CDP automation, mobile profile emulation
πŸ“Š Structured Extract CSS-selector + prompt-driven field extraction from any page
πŸ” Batch Scrape Parallel scrape of N URLs with configurable concurrency

πŸ— Zero-Bloat Architecture

ShadowCrawl is pure binary: a single Rust executable exposes MCP tools (stdio) and an optional HTTP server β€” no Docker, no sidecars.


πŸ’Ž The Nuclear Option: Human Auth Session (v3.0.0)

When standard automation fails (Cloudflare, CAPTCHA, complex logins), ShadowCrawl activates the human element.

πŸ›‚ human_auth_session β€” The "Unblocker"

This is our signature tool that surpasses all competitors. While most scrapers fail on login-walled content, human_auth_session opens a real, visible browser window for you to solve the challenge.

Once you click FINISH & RETURN, all authentication cookies are transparently captured and persisted in ~/.shadowcrawl/sessions/. Subsequent requests to the same domain automatically inject these cookies β€” making future fetches fully automated and effortless.

  • 🟒 Instruction Overlay β€” A native green banner guides the user on what to solve.
  • πŸͺ Persistent Sessions β€” Solve once, scrape forever. No need to log in manually again for weeks.
  • πŸ›‘ Security first β€” Cookies are stored locally and encrypted (optional/upcoming).
  • πŸš€ Auto-injection β€” Next web_fetch or web_crawl calls automatically load found sessions.

πŸ’₯ Boss-Level Anti-Bot Evidence

We don't claim β€” we show receipts. All captured with human_auth_session and our advanced CDP engines (2026-02-20):

Target Protection Evidence Extracted
LinkedIn Cloudflare + Auth JSON Β· Snippet 60+ job listings βœ…
Ticketmaster Cloudflare Turnstile JSON Β· Snippet Tour dates & venues βœ…
Airbnb DataDome JSON Β· Snippet 1,000+ Tokyo listings βœ…
Upwork reCAPTCHA JSON Β· Snippet 160K+ job postings βœ…
Amazon AWS Shield JSON Β· Snippet RTX 5070 Ti search results βœ…
nowsecure.nl Cloudflare JSON Manual button verified βœ…

πŸ“– Full analysis: proof/README.md


πŸ“¦ Quick Start

Option A β€” Download Prebuilt Binaries (Recommended)

Download the latest release assets from GitHub Releases and run one of:

Prebuilt assets are published for: windows-x64, windows-arm64, linux-x64, linux-arm64.

  • shadowcrawl-mcp β€” MCP stdio server (recommended for VS Code / Cursor / Claude Desktop)
  • shadowcrawl β€” HTTP server (default port 5000; override via --port, PORT, or SHADOWCRAWL_PORT)

Confirm the HTTP server is alive:

./shadowcrawl --port 5000
curl http://localhost:5000/health

πŸ§ͺ Build (Release, All Features)

Build all binaries with all optional features enabled:

cd mcp-server
cargo build --release --all-features

Option B β€” Build / Install from Source

git clone https://github.com/DevsHero/shadowcrawl.git
cd shadowcrawl

Build:

cd mcp-server
cargo build --release --features non_robot_search --bin shadowcrawl --bin shadowcrawl-mcp

Or install (puts binaries into your Cargo bin directory):

cargo install --path mcp-server --locked

Binaries land at:

  • target/release/shadowcrawl β€” HTTP server (default port 5000; override via --port, PORT, or SHADOWCRAWL_PORT)
  • target/release/shadowcrawl-mcp β€” MCP stdio server

Prerequisites for HITL:

  • Brave Browser (brave.com/download)
  • Accessibility permission (macOS: System Preferences β†’ Privacy & Security β†’ Accessibility)
  • A desktop session (not SSH-only)

Platform guides: docs/window_setup.md Β· docs/ubuntu_setup.md

After any binary rebuild/update, restart your MCP client session to pick up new tool definitions.


βœ… Agent Best Practices (ShadowCrawl Rules)

Use this exact decision flow to get the highest-quality results with minimal tokens:

  1. memory_search first (avoid re-fetching)
  2. web_search_json for initial research (search + content summaries in one call)
  3. web_fetch for specific URLs (docs/articles) - output_format="clean_json" for token-efficient output - set query + strict_relevance=true when you want only query-relevant paragraphs
  4. If web_fetch returns 403/429/rate-limit β†’ proxy_control grab then retry with use_proxy=true
  5. If web_fetch returns auth_risk_score >= 0.4 β†’ visual_scout (confirm login wall) β†’ human_auth_session (The God-Tier Nuclear Option)

Structured extraction (schema-first):

  • Prefer fetch_then_extract for one-shot fetch + extract.
  • strict=true (default) enforces schema shape: missing arrays become [], missing scalars become null (no schema drift).
  • Treat confidence=0.0 as β€œplaceholder / unrendered page” (often JS-only like crates.io). Escalate to browser rendering (CDP/HITL) instead of trusting the fields.
  • πŸ’‘ New in v3.0.0: Placeholder detection is now scalar-only. Pure-array schemas (only lists/structs) never trigger confidence=0.0, fixing prior regressions.

clean_json notes:

  • Large pages are truncated to respect max_chars (look for clean_json_truncated warning). Increase max_chars to see more.
  • key_code_blocks is extracted from fenced blocks and signature-like inline code; short docs pages are supported.
  • πŸ•· v3.0.0 fix: Module extraction on docs.rs works recursively for all relative and absolute sub-paths.

🧩 MCP Integration

ShadowCrawl exposes all tools via the Model Context Protocol (stdio transport).

VS Code / Copilot Chat

Add to your MCP config (~/.config/Code/User/mcp.json):

{
  "servers": {
    "shadowcrawl": {
      "type": "stdio",
      "command": "env",
      "args": [
        "RUST_LOG=info",
        "SEARCH_ENGINES=google,bing,duckduckgo,brave",
        "LANCEDB_URI=/YOUR_PATH/shadowcrawl/lancedb",
        "HTTP_TIMEOUT_SECS=30",
        "MAX_CONTENT_CHARS=10000",
        "IP_LIST_PATH=/YOUR_PATH/shadowcrawl/ip.txt",
        "PROXY_SOURCE_PATH=/YOUR_PATH/shadowcrawl/proxy_source.json",
        "/YOUR_PATH/shadowcrawl/mcp-server/target/release/shadowcrawl-mcp"
      ]
    }
  }
}

Cursor / Claude Desktop

Use the same stdio setup as VS Code (run shadowcrawl-mcp locally and pass env vars via env or your client’s env field).

πŸ“– Full multi-IDE guide: docs/IDE_SETUP.md


βš™οΈ Key Environment Variables

Variable Default Description
CHROME_EXECUTABLE auto-detected Override path to Chromium/Chrome/Brave binary
SEARCH_ENGINES google,bing,duckduckgo,brave Active search engines (comma-separated)
SEARCH_MAX_RESULTS_PER_ENGINE 10 Results per engine before merge
SEARCH_CDP_FALLBACK true if browser found Auto-retry blocked engines via native Chromium CDP (alias: SEARCH_BROWSERLESS_FALLBACK)
SEARCH_SIMULATE_BLOCK β€” Force blocked path for testing: duckduckgo,bing or all
LANCEDB_URI β€” Path for semantic research memory (optional)
SHADOWCRAWL_NEUROSIPHON 1 (enabled) Set to 0 / false / off to disable all NeuroSiphon techniques (import nuking, SPA extraction, semantic shaving, search reranking)
HTTP_TIMEOUT_SECS 30 Per-request timeout
OUTBOUND_LIMIT 32 Max concurrent outbound connections
MAX_CONTENT_CHARS 10000 Max chars per scraped document
IP_LIST_PATH β€” Path to proxy IP list
SCRAPE_DELAY_PRESET polite fast / polite / cautious
DEEP_RESEARCH_ENABLED 1 (enabled) Set 0 to disable the deep_research tool at runtime (without rebuild)
OPENAI_API_KEY β€” API key for LLM synthesis. Leave unset for key-less local endpoints (Ollama / LM Studio)
OPENAI_BASE_URL https://api.openai.com/v1 LLM endpoint. Override for Ollama (http://localhost:11434/v1) or LM Studio (http://localhost:1234/v1). Config: deep_research.llm_base_url
DEEP_RESEARCH_LLM_MODEL gpt-4o-mini Model name (e.g. llama3, mistral). Config: deep_research.llm_model
DEEP_RESEARCH_SYNTHESIS 1 (enabled) Set 0 to run search + scrape only (skip LLM step). Config: deep_research.synthesis_enabled
DEEP_RESEARCH_SYNTHESIS_MAX_SOURCES 8 Max source docs fed to LLM. Config: deep_research.synthesis_max_sources
DEEP_RESEARCH_SYNTHESIS_MAX_CHARS_PER_SOURCE 2500 Max chars per source. Config: deep_research.synthesis_max_chars_per_source
DEEP_RESEARCH_SYNTHESIS_MAX_TOKENS 1024 Max tokens in the LLM response. Tune per model: 512–1024 for small 4k-ctx models (e.g. lfm2-2.6b), 2048+ for large models. Config: deep_research.synthesis_max_tokens

πŸ† Comparison

Feature Firecrawl / Jina / Tavily ShadowCrawl v3.1.0
Deep Research None / paid add-on Native: multi-hop + LLM synthesis (local or cloud)
Cost $49–$499/mo $0 β€” self-hosted
Privacy They see your queries 100% private, local-only
Search Engine Proprietary / 3rd-party API Native Rust (4 engines, parallel)
Result Quality Mixed, noisy snippets Deduped, scored, LLM-clean
Cloudflare Bypass Rarely Native Chromium CDP + HITL fallback
LinkedIn / Airbnb Blocked 99.99% success (HITL)
JS Rendering Cloud API Native Brave + bundled Chromium CDP
Semantic Memory None Embedded LanceDB + Model2Vec
Proxy Support Paid add-on Native SOCKS5/HTTP rotation
MCP Native Partial Full MCP stdio + HTTP

πŸ€– Agent Optimal Setup: IDE Copilot Instructions

ShadowCrawl works best when your AI agent knows the operational rules before it starts β€” which tool to call first, when to rotate proxies, and when not to use extract_structured. Without these rules, agents waste tokens re-fetching cached data and can misuse tools on incompatible sources.

The complete rules file lives at .github/copilot-instructions.md (VS Code / GitHub Copilot) and is also available as .clinerules for Cline. Copy the block below into the IDE-specific file for your editor.


πŸ—‚οΈ VS Code β€” .github/copilot-instructions.md

Create (or append to) .github/copilot-instructions.md in your workspace root:

## MCP Usage Guidelines β€” ShadowCrawl

### Shadowcrawl Priority Rules

1. **Memory first (NEVER skip):** ALWAYS call `research_history` BEFORE calling `search_web`,
   `search_structured`, or `scrape_url`.
   **Cache-quality guard:** only skip a live fetch when ALL of the following are true:
   - similarity score β‰₯ 0.60
   - entry_type is NOT "search" (search entries have no word_count β€” always follow up with scrape_url)
   - word_count β‰₯ 50 (cached crates.io pages are JS-placeholders with ~11 words)
   - no placeholder/sparse warnings (placeholder_page, short_content, content_restricted)

2. **Initial research:** use `search_structured` (search + content summaries in one call).
   For private/internal tools not indexed publicly, skip search and go directly to
   `scrape_url` on the known repo/docs URL.

3. **Doc/article pages:** `scrape_url` with `output_format: clean_json`,
   `strict_relevance: true`, `query: "<your question>"`.
   Raw `.md`/`.txt` URLs are auto-detected β€” HTML pipeline is skipped, raw content returned.

4. **Proxy rotation (mandatory on first block):** if `scrape_url` or `search_web` returns
   403/429/rate-limit, immediately call `proxy_manager` with `action: "grab"` then retry
   with `use_proxy: true`. Do NOT wait for a second failure.

4a. **Auto-escalation on low confidence:** if `scrape_url` returns confidence < 0.3 or
    extraction_score < 0.4 β†’ retry with `quality_mode: "aggressive"` β†’ `visual_scout`
    β†’ `human_auth_session`. Never stay stuck on a low-confidence result.

5. **Schema extraction:** use `fetch_then_extract` (one-shot) or `extract_structured`.
   Both auto-inject `raw_markdown_url` warning when called on raw file URLs.
   Do NOT point at raw `.md`/`.json`/`.txt` unless intentional.

6. **Sub-page discovery:** use `crawl_website` before `scrape_url` when you only know
   an index URL and need to find the right sub-page.

7. **Last resort:** `non_robot_search` only after direct fetch + proxy rotation have both
   failed (Cloudflare / CAPTCHA / login walls). Session cookies are persisted after login.

🐾 Cursor β€” .cursorrules

Create or append to .cursorrules in your project root with the same block above.


🟩 Cline (VS Code extension) β€” .clinerules

Already included in this repository as .clinerules. Cline loads it automatically β€” no action needed.


🧠 Claude Desktop β€” System Prompt / Custom Instructions

Paste the rules block into the Custom Instructions or System Prompt field in Claude Desktop settings (Settings β†’ Advanced β†’ System Prompt).


🧳 Other Agents (Windsurf, Aider, Continue, AutoGen, etc.)

Any agent that accepts a system prompt or workspace instruction file: paste the same block. The rules are plain markdown and tool-agnostic.


Quick Decision Flow

Question / research task
        β”‚
        β–Ό
research_history ──► hit (β‰₯ 0.60)? ──► cache-quality guard:
        β”‚ miss            β”‚  entry_type=="search"? ──► don't skip; do scrape_url
        β”‚                 β”‚  word_count < 50 or placeholder warnings? ──► don't skip
        β”‚                 └──► quality OK? ──► use cached result, STOP
        β”‚
        β–Ό
search_structured ──► enough content? ──► use it, STOP
        β”‚ need deeper page
        β–Ό
scrape_url (clean_json + strict_relevance + query)
  β”‚ confidence < 0.3 or extraction_score < 0.4?
  β”œβ”€β”€β–Ί retry quality_mode: aggressive ──► visual_scout ──► human_auth_session
  β”‚ 403/429/blocked? ──► proxy_manager grab ──► retry use_proxy: true
  β”‚ still blocked? ──► non_robot_search  (LAST RESORT)
  β”‚
  └── need schema JSON? ──► fetch_then_extract (schema + strict=true)

πŸ“– Full rules + per-tool quick-reference table: .github/copilot-instructions.md


v3.0.0 (2026-02-20)

Added

  • human_auth_session (The Nuclear Option): Launches a visible browser for human login/CAPTCHA solving. Captures and persists full authentication cookies to ~/.shadowcrawl/sessions/{domain}.json. Enables full automation for protected URLs after a single manual session.
  • Instruction Overlay: human_auth_session now displays a custom green "ShadowCrawl" instruction banner on top of the browser window to guide users through complex auth walls.
  • Persistent Session Auto-Injection: web_fetch, web_crawl, and visual_scout now automatically check for and inject matching cookies from the local session store.
  • extract_structured / fetch_then_extract: new optional params placeholder_word_threshold (int, default 10) and placeholder_empty_ratio (float 0–1, default 0.9) allow agents to tune placeholder detection sensitivity per-call.
  • web_crawl: new optional max_chars param (default 10 000) caps total JSON output size to prevent workspace storage spill.
  • Rustdoc module extraction: extract_structured / fetch_then_extract correctly populate modules: [...] on docs.rs pages using the NAME/index.html sub-directory convention.
  • GitHub Discussions & Issues hydration: fetch_via_cdp detects github.com/*/discussions/* and /issues/* URLs; extends network-idle window to 2.5 s / 12 s max and polls for .timeline-comment, .js-discussion, .comment-body DOM nodes.
  • Contextual code blocks (clean_json mode): SniperCodeBlock gains a context: Option<String> field. Performs two-pass extraction for prose preceding fenced blocks and Markdown sentences containing inline snippets.
  • IDE copilot-instructions guide (README): new πŸ€– Agent Optimal Setup section.
  • .clinerules workspace file: all 7 priority rules + decision-flow diagram + per-tool quick-reference table.
  • Agent priority rules in tool schemas: every MCP tool description now carries machine-readable ⚠️ AGENT RULE / βœ… BEST PRACTICE.

Changed

  • Placeholder detection (Scalar-Only Logic): Confidence override to 0.0 now only considers scalar (non-array) fields. Pure-array schemas (headers, modules, structs) never trigger fake placeholder warnings, fixing false-positives on rich but list-heavy documentation pages.
  • web_fetch(output_format="clean_json"): applies a max_chars-based paragraph budget and emits clean_json_truncated when output is clipped.
  • extract_fields / fetch_then_extract: placeholder/unrendered pages (very low content + mostly empty schema fields) force confidence=0.0.
  • Short-content bypass (strict_relevance / extract_relevant_sections): early exit with a descriptive warning when word_count < 200. Short pages (GitHub Discussions, Q&A threads) are returned whole.

Fixed

  • BUG-6: modules: [] always empty on rustdoc pages β€” refactored regex to support both absolute and simple relative module links (init/index.html, optim/index.html).
  • BUG-7: false-positive confidence=0.0 on real docs.rs pages; replaced whole-schema empty ratio with scalar-only ratio + raised threshold.
  • BUG-9: web_crawl could spill 16 KB+ of JSON into VS Code workspace storage; handler now truncates response to max_chars (default 10 000).
  • web_fetch(output_format="clean_json"): paragraph filter now adapts for word_count < 200.
  • fetch_then_extract: prevents false-high confidence on JS-only placeholder pages (e.g. crates.io) by overriding confidence to 0.0.
  • cdp_fallback_failed on GitHub Discussions: extended CDP hydration window and selector polling ensures full thread capture.

β˜• Acknowledgments & Support

ShadowCrawl is built with ❀️ by a solo developer for the open-source AI community. If this tool saved you from a $500/mo scraping API bill:

  • ⭐ Star the repo β€” it helps others discover this
  • πŸ› Found a bug? Open an issue
  • πŸ’‘ Feature request? Start a discussion
  • β˜• Fuel more updates:

Sponsor

License: MIT β€” free for personal and commercial use.

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for ShadowCrawl

Similar Open Source Tools

For similar tasks

For similar jobs