Website-Crawler

Extract data from websites in LLM ready JSON or CSV format. Crawl or Scrape entire website with Website Crawler

Stars: 61

Visit

Website-Crawler is a tool designed to extract data from websites in an automated manner. It allows users to scrape information such as text, images, links, and more from web pages. The tool provides functionalities to navigate through websites, handle different types of content, and store extracted data for further analysis. Website-Crawler is useful for tasks like web scraping, data collection, content aggregation, and competitive analysis. It can be customized to extract specific data elements based on user requirements, making it a versatile tool for various web data extraction needs.

README:

Website Crawler

🕷️ Website Crawler API

The Website Crawler API allows developers to programmatically crawl websites and access structured metadata via four simple endpoints. This API gives you clean JSON responses and real-time crawl updates. The JSON/structured response generated by the /crawl/cwdata endpoint can be used for a variety of purpose. For example, as the data is JSON/LLM ready format, you can use it to train an AI model, use it for creating chatbots, audit websites, etc.

🔐 Authentication

To use the API, you'll need an API Key.

How to get one:

Visit websitecrawler.org
Create an account or log in
Go to the Settings page to generate your API key

🌐 Base URL

https://www.websitecrawler.org/api

📡 Endpoints

1. `GET /crawl/authenticate`

Obtain an access token through the API. This token must be included in all subsequent requests.

Key required in the JSON payload:
- key (string): Your API Key
Sample Request to get the token:

curl -X POST https://www.websitecrawler.org/api/crawl/authenticate \
 -H "Content-Type: application/json" \
 -d '{"apiKey": "your_api_key"}'

- Sample Response:

{
  "token": "api_generated_token"
}

2. `GET /crawl/start`

Initiate a new crawl for a given domain.

Keys required in the JSON payload:
- url (string, required): Target website (e.g. example.com) i.e. a non redirecting main URL of the website.
- limit (integer, required): Max pages to crawl (free tier is resticted to 100)
Sample Request to initiate crawling:

curl -X POST https://www.websitecrawler.org/api/crawl/start \
     -H "Authorization: Bearer api_generated_token" \
 -H "Content-Type: application/json" \
 -d '{"url": "your_url","limit":"your_limit"}'

- Sample Response 1:

{
  "status": "Crawling"
}

- Sample Response 2:

{
  "status": "Completed!"
}

3. `GET /crawl/cwdata`

Retrieve the structured crawl output once crawling has completed.

Required key in JSON payload:
- url (string, required): Target website (e.g. example.com)
- - Sample Request to get data:

curl -X POST https://www.websitecrawler.org/api/crawl/cwdata \
     -H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
 -d '{"url": "your_url"}'

- Sample Response:

{
  "status": [
    {
      "tt": "WPTLS - WordPress Plugins, themes and related services",
      "np": "12",
      "h1": "",
      "nw": "534",
      "h2": "Why learn HTML when there is WordPress?",
      "h3": "",
      "h4": "",
      "h5": "",
      "atb": "Why learn HTML when there is WordPress?",
      "sc": "200",
      "md": "Reviews, comparison, and collection of top WordPress themes, plugins, related services, and useful WP tips.",
      "elsc": "",
      "textCN": "Websitedata.",
      "d": "",
      "mr": "follow, index",
      "pname": "wptls.com",
      "al": "",
      "cn": "https://wptls.com/",
      "kw": "",
      "url": "https://wptls.com",
      "at": "",
      "external_links": "https://www.facebook.com/wptls",
      "tm": "96",
      "image_links": "https://wptls.com/wp-content/uploads/2021/12/cropped-wptls-logo.png | https://wptls.com/wp-content/uploads/2021/12/cropped-wptls-logo.png | https://wptls.com/wp-content/uploads/2024/02/Spaceship-768x378.jpg | https://wptls.com/wp-content/uploads/2023/12/AdSense-768x612.png | https://wptls.com/wp-content/uploads/2023/12/Exabytes-768x375.jpg | https://wptls.com/wp-content/uploads/2023/10/HTML-768x112.jpg | https://wptls.com/wp-content/uploads/2023/10/Cloudflare-add-site-768x363.png | https://wptls.com/wp-content/uploads/2023/01/Google-Trends-768x363.webp | https://wptls.com/wp-content/uploads/2022/11/Twenty-Twenty-Three-768x351.webp | https://wptls.com/wp-content/uploads/2022/11/Broken-Link-Checker-768x223.webp | https://wptls.com/wp-content/uploads/2022/11/wordpress_logo.webp | https://wptls.com/wp-content/uploads/2022/11/footer-css-768x327.webp",
      "internal_links": "https://wptls.com/why-learn-html-when-there-is-wordpress/ | https://wptls.com/customize-footer-wordpress/",
      "nofollow_links": ""
    }
  ]
}

4. `GET /crawl/currentURL`

Get the last crawled/processed URL

Required key in the JSON payload:
- url (string, required): Target website (e.g. example.com) i.e. a non redirecting main URL of the website.
Sample Request to get the last crawled/processed URL:

curl -X POST https://www.websitecrawler.org/api/crawl/currentURL \
     -H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
 -d '{"url": "your_url"}'

- Sample Response:

{
  "currentURL": "https://wptls.com"
}

5. `GET /crawl/clear`

Clear the previous job in case you want to rerun the crawler.

Required key:
- url (string, required): Target website (e.g. example.com) i.e. a non redirecting main URL of the website.
Sample Request to clear the job:

curl -X POST https://www.websitecrawler.org/api/crawl/clear \
     -H "Authorization: Bearer api_generated_token" \
-H "Content-Type: application/json" \
 -d '{"url": "your_url"}'

- Sample Response:

{
  "clearStatus": "Job cannot be cleared as the URL of the entered website is being crawled."
}

🕸️ Website Crawler API Usage Demo

The Python, Java, nodejs demos showcases how to use the WebsiteCrawlerSDK to interact with websitecrawler.org, enabling automated URL submission, status tracking, and retrieval of crawl data via their API.

Python

install the website crawler sdk

pip install website-crawler-sdk

change API_KEY,YOUR_LIMIT,YOUR_URL in the following demo script and run it. The objective of this script is to submit a URL to websitecrawler.org, get crawl status, the current URL being processed by websitecrawler in realtime, and retrieve the structured data once the task of crawling the website is finished.

import time
from website_crawler_sdk import WebsiteCrawlerConfig, WebsiteCrawlerClient

"""
Author: Pramod Choudhary (websitecrawler.org)
Version: 1.1
Date: July 10, 2025
"""

# Replace with your actual API key, target URL, and limit
YOUR_API_KEY = "YOUR_API_KEY" #Your API key goes here
URL = "YOUR_URL" #Enter a non redirecting URL/domain with https or http
LIMIT = YOUR_LIMIT #Change YOUR_LIMIT 

def main():
    cfg = WebsiteCrawlerConfig(YOUR_API_KEY)
    client = WebsiteCrawlerClient(cfg)

    # Submit URL to WebsiteCrawler.org for crawling
    client.submit_url_to_website_crawler(URL, LIMIT) #Submit the URL and Limit to websitecrawler via API

    while True:
        task_status = client.get_task_status() #Start retrieving data if the task_status is true
        print(f"{task_status} << task status")
        time.sleep(2)  #Wait for 2 seconds

        if not task_status:
           break

        if task_status:
            status = client.get_crawl_status() #get_crawl_status() method gets the crawl status
            currenturl = client.get_current_url() #get_current_url() method gets the current URL
            data = client.get_crawl_data() # get_crawl_data() method gets the structured data once crawling has completed

            if status:
                print(f"Current Status:: {status}")


            if status == "Crawling": #Crawling is one of the status
                print(f"Current URL:: {currenturl}")

            if status == "Completed!":  #Completed! (with exclamation) is one of the status
                print("Task has been completed... closing the loop and gettint the data...")
                if data:
                    print(f"JSON Data:: {data}")
                    time.sleep(20)  # Give extra time for large JSON response
                    break
            
           

    print("Job over")

if __name__ == "__main__":
    main()

Java

🚀 Features

Submit any website URL to be crawled
Track crawl status in real-time
View current URL being crawled
Retrieve JSON-formatted crawl data on completion

📦 Prerequisites

Java 11 or higher
Maven build system
API key from WebsiteCrawler.org

How to use the Java library?

Download the jar file WebsiteCrawlerSDK-Java-1.0.jar and add it as a dependency in your java project. Create the WebsiteCrawlerConfig object as shown in the following code. Pass the WebsiteCrawlerConfig object to WebsiteCrawlerClient. Use the WebsiteCrawlerConfig object to call the methods.

WebsiteCrawlerConfig config = new WebsiteCrawlerConfig("YOUR_API_KEY");
WebsiteCrawlerClient crawler = new WebsiteCrawlerClient(config);

package wc.WebsiteCrawlerAPIUsageDemo;

import wc.websitecrawlersdk.WebsiteCrawlerClient;
import wc.websitecrawlersdk.WebsiteCrawlerConfig;

/**
 *
 * @author Pramod
 */
public class WebsiteCrawlerAPIUsageDemo {

    public static void main(String[] args) throws InterruptedException {
        String status;
        String currenturl;
        String data;
        WebsiteCrawlerConfig cfg = new WebsiteCrawlerConfig(YOUR_API_KEY); //replace YOUR_API_KEY with your api key
        WebsiteCrawlerClient client = new WebsiteCrawlerClient(cfg);

        client.submitUrlToWebsiteCrawler(URL, LIMIT); //replace "URL" with the URL you want Websitecrawler.org to crawl and the number of URLs
        boolean taskStatus;
        while (true) {
            taskStatus = client.getTaskStatus(); //getTaskStatus() should be true before you call any methods
            System.out.println(taskStatus + "<<task status");
            Thread.sleep(9000);
            if (taskStatus == true) {
                status = client.getCrawlStatus(); // getCrawlStatus() method returns the live crawling status
                currenturl = client.getCurrentURL(); //getCurrentURL() method returns the URL being processed by WebsiteCrawler.org
                data = client.getcwData(); // getcwData() returns the JSON array of the website data;
                System.out.println("Crawl status::");
                if (status != null) {
                    System.out.println(status);
                }
                if (status != null && status.equals("Crawling")) { // status: Crawling  ----> Crawl job is in progresss
                    System.out.println("Current URL::" + currenturl);
                }
                if (status != null && status.equals("Completed!")) { // status: Completed! ---> Crawl job has completed succesfully 
                    System.out.println("Task has been completed.. closing the while loop");
                    if (data != null) {
                        System.out.println("Json Data::" + data);
                        Thread.sleep(20000); // JSON data might be huge. Thread.sleep makes the program wait until json data is retrieved
                        break; // exits the while(true) loop
                    }
                }

            }
        }
        System.out.println("job over");
    }
}

nodejs

install the website crawler sdk

npm i website-crawler-sdk

change YOUR_API_KEY,YOUR_LIMIT,YOUR_URL in the following demo script and run it. The objective of this script is to submit a URL to websitecrawler.org, get crawl status, the current URL being processed by websitecrawler in realtime, and retrieve the structured data once the task of crawling the website is finished.

const { WebsiteCrawlerConfig, WebsiteCrawlerClient } = require('website-crawler-sdk');

const config = new WebsiteCrawlerConfig('YOUR_API_KEY');
const client = new WebsiteCrawlerClient(config);

client.submitUrlToWebsiteCrawler('YOUR_URL', 'YOUR_LIMIT');

const intervalId = setInterval(() => {
  const status = client.getCrawlStatus();
  console.log('Status:', status);
  console.log('Current URL:', client.getCurrentURL());

  if (status === 'Completed!') {
    console.log('Crawl Data:', client.getCrawlData());
    console.log('Job completed...');
    clearInterval(intervalId);
  }
}, 3000);

🧩 Integration Example: XML Sitemap Generator

This section highlights how the XML-Sitemap-Generator project uses the websitecrawler.org API to automate XML sitemap generation.

🔄 Integration Workflow

The following steps outline the flow between the Website Crawler API and the sitemap generation logic:

Start Crawling

Use the crawl/start endpoint to initiate crawling of your website:

https://www.websitecrawler.org/api/crawl/start?url=example.com&limit=100&key=YOUR_API_KEY

Fetch Crawled Data
- Once crawling is complete, retrieve data using:
```
https://www.websitecrawler.org/api/crawl/cwdata?url=example.com&key=YOUR_API_KEY
```
- Response includes structured metadata (titles, links, status codes, etc.) in JSON format.
Process and Transform
- The XML Sitemap Generator parses the response and extracts valid URLs.
Generate Sitemap
- The extracted URLs are then converted into a compliant sitemap.xml for SEO optimization and better search engine indexing.

📂 Repository

Check out the full implementation here:
🔗 XML-Sitemap-Generator

For best results, ensure your API key is valid and your domain permits crawling.

##👋 Feedback & Support Found a bug or need help? Open an issue or connect via websitecrawler.org

For Tasks:

Click tags to check more tools for each tasks

extract data scrape websites analyze competitors collect information aggregate content

For Jobs:

data analyst market researcher seo specialist business intelligence analyst web developer

Alternative AI tools for Website-Crawler

Similar Open Source Tools

Website-Crawler

github

: 61

waidrin

Waidrin is a powerful web scraping tool that allows users to easily extract data from websites. It provides a user-friendly interface for creating custom web scraping scripts and supports various data formats for exporting the extracted data. With Waidrin, users can automate the process of collecting information from multiple websites, saving time and effort. The tool is designed to be flexible and scalable, making it suitable for both beginners and advanced users in the field of web scraping.

github

: 229

onlook

Onlook is a web scraping tool that allows users to extract data from websites easily and efficiently. It provides a user-friendly interface for creating web scraping scripts and supports various data formats for exporting the extracted data. With Onlook, users can automate the process of collecting information from multiple websites, saving time and effort. The tool is designed to be flexible and customizable, making it suitable for a wide range of web scraping tasks.

github

: 22.4k

falkordb-browser

FalkorDB Browser is a user-friendly web application for browsing and managing databases. It provides an intuitive interface for users to interact with their databases, allowing them to view, edit, and query data easily. With FalkorDB Browser, users can perform various database operations without the need for complex commands or scripts, making database management more accessible and efficient.

github

: 55

proxyless-llm-websearch

Proxyless-LLM-WebSearch is a tool that enables users to perform large language model-based web search without the need for proxies. It leverages state-of-the-art language models to provide accurate and efficient web search results. The tool is designed to be user-friendly and accessible for individuals looking to conduct web searches at scale. With Proxyless-LLM-WebSearch, users can easily search the web using natural language queries and receive relevant results in a timely manner. This tool is particularly useful for researchers, data analysts, content creators, and anyone interested in leveraging advanced language models for web search tasks.

github

: 122

SQLBot

SQLBot is a versatile tool for executing SQL queries and managing databases. It provides a user-friendly interface for interacting with databases, allowing users to easily query, insert, update, and delete data. SQLBot supports various database systems such as MySQL, PostgreSQL, and SQLite, making it a valuable tool for developers, data analysts, and database administrators. With SQLBot, users can streamline their database management tasks and improve their productivity by quickly accessing and manipulating data without the need for complex SQL commands.

github

: 3.4k

ROGRAG

ROGRAG is a powerful open-source tool designed for data analysis and visualization. It provides a user-friendly interface for exploring and manipulating datasets, making it ideal for researchers, data scientists, and analysts. With ROGRAG, users can easily import, clean, analyze, and visualize data to gain valuable insights and make informed decisions. The tool supports a wide range of data formats and offers a variety of statistical and visualization tools to help users uncover patterns, trends, and relationships in their data. Whether you are working on exploratory data analysis, statistical modeling, or data visualization, ROGRAG is a versatile tool that can streamline your workflow and enhance your data analysis capabilities.

github

: 172

vizra-adk

Vizra-ADK is a data visualization tool that allows users to create interactive and customizable visualizations for their data. With a user-friendly interface and a wide range of customization options, Vizra-ADK makes it easy for users to explore and analyze their data in a visually appealing way. Whether you're a data scientist looking to create informative charts and graphs, or a business analyst wanting to present your findings in a compelling way, Vizra-ADK has you covered. The tool supports various data formats and provides features like filtering, sorting, and grouping to help users make sense of their data quickly and efficiently.

github

: 181

datatune

Datatune is a data analysis tool designed to help users explore and analyze datasets efficiently. It provides a user-friendly interface for importing, cleaning, visualizing, and modeling data. With Datatune, users can easily perform tasks such as data preprocessing, feature engineering, model selection, and evaluation. The tool offers a variety of statistical and machine learning algorithms to support data analysis tasks. Whether you are a data scientist, analyst, or researcher, Datatune can streamline your data analysis workflow and help you derive valuable insights from your data.

github

: 101

turftopic

Turftopic is a Python library that provides tools for sentiment analysis and topic modeling of text data. It allows users to analyze large volumes of text data to extract insights on sentiment and topics. The library includes functions for preprocessing text data, performing sentiment analysis using machine learning models, and conducting topic modeling using algorithms such as Latent Dirichlet Allocation (LDA). Turftopic is designed to be user-friendly and efficient, making it suitable for both beginners and experienced data analysts.

github

: 80

cellm

Cellm is an Excel extension that allows users to leverage Large Language Models (LLMs) like ChatGPT within cell formulas. It enables users to extract AI responses to text ranges, making it useful for automating repetitive tasks that involve data processing and analysis. Cellm supports various models from Anthropic, Mistral, OpenAI, and Google, as well as locally hosted models via Llamafiles, Ollama, or vLLM. The tool is designed to simplify the integration of AI capabilities into Excel for tasks such as text classification, data cleaning, content summarization, entity extraction, and more.

github

: 882

vivaria

Vivaria is a web application tool designed for running evaluations and conducting agent elicitation research. Users can interact with Vivaria using a web UI and a command-line interface. It allows users to start task environments based on METR Task Standard definitions, run AI agents, perform agent elicitation research, view API requests and responses, add tags and comments to runs, store results in a PostgreSQL database, sync data to Airtable, test prompts against LLMs, and authenticate using Auth0.

github

: 110

upgini

Upgini is an intelligent data search engine with a Python library that helps users find and add relevant features to their ML pipeline from various public, community, and premium external data sources. It automates the optimization of connected data sources by generating an optimal set of machine learning features using large language models, GraphNNs, and recurrent neural networks. The tool aims to simplify feature search and enrichment for external data to make it a standard approach in machine learning pipelines. It democratizes access to data sources for the data science community.

github

: 338

ollama4j

Ollama4j is a Java library that serves as a wrapper or binding for the Ollama server. It allows users to communicate with the Ollama server and manage models for various deployment scenarios. The library provides APIs for interacting with Ollama, generating fake data, testing UI interactions, translating messages, and building web UIs. Users can easily integrate Ollama4j into their Java projects to leverage the functionalities offered by the Ollama server.

github

: 438

arconia

Arconia is a powerful open-source tool for managing and visualizing data in a user-friendly way. It provides a seamless experience for data analysts and scientists to explore, clean, and analyze datasets efficiently. With its intuitive interface and robust features, Arconia simplifies the process of data manipulation and visualization, making it an essential tool for anyone working with data.

github

: 63

yek

Yek is a fast Rust-based tool designed to read text-based files in a repository or directory, chunk them, and serialize them for Large Language Models (LLM) consumption. It utilizes .gitignore rules to skip unwanted files, Git history to infer important files, and additional ignore patterns. Yek splits content into chunks based on token count or byte size, supports processing multiple directories, and can stream content when output is piped. It is configurable via a 'yek.toml' file and prioritizes important files at the end of the output.

github

: 2.3k

For similar tasks

Revornix

Revornix is an information management tool designed for the AI era. It allows users to conveniently integrate all visible information and generates comprehensive reports at specific times. The tool offers cross-platform availability, all-in-one content aggregation, document transformation & vectorized storage, native multi-tenancy, localization & open-source features, smart assistant & built-in MCP, seamless LLM integration, and multilingual & responsive experience for users.

github

: 154

crssnt

crssnt is a tool that converts RSS/Atom feeds into LLM-friendly Markdown or JSON, simplifying integration of feed content into AI workflows. It supports LLM-optimized conversion, multiple output formats, feed aggregation, and Google Sheet support. Users can access various endpoints for feed conversion and Google Sheet processing, with query parameters for customization. The tool processes user-provided URLs transiently without storing feed data, and can be self-hosted as Firebase Cloud Functions. Contributions are welcome under the MIT License.

github

: 93

Website-Crawler

github

: 61

temporal-ai-agent

Temporal AI Agent is a demo showcasing a multi-turn conversation with an AI agent running inside a Temporal workflow. The agent collects information towards a goal using a simple DSL input. It is currently set up to search for events, book flights around those events, and create an invoice for those flights. The AI agent responds with clarifications and prompts for missing information. Users can configure the agent to use ChatGPT 4o or a local LLM via Ollama. The tool requires Rapidapi key for sky-scrapper to find flights and a Stripe key for creating invoices. Users can customize the agent by modifying tool and goal definitions in the codebase.

github

: 64

skyvern

Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions. Traditional approaches to browser automations required writing custom scripts for websites, often relying on DOM parsing and XPath-based interactions which would break whenever the website layouts changed. Instead of only relying on code-defined XPath interactions, Skyvern adds computer vision and LLMs to the mix to parse items in the viewport in real-time, create a plan for interaction and interact with them. This approach gives us a few advantages: 1. Skyvern can operate on websites it’s never seen before, as it’s able to map visual elements to actions necessary to complete a workflow, without any customized code 2. Skyvern is resistant to website layout changes, as there are no pre-determined XPaths or other selectors our system is looking for while trying to navigate 3. Skyvern leverages LLMs to reason through interactions to ensure we can cover complex situations. Examples include: 1. If you wanted to get an auto insurance quote from Geico, the answer to a common question “Were you eligible to drive at 18?” could be inferred from the driver receiving their license at age 16 2. If you were doing competitor analysis, it’s understanding that an Arnold Palmer 22 oz can at 7/11 is almost definitely the same product as a 23 oz can at Gopuff (even though the sizes are slightly different, which could be a rounding error!) Want to see examples of Skyvern in action? Jump to #real-world-examples-of- skyvern

github

: 14.5k

airbyte-connectors

This repository contains Airbyte connectors used in Faros and Faros Community Edition platforms as well as Airbyte Connector Development Kit (CDK) for JavaScript/TypeScript.

github

: 121

open-parse

Open Parse is a Python library for visually discerning document layouts and chunking them effectively. It is designed to fill the gap in open-source libraries for handling complex documents. Unlike text splitting, which converts a file to raw text and slices it up, Open Parse visually analyzes documents for superior LLM input. It also supports basic markdown for parsing headings, bold, and italics, and has high-precision table support, extracting tables into clean Markdown formats with accuracy that surpasses traditional tools. Open Parse is extensible, allowing users to easily implement their own post-processing steps. It is also intuitive, with great editor support and completion everywhere, making it easy to use and learn.

github

: 2.4k

unstract

Unstract is a no-code platform that enables users to launch APIs and ETL pipelines to structure unstructured documents. With Unstract, users can go beyond co-pilots by enabling machine-to-machine automation. Unstract's Prompt Studio provides a simple, no-code approach to creating prompts for LLMs, vector databases, embedding models, and text extractors. Users can then configure Prompt Studio projects as API deployments or ETL pipelines to automate critical business processes that involve complex documents. Unstract supports a wide range of LLM providers, vector databases, embeddings, text extractors, ETL sources, and ETL destinations, providing users with the flexibility to choose the best tools for their needs.

github

: 5.8k

For similar jobs

databerry

Chaindesk is a no-code platform that allows users to easily set up a semantic search system for personal data without technical knowledge. It supports loading data from various sources such as raw text, web pages, files (Word, Excel, PowerPoint, PDF, Markdown, Plain Text), and upcoming support for web sites, Notion, and Airtable. The platform offers a user-friendly interface for managing datastores, querying data via a secure API endpoint, and auto-generating ChatGPT Plugins for each datastore. Chaindesk utilizes a Vector Database (Qdrant), Openai's text-embedding-ada-002 for embeddings, and has a chunk size of 1024 tokens. The technology stack includes Next.js, Joy UI, LangchainJS, PostgreSQL, Prisma, and Qdrant, inspired by the ChatGPT Retrieval Plugin.

github

: 2.9k

OAD

OAD is a powerful open-source tool for analyzing and visualizing data. It provides a user-friendly interface for exploring datasets, generating insights, and creating interactive visualizations. With OAD, users can easily import data from various sources, clean and preprocess data, perform statistical analysis, and create customizable visualizations to communicate findings effectively. Whether you are a data scientist, analyst, or researcher, OAD can help you streamline your data analysis workflow and uncover valuable insights from your data.

github

: 132

sqlcoder

Defog's SQLCoder is a family of state-of-the-art large language models (LLMs) designed for converting natural language questions into SQL queries. It outperforms popular open-source models like gpt-4 and gpt-4-turbo on SQL generation tasks. SQLCoder has been trained on more than 20,000 human-curated questions based on 10 different schemas, and the model weights are licensed under CC BY-SA 4.0. Users can interact with SQLCoder through the 'transformers' library and run queries using the 'sqlcoder launch' command in the terminal. The tool has been tested on NVIDIA GPUs with more than 16GB VRAM and Apple Silicon devices with some limitations. SQLCoder offers a demo on their website and supports quantized versions of the model for consumer GPUs with sufficient memory.

github

: 2.8k

TableLLM

TableLLM is a large language model designed for efficient tabular data manipulation tasks in real office scenarios. It can generate code solutions or direct text answers for tasks like insert, delete, update, query, merge, and chart operations on tables embedded in spreadsheets or documents. The model has been fine-tuned based on CodeLlama-7B and 13B, offering two scales: TableLLM-7B and TableLLM-13B. Evaluation results show its performance on benchmarks like WikiSQL, Spider, and self-created table operation benchmark. Users can use TableLLM for code and text generation tasks on tabular data.

github

: 77

mlcraft

Synmetrix (prev. MLCraft) is an open source data engineering platform and semantic layer for centralized metrics management. It provides a complete framework for modeling, integrating, transforming, aggregating, and distributing metrics data at scale. Key features include data modeling and transformations, semantic layer for unified data model, scheduled reports and alerts, versioning, role-based access control, data exploration, caching, and collaboration on metrics modeling. Synmetrix leverages Cube (Cube.js) for flexible data models that consolidate metrics from various sources, enabling downstream distribution via a SQL API for integration into BI tools, reporting, dashboards, and data science. Use cases include data democratization, business intelligence, embedded analytics, and enhancing accuracy in data handling and queries. The tool speeds up data-driven workflows from metrics definition to consumption by combining data engineering best practices with self-service analytics capabilities.

github

: 480

data-scientist-roadmap2024

The Data Scientist Roadmap2024 provides a comprehensive guide to mastering essential tools for data science success. It includes programming languages, machine learning libraries, cloud platforms, and concepts categorized by difficulty. The roadmap covers a wide range of topics from programming languages to machine learning techniques, data visualization tools, and DevOps/MLOps tools. It also includes web development frameworks and specific concepts like supervised and unsupervised learning, NLP, deep learning, reinforcement learning, and statistics. Additionally, it delves into DevOps tools like Airflow and MLFlow, data visualization tools like Tableau and Matplotlib, and other topics such as ETL processes, optimization algorithms, and financial modeling.

github

: 254

VMind

VMind is an open-source solution for intelligent visualization, providing an intelligent chart component based on LLM by VisActor. It allows users to create chart narrative works with natural language interaction, edit charts through dialogue, and export narratives as videos or GIFs. The tool is easy to use, scalable, supports various chart types, and offers one-click export functionality. Users can customize chart styles, specify themes, and aggregate data using LLM models. VMind aims to enhance efficiency in creating data visualization works through dialogue-based editing and natural language interaction.

github

: 263

quadratic

Quadratic is a modern multiplayer spreadsheet application that integrates Python, AI, and SQL functionalities. It aims to streamline team collaboration and data analysis by enabling users to pull data from various sources and utilize popular data science tools. The application supports building dashboards, creating internal tools, mixing data from different sources, exploring data for insights, visualizing Python workflows, and facilitating collaboration between technical and non-technical team members. Quadratic is built with Rust + WASM + WebGL to ensure seamless performance in the browser, and it offers features like WebGL Grid, local file management, Python and Pandas support, Excel formula support, multiplayer capabilities, charts and graphs, and team support. The tool is currently in Beta with ongoing development for additional features like JS support, SQL database support, and AI auto-complete.

github

: 3.8k

Website-Crawler

README:

Website Crawler

🕷️ Website Crawler API

🔐 Authentication

🌐 Base URL

📡 Endpoints

1. GET /crawl/authenticate

2. GET /crawl/start

3. GET /crawl/cwdata

4. GET /crawl/currentURL

5. GET /crawl/clear

🕸️ Website Crawler API Usage Demo

Python

Java

🚀 Features

📦 Prerequisites

How to use the Java library?

nodejs

🧩 Integration Example: XML Sitemap Generator

🔄 Integration Workflow

📂 Repository

For Tasks:

For Jobs:

Alternative AI tools for Website-Crawler

Similar Open Source Tools

Website-Crawler

waidrin

onlook

falkordb-browser

proxyless-llm-websearch

SQLBot

ROGRAG

vizra-adk

datatune

turftopic

cellm

vivaria

upgini

ollama4j

arconia

yek

For similar tasks

Revornix

crssnt

Website-Crawler

temporal-ai-agent

skyvern

airbyte-connectors

open-parse

unstract

For similar jobs

databerry

OAD

sqlcoder

TableLLM

mlcraft

data-scientist-roadmap2024

VMind

quadratic

1. `GET /crawl/authenticate`

2. `GET /crawl/start`

3. `GET /crawl/cwdata`

4. `GET /crawl/currentURL`

5. `GET /crawl/clear`