PulsarRPA

Automate webpages at scale, scrape web data completely and accurately with high performance, distributed AI-RPA.

Stars: 805

Visit

PulsarRPA is a high-performance, distributed, open-source Robotic Process Automation (RPA) framework designed to handle large-scale RPA tasks with ease. It provides a comprehensive solution for browser automation, web content understanding, and data extraction. PulsarRPA addresses challenges of browser automation and accurate web data extraction from complex and evolving websites. It incorporates innovative technologies like browser rendering, RPA, intelligent scraping, advanced DOM parsing, and distributed architecture to ensure efficient, accurate, and scalable web data extraction. The tool is open-source, customizable, and supports cutting-edge information extraction technology, making it a preferred solution for large-scale web data extraction.

README:

PulsarRPA

English | 简体中文 | 中国镜像

🚄 Get started

💖 PulsarRPA is All You Need! 💖

PulsarRPA is a high-performance, distributed, open-source Robotic Process Automation (RPA) framework. It is designed to handle large-scale RPA tasks with ease, providing a comprehensive solution for browser automation, web content understanding, and data extraction. It addresses the inherent challenges of browser automation and extracting accurate, comprehensive web data amidst rapidly evolving and increasingly intricate websites.

Videos

YouTube:

Bilibili: https://www.bilibili.com/video/BV1kM2rYrEFC

Most scraping attempts can start with (almost) a single line of code:

fun main() = PulsarContexts.createSession().scrapeOutPages(
  "https://www.amazon.com/",  "-outLink a[href~=/dp/]", listOf("#title", "#acrCustomerReviewText"))

The code above scrapes fields specified by CSS selectors #title and #acrCustomerReviewText from a set of product pages.

Example code: kotlin.

The majority of real-world web crawling projects can initiate with the following code template:

fun main() {
    val context = PulsarContexts.create()

    val parseHandler = { _: WebPage, document: FeaturedDocument ->
        // use the document
        // ...
        // and then extract further hyperlinks
        context.submitAll(document.selectHyperlinks("a[href~=/dp/]"))
    }
    val urls = LinkExtractors.fromResource("seeds10.txt")
        .map { ParsableHyperlink("$it -refresh", parseHandler) }
    context.submitAll(urls).await()
}

Example code: kotlin, java.

The most complicated web crawling challenges can be overcome with cutting-edge robotic process automation (RPA) technologies:

val options = session.options(args)
val event = options.eventHandlers.browseEventHandlers
event.onBrowserLaunched.addLast { page, driver ->
    // warp up the browser to avoid being blocked by the website,
    // or choose the global settings, such as your location.
    warnUpBrowser(page, driver)
}
event.onWillFetch.addLast { page, driver ->
    // have to visit a referrer page before we can visit the desired page
    waitForReferrer(page, driver)
    // websites may prevent us from opening too many pages at a time, so we should open links one by one.
    waitForPreviousPage(page, driver)
}
event.onWillCheckDocumentState.addLast { page, driver ->
    // wait for a special fields to appear on the page
    driver.waitForSelector("body h1[itemprop=name]")
    // close the mask layer, it might be promotions, ads, or something else.
    driver.click(".mask-layer-close-button")
}
// visit the URL and trigger events
session.load(url, options)

Example code: kotlin.

The most complicated Web data extraction problems can be effectively managed using X-SQL technology:

select
    dom_first_text(dom, '#productTitle') as title,
    dom_first_text(dom, '#bylineInfo') as brand,
    dom_first_text(dom, '#price tr td:matches(^Price) ~ td, #corePrice_desktop tr td:matches(^Price) ~ td') as price,
    dom_first_text(dom, '#acrCustomerReviewText') as ratings,
    str_first_float(dom_first_text(dom, '#reviewsMedley .AverageCustomerReviews span:contains(out of)'), 0.0) as score
from load_and_select('https://www.amazon.com/dp/B0C1H26C46  -i 1s -njr 3', 'body');

Example code:

🥁 Introduction

Extracting Web data at scale is extremely hard. Websites change frequently and are becoming more complex, meaning web data collected is often inaccurate or incomplete. PulsarRPA has developed a range of cutting-edge technologies to solve this problem.

Challenges in Large-Scale Web Data Extraction

Frequent Website Changes: Online platforms continuously update their layouts, structures, and content, making it difficult to maintain reliable extraction processes over time. Traditional scraping tools may struggle to adapt promptly to these changes, leading to outdated or irrelevant data.
Intelligent Extraction of Web Content: The internet hosts billions of websites, each containing vast amounts of data. To extract information from such a diverse range of websites and keep up with frequent changes, intelligent web content extraction techniques are crucial. Traditional web scraping methods fail to effectively handle large volumes of webpages, leading to inefficient data extraction.
Complex Website Architecture: Modern websites often employ sophisticated design patterns, dynamic content loading, and advanced security measures, presenting formidable obstacles for conventional scraping techniques. Extracting data from such sites requires deep understanding of their structure and behavior, as well as the ability to interact with them as a human user would.

PulsarRPA: A Comprehensive Solution for Web Data Collection

To conquer these challenges, PulsarRPA incorporates a suite of innovative technologies that ensure efficient, accurate, and scalable web data extraction:

Browser Rendering: Utilizes browser rendering and AJAX data crawling to extract content from websites.
RPA (Robotic Process Automation): Employs human-like behaviors to interact with webpages, enabling data collection from modern, complex websites.
Intelligent Scraping: PulsarRPA employs intelligent scraping technology that can automatically recognize and understand web content, ensuring accurate and timely data extraction. Utilizing smart algorithms and machine learning techniques, PulsarRPA can independently learn and apply data extraction models, significantly improving the efficiency and accuracy of data retrieval.
Advanced DOM Parsing: Leveraging advanced Document Object Model (DOM) parsing techniques, PulsarRPA can navigate complex website architectures with ease. It accurately identifies and extracts data from elements in modern web pages, handles dynamic content rendering, and bypasses anti-scraping measures, delivering complete and accurate datasets despite website intricacies.
Distributed Architecture: Built on a distributed architecture, PulsarRPA harnesses the combined processing power of multiple nodes to handle large-scale extraction tasks efficiently. This allows for parallel crawling, faster data retrieval, and seamless scalability as your data requirements grow, without compromising performance or reliability.
Open-Source & Customizable: As an open-source solution, PulsarRPA offers unparalleled flexibility and extensibility. Developers can easily customize its components, integrate with existing systems, or contribute new features to meet specific project requirements.

In summary, PulsarRPA, with its web content understanding, intelligent scraping, advanced DOM parsing, distributed processing, and open-source features, becomes the preferred open-source solution for large-scale web data extraction. Its unique technology combination allows users to effectively address the complexities and challenges associated with extracting valuable web data on a large scale, ultimately facilitating wiser decision-making and competitive advantage.

We have released complete solutions for site-wide Web scraping for some of the largest e-commerce websites. These solutions meet the highest standards of performance, quality, and cost. They will be free and open source forever, such as:

🕷 PulsarRPA supports high-quality, large-scale Web data collection and processing. We have developed a range of infrastructure and cutting-edge technologies to ensure the highest standards of performance, quality, and TCO (total cost of ownership), even in very large-scale data collection scenarios.

🏪 PulsarRPA supports the Network-As-A-Database paradigm. PulsarRPA treats the external network as a database. If the required data is not in the local storage, or the existing version does not meet the analysis needs, the system will collect the latest version of the data from the Internet. We also developed X-SQL to query the Web directly and convert webpages into tables and charts.

🌈 PulsarRPA supports browser rendering as the primary method to collect Web data. By using browser rendering as the primary method to collect Web data, we achieve an optimal balance between data point scale, data quality, labor cost, and hardware cost, and achieve the lowest TCO (total cost of ownership). With optimizations such as blocking unnecessary resource files, the performance of browser rendering can even be comparable to the traditional single resource collection method.

💫 PulsarRPA supports RPA based Web scraping. PulsarRPA includes an RPA subsystem for Web interaction: scrolling, typing, screen capture, dragging and dropping, clicking, etc. This subsystem is similar to the well-known selenium, playwright, puppeteer, but all behaviors are optimized, such as more realistic simulation, better execution performance, better parallelism, better fault tolerance, and so on.

🔪 PulsarRPA supports single resource collection. PulsarRPA's default data collection method is to harvest the complete Web data through browser rendering, but if the data you need can be retrieved through a single link, for example, it can be returned through an ajax interface, you can also call PulsarRPA's resource collection method for super High-speed collection.

💯 PulsarRPA plans to support cutting-edge information extraction technology. We plan to release an advanced AI to automatically extract every field from all valuable webpages (e.g., product detail pages) with remarkable accuracy, and we currently offer a preview version.

🚀 Features

Web spider: browser rendering, ajax data crawling
RPA: robotic process automation, mimic human behaviors, SPA crawling, or do something else valuable
Simple API: single line of code to scrape, or single SQL to turn a website into a table
X-SQL: extended SQL to manage web data: Web crawling, scraping, Web content mining, Web BI
Bot stealth: web driver stealth, IP rotation, privacy context rotation, never get banned
High performance: highly optimized, rendering hundreds of pages in parallel on a single machine without being blocked
Low cost: scraping 100,000 browser-rendered e-comm webpages, or n * 10,000,000 data points each day, only 8 core CPU/32G memory are required
Data quantity assurance: smart retry, accurate scheduling, web data lifecycle management
Large scale: fully distributed, designed for large scale crawling
Big data: various backend storage support: Local File/MongoDB/HBase/Gora
Logs & metrics: monitored closely and every event is recorded
[Preview] Information Extraction: Learns Web data patterns and automatically extracts every field in a webpage with remarkable precision

♾ Core concepts

To unlock the full potential of PulsarRPA and tackle even the most complex data scraping tasks, a solid understanding of its core concepts is essential. By grasping these fundamental principles, you'll be equipped to wield PulsarRPA as a powerful tool for extracting valuable information from the web. Let's delve into the key concepts that form the foundation of your data scraping journey with PulsarRPA:

Web Scraping: the process of using bots to extract content and data from a website
Auto Extract: learn the data schema automatically and extract every field from webpages, powered by cutting-edge AI algorithm
RPA: stands for robotic process automation which is the only way to scrape modern webpages
Network As A Database: access the network just like a database
X-SQL: query the Web using SQL directly
Pulsar Session: provides a set of simple, powerful, and flexible APIs to do web scraping tasks
Web Driver: defines a concise interface to visit and interact with webpages, all behaviors are optimized to mimic real people as closely as possible
UrlAware: This concept encompasses URLs and additional information describing a task. In PulsarRPA, every task is defined as a type of UrlAware, with the main types being: PlainUrl, HyperLink, ListenableHyperlink, and ParsableHyperlink.
Load Options: also known as load arguments. These control parameters fine-tune how PulsarRPA loads and scrapes webpages, allowing you to tailor the process to specific requirements.
Event Handlers: capture and process various events that occur throughout the entire lifecycle of webpage scraping

Check PulsarRPA concepts for details.

🧮 PulsarRPA as an executable jar

We have released a standalone executable jar based on PulsarRPA, which includes:

Web scraping examples of a set of top sites
An applet based on self-supervised machine learning for information extraction, AI identifies all fields on the detail page with over 90% field accuracy of 99.9% or more
An applet based on self-supervised machine learning and outputs all extract rules, which can help traditional Web scraping methods
An applet that scrape Web data directly from the command line, like wget or curl, without writing code
An upgraded PulsarRPA server to which we can send SQLs to collect Web data
A Web UI from which we can write SQLs and send them to the server

Download PulsarRPAPro and explore its capabilities with a single command line:

java -jar PulsarRPAPro.jar

🎁 PulsarRPA as a java library

The simplest way to leverage the power of PulsarRPA is to add it to your project as a library.

Maven:

<dependency>
  <groupId>ai.platon.pulsar</groupId>
  <artifactId>pulsar-bom</artifactId>
  <version>2.2.0-SNAPSHOT</version>
</dependency>

Gradle:

implementation("ai.platon.pulsar:pulsar-bom:2.2.0-SNAPSHOT")

You can clone the template project from github.com: kotlin, java-11, java-17.

You can also start your own large-scale web crawling projects based on our commercial-grade open source projects: PulsarRPAPro, Exotic-amazon.

Check the quick start for more details.

🌐 PulsarRPA as a REST Service

When PulsarRPA runs as a REST service, X-SQL can be used to scrape webpages or to query web data directly at any time, from anywhere, without opening an IDE.

Build from Source

git clone https://github.com/platonai/PulsarRPA.git
cd PulsarRPA && bin/build-run.sh

For Chinese developers, we strongly suggest you to follow this instruction to accelerate the building process.

Use X-SQL to Query the Web

Start the pulsar server if it is not started:

bin/pulsar

Scrape a webpage in another terminal window:

bin/scrape.sh

The bash script is straightforward. It merely uses curl to send a POST request with an X-SQL.

curl -X POST --location "http://localhost:8182/api/x/e" -H "Content-Type: text/plain" -d "
  select
      dom_base_uri(dom) as url,
      dom_first_text(dom, '#productTitle') as title,
      dom_first_slim_html(dom, 'img:expr(width > 400)') as img
  from load_and_select('https://www.amazon.com/dp/B0C1H26C46', 'body');
"

Example code: bash, batch, java, kotlin, php.

Click X-SQL to see a detailed introduction and function descriptions about X-SQL.

📖 Step-by-Step Course

We have a step-by-step course by example:

📊 Logs & Metrics

PulsarRPA has carefully designed the logging and metrics subsystem to record every event that occurs in the system. PulsarRPA logs the status for every load execution, providing a clear and comprehensive overview of system performance. This detailed logging allows for quick assessment of the system’s health and efficiency. It answers key questions such as: Is the system operating smoothly? How many pages have been successfully retrieved? How many attempts were made to reload pages? And how many proxy IP addresses have been utilized? This information is invaluable for monitoring and troubleshooting purposes, ensuring that any issues can be promptly identified and addressed.

By focusing on a concise set of indicators, you can unlock a deeper understanding of the system’s overall condition: 💯 💔 🗙 ?💿 🔃 🤺.

Typical page loading logs are shown below. Check the log-format to learn how to read the logs and gain insight into the state of the entire system at a glance.

2022-09-24 11:46:26.045  INFO [-worker-14] a.p.p.c.c.L.Task - 3313. 💯  ?U for N got 200 580.92 KiB in 1m14.277s, fc:1 | 75/284/96/277/6554 | 106.32.12.75 | 3xBpaR2 | https://www.walmart.com/ip/Restored-iPhone-7-32GB-Black-T-Mobile-Refurbished/329207863  -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:46:09.190  INFO [-worker-32] a.p.p.c.c.L.Task - 3738. 💯 💿 U got 200 452.91 KiB in 55.286s, last fetched 9h32m50s ago, fc:1 | 49/171/82/238/6172 | 121.205.2.0.5 | https://www.walmart.com/ip/Boost-Mobile-Apple-iPhone-SE-2-Cell-Phone-Black-64GB-Prepaid-Smartphone/490934488  -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:46:28.567  INFO [-worker-17] a.p.p.c.c.L.Task - 2269. 💯 🔃 U for SC got 200 565.07 KiB <- 543.41 KiB in 1m22.767s, last fetched 16m58s ago, fc:6 | 58/230/98/295/6272 | 27.158.125.76 | 9uwu602 | https://www.walmart.com/ip/Straight-Talk-Apple-iPhone-11-64GB-Purple-Prepaid-Smartphone/356345388?variantFieldId=actual_color  -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:47:18.390  INFO [r-worker-8] a.p.p.c.c.L.Task - 3732. 💔  ?U for N got 1601 0 <- 0 in 32.201s, fc:1/1 Retry(1601) rsp: CRAWL, rrs: EMPTY_0B | 2zYxg52 | https://www.walmart.com/ip/Apple-iPhone-7-256GB-Jet-Black-AT-T-Locked-Smartphone-Grade-B-Used/182353175?variantFieldId=actual_color  -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:47:13.860  INFO [-worker-60] a.p.p.c.c.L.Task - 2828. 🗙 🗙 U for SC got 200 0 <- 348.31 KiB <- 684.75 KiB in 0s, last fetched 18m55s ago, fc:2 | 34/130/52/181/5747 | 60.184.124.232 | 11zTa0r2 | https://www.walmart.com/ip/Walmart-Family-Mobile-Apple-iPhone-11-64GB-Black-Prepaid-Smartphone/209201965?athbdg=L1200  -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000

💻 System Requirements

Memory 4G+
JDK 17+
java on the PATH
Latest Google Chrome
[Optional] MongoDB started

PulsarRPA is tested on Ubuntu 18.04, Ubuntu 20.04, Windows 7, Windows 11, WSL, and any other operating system that meets the requirements should work as well.

🛸 Advanced Topics

Check the advanced topics to find out the answers for the following questions:

What’s so difficult about scraping web data at scale?
How to scrape a million product pages from an e-commerce website a day?
How to scrape pages behind a login?
How to download resources directly within a browser context?
How to scrape a single page application (SPA)?
Resource mode
RPA mode
How to make sure all fields are extracted correctly?
How to crawl paginated links?
How to crawl newly discovered links?
How to crawl the entire website?
How to simulate human behaviors?
How to schedule priority tasks?
How to start a task at a fixed time point?
How to drop a scheduled task?
How to know the status of a task?
How to know what's going on in the system?
How to automatically generate the CSS selectors for fields to scrape?
How to extract content from websites using machine learning automatically with commercial accuracy?
How to scrape amazon.com to match industrial needs?

🆚 Compare with Other Solutions

In general, the features mentioned in the Feature section are well-supported by PulsarRPA, but other solutions do not.

Check the solution comparison to see the detailed comparison to the other solutions:

PulsarRPA vs selenium/puppeteer/playwright
PulsarRPA vs nutch
PulsarRPA vs scrapy+splash

🤓 Technical Details

Check the technical details to see answers for the following questions:

How to rotate my IP addresses?
How to hide my bot from being detected?
How & why to simulate human behaviors?
How to render as many pages as possible on a single machine without being blocked?

🐦 Contact

Wechat: galaxyeye
Weibo: galaxyeye
Email: [email protected], [email protected]
Twitter: galaxyeye8
Website: platon.ai

For Tasks:

Click tags to check more tools for each tasks

scrape product data extract web content automate web interactions parse complex websites handle large-scale data extraction

For Jobs:

automation engineer data analyst web scraper software developer data scientist

Alternative AI tools for PulsarRPA

Similar Open Source Tools

PulsarRPA

github

: 805

morphik-core

Morphik is an AI-native toolset designed to help developers integrate context into their AI applications by providing tools to store, represent, and search unstructured data. It offers features such as multimodal search, fast metadata extraction, and integrations with existing tools. Morphik aims to address the challenges of traditional AI approaches that struggle with visually rich documents and provide a more comprehensive solution for understanding and processing complex data.

github

: 3.5k

reductstore

ReductStore is a high-performance time series database designed for storing and managing large amounts of unstructured blob data. It offers features such as real-time querying, batching data, and HTTP(S) API for edge computing, computer vision, and IoT applications. The database ensures data integrity, implements retention policies, and provides efficient data access, making it a cost-effective solution for applications requiring unstructured data storage and access at specific time intervals.

github

: 242

kdbai-samples

KDB.AI is a time-based vector database that allows developers to build scalable, reliable, and real-time applications by providing advanced search, recommendation, and personalization for Generative AI applications. It supports multiple index types, distance metrics, top-N and metadata filtered retrieval, as well as Python and REST interfaces. The repository contains samples demonstrating various use-cases such as temporal similarity search, document search, image search, recommendation systems, sentiment analysis, and more. KDB.AI integrates with platforms like ChatGPT, Langchain, and LlamaIndex. The setup steps require Unix terminal, Python 3.8+, and pip installed. Users can install necessary Python packages and run Jupyter notebooks to interact with the samples.

github

: 95

nucliadb

NucliaDB is a robust database that allows storing and searching on unstructured data. It is an out of the box hybrid search database, utilizing vector, full text and graph indexes. NucliaDB is written in Rust and Python. We designed it to index large datasets and provide multi-teanant support. When utilizing NucliaDB with Nuclia cloud, you are able to the power of an NLP database without the hassle of data extraction, enrichment and inference. We do all the hard work for you.

github

: 704

llm-app

Pathway's LLM (Large Language Model) Apps provide a platform to quickly deploy AI applications using the latest knowledge from data sources. The Python application examples in this repository are Docker-ready, exposing an HTTP API to the frontend. These apps utilize the Pathway framework for data synchronization, API serving, and low-latency data processing without the need for additional infrastructure dependencies. They connect to document data sources like S3, Google Drive, and Sharepoint, offering features like real-time data syncing, easy alert setup, scalability, monitoring, security, and unification of application logic.

github

: 22.9k

AntSK

AntSK is an AI knowledge base/agent built with .Net8+Blazor+SemanticKernel. It features a semantic kernel for accurate natural language processing, a memory kernel for continuous learning and knowledge storage, a knowledge base for importing and querying knowledge from various document formats, a text-to-image generator integrated with StableDiffusion, GPTs generation for creating personalized GPT models, API interfaces for integrating AntSK into other applications, an open API plugin system for extending functionality, a .Net plugin system for integrating business functions, real-time information retrieval from the internet, model management for adapting and managing different models from different vendors, support for domestic models and databases for operation in a trusted environment, and planned model fine-tuning based on llamafactory.

github

: 1.3k

supervisely

Supervisely is a computer vision platform that provides a range of tools and services for developing and deploying computer vision solutions. It includes a data labeling platform, a model training platform, and a marketplace for computer vision apps. Supervisely is used by a variety of organizations, including Fortune 500 companies, research institutions, and government agencies.

github

: 517

project_alice

Alice is an agentic workflow framework that integrates task execution and intelligent chat capabilities. It provides a flexible environment for creating, managing, and deploying AI agents for various purposes, leveraging a microservices architecture with MongoDB for data persistence. The framework consists of components like APIs, agents, tasks, and chats that interact to produce outputs through files, messages, task results, and URL references. Users can create, test, and deploy agentic solutions in a human-language framework, making it easy to engage with by both users and agents. The tool offers an open-source option, user management, flexible model deployment, and programmatic access to tasks and chats.

github

: 176

pathway

Pathway is a Python data processing framework for analytics and AI pipelines over data streams. It's the ideal solution for real-time processing use cases like streaming ETL or RAG pipelines for unstructured data. Pathway comes with an **easy-to-use Python API** , allowing you to seamlessly integrate your favorite Python ML libraries. Pathway code is versatile and robust: **you can use it in both development and production environments, handling both batch and streaming data effectively**. The same code can be used for local development, CI/CD tests, running batch jobs, handling stream replays, and processing data streams. Pathway is powered by a **scalable Rust engine** based on Differential Dataflow and performs incremental computation. Your Pathway code, despite being written in Python, is run by the Rust engine, enabling multithreading, multiprocessing, and distributed computations. All the pipeline is kept in memory and can be easily deployed with **Docker and Kubernetes**. You can install Pathway with pip: `pip install -U pathway` For any questions, you will find the community and team behind the project on Discord.

github

: 59.6k

Customer-Service-Conversational-Insights-with-Azure-OpenAI-Services

This solution accelerator is built on Azure Cognitive Search Service and Azure OpenAI Service to synthesize post-contact center transcripts for intelligent contact center scenarios. It converts raw transcripts into customer call summaries to extract insights around product and service performance. Key features include conversation summarization, key phrase extraction, speech-to-text transcription, sensitive information extraction, sentiment analysis, and opinion mining. The tool enables data professionals to quickly analyze call logs for improvement in contact center operations.

github

: 191

Conversation-Knowledge-Mining-Solution-Accelerator

The Conversation Knowledge Mining Solution Accelerator enables customers to leverage intelligence to uncover insights, relationships, and patterns from conversational data. It empowers users to gain valuable knowledge and drive targeted business impact by utilizing Azure AI Foundry, Azure OpenAI, Microsoft Fabric, and Azure Search for topic modeling, key phrase extraction, speech-to-text transcription, and interactive chat experiences.

github

: 255

ChainForge

ChainForge is a visual programming environment for battle-testing prompts to LLMs. It is geared towards early-stage, quick-and-dirty exploration of prompts, chat responses, and response quality that goes beyond ad-hoc chatting with individual LLMs. With ChainForge, you can: * Query multiple LLMs at once to test prompt ideas and variations quickly and effectively. * Compare response quality across prompt permutations, across models, and across model settings to choose the best prompt and model for your use case. * Setup evaluation metrics (scoring function) and immediately visualize results across prompts, prompt parameters, models, and model settings. * Hold multiple conversations at once across template parameters and chat models. Template not just prompts, but follow-up chat messages, and inspect and evaluate outputs at each turn of a chat conversation. ChainForge comes with a number of example evaluation flows to give you a sense of what's possible, including 188 example flows generated from benchmarks in OpenAI evals. This is an open beta of Chainforge. We support model providers OpenAI, HuggingFace, Anthropic, Google PaLM2, Azure OpenAI endpoints, and Dalai-hosted models Alpaca and Llama. You can change the exact model and individual model settings. Visualization nodes support numeric and boolean evaluation metrics. ChainForge is built on ReactFlow and Flask.

github

: 2.8k

MemoryBear

MemoryBear is a next-generation AI memory system developed by RedBear AI, focusing on overcoming limitations in knowledge storage and multi-agent collaboration. It empowers AI with human-like memory capabilities, enabling deep knowledge understanding and cognitive collaboration. The system addresses challenges such as knowledge forgetting, memory gaps in multi-agent collaboration, and semantic ambiguity during reasoning. MemoryBear's core features include memory extraction engine, graph storage, hybrid search, memory forgetting engine, self-reflection engine, and FastAPI services. It offers a standardized service architecture for efficient integration and invocation across applications.

github

: 244

EDA-GPT

EDA GPT is an open-source data analysis companion that offers a comprehensive solution for structured and unstructured data analysis. It streamlines the data analysis process, empowering users to explore, visualize, and gain insights from their data. EDA GPT supports analyzing structured data in various formats like CSV, XLSX, and SQLite, generating graphs, and conducting in-depth analysis of unstructured data such as PDFs and images. It provides a user-friendly interface, powerful features, and capabilities like comparing performance with other tools, analyzing large language models, multimodal search, data cleaning, and editing. The tool is optimized for maximal parallel processing, searching internet and documents, and creating analysis reports from structured and unstructured data.

github

: 160

airbyte-platform

Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's low-code Connector Development Kit (CDK). Airbyte is used by data engineers and analysts at companies of all sizes to move data for a variety of purposes, including data warehousing, data analysis, and machine learning.

github

: 273

For similar tasks

LaVague

LaVague is an open-source Large Action Model framework that uses advanced AI techniques to compile natural language instructions into browser automation code. It leverages Selenium or Playwright for browser actions. Users can interact with LaVague through an interactive Gradio interface to automate web interactions. The tool requires an OpenAI API key for default examples and offers a Playwright integration guide. Contributors can help by working on outlined tasks, submitting PRs, and engaging with the community on Discord. The project roadmap is available to track progress, but users should exercise caution when executing LLM-generated code using 'exec'.

github

: 5.8k

AutoNode

AutoNode is a self-operating computer system designed to automate web interactions and data extraction processes. It leverages advanced technologies like OCR (Optical Character Recognition), YOLO (You Only Look Once) models for object detection, and a custom site-graph to navigate and interact with web pages programmatically. Users can define objectives, create site-graphs, and utilize AutoNode via API to automate tasks on websites. The tool also supports training custom YOLO models for object detection and OCR for text recognition on web pages. AutoNode can be used for tasks such as extracting product details, automating web interactions, and more.

github

: 116

PulsarRPA

github

: 805

agent-browser

agent-browser is a headless browser automation CLI tool designed for AI agents. It is a fast Rust CLI tool with Node.js fallback. The tool allows users to automate web interactions, perform various browser actions, interact with elements using semantic locators, wait for specific conditions, control mouse and keyboard events, manage browser settings, handle cookies and storage, monitor network requests, work with tabs and windows, interact with frames and dialogs, debug browser sessions, navigate pages, set up sessions, use persistent profiles, take snapshots with filtering options, control browser via CDP, stream browser viewport, work with iOS simulators and real devices, utilize Browserbase, Browser Use, and Kernel cloud browser infrastructure, and more. It supports multiple platforms and browsers, provides a comprehensive set of commands for web automation, and is suitable for AI agents and coding assistants.

github

: 13.9k

Scrapling

Scrapling is a high-performance, intelligent web scraping library for Python that automatically adapts to website changes while significantly outperforming popular alternatives. For both beginners and experts, Scrapling provides powerful features while maintaining simplicity. It offers features like fast and stealthy HTTP requests, adaptive scraping with smart element tracking and flexible selection, high performance with lightning-fast speed and memory efficiency, and developer-friendly navigation API and rich text processing. It also includes advanced parsing features like smart navigation, content-based selection, handling structural changes, and finding similar elements. Scrapling is designed to handle anti-bot protections and website changes effectively, making it a versatile tool for web scraping tasks.

github

: 7.3k

openbrowser-ai

OpenBrowser is a framework for intelligent browser automation that combines direct CDP communication with a CodeAgent architecture. It allows users to navigate, interact with, and extract information from web pages autonomously. The tool supports various LLM providers, offers vision support for screenshot analysis, and includes a MCP server for Model Context Protocol support. Users can record browser sessions as video files and benefit from features like video recording and full documentation available at docs.openbrowser.me.

github

: 91

For similar jobs

aiscript

AiScript is a lightweight scripting language that runs on JavaScript. It supports arrays, objects, and functions as first-class citizens, and is easy to write without the need for semicolons or commas. AiScript runs in a secure sandbox environment, preventing infinite loops from freezing the host. It also allows for easy provision of variables and functions from the host.

github

: 227

askui

AskUI is a reliable, automated end-to-end automation tool that only depends on what is shown on your screen instead of the technology or platform you are running on.

github

: 83

bots

The 'bots' repository is a collection of guides, tools, and example bots for programming bots to play video games. It provides resources on running bots live, installing the BotLab client, debugging bots, testing bots in simulated environments, and more. The repository also includes example bots for games like EVE Online, Tribal Wars 2, and Elvenar. Users can learn about developing bots for specific games, syntax of the Elm programming language, and tools for memory reading development. Additionally, there are guides on bot programming, contributing to BotLab, and exploring Elm syntax and core library.

github

: 179

ain

Ain is a terminal HTTP API client designed for scripting input and processing output via pipes. It allows flexible organization of APIs using files and folders, supports shell-scripts and executables for common tasks, handles url-encoding, and enables sharing the resulting curl, wget, or httpie command-line. Users can put things that change in environment variables or .env-files, and pipe the API output for further processing. Ain targets users who work with many APIs using a simple file format and uses curl, wget, or httpie to make the actual calls.

github

: 592

LaVague

github

: 5.8k

robocorp

Robocorp is a platform that allows users to create, deploy, and operate Python automations and AI actions. It provides an easy way to extend the capabilities of AI agents, assistants, and copilots with custom actions written in Python. Users can create and deploy tools, skills, loaders, and plugins that securely connect any AI Assistant platform to their data and applications. The Robocorp Action Server makes Python scripts compatible with ChatGPT and LangChain by automatically creating and exposing an API based on function declaration, type hints, and docstrings. It simplifies the process of developing and deploying AI actions, enabling users to interact with AI frameworks effortlessly.

github

: 501

Open-Interface

Open Interface is a self-driving software that automates computer tasks by sending user requests to a language model backend (e.g., GPT-4V) and simulating keyboard and mouse inputs to execute the steps. It course-corrects by sending current screenshots to the language models. The tool supports MacOS, Linux, and Windows, and requires setting up the OpenAI API key for access to GPT-4V. It can automate tasks like creating meal plans, setting up custom language model backends, and more. Open Interface is currently not efficient in accurate spatial reasoning, tracking itself in tabular contexts, and navigating complex GUI-rich applications. Future improvements aim to enhance the tool's capabilities with better models trained on video walkthroughs. The tool is cost-effective, with user requests priced between $0.05 - $0.20, and offers features like interrupting the app and primary display visibility in multi-monitor setups.

github

: 934

AI-Case-Sorter-CS7.1

AI-Case-Sorter-CS7.1 is a project focused on building a case sorter using machine vision and machine learning AI to sort cases by headstamp. The repository includes Arduino code and 3D models necessary for the project.

github

: 67