AutoNode

AutoNode: A Neuro-Graphic Self-Learnable Engine for Cognitive GUI Automation

Stars: 116

Visit

AutoNode is a self-operating computer system designed to automate web interactions and data extraction processes. It leverages advanced technologies like OCR (Optical Character Recognition), YOLO (You Only Look Once) models for object detection, and a custom site-graph to navigate and interact with web pages programmatically. Users can define objectives, create site-graphs, and utilize AutoNode via API to automate tasks on websites. The tool also supports training custom YOLO models for object detection and OCR for text recognition on web pages. AutoNode can be used for tasks such as extracting product details, automating web interactions, and more.

README:

Generalized Super Intelligence Research, Infrastructure & Autonomous Apps

Follow SuperAGI

Connect with the Creator

Share This Repository

AutoNode: A Neuro-Graphic Self-Learnable Engine for Cognitive GUI Automation

What is AutoNode?

Installation

To get started with AutoNode, you need to have Python installed on your system. Follow these steps to install AutoNode:

Open your terminal and clone the SuperAGI repository.

git clone https://github.com/TransformerOptimus/AutoNode.git

Navigate to the root of the cloned repository directory using the command:

cd AutoNode

Create a copy of .env.example , and name it .env. Repeat this step for all the three modules - autonode, yolo, ocr
Ensure that Docker is installed on your system. You can download and install it from here.
Once you have Docker Desktop running, run the following command in the AutoNode directory:

docker compose -f docker-compose.yaml up --build

Open your web browser and navigate to http://localhost:8001/health to check server is running

How to Use AutoNode

AutoNode operates based on a site-graph that defines the navigation and actions to be performed on a website. Here's a basic overview of how to use AutoNode:

Define Your Objective: Specify what you want to achieve with AutoNode, such as data extraction or automation of specific web interactions.
Prepare Your Autonode-Site-Graph: Create a JSON file that represents the site-graph. This graph outlines the nodes (web elements) and edges (actions) that AutoNode will navigate and interact with.
Prepare you AutoNode initiator planner prompt : Using the template structure given for planner prompts in openai_prompts.py, for OpenAI LLM, you can create a new prompt file in the prompts directory with structure <llm_prompts.py>
Run AutoNode:

Using AutoNode via API

AutoNode can be controlled and utilized through its API, allowing users to automate web interactions and data extraction tasks programmatically. This guide will walk you through the process of sending requests to AutoNode using its API endpoint.

Accessing the API Documentation

Before you start, ensure that AutoNode is running on your local machine. Once AutoNode is up and running, you can access the API documentation by visiting:
```
http://localhost:8001/docs
```
This URL will take you to the Swagger UI, where you can find detailed documentation on all available API endpoints, including the one used to initiate AutoNode tasks.

Sending a Request to AutoNode

To automate a task with AutoNode, you will use the /api/autonode/initiate endpoint. This endpoint accepts a JSON payload that specifies the task's objective, the path to the site-graph JSON file, the root node to start traversal, and the URL of the website you wish to interact with.

Request Structure

Here is the structure of the JSON payload you need to send to the /api/autonode/initiate endpoint:
```
{
    "site_url": "string",
    "objective": "string",
    "graph_path": "string",
    "planer_prompt": "string"

}
```
Example request: { "site_url": "https://app.apollo.io/#/login", "objective": "Find the list of 20 ceo, cto of tech companies in san francisco. Login into apollo using the creds [email protected] and password dummypassword@123", "graph_path": "autonode/site_trees/apollo.json" "planner_prompt": "apollo" }
- site_url: The URL of the website AutoNode will visit and interact with.
- objective: The goal you want to achieve on the website. This could be anything from data extraction to automating a series of web interactions. Make sure you provide the login instructions along with the credentials if login is required in your use-case
- graph_path: The path to the JSON file that contains your site-graph. The site-graph defines the structure and navigation flow of the website for AutoNode.
- planner_prompt: The key for planning prompt. You can change or map new key in planning_agent.py for AutoNode planner prompt.
Using CURL
```
curl -X 'POST' \
    'http://localhost:8001/api/autonode/initiate' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "site_url": "https://example.com/products",
    "objective": "Extract product details",
    "graph_path": "/path/to/your/site-graph.json"
    "planner_prompt": "planner_key"
    
}'
```

YOLO/OCR Models

AutoNode utilizes YOLO models for object detection and OCR for text recognition on web pages. These models are crucial for identifying clickable elements, reading text from images, and interacting with web pages dynamically.

We are providing some general yolo models trained on YOLO-V8 over thousands of web-screenshots Navigate to - yolo/web_detection_models/ dir to find those

How to Train Your Own YOLO Model

Collect a Dataset: Gather images of the web elements you want to detect and annotate them with bounding boxes.
Prepare the Dataset: Split your dataset into training and validation sets.
Train the Model: Use a YOLO training script to train your model on the prepared dataset. Adjust the training parameters according to your needs.
Evaluate the Model: Test your trained model on a separate test set to evaluate its performance.
Integrate with AutoNode: Once trained, integrate your custom YOLO model with AutoNode by specifying the model path in the configuration.

Hosting Custom YOLO model and OCR apps remotely

If you don't have enough resources on your local machine, you can host the OCR and YOLO modules on any cloud server. Following steps can be taken:

In your ocr/.env file, add USE_REMOTE_OCR=True and set the url for the remote service in OCR_REMOTE_URL
In your yolo/.env file, add USE_REMOTE_YOLO=True and set the url for the remote service in YOLO_REMOTE_URL
Update the yolo web detection model path in your yolo/.env file, add SAHI_MODEL_PATH and ULTRALYTICS_MODEL_PATH. Example: SAHI_MODEL_PATH = yolo/web_detection_models/twitter.pt

AutoNode-Site-Graph Preparation

The site-graph is a JSON file that describes the structure and navigation flow of a website for AutoNode. Here's how to prepare it:

Identify Web Elements: Navigate through your target website and identify the key elements you want to interact with, such as buttons, text-boxes, and links.
Define Nodes: For each web element, define a node in the JSON file. Include properties like node_name, actionable_element_type, location, and is_type.
Define Edges: Specify the relationships between nodes using adjacent_to and adjacent_from properties to represent the navigation flow.
Include Action Details: For nodes that require typing or clicking, provide additional details like type_description or click_action.

Example of a simple site-graph:

{
    "1": {
        "node_type": "clickable_and_typeable",
        "node_name": "Login Button",
        "actionable_element_type": "button",
        "location": [100, 200],
        "is_type": false,
        "adjacent_to": ["2"]
    },
    "2": {
        "node_type": "clickable_and_typeable",
        "node_name": "Username Field",
        "actionable_element_type": "textbox",
        "location": [150, 250],
        "is_type": true,
        "type_description": "Enter username here",
        "adjacent_to": []
    }
}

Storing Debugging Screenshots & Downloaded Output

Screenshots at every node for web element detection is stored in requests directory in the root folder. You can store them on an AWS s3 account in case you wish to, or persist them locally or neither as per your choice. For use-cases which require downloading output, downloadable content (like output in apollo) needs to be stored either locally or remotely.

Storing Screenshots & Downloads Remotely

In your autonode/.env file, configure AWS keys AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY from your AWS console
In autonode/services/autonode.py, uncomment the following snippet at line 35,

   # Uncomment If you have aws account and want to store result in your AWS S3
   # self.s3_client = S3Helper(access_key=self.config.AWS_ACCESS_KEY_ID,
   #                           secret_key=self.config.AWS_SECRET_ACCESS_KEY,
   #                           bucket_name=self.config.bucket_name)

and pass s3_client=self.s3_client at line 84

   s3_client=None

In autonode/utils/screenshot_generator.py uncomment this on line 18,

   # Uncomment If you have aws account and want to store result in your AWS S3
   s3_client.upload_file(file_path=screenshot_filename)

In autonode/nodes/download.py uncomment this on line 44

   # Uncomment If you have aws account and want to store result in your AWS S3
   s3_client.upload_file(file_path=download_file_path)

Storing Screenshots & Downloaded files locally

By default, the screenshots and downloaded output are stored locally during execution and are deleted as the task completes or fails
In autonode/worker.py comment this finally block on line 79 to persist screenshots locally,

   finally:
           # Comment if you don't want to delete screenshots locally
           if os.path.exists(screenshots_dir):
               logger.info(f"Deleting request directory {screenshots_dir}")
               shutil.rmtree(screenshots_dir)
           session.close()

For Tasks:

Click tags to check more tools for each tasks

automate data extraction interact with web pages train yolo models ocr text recognition automate web interactions

For Jobs:

web developer data analyst automation engineer ai engineer software developer

Alternative AI tools for AutoNode

Similar Open Source Tools

AutoNode

github

: 116

Figma-Context-MCP

Figma-Context-MCP is a plugin for Figma that allows users to easily manage and switch between multiple design contexts within a single Figma file. This tool simplifies the process of working on different design variations or versions by providing a seamless way to organize and switch between them. With Figma-Context-MCP, designers can streamline their workflow and improve collaboration by keeping all design contexts in one place and easily accessible. This plugin enhances productivity and efficiency for Figma users who frequently work on multiple design iterations or versions within a project.

github

: 10.6k

palimpzest

Palimpzest (PZ) is a tool for managing and optimizing workloads, particularly for data processing tasks. It provides a CLI tool and Python demos for users to register datasets, run workloads, and access results. Users can easily initialize their system, register datasets, and manage configurations using the CLI commands provided. Palimpzest also supports caching intermediate results and configuring for parallel execution with remote services like OpenAI and together.ai. The tool aims to streamline the workflow of working with datasets and optimizing performance for data extraction tasks.

github

: 144

Trinity

Trinity is an Explainable AI (XAI) Analysis and Visualization tool designed for Deep Learning systems or other models performing complex classification or decoding. It provides performance analysis through interactive 3D projections that are hyper-dimensional aware, allowing users to explore hyperspace, hypersurface, projections, and manifolds. Trinity primarily works with JSON data formats and supports the visualization of FeatureVector objects. Users can analyze and visualize data points, correlate inputs with classification results, and create custom color maps for better data interpretation. Trinity has been successfully applied to various use cases including Deep Learning Object detection models, COVID gene/tissue classification, Brain Computer Interface decoders, and Large Language Model (ChatGPT) Embeddings Analysis.

github

: 93

Guardrails

Guardrails is a security tool designed to help developers identify and fix security vulnerabilities in their code. It provides automated scanning and analysis of code repositories to detect potential security issues, such as sensitive data exposure, injection attacks, and insecure configurations. By integrating Guardrails into the development workflow, teams can proactively address security concerns and reduce the risk of security breaches. The tool offers detailed reports and actionable recommendations to guide developers in remediation efforts, ultimately improving the overall security posture of the codebase. Guardrails supports multiple programming languages and frameworks, making it versatile and adaptable to different development environments. With its user-friendly interface and seamless integration with popular version control systems, Guardrails empowers developers to prioritize security without compromising productivity.

github

: 5.1k

neo4j-graphrag-python

The Neo4j GraphRAG package for Python is an official repository that provides features for creating and managing vector indexes in Neo4j databases. It aims to offer developers a reliable package with long-term commitment, maintenance, and fast feature updates. The package supports various Python versions and includes functionalities for creating vector indexes, populating them, and performing similarity searches. It also provides guidelines for installation, examples, and development processes such as installing dependencies, making changes, and running tests.

github

: 463

zep

Zep is a long-term memory service for AI Assistant apps. With Zep, you can provide AI assistants with the ability to recall past conversations, no matter how distant, while also reducing hallucinations, latency, and cost. Zep persists and recalls chat histories, and automatically generates summaries and other artifacts from these chat histories. It also embeds messages and summaries, enabling you to search Zep for relevant context from past conversations. Zep does all of this asyncronously, ensuring these operations don't impact your user's chat experience. Data is persisted to database, allowing you to scale out when growth demands. Zep also provides a simple, easy to use abstraction for document vector search called Document Collections. This is designed to complement Zep's core memory features, but is not designed to be a general purpose vector database. Zep allows you to be more intentional about constructing your prompt: 1. automatically adding a few recent messages, with the number customized for your app; 2. a summary of recent conversations prior to the messages above; 3. and/or contextually relevant summaries or messages surfaced from the entire chat session. 4. and/or relevant Business data from Zep Document Collections.

github

: 2.4k

vscode-pddl

The vscode-pddl extension provides comprehensive support for Planning Domain Description Language (PDDL) in Visual Studio Code. It enables users to model planning domains, validate them, industrialize planning solutions, and run planners. The extension offers features like syntax highlighting, auto-completion, plan visualization, plan validation, plan happenings evaluation, search debugging, and integration with Planning.Domains. Users can create PDDL files, run planners, visualize plans, and debug search algorithms efficiently within VS Code.

github

: 81

langchain

LangChain is a framework for developing Elixir applications powered by language models. It enables applications to connect language models to other data sources and interact with the environment. The library provides components for working with language models and off-the-shelf chains for specific tasks. It aims to assist in building applications that combine large language models with other sources of computation or knowledge. LangChain is written in Elixir and is not aimed for parity with the JavaScript and Python versions due to differences in programming paradigms and design choices. The library is designed to make it easy to integrate language models into applications and expose features, data, and functionality to the models.

github

: 830

talking-avatar-with-ai

The 'talking-avatar-with-ai' project is a digital human system that utilizes OpenAI's GPT-3 for generating responses, Whisper for audio transcription, Eleven Labs for voice generation, and Rhubarb Lip Sync for lip synchronization. The system allows users to interact with a digital avatar that responds with text, facial expressions, and animations, creating a realistic conversational experience. The project includes setup for environment variables, chat prompt templates, chat model configuration, and structured output parsing to enhance the interaction with the digital human.

github

: 132

codecompanion.nvim

CodeCompanion.nvim is a Neovim plugin that provides a Copilot Chat experience, adapter support for various LLMs, agentic workflows, inline code creation and modification, built-in actions for language prompts and error fixes, custom actions creation, async execution, and more. It supports Anthropic, Ollama, and OpenAI adapters. The plugin is primarily developed for personal workflows with no guarantees of regular updates or support. Users can customize the plugin to their needs by forking the project.

github

: 5.3k

kafka-ml

Kafka-ML is a framework designed to manage the pipeline of Tensorflow/Keras and PyTorch machine learning models on Kubernetes. It enables the design, training, and inference of ML models with datasets fed through Apache Kafka, connecting them directly to data streams like those from IoT devices. The Web UI allows easy definition of ML models without external libraries, catering to both experts and non-experts in ML/AI.

github

: 163

numerapi

Numerapi is a Python client to the Numerai API that allows users to automatically download and upload data for the Numerai machine learning competition. It provides functionalities for downloading training data, uploading predictions, and accessing user, submission, and competitions information for both the main competition and Numerai Signals competition. Users can interact with the API using Python modules or command line interface. Tokens are required for certain actions like uploading predictions or staking, which can be obtained from Numer.ai account settings. The tool also supports features like checking new rounds, getting leaderboards, and managing stakes.

github

: 174

perplexity-mcp

MCP-researcher Server is a tool that serves as your research assistant inside Cline, utilizing Perplexity's Sonar Pro API to access documentation, create API routes, and check for deprecated code. It includes Chain of Thought Reasoning and local chat history through SQLite. The tool offers functionalities like general search queries, retrieving documentation, finding APIs, and analyzing deprecated code. Installation can be done via Smithery or manually by cloning the repository, installing dependencies, and setting up the Perplexity API key and server configurations for Claude Desktop and Cline.

github

: 150

lingo

Lingo is a lightweight ML model proxy that runs on Kubernetes, allowing you to run text-completion and embedding servers without changing OpenAI client code. It supports serving OSS LLMs, is compatible with OpenAI API, plug-and-play with messaging systems, scales from zero based on load, and has zero dependencies. Namespaced with no cluster privileges needed.

github

: 95

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135

For similar tasks

genaiscript

GenAIScript is a scripting environment designed to facilitate file ingestion, prompt development, and structured data extraction. Users can define metadata and model configurations, specify data sources, and define tasks to extract specific information. The tool provides a convenient way to analyze files and extract desired content in a structured format. It offers a user-friendly interface for working with data and automating data extraction processes, making it suitable for various data processing tasks.

github

: 2.8k

AutoNode

github

: 116

oxylabs-mcp

The Oxylabs MCP Server acts as a bridge between AI models and the web, providing clean, structured data from any site. It enables scraping of URLs, rendering JavaScript-heavy pages, content extraction for AI use, bypassing anti-scraping measures, and accessing geo-restricted web data from 195+ countries. The implementation utilizes the Model Context Protocol (MCP) to facilitate secure interactions between AI assistants and web content. Key features include scraping content from any site, automatic data cleaning and conversion, bypassing blocks and geo-restrictions, flexible setup with cross-platform support, and built-in error handling and request management.

github

: 61

LaVague

LaVague is an open-source Large Action Model framework that uses advanced AI techniques to compile natural language instructions into browser automation code. It leverages Selenium or Playwright for browser actions. Users can interact with LaVague through an interactive Gradio interface to automate web interactions. The tool requires an OpenAI API key for default examples and offers a Playwright integration guide. Contributors can help by working on outlined tasks, submitting PRs, and engaging with the community on Discord. The project roadmap is available to track progress, but users should exercise caution when executing LLM-generated code using 'exec'.

github

: 5.8k

PulsarRPA

PulsarRPA is a high-performance, distributed, open-source Robotic Process Automation (RPA) framework designed to handle large-scale RPA tasks with ease. It provides a comprehensive solution for browser automation, web content understanding, and data extraction. PulsarRPA addresses challenges of browser automation and accurate web data extraction from complex and evolving websites. It incorporates innovative technologies like browser rendering, RPA, intelligent scraping, advanced DOM parsing, and distributed architecture to ensure efficient, accurate, and scalable web data extraction. The tool is open-source, customizable, and supports cutting-edge information extraction technology, making it a preferred solution for large-scale web data extraction.

github

: 805

playword

PlayWord is a tool designed to supercharge web test automation experience with AI. It provides core features such as enabling browser operations and validations using natural language inputs, as well as monitoring interface to record and dry-run test steps. PlayWord supports multiple AI services including Anthropic, Google, and OpenAI, allowing users to select the appropriate provider based on their requirements. The tool also offers features like assertion handling, frame handling, custom variables, test recordings, and an Observer module to track user interactions on web pages. With PlayWord, users can interact with web pages using natural language commands, reducing the need to worry about element locators and providing AI-powered adaptation to UI changes.

github

: 52

WebDreamer

WebDreamer is a model-based planning tool for web agents that uses large language models (LLMs) as a world model of the internet to predict outcomes of actions on websites. It employs LLM-based simulation for speculative planning on the web, offering greater safety and flexibility compared to traditional tree search methods. The tool provides modules for world model prediction, simulation scoring, and controller actions, enabling users to interact with web pages and achieve specific goals through simulated actions.

github

: 82

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 668

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k

AutoNode

README:

AutoNode: A Neuro-Graphic Self-Learnable Engine for Cognitive GUI Automation

What is AutoNode?

Installation

How to Use AutoNode

Using AutoNode via API

Accessing the API Documentation

Sending a Request to AutoNode

Request Structure

Using CURL

YOLO/OCR Models

How to Train Your Own YOLO Model

Hosting Custom YOLO model and OCR apps remotely

AutoNode-Site-Graph Preparation

Storing Debugging Screenshots & Downloaded Output

For Tasks:

For Jobs:

Alternative AI tools for AutoNode

Similar Open Source Tools

AutoNode

Figma-Context-MCP

palimpzest

Trinity

Guardrails

neo4j-graphrag-python

zep

vscode-pddl

langchain

talking-avatar-with-ai

codecompanion.nvim

kafka-ml

numerapi

perplexity-mcp

lingo

labelbox-python

For similar tasks

genaiscript

AutoNode

oxylabs-mcp

LaVague

PulsarRPA

playword

WebDreamer

For similar jobs

sweep

teams-ai

ai-guide

classifai

chatbot-ui

BricksLLM

uAgents

griptape