CyberScraper-2077

A Powerful web scraper powered by LLM | OpenAI, Gemini & Ollama

Stars: 951

Visit

CyberScraper 2077 is an advanced web scraping tool powered by AI, designed to extract data from websites with precision and style. It offers a user-friendly interface, supports multiple data export formats, operates in stealth mode to avoid detection, and promises lightning-fast scraping. The tool respects ethical scraping practices, including robots.txt and site policies. With upcoming features like proxy support and page navigation, CyberScraper 2077 is a futuristic solution for data extraction in the digital realm.

README:

🌐 CyberScraper 2077

Rip data from the net, leaving no trace. Welcome to the future of web scraping.

🔍 About

CyberScraper 2077 is not just another web scraping tool – it's a glimpse into the future of data extraction. Born from the neon-lit streets of a cyberpunk world, this AI-powered scraper uses OpenAI, Gemini and LocalLLM Models to slice through the web's defenses, extracting the data you need with unparalleled precision and style.

Whether you're a corpo data analyst, a street-smart netrunner, or just someone looking to pull information from the digital realm, CyberScraper 2077 has got you covered.

✨ Features

🤖 AI-Powered Extraction: Utilizes cutting-edge AI models to understand and parse web content intelligently.
🖥️ Sleek Streamlit Interface: User-friendly GUI that even a chrome-armed street samurai could navigate.
🔄 Multi-Format Support: Export your data in JSON, CSV, HTML, SQL or Excel – whatever fits your cyberdeck.
🕵️ Stealth Mode: Implemented stealth mode parameters that help avoid detection as a bot.
🦙 Ollama Support: Use a huge library of open source LLMs.
⚡ Async Operations: Lightning-fast scraping that would make a Trauma Team jealous.
🧠 Smart Parsing: Structures scraped content as if it was extracted straight from the engram of a master netrunner.
💾 Caching: Implemented content-based and query-based caching using LRU cache and a custom dictionary to reduce redundant API calls.
📊 Upload to Google Sheets: Now you can easily upload your extracted CSV data to Google Sheets with one click.
🛡️ Bypass Captcha: Bypass captcha by using the -captcha at the end of the URL. (Currently only works natively, doesn't work on Docker)
🌐 Current Browser: The current browser feature uses your local browser instance which will help you bypass 99% of bot detections. (Only use when necessary)
🔒 Proxy Mode (Coming Soon): Built-in proxy support to keep you ghosting through the net.
🧭 Navigate through the Pages (BETA): Navigate through the webpage and scrape data from different pages.

🎥 Demo

Check out our Redesigned and Improved Version of CyberScraper-2077 with more functionality YouTube video for a full walkthrough of CyberScraper 2077's capabilities.

Check out our first build (Old Video) YouTube video

🪟 For Windows Users

Please follow the Docker Container Guide given below, as I won't be able to maintain another version for Windows systems.

🛠 Installation

Note: CyberScraper 2077 requires Python 3.10 or higher.

Clone this repository:

git clone https://github.com/itsOwen/CyberScraper-2077.git
cd CyberScraper-2077

Create and activate a virtual environment:

virtualenv venv
source venv/bin/activate  # Optional

Install the required packages:
```
pip install -r requirements.txt
```
Install the playwright:
```
playwright install
```

Set OpenAI & Gemini Key in your environment:

Linux/Mac:

export OPENAI_API_KEY="your-api-key-here"

export GOOGLE_API_KEY="your-api-key-here"

If you want to use Ollama:

Note: I only recommend using OpenAI and Gemini API as these models are really good at following instructions. If you are using open-source LLMs, make sure you have a good system as the speed of the data generation/presentation depends on how well your system can run the LLM. You may also have to fine-tune the prompt and add some additional filters yourself.

1. Setup Ollama using `pip install ollama`
2. Download Ollama from the official website: https://ollama.com/download
3. Now type: ollama pull llama3.1 or whatever LLM you want to use.
4. Now follow the rest of the steps below.

🐳 Docker Installation

If you prefer to use Docker, follow these steps to set up and run CyberScraper 2077:

Ensure you have Docker installed on your system.

Clone this repository:

git clone https://github.com/itsOwen/CyberScraper-2077.git
cd CyberScraper-2077

Build the Docker image:
```
docker build -t cyberscraper-2077 .
```

Run the container:

Without API key:

docker run -p 8501:8501 cyberscraper-2077

With OpenAI API key:

docker run -p 8501:8501 -e OPENAI_API_KEY="your-actual-api-key" cyberscraper-2077

With Gemini API key:

docker run -p 8501:8501 -e GOOGLE_API_KEY="your-actual-api-key" cyberscraper-2077

Open your browser and navigate to http://localhost:8501.

Using Ollama with Docker

If you want to use Ollama with the Docker setup:

Install Ollama on your host machine following the instructions at https://ollama.com/download
Run Ollama on your host machine:
```
ollama pull llama3.1
```
Find your host machine's IP address:
- On Linux/Mac: ifconfig or ip addr show
- On Windows: ipconfig

Run the Docker container with the host network and set the Ollama URL:

docker run -e OLLAMA_BASE_URL=http://host.docker.internal:11434 -p 8501:8501 cyberscraper-2077

On Linux you might need to use this below:

docker run -e OLLAMA_BASE_URL=http://<your-host-ip>:11434 -p 8501:8501 cyberscraper-2077

Replace <your-host-ip> with your actual host machine IP address.

In the Streamlit interface, select the Ollama model you want to use (e.g., "ollama:llama3.1").

Note: Ensure that your firewall allows connections to port 11434 for Ollama.

🚀 Usage

Fire up the Streamlit app:
```
streamlit run main.py
```
Open your browser and navigate to http://localhost:8501.
Enter the URL of the site you want to scrape or ask a question about the data you need.
Ask the chatbot to extract the data in any format. Select whatever data you want to export or even everything from the webpage.
Watch as CyberScraper 2077 tears through the net, extracting your data faster than you can say "flatline"!

🌐 Multi-Page Scraping (BETA)

Note: The multi-page scraping feature is currently in beta. While functional, you may encounter occasional issues or unexpected behavior. We appreciate your feedback and patience as we continue to improve this feature.

CyberScraper 2077 now supports multi-page scraping, allowing you to extract data from multiple pages of a website in one go. This feature is perfect for scraping paginated content, search results, or any site with data spread across multiple pages.

How to Use Multi-Page Scraping

I suggest you enter the URL structure every time if you want to scrape multiple pages so it can detect the URL structure easily. It detects nearly all URL types.

Basic Usage: To scrape multiple pages, use the following format when entering the URL:
```
https://example.com/page 1-5
https://example.com/p/ 1-6
https://example.com/xample/something-something-1279?p=1 1-3
```
This will scrape pages 1 through 5 of the website.
Custom Page Ranges: You can specify custom page ranges:
```
https://example.com/p/ 1-5,7,9-12
https://example.com/xample/something-something-1279?p=1 1,7,8,9
```
This will scrape pages 1 to 5, page 7, and pages 9 to 12.
URL Patterns: For websites with different URL structures, you can specify a pattern:
```
https://example.com/search?q=cyberpunk&page={page} 1-5
```
Replace {page} with where the page number should be in the URL.
Automatic Pattern Detection: If you don't specify a pattern, CyberScraper 2077 will attempt to detect the URL pattern automatically. However, for best results, specifying the pattern is recommended.

Tips for Effective Multi-Page Scraping

Start with a small range of pages to test before scraping a large number.
Be mindful of the website's load and your scraping speed to avoid overloading servers.
Use the simulate_human option for more natural scraping behavior on sites with anti-bot measures.
Regularly check the website's robots.txt file and terms of service to ensure compliance.

Example

URL Example : "https://news.ycombinator.com/?p=1 1-3 or 1,2,3,4"

If you want to scrape a specific page, just enter the query "please scrape page number 1 or 2". If you want to scrape all pages, simply give a query like "scrape all pages in csv" or whatever format you want.

Handling Errors

If you encounter errors during multi-page scraping:

Check your internet connection
Verify the URL pattern is correct
Ensure the website allows scraping
Try reducing the number of pages or increasing the delay between requests

Beta Feedback

As this feature is in beta, we highly value your feedback. If you encounter any issues or have suggestions for improvement, please:

Open an issue on our GitHub repository
Provide detailed information about the problem, including the URL structure and number of pages you were attempting to scrape
Share any error messages or unexpected behaviors you observed

Your input is crucial in helping us refine and stabilize this feature for future releases.

Setup Google Sheets Authentication:

Go to the Google Cloud Console (https://console.cloud.google.com/).
Select your project.
Navigate to "APIs & Services" > "Credentials".
Find your existing OAuth 2.0 Client ID and delete it.
Click "Create Credentials" > "OAuth client ID".
Choose "Web application" as the application type.
Name your client (e.g., "CyberScraper 2077 Web Client").
Under "Authorized JavaScript origins", add:
Under "Authorized redirect URIs", add:
Click "Create" to generate the new client ID.
Download the new client configuration JSON file and rename it to client_secret.json.

Adjusting PlaywrightScraper Settings (optional)

Customize the PlaywrightScraper settings to fit your scraping needs. If some websites are giving you issues, you might want to check the behavior of the website:

use_stealth: bool = True,
simulate_human: bool = False,
use_custom_headers: bool = True,
hide_webdriver: bool = True,
bypass_cloudflare: bool = True:

Adjust these settings based on your target website and environment for optimal results.

You can also bypass the captcha using the -captcha parameter at the end of the URL. The browser window will pop up, complete the captcha, and go back to your terminal window. Press enter and the bot will complete its task.

🤝 Contributing

We welcome all cyberpunks, netrunners, and code samurais to contribute to CyberScraper 2077!

🔧 Troubleshooting

Ran into a glitch in the matrix? Let me know by adding the issue to this repo so that we can fix it together.

❓ FAQ

Q: Is CyberScraper 2077 legal to use? A: CyberScraper 2077 is designed for ethical web scraping. Always ensure you have the right to scrape a website and respect their robots.txt file.

Q: Can I use this for commercial purposes? A: Yes, under the terms of the MIT License. But remember, in Night City, there's always a price to pay. Just kidding!

📄 License

This project is licensed under the MIT License - see the LICENSE file for details. Use it, mod it, sell it – just don't blame us if you end up flatlined.

📞 Contact

Got questions? Need support? Want to hire me for a gig?

📧 Email: [email protected]
🐦 Twitter: @owensingh_
💬 Website: Portfolio

🚨 Disclaimer

Listen up, choombas! Before you jack into this code, you better understand the risks:

This software is provided "as is", without warranty of any kind, express or implied.
The authors are not liable for any damages or losses resulting from the use of this software.
This tool is intended for educational and research purposes only. Any illegal use is strictly prohibited.
We do not guarantee the accuracy, completeness, or reliability of any data obtained through this tool.
By using this software, you acknowledge that you are doing so at your own risk.
You are responsible for complying with all applicable laws and regulations in your use of this software.
We reserve the right to modify or discontinue the software at any time without notice.

Remember, samurai: In the dark future of the NET, knowledge is power, but it's also a double-edged sword. Use this tool wisely, and may your connection always be strong and your firewalls impenetrable. Stay frosty out there in the digital frontier.

CyberScraper 2077 – Because in 2077, what makes someone a criminal? Getting caught.

For Tasks:

Click tags to check more tools for each tasks

extract data scrape websites parse content export data navigate webpages

For Jobs:

data analyst web developer data scientist research analyst digital marketer

Alternative AI tools for CyberScraper-2077

Similar Open Source Tools

CyberScraper-2077

github

: 951

ImageIndexer

LLMII is a tool that uses a local AI model to label metadata and index images without relying on cloud services or remote APIs. It runs a visual language model on your computer to generate captions and keywords for images, enhancing their metadata for indexing, searching, and organization. The tool can be run multiple times on the same image files, allowing for adding new data, regenerating data, and discovering files with issues. It supports various image formats, offers a user-friendly GUI, and can utilize GPU acceleration for faster processing. LLMII requires Python 3.8 or higher and operates directly on image file metadata fields like MWG:Keyword and XMP:Identifier.

github

: 156

coral-cloud

Coral Cloud Resorts is a sample hospitality application that showcases Data Cloud, Agents, and Prompts. It provides highly personalized guest experiences through smart automation, content generation, and summarization. The app requires licenses for Data Cloud, Agents, Prompt Builder, and Einstein for Sales. Users can activate features, deploy metadata, assign permission sets, import sample data, and troubleshoot common issues. Additionally, the repository offers integration with modern web development tools like Prettier, ESLint, and pre-commit hooks for code formatting and linting.

github

: 62

azure-search-openai-javascript

This sample demonstrates a few approaches for creating ChatGPT-like experiences over your own data using the Retrieval Augmented Generation pattern. It uses Azure OpenAI Service to access the ChatGPT model (gpt-35-turbo), and Azure AI Search for data indexing and retrieval.

github

: 277

Open_Data_QnA

Open Data QnA is a Python library that allows users to interact with their PostgreSQL or BigQuery databases in a conversational manner, without needing to write SQL queries. The library leverages Large Language Models (LLMs) to bridge the gap between human language and database queries, enabling users to ask questions in natural language and receive informative responses. It offers features such as conversational querying with multiturn support, table grouping, multi schema/dataset support, SQL generation, query refinement, natural language responses, visualizations, and extensibility. The library is built on a modular design and supports various components like Database Connectors, Vector Stores, and Agents for SQL generation, validation, debugging, descriptions, embeddings, responses, and visualizations.

github

: 127

clickolas-cage

Clickolas-cage is a Chrome extension designed to autonomously perform web browsing actions to achieve specific goals using LLM as a brain. Users can interact with the extension by setting goals, which triggers a series of actions including navigation, element extraction, and step generation. The extension is developed using Node.js and can be locally run for testing and development purposes before packing it for submission to the Chrome Web Store.

github

: 81

Perplexica

Perplexica is an open-source AI-powered search engine that utilizes advanced machine learning algorithms to provide clear answers with sources cited. It offers various modes like Copilot Mode, Normal Mode, and Focus Modes for specific types of questions. Perplexica ensures up-to-date information by using SearxNG metasearch engine. It also features image and video search capabilities and upcoming features include finalizing Copilot Mode and adding Discover and History Saving features.

github

: 21.0k

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

crawlee-python

Crawlee-python is a web scraping and browser automation library that covers crawling and scraping end-to-end, helping users build reliable scrapers fast. It allows users to crawl the web for links, scrape data, and store it in machine-readable formats without worrying about technical details. With rich configuration options, users can customize almost any aspect of Crawlee to suit their project's needs.

github

: 5.5k

obsidian-chat-cbt-plugin

ChatCBT is an AI-powered journaling assistant for Obsidian, inspired by cognitive behavioral therapy (CBT). It helps users reframe negative thoughts and rewire reactions to distressful situations. The tool provides kind and objective responses to uncover negative thinking patterns, store conversations privately, and summarize reframed thoughts. Users can choose between a cloud-based AI service (OpenAI) or a local and private service (Ollama) for handling data. ChatCBT is not a replacement for therapy but serves as a journaling assistant to help users gain perspective on their problems.

github

: 191

cognita

Cognita is an open-source framework to organize your RAG codebase along with a frontend to play around with different RAG customizations. It provides a simple way to organize your codebase so that it becomes easy to test it locally while also being able to deploy it in a production ready environment. The key issues that arise while productionizing RAG system from a Jupyter Notebook are: 1. **Chunking and Embedding Job** : The chunking and embedding code usually needs to be abstracted out and deployed as a job. Sometimes the job will need to run on a schedule or be trigerred via an event to keep the data updated. 2. **Query Service** : The code that generates the answer from the query needs to be wrapped up in a api server like FastAPI and should be deployed as a service. This service should be able to handle multiple queries at the same time and also autoscale with higher traffic. 3. **LLM / Embedding Model Deployment** : Often times, if we are using open-source models, we load the model in the Jupyter notebook. This will need to be hosted as a separate service in production and model will need to be called as an API. 4. **Vector DB deployment** : Most testing happens on vector DBs in memory or on disk. However, in production, the DBs need to be deployed in a more scalable and reliable way. Cognita makes it really easy to customize and experiment everything about a RAG system and still be able to deploy it in a good way. It also ships with a UI that makes it easier to try out different RAG configurations and see the results in real time. You can use it locally or with/without using any Truefoundry components. However, using Truefoundry components makes it easier to test different models and deploy the system in a scalable way. Cognita allows you to host multiple RAG systems using one app. ### Advantages of using Cognita are: 1. A central reusable repository of parsers, loaders, embedders and retrievers. 2. Ability for non-technical users to play with UI - Upload documents and perform QnA using modules built by the development team. 3. Fully API driven - which allows integration with other systems. > If you use Cognita with Truefoundry AI Gateway, you can get logging, metrics and feedback mechanism for your user queries. ### Features: 1. Support for multiple document retrievers that use `Similarity Search`, `Query Decompostion`, `Document Reranking`, etc 2. Support for SOTA OpenSource embeddings and reranking from `mixedbread-ai` 3. Support for using LLMs using `Ollama` 4. Support for incremental indexing that ingests entire documents in batches (reduces compute burden), keeps track of already indexed documents and prevents re-indexing of those docs.

github

: 3.2k

LLavaImageTagger

LLMImageIndexer is an intelligent image processing and indexing tool that leverages local AI to generate comprehensive metadata for your image collection. It uses advanced language models to analyze images and generate captions and keyword metadata. The tool offers features like intelligent image analysis, metadata enhancement, local processing, multi-format support, user-friendly GUI, GPU acceleration, cross-platform support, stop and start capability, and keyword post-processing. It operates directly on image file metadata, allowing users to manage files, add new files, and run the tool multiple times without reprocessing previously keyworded files. Installation instructions are provided for Windows, macOS, and Linux platforms, along with usage guidelines and configuration options.

github

: 97

Akagi

Akagi is a project designed to help users understand and improve their performance in Majsoul game matches in real-time. It provides educational insights and tools for analyzing gameplay. Users can install Akagi on Windows or Mac systems and follow the setup instructions to enhance their gaming experience. The project aims to offer features like Autoplay, Auto Ron, and integration with MajsoulUnlocker. It also focuses on enhancing user safety by providing guidelines to minimize the risk of account suspension. Akagi is a tool that combines MITM interception, AI decision-making, and user interaction to optimize gameplay strategies and performance.

github

: 258

devika

Devika is an advanced AI software engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. Devika utilizes large language models, planning and reasoning algorithms, and web browsing abilities to intelligently develop software. Devika aims to revolutionize the way we build software by providing an AI pair programmer who can take on complex coding tasks with minimal human guidance. Whether you need to create a new feature, fix a bug, or develop an entire project from scratch, Devika is here to assist you.

github

: 18.1k

litlytics

LitLytics is an affordable analytics platform leveraging LLMs for automated data analysis. It simplifies analytics for teams without data scientists, generates custom pipelines, and allows customization. Cost-efficient with low data processing costs. Scalable and flexible, works with CSV, PDF, and plain text data formats.

github

: 83

chat-with-notes

Chat-with-Notes is a Flask web application that enables users to upload text files, view their content, and engage with an AI chatbot for discussions. The application prioritizes privacy by utilizing a locally hosted Ollama Llama 3.1 (8B) model for AI responses, ensuring data security. Users can upload files during conversations, clear chat history, and export chat logs. The tool operates locally, requiring Python 3.x, pip, Git, and a locally running Ollama Llama 3.1 (8B) model as prerequisites.

github

: 60

For similar tasks

mage-ai

Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.

github

: 7.8k

nucliadb

NucliaDB is a robust database that allows storing and searching on unstructured data. It is an out of the box hybrid search database, utilizing vector, full text and graph indexes. NucliaDB is written in Rust and Python. We designed it to index large datasets and provide multi-teanant support. When utilizing NucliaDB with Nuclia cloud, you are able to the power of an NLP database without the hassle of data extraction, enrichment and inference. We do all the hard work for you.

github

: 688

LLMstudio

LLMstudio by TensorOps is a platform that offers prompt engineering tools for accessing models from providers like OpenAI, VertexAI, and Bedrock. It provides features such as Python Client Gateway, Prompt Editing UI, History Management, and Context Limit Adaptability. Users can track past runs, log costs and latency, and export history to CSV. The tool also supports automatic switching to larger-context models when needed. Coming soon features include side-by-side comparison of LLMs, automated testing, API key administration, project organization, and resilience against rate limits. LLMstudio aims to streamline prompt engineering, provide execution history tracking, and enable effortless data export, offering an evolving environment for teams to experiment with advanced language models.

github

: 311

CyberScraper-2077

github

: 951

kangaroo

Kangaroo is an AI-powered SQL client and admin tool for popular databases like SQLite, MySQL, PostgreSQL, etc. It supports various functionalities such as table design, query, model, sync, export/import, and more. The tool is designed to be comfortable, fun, and developer-friendly, with features like code intellisense and autocomplete. Kangaroo aims to provide a seamless experience for database management across different operating systems.

github

: 370

skyvern

Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions. Traditional approaches to browser automations required writing custom scripts for websites, often relying on DOM parsing and XPath-based interactions which would break whenever the website layouts changed. Instead of only relying on code-defined XPath interactions, Skyvern adds computer vision and LLMs to the mix to parse items in the viewport in real-time, create a plan for interaction and interact with them. This approach gives us a few advantages: 1. Skyvern can operate on websites it’s never seen before, as it’s able to map visual elements to actions necessary to complete a workflow, without any customized code 2. Skyvern is resistant to website layout changes, as there are no pre-determined XPaths or other selectors our system is looking for while trying to navigate 3. Skyvern leverages LLMs to reason through interactions to ensure we can cover complex situations. Examples include: 1. If you wanted to get an auto insurance quote from Geico, the answer to a common question “Were you eligible to drive at 18?” could be inferred from the driver receiving their license at age 16 2. If you were doing competitor analysis, it’s understanding that an Arnold Palmer 22 oz can at 7/11 is almost definitely the same product as a 23 oz can at Gopuff (even though the sizes are slightly different, which could be a rounding error!) Want to see examples of Skyvern in action? Jump to #real-world-examples-of- skyvern

github

: 12.9k

airbyte-connectors

This repository contains Airbyte connectors used in Faros and Faros Community Edition platforms as well as Airbyte Connector Development Kit (CDK) for JavaScript/TypeScript.

github

: 115

open-parse

Open Parse is a Python library for visually discerning document layouts and chunking them effectively. It is designed to fill the gap in open-source libraries for handling complex documents. Unlike text splitting, which converts a file to raw text and slices it up, Open Parse visually analyzes documents for superior LLM input. It also supports basic markdown for parsing headings, bold, and italics, and has high-precision table support, extracting tables into clean Markdown formats with accuracy that surpasses traditional tools. Open Parse is extensible, allowing users to easily implement their own post-processing steps. It is also intuitive, with great editor support and completion everywhere, making it easy to use and learn.

github

: 2.4k

For similar jobs

lollms-webui

LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

github

: 4.6k

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

minio

MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.

github

: 46.0k

mage-ai

github

: 7.8k

AiTreasureBox

AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.

github

: 368

tidb

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

github

: 37.1k

airbyte

Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.

github

: 17.8k

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135