awesome-generative-ai-data-scientist

awesome-generative-ai-data-scientist

A curated list of 100+ resources for building and deploying generative AI specifically focusing on helping you become a Generative AI Data Scientist with LLMs

Stars: 393

Visit
 screenshot

A curated list of 50+ resources to help you become a Generative AI Data Scientist. This repository includes resources on building GenAI applications with Large Language Models (LLMs), and deploying LLMs and GenAI with Cloud-based solutions.

README:

Awesome Generative AI Data Scientist
100+ Free Resources On Generative AI for Data Scientists

Awesome Generative AI Data Scientist Awesome

The Future is using AI and ML Together

πŸš€πŸš€ 100+ Free Resources On Generative AI for Data Scientists

A curated list of 100+ resources to help you become a Generative AI Data Scientist. This repository includes resources on building GenAI Data Science applications with Large Language Models (LLMs) and deploying LLMs and Generative AI/ML with Cloud-based solutions.

Please ⭐ us on GitHub (it takes 2 seconds and means a lot).

Contributing

Contributions are welcome! Please submit a pull request or open an issue if you have suggestions for new resources or improvements to existing ones. Thanks for your support!

Table of Contents

Awesome Real-World AI Use Cases

Project Description Links
πŸš€πŸš€ AI-Powered Data Science Team In Python An AI-powered data science team of copilots that uses agents to help you perform common data science tasks 10X faster. Apps | Examples | GitHub
πŸš€ Awesome LLM Apps LLM RAG AI Apps with Step-By-Step Tutorials. GitHub
AI Hedge Fund Proof of concept for an AI-powered hedge fund. GitHub
AI Financial Agent A financial agent for investment research. GitHub
Structured Report Generation (LangGraph) How to build an agent that can orchestrate the end-to-end process of report planning, web research, and writing. Produces reports of varying and easily configurable formats. Video | Blog | Code
Uber QueryGPT Uber's QueryGPT uses large language models (LLM), vector databases, and similarity search to generate complex queries from English (Natural Language) questions, enhancing productivity for engineers, operations managers, and data scientists. Blog
Nir Diamant GenAI Agents Hub Tutorials and implementations for various Generative AI Agent techniques, from basic to advanced. A comprehensive guide for building intelligent, interactive AI systems. GitHub
AI Engineering Hub Real-world AI agent applications, LLM and RAG tutorials, with examples to implement. GitHub
StockChat An open-source alternative to Perplexity Finance. GitHub

Python Libraries

Curated Python AI, Data Science, and ML Compilations

Project Description Links
Awesome Generative AI Data Scientist A curated list of 100+ resources for building and deploying generative AI specifically focusing on helping you become a Generative AI Data Scientist GitHub
LLM Engineer Toolkit A curated list of 120+ LLM libraries organized by category. GitHub
πŸš€πŸš€ Best of ML Python A ranked list of awesome machine learning Python libraries. Updated weekly. GitHub
πŸš€πŸš€ Awesome Python Data Science Probably the best curated list of data science software in Python. GitHub
Awesome Production Machine Learning A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning GitHub
Awesome AI Agents A list of AI autonomous agents GitHub

Data Science And AI Agents

Project Description Links
πŸš€πŸš€ AI Data Science Team In Python AI Agents to help you perform common data science tasks 10X faster. Apps | Examples | GitHub
πŸš€ PandasAI Open Source AI Agents for Data Analysis. Documentation | GitHub
Microsoft Data Formulator Transform data and create rich visualizations iteratively with AI πŸͺ„. Paper | GitHub
Jupyter Agent Let a LLM agent write and execute code inside a notebook. Hugging Face
Jupyter AI A generative AI extension for JupyterLab. Documentation | GitHub
WrenAI Open-source GenBI AI Agent. Text2SQL made Easy! Documentation | GitHub
Google GenAI Toolbox for Databases Gen AI Toolbox for Databases is an open-source server that makes it easier to build Gen AI tools for interacting with databases. Blog | Documentation | GitHub
Vanna AI The fastest way to get actionable insights from your SQL database just by asking questions. Documentation | GitHub

Coding Agents

Project Description Links
Qwen-Agent A framework for developing LLM applications based on the instruction following, tool usage, planning, and memory capabilities of Qwen. It also comes with example applications such as Browser Assistant, Code Interpreter, and Custom Assistant. Documentation | Examples | GitHub

Deep Research Agents

Project Description Links
HuggingFace OpenDeepResearch Open version of OpenAI's Deep Research agent. Blog | Example | GitHub
OpenDeepResearcher AI researcher that continuously searches for information based on a user query until the system is confident that it has gathered all the necessary details. GitHub

AI Frameworks (Build Your Own)

Project Description Links
LangChain A framework for developing applications powered by large language models (LLMs). Documentation | GitHub | Cookbook
LangGraph A library for building stateful, multi-actor applications with LLMs, used to create agent and multi-agent workflows. Documentation | Tutorials
LangSmith A platform for building production-grade LLM applications. It allows you to closely monitor and evaluate your application, so you can quickly and confidently ship. Documentation | GitHub
LlamaIndex A framework for building context-augmented generative AI applications with LLMs. Documentation | GitHub
LlamaIndex Workflows A mechanism for orchestrating actions in increasingly complex AI applications. Blog
CrewAI Streamline workflows across industries with powerful AI agents. Documentation | GitHub
AutoGen Microsoft's programming framework for agentic AI. GitHub
Pydantic AI Python agent framework designed to make building production-grade applications with Generative AI less painful. GitHub
ControlFlow Prefect's Python framework for building agentic AI workflows. Documentation | GitHub
FlatAI Frameworkless LLM Agents. GitHub
Llama Stack Meta (Facebook) core building blocks needed to bring generative AI applications to market. Documentation | GitHub
Haystack Deepset AI's open-source AI orchestration framework for building customizable, production-ready LLM applications. Documentation | GitHub
Pocket Flow A 100-line minimalist LLM framework for Agents, Task Decomposition, RAG, etc. Documentation | GitHub
Agency Swarm An open-source agent orchestration framework built on top of the latest OpenAI Assistants API. Documentation | GitHub
Google GenAI Google Gen AI Python SDK provides an interface for developers to integrate Google’s generative models into their Python applications. Documentation | GitHub
AutoAgent A fully-automated and highly self-developing framework that enables users to create and deploy LLM agents through natural language alone. GitHub
Legion A flexible and provider-agnostic framework designed to simplify the creation of sophisticated multi-agent systems. Documentation | GitHub

AI Frameworks (Drag and Drop)

Project Description Links
LangGraph Studio IDE that enables visualization, interaction, and debugging of complex agentic applications. GitHub
Langflow A low-code tool that makes building powerful AI agents and workflows that can use any API, model, or database easier. Documentation | GitHub
Pyspur Graph-Based Editor for LLM Workflows. Documentation | GitHub
LangWatch Monitor, Evaluate & Optimize your LLM performance with 1-click. Drag and drop interface for LLMOps platform. Documentation | GitHub
AutoGen Studio A low-code interface to rapidly prototype AI agents, enhance them with tools, compose them into teams, and interact with them to accomplish tasks. Built on AutoGen AgentChat. Documentation
n8n Fair-code workflow automation platform with native AI capabilities. Combine visual building with custom code, self-host or cloud, 400+ integrations. Documentation | GitHub

LLM Providers

Provider Description Links
OpenAI The official Python library for the OpenAI API. GitHub
OpenAI Agents The OpenAI Agents SDK is a lightweight yet powerful framework for building multi-agent workflows. GitHub
Hugging Face Models Open LLM models by Meta, Mistral, and hundreds of other providers. Hugging Face
Anthropic Claude The official Python library for the Anthropic API. GitHub
Meta Llama Models The open-source AI model you can fine-tune, distill, and deploy anywhere. Meta
Google Gemini The official Python library for the Google Gemini API. GitHub
Ollama Get up and running with large language models locally. GitHub
Grok The official Python library for the Groq API. GitHub

Open Source LLM Models

Project Description Links
DeepSeek-R1 1st generation reasoning model that competes with OpenAI o1. Paper | GitHub
Qwen Alibaba's Qwen models. GitHub
Llama Meta's foundational models. GitHub

LangChain Ecosystem

Project Description Links
LangChain A framework for developing applications powered by large language models (LLMs). Documentation | GitHub | Cookbook
LangGraph A library for building stateful, multi-actor applications with LLMs, used to create agent and multi-agent workflows. Documentation | Tutorials
LangSmith A platform for building production-grade LLM applications. It allows you to closely monitor and evaluate your application, so you can quickly and confidently ship. Documentation | GitHub

LangGraph Extensions

Project Description Links
LangGraph Prebuilt Agents Prebuilt agents for LangGraph (includes 3rd Party LangGraph extensions). Documentation
AI Data Science Team An AI-powered data science team of agents to help you perform common data science tasks 10X faster. GitHub
LangMem LangMem provides tooling to extract important information from conversations, optimize agent behavior through prompt refinement, and maintain long-term memory. GitHub
LangGraph Supervisor A Python library for creating hierarchical multi-agent systems using LangGraph. GitHub
Open Deep Research An open-source assistant that automates research and produces customizable reports on any topic. GitHub
LangGraph Reflection This prebuilt graph is an agent that uses a reflection-style architecture to check and improve an initial agent's output. GitHub
LangGraph Big Tool Create LangGraph agents that can access large numbers of tools. GitHub
LangGraph CodeAct This library implements the CodeAct architecture in LangGraph. This architecture is used by Manus.im. GitHub
LangGraph Swarm Create swarm-style multi-agent systems using LangGraph. Agents dynamically hand off control to one another based on their specializations. GitHub
LangChain MCP Adapters Provides a lightweight wrapper that makes Anthropic Model Context Protocol (MCP) tools compatible with LangChain and LangGraph. GitHub

Huggingface Ecosystem

Project Description Links
Huggingface An open-source platform for machine learning (ML) and artificial intelligence (AI) tools and models. Documentation
Transformers Transformers provides APIs and tools to easily download and train state-of-the-art pretrained models. Documentation
Tokenizers Tokenizers provides an implementation of today’s most used tokenizers, with a focus on performance and versatility. Documentation | GitHub
Sentence Transformers Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. Documentation
smolagents The simplest framework out there to build powerful agents. Documentation | GitHub

Vector Databases (RAG)

Project Description Links
ChromaDB The fastest way to build Python or JavaScript LLM apps with memory! GitHub
FAISS A library for efficient similarity search and clustering of dense vectors. GitHub
Qdrant High-Performance Vector Search at Scale. Website
Pinecone The official Pinecone Python SDK. GitHub
Milvus Milvus is an open-source vector database built to power embedding similarity search and AI applications. GitHub
SQLite Vec A vector search SQLite extension that runs anywhere! GitHub

Pretraining

Project Description Links
PyTorch PyTorch is an open-source machine learning library based on the Torch library, used for applications such as computer vision and natural language processing. Website
TensorFlow TensorFlow is an open-source machine learning library developed by Google. Website
JAX Google’s library for high-performance computing and automatic differentiation. GitHub
tinygrad A minimalistic deep learning library with a focus on simplicity and educational use, created by George Hotz. GitHub
micrograd A simple, lightweight autograd engine for educational purposes, created by Andrej Karpathy. GitHub

Fine-tuning

Project Description Links
Transformers Hugging Face Transformers is a popular library for Natural Language Processing (NLP) tasks, including fine-tuning large language models. Documentation
Unsloth Finetune Llama 3.2, Mistral, Phi-3.5 & Gemma 2-5x faster with 80% less memory! GitHub
LitGPT 20+ high-performance LLMs with recipes to pretrain, finetune, and deploy at scale. GitHub
AutoTrain No code fine-tuning of LLMs and other machine learning tasks. GitHub

Testing and Monitoring (Observability)

Project Description Links
LangSmith LangSmith is a platform for building production-grade LLM applications. It allows you to closely monitor and evaluate your application, so you can quickly and confidently ship. Documentation | GitHub
LangWatch Monitor, Evaluate & Optimize your LLM performance with 1-click. Drag and drop interface for LLMOps platform. Documentation | GitHub
Opik Opik is an open-source platform for evaluating, testing, and monitoring LLM applications. GitHub
MLflow Tracing and Evaluation MLflow has a suite of features for LLMs. MLflow LLM Documentation | Model Tracing | Model Evaluation | GitHub
Langfuse Traces, evals, prompt management, and metrics to debug and improve your LLM application. Documentation | GitHub

Document Parsing

Project Description Links
LangChain Document Loaders LangChain has hundreds of integrations with various data sources to load data from: Slack, Notion, Google Drive, etc. Documentation
Embedchain Create an AI app on your own data in a minute. Documentation | GitHub
Docling by IBM Parse documents and export them to the desired format with ease and speed. GitHub
Markitdown by Microsoft Python tool for converting files and office documents to Markdown. GitHub
DocETL A system for agentic LLM-powered data processing and ETL. Documentation | GitHub

Web Parsing (HTML) and Web Crawling

Project Description Links
Gitingest Turn any Git repository into a simple text ingest of its codebase. This is useful for feeding a codebase into any LLM. GitHub
Crawl4AI Open-source, blazing-fast, AI-ready web crawling tailored for LLMs, AI agents, and data pipelines. Documentation | GitHub
GPT Crawler Crawl a site to generate knowledge files to create your own custom GPT from a URL. Documentation | GitHub
ScrapeGraphAI A web scraping Python library that uses LLM and direct graph logic to create scraping pipelines for websites and local documents (XML, HTML, JSON, Markdown, etc.). Documentation | GitHub
Scrapling πŸ•·οΈ Undetectable, Lightning-Fast, and Adaptive Web Scraping for Python. GitHub
Firecrawl πŸ”₯ Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl, and extract with a single API. Documentation | GitHub

Agents and Tools (Build Your Own)

Project Description Links
LangChain Agents Build agents with LangChain. Documentation
LangChain Tools Integrate Tools (Function Calling) with LangChain. Documentation
smolagents The simplest framework out there to build powerful agents. Documentation | GitHub
Agentarium Open-source framework for creating and managing simulations populated with AI-powered agents. It provides an intuitive platform for designing complex, interactive environments where agents can act, learn, and evolve. GitHub
AutoGen AgentChat Build applications quickly with preset agents. Documentation

Agents and Tools (Prebuilt)

Project Description Links
Agno (Formerly Phidata) An open-source platform to build, ship and monitor agentic systems. Documentation | GitHub
Composio Integration Platform for AI Agents & LLMs (works with LangChain, CrewAI, etc). Documentation | GitHub

LLM Memory

Project Description Links
Mem0 Mem0 is a self-improving memory layer for LLM applications, enabling personalized AI experiences that save costs and delight users. Documentation | GitHub
Memary Open Source Memory Layer For Autonomous Agents. GitHub
Memobase 1st User Profile-Based Memory for GenAI Apps. Documentation | GitHub

LLMOps

Project Description Links
LangWatch Monitor, Evaluate & Optimize your LLM performance with 1-click. Drag and drop interface for LLMOps platform. Documentation | GitHub
MLflow MLflow Tracing for LLM Observability. Documentation
Agenta Open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM Observability all in one place. Documentation
LLMOps Best practices designed to support your LLMOps initiatives. GitHub
Helicone Open-source LLM observability platform for developers to monitor, debug, and improve production-ready applications. Documentation | GitHub

Code Sandbox (Security)

Project Description Links
E2B E2B is an open-source runtime for executing AI-generated code in secure cloud sandboxes. Made for agentic & AI use cases. Documentation | GitHub
AutoGen Docker Code Executor Executes code through a command line environment in a Docker container. Documentation

Browser Control Agents

Project Description Links
Browser-Use Make websites accessible for AI agents. Documentation | GitHub
WebUI Built on Gradio and supports most of browser-use functionalities. This UI is designed to be user-friendly and enables easy interaction with the browser agent. GitHub
WebRover WebRover is an AI-powered web agent that combines autonomous browsing with advanced research capabilities. GitHub

Prompt Improvement

Project Description Links
Microsoft PromptWizard Task-Aware Prompt Optimization Framework. GitHub
Promptify A library for prompt engineering that simplifies NLP tasks (e.g., NER, classification) using LLMs like GPT. GitHub
AutoPrompt A framework for prompt tuning using Intent-based Prompt Calibration. GitHub

Other

Project Description Links
AI Suite Simple, unified interface to multiple Generative AI providers. GitHub
AdalFlow The library to build & auto-optimize LLM applications, from Chatbot, RAG, to Agent by SylphAI. GitHub
dspy DSPy: The framework for programmingβ€”not promptingβ€”foundation models. GitHub
LiteLLM Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format. GitHub
AI Agent Service Toolkit Full toolkit for running an AI agent service built with LangGraph, FastAPI, and Streamlit. App | GitHub
Microsoft Tiny Troupe LLM-powered multiagent persona simulation for imagination enhancement and business insights. GitHub
Distributed Llama Connect home devices into a powerful cluster to accelerate LLM inference. GitHub

R Libraries

Curated AI, ML, Data Science Lists

Project Description Links
LLM tools for R An ongoing roundup of useful developments in the LLM/genAI space, with a specific focus on R. Website

Ellmer-Verse

Project Description Links
ellmer Makes it easy to use large language models (LLM) from R. It supports a wide variety of LLM providers and implements a rich set of features including streaming outputs, tool/function calling, structured data extraction, and more. Website
hellmer Enables sequential and parallel batch processing for chat models supported by ellmer. Documentation
chores Provides a library of ergonomic LLM assistants designed to help you complete repetitive, hard-to-automate tasks quickly. Documentation
ggpal LLM assistant specifically for ggplot2. GitHub
gander A high-performance and low-friction chat experience for data scientists in RStudio and Positron–sort of like completions with Copilot, but it knows how to talk to the objects in your R environment. Documentation

mlverse

Project Description Links
mall Run multiple LLM predictions against a data frame. The predictions are processed row-wise over a specified column. Website
lang Use an LLM to translate a function’s help documentation on-the-fly. Website
chattr An interface to LLMs (Large Language Models). Website

Other Popular Interfaces to LLM Models in R

Project Description Links
chatgpt Interface with models from OpenAI to get assistance while coding. GitHub
groqR Brings GroqCloud’s lightning-fast LPU (Language Processing Unit) technology directly to your R workflow. Website
gptstudio Easily incorporate use of large language models (LLMs) into their project workflows. Website
llmR R interface to various Large Language Models (LLMs) such as OpenAI’s GPT models, Azure’s language models, Google’s Gemini models, or custom local servers. GitHub
tidychatmodels A simple interface to chat with your favorite AI chatbot from R, inspired by tidymodels where you can easily swap out any ML model for another one but keep the other parts of the workflow the same. Website
tidyllm Access various large language model APIs, including Anthropic Claude, OpenAI, Google Gemini, Perplexity, Groq, Mistral, and local models via Ollama or OpenAI-compatible APIs. Website
gemini.R R package to use Google’s Gemini via API on R. Website
PerplexR Intuitive interface for leveraging the capabilities of the Perplexity API Pro subscription. GitHub
ollama-r The easiest way to integrate R with Ollama, which lets you run language models locally on your own machine. Website
rollama Wraps the Ollama API, which allows you to run different LLMs locally and create an experience similar to ChatGPT/OpenAI’s API. Website

RAG in R

Project Description Links
Ragnar Helps implement Retrieval-Augmented Generation (RAG) workflows. Website

LLM Deployment (Cloud Services)

Service Description Links
AWS Bedrock Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon. AWS Bedrock
Microsoft Azure AI Services Azure AI services help developers and organizations rapidly create intelligent, cutting-edge, market-ready, and responsible applications with out-of-the-box and prebuilt and customizable APIs and models. Microsoft Azure AI Services
Google Vertex AI Vertex AI is a fully-managed, unified AI development platform for building and using generative AI. Google Vertex AI
NVIDIA NIM NVIDIA NIMβ„’, part of NVIDIA AI Enterprise, provides containers to self-host GPU-accelerated inferencing microservices for pretrained and customized AI models across clouds, data centers, and workstations. NVIDIA NIM

Examples and Cookbooks

Building AI

Project Description Links
LangChain Cookbook Example code for building applications with LangChain, with an emphasis on more applied and end-to-end examples. GitHub
LangGraph Examples Example code for building applications with LangGraph. GitHub
Llama Index Examples Example code for building applications with Llama Index. GitHub
Streamlit LLM Examples Streamlit LLM app examples for getting started. GitHub

Deploying AI

Amazon Web Services (AWS)

Project Description Links
Amazon Bedrock Workshop Introduces how to leverage foundation models (FMs) through Amazon Bedrock. GitHub

Microsoft Azure

Project Description Links
Microsoft Generative AI for Beginners 21 lessons teaching everything you need to know to start building Generative AI applications. GitHub
Microsoft Intro to Generative AI Course A comprehensive introduction to Generative AI concepts and applications. Microsoft Learn
Azure Generative AI Examples Prompt Flow and RAG examples for use with the Microsoft Azure Cloud platform. GitHub

Google Cloud Platform (GCP)

Project Description Links
Google Vertex AI Examples Notebooks, code samples, sample apps, and other resources that demonstrate how to use, develop, and manage machine learning and generative AI workflows using Google Cloud Vertex AI. GitHub
Google Generative AI Examples Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI. GitHub

NVIDIA

Project Description Links
NVIDIA NIM Anywhere An entry point for developing with NIMs that natively scales out to full-sized labs and up to production environments. GitHub
NVIDIA NIM Deploy Reference implementations, example documents, and architecture guides that can be used as a starting point to deploy multiple NIMs and other NVIDIA microservices into Kubernetes and other production deployment environments. GitHub

Newsletters

Newsletter Description Links
Python AI/ML Tips Free newsletter on Generative AI and Data Science. GitHub
unwind ai Latest AI news, tools, and tutorials for AI Developers. Website

Courses and Training

Free Training

Workshop Description Links
Generative AI Data Scientist Workshops Get free training on how to build and deploy Generative AI / ML Solutions. Register for the next free workshop here.

Paid Courses

Course Description Links
8-Week AI Bootcamp To Become A Generative AI-Data Scientist Focused on helping you become a Generative AI Data Scientist. Learn how to build and deploy AI-powered data science solutions using LangChain, LangGraph, Pandas, Scikit Learn, Streamlit, AWS, Bedrock, and EC2. Enroll Here

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for awesome-generative-ai-data-scientist

Similar Open Source Tools

For similar tasks

For similar jobs