simple-dataengineering-ai-stack

None

Stars: 156

Visit

This repository provides curated, dockerized blueprints for building and demoing modern data and AI platforms. Users can spin up end-to-end environments including data lake foundations, pipeline orchestration, observability, and AI-friendly tooling with just a few commands. The vision is to accelerate experimentation, stay modular, promote best practices, and bridge different personas in data and AI fields. The repository includes various directories focusing on different aspects of data engineering and AI, such as data infrastructure, data pipeline orchestration, AI-powered job orchestration, and more. Users can choose a stack, launch it locally, and compose their platform by running multiple stacks side-by-side. Typical use cases include prototyping a lakehouse, trialing ETL & AI pipelines, providing sandboxes for analysts, and validating monitoring/backup strategies.

README:

Simple Data AI Stack

Build and demo modern data and AI platforms without waiting on infrastructure tickets. This repository collects curated, dockerized blueprints that let data engineers, ML teams, and platform builders spin up end-to-end environments—data lake foundations, pipeline orchestration, observability, and AI-friendly tooling—in a few commands.

The Vision

Accelerate experimentation: Stand up realistic data/AI environments locally or on a single VM, then iterate on pipelines, models, and dashboards with production-inspired defaults.
Stay modular: Each stack is self-contained and composable—pick the lakehouse, orchestration, or monitoring pieces you need today and combine them as your platform grows.
Promote best practices: Included services cover security, backups, health checks, and resource monitoring so teams focus on insights, not plumbing.
Bridge personas: Empower data engineers, AI engineers, analytics developers, and operators to collaborate against the same sandbox with role-aligned interfaces.

Repository Guide

Directory	Focus	Highlights	Docs
`data-Infrastructure/`	Platform foundations	Opinionated essays covering the why behind stack choices—start with hidden pitfalls that derail data platforms before they scale	The Hidden Problems in Data Infrastructure
`datalake/`	Data infrastructure	PostgreSQL-based lake with connection pooling, Redis cache, no-code access, backups, and uptime monitoring	Postgres Lake README
`data_pipeline_orchestration/`	Data & AI engineering	Apache Airflow bundle with MinIO object storage, customizable ETL worker, resource monitoring, and helper scripts	Airflow Stack README
`ducklake-ai-platform/`	Lakehouse + AI workspace	DuckDB + DuckLake core with Marimo notebooks, MinIO object storage, Postgres metadata, and vector search-ready defaults	DuckLake README
`dataengineering-dashboard-vision/`	Observability agent	Conversational Grafana + Prometheus assistant delivers root-cause context and anomaly summaries via chat	Dashboard Agent README
`dwh-rag-framework/`	Warehouse-first RAG lab	DuckDB snapshots feeding LightRAG indexing with Marimo notebooks and Cronicle automation for agent validation	RAG Framework README
`n8n-data-ai-orchestration/`	AI-powered job orchestration	Customer retention workflow that blends SQL, enrichment, OpenAI strategy generation, Slack/email reporting, and failure alerting in n8n	n8n Flow README
`mcp-data-server/`	Universal data loader MCP	Format-agnostic FastAPI server with auto-detect parsers, DuckDB SQL querying, and REST endpoints for instant file-to-query workflows	MCP Data Server README
`data-agent-sdk/`	Data engineering agent SDK	Minimal SDK for building data agents with SQL/Polars tools, governance hooks, lineage tracking, and MCP server support in ~2,000 lines	Data Agent SDK README
`python-redis-streaming/`	Streaming ingestion engine	Async Python + Redis Streams + Postgres stack with uv tooling, DLQ handling, and CLI helpers for monitoring and benchmarks	Python Redis Streaming README
`redis-postgres-pipeline/`	High-performance pipeline	Production-ready data pipeline with Redis queues, dedup, caching, Postgres 18 async I/O, UNLOGGED staging, materialized views, and Polars — handles 500M records without Spark	Redis Postgres Pipeline README
`postgres-duckdb-sync/`	Postgres → DuckDB sync lab	150-line Polars loop that copies live Postgres rows to DuckDB via Parquet, SQLite checkpoints, schema drift detection, and soft-delete support — exactly what the “Copying Postgres to DuckDB” post prescribes	Postgres → DuckDB Sync README
`spark-to-polars-migration/`	Spark-to-single-node rewrite lab	Side-by-side Spark UDF baseline with Polars and DuckDB replacements, Dockerized for benchmarking single-node performance	Spark-to-Polars README
`data-pipeline-security/`	Data Pipeline Security	Secrets & Identity	Data-pipeline-security README
`elasticsearch-vs-vector-search/`	Search architecture lab	Hands-on comparison of Elasticsearch keyword search vs pgvector semantic search with hybrid approach, performance benchmarks, and production decision framework	Elasticsearch vs Vector Search README
`knowledge-search-hybrid/`	Local hybrid search stack	Config-driven Lucene BM25 + local embeddings + HNSW kNN + RAG answers in one container; Polars ETL on JSONL drops, autocomplete, and disk-backed indexes	Hybrid Knowledge Search README
`mdm-polars-duckdb/`	MDM golden customer table	Implements “Creating One Clean Customer Table from 7 Conflicting Sources” with Polars, Pandera, RapidFuzz, and DuckDB; includes synthetic messy inputs, uv workflow, and Docker image for five-minute runs	Polars + DuckDB Golden Table README

Pair the conceptual deep dives with the hands-on stack READMEs: skim data-Infrastructure/ to understand the platform philosophy, then jump into the stack directory that matches your next experiment for deployment steps and credentials.

Getting Started

Install prerequisites: Docker + Docker Compose v2 on a machine with adequate CPU, RAM, and disk (see stack-specific READMEs for sizing).

Clone the repo:

git clone https://github.com/hottechstack/simple-data-ai-stack.git
cd simple-data-ai-stack

Choose a stack: Browse the directories above and open the corresponding README for detailed instructions.
Launch locally: Most stacks run with a single command (docker compose up -d, ./start_pipeline.sh start, etc.). Scripts expose health checks, sample data loaders, and log helpers to keep you moving.
Compose your platform: Run stacks side-by-side for a fuller platform—pipe object storage into the SQL lake, orchestrate model feature jobs, or layer BI tooling on top.

Typical Use Cases

Prototype a lakehouse with production-grade components before committing to cloud services.
Trial ETL & AI feature pipelines with real datasets and observe resource footprints.
Provide analysts and business users a sandbox with self-service interfaces (NocoDB, pgAdmin, dashboards).
Validate monitoring/backup strategies in isolation before promoting to shared environments.

Opinionated Workflow

Land structured/unstructured data via MinIO or direct DB ingestion.
Transform using Airflow-managed ETL jobs powered by DuckDB and Polars.
Serve & explore through PostgreSQL, NocoDB, BI tools, or custom APIs.
Observe everything with built-in uptime checks, metrics dashboards, and automated backups.

The stacks are designed to connect: object storage flows into transformation jobs, refined outputs land back into the data lake, and monitoring tools keep the feedback loop tight.

Roadmap Inspiration

✅ Vector databases + search architecture comparison (see elasticsearch-vs-vector-search/)
Streaming ingestion profile (Kafka/Redpanda + stream processing + materialized views).
Notebook & model experimentation workspace with GPU-ready containers.
Terraform modules to mirror these blueprints in managed cloud environments.

Have an idea or internal stack you want to share? Contributions are welcome—open an issue or PR to propose a new module or enhancement.

Contributing

Fork the repository and work inside a dedicated directory for your stack.
Document your stack thoroughly (architecture, environment variables, health checks, teardown steps).
Reuse existing patterns for Docker Compose profiles, scripts, and monitoring hooks to keep experiences consistent.
Submit a PR describing the use case, prerequisites, and any sample data included.

License

Unless otherwise stated in a subdirectory, content is provided as-is for educational and production experimentation. Review upstream container licenses before deploying in regulated environments.

For Tasks:

Click tags to check more tools for each tasks

build data lake orchestrate pipelines trial ai pipelines validate monitoring prototype a platform

For Jobs:

data engineer machine learning engineer platform builder data analyst ai engineer

Alternative AI tools for simple-dataengineering-ai-stack

Similar Open Source Tools

simple-dataengineering-ai-stack

github

: 156

synmetrix

Synmetrix is an open source data engineering platform and semantic layer for centralized metrics management. It provides a complete framework for modeling, integrating, transforming, aggregating, and distributing metrics data at scale. Key features include data modeling and transformations, semantic layer for unified data model, scheduled reports and alerts, versioning, role-based access control, data exploration, caching, and collaboration on metrics modeling. Synmetrix leverages Cube.js to consolidate metrics from various sources and distribute them downstream via a SQL API. Use cases include data democratization, business intelligence and reporting, embedded analytics, and enhancing accuracy in data handling and queries. The tool speeds up data-driven workflows from metrics definition to consumption by combining data engineering best practices with self-service analytics capabilities.

github

: 531

mlcraft

Synmetrix (prev. MLCraft) is an open source data engineering platform and semantic layer for centralized metrics management. It provides a complete framework for modeling, integrating, transforming, aggregating, and distributing metrics data at scale. Key features include data modeling and transformations, semantic layer for unified data model, scheduled reports and alerts, versioning, role-based access control, data exploration, caching, and collaboration on metrics modeling. Synmetrix leverages Cube (Cube.js) for flexible data models that consolidate metrics from various sources, enabling downstream distribution via a SQL API for integration into BI tools, reporting, dashboards, and data science. Use cases include data democratization, business intelligence, embedded analytics, and enhancing accuracy in data handling and queries. The tool speeds up data-driven workflows from metrics definition to consumption by combining data engineering best practices with self-service analytics capabilities.

github

: 480

aistore

AIStore is a lightweight object storage system designed for AI applications. It is highly scalable, reliable, and easy to use. AIStore can be deployed on any commodity hardware, and it can be used to store and manage large datasets for deep learning and other AI applications.

github

: 1.8k

positronic

Positronic is an end-to-end toolkit for building ML-driven robotics systems, aiming to simplify data collection, messy data handling, and complex deployment in the field of robotics. It provides a Python-native stack for real-life ML robotics, covering hardware integration, dataset curation, policy training, deployment, and monitoring. The toolkit is designed to make professional-grade ML robotics approachable, without the need for ROS. Positronic offers solutions for data ops, hardware drivers, unified inference API, and iteration workflows, enabling teams to focus on developing manipulation systems for robots.

github

: 60

ksail

KSail is a tool that bundles common Kubernetes tooling into a single binary, providing a unified workflow for creating clusters, deploying workloads, and operating cloud-native stacks across different distributions and providers. It eliminates the need for multiple CLI tools and bespoke scripts, offering features like one binary for provisioning and deployment, support for various cluster configurations, mirror registries, GitOps integration, customizable stack selection, built-in SOPS for secrets management, AI assistant for interactive chat, and a VSCode extension for cluster management.

github

: 129

neuro-san-studio

Neuro SAN Studio is an open-source library for building agent networks across various industries. It simplifies the development of collaborative AI systems by enabling users to create sophisticated multi-agent applications using declarative configuration files. The tool offers features like data-driven configuration, adaptive communication protocols, safe data handling, dynamic agent network designer, flexible tool integration, robust traceability, and cloud-agnostic deployment. It has been used in various use-cases such as automated generation of multi-agent configurations, airline policy assistance, banking operations, market analysis in consumer packaged goods, insurance claims processing, intranet knowledge management, retail operations, telco network support, therapy vignette supervision, and more.

github

: 200

ToolJet

ToolJet is an open-source platform for building and deploying internal tools, workflows, and AI agents. It offers a visual builder with drag-and-drop UI, integrations with databases, APIs, SaaS apps, and object storage. The community edition includes features like a visual app builder, ToolJet database, multi-page apps, collaboration tools, extensibility with plugins, code execution, and security measures. ToolJet AI, the enterprise version, adds AI capabilities for app generation, query building, debugging, agent creation, security compliance, user management, environment management, GitSync, branding, access control, embedded apps, and enterprise support.

github

: 37.4k

genkit

Firebase Genkit (beta) is a framework with powerful tooling to help app developers build, test, deploy, and monitor AI-powered features with confidence. Genkit is cloud optimized and code-centric, integrating with many services that have free tiers to get started. It provides unified API for generation, context-aware AI features, evaluation of AI workflow, extensibility with plugins, easy deployment to Firebase or Google Cloud, observability and monitoring with OpenTelemetry, and a developer UI for prototyping and testing AI features locally. Genkit works seamlessly with Firebase or Google Cloud projects through official plugins and templates.

github

: 5.6k

video-search-and-summarization

The NVIDIA AI Blueprint for Video Search and Summarization is a repository showcasing video search and summarization agent with NVIDIA NIM microservices. It enables industries to make better decisions faster by providing insightful, accurate, and interactive video analytics AI agents. These agents can perform tasks like video summarization and visual question-answering, unlocking new application possibilities. The repository includes software components like NIM microservices, ingestion pipeline, and CA-RAG module, offering a comprehensive solution for analyzing and summarizing large volumes of video data. The target audience includes video analysts, IT engineers, and GenAI developers who can benefit from the blueprint's 1-click deployment steps, easy-to-manage configurations, and customization options. The repository structure overview includes directories for deployment, source code, and training notebooks, along with documentation for detailed instructions. Hardware requirements vary based on deployment topology and dependencies like VLM and LLM, with different deployment methods such as Launchable Deployment, Docker Compose Deployment, and Helm Chart Deployment provided for various use cases.

github

: 404

Genkit

Genkit is an open-source framework for building full-stack AI-powered applications, used in production by Google's Firebase. It provides SDKs for JavaScript/TypeScript (Stable), Go (Beta), and Python (Alpha) with unified interface for integrating AI models from providers like Google, OpenAI, Anthropic, Ollama. Rapidly build chatbots, automations, and recommendation systems using streamlined APIs for multimodal content, structured outputs, tool calling, and agentic workflows. Genkit simplifies AI integration with open-source SDK, unified APIs, and offers text and image generation, structured data generation, tool calling, prompt templating, persisted chat interfaces, AI workflows, and AI-powered data retrieval (RAG).

github

: 2.8k

graphbit

GraphBit is an industry-grade agentic AI framework built for developers and AI teams that demand stability, scalability, and low resource usage. It is written in Rust for maximum performance and safety, delivering significantly lower CPU usage and memory footprint compared to leading alternatives. The framework is designed to run multi-agent workflows in parallel, persist memory across steps, recover from failures, and ensure 100% task success under load. With lightweight architecture, observability, and concurrency support, GraphBit is suitable for deployment in high-scale enterprise environments and low-resource edge scenarios.

github

: 510

doris

Doris is a lightweight and user-friendly data visualization tool designed for quick and easy exploration of datasets. It provides a simple interface for users to upload their data and generate interactive visualizations without the need for coding. With Doris, users can easily create charts, graphs, and dashboards to analyze and present their data in a visually appealing way. The tool supports various data formats and offers customization options to tailor visualizations to specific needs. Whether you are a data analyst, researcher, or student, Doris simplifies the process of data exploration and presentation.

github

: 15.0k

instill-core

Instill Core is an open-source orchestrator comprising a collection of source-available projects designed to streamline every aspect of building versatile AI features with unstructured data. It includes Instill VDP (Versatile Data Pipeline) for unstructured data, AI, and pipeline orchestration, Instill Model for scalable MLOps and LLMOps for open-source or custom AI models, and Instill Artifact for unified unstructured data management. Instill Core can be used for tasks such as building, testing, and sharing pipelines, importing, serving, fine-tuning, and monitoring ML models, and transforming documents, images, audio, and video into a unified AI-ready format.

github

: 2.3k

pluto

Pluto is a development tool dedicated to helping developers **build cloud and AI applications more conveniently** , resolving issues such as the challenging deployment of AI applications and open-source models. Developers are able to write applications in familiar programming languages like **Python and TypeScript** , **directly defining and utilizing the cloud resources necessary for the application within their code base** , such as AWS SageMaker, DynamoDB, and more. Pluto automatically deduces the infrastructure resource needs of the app through **static program analysis** and proceeds to create these resources on the specified cloud platform, **simplifying the resources creation and application deployment process**.

github

: 90

ai-factory

AI Factory is a CLI tool and skill system that streamlines AI-powered development by handling context setup, skill installation, and workflow configuration. It supports multiple AI coding agents, offers spec-driven development, and integrates with popular tech stacks like Next.js, Laravel, Django, and Express. The tool ensures zero configuration, best practices adherence, community skills utilization, and multi-agent support. Users can create plans, tasks, and commits for structured feature development, bug fixes, and self-improvement. Security is a priority with mandatory two-level scans for external skills. The tool's learning loop generates patches from bug fixes to enhance future implementations.

github

: 187

For similar tasks

simple-dataengineering-ai-stack

github

: 156

llm-zoomcamp

LLM Zoomcamp is a free online course focusing on real-life applications of Large Language Models (LLMs). Over 10 weeks, participants will learn to build an AI bot capable of answering questions based on a knowledge base. The course covers topics such as LLMs, RAG, open-source LLMs, vector databases, orchestration, monitoring, and advanced RAG systems. Pre-requisites include comfort with programming, Python, and the command line, with no prior exposure to AI or ML required. The course features a pre-course workshop and is led by instructors Alexey Grigorev and Magdalena Kuhn, with support from sponsors and partners.

github

: 3.2k

emqx

EMQX is a highly scalable and reliable MQTT platform designed for IoT data infrastructure. It supports various protocols like MQTT 5.0, 3.1.1, and 3.1, as well as MQTT-SN, CoAP, LwM2M, and MQTT over QUIC. EMQX allows connecting millions of IoT devices, processing messages in real time, and integrating with backend data systems. It is suitable for applications in AI, IoT, IIoT, connected vehicles, smart cities, and more. The tool offers features like massive scalability, powerful rule engine, flow designer, AI processing, robust security, observability, management, extensibility, and a unified experience with the Business Source License (BSL) 1.1.

github

: 15.3k

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 697

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k