spiceai

spiceai

A data query and AI-inference compute engine written in Rust using Apache Arrow and DataFusion for secure, fast, agentic AI applications.

Stars: 1970

Visit
 screenshot

Spice is a portable runtime written in Rust that offers developers a unified SQL interface to materialize, accelerate, and query data from any database, data warehouse, or data lake. It connects, fuses, and delivers data to applications, machine-learning models, and AI-backends, functioning as an application-specific, tier-optimized Database CDN. Built with industry-leading technologies such as Apache DataFusion, Apache Arrow, Apache Arrow Flight, SQLite, and DuckDB. Spice makes it fast and easy to query data from one or more sources using SQL, co-locating a managed dataset with applications or machine learning models, and accelerating it with Arrow in-memory, SQLite/DuckDB, or attached PostgreSQL for fast, high-concurrency, low-latency queries.

README:

spice oss logo

CodeQL License: Apache-2.0 Discord Follow on X

Spice is a portable runtime written in Rust that offers developers a unified SQL interface to materialize, accelerate, and query data from any database, data warehouse, or data lake.

πŸ“£ Read the Spice.ai OSS announcement blog post.

Spice connects, fuses, and delivers data to applications, machine-learning models, and AI-backends, functioning as an application-specific, tier-optimized Database CDN.

Spice is built-with industry leading technologies such as Apache DataFusion, Apache Arrow, Apache Arrow Flight, SQLite, and DuckDB.

How Spice works.

πŸŽ“ Read the MaterializedView interview on Spice.ai

πŸŽ₯ Watch the CMU Databases Accelerating Data and AI with Spice.ai Open-Source

Why Spice?

Spice makes it fast and easy to query data from one or more sources using SQL. You can co-locate a managed dataset with your application or machine learning model, and accelerate it with Arrow in-memory, SQLite/DuckDB, or with attached PostgreSQL for fast, high-concurrency, low-latency queries. Accelerated engines give you flexibility and control over query cost and performance.

Spice.ai

How is Spice different?

  1. Application-focused: Spice is designed to integrate at the application level; 1:1 or 1:N application to Spice mapping, whereas most other data systems are designed for multiple applications to share a single database or data warehouse. It's not uncommon to have many Spice instances, even down to one for each tenant or customer.

  2. Dual-Engine Acceleration: Spice supports both OLAP (Arrow/DuckDB) and OLTP (SQLite/PostgreSQL) databases at the dataset level, unlike other systems that only support one type.

  3. Separation of Materialization and Storage/Compute: Spice separates storage and compute, allowing you to keep data close to its source and bring a materialized working set next to your application, dashboard, or data/ML pipeline.

  4. Edge to Cloud Native. Spice is designed to be deployed anywhere, from a standalone instance to a Kubernetes container sidecar, microservice, or cluster at the Edge/POP, On-Prem, or in public clouds. You can also chain Spice instances and deploy them across multiple infrastructure tiers.

How does Spice compare?

Spice Trino/Presto Dremio Clickhouse
Primary Use-Case Data & AI Applications Big Data Analytics Interactive Analytics Real-Time Analytics
Typical Deployment Colocated with application Cloud Cluster Cloud Cluster On-Prem/Cloud Cluster
Application-to-Data System One-to-One/Many Many-to-One Many-to-One Many-to-One
Query Federation Native with query push-down Supported with push-down Supported with limited push-down Limited
Materialization Arrow/SQLite/DuckDB/PostgreSQL Intermediate Storage Reflections (Iceberg) Views & MergeTree
Query Result Caching Supported Supported Supported Supported
Typical Configuration Single-Binary/Sidecar/Microservice Coodinator+Executor w/ Zookeeper Coodinator+Executor w/ Zookeeper Clickhouse Keeper+Nodes

Example Use-Cases

1. Faster applications and frontends. Accelerate and co-locate datasets with applications and frontends, to serve more concurrent queries and users with faster page loads and data updates. Try the CQRS sample app

2. Faster dashboards, analytics, and BI. Faster, more responsive dashboards without massive compute costs. Watch the Apache Superset demo

3. Faster data pipelines, machine learning training and inferencing. Co-locate datasets in pipelines where the data is needed to minimize data-movement and improve query performance.

4. Easily query many data sources. Federated SQL query across databases, data warehouses, and data lakes using Data Connectors.

FAQ

  • Is Spice a cache? No, however you can think of Spice data materialization like an active cache or data prefetcher. A cache would fetch data on a cache-miss while Spice prefetches and materializes filtered data on an interval or as new data becomes available. In addition to materialization Spice supports results caching.

  • Is Spice a CDN for databases? Yes, you can think of Spice like a CDN for different data sources. Using CDN concepts, Spice enables you to ship (load) a working set of your database (or data lake, or data warehouse) where it's most frequently accessed, like from a data application or for AI-inference.

  • Where is the AI? Spice provides a unified API for both data and AI/ML with a high-performance bus between the two. However, because the first step in AI-readiness is data-readiness, the Getting Started content is focused on data. Spice has endpoints and APIs for model deployment and inference including LLMs, accelerated embeddings, and an AI-gateway for providers like OpenAI and Anthropic. Read more about the vision to enable development of intelligent AI-driven applications.

Watch a 30-sec BI dashboard acceleration demo

https://github.com/spiceai/spiceai/assets/80174/7735ee94-3f4a-4983-a98e-fe766e79e03a

Supported Data Connectors

Currently supported data connectors for upstream datasets. More coming soon.

Name Description Status Protocol/Format
duckdb DuckDB Release Candidate
github GitHub Release Candidate
graphql GraphQL Release Candidate JSON
mysql MySQL Release Candidate
postgres PostgreSQL Release Candidate
s3 S3 Release Candidate Parquet, CSV
databricks (mode: delta_lake) Databricks Release Candidate S3/Delta Lake
file File Release Candidate Parquet, CSV
databricks (mode: spark_connect) Databricks Beta Spark Connect
delta_lake Delta Lake Beta Delta Lake
flightsql FlightSQL Beta Arrow Flight SQL
mssql Microsoft SQL Server Beta Tabular Data Stream (TDS)
odbc ODBC Beta ODBC
spiceai Spice.ai Beta Arrow Flight
abfs Azure BlobFS Alpha Parquet, CSV
clickhouse Clickhouse Alpha
debezium Debezium CDC Alpha Kafka + JSON
dremio Dremio Alpha Arrow Flight
ftp, sftp FTP/SFTP Alpha Parquet, CSV
http, https HTTP(s) Alpha Parquet, CSV
iceberg Apache Iceberg Alpha Parquet
localpod Local dataset replication Alpha
sharepoint Microsoft SharePoint Alpha Unstructured UTF-8 documents
snowflake Snowflake Alpha Arrow
spark Spark Alpha Spark Connect

Supported Data Stores/Accelerators

Currently supported data stores for local materialization/acceleration. More coming soon.

Name Description Status Engine Modes
arrow In-Memory Arrow Records Release Candidate memory
duckdb Embedded DuckDB Release Candidate memory, file
postgres Attached PostgreSQL Release Candidate
sqlite Embedded SQLite Release Candidate memory, file

⚑️ Quickstart (Local Machine)

https://github.com/spiceai/spiceai/assets/88671039/85cf9a69-46e7-412e-8b68-22617dcbd4e0

Step 1. Install the Spice CLI:

On macOS, Linux, and WSL:

curl https://install.spiceai.org | /bin/bash

Or using brew:

brew install spiceai/spiceai/spice

On Windows:

curl -L "https://install.spiceai.org/Install.ps1" -o Install.ps1 && PowerShell -ExecutionPolicy Bypass -File ./Install.ps1

Step 2. Initialize a new Spice app with the spice init command:

spice init spice_qs

A spicepod.yaml file is created in the spice_qs directory. Change to that directory:

cd spice_qs

Step 3. Start the Spice runtime:

spice run

Example output will be shown as follows:

Spice.ai runtime starting...
2024-08-05T13:02:40.247484Z  INFO runtime::flight: Spice Runtime Flight listening on 127.0.0.1:50051
2024-08-05T13:02:40.247490Z  INFO runtime::metrics_server: Spice Runtime Metrics listening on 127.0.0.1:9090
2024-08-05T13:02:40.247949Z  INFO runtime: Initialized results cache; max size: 128.00 MiB, item ttl: 1s
2024-08-05T13:02:40.248611Z  INFO runtime::http: Spice Runtime HTTP listening on 127.0.0.1:8090
2024-08-05T13:02:40.252356Z  INFO runtime::opentelemetry: Spice Runtime OpenTelemetry listening on 127.0.0.1:50052

The runtime is now started and ready for queries.

Step 4. In a new terminal window, add the spiceai/quickstart Spicepod. A Spicepod is a package of configuration defining datasets and ML models.

spice add spiceai/quickstart

The spicepod.yaml file will be updated with the spiceai/quickstart dependency.

version: v1beta1
kind: Spicepod
name: spice_qs
dependencies:
  - spiceai/quickstart

The spiceai/quickstart Spicepod will add a taxi_trips data table to the runtime which is now available to query by SQL.

2024-08-05T13:04:56.742779Z  INFO runtime: Dataset taxi_trips registered (s3://spiceai-demo-datasets/taxi_trips/2024/), acceleration (arrow, 10s refresh), results cache enabled.
2024-08-05T13:04:56.744062Z  INFO runtime::accelerated_table::refresh_task: Loading data for dataset taxi_trips
2024-08-05T13:05:03.556169Z  INFO runtime::accelerated_table::refresh_task: Loaded 2,964,624 rows (421.71 MiB) for dataset taxi_trips in 6s 812ms.

Step 5. Start the Spice SQL REPL:

spice sql

The SQL REPL inferface will be shown:

Welcome to the Spice.ai SQL REPL! Type 'help' for help.

show tables; -- list available tables
sql>

Enter show tables; to display the available tables for query:

sql> show tables;
+---------------+--------------+---------------+------------+
| table_catalog | table_schema | table_name    | table_type |
+---------------+--------------+---------------+------------+
| spice         | public       | taxi_trips    | BASE TABLE |
| spice         | runtime      | query_history | BASE TABLE |
| spice         | runtime      | metrics       | BASE TABLE |
+---------------+--------------+---------------+------------+

Time: 0.022671708 seconds. 3 rows.

Enter a query to display the longest taxi trips:

SELECT trip_distance, total_amount FROM taxi_trips ORDER BY trip_distance DESC LIMIT 10;

Output:

+---------------+--------------+
| trip_distance | total_amount |
+---------------+--------------+
| 312722.3      | 22.15        |
| 97793.92      | 36.31        |
| 82015.45      | 21.56        |
| 72975.97      | 20.04        |
| 71752.26      | 49.57        |
| 59282.45      | 33.52        |
| 59076.43      | 23.17        |
| 58298.51      | 18.63        |
| 51619.36      | 24.2         |
| 44018.64      | 52.43        |
+---------------+--------------+

Time: 0.045150667 seconds. 10 rows.

βš™οΈ Runtime Container Deployment

Using the Docker image locally:

docker pull spiceai/spiceai

In a Dockerfile:

from spiceai/spiceai:latest

Using Helm:

helm repo add spiceai https://helm.spiceai.org
helm install spiceai spiceai/spiceai

🏎️ Next Steps

Explore the Spice.ai Cookbook

The Spice.ai Cookbook is a collection of recipes and examples for using Spice. Find it at https://github.com/spiceai/cookbook.

Using Spice.ai Cloud Platform

You can use any number of predefined datasets available from the Spice.ai Cloud Platform in the Spice runtime.

A list of publicly available datasets from the Spice.ai Cloud Platform can be found on Spicerack: https://spicerack.org/.

In order to access public datasets from Spice.ai, you will first need to create an account with Spice.ai by selecting the free tier membership.

Navigate to spice.ai and create a new account by clicking on Try for Free.

spiceai_try_for_free-1

After creating an account, you will need to create an app in order to create to an API key.

create_app-1

You will now be able to access datasets from Spice.ai. For this demonstration, we will be using the taxi_trips dataset from the https://spice.ai/spiceai/quickstart Spice.ai app.

Step 1. Initialize a new project.

# Initialize a new Spice app
spice init spice_app

# Change to app directory
cd spice_app

Step 2. Log in and authenticate from the command line using the spice login command. A pop up browser window will prompt you to authenticate:

spice login

Step 3. Start the runtime:

# Start the runtime
spice run

Step 4. Configure the dataset:

In a new terminal window, configure a new dataset using the spice dataset configure command:

spice dataset configure

Enter a dataset name that will be used to reference the dataset in queries. This name does not need to match the name in the dataset source.

dataset name: (spice_app) taxi_trips

Enter the description of the dataset:

description: Taxi trips dataset

Enter the location of the dataset:

from: spice.ai/spiceai/quickstart/datasets/taxi_trips

Select y when prompted whether to accelerate the data:

Locally accelerate (y/n)? y

You should see the following output from your runtime terminal:

2024-12-16T05:12:45.803694Z  INFO runtime::init::dataset: Dataset taxi_trips registered (spice.ai/spiceai/quickstart/datasets/taxi_trips), acceleration (arrow, 10s refresh), results cache enabled.
2024-12-16T05:12:45.805494Z  INFO runtime::accelerated_table::refresh_task: Loading data for dataset taxi_trips
2024-12-16T05:13:24.218345Z  INFO runtime::accelerated_table::refresh_task: Loaded 2,964,624 rows (8.41 GiB) for dataset taxi_trips in 38s 412ms.

Step 5. In a new terminal window, use the Spice SQL REPL to query the dataset

spice sql
SELECT tpep_pickup_datetime, passenger_count, trip_distance from taxi_trips LIMIT 10;

The output displays the results of the query along with the query execution time:

+----------------------+-----------------+---------------+
| tpep_pickup_datetime | passenger_count | trip_distance |
+----------------------+-----------------+---------------+
| 2024-01-11T12:55:12  | 1               | 0.0           |
| 2024-01-11T12:55:12  | 1               | 0.0           |
| 2024-01-11T12:04:56  | 1               | 0.63          |
| 2024-01-11T12:18:31  | 1               | 1.38          |
| 2024-01-11T12:39:26  | 1               | 1.01          |
| 2024-01-11T12:18:58  | 1               | 5.13          |
| 2024-01-11T12:43:13  | 1               | 2.9           |
| 2024-01-11T12:05:41  | 1               | 1.36          |
| 2024-01-11T12:20:41  | 1               | 1.11          |
| 2024-01-11T12:37:25  | 1               | 2.04          |
+----------------------+-----------------+---------------+

Time: 0.00538925 seconds. 10 rows.

You can experiment with the time it takes to generate queries when using non-accelerated datasets. You can change the acceleration setting from true to false in the datasets.yaml file.

πŸ“„ Documentation

Comprehensive documentation is available at docs.spiceai.org.

πŸ”Œ Extensibility

Spice.ai is designed to be extensible with extension points documented at EXTENSIBILITY.md. Build custom Data Connectors, Data Accelerators, Catalog Connectors, Secret Stores, Models, or Embeddings.

πŸ”¨ Upcoming Features

πŸš€ See the Roadmap to v1.0-stable for upcoming features.

🀝 Connect with us

We greatly appreciate and value your support! You can help Spice in a number of ways:

⭐️ star this repo! Thank you for your support! πŸ™

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for spiceai

Similar Open Source Tools

For similar tasks

For similar jobs