indexify

A realtime serving engine for Data-Intensive Generative AI Applications

Stars: 985

Visit

Indexify is an open-source engine for building fast data pipelines for unstructured data (video, audio, images, and documents) using reusable extractors for embedding, transformation, and feature extraction. LLM Applications can query transformed content friendly to LLMs by semantic search and SQL queries. Indexify keeps vector databases and structured databases (PostgreSQL) updated by automatically invoking the pipelines as new data is ingested into the system from external data sources. **Why use Indexify** * Makes Unstructured Data **Queryable** with **SQL** and **Semantic Search** * **Real-Time** Extraction Engine to keep indexes **automatically** updated as new data is ingested. * Create **Extraction Graph** to describe **data transformation** and extraction of **embedding** and **structured extraction**. * **Incremental Extraction** and **Selective Deletion** when content is deleted or updated. * **Extractor SDK** allows adding new extraction capabilities, and many readily available extractors for **PDF**, **Image**, and **Video** indexing and extraction. * Works with **any LLM Framework** including **Langchain**, **DSPy**, etc. * Runs on your laptop during **prototyping** and also scales to **1000s of machines** on the cloud. * Works with many **Blob Stores**, **Vector Stores**, and **Structured Databases** * We have even **Open Sourced Automation** to deploy to Kubernetes in production.

README:

Indexify

Indexify simplifies building and serving durable, multi-stage data-intensive workflows and exposes them as HTTP APIs or Python Remote APIs.

[!NOTE]
Indexify is the Open-Source core compute engine that powers Tensorlake's Serverless Workflow Engine for processing unstructured data.

💡 Use Cases

Indexify is a versatile data processing framework for all kinds of use cases, including:

⭐ Key Features

Multi-Cloud/Datacenter/Region: Leverage Compute in your workflows from other clouds with very little hassle and configuration.
Distributed Processing: Run functions in parallel across machines for scaleouts use-cases.
Resource Allocation: Span workflows across GPU and CPU instances so that functions can be assigned to their optimal hardware.
Dynamic Routing: Route data to different specialized compute functions distributed on a cluster based on conditional branching logic.

⚙️ Installation

Install the Tensorlake SDK for building workflows and the Indexify CLI.

pip install indexify tensorlake

📚 A Minimal Example

Functions decorated with @tensorlake_function() are units of compute in your Workflow APIs. These functions can have data dependencies on other functions.
Tensorlake functions are durable, i.e, if the function crashes or the node running the function is lost, it is going to be automatically retried on another running instance with the same input.
You can run as many function instances on the cluster, inputs are going to be automatically load balanced across them when the workflow is called in parallel by other applications.

The example below is a workflow API that accepts some text, embeds the file, and writes it to a local vector-db. Each function could be placed on different classes of machines(CPU-only machines for chunking and writing to databases, NVIDIA GPUs for embedding)

from pydantic import BaseModel
from tensorlake import tensorlake_function, Graph, Image, TensorlakeCompute
from typing import List, Union

# Define Input and Outputs of various functions in your workflow
class Text(BaseModel):
    text: str


class TextChunk(BaseModel):
    chunk: str
    page_number: int


class ChunkEmbedding(BaseModel):
    text: str
    embedding: List[float]

# Define an image capable of running the functions. Each image
# can have their own image
embedding_image = (
    Image()
    .name("text_embedding_image")
    .run("pip install langchain")
    .run("pip install sentence_transformer")
    .run("pip install langchain-text-splitters")
    .run("pip install chromadb")
    .run("pip install uuid")
)


# Chunk the text for embedding and retrieval
@tensorlake_function(input_encoder="json", image=embedding_image)
def chunk_text(input: dict) -> List[TextChunk]:
    text = Text.model_validate(input)
    from langchain_text_splitters import RecursiveCharacterTextSplitter

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=20,
        length_function=len,
        is_separator_regex=False,
    )
    texts = text_splitter.create_documents([text.text])
    return [
        TextChunk(chunk=chunk.page_content, page_number=i)
        for i, chunk in enumerate(texts)
    ]


# Embed a single chunk.
# Note: (Automatic Map) Indexify automatically parallelize functions when they consume an
# element from functions that produces a List. In this case each text chunk is processed
# in parallel by an Embedder function
class Embedder(TensorlakeCompute):
    name = "embedder"
    image = embedding_image

    # TensorlakeCompute function allows initializing resources in the constructors
    # and they are not unloaded again until the compute object is destroyed.  
    def __init__(self):
        from sentence_transformers import SentenceTransformer
        self._model = SentenceTransformer("all-MiniLM-L6-v2")

    def run(self, chunk: TextChunk) -> ChunkEmbedding:
        embeddings = self._model.encode(chunk.chunk)
        return ChunkEmbedding(text=chunk.chunk, embedding=embeddings)


class EmbeddingWriter(TensorlakeCompute):
    name = "embedding_writer"
    image = embedding_image

    def __init__(self):
        import chromadb

        self._chroma = chromadb.PersistentClient("./chromadb_tensorlake")
        self._collection = collection = self._chroma.create_collection(
            name="my_collection", get_or_create=True
        )

    def run(self, embedding: ChunkEmbedding) -> None:
        import uuid

        self._collection.upsert(
            ids=[str(uuid.uuid4())],
            embeddings=[embedding.embedding],
            documents=[embedding.text],
        )


# Constructs a compute graph connecting the three functions defined above into a workflow that generates
# runs them as a pipeline
graph = Graph(
    name="text_embedder",
    start_node=chunk_text,
    description="Splits, embeds and indexes text",
)
graph.add_edge(chunk_text, Embedder)
graph.add_edge(Embedder, EmbeddingWriter)

Testing Locally

You can test the workflow locally with only the tensorlake package installed.

invocation_id = graph.run(input={"text": "This is a test text"})
print(f"Invocation ID: {invocation_id}")

# You can get output from each function of the graph
embedding = graph.output(invocation_id, "embedder")
print(embedding)

Deploying Workflows as APIs

Big Picture, you will deploy Indexify Server on a machine and run containers for each of the function in a workflow separately.

But first, we will show how to do this locally on a single machine.

Start the Server

Download a server release from here. Open a terminal and start the server.

./indexify-server -dev

Start the Executors

Executor is the component which is responsible for running your functions. On a terminal, where all the dependencies are installed, start an executor in development mode.

indexify-cli executor --dev

Set the environment variable -

export INDEXIFY_URL=http://localhost:8900

Change the code in the workflow to the following -

from tensorlake import RemoteGraph
RemoteGraph.deploy(graph)

At this point, you now have a Graph endpoint on Indexify Server ready to be called as an API from any application.

Invoke the Graph

You can invoke the Graph as a REST API if the first function is configured to accept JSON payload.

curl -X 'POST' http://localhost:8900/namespaces/default/compute_graphs/text_embedder/invoke_object -H 'Content-Type: application/json' -d '{"input": {"text": "hello world"}}'

This returns you an invocation id - {"id":"55df51b4a84ffc69"}. An Invocation Id can be used to get the status of the workflow as it processes that input, and getting any outputs off the graph.

Get the outputs of the Embedding function -

curl -X GET http://localhost:8900/namespaces/default/compute_graphs/text_embedder/invocations/55df51b4a84ffc69/outputs

This returns all the outputs of the function -

{"status":"finalized","outputs":[{"compute_fn":"chunk_text","id":"89de2063abadf5d3","created_at":1738110077424},{"compute_fn":"embedder","id":"4908f00d711c4cd1","created_at":1738110081015}],"cursor":null}

You can now retrieve one of the outputs -

curl -X GET http://localhost:8900/namespaces/default/compute_graphs/text_embedder/invocations/55df51b4a84ffc69/fn/embedder/output/4908f00d711c4cd1

You can invoke the Graph from Python too

from tensorlake import RemoteGraph
remote_graph = RemoteGraph.by_name("text_embedder")

Deploying to Production Clusters

Deploying a workflow to production is a two step process -

1. Run the server

docker run -it -p 8900:8900 tensorlake/indexify-server

2. Building and Deploying Function Containers

First build and deploy container images that contains all the python and system dependencies of your code. They can be built using standard Docker build systems. For this example, we have a single image that can run all the functions. You can separate them to reduce the size of your images for more complex projects.

indexify-cli build-image workflow.py

This builds the following image, as defined in the workflow code above - text_embedding_image

Next Deploy the Containers

docker run --it text_embedding_image indexify-cli executor --function default:text_embedder:chunk_document
docker run --it text_embedding_image indexify-cli executor --function default:text_embedder:embed_chunk
docker run --it text_embedding_image indexify-cli executor --function default:text_embedder:write_to_db

Containers are treated as ephemeral, only one type of function is ever scheduled on a single container. We are starting two containers for placing one function in each of them.

Scale To Zero

Indexify won't complain if you shut down the containers at night. It will still accept new API calls from external systems even it can't find machines to run functions. It will simply queue them up, and wait for functions to come up. It emits telemetry of pending tasks, waiting to be placed on functions which can be used as inputs to Autoscalers.

This is it!

Production Ready Distributed and Always-On Data Workflows

You have built a workflow API which is durable, capable of being distributed on many kinds of hardware and can handle limitless scale.
You can use any Python libraries under the sun, any system packages and can use your favorite tools to package them into a container image.
Deploying code is as simple as uploading code into the server, they get distributed and updated automatically.

🗺️ Roadmap

⏳ Scheduler

Function Batching: Process multiple functions in a single batch to improve efficiency.
Data Localized Execution: Boost performance by prioritizing execution on machines where intermediate outputs exist already.
Reducer Optimizations: Optimize performance by batching the serial execution of reduced function calls.
Parallel Scheduling: Reduce latency by enabling parallel execution across multiple machines.
Cyclic Graph Support: Enable more flexible agentic behaviors by leveraging cycles in graphs.
Ephemeral Graphs: Perform multi-stage inference and retrieval without persisting intermediate outputs.
Data Loader Functions: Stream values into graphs over time using the yield keyword.

🛠️ SDK

TypeScript SDK: Build an SDK for writing workflows in Typescript.

For Tasks:

Click tags to check more tools for each tasks

extract data from videos extract data from audio extract data from images extract data from documents query data with sql

For Jobs:

data engineer data scientist machine learning engineer research scientist software engineer

Alternative AI tools for indexify

Similar Open Source Tools

indexify

github

: 985

restai

RestAI is an AIaaS (AI as a Service) platform that allows users to create and consume AI agents (projects) using a simple REST API. It supports various types of agents, including RAG (Retrieval-Augmented Generation), RAGSQL (RAG for SQL), inference, vision, and router. RestAI features automatic VRAM management, support for any public LLM supported by LlamaIndex or any local LLM supported by Ollama, a user-friendly API with Swagger documentation, and a frontend for easy access. It also provides evaluation capabilities for RAG agents using deepeval.

github

: 416

comfyui_LLM_Polymath

github

: 54

open-parse

Open Parse is a Python library for visually discerning document layouts and chunking them effectively. It is designed to fill the gap in open-source libraries for handling complex documents. Unlike text splitting, which converts a file to raw text and slices it up, Open Parse visually analyzes documents for superior LLM input. It also supports basic markdown for parsing headings, bold, and italics, and has high-precision table support, extracting tables into clean Markdown formats with accuracy that surpasses traditional tools. Open Parse is extensible, allowing users to easily implement their own post-processing steps. It is also intuitive, with great editor support and completion everywhere, making it easy to use and learn.

github

: 2.4k

premsql

PremSQL is an open-source library designed to help developers create secure, fully local Text-to-SQL solutions using small language models. It provides essential tools for building and deploying end-to-end Text-to-SQL pipelines with customizable components, ideal for secure, autonomous AI-powered data analysis. The library offers features like Local-First approach, Customizable Datasets, Robust Executors and Evaluators, Advanced Generators, Error Handling and Self-Correction, Fine-Tuning Support, and End-to-End Pipelines. Users can fine-tune models, generate SQL queries from natural language inputs, handle errors, and evaluate model performance against predefined metrics. PremSQL is extendible for customization and private data usage.

github

: 90

maiar-ai

MAIAR is a composable, plugin-based AI agent framework designed to abstract data ingestion, decision-making, and action execution into modular plugins. It enables developers to define triggers and actions as standalone plugins, while the core runtime handles decision-making dynamically. This framework offers extensibility, composability, and model-driven behavior, allowing seamless addition of new functionality. MAIAR's architecture is influenced by Unix pipes, ensuring highly composable plugins, dynamic execution pipelines, and transparent debugging. It remains declarative and extensible, allowing developers to build complex AI workflows without rigid architectures.

github

: 63

spacy-llm

This package integrates Large Language Models (LLMs) into spaCy, featuring a modular system for **fast prototyping** and **prompting** , and turning unstructured responses into **robust outputs** for various NLP tasks, **no training data** required. It supports open-source LLMs hosted on Hugging Face 🤗: Falcon, Dolly, Llama 2, OpenLLaMA, StableLM, Mistral. Integration with LangChain 🦜️🔗 - all `langchain` models and features can be used in `spacy-llm`. Tasks available out of the box: Named Entity Recognition, Text classification, Lemmatization, Relationship extraction, Sentiment analysis, Span categorization, Summarization, Entity linking, Translation, Raw prompt execution for maximum flexibility. Soon: Semantic role labeling. Easy implementation of **your own functions** via spaCy's registry for custom prompting, parsing and model integrations. For an example, see here. Map-reduce approach for splitting prompts too long for LLM's context window and fusing the results back together

github

: 948

langmanus

LangManus is a community-driven AI automation framework that combines language models with specialized tools for tasks like web search, crawling, and Python code execution. It implements a hierarchical multi-agent system with agents like Coordinator, Planner, Supervisor, Researcher, Coder, Browser, and Reporter. The framework supports LLM integration, search and retrieval tools, Python integration, workflow management, and visualization. LangManus aims to give back to the open-source community and welcomes contributions in various forms.

github

: 4.9k

sd-webui-agent-scheduler

AgentScheduler is an Automatic/Vladmandic Stable Diffusion Web UI extension designed to enhance image generation workflows. It allows users to enqueue prompts, settings, and controlnets, manage queued tasks, prioritize, pause, resume, and delete tasks, view generation results, and more. The extension offers hidden features like queuing checkpoints, editing queued tasks, and custom checkpoint selection. Users can access the functionality through HTTP APIs and API callbacks. Troubleshooting steps are provided for common errors. The extension is compatible with latest versions of A1111 and Vladmandic. It is licensed under Apache License 2.0.

github

: 646

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

chroma

Chroma is an open-source embedding database that provides a simple, scalable, and feature-rich way to build Python or JavaScript LLM apps with memory. It offers a fully-typed, fully-tested, and fully-documented API that makes it easy to get started and scale your applications. Chroma also integrates with popular tools like LangChain and LlamaIndex, and supports a variety of embedding models, including Sentence Transformers, OpenAI embeddings, and Cohere embeddings. With Chroma, you can easily add documents to your database, query relevant documents with natural language, and compose documents into the context window of an LLM like GPT3 for additional summarization or analysis.

github

: 19.1k

bittensor

Bittensor is an internet-scale neural network that incentivizes computers to provide access to machine learning models in a decentralized and censorship-resistant manner. It operates through a token-based mechanism where miners host, train, and procure machine learning systems to fulfill verification problems defined by validators. The network rewards miners and validators for their contributions, ensuring continuous improvement in knowledge output. Bittensor allows anyone to participate, extract value, and govern the network without centralized control. It supports tasks such as generating text, audio, images, and extracting numerical representations.

github

: 987

UFO

UFO is a UI-focused dual-agent framework to fulfill user requests on Windows OS by seamlessly navigating and operating within individual or spanning multiple applications.

github

: 6.6k

typechat.net

TypeChat.NET is a framework that provides cross-platform libraries for building natural language interfaces with language models using strong types, type validation, and simple type-safe programs. It translates user intent into strongly typed objects and JSON programs, with support for schema export, extensibility, and common scenarios. The framework is actively developed with frequent updates, evolving based on exploration and feedback. It consists of assemblies for translating user intent, synthesizing JSON programs, and integrating with Microsoft Semantic Kernel. TypeChat.NET requires familiarity with and access to OpenAI language models for its examples and scenarios.

github

: 181

Whisper-TikTok

Discover Whisper-TikTok, an innovative AI-powered tool that leverages the prowess of Edge TTS, OpenAI-Whisper, and FFMPEG to craft captivating TikTok videos. Whisper-TikTok effortlessly generates accurate transcriptions from audio files and integrates Microsoft Edge Cloud Text-to-Speech API for vibrant voiceovers. The program orchestrates the synthesis of videos using a structured JSON dataset, generating mesmerizing TikTok content in minutes.

github

: 148

llama-cpp-agent

The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM models, execute structured function calls and get structured output (objects). It provides a simple yet robust interface and supports llama-cpp-python and OpenAI endpoints with GBNF grammar support (like the llama-cpp-python server) and the llama.cpp backend server. It works by generating a formal GGML-BNF grammar of the user defined structures and functions, which is then used by llama.cpp to generate text valid to that grammar. In contrast to most GBNF grammar generators it also supports nested objects, dictionaries, enums and lists of them.

github

: 454

For similar tasks

indexify

github

: 985

unstructured

The `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and many more. The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.

github

: 10.5k

For similar jobs

db2rest

DB2Rest is a modern low-code REST DATA API platform that simplifies the development of intelligent applications. It seamlessly integrates existing and new databases with language models (LMs/LLMs) and vector stores, enabling the rapid delivery of context-aware, reasoning applications without vendor lock-in.

github

: 167

mage-ai

Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.

github

: 7.8k

airbyte

Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.

github

: 17.8k

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135

telemetry-airflow

This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

github

: 185

airflow

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

github

: 39.5k

airbyte-platform

Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's low-code Connector Development Kit (CDK). Airbyte is used by data engineers and analysts at companies of all sizes to move data for a variety of purposes, including data warehousing, data analysis, and machine learning.

github

: 238

chronon

Chronon is a platform that simplifies and improves ML workflows by providing a central place to define features, ensuring point-in-time correctness for backfills, simplifying orchestration for batch and streaming pipelines, offering easy endpoints for feature fetching, and guaranteeing and measuring consistency. It offers benefits over other approaches by enabling the use of a broad set of data for training, handling large aggregations and other computationally intensive transformations, and abstracting away the infrastructure complexity of data plumbing.

github

: 766