distill

distill

Reliable LLM outputs start with clean context. Deterministic deduplication, compression, and caching for RAG pipelines.

Stars: 89

Visit
 screenshot

Distill is a reliability layer for LLM context that provides deterministic deduplication to remove redundancy before reaching the model. It aims to reduce redundant data, lower costs, provide faster responses, and offer more efficient and deterministic results. The tool works by deduplicating, compressing, summarizing, and caching context to ensure reliable outputs. It offers various installation methods, including binary download, Go install, Docker usage, and building from source. Distill can be used for tasks like deduplicating chunks, connecting to vector databases, integrating with AI assistants, analyzing files for duplicates, syncing vectors to Pinecone, querying from the command line, and managing configuration files. The tool supports self-hosting via Docker, Docker Compose, building from source, Fly.io deployment, Render deployment, and Railway integration. Distill also provides monitoring capabilities with Prometheus-compatible metrics, Grafana dashboard, and OpenTelemetry tracing.

README:

Distill

CI Release License: AGPL v3 Go Report Card

Build with Ona

Reliable LLM outputs start with clean context.

A reliability layer for LLM context. Deterministic deduplication that removes redundancy before it reaches your model.

Less redundant data. Lower costs. Faster responses. More efficient & deterministic results.

Learn more →

Context sources → Distill → LLM
(RAG, tools, memory, docs)    (reliable outputs)

The Problem

LLM outputs are unreliable because context is polluted. "Garbage in, garbage out."

30-40% of context assembled from multiple sources is semantically redundant. Same information from docs, code, memory, and tools competing for attention. This leads to:

  • Non-deterministic outputs — Same workflow, different results
  • Confused reasoning — Signal diluted by repetition
  • Production failures — Works in demos, breaks at scale

You can't fix unreliable outputs with better prompts. You need to fix the context that goes in.

How It Works

Math, not magic. No LLM calls. Fully deterministic.

Step What it does Benefit
Deduplicate Remove redundant information across sources More reliable outputs
Compress Keep what matters, remove the noise Lower token costs
Summarize Condense older context intelligently Longer sessions
Cache Instant retrieval for repeated patterns Faster responses

Pipeline

Query → Over-fetch (50) → Cluster → Select → MMR Re-rank (8) → LLM
  1. Over-fetch - Retrieve 3-5x more chunks than needed
  2. Cluster - Group semantically similar chunks (agglomerative clustering)
  3. Select - Pick best representative from each cluster
  4. MMR Re-rank - Balance relevance and diversity

Result: Deterministic, diverse context in ~12ms. No LLM calls. Fully auditable.

Installation

Binary (Recommended)

Download from GitHub Releases:

# macOS (Apple Silicon)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*darwin_arm64.tar.gz" | cut -d '"' -f 4) | tar xz

# macOS (Intel)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*darwin_amd64.tar.gz" | cut -d '"' -f 4) | tar xz

# Linux (amd64)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*linux_amd64.tar.gz" | cut -d '"' -f 4) | tar xz

# Linux (arm64)
curl -sL $(curl -s https://api.github.com/repos/Siddhant-K-code/distill/releases/latest | grep "browser_download_url.*linux_arm64.tar.gz" | cut -d '"' -f 4) | tar xz

# Move to PATH
sudo mv distill /usr/local/bin/

Or download directly from the releases page.

Go Install

go install github.com/Siddhant-K-code/distill@latest

Docker

docker pull ghcr.io/siddhant-k-code/distill:latest
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill

Build from Source

git clone https://github.com/Siddhant-K-code/distill.git
cd distill
go build -o distill .

Quick Start

1. Standalone API (No Vector DB Required)

Start the API server and send chunks directly:

export OPENAI_API_KEY="your-key"  # For embeddings
distill api --port 8080

Deduplicate chunks:

curl -X POST http://localhost:8080/v1/dedupe \
  -H "Content-Type: application/json" \
  -d '{
    "chunks": [
      {"id": "1", "text": "React is a JavaScript library for building UIs."},
      {"id": "2", "text": "React.js is a JS library for building user interfaces."},
      {"id": "3", "text": "Vue is a progressive framework for building UIs."}
    ]
  }'

Response:

{
  "chunks": [
    {"id": "1", "text": "React is a JavaScript library for building UIs.", "cluster_id": 0},
    {"id": "3", "text": "Vue is a progressive framework for building UIs.", "cluster_id": 1}
  ],
  "stats": {
    "input_count": 3,
    "output_count": 2,
    "reduction_pct": 33,
    "latency_ms": 12
  }
}

With pre-computed embeddings (no OpenAI key needed):

curl -X POST http://localhost:8080/v1/dedupe \
  -H "Content-Type: application/json" \
  -d '{
    "chunks": [
      {"id": "1", "text": "React is...", "embedding": [0.1, 0.2, ...]},
      {"id": "2", "text": "React.js is...", "embedding": [0.11, 0.21, ...]},
      {"id": "3", "text": "Vue is...", "embedding": [0.9, 0.8, ...]}
    ]
  }'

2. With Vector Database

Connect to Pinecone or Qdrant for retrieval + deduplication:

export PINECONE_API_KEY="your-key"
export OPENAI_API_KEY="your-key"

distill serve --index my-index --port 8080

Query with automatic deduplication:

curl -X POST http://localhost:8080/v1/retrieve \
  -H "Content-Type: application/json" \
  -d '{"query": "how do I reset my password?"}'

3. MCP Integration (AI Assistants)

Works with Claude, Cursor, Amp, and other MCP-compatible assistants:

distill mcp

Add to Claude Desktop (~/Library/Application Support/Claude/claude_desktop_config.json):

{
  "mcpServers": {
    "distill": {
      "command": "/path/to/distill",
      "args": ["mcp"]
    }
  }
}

See mcp/README.md for more configuration options.

CLI Commands

distill api       # Start standalone API server
distill serve     # Start server with vector DB connection
distill mcp       # Start MCP server for AI assistants
distill analyze   # Analyze a file for duplicates
distill sync      # Upload vectors to Pinecone with dedup
distill query     # Test a query from command line
distill config    # Manage configuration files

Configuration

Config File

Distill supports a distill.yaml configuration file for persistent settings. Generate a template:

distill config init              # Creates distill.yaml in current directory
distill config init --stdout     # Print template to stdout
distill config validate          # Validate existing config file

Config file search order: ./distill.yaml, $HOME/distill.yaml.

Priority: CLI flags > environment variables > config file > defaults.

Example distill.yaml:

server:
  port: 8080
  host: 0.0.0.0
  read_timeout: 30s
  write_timeout: 60s

embedding:
  provider: openai
  model: text-embedding-3-small
  batch_size: 100

dedup:
  threshold: 0.15
  method: agglomerative
  linkage: average
  lambda: 0.5
  enable_mmr: true

retriever:
  backend: pinecone    # pinecone or qdrant
  index: my-index
  host: ""             # required for qdrant
  namespace: ""
  top_k: 50
  target_k: 8

auth:
  api_keys:
    - ${DISTILL_API_KEY}

Environment variables can be referenced using ${VAR} or ${VAR:-default} syntax.

Environment Variables

OPENAI_API_KEY      # For text → embedding conversion (see note below)
PINECONE_API_KEY    # For Pinecone backend
QDRANT_URL          # For Qdrant backend (default: localhost:6334)
DISTILL_API_KEYS    # Optional: protect your self-hosted instance (see below)

Protecting Your Self-Hosted Instance

If you're exposing Distill publicly, set DISTILL_API_KEYS to require authentication:

# Generate a random API key
export DISTILL_API_KEYS="sk-$(openssl rand -hex 32)"

# Or multiple keys (comma-separated)
export DISTILL_API_KEYS="sk-key1,sk-key2,sk-key3"

Then include the key in requests:

curl -X POST http://your-server:8080/v1/dedupe \
  -H "Authorization: Bearer sk-your-key" \
  -H "Content-Type: application/json" \
  -d '{"chunks": [...]}'

If DISTILL_API_KEYS is not set, the API is open (suitable for local/internal use).

About OpenAI API Key

When you need it:

  • Sending text chunks without pre-computed embeddings
  • Using text queries with vector database retrieval
  • Using the MCP server with text-based tools

When you DON'T need it:

  • Sending chunks with pre-computed embeddings (include "embedding": [...] in your request)
  • Using Distill purely for clustering/deduplication on existing vectors

What it's used for:

  • Converts text to embeddings using text-embedding-3-small model
  • ~$0.00002 per 1K tokens (very cheap)
  • Embeddings are used only for similarity comparison, never stored

Alternatives:

  • Bring your own embeddings - include "embedding" field in chunks
  • Self-host an embedding model - set EMBEDDING_API_URL to your endpoint

Parameters

Parameter Description Default
--threshold Clustering distance (lower = stricter) 0.15
--lambda MMR balance: 1.0 = relevance, 0.0 = diversity 0.5
--over-fetch-k Chunks to retrieve initially 50
--target-k Chunks to return after dedup 8

Self-Hosting

Docker (Recommended)

Use the pre-built image from GitHub Container Registry:

# Pull and run
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill:latest

# Or with a specific version
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key ghcr.io/siddhant-k-code/distill:v0.1.0

Docker Compose

# Start Distill + Qdrant (local vector DB)
docker-compose up

Build from Source

docker build -t distill .
docker run -p 8080:8080 -e OPENAI_API_KEY=your-key distill api

Fly.io

fly launch
fly secrets set OPENAI_API_KEY=your-key
fly deploy

Render

Deploy to Render

Or manually:

  1. Connect your GitHub repo
  2. Set environment variables (OPENAI_API_KEY)
  3. Deploy

Railway

Connect your repo and set OPENAI_API_KEY in environment variables.

Monitoring

Distill exposes a Prometheus-compatible /metrics endpoint on both api and serve commands.

Metrics

Metric Type Description
distill_requests_total Counter Total requests by endpoint and status code
distill_request_duration_seconds Histogram Request latency distribution
distill_chunks_processed_total Counter Chunks processed (input/output)
distill_reduction_ratio Histogram Chunk reduction ratio per request
distill_active_requests Gauge Currently processing requests
distill_clusters_formed_total Counter Clusters formed during deduplication

Prometheus Scrape Config

scrape_configs:
  - job_name: distill
    static_configs:
      - targets: ['localhost:8080']

Grafana Dashboard

Import the included dashboard from grafana/dashboard.json or use dashboard UID distill-overview.

OpenTelemetry Tracing

Distill supports distributed tracing via OpenTelemetry. Each pipeline stage (embedding, clustering, selection, MMR) is instrumented as a separate span.

Enable via distill.yaml:

telemetry:
  tracing:
    enabled: true
    exporter: otlp         # otlp, stdout, or none
    endpoint: localhost:4317
    sample_rate: 1.0
    insecure: true

Or via environment variables:

export DISTILL_TELEMETRY_TRACING_ENABLED=true
export DISTILL_TELEMETRY_TRACING_ENDPOINT=localhost:4317

Spans emitted per request:

Span Attributes
distill.request endpoint
distill.embedding chunk_count
distill.clustering input_count, threshold
distill.selection cluster_count
distill.mmr input_count, lambda
distill.retrieval top_k, backend

Result attributes (distill.result.*) are added to the root span: input_count, output_count, cluster_count, latency_ms, reduction_ratio.

W3C Trace Context propagation is enabled by default for cross-service tracing.

Pipeline Modules

Compression (pkg/compress)

Reduces token count while preserving meaning. Three strategies:

  • Extractive — Scores sentences by position, keyword density, and length; keeps the most salient spans
  • Placeholder — Replaces verbose JSON, XML, and table outputs with compact structural summaries
  • Pruner — Strips filler phrases, redundant qualifiers, and boilerplate patterns

Strategies can be chained via compress.Pipeline. Configure with target reduction ratio (e.g., 0.3 = keep 30% of original).

Cache (pkg/cache)

KV cache for repeated context patterns (system prompts, tool definitions, boilerplate). Sub-millisecond retrieval for cache hits.

  • MemoryCache — In-memory LRU with TTL, configurable size limits (entries and bytes), background cleanup
  • PatternDetector — Identifies cacheable content: system prompts, tool/function definitions, code blocks
  • RedisCache — Interface for distributed deployments (requires external Redis)

Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                            Your App                                  │
└──────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────────────┐
│                             Distill                                  │
│                                                                      │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌──────────┐  ┌─────────┐  │
│  │  Cache  │→ │ Cluster │→ │ Select  │→ │ Compress │→ │  MMR    │  │
│  │  check  │  │  dedup  │  │  best   │  │  prune   │  │ re-rank │  │
│  └─────────┘  └─────────┘  └─────────┘  └──────────┘  └─────────┘  │
│     <1ms          6ms         <1ms          2ms           3ms        │
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────┐    │
│  │  /metrics (Prometheus)  ·  distill.yaml  ·  MCP server      │    │
│  └──────────────────────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌──────────────────────────────────────────────────────────────────────┐
│                              LLM                                     │
└──────────────────────────────────────────────────────────────────────┘

Supported Backends

  • Pinecone - Fully supported
  • Qdrant - Fully supported
  • Weaviate - Coming soon

Use Cases

  • Code Assistants - Dedupe context from multiple files/repos
  • RAG Pipelines - Remove redundant chunks before LLM
  • Agent Workflows - Clean up tool outputs + memory + docs
  • Enterprise - Deterministic outputs for compliance

Why not just use an LLM?

LLMs are non-deterministic. Reliability requires deterministic preprocessing.

LLM Compression Distill
Latency ~500ms ~12ms
Cost per call $0.01+ $0.0001
Deterministic No Yes
Lossless No Yes
Auditable No Yes

Use LLMs for reasoning. Use deterministic algorithms for reliability.

Integrations

Works with your existing AI stack:

  • LLM Providers: OpenAI, Anthropic
  • Frameworks: LangChain, LlamaIndex
  • Vector DBs: Pinecone, Qdrant, Weaviate, Chroma, pgvector
  • Tools: Cursor, Lovable, and more

Contributing

Contributions welcome! Please read the contributing guidelines first.

# Run tests
go test ./...

# Build
go build -o distill .

License

AGPL-3.0 - see LICENSE

For commercial licensing, contact: [email protected]

Links

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for distill

Similar Open Source Tools

For similar tasks

For similar jobs