codemogger

Codemogger is a code indexing library and MCP server for AI coding agents

Stars: 103

Visit

Code indexing library for AI coding agents. Parses source code with tree-sitter, chunks it into semantic units (functions, structs, classes, impl blocks), embeds them locally, and stores everything in a single SQLite file with vector + full-text search. No Docker, no server, no API keys. One .db file per codebase. Enables keyword search and semantic search for AI coding tools, facilitating precise identifier lookup and natural language queries. Suitable for understanding codebases, discovering implementations, and navigating unfamiliar code quickly. Can be used as a library or CLI tool with incremental indexing and high search quality.

README:

codemogger

No Docker, no server, no API keys. One .db file per codebase.

Why

Coding agents need to understand codebases. They need to find where things are defined, discover how concepts are implemented across files, and navigate unfamiliar code quickly. This requires both keyword search (precise identifier lookup) and semantic search (natural language queries when you don't know the exact names).

As AI coding tools become more composable - agents calling agents, MCP servers plugging into different hosts - this capability needs to exist as a library that runs locally. No external servers, no API keys, no Docker containers. Just a function call that returns results.

codemogger is that library. Embedded SQLite (via Turso) with FTS + vector search in a single .db file.

Install

npm install -g codemogger

Or use npx to run without installing.

Quick start

# Index a project
codemogger index ./my-project

# Search
codemogger search "authentication middleware"

Add to your coding agent's MCP config (Claude Code, OpenCode, etc.):

{
  "mcpServers": {
    "codemogger": {
      "command": "npx",
      "args": ["-y", "codemogger", "mcp"]
    }
  }
}

The MCP server exposes three tools:

codemogger_search - semantic and keyword search over indexed code
codemogger_index - index a codebase for the first time
codemogger_reindex - update the index after modifying files

SDK

codemogger is also usable as a library. The SDK has no model dependency - you provide your own embedding function:

import { CodeIndex } from "codemogger"
import { pipeline } from "@huggingface/transformers"

// Load embedding model (runs locally, no API keys)
const extractor = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2", { dtype: "q8" })

const embedder = async (texts: string[]): Promise<number[][]> => {
  const output = await extractor(texts, { pooling: "mean", normalize: true })
  return output.tolist() as number[][]
}

const db = new CodeIndex({
  dbPath: "./my-project.db",
  embedder,
  embeddingModel: "all-MiniLM-L6-v2",
})

await db.index("/path/to/project")

// Semantic: "what does this codebase do?"
const results = await db.search("authentication middleware", { mode: "semantic" })

// Keyword: precise identifier lookup
const results = await db.search("BTreeCursor", { mode: "keyword" })

await db.close()

The MCP server and CLI ship with all-MiniLM-L6-v2 by default.

CLI

# Install globally
npm install -g codemogger

# Index a directory
codemogger index ./my-project

# Search
codemogger search "authentication middleware"

# List indexed codebases
codemogger list

How it works

Scan - walk directory, respect .gitignore, detect language from extension
Chunk - parse each file with tree-sitter (WASM), extract top-level definitions (functions, structs, classes, impl blocks). Items >150 lines are split into sub-items.
Embed - encode each chunk with the provided embedding model (runs locally, no API)
Store - write chunks + embeddings to SQLite with FTS index
Search - vector cosine similarity (semantic) or FTS with weighted fields (keyword)

Incremental: only changed files (by SHA-256 hash) are re-processed on subsequent runs.

Languages

Rust, C, C++, Go, Python, Zig, Java, Scala, JavaScript, TypeScript, TSX, PHP, Ruby.

Benchmarks

Benchmarked on 4 real-world codebases on an Apple M2 (8GB). Each project uses its own isolated database. Embeddings use vector8 (int8 quantized, 395 bytes/chunk vs 1,536 for float32). Embedding model: all-MiniLM-L6-v2 (q8 quantized, local CPU). Search times are p50 over 3 runs.

Performance

Project	Language	Files	Semantic	Keyword	ripgrep
Turso	Rust	748	35 ms	1 ms	25 ms
Bun	Zig	9,255	137 ms	2 ms	166 ms
TypeScript	TypeScript	39,298	242 ms	4 ms	1,500 ms
Kubernetes	Go	16,668	617 ms	12 ms	731 ms

Keyword search is 25x-370x faster than ripgrep and returns precise definitions instead of thousands of file matches.

Indexing is a one-time cost dominated by embedding (~97% of time). Subsequent runs only re-embed changed files.

Search quality: semantic search vs ripgrep

The real advantage isn't speed - it's finding the right code when you don't know the exact keywords.

"write-ahead log replication and synchronization" (Turso)

codemogger (top 5)	ripgrep
`impl LogicalLog` - core/mvcc/persistent_storage/logical_log.rs	3 files matched
`enum CommitState` - core/mvcc/database/mod.rs	(keyword: "write-ahead")
`function new` - core/mvcc/database/checkpoint_state_machine.rs
`struct LogicalLog` - core/mvcc/persistent_storage/logical_log.rs
`function checkpoint_shutdown` - core/storage/pager.rs

"SQL statement parsing and compilation" (Turso)

codemogger (top 5)	ripgrep
`function parse_and_build` - core/translate/logical.rs	139 files matched
`macro compile_sql` - core/incremental/compiler.rs	(keyword: "statement")
`function parse_from_clause_opt` - parser/src/parser.rs
`function parse_from_clause_table` - core/translate/planner.rs
`function parse_table` - core/translate/planner.rs

"HTTP request parsing and response writing" (Bun)

codemogger (top 5)	ripgrep
`function consumeRequestLine` - packages/bun-uws/src/HttpParser.h	0 files matched
`declaration ConsumeRequestLineResult` - packages/bun-uws/src/HttpParser.h	(keyword: "HTTP")
`function llhttp__after_headers_complete` - src/bun.js/bindings/node/http/llhttp/http.c
`function llhttp_message_needs_eof` - src/bun.js/bindings/node/http/llhttp/http.c
`function shortRead` - packages/bun-uws/src/HttpParser.h

"scheduling pods to nodes based on resource requirements" (Kubernetes)

codemogger (top 5)	ripgrep
`type Scheduling` - staging/src/k8s.io/api/node/v1beta1/types.go	429 files matched
`type Scheduling` - staging/src/k8s.io/api/node/v1/types.go	(keyword: "scheduling")
`type SchedulingApplyConfiguration` - staging/.../node/v1/scheduling.go
`function runPodAndGetNodeName` - test/e2e/scheduling/predicates.go
`type createPodsOp` - test/integration/scheduler_perf/scheduler_perf.go

"container health check probes and restart policy" (Kubernetes)

codemogger (top 5)	ripgrep
`type ContainerFailures` - test/utils/conditions.go	1,652 files matched
`variable _` - test/e2e/common/node/container_probe.go	(keyword: "container")
`function checkContainerStateTransition` - pkg/kubelet/status/status_manager.go
`function TestDoProbe_TerminatedContainerWithRestartPolicyNever` - pkg/kubelet/prober/worker_test.go
`function proveHealthCheckNodePortDeallocated` - pkg/registry/core/service/storage/storage_test.go

ripgrep matches thousands of files on common keywords. codemogger returns the 5 most relevant definitions.

Architecture

Bun/TypeScript runtime
tree-sitter (WASM) for AST-aware chunking - 13 language grammars
all-MiniLM-L6-v2 for local embeddings (384 dimensions, q8 quantized)
Turso for storage - embedded SQLite with FTS + vector search extensions
Single DB file stores multiple codebases with per-codebase FTS tables and global vector search

License

MIT

For Tasks:

Click tags to check more tools for each tasks

index codebase search code embed code chunks update index list indexed codebases

For Jobs:

software developer ai engineer machine learning engineer data scientist web developer

Alternative AI tools for codemogger

Similar Open Source Tools

codemogger

github

: 103

flyto-core

Flyto-core is a powerful Python library for geospatial analysis and visualization. It provides a wide range of tools for working with geographic data, including support for various file formats, spatial operations, and interactive mapping. With Flyto-core, users can easily load, manipulate, and visualize spatial data to gain insights and make informed decisions. Whether you are a GIS professional, a data scientist, or a developer, Flyto-core offers a versatile and user-friendly solution for geospatial tasks.

github

: 202

shodh-memory

Shodh-Memory is a cognitive memory system designed for AI agents to persist memory across sessions, learn from experience, and run entirely offline. It features Hebbian learning, activation decay, and semantic consolidation, packed into a single ~17MB binary. Users can deploy it on cloud, edge devices, or air-gapped systems to enhance the memory capabilities of AI agents.

github

: 94

skylos

Skylos is a privacy-first SAST tool for Python, TypeScript, and Go that bridges the gap between traditional static analysis and AI agents. It detects dead code, security vulnerabilities (SQLi, SSRF, Secrets), and code quality issues with high precision. Skylos uses a hybrid engine (AST + optional Local/Cloud LLM) to eliminate false positives, verify via runtime, find logic bugs, and provide context-aware audits. It offers automated fixes, end-to-end remediation, and 100% local privacy. The tool supports taint analysis, secrets detection, vulnerability checks, dead code detection and cleanup, agentic AI and hybrid analysis, codebase optimization, operational governance, and runtime verification.

github

: 317

augustus

Augustus is a Go-based LLM vulnerability scanner designed for security professionals to test large language models against a wide range of adversarial attacks. It integrates with 28 LLM providers, covers 210+ adversarial attacks including prompt injection, jailbreaks, encoding exploits, and data extraction, and produces actionable vulnerability reports. The tool is built for production security testing with features like concurrent scanning, rate limiting, retry logic, and timeout handling out of the box.

github

: 62

SWE-AF

SWE-AF is an autonomous engineering team runtime built on AgentField, designed to spin up a full engineering team that can scope, build, adapt, and ship complex software end-to-end. It enables autonomous software engineering factories, scaling from simple goals to multi-issue programs with hundreds to thousands of agent invocations. SWE-AF offers one-call DX for quick deployment and adaptive factory control using three nested control loops to adapt to task difficulty in real-time. It features a factory architecture, continual learning, agent-scale parallelism, fleet-scale orchestration with AgentField, explicit compromise tracking, and long-run reliability.

github

: 212

tokscale

Tokscale is a high-performance CLI tool and visualization dashboard for tracking token usage and costs across multiple AI coding agents. It helps monitor and analyze token consumption from various AI coding tools, providing real-time pricing calculations using LiteLLM's pricing data. Inspired by the Kardashev scale, Tokscale measures token consumption as users scale the ranks of AI-augmented development. It offers interactive TUI mode, multi-platform support, real-time pricing, detailed breakdowns, web visualization, flexible filtering, and social platform features.

github

: 678

Code

A3S Code is an embeddable AI coding agent framework in Rust that allows users to build agents capable of reading, writing, and executing code with tool access, planning, and safety controls. It is production-ready with features like permission system, HITL confirmation, skill-based tool restrictions, and error recovery. The framework is extensible with 19 trait-based extension points and supports lane-based priority queue for scalable multi-machine task distribution.

github

: 95

FalkorDB

FalkorDB is the first queryable Property Graph database to use sparse matrices to represent the adjacency matrix in graphs and linear algebra to query the graph. Primary features: * Adopting the Property Graph Model * Nodes (vertices) and Relationships (edges) that may have attributes * Nodes can have multiple labels * Relationships have a relationship type * Graphs represented as sparse adjacency matrices * OpenCypher with proprietary extensions as a query language * Queries are translated into linear algebra expressions

github

: 3.5k

goclaw

GoClaw is a multi-agent AI gateway that connects LLMs to your tools, channels, and data. It orchestrates agent teams, inter-agent delegation, and quality-gated workflows across 11+ LLM providers with full multi-tenant isolation. It is a Go port of OpenClaw with enhanced security, multi-tenant PostgreSQL, and production-grade observability. GoClaw's unique strengths include multi-tenant PostgreSQL, agent teams, conversation handoff, evaluate-loop quality gates, runtime custom tools via API, and MCP protocol support.

github

: 228

agent-security-scanner-mcp

The 'agent-security-scanner-mcp' is a security scanner designed for AI coding agents and autonomous assistants. It scans code for vulnerabilities, detects hallucinated packages, and blocks prompt injection. The tool supports two versions: ProofLayer (lightweight) and Full Version (advanced) with different features and capabilities. It provides various tools for scanning code, fixing vulnerabilities, checking package legitimacy, and detecting prompt injection. The scanner also includes specific tools for scanning MCP servers, OpenClaw skills, and integrating with OpenClaw for autonomous AI threat detection. The tool utilizes AST analysis, taint tracking, and cross-file analysis to provide accurate security assessments. It supports multiple languages and ecosystems, offering comprehensive security coverage for various development environments.

github

: 61

PraisonAI

Praison AI is a low-code, centralised framework that simplifies the creation and orchestration of multi-agent systems for various LLM applications. It emphasizes ease of use, customization, and human-agent interaction. The tool leverages AutoGen and CrewAI frameworks to facilitate the development of AI-generated scripts and movie concepts. Users can easily create, run, test, and deploy agents for scriptwriting and movie concept development. Praison AI also provides options for full automatic mode and integration with OpenAI models for enhanced AI capabilities.

github

: 5.6k

llamafarm

LlamaFarm is a comprehensive AI framework that empowers users to build powerful AI applications locally, with full control over costs and deployment options. It provides modular components for RAG systems, vector databases, model management, prompt engineering, and fine-tuning. Users can create differentiated AI products without needing extensive ML expertise, using simple CLI commands and YAML configs. The framework supports local-first development, production-ready components, strategy-based configuration, and deployment anywhere from laptops to the cloud.

github

: 811

zeroclaw

ZeroClaw is a fast, small, and fully autonomous AI assistant infrastructure built with Rust. It features a lean runtime, cost-efficient deployment, fast cold starts, and a portable architecture. It is secure by design, fully swappable, and supports OpenAI-compatible provider support. The tool is designed for low-cost boards and small cloud instances, with a memory footprint of less than 5MB. It is suitable for tasks like deploying AI assistants, swapping providers/channels/tools, and pluggable everything.

github

: 20.4k

roam-code

Roam is a tool that builds a semantic graph of your codebase and allows AI agents to query it with one shell command. It pre-indexes your codebase into a semantic graph stored in a local SQLite DB, providing architecture-level graph queries offline, cross-language, and compact. Roam understands functions, modules, tests coverage, and overall architecture structure. It is best suited for agent-assisted coding, large codebases, architecture governance, safe refactoring, and multi-repo projects. Roam is not suitable for real-time type checking, dynamic/runtime analysis, small scripts, or pure text search. It offers speed, dependency-awareness, LLM-optimized output, fully local operation, and CI readiness.

github

: 77

gpt-load

GPT-Load is a high-performance, enterprise-grade AI API transparent proxy service designed for enterprises and developers needing to integrate multiple AI services. Built with Go, it features intelligent key management, load balancing, and comprehensive monitoring capabilities for high-concurrency production environments. The tool serves as a transparent proxy service, preserving native API formats of various AI service providers like OpenAI, Google Gemini, and Anthropic Claude. It supports dynamic configuration, distributed leader-follower deployment, and a Vue 3-based web management interface. GPT-Load is production-ready with features like dual authentication, graceful shutdown, and error recovery.

github

: 5.9k

For similar tasks

codemogger

github

: 103

chunkhound

ChunkHound is a modern tool for transforming your codebase into a searchable knowledge base for AI assistants. It utilizes semantic search via the cAST algorithm and regex search, integrating with AI assistants through the Model Context Protocol (MCP). With features like cAST Algorithm, Multi-Hop Semantic Search, Regex search, and support for 22 languages, ChunkHound offers a local-first approach to code analysis and discovery. It provides intelligent code discovery, universal language support, and real-time indexing capabilities, making it a powerful tool for developers looking to enhance their coding experience.

github

: 90

chunkhound

ChunkHound is a tool that transforms your codebase into a searchable knowledge base for AI assistants using semantic and regex search. It integrates with AI assistants via the Model Context Protocol (MCP) and offers features such as cAST algorithm for semantic code chunking, multi-hop semantic search, natural language queries, regex search without API keys, support for 22 languages, and local-first architecture. It provides intelligent code discovery by following semantic relationships and discovering related implementations. ChunkHound is built on the cAST algorithm from Carnegie Mellon University, ensuring structure-aware chunking that preserves code meaning. It supports universal language parsing and offers efficient updates for large codebases.

github

: 1.1k

sourcegraph

Sourcegraph is a code search and navigation tool that helps developers read, write, and fix code in large, complex codebases. It provides features such as code search across all repositories and branches, code intelligence for navigation and refactoring, and the ability to fix and refactor code across multiple repositories at once.

github

: 10.0k

awesome-code-ai

A curated list of AI coding tools, including code completion, refactoring, and assistants. This list includes both open-source and commercial tools, as well as tools that are still in development. Some of the most popular AI coding tools include GitHub Copilot, CodiumAI, Codeium, Tabnine, and Replit Ghostwriter.

github

: 766

moatless-tools

Moatless Tools is a hobby project focused on experimenting with using Large Language Models (LLMs) to edit code in large existing codebases. The project aims to build tools that insert the right context into prompts and handle responses effectively. It utilizes an agentic loop functioning as a finite state machine to transition between states like Search, Identify, PlanToCode, ClarifyChange, and EditCode for code editing tasks.

github

: 320

CodeGeeX4

CodeGeeX4-ALL-9B is an open-source multilingual code generation model based on GLM-4-9B, offering enhanced code generation capabilities. It supports functions like code completion, code interpreter, web search, function call, and repository-level code Q&A. The model has competitive performance on benchmarks like BigCodeBench and NaturalCodeBench, outperforming larger models in terms of speed and performance.

github

: 1.0k

code-companion

CodeCompanion.AI is an AI coding assistant desktop app that helps with various coding tasks. It features an interactive chat interface, file system operations, web search capabilities, semantic code search, a fully functional terminal, code preview and approval, unlimited context window, dynamic context management, and more. Users can save chat conversations and set custom instructions per project.

github

: 79

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 697

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k