codemogger
Codemogger is a code indexing library and MCP server for AI coding agents
Stars: 103
Code indexing library for AI coding agents. Parses source code with tree-sitter, chunks it into semantic units (functions, structs, classes, impl blocks), embeds them locally, and stores everything in a single SQLite file with vector + full-text search. No Docker, no server, no API keys. One .db file per codebase. Enables keyword search and semantic search for AI coding tools, facilitating precise identifier lookup and natural language queries. Suitable for understanding codebases, discovering implementations, and navigating unfamiliar code quickly. Can be used as a library or CLI tool with incremental indexing and high search quality.
README:
Code indexing library for AI coding agents. Parses source code with tree-sitter, chunks it into semantic units (functions, structs, classes, impl blocks), embeds them locally, and stores everything in a single SQLite file with vector + full-text search.
No Docker, no server, no API keys. One .db file per codebase.
Coding agents need to understand codebases. They need to find where things are defined, discover how concepts are implemented across files, and navigate unfamiliar code quickly. This requires both keyword search (precise identifier lookup) and semantic search (natural language queries when you don't know the exact names).
As AI coding tools become more composable - agents calling agents, MCP servers plugging into different hosts - this capability needs to exist as a library that runs locally. No external servers, no API keys, no Docker containers. Just a function call that returns results.
codemogger is that library. Embedded SQLite (via Turso) with FTS + vector search in a single .db file.
npm install -g codemogger
Or use npx to run without installing.
# Index a project
codemogger index ./my-project
# Search
codemogger search "authentication middleware"
Add to your coding agent's MCP config (Claude Code, OpenCode, etc.):
{
"mcpServers": {
"codemogger": {
"command": "npx",
"args": ["-y", "codemogger", "mcp"]
}
}
}
The MCP server exposes three tools:
-
codemogger_search- semantic and keyword search over indexed code -
codemogger_index- index a codebase for the first time -
codemogger_reindex- update the index after modifying files
codemogger is also usable as a library. The SDK has no model dependency - you provide your own embedding function:
import { CodeIndex } from "codemogger"
import { pipeline } from "@huggingface/transformers"
// Load embedding model (runs locally, no API keys)
const extractor = await pipeline("feature-extraction", "Xenova/all-MiniLM-L6-v2", { dtype: "q8" })
const embedder = async (texts: string[]): Promise<number[][]> => {
const output = await extractor(texts, { pooling: "mean", normalize: true })
return output.tolist() as number[][]
}
const db = new CodeIndex({
dbPath: "./my-project.db",
embedder,
embeddingModel: "all-MiniLM-L6-v2",
})
await db.index("/path/to/project")
// Semantic: "what does this codebase do?"
const results = await db.search("authentication middleware", { mode: "semantic" })
// Keyword: precise identifier lookup
const results = await db.search("BTreeCursor", { mode: "keyword" })
await db.close()
The MCP server and CLI ship with all-MiniLM-L6-v2 by default.
# Install globally
npm install -g codemogger
# Index a directory
codemogger index ./my-project
# Search
codemogger search "authentication middleware"
# List indexed codebases
codemogger list
-
Scan - walk directory, respect
.gitignore, detect language from extension - Chunk - parse each file with tree-sitter (WASM), extract top-level definitions (functions, structs, classes, impl blocks). Items >150 lines are split into sub-items.
- Embed - encode each chunk with the provided embedding model (runs locally, no API)
- Store - write chunks + embeddings to SQLite with FTS index
- Search - vector cosine similarity (semantic) or FTS with weighted fields (keyword)
Incremental: only changed files (by SHA-256 hash) are re-processed on subsequent runs.
Rust, C, C++, Go, Python, Zig, Java, Scala, JavaScript, TypeScript, TSX, PHP, Ruby.
Benchmarked on 4 real-world codebases on an Apple M2 (8GB). Each project uses its own isolated database. Embeddings use vector8 (int8 quantized, 395 bytes/chunk vs 1,536 for float32). Embedding model: all-MiniLM-L6-v2 (q8 quantized, local CPU). Search times are p50 over 3 runs.
| Project | Language | Files | Semantic | Keyword | ripgrep |
|---|---|---|---|---|---|
| Turso | Rust | 748 | 35 ms | 1 ms | 25 ms |
| Bun | Zig | 9,255 | 137 ms | 2 ms | 166 ms |
| TypeScript | TypeScript | 39,298 | 242 ms | 4 ms | 1,500 ms |
| Kubernetes | Go | 16,668 | 617 ms | 12 ms | 731 ms |
Keyword search is 25x-370x faster than ripgrep and returns precise definitions instead of thousands of file matches.
Indexing is a one-time cost dominated by embedding (~97% of time). Subsequent runs only re-embed changed files.
The real advantage isn't speed - it's finding the right code when you don't know the exact keywords.
"write-ahead log replication and synchronization" (Turso)
| codemogger (top 5) | ripgrep |
|---|---|
impl LogicalLog - core/mvcc/persistent_storage/logical_log.rs |
3 files matched |
enum CommitState - core/mvcc/database/mod.rs |
(keyword: "write-ahead") |
function new - core/mvcc/database/checkpoint_state_machine.rs |
|
struct LogicalLog - core/mvcc/persistent_storage/logical_log.rs |
|
function checkpoint_shutdown - core/storage/pager.rs |
"SQL statement parsing and compilation" (Turso)
| codemogger (top 5) | ripgrep |
|---|---|
function parse_and_build - core/translate/logical.rs |
139 files matched |
macro compile_sql - core/incremental/compiler.rs |
(keyword: "statement") |
function parse_from_clause_opt - parser/src/parser.rs |
|
function parse_from_clause_table - core/translate/planner.rs |
|
function parse_table - core/translate/planner.rs |
"HTTP request parsing and response writing" (Bun)
| codemogger (top 5) | ripgrep |
|---|---|
function consumeRequestLine - packages/bun-uws/src/HttpParser.h |
0 files matched |
declaration ConsumeRequestLineResult - packages/bun-uws/src/HttpParser.h |
(keyword: "HTTP") |
function llhttp__after_headers_complete - src/bun.js/bindings/node/http/llhttp/http.c |
|
function llhttp_message_needs_eof - src/bun.js/bindings/node/http/llhttp/http.c |
|
function shortRead - packages/bun-uws/src/HttpParser.h |
"scheduling pods to nodes based on resource requirements" (Kubernetes)
| codemogger (top 5) | ripgrep |
|---|---|
type Scheduling - staging/src/k8s.io/api/node/v1beta1/types.go |
429 files matched |
type Scheduling - staging/src/k8s.io/api/node/v1/types.go |
(keyword: "scheduling") |
type SchedulingApplyConfiguration - staging/.../node/v1/scheduling.go |
|
function runPodAndGetNodeName - test/e2e/scheduling/predicates.go |
|
type createPodsOp - test/integration/scheduler_perf/scheduler_perf.go |
"container health check probes and restart policy" (Kubernetes)
| codemogger (top 5) | ripgrep |
|---|---|
type ContainerFailures - test/utils/conditions.go |
1,652 files matched |
variable _ - test/e2e/common/node/container_probe.go |
(keyword: "container") |
function checkContainerStateTransition - pkg/kubelet/status/status_manager.go |
|
function TestDoProbe_TerminatedContainerWithRestartPolicyNever - pkg/kubelet/prober/worker_test.go |
|
function proveHealthCheckNodePortDeallocated - pkg/registry/core/service/storage/storage_test.go |
ripgrep matches thousands of files on common keywords. codemogger returns the 5 most relevant definitions.
- Bun/TypeScript runtime
- tree-sitter (WASM) for AST-aware chunking - 13 language grammars
- all-MiniLM-L6-v2 for local embeddings (384 dimensions, q8 quantized)
- Turso for storage - embedded SQLite with FTS + vector search extensions
- Single DB file stores multiple codebases with per-codebase FTS tables and global vector search
MIT
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for codemogger
Similar Open Source Tools
codemogger
Code indexing library for AI coding agents. Parses source code with tree-sitter, chunks it into semantic units (functions, structs, classes, impl blocks), embeds them locally, and stores everything in a single SQLite file with vector + full-text search. No Docker, no server, no API keys. One .db file per codebase. Enables keyword search and semantic search for AI coding tools, facilitating precise identifier lookup and natural language queries. Suitable for understanding codebases, discovering implementations, and navigating unfamiliar code quickly. Can be used as a library or CLI tool with incremental indexing and high search quality.
flyto-core
Flyto-core is a powerful Python library for geospatial analysis and visualization. It provides a wide range of tools for working with geographic data, including support for various file formats, spatial operations, and interactive mapping. With Flyto-core, users can easily load, manipulate, and visualize spatial data to gain insights and make informed decisions. Whether you are a GIS professional, a data scientist, or a developer, Flyto-core offers a versatile and user-friendly solution for geospatial tasks.
shodh-memory
Shodh-Memory is a cognitive memory system designed for AI agents to persist memory across sessions, learn from experience, and run entirely offline. It features Hebbian learning, activation decay, and semantic consolidation, packed into a single ~17MB binary. Users can deploy it on cloud, edge devices, or air-gapped systems to enhance the memory capabilities of AI agents.
skylos
Skylos is a privacy-first SAST tool for Python, TypeScript, and Go that bridges the gap between traditional static analysis and AI agents. It detects dead code, security vulnerabilities (SQLi, SSRF, Secrets), and code quality issues with high precision. Skylos uses a hybrid engine (AST + optional Local/Cloud LLM) to eliminate false positives, verify via runtime, find logic bugs, and provide context-aware audits. It offers automated fixes, end-to-end remediation, and 100% local privacy. The tool supports taint analysis, secrets detection, vulnerability checks, dead code detection and cleanup, agentic AI and hybrid analysis, codebase optimization, operational governance, and runtime verification.
augustus
Augustus is a Go-based LLM vulnerability scanner designed for security professionals to test large language models against a wide range of adversarial attacks. It integrates with 28 LLM providers, covers 210+ adversarial attacks including prompt injection, jailbreaks, encoding exploits, and data extraction, and produces actionable vulnerability reports. The tool is built for production security testing with features like concurrent scanning, rate limiting, retry logic, and timeout handling out of the box.
SWE-AF
SWE-AF is an autonomous engineering team runtime built on AgentField, designed to spin up a full engineering team that can scope, build, adapt, and ship complex software end-to-end. It enables autonomous software engineering factories, scaling from simple goals to multi-issue programs with hundreds to thousands of agent invocations. SWE-AF offers one-call DX for quick deployment and adaptive factory control using three nested control loops to adapt to task difficulty in real-time. It features a factory architecture, continual learning, agent-scale parallelism, fleet-scale orchestration with AgentField, explicit compromise tracking, and long-run reliability.
tokscale
Tokscale is a high-performance CLI tool and visualization dashboard for tracking token usage and costs across multiple AI coding agents. It helps monitor and analyze token consumption from various AI coding tools, providing real-time pricing calculations using LiteLLM's pricing data. Inspired by the Kardashev scale, Tokscale measures token consumption as users scale the ranks of AI-augmented development. It offers interactive TUI mode, multi-platform support, real-time pricing, detailed breakdowns, web visualization, flexible filtering, and social platform features.
Code
A3S Code is an embeddable AI coding agent framework in Rust that allows users to build agents capable of reading, writing, and executing code with tool access, planning, and safety controls. It is production-ready with features like permission system, HITL confirmation, skill-based tool restrictions, and error recovery. The framework is extensible with 19 trait-based extension points and supports lane-based priority queue for scalable multi-machine task distribution.
FalkorDB
FalkorDB is the first queryable Property Graph database to use sparse matrices to represent the adjacency matrix in graphs and linear algebra to query the graph. Primary features: * Adopting the Property Graph Model * Nodes (vertices) and Relationships (edges) that may have attributes * Nodes can have multiple labels * Relationships have a relationship type * Graphs represented as sparse adjacency matrices * OpenCypher with proprietary extensions as a query language * Queries are translated into linear algebra expressions
goclaw
GoClaw is a multi-agent AI gateway that connects LLMs to your tools, channels, and data. It orchestrates agent teams, inter-agent delegation, and quality-gated workflows across 11+ LLM providers with full multi-tenant isolation. It is a Go port of OpenClaw with enhanced security, multi-tenant PostgreSQL, and production-grade observability. GoClaw's unique strengths include multi-tenant PostgreSQL, agent teams, conversation handoff, evaluate-loop quality gates, runtime custom tools via API, and MCP protocol support.
agent-security-scanner-mcp
The 'agent-security-scanner-mcp' is a security scanner designed for AI coding agents and autonomous assistants. It scans code for vulnerabilities, detects hallucinated packages, and blocks prompt injection. The tool supports two versions: ProofLayer (lightweight) and Full Version (advanced) with different features and capabilities. It provides various tools for scanning code, fixing vulnerabilities, checking package legitimacy, and detecting prompt injection. The scanner also includes specific tools for scanning MCP servers, OpenClaw skills, and integrating with OpenClaw for autonomous AI threat detection. The tool utilizes AST analysis, taint tracking, and cross-file analysis to provide accurate security assessments. It supports multiple languages and ecosystems, offering comprehensive security coverage for various development environments.
PraisonAI
Praison AI is a low-code, centralised framework that simplifies the creation and orchestration of multi-agent systems for various LLM applications. It emphasizes ease of use, customization, and human-agent interaction. The tool leverages AutoGen and CrewAI frameworks to facilitate the development of AI-generated scripts and movie concepts. Users can easily create, run, test, and deploy agents for scriptwriting and movie concept development. Praison AI also provides options for full automatic mode and integration with OpenAI models for enhanced AI capabilities.
llamafarm
LlamaFarm is a comprehensive AI framework that empowers users to build powerful AI applications locally, with full control over costs and deployment options. It provides modular components for RAG systems, vector databases, model management, prompt engineering, and fine-tuning. Users can create differentiated AI products without needing extensive ML expertise, using simple CLI commands and YAML configs. The framework supports local-first development, production-ready components, strategy-based configuration, and deployment anywhere from laptops to the cloud.
zeroclaw
ZeroClaw is a fast, small, and fully autonomous AI assistant infrastructure built with Rust. It features a lean runtime, cost-efficient deployment, fast cold starts, and a portable architecture. It is secure by design, fully swappable, and supports OpenAI-compatible provider support. The tool is designed for low-cost boards and small cloud instances, with a memory footprint of less than 5MB. It is suitable for tasks like deploying AI assistants, swapping providers/channels/tools, and pluggable everything.
roam-code
Roam is a tool that builds a semantic graph of your codebase and allows AI agents to query it with one shell command. It pre-indexes your codebase into a semantic graph stored in a local SQLite DB, providing architecture-level graph queries offline, cross-language, and compact. Roam understands functions, modules, tests coverage, and overall architecture structure. It is best suited for agent-assisted coding, large codebases, architecture governance, safe refactoring, and multi-repo projects. Roam is not suitable for real-time type checking, dynamic/runtime analysis, small scripts, or pure text search. It offers speed, dependency-awareness, LLM-optimized output, fully local operation, and CI readiness.
gpt-load
GPT-Load is a high-performance, enterprise-grade AI API transparent proxy service designed for enterprises and developers needing to integrate multiple AI services. Built with Go, it features intelligent key management, load balancing, and comprehensive monitoring capabilities for high-concurrency production environments. The tool serves as a transparent proxy service, preserving native API formats of various AI service providers like OpenAI, Google Gemini, and Anthropic Claude. It supports dynamic configuration, distributed leader-follower deployment, and a Vue 3-based web management interface. GPT-Load is production-ready with features like dual authentication, graceful shutdown, and error recovery.
For similar tasks
codemogger
Code indexing library for AI coding agents. Parses source code with tree-sitter, chunks it into semantic units (functions, structs, classes, impl blocks), embeds them locally, and stores everything in a single SQLite file with vector + full-text search. No Docker, no server, no API keys. One .db file per codebase. Enables keyword search and semantic search for AI coding tools, facilitating precise identifier lookup and natural language queries. Suitable for understanding codebases, discovering implementations, and navigating unfamiliar code quickly. Can be used as a library or CLI tool with incremental indexing and high search quality.
chunkhound
ChunkHound is a modern tool for transforming your codebase into a searchable knowledge base for AI assistants. It utilizes semantic search via the cAST algorithm and regex search, integrating with AI assistants through the Model Context Protocol (MCP). With features like cAST Algorithm, Multi-Hop Semantic Search, Regex search, and support for 22 languages, ChunkHound offers a local-first approach to code analysis and discovery. It provides intelligent code discovery, universal language support, and real-time indexing capabilities, making it a powerful tool for developers looking to enhance their coding experience.
chunkhound
ChunkHound is a tool that transforms your codebase into a searchable knowledge base for AI assistants using semantic and regex search. It integrates with AI assistants via the Model Context Protocol (MCP) and offers features such as cAST algorithm for semantic code chunking, multi-hop semantic search, natural language queries, regex search without API keys, support for 22 languages, and local-first architecture. It provides intelligent code discovery by following semantic relationships and discovering related implementations. ChunkHound is built on the cAST algorithm from Carnegie Mellon University, ensuring structure-aware chunking that preserves code meaning. It supports universal language parsing and offers efficient updates for large codebases.
sourcegraph
Sourcegraph is a code search and navigation tool that helps developers read, write, and fix code in large, complex codebases. It provides features such as code search across all repositories and branches, code intelligence for navigation and refactoring, and the ability to fix and refactor code across multiple repositories at once.
awesome-code-ai
A curated list of AI coding tools, including code completion, refactoring, and assistants. This list includes both open-source and commercial tools, as well as tools that are still in development. Some of the most popular AI coding tools include GitHub Copilot, CodiumAI, Codeium, Tabnine, and Replit Ghostwriter.
moatless-tools
Moatless Tools is a hobby project focused on experimenting with using Large Language Models (LLMs) to edit code in large existing codebases. The project aims to build tools that insert the right context into prompts and handle responses effectively. It utilizes an agentic loop functioning as a finite state machine to transition between states like Search, Identify, PlanToCode, ClarifyChange, and EditCode for code editing tasks.
CodeGeeX4
CodeGeeX4-ALL-9B is an open-source multilingual code generation model based on GLM-4-9B, offering enhanced code generation capabilities. It supports functions like code completion, code interpreter, web search, function call, and repository-level code Q&A. The model has competitive performance on benchmarks like BigCodeBench and NaturalCodeBench, outperforming larger models in terms of speed and performance.
code-companion
CodeCompanion.AI is an AI coding assistant desktop app that helps with various coding tasks. It features an interactive chat interface, file system operations, web search capabilities, semantic code search, a fully functional terminal, code preview and approval, unlimited context window, dynamic context management, and more. Users can save chat conversations and set custom instructions per project.
For similar jobs
sweep
Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.
teams-ai
The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.
ai-guide
This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.
classifai
Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.
chatbot-ui
Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.
BricksLLM
BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students
uAgents
uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.
griptape
Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.
