
yek
A fast Rust based tool to serialize text-based files in a repository or directory for LLM consumption
Stars: 2332

Yek is a fast Rust-based tool designed to read text-based files in a repository or directory, chunk them, and serialize them for Large Language Models (LLM) consumption. It utilizes .gitignore rules to skip unwanted files, Git history to infer important files, and additional ignore patterns. Yek splits content into chunks based on token count or byte size, supports processing multiple directories, and can stream content when output is piped. It is configurable via a 'yek.toml' file and prioritizes important files at the end of the output.
README:
A fast Rust based tool to serialize text-based files in a repository or directory for LLM consumption.1
By default:
- Uses
.gitignore
rules to skip unwanted files. - Uses the Git history to infer what files are more important.
- Infers additional ignore patterns (binary, large, etc.).
- Automatically detects if output is being piped and streams content instead of writing to files.
- Supports processing multiple directories in a single command.
- Supports glob patterns and individual file selection.
- Configurable via a
yek.yaml
file.
Yek يک means "One" in Farsi/Persian.
Consider having a simple repo like this:
.
├── README.md
├── src
│ ├── main.rs
│ └── utils.rs
└── tests
└── test.rs
Running yek
in this directory will produce a single file and write it to the temp directory with the following content:
>>>> README.md
... content ...
>>>> tests/test.rs
... content ...
>>>> src/utils.rs
... content ...
>>>> src/main.rs
... content ...
[!NOTE]
yek
will prioritize more important files to come last in the output. This is useful for LLM consumption since LLMs tend to pay more attention to content that appears later in the context.
Choose the installation method for your platform:
curl -fsSL https://bodo.run/yek.sh | bash
For Windows (PowerShell):
irm https://bodo.run/yek.ps1 | iex
Build from Source
git clone https://github.com/bodo-run/yek
cd yek
cargo install --path .
yek
has sensible defaults, you can simply run yek
in a directory to serialize the entire repository. It will serialize all files in the repository and write them into a temporary file. The path to the file will be printed to the console.
Process current directory and write to temp directory:
yek
Pipe output to clipboard (macOS):
yek src/ | pbcopy
Cap the max output size to 128K tokens:
yek --tokens 128k
[!NOTE]
yek
will remove any files that won't fit in the capped context size. It will try to fit in more important files
yek --max-size 100KB --output-dir /tmp/yek src/
Process multiple directories:
yek src/ tests/
Process multiple files
yek file1.txt file2.txt file3.txt
Use glob patterns:
yek "src/**/*.ts"
yek "src/main.rs" "tests/*.rs" "docs/README.md"
[!NOTE] When using glob patterns, make sure to quote them to prevent shell expansion.
yek --help
Usage: yek [OPTIONS] [input-paths]...
Arguments:
[input-paths]... Input files and/or directories to process
Options:
--no-config Do not use a config file
--config-file <CONFIG_FILE> Path to the config file
-V, --version Print version of yek
--max-size <MAX_SIZE> Max size per chunk. e.g. "10MB" or "128K" or when using token counting mode, "100" or "128K" [default: 10MB]
--tokens <TOKENS> Use token mode instead of byte mode
--json Enable JSON output
--debug Enable debug output
--line-numbers Include line numbers in output
--output-dir [<OUTPUT_DIR>] Output directory. If none is provided & stdout is a TTY, we pick a temp dir
--output-name [<OUTPUT_NAME>] Output filename. If provided, write output to this file in current directory
--output-template [<OUTPUT_TEMPLATE>] Output template. Defaults to ">>>> FILE_PATH\nFILE_CONTENT"
--ignore-patterns <IGNORE_PATTERNS>... Ignore patterns
--unignore-patterns <UNIGNORE_PATTERNS>... Unignore patterns. Yek has some built-in ignore patterns, but you can override them here.
-t, --tree-header Include directory tree header in output (incompatible with JSON output)
--tree-only Show only the directory tree (no file contents, incompatible with JSON output)
-h, --help Print help
-
[input-paths]...
- Files or directories to process. Supports glob patterns (quote them to prevent shell expansion) -
--no-config
- Skip loading any configuration file -
--config-file <CONFIG_FILE>
- Use a specific configuration file path instead of searching for default config files -
-V, --version
- Print version information and exit -
--max-size <MAX_SIZE>
- Maximum size limit per output (e.g., "10MB", "128K"). Used in byte mode -
--tokens <TOKENS>
- Use token-based counting instead of bytes (e.g., "128k", "100"). Enables token mode -
--json
- Output results in JSON format instead of text -
--debug
- Enable debug logging for troubleshooting -
--line-numbers
- Include line numbers in the output for each file -
--output-dir [<OUTPUT_DIR>]
- Directory to write output files. If not specified and not streaming, uses temp directory -
--output-name [<OUTPUT_NAME>]
- Specific filename for output. If specified, writes to current directory with this name -
--output-template [<OUTPUT_TEMPLATE>]
- Template for formatting output. UseFILE_PATH
andFILE_CONTENT
placeholders -
--ignore-patterns <IGNORE_PATTERNS>...
- Additional patterns to ignore (extends .gitignore and defaults) -
--unignore-patterns <UNIGNORE_PATTERNS>...
- Patterns to override built-in ignore rules -
-t, --tree-header
- Include a directory tree at the beginning of output (incompatible with JSON) -
--tree-only
- Show only the directory tree structure without file contents (incompatible with JSON)
You can place a file called yek.yaml
at your project root or pass a custom path via --config-file
. The configuration file allows you to:
- Add custom ignore patterns
- Define file priority rules for processing order
- Add additional binary file extensions to ignore (extends the built-in list)
- Configure Git-based priority boost
- Define output directory and output filename
- Define output template and other output options
Most CLI options can be configured in the config file. The following options can be set:
File Processing:
-
max_size
- Size limit (same as--max-size
) -
tokens
- Token count limit (same as--tokens
) -
ignore_patterns
- Additional ignore patterns (same as--ignore-patterns
) -
unignore_patterns
- Override built-in ignores (same as--unignore-patterns
)
Output Configuration:
-
json
- Enable JSON output (same as--json
) -
debug
- Enable debug mode (same as--debug
) -
line_numbers
- Include line numbers (same as--line-numbers
) -
output_dir
- Output directory (same as--output-dir
) -
output_name
- Output filename (same as--output-name
) -
output_template
- Output template (same as--output-template
) -
tree_header
- Include directory tree header (same as--tree-header
) -
tree_only
- Show only directory tree (same as--tree-only
)
Config-only Options:
-
priority_rules
- File priority rules (config file only) -
binary_extensions
- Additional binary file extensions (config file only) -
git_boost_max
- Maximum Git-based priority boost (config file only)
[!NOTE] Some CLI options like
--no-config
,--config-file
, and--version
are CLI-only and cannot be set in config files.
You can also use yek.toml
or yek.json
instead of yek.yaml
.
This is optional, you can configure the yek.yaml
file at the root of your project.
# Add patterns to ignore (in addition to .gitignore)
ignore_patterns:
- "ai-prompts/**"
- "__generated__/**"
# Configure Git-based priority boost (optional)
git_boost_max: 50 # Maximum score boost based on Git history (default: 100)
# Define priority rules for processing order
# Higher scores are processed first
priority_rules:
- score: 100
pattern: "^src/lib/"
- score: 90
pattern: "^src/"
- score: 80
pattern: "^docs/"
# Add additional binary file extensions to ignore
# These extend the built-in list (.jpg, .png, .exe, etc.)
binary_extensions:
- ".blend" # Blender files
- ".fbx" # 3D model files
- ".max" # 3ds Max files
- ".psd" # Photoshop files
# Output configuration
max_size: "128K" # Size limit (can also use tokens: "100k")
json: false # Enable JSON output
debug: false # Enable debug logging
line_numbers: false # Include line numbers in output
tree_header: false # Include directory tree at start
# Define output directory
output_dir: /tmp/yek
# Define output filename (writes to current directory with this name)
output_name: yek-output.txt
# Define output template.
# FILE_PATH and FILE_CONTENT are expected to be present in the template.
output_template: "FILE_PATH\n\nFILE_CONTENT"
yek
is fast. It's written in Rust and does many things in parallel to speed up processing.
Here is a benchmark comparing it to Repomix serializing the Next.js project:
time yek
Executed in 5.19 secs fish external
usr time 2.85 secs 54.00 micros 2.85 secs
sys time 6.31 secs 629.00 micros 6.31 secs
time repomix
Executed in 22.24 mins fish external
usr time 21.99 mins 0.18 millis 21.99 mins
sys time 0.23 mins 1.72 millis 0.23 mins
yek
is 230x faster than repomix
.
See proposed features. I am open to accepting new feature requests. Please write a detailed proposal to discuss new features.
-
Repomix: A tool to serialize a repository into a single file in a similar way to
yek
. - Aider: A full IDE like experience for coding using AI
-
yek
is not "blazingly" fast. It's just fast, as fast as your computer can be. ↩
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for yek
Similar Open Source Tools

yek
Yek is a fast Rust-based tool designed to read text-based files in a repository or directory, chunk them, and serialize them for Large Language Models (LLM) consumption. It utilizes .gitignore rules to skip unwanted files, Git history to infer important files, and additional ignore patterns. Yek splits content into chunks based on token count or byte size, supports processing multiple directories, and can stream content when output is piped. It is configurable via a 'yek.toml' file and prioritizes important files at the end of the output.

mcp-use
MCP-Use is a Python library for analyzing and processing text data using Markov Chains. It provides functionalities for generating text based on input data, calculating transition probabilities, and simulating text sequences. The library is designed to be user-friendly and efficient, making it suitable for natural language processing tasks.

PaddleOCR
PaddleOCR is an easy-to-use and scalable OCR toolkit based on PaddlePaddle. It provides a series of text detection and recognition models, supporting multiple languages and various scenarios. With PaddleOCR, users can perform accurate and efficient text extraction from images and videos, making it suitable for tasks such as document scanning, text recognition, and information extraction.

GraphLLM
GraphLLM is a graph-based framework designed to process data using LLMs. It offers a set of tools including a web scraper, PDF parser, YouTube subtitles downloader, Python sandbox, and TTS engine. The framework provides a GUI for building and debugging graphs with advanced features like loops, conditionals, parallel execution, streaming of results, hierarchical graphs, external tool integration, and dynamic scheduling. GraphLLM is a low-level framework that gives users full control over the raw prompt and output of models, with a steeper learning curve. It is tested with llama70b and qwen 32b, under heavy development with breaking changes expected.

Website-Crawler
Website-Crawler is a tool designed to extract data from websites in an automated manner. It allows users to scrape information such as text, images, links, and more from web pages. The tool provides functionalities to navigate through websites, handle different types of content, and store extracted data for further analysis. Website-Crawler is useful for tasks like web scraping, data collection, content aggregation, and competitive analysis. It can be customized to extract specific data elements based on user requirements, making it a versatile tool for various web data extraction needs.

VectorCode
VectorCode is a code repository indexing tool that helps users write better prompts for coding LLMs by providing information about the code repository being worked on. It includes a neovim plugin and supports multiple embedding engines. The tool enhances completion results by providing project context and improves understanding of close-source or cutting edge projects.

jsonrepair
Jsonrepair is a Python library that provides functionalities to repair and validate JSON files. It helps users to fix common issues in JSON data such as missing commas, incorrect data types, and structural errors. With jsonrepair, users can easily clean up and standardize their JSON files, ensuring they are well-formed and error-free.

ollama4j
Ollama4j is a Java library that serves as a wrapper or binding for the Ollama server. It allows users to communicate with the Ollama server and manage models for various deployment scenarios. The library provides APIs for interacting with Ollama, generating fake data, testing UI interactions, translating messages, and building web UIs. Users can easily integrate Ollama4j into their Java projects to leverage the functionalities offered by the Ollama server.

falkordb-browser
FalkorDB Browser is a user-friendly web application for browsing and managing databases. It provides an intuitive interface for users to interact with their databases, allowing them to view, edit, and query data easily. With FalkorDB Browser, users can perform various database operations without the need for complex commands or scripts, making database management more accessible and efficient.

file-organizer-2000
AI File Organizer 2000 is an Obsidian Plugin that uses AI to transcribe audio, annotate images, and automatically organize files by moving them to the most likely folders. It supports text, audio, and images, with upcoming local-first LLM support. Users can simply place unorganized files into the 'Inbox' folder for automatic organization. The tool renames and moves files quickly, providing a seamless file organization experience. Self-hosting is also possible by running the server and enabling the 'Self-hosted' option in the plugin settings. Join the community Discord server for more information and use the provided iOS shortcut for easy access on mobile devices.

turftopic
Turftopic is a Python library that provides tools for sentiment analysis and topic modeling of text data. It allows users to analyze large volumes of text data to extract insights on sentiment and topics. The library includes functions for preprocessing text data, performing sentiment analysis using machine learning models, and conducting topic modeling using algorithms such as Latent Dirichlet Allocation (LDA). Turftopic is designed to be user-friendly and efficient, making it suitable for both beginners and experienced data analysts.

CredSweeper
CredSweeper is a tool designed to detect credentials like tokens, passwords, and API keys in directories or files. It helps users identify potential exposure of sensitive information by scanning lines, filtering, and utilizing an AI model. The tool reports lines containing possible credentials, their location, and the expected type of credential.

langfuse-docs
Langfuse Docs is a repository for langfuse.com, built on Nextra. It provides guidelines for contributing to the documentation using GitHub Codespaces and local development setup. The repository includes Python cookbooks in Jupyter notebooks format, which are converted to markdown for rendering on the site. It also covers media management for images, videos, and gifs. The stack includes Nextra, Next.js, shadcn/ui, and Tailwind CSS. Additionally, there is a bundle analysis feature to analyze the production build bundle size using @next/bundle-analyzer.

PerforatedAI
PerforatedAI is a machine learning tool designed to automate the process of analyzing and extracting information from perforated documents. It uses advanced OCR technology to accurately identify and extract data from documents with perforations, such as surveys, questionnaires, and forms. The tool can handle various types of perforations and is capable of processing large volumes of documents quickly and efficiently. PerforatedAI streamlines the data extraction process, saving time and reducing errors associated with manual data entry. It is a valuable tool for businesses and organizations that deal with large amounts of perforated documents on a regular basis.

SQLBot
SQLBot is a versatile tool for executing SQL queries and managing databases. It provides a user-friendly interface for interacting with databases, allowing users to easily query, insert, update, and delete data. SQLBot supports various database systems such as MySQL, PostgreSQL, and SQLite, making it a valuable tool for developers, data analysts, and database administrators. With SQLBot, users can streamline their database management tasks and improve their productivity by quickly accessing and manipulating data without the need for complex SQL commands.

vivaria
Vivaria is a web application tool designed for running evaluations and conducting agent elicitation research. Users can interact with Vivaria using a web UI and a command-line interface. It allows users to start task environments based on METR Task Standard definitions, run AI agents, perform agent elicitation research, view API requests and responses, add tags and comments to runs, store results in a PostgreSQL database, sync data to Airtable, test prompts against LLMs, and authenticate using Auth0.
For similar tasks

yek
Yek is a fast Rust-based tool designed to read text-based files in a repository or directory, chunk them, and serialize them for Large Language Models (LLM) consumption. It utilizes .gitignore rules to skip unwanted files, Git history to infer important files, and additional ignore patterns. Yek splits content into chunks based on token count or byte size, supports processing multiple directories, and can stream content when output is piped. It is configurable via a 'yek.toml' file and prioritizes important files at the end of the output.

emigo
Emigo is an AI-powered development tool for Emacs that integrates large language models to interact with projects, read files, write code, execute commands, and more. It acts as an agentic AI assistant, leveraging tool use to enhance development workflows within Emacs. Emigo is actively developed, offering features like agentic tool use, Emacs integration, flexible LLM support, and context-aware interactions. Users can install Emigo with Python dependencies and configure it within Emacs for seamless integration. The tool's core strength lies in its agentic tool use, where the AI analyzes requests, selects appropriate tools, executes actions, and provides feedback, enabling users to accomplish complex tasks efficiently.

atomic_agents
Atomic Agents is a modular and extensible framework designed for creating powerful applications. It follows the principles of Atomic Design, emphasizing small and single-purpose components. Leveraging Pydantic for data validation and serialization, the framework offers a set of tools and agents that can be combined to build AI applications. It depends on the Instructor package and supports various APIs like OpenAI, Cohere, Anthropic, and Gemini. Atomic Agents is suitable for developers looking to create AI agents with a focus on modularity and flexibility.

atomic-agents
The Atomic Agents framework is a modular and extensible tool designed for creating powerful applications. It leverages Pydantic for data validation and serialization. The framework follows the principles of Atomic Design, providing small and single-purpose components that can be combined. It integrates with Instructor for AI agent architecture and supports various APIs like Cohere, Anthropic, and Gemini. The tool includes documentation, examples, and testing features to ensure smooth development and usage.

worker-vllm
The worker-vLLM repository provides a serverless endpoint for deploying OpenAI-compatible vLLM models with blazing-fast performance. It supports deploying various model architectures, such as Aquila, Baichuan, BLOOM, ChatGLM, Command-R, DBRX, DeciLM, Falcon, Gemma, GPT-2, GPT BigCode, GPT-J, GPT-NeoX, InternLM, Jais, LLaMA, MiniCPM, Mistral, Mixtral, MPT, OLMo, OPT, Orion, Phi, Phi-3, Qwen, Qwen2, Qwen2MoE, StableLM, Starcoder2, Xverse, and Yi. Users can deploy models using pre-built Docker images or build custom images with specified arguments. The repository also supports OpenAI compatibility for chat completions, completions, and models, with customizable input parameters. Users can modify their OpenAI codebase to use the deployed vLLM worker and access a list of available models for deployment.

open-assistant-api
Open Assistant API is an open-source, self-hosted AI intelligent assistant API compatible with the official OpenAI interface. It supports integration with more commercial and private models, R2R RAG engine, internet search, custom functions, built-in tools, code interpreter, multimodal support, LLM support, and message streaming output. Users can deploy the service locally and expand existing features. The API provides user isolation based on tokens for SaaS deployment requirements and allows integration of various tools to enhance its capability to connect with the external world.
For similar jobs

db2rest
DB2Rest is a modern low-code REST DATA API platform that simplifies the development of intelligent applications. It seamlessly integrates existing and new databases with language models (LMs/LLMs) and vector stores, enabling the rapid delivery of context-aware, reasoning applications without vendor lock-in.

mage-ai
Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.

airbyte
Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.

labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

telemetry-airflow
This repository codifies the Airflow cluster that is deployed at workflow.telemetry.mozilla.org (behind SSO) and commonly referred to as "WTMO" or simply "Airflow". Some links relevant to users and developers of WTMO: * The `dags` directory in this repository contains some custom DAG definitions * Many of the DAGs registered with WTMO don't live in this repository, but are instead generated from ETL task definitions in bigquery-etl * The Data SRE team maintains a WTMO Developer Guide (behind SSO)

airflow
Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative. Use Airflow to author workflows as directed acyclic graphs (DAGs) of tasks. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies. Rich command line utilities make performing complex surgeries on DAGs a snap. The rich user interface makes it easy to visualize pipelines running in production, monitor progress, and troubleshoot issues when needed.

airbyte-platform
Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's low-code Connector Development Kit (CDK). Airbyte is used by data engineers and analysts at companies of all sizes to move data for a variety of purposes, including data warehousing, data analysis, and machine learning.

chronon
Chronon is a platform that simplifies and improves ML workflows by providing a central place to define features, ensuring point-in-time correctness for backfills, simplifying orchestration for batch and streaming pipelines, offering easy endpoints for feature fetching, and guaranteeing and measuring consistency. It offers benefits over other approaches by enabling the use of a broad set of data for training, handling large aggregations and other computationally intensive transformations, and abstracting away the infrastructure complexity of data plumbing.