gepa

Optimize prompts, code, and more with AI-powered Reflective Text Evolution

Stars: 317

Visit

GEPA (Genetic-Pareto) is a framework for optimizing arbitrary systems composed of text components like AI prompts, code snippets, or textual specs against any evaluation metric. It employs LLMs to reflect on system behavior, using feedback from execution and evaluation traces to drive targeted improvements. Through iterative mutation, reflection, and Pareto-aware candidate selection, GEPA evolves robust, high-performing variants with minimal evaluations, co-evolving multiple components in modular systems for domain-specific gains. The repository provides the official implementation of the GEPA algorithm as proposed in the paper titled 'GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning'.

README:

GEPA: System Optimization through Reflective Text Evolution

Optimize text components—AI prompts, code, or instructions—of any system using reflective text evolution.

Overview

GEPA (Genetic-Pareto) is a framework for optimizing arbitrary systems composed of text components—like AI prompts, code snippets, or textual specs—against any evaluation metric. It employs LLMs to reflect on system behavior, using feedback from execution and evaluation traces to drive targeted improvements. Through iterative mutation, reflection, and Pareto-aware candidate selection, GEPA evolves robust, high-performing variants with minimal evaluations, co-evolving multiple components in modular systems for domain-specific gains.

This repository provides the official implementation of the GEPA algorithm as proposed in the paper titled "GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning" (https://arxiv.org/abs/2507.19457). In order to reproduce experiments from the paper, we provide a separate reproduction artifact.

Installation

pip install gepa

To install the very latest from main:

pip install git+https://github.com/gepa-ai/gepa.git

Using GEPA

The Easiest Path: DSPy Integration

The easiest and most powerful way to use GEPA for prompt optimization is within DSPy, where the GEPA algorithm is directly available through the dspy.GEPA API. Directly executable tutorial notebooks are at dspy.GEPA Tutorials.

Simple Prompt Optimization Example

GEPA can be run in just a few lines of code. In this example, we'll use GEPA to optimize a system prompt for math problems from the AIME benchmark (full tutorial). Run the following in an environment with OPENAI_API_KEY:

import gepa

# Load AIME dataset
trainset, valset, _ = gepa.examples.aime.init_dataset()

seed_prompt = {
    "system_prompt": "You are a helpful assistant. You are given a question and you need to answer it. The answer should be given at the end of your response in exactly the format '### <final answer>'"
}

# Let's run GEPA optimization process.
gepa_result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset, valset=valset,
    task_lm="openai/gpt-4.1-mini", # <-- This is the model being optimized
    max_metric_calls=150, # <-- Set a budget
    reflection_lm="openai/gpt-5", # <-- Use a strong model to reflect on mistakes and propose better prompts
)

print("GEPA Optimized Prompt:", gepa_result.best_candidate['system_prompt'])

Here, we can see the optimized prompt that GEPA generates for AIME, which achieves improves GPT-4.1 Mini's performance from 46.6% to 56.6%, an improvement of 10% on AIME 2025. Note the details captured in the prompts in just 2 iterations of GEPA. GEPA can be thought of as precomputing some reasoning (during optimization) to come up with a good plan for future task instances.

Example GEPA Prompts
HotpotQA (multi-hop QA) Prompt	AIME Prompt
Click to view full HotpotQA prompt [HotpotQA Prompt Begin] You will be given two input fields: `question` and `summary_1`. Your task is to generate a new search query (`query`) optimized for the second hop of a multi-hop retrieval system. The original user question is typically complex and requires information from multiple documents to answer. The first hop query is the original question used to retrieve an initial set of documents. Your goal is to generate a second hop query that retrieves additional relevant documents that were not found in the first hop but are necessary to answer the original question completely. Detailed task instructions and hints: Input Understanding: `question` is the original multi-hop question posed by the user. `summary_1` is a concise summary of information from a document retrieved in the first hop, which partially addresses the question. Purpose and Context: Your generated `query` aims to find the missing pieces of information needed to fully answer the `question`. The multi-hop retrieval system works in stages: First hop: The original question returns some documents. Second hop: Your query must help retrieve any other relevant documents NOT found in the first hop that hold complementary or broader context necessary for final answer extraction. Key Observations from Examples and Feedback: First-hop documents often cover one entity or aspect in the question. Remaining relevant documents often involve connected or higher-level concepts mentioned in `summary_1` but not explicitly asked in the original question. The `query` should be formulated to explicitly target these missing, but logically linked, documents. Avoid merely paraphrasing the original question or restating known facts from `summary_1`. Instead, infer what broader or related entities/concepts might provide the crucial missing information. For example, if `summary_1` describes a population for a small civil parish, but the question wants total population of the wider region, your `query` should target that wider region (e.g., "Madeira archipelago population in 2011"). Similarly, if `summary_1` covers a song and the question wants the album it came from, but first hop got song-level documents, your query should retrieve documents about the album itself. How to Build the Query: Identify the entities or topics mentioned in `summary_1` that appear related but different from first-hop documents. Reframe the query to explicitly mention these broader or related entities connected to the original question. Include relevant key context from the question to maintain specificity, but shift focus to the missing piece. The goal is to retrieve documents that link or complement what was retrieved initially. Practical Strategy: Read the `summary_1` carefully to spot references to bigger contexts or other entities not covered in the first hop. Ask yourself, "What entity or aspect does this summary hint at that could answer the original question but was not found yet?" Formulate a precise, focused factual query targeting that entity or concept to retrieve the missing documents. Output: Produce only the field `query` as a clear, concise question or keyword phrase designed for efficient retrieval of second-hop documents. Ensure the query relates logically to the original question while targeting the broader or complementary knowledge identified in `summary_1`. Do not include the original question or simply rephrase it. Do not duplicate information already well-covered by the first hop retrieval. By following these principles, you will help the multi-hop retrieval system find all necessary documents to answer the multi-faceted original question completely. [HotpotQA Prompt End]	Click to view full AIME prompt [AIME Prompt Begin] You will be given one math problem as plain text under a key like “problem.” Your job is to solve it correctly and return: reasoning: a concise, logically ordered solution that uses identities/structure to avoid brute force, ends with a quick verification. answer: the final requested number/expression only (no extra words). Formatting: Use exactly two top-level fields named “reasoning” and “answer.” Keep reasoning succinct but complete. Bullet points are fine. The answer field must contain only the final value requested (e.g., 227, 585, 601). General problem-solving guidance: Parse the problem type (e.g., base representation, intersecting families of subsets, avoiding arithmetic progressions, symmetric sums with constraints, ordered tuples counting). Always enforce domain constraints (e.g., base-b digits in 0..b−1; no leading zero for base-10 “three-digit”; ordered vs unordered families; strict increase conditions in sequences). Use algebraic identities and modular arithmetic to reduce the search space; prefer structural arguments over naive enumeration. For “greatest/least” questions, derive tight bounds and give a construction that attains them. Domain-specific strategies and pitfalls (learned from typical contest problems and prior feedback): Base-conversion/digit rearrangement: Translate positional notation correctly: in base b, (a b c)_b = a·b^2 + b·b + c; in base 10: abc = 100a + 10b + c. Enforce digit ranges strictly (e.g., in base 9, digits ∈ {0,…,8}; if also a is a base-10 leading digit, then a ∈ {1,…,8}). Set up equality and simplify. Use modular constraints to prune: • Mod 9 often collapses coefficients; e.g., 99a = 71b + 8c ⇒ mod 9 gives b + c ≡ 0 (mod 9). • Mod 8: 99 ≡ 3, 71 ≡ 7 ⇒ 3a ≡ 7b (mod 8) ⇒ b ≡ −3a (mod 8). Solve within digit bounds and verify numerically. Palindromes across bases: Bound the base length by magnitude (e.g., n < 1000 ⇒ octal has 3–4 digits). Characterize palindromes: • 3-digit octal: (A B A)_8 = 65A + 8B. • 4-digit octal: (A B B A)_8 = 513A + 72B (with A ≥ 1). Enumerate small parameter ranges and test the other-base palindrome constraint. For “greatest”, check candidates in descending order with justification. Symmetric sums with a + b + c fixed (ordered triples of nonnegative integers): Use identities to compress expressions: S = ab(a + b) + bc(b + c) + ca(c + a) = (a + b + c)(ab + bc + ca) − 3abc. With a + b + c known (e.g., 300), convert the given sum into a relation among ab + bc + ca and abc. Use the shift a = A + x etc. to isolate a product like (a−A)(b−A)(c−A) and deduce factorization constraints, enabling clean counting. Count ordered solutions carefully; include/exclude symmetric/degenerate cases precisely. Intersecting families of subsets (collections from the power set): Intersecting means every pair has nonempty intersection. The empty set cannot be included. Complement pairs: S and S^c cannot both be present. Use this to structure counts. Use size-based pigeonhole facts: In [n], any two subsets of size > n/2 must intersect. For n = 5, any two subsets of size ≥ 3 intersect; thus “all subsets of size ≥ 3” is an intersecting family (size 16). Do not assume that “stars” (all subsets containing a fixed element) are the only intersecting families of maximum size. For odd n, both the star and “all subsets of size > n/2” have size 2^{n−1}. When counting collections of a fixed size: • Consider the minimum set size N in the family and do casework on how many 2-element sets are included (for n=5), as these control which 3-sets must be excluded (complements). • Ensure completeness of cases and avoid double counting by parameterizing canonical patterns (e.g., how many 2-sets, how they overlap, whether they share a common element). • Remember order of subsets in a collection does not matter; count distinct families. Avoiding 4-term arithmetic progressions in a strictly increasing sequence with fixed anchors: First bound the variable terms by strict increase (e.g., if fixed terms are 3,4,5,...,30,40,50 then 6 ≤ a < b ≤ 29). Pre-eliminate values that cause a 4-term AP with three fixed terms: • 3,4,5,a forbids a = 6. • b,30,40,50 forbids b = 20. • Similarly, a,30,40,50 forbids a = 20. Start with the count of pairs from allowed values and then subtract specific pairs that complete APs with two fixed endpoints: • 3,5,a,b ⇒ (a,b) = (7,9). • 3,a,b,30 ⇒ (a,b) = (12,21). • 4,a,b,40 ⇒ (a,b) = (16,28). • 5,a,b,50 ⇒ (a,b) = (20,35) but may be outside bounds or pre-excluded (e.g., 20 banned). Systematically check all endpoint combinations; use the fact that if endpoints differ by Δ, then Δ must be divisible by 3 for a 4-term AP, and solve for integer a,b within bounds. Avoid double subtraction; ensure monotonicity and domain constraints are respected. Order statistics with sum and absolute-sum constraints (e.g., x_1 ≤ ... ≤ x_n, sum \|x_i\| = 1, sum x_i = 0): Total positive mass equals total negative mass: both = 1/2. For maximizing x_k (k near the top): if there are T largest terms from k to n (T = n − k + 1), then sum of these T terms ≥ T·x_k. Since the total positive mass ≤ 1/2, we get x_k ≤ (1/2)/T. For minimizing x_l (l near the bottom): if there are l smallest terms, sum of these l terms ≤ l·x_l. Since the total negative mass is −1/2, we get x_l ≥ (−1/2)/l. To attain these bounds, concentrate masses evenly on exactly those positions: set the smallest l terms equal to −1/(2l), the largest T terms equal to 1/(2T), and the middle to 0 (respecting monotonicity). Verify sums and absolute sums. Example: For n=100, maximize x_76 − x_16: T = 25 ⇒ x_76 ≤ 1/50; l = 16 ⇒ x_16 ≥ −1/32; construction with 16 negatives at −1/32, 59 zeros, 25 positives at 1/50 attains 1/50 − (−1/32) = 41/800. Quality checks: Verify digit/base constraints and final equalities numerically if applicable. For extremal problems, provide both a tight bound and an explicit construction achieving it. For counting, explicitly handle ordered vs unordered, exclude impossible/duplicate cases, and check complements/forbidden pairs. For AP-avoidance, confirm integrality and bounds; ensure no missed endpoint combinations. For “greatest/least” questions, justify optimality structurally (e.g., convexity/majorization/pigeonhole). Finally: Put the clean final numeric result in the “answer” field only. [AIME Prompt End]

GEPA is built around a flexible GEPAAdapter abstraction that lets it plug into any system and optimize different types of text snippets. The above example used a simple DefaultAdapter that plugs into a single-turn LLM environment and evolves system prompts, where tasks are presented as user messages. GEPA can be easily extended to multi-turn and other agentic settings. For example, the dspy.GEPA integration uses a DSPyAdapter.

Beyond prompt optimization, GEPA can evolve entire programs. The DSPy Full Program Adapter demonstrates this by evolving complete DSPy programs—including custom signatures, modules, and control flow logic. Starting from a basic dspy.ChainOfThought("question -> answer") that achieves 67% on the MATH benchmark, GEPA evolves a multi-step reasoning program that reach 93% accuracy. A fully executable example notebook shows how to use this adapter.

Using GEPA to optimize your system

GEPA can be used to optimize any system consisting of textual components. Follow these steps:

Implement GEPAAdapter: In order to allow the GEPA optimizer to pair with your system and its environment, users can implement the GEPAAdapter interface defined in src/gepa/core/adapter.py. GEPAAdapter requires 2 methods:
- Evaluate: Given a candidate consisting of proposed text components, and a minibatch of inputs sampled from the train/val sets, evaluate and return execution scores, also capturing the system traces.
- Extract Traces for Reflection: Given the execution traces obtained from executing a proposed candidate, and a named component being optimized, return the textual content from the traces relevant to the named component.
Prepare trainset and valset: Lists of example inputs and task metadata.
Call gepa.optimize with your adapter, metric, and system configuration.

We are actively working on implementing adapters to integrate into many different frameworks. Please open an issue if there's a specific framework you would like to see supported!

Example: Optimizing a multi-turn agent in an external environment: terminal-bench's Terminus agent

Terminal-bench is a benchmark for evaluating the performance of terminal-use agents. Terminus is a leading terminal-use agent. In this script, we use GEPA to optimize the system prompt/terminal-use instruction for the Terminus agent through a custom GEPAAdapter implementation.

Note that the terminus agent as well as terminal-bench run in an external environment and is integrated into GEPA via the TerminusAdapter.

To run this example:

pip install terminal-bench
python src/gepa/examples/terminal-bench/train_terminus.py --model_name=gpt-5-mini

How does GEPA work

GEPA optimizes text components of systems using an evolutionary search algorithm that uses LLM-based reflection for mutating candidates. Most importantly, GEPA leverages task-specific textual feedback (for example, compiler error messages, profiler performance reports, documentation, etc.) to guide the search process. For further details, refer to the paper: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning.

Contributions

We encourage the community and users to help us develop adapters to allow GEPA to be used for optimizing all kinds of systems leveraging textual components. Refer to DSPy/GEPAAdapter and src/gepa/adapters/ for example GEPAAdapter implementations. Please feel free to flag any problems faced as issues.

Reference and Citation

If you use this repository, or the GEPA algorithm, kindly cite:

@misc{agrawal2025gepareflectivepromptevolution,
      title={GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning}, 
      author={Lakshya A Agrawal and Shangyin Tan and Dilara Soylu and Noah Ziems and Rishi Khare and Krista Opsahl-Ong and Arnav Singhvi and Herumb Shandilya and Michael J Ryan and Meng Jiang and Christopher Potts and Koushik Sen and Alexandros G. Dimakis and Ion Stoica and Dan Klein and Matei Zaharia and Omar Khattab},
      year={2025},
      eprint={2507.19457},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.19457}, 
}

For Tasks:

Click tags to check more tools for each tasks

optimize system prompts evolve entire programs implement gepaadapter optimize multi-turn agents integrate into different frameworks

For Jobs:

data scientist machine learning engineer natural language processing engineer ai researcher software developer

Alternative AI tools for gepa

Similar Open Source Tools

gepa

github

: 317

llm-reasoners

LLM Reasoners is a library that enables LLMs to conduct complex reasoning, with advanced reasoning algorithms. It approaches multi-step reasoning as planning and searches for the optimal reasoning chain, which achieves the best balance of exploration vs exploitation with the idea of "World Model" and "Reward". Given any reasoning problem, simply define the reward function and an optional world model (explained below), and let LLM reasoners take care of the rest, including Reasoning Algorithms, Visualization, LLM calling, and more!

github

: 1.0k

mscclpp

MSCCL++ is a GPU-driven communication stack for scalable AI applications. It provides a highly efficient and customizable communication stack for distributed GPU applications. MSCCL++ redefines inter-GPU communication interfaces, delivering a highly efficient and customizable communication stack for distributed GPU applications. Its design is specifically tailored to accommodate diverse performance optimization scenarios often encountered in state-of-the-art AI applications. MSCCL++ provides communication abstractions at the lowest level close to hardware and at the highest level close to application API. The lowest level of abstraction is ultra light weight which enables a user to implement logics of data movement for a collective operation such as AllReduce inside a GPU kernel extremely efficiently without worrying about memory ordering of different ops. The modularity of MSCCL++ enables a user to construct the building blocks of MSCCL++ in a high level abstraction in Python and feed them to a CUDA kernel in order to facilitate the user's productivity. MSCCL++ provides fine-grained synchronous and asynchronous 0-copy 1-sided abstracts for communication primitives such as `put()`, `get()`, `signal()`, `flush()`, and `wait()`. The 1-sided abstractions allows a user to asynchronously `put()` their data on the remote GPU as soon as it is ready without requiring the remote side to issue any receive instruction. This enables users to easily implement flexible communication logics, such as overlapping communication with computation, or implementing customized collective communication algorithms without worrying about potential deadlocks. Additionally, the 0-copy capability enables MSCCL++ to directly transfer data between user's buffers without using intermediate internal buffers which saves GPU bandwidth and memory capacity. MSCCL++ provides consistent abstractions regardless of the location of the remote GPU (either on the local node or on a remote node) or the underlying link (either NVLink/xGMI or InfiniBand). This simplifies the code for inter-GPU communication, which is often complex due to memory ordering of GPU/CPU read/writes and therefore, is error-prone.

github

: 203

llm_client

llm_client is a Rust interface designed for Local Large Language Models (LLMs) that offers automated build support for CPU, CUDA, MacOS, easy model presets, and a novel cascading prompt workflow for controlled generation. It provides a breadth of configuration options and API support for various OpenAI compatible APIs. The tool is primarily focused on deterministic signals from probabilistic LLM vibes, enabling specialized workflows for specific tasks and reproducible outcomes.

github

: 123

bocoel

BoCoEL is a tool that leverages Bayesian Optimization to efficiently evaluate large language models by selecting a subset of the corpus for evaluation. It encodes individual entries into embeddings, uses Bayesian optimization to select queries, retrieves from the corpus, and provides easily managed evaluations. The tool aims to reduce computation costs during evaluation with a dynamic budget, supporting models like GPT2, Pythia, and LLAMA through integration with Hugging Face transformers and datasets. BoCoEL offers a modular design and efficient representation of the corpus to enhance evaluation quality.

github

: 270

datasets

Datasets is a repository that provides a collection of various datasets for machine learning and data analysis projects. It includes datasets in different formats such as CSV, JSON, and Excel, covering a wide range of topics including finance, healthcare, marketing, and more. The repository aims to help data scientists, researchers, and students access high-quality datasets for training models, conducting experiments, and exploring data analysis techniques.

github

: 20.7k

kafka-ml

Kafka-ML is a framework designed to manage the pipeline of Tensorflow/Keras and PyTorch machine learning models on Kubernetes. It enables the design, training, and inference of ML models with datasets fed through Apache Kafka, connecting them directly to data streams like those from IoT devices. The Web UI allows easy definition of ML models without external libraries, catering to both experts and non-experts in ML/AI.

github

: 163

uTensor

uTensor is an extremely light-weight machine learning inference framework built on Tensorflow and optimized for Arm targets. It consists of a runtime library and an offline tool that handles most of the model translation work. The core runtime is only ~2KB. The workflow involves constructing and training a model in Tensorflow, then using uTensor to produce C++ code for inferencing. The runtime ensures system safety, guarantees RAM usage, and focuses on clear, concise, and debuggable code. The high-level API simplifies tensor handling and operator execution for embedded systems.

github

: 1.7k

TapeAgents

TapeAgents is a framework that leverages a structured, replayable log of the agent session to facilitate all stages of the LLM Agent development lifecycle. The agent reasons by processing the tape and the LLM output to produce new thoughts, actions, control flow steps, and append them to the tape. Key features include building agents as low-level state machines or high-level multi-agent team configurations, debugging agents with TapeAgent studio or TapeBrowser apps, serving agents with response streaming, and optimizing agent configurations using successful tapes. The Tape-centric design of TapeAgents provides ultimate flexibility in project development, allowing access to tapes for making prompts, generating next steps, and controlling agent behavior.

github

: 248

baal

Baal is an active learning library that supports both industrial applications and research use cases. It provides a framework for Bayesian active learning methods such as Monte-Carlo Dropout, MCDropConnect, Deep ensembles, and Semi-supervised learning. Baal helps in labeling the most uncertain items in the dataset pool to improve model performance and reduce annotation effort. The library is actively maintained by a dedicated team and has been used in various research papers for production and experimentation.

github

: 833

council

Council is an open-source platform designed for the rapid development and deployment of customized generative AI applications using teams of agents. It extends the LLM tool ecosystem by providing advanced control flow and scalable oversight for AI agents. Users can create sophisticated agents with predictable behavior by leveraging Council's powerful approach to control flow using Controllers, Filters, Evaluators, and Budgets. The framework allows for automated routing between agents, comparing, evaluating, and selecting the best results for a task. Council aims to facilitate packaging and deploying agents at scale on multiple platforms while enabling enterprise-grade monitoring and quality control.

github

: 815

aiomultiprocess

aiomultiprocess is a Python library that combines AsyncIO and multiprocessing to achieve high levels of concurrency in Python applications. It allows running a full AsyncIO event loop on each child process, enabling multiple coroutines to execute simultaneously. The library provides a simple interface for executing asynchronous tasks on a pool of worker processes, making it easy to gather large amounts of network requests quickly. aiomultiprocess is designed to take Python codebases to the next level of performance by leveraging the combined power of AsyncIO and multiprocessing.

github

: 1.7k

craftium

Craftium is an open-source platform based on the Minetest voxel game engine and the Gymnasium and PettingZoo APIs, designed for creating fast, rich, and diverse single and multi-agent environments. It allows for connecting to Craftium's Python process, executing actions as keyboard and mouse controls, extending the Lua API for creating RL environments and tasks, and supporting client/server synchronization for slow agents. Craftium is fully extensible, extensively documented, modern RL API compatible, fully open source, and eliminates the need for Java. It offers a variety of environments for research and development in reinforcement learning.

github

: 64

semlib

Semlib is a Python library for building data processing and data analysis pipelines that leverage the power of large language models (LLMs). It provides functional programming primitives like map, reduce, sort, and filter, programmed with natural language descriptions. Semlib handles complexities such as prompting, parsing, concurrency control, caching, and cost tracking. The library breaks down sophisticated data processing tasks into simpler steps to improve quality, feasibility, latency, cost, security, and flexibility of data processing tasks.

github

: 72

llm-search

pyLLMSearch is an advanced RAG system that offers a convenient question-answering system with a simple YAML-based configuration. It enables interaction with multiple collections of local documents, with improvements in document parsing, hybrid search, chat history, deep linking, re-ranking, customizable embeddings, and more. The package is designed to work with custom Large Language Models (LLMs) from OpenAI or installed locally. It supports various document formats, incremental embedding updates, dense and sparse embeddings, multiple embedding models, 'Retrieve and Re-rank' strategy, HyDE (Hypothetical Document Embeddings), multi-querying, chat history, and interaction with embedded documents using different models. It also offers simple CLI and web interfaces, deep linking, offline response saving, and an experimental API.

github

: 538

aitlas

The AiTLAS toolbox (Artificial Intelligence Toolbox for Earth Observation) includes state-of-the-art machine learning methods for exploratory and predictive analysis of satellite imagery as well as a repository of AI-ready Earth Observation (EO) datasets. It can be easily applied for a variety of Earth Observation tasks, such as land use and cover classification, crop type prediction, localization of specific objects (semantic segmentation), etc. The main goal of AiTLAS is to facilitate better usability and adoption of novel AI methods (and models) by EO experts, while offering easy access and standardized format of EO datasets to AI experts which allows benchmarking of various existing and novel AI methods tailored for EO data.

github

: 180

For similar tasks

gepa

github

: 317

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.1k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675

gepa

README:

GEPA: System Optimization through Reflective Text Evolution

Overview

Installation

Using GEPA

The Easiest Path: DSPy Integration

Simple Prompt Optimization Example

Using GEPA to optimize your system

Example: Optimizing a multi-turn agent in an external environment: terminal-bench's Terminus agent

How does GEPA work

Contributions

Further Reading

Reference and Citation

For Tasks:

For Jobs:

Alternative AI tools for gepa

Similar Open Source Tools

gepa

llm-reasoners

mscclpp

llm_client

bocoel

datasets

kafka-ml

uTensor

TapeAgents

baal

council

aiomultiprocess

craftium

semlib

llm-search

aitlas

For similar tasks

gepa

For similar jobs

weave

LLMStack

VisionCraft

kaito

PyRIT

tabby

spear

Magick