AirspeedVelocity.jl
Easily benchmark a Julia package over its commit history
Stars: 106
AirspeedVelocity.jl is a tool designed to simplify benchmarking of Julia packages over their lifetime. It provides a CLI to generate benchmarks, compare commits/tags/branches, plot benchmarks, and run benchmark comparisons for every submitted PR as a GitHub action. The tool freezes the benchmark script at a specific revision to prevent old history from affecting benchmarks. Users can configure options using CLI flags and visualize benchmark results. AirspeedVelocity.jl can be used to benchmark any Julia package and offers features like generating tables and plots of benchmark results. It also supports custom benchmarks and can be integrated into GitHub actions for automated benchmarking of PRs.
README:
AirspeedVelocity.jl strives to make it easy to benchmark Julia packages over their lifetime. It is inspired by asv.
This package allows you to:
- Generate benchmarks directly from the terminal with an easy-to-use CLI.
- Compare many commits/tags/branches at once.
- Plot those benchmarks, automatically flattening your benchmark suite into a list of plots with generated titles.
- Run as a GitHub action to create benchmark comparisons for every submitted PR (in a bot comment).
This package also freezes the benchmark script at a particular revision, so there is no worry about the old history overwriting the benchmark.
You can install the CLI with:
julia -e 'using Pkg; Pkg.activate(temp=true); Pkg.add("AirspeedVelocity")'
This will install two executables at ~/.julia/bin
- make sure to have it on your PATH
.
You may use the CLI to generate benchmarks for any package with, e.g.,
benchpkg
This will benchmark the package defined in the current directory
at the current dirty state, against the default branch (i.e., main
or master
),
over all benchmarks defined in benchmark/benchmarks.jl
. It will then print
a markdown table of the results while also saving the JSON results to the current directory.
You can configure all options with the CLI flags. For example, to benchmark
the registered package Transducers.jl
at the revisions v0.4.20
, v0.4.70
, and master
,
you can use:
benchpkg Transducers \
--rev=v0.4.20,v0.4.70,master \
--bench-on=v0.4.20
This will further use the benchmark script benchmark/benchmarks.jl
as it was defined at v0.4.20
,
and then save the JSON results in the current directory.
We can explicitly view the results of the benchmark as a table with benchpkgtable
:
benchpkgtable Transducers \
--rev=v0.4.20,v0.4.70,master
We can also generate plots of the revisions with:
benchpkgplot Transducers \
--rev=v0.4.20,v0.4.70,master \
--format=pdf \
--npart=5
which will generate a pdf file for each set of 5 plots, showing the change with each revision:
You can also provide a custom benchmark.
For example, let's say you have a file script.jl
, defining
a benchmark for SymbolicRegression.jl
(we always need to define
the SUITE
variable as a BenchmarkGroup
):
using BenchmarkTools, SymbolicRegression
const SUITE = BenchmarkGroup()
# Create hierarchy of benchmarks:
SUITE["eval_tree_array"] = BenchmarkGroup()
options = Options(; binary_operators=[+, -, *], unary_operators=[cos])
tree = Node(; feature=1) + cos(3.2f0 * Node(; feature=2))
for n in [10, 20]
SUITE["eval_tree_array"][n] = @benchmarkable(
eval_tree_array($tree, X, $options),
evals=10,
samples=1000,
setup=(X=randn(Float32, 2, $n))
)
end
Inside this script, we will also have access to the PACKAGE_VERSION
constant,
to allow for different behavior depending on tag.
We can run this benchmark over the history of SymbolicRegression.jl
with:
benchpkg SymbolicRegression \
-r v0.15.3,v0.16.2 \
-s script.jl \
-o results/ \
--exeflags="--threads=4 -O3"
where we have also specified the output directory and extra flags to pass to the
julia
executable. We can also now visualize this:
benchpkgplot SymbolicRegression \
-r v0.15.3,v0.16.2 \
-i results/ \
-o plots/
You can use this package in GitHub actions to benchmark every PR submitted to your package,
by copying the example: .github/workflows/benchmark_pr.yml
.
Every time a PR is submitted to your package, this workflow will run and generate plots of the performance of the PR against the default branch, as well as a markdown table, showing whether the PR improves or worsens performance:
For running benchmarks, you can use the benchpkg
command, which is
built into the ~/.julia/bin
folder:
benchpkg [package_name] [-r --rev <arg>]
[--url <arg>]
[--path <arg>]
[-o, --output-dir <arg>]
[-e, --exeflags <arg>]
[-a, --add <arg>]
[-s, --script <arg>]
[--bench-on <arg>]
[-f, --filter <arg>]
[--nsamples-load-time <arg>]
[--tune]
[--dont-print]
Benchmark a package over a set of revisions.
# Arguments
- `package_name`: Name of the package. If not given, the package is assumed to be
the current directory.
# Options
- `-r, --rev <arg>`: Revisions to test (delimit by comma). Use `dirty` to
benchmark the current state of the package at `path` (and not a git commit).
The default is `{DEFAULT},dirty`, which will attempt to find the default branch
of the package.
- `--url <arg>`: URL of the package.
- `--path <arg>`: Path of the package. The default is `.` if other arguments are not given.
- `-o, --output-dir <arg>`: Where to save the JSON results. The default is `.`.
- `-e, --exeflags <arg>`: CLI flags for Julia (default: none).
- `-a, --add <arg>`: Extra packages needed (delimit by comma).
- `-s, --script <arg>`: The benchmark script. Default: `benchmark/benchmarks.jl` downloaded from `stable`.
- `--bench-on <arg>`: If the script is not set, this specifies the revision at which
to download `benchmark/benchmarks.jl` from the package.
- `-f, --filter <arg>`: Filter the benchmarks to run (delimit by comma).
- `--nsamples-load-time <arg>`: Number of samples to take when measuring load time of
the package (default: 5). (This means starting a Julia process for each sample.)
- `--dont-print`: Don't print the table.
# Flags
- `--tune`: Whether to run benchmarks with tuning (default: false).
You can also just generate a table:
benchpkgtable [package_name] [-r --rev <arg>]
[-i --input-dir <arg>]
[--ratio]
[--mode <arg>]
[--url <arg>]
[--path <arg>]
Print a table of the benchmarks of a package as created with `benchpkg`.
# Arguments
- `package_name`: Name of the package.
# Options
- `-r, --rev <arg>`: Revisions to test (delimit by comma).
The default is `{DEFAULT},dirty`, which will attempt to find the default branch
of the package.
- `-i, --input-dir <arg>`: Where the JSON results were saved (default: ".").
- `--url <arg>`: URL of the package. Only used to get the package name.
- `--path <arg>`: Path of the package. The default is `.` if other arguments are not given.
Only used to get the package name.
# Flags
- `--ratio`: Whether to include the ratio (default: false). Only applies when
comparing two revisions.
- `--mode`: Table mode(s). Valid values are "time" (default), to print the
benchmark time, or "memory", to print the allocation and memory usage.
Both options can be passed, if delimited by comma.
For plotting, you can use the benchpkgplot
function:
benchpkgplot package_name [-r --rev <arg>]
[-i --input-dir <arg>]
[-o --output-dir <arg>]
[-n --npart <arg>]
[--format <arg>]
Plot the benchmarks of a package as created with `benchpkg`.
# Arguments
- `package_name`: Name of the package.
# Options
- `-r, --rev <arg>`: Revisions to test (delimit by comma).
- `-i, --input-dir <arg>`: Where the JSON results were saved (default: ".").
- `-o, --output-dir <arg>`: Where to save the plots results (default: ".").
- `-n, --npart <arg>`: Max number of plots per page (default: 10).
- `--format <arg>`: File type to save the plots as (default: "png").
If you prefer to use the Julia API, you can use the benchmark
function for generating data.
The API is given here.
Also be sure to check out PkgBenchmark.jl. PkgBenchmark.jl is a simple wrapper of BenchmarkTools.jl to interface it with Git, and is a good choice for building custom analysis workflows.
However, for me this wrapper is a bit too thin, which is why I created this package.
AirspeedVelocity.jl tries to have more features and workflows readily-available.
It also emphasizes a CLI (though there is a Julia API), as my subjective view
is that this is more suitable for interacting side-by-side with git
.
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for AirspeedVelocity.jl
Similar Open Source Tools
AirspeedVelocity.jl
AirspeedVelocity.jl is a tool designed to simplify benchmarking of Julia packages over their lifetime. It provides a CLI to generate benchmarks, compare commits/tags/branches, plot benchmarks, and run benchmark comparisons for every submitted PR as a GitHub action. The tool freezes the benchmark script at a specific revision to prevent old history from affecting benchmarks. Users can configure options using CLI flags and visualize benchmark results. AirspeedVelocity.jl can be used to benchmark any Julia package and offers features like generating tables and plots of benchmark results. It also supports custom benchmarks and can be integrated into GitHub actions for automated benchmarking of PRs.
termax
Termax is an LLM agent in your terminal that converts natural language to commands. It is featured by: - Personalized Experience: Optimize the command generation with RAG. - Various LLMs Support: OpenAI GPT, Anthropic Claude, Google Gemini, Mistral AI, and more. - Shell Extensions: Plugin with popular shells like `zsh`, `bash` and `fish`. - Cross Platform: Able to run on Windows, macOS, and Linux.
SpeziLLM
The Spezi LLM Swift Package includes modules that help integrate LLM-related functionality in applications. It provides tools for local LLM execution, usage of remote OpenAI-based LLMs, and LLMs running on Fog node resources within the local network. The package contains targets like SpeziLLM, SpeziLLMLocal, SpeziLLMLocalDownload, SpeziLLMOpenAI, and SpeziLLMFog for different LLM functionalities. Users can configure and interact with local LLMs, OpenAI LLMs, and Fog LLMs using the provided APIs and platforms within the Spezi ecosystem.
stagehand
Stagehand is an AI web browsing framework that simplifies and extends web automation using three simple APIs: act, extract, and observe. It aims to provide a lightweight, configurable framework without complex abstractions, allowing users to automate web tasks reliably. The tool generates Playwright code based on atomic instructions provided by the user, enabling natural language-driven web automation. Stagehand is open source, maintained by the Browserbase team, and supports different models and model providers for flexibility in automation tasks.
HuggingFaceGuidedTourForMac
HuggingFaceGuidedTourForMac is a guided tour on how to install optimized pytorch and optionally Apple's new MLX, JAX, and TensorFlow on Apple Silicon Macs. The repository provides steps to install homebrew, pytorch with MPS support, MLX, JAX, TensorFlow, and Jupyter lab. It also includes instructions on running large language models using HuggingFace transformers. The repository aims to help users set up their Macs for deep learning experiments with optimized performance.
mystic
The `mystic` framework provides a collection of optimization algorithms and tools that allow the user to robustly solve hard optimization problems. It offers fine-grained power to monitor and steer optimizations during the fit processes. Optimizers can advance one iteration or run to completion, with customizable stop conditions. `mystic` optimizers share a common interface for easy swapping without writing new code. The framework supports parameter constraints, including soft and hard constraints, and provides tools for scientific machine learning, uncertainty quantification, adaptive sampling, nonlinear interpolation, and artificial intelligence. `mystic` is actively developed and welcomes user feedback and contributions.
codespin
CodeSpin.AI is a set of open-source code generation tools that leverage large language models (LLMs) to automate coding tasks. With CodeSpin, you can generate code in various programming languages, including Python, JavaScript, Java, and C++, by providing natural language prompts. CodeSpin offers a range of features to enhance code generation, such as custom templates, inline prompting, and the ability to use ChatGPT as an alternative to API keys. Additionally, CodeSpin provides options for regenerating code, executing code in prompt files, and piping data into the LLM for processing. By utilizing CodeSpin, developers can save time and effort in coding tasks, improve code quality, and explore new possibilities in code generation.
ChatDBG
ChatDBG is an AI-based debugging assistant for C/C++/Python/Rust code that integrates large language models into a standard debugger (`pdb`, `lldb`, `gdb`, and `windbg`) to help debug your code. With ChatDBG, you can engage in a dialog with your debugger, asking open-ended questions about your program, like `why is x null?`. ChatDBG will _take the wheel_ and steer the debugger to answer your queries. ChatDBG can provide error diagnoses and suggest fixes. As far as we are aware, ChatDBG is the _first_ debugger to automatically perform root cause analysis and to provide suggested fixes.
magic-cli
Magic CLI is a command line utility that leverages Large Language Models (LLMs) to enhance command line efficiency. It is inspired by projects like Amazon Q and GitHub Copilot for CLI. The tool allows users to suggest commands, search across command history, and generate commands for specific tasks using local or remote LLM providers. Magic CLI also provides configuration options for LLM selection and response generation. The project is still in early development, so users should expect breaking changes and bugs.
LLMUnity
LLM for Unity enables seamless integration of Large Language Models (LLMs) within the Unity engine, allowing users to create intelligent characters for immersive player interactions. The tool supports major LLM models, runs locally without internet access, offers fast inference on CPU and GPU, and is easy to set up with a single line of code. It is free for both personal and commercial use, tested on Unity 2021 LTS, 2022 LTS, and 2023. Users can build multiple AI characters efficiently, use remote servers for processing, and customize model settings for text generation.
py-vectara-agentic
The `vectara-agentic` Python library is designed for developing powerful AI assistants using Vectara and Agentic-RAG. It supports various agent types, includes pre-built tools for domains like finance and legal, and enables easy creation of custom AI assistants and agents. The library provides tools for summarizing text, rephrasing text, legal tasks like summarizing legal text and critiquing as a judge, financial tasks like analyzing balance sheets and income statements, and database tools for inspecting and querying databases. It also supports observability via LlamaIndex and Arize Phoenix integration.
godot-llm
Godot LLM is a plugin that enables the utilization of large language models (LLM) for generating content in games. It provides functionality for text generation, text embedding, multimodal text generation, and vector database management within the Godot game engine. The plugin supports features like Retrieval Augmented Generation (RAG) and integrates llama.cpp-based functionalities for text generation, embedding, and multimodal capabilities. It offers support for various platforms and allows users to experiment with LLM models in their game development projects.
garak
Garak is a vulnerability scanner designed for LLMs (Large Language Models) that checks for various weaknesses such as hallucination, data leakage, prompt injection, misinformation, toxicity generation, and jailbreaks. It combines static, dynamic, and adaptive probes to explore vulnerabilities in LLMs. Garak is a free tool developed for red-teaming and assessment purposes, focusing on making LLMs or dialog systems fail. It supports various LLM models and can be used to assess their security and robustness.
garak
Garak is a free tool that checks if a Large Language Model (LLM) can be made to fail in a way that is undesirable. It probes for hallucination, data leakage, prompt injection, misinformation, toxicity generation, jailbreaks, and many other weaknesses. Garak's a free tool. We love developing it and are always interested in adding functionality to support applications.
For similar tasks
AirspeedVelocity.jl
AirspeedVelocity.jl is a tool designed to simplify benchmarking of Julia packages over their lifetime. It provides a CLI to generate benchmarks, compare commits/tags/branches, plot benchmarks, and run benchmark comparisons for every submitted PR as a GitHub action. The tool freezes the benchmark script at a specific revision to prevent old history from affecting benchmarks. Users can configure options using CLI flags and visualize benchmark results. AirspeedVelocity.jl can be used to benchmark any Julia package and offers features like generating tables and plots of benchmark results. It also supports custom benchmarks and can be integrated into GitHub actions for automated benchmarking of PRs.
ck
Collective Mind (CM) is a collection of portable, extensible, technology-agnostic and ready-to-use automation recipes with a human-friendly interface (aka CM scripts) to unify and automate all the manual steps required to compose, run, benchmark and optimize complex ML/AI applications on any platform with any software and hardware: see online catalog and source code. CM scripts require Python 3.7+ with minimal dependencies and are continuously extended by the community and MLCommons members to run natively on Ubuntu, MacOS, Windows, RHEL, Debian, Amazon Linux and any other operating system, in a cloud or inside automatically generated containers while keeping backward compatibility - please don't hesitate to report encountered issues here and contact us via public Discord Server to help this collaborative engineering effort! CM scripts were originally developed based on the following requirements from the MLCommons members to help them automatically compose and optimize complex MLPerf benchmarks, applications and systems across diverse and continuously changing models, data sets, software and hardware from Nvidia, Intel, AMD, Google, Qualcomm, Amazon and other vendors: * must work out of the box with the default options and without the need to edit some paths, environment variables and configuration files; * must be non-intrusive, easy to debug and must reuse existing user scripts and automation tools (such as cmake, make, ML workflows, python poetry and containers) rather than substituting them; * must have a very simple and human-friendly command line with a Python API and minimal dependencies; * must require minimal or zero learning curve by using plain Python, native scripts, environment variables and simple JSON/YAML descriptions instead of inventing new workflow languages; * must have the same interface to run all automations natively, in a cloud or inside containers. CM scripts were successfully validated by MLCommons to modularize MLPerf inference benchmarks and help the community automate more than 95% of all performance and power submissions in the v3.1 round across more than 120 system configurations (models, frameworks, hardware) while reducing development and maintenance costs.
eval-dev-quality
DevQualityEval is an evaluation benchmark and framework designed to compare and improve the quality of code generation of Language Model Models (LLMs). It provides developers with a standardized benchmark to enhance real-world usage in software development and offers users metrics and comparisons to assess the usefulness of LLMs for their tasks. The tool evaluates LLMs' performance in solving software development tasks and measures the quality of their results through a point-based system. Users can run specific tasks, such as test generation, across different programming languages to evaluate LLMs' language understanding and code generation capabilities.
fortuna
Fortuna is a library for uncertainty quantification that enables users to estimate predictive uncertainty, assess model reliability, trigger human intervention, and deploy models safely. It provides calibration and conformal methods for pre-trained models in any framework, supports Bayesian inference methods for deep learning models written in Flax, and is designed to be intuitive and highly configurable. Users can run benchmarks and bring uncertainty to production systems with ease.
modelbench
ModelBench is a tool for running safety benchmarks against AI models and generating detailed reports. It is part of the MLCommons project and is designed as a proof of concept to aggregate measures, relate them to specific harms, create benchmarks, and produce reports. The tool requires LlamaGuard for evaluating responses and a TogetherAI account for running benchmarks. Users can install ModelBench from GitHub or PyPI, run tests using Poetry, and create benchmarks by providing necessary API keys. The tool generates static HTML pages displaying benchmark scores and allows users to dump raw scores and manage cache for faster runs. ModelBench is aimed at enabling users to test their own models and create tests and benchmarks.
Aidan-Bench
Aidan Bench is a tool that rewards creativity, reliability, contextual attention, and instruction following. It is weakly correlated with Lmsys, has no score ceiling, and aligns with real-world open-ended use. The tool involves giving LLMs open-ended questions and evaluating their answers based on novelty scores. Users can set up the tool by installing required libraries and setting up API keys. The project allows users to run benchmarks for different models and provides flexibility in threading options.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
agentcloud
AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.
oss-fuzz-gen
This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.