llm-benchmarks

llm-benchmarks

Popular LLM benchmarks for ruby code generation

Stars: 74

Visit
 screenshot

LLM Benchmarks is a tool designed for benchmarking AI models' performance and debugging capabilities. It provides comprehensive metrics, automated testing, and an interactive website for exploring results. The tool aims to ensure fair competition among models by tracking and comparing results consistently. LLM Benchmarks also emphasizes transparency through open-source code and public results. Users can contribute by adding new model implementations, benchmark types, improving the website, optimizing benchmarks, or enhancing documentation.

README:

class Reality < Exception
  def initialize
    super("Your model may be hallucinating benchmarks")
  end
end

begin
  puts "Running AI-generated code..."
  raise Reality if performance != promised_performance
rescue Reality => e
  puts "Welcome to our benchmarks!"
end

MIT License Ruby


๐Ÿงช Where AI Models Come to Face Their RuntimeErrors

LLM Benchmarks Website

๐ŸŒ Interactive Results Website

See real-time rankings, detailed metrics, and comprehensive analysis of AI model performance across all benchmark types

Requirements

unless RUBY_VERSION >= "3.4.0"
  puts "โš ๏ธ Hold up! We need Ruby 3.4+ for this party! โš ๏ธ"
  exit
end

puts "โœจ You're good to go! Let's benchmark some AI! โœจ"

Features

module BenchmarkFeatures
  class << self
    def dual_benchmark_types
      # Two comprehensive benchmark categories
      {
        performance: "โšก๏ธ Raw speed & memory efficiency tests",
        program_fixer: "๐Ÿ› ๏ธ AI debugging & code repair challenges"
      }
    end

    def automated_testing
      # One command for both running benchmarks and generating implementations
      system("bin/main")
    end

    def implementation_generation
      # Automatic implementation generation with OpenRouter models
      # powered by ruby_llm gem
      available_models = true
      easy_setup = true
      consistent_results = true

      puts "โœจ AI-powered solution generation" if available_models && easy_setup && consistent_results
    end

    def interactive_website
      # Modern React Router website for exploring results
      {
        real_time_rankings: true,
        detailed_metrics: true,
        model_comparisons: true,
        benchmark_insights: true
      }
    end

    def fair_competition
      models.each do |model|
        # Each model gets the same prompt
        # Each implementation is saved with a timestamp
        # Results are tracked and compared consistently
      end
    end

    def comprehensive_metrics
      {
        performance: {
          speed: "โšก๏ธ Microseconds matter",
          memory: "๐Ÿง  Every byte counts",
          complexity: "๐Ÿคฏ O(n) or go home"
        },
        program_fixer: {
          test_success: "โœ… Tests passing ratio",
          syntax_validity: "๏ฟฝ Clean, compilable code",
          rubocop_score: "๐Ÿ’Ž Ruby style compliance"
        }
      }
    end

    def transparency
      open_source = true
      results_public = true
      bias = nil # We don't do that here

      puts "Trust through code, not words" if open_source && results_public && bias.nil?
    end

    private

    def marketing_buzz
      raise NotImplementedError, "We prefer cold, hard benchmarks"
    end
  end
end

# No AI models were permanently harmed in the making of these benchmarks
# (They just learned some humility)

Quick Start

# Clone this beauty
git clone https://github.com/OskarsEzerins/llm-benchmarks
cd llm-benchmarks

# Install dependencies
bundle install

# Choose your adventure ๐ŸŽฎ
bin/main

# Interactive menu with options:
# 1. ๐Ÿƒโ€โ™‚๏ธ Run benchmarks with existing implementations
#    - Performance benchmarks (speed & memory)
#    - Program fixer benchmarks (debugging challenges)
# 2. ๐Ÿค– Generate new AI implementations with OpenRouter models
#    - 44+ models available
#    - Automated prompt-to-implementation pipeline

# See detailed results by category
bin/show_all_results

# See combined rankings across all benchmark types
bin/show_total_rankings

# ๐ŸŒ Launch the website locally (optional)
cd website
pnpm install
pnpm dev  # runs bin/aggregate_results automatically via predev hook

Adding a new model? Also add its slug โ†’ display name entry to config/model_names.json at the repo root. bin/aggregate_results reads this file at build time and bakes the display name into the website's data. The JSON is bundled statically by Vite, so it works on Vercel without any runtime filesystem access.

Ways to Contribute

if you.have_ideas? && you.like_benchmarks?
  puts "We'd love your help!"
  fork_it
  create_branch
  push_changes
  pull_request
else
  puts "No pressure! Star us and come back later!"
end
  • Add new model implementations: Run a model against all benchmarks and open a PR with the results. See #19 as an example.
  • Add new benchmark types: Got a clever challenge for AI models?
  • Improve the website: Make those charts even prettier
  • Optimize benchmarks: More accurate, more fair, more challenging
  • Documentation: Help others understand the madness

Project Anatomy

๐Ÿ“ฆ LLM_BENCHMARKS
 โ”ฃ ๐Ÿ“‚ benchmarks      # Where AI models face their destiny
 โ”ƒ โ”ฃ ๐Ÿ“‚ performance   # โšก๏ธ Speed & memory challenges
 โ”ƒ โ”— ๐Ÿ“‚ program_fixer # ๐Ÿ› ๏ธ Code debugging challenges
 โ”ฃ ๐Ÿ“‚ implementations # AI's best attempts at glory
 โ”ƒ โ”ฃ ๐Ÿ“‚ performance   # Generated speed solutions
 โ”ƒ โ”— ๐Ÿ“‚ program_fixer # Generated debugging fixes
 โ”ฃ ๐Ÿ“‚ config          # Shared config (model_names.json โ€” slug โ†’ display name)
 โ”ฃ ๐Ÿ“‚ lib             # Our benchmark orchestration tools
 โ”ฃ ๐Ÿ“‚ results         # The cold, hard truth (JSON data)
 โ”ฃ ๐Ÿ“‚ website         # ๐ŸŒ Interactive results dashboard
 โ”— ๐Ÿ“‚ bin             # Press buttons, get answers

Benchmark Types

Performance Benchmarks โšก๏ธ

  • CSV data processing at scale
  • Graph shortest path algorithms
  • LRU cache implementations
  • Run-length encoding optimization

Program Fixer Benchmarks ๐Ÿ› ๏ธ

  • Calendar system debugging
  • Parking garage logic repair
  • School library management fixes
  • Vending machine state handling

๐Ÿ”ฌ assert_equal(ai_promises, reality)

Where AI code meets its maker... literally

Built with โค๏ธ by developers who believe in transparency over hype

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for llm-benchmarks

Similar Open Source Tools

For similar tasks

For similar jobs