llm-leaderboard

Project of llm evaluation to Japanese tasks

Stars: 67

Visit

Nejumi Leaderboard 3 is a comprehensive evaluation platform for large language models, assessing general language capabilities and alignment aspects. The evaluation framework includes metrics for language processing, translation, summarization, information extraction, reasoning, mathematical reasoning, entity extraction, knowledge/question answering, English, semantic analysis, syntactic analysis, alignment, ethics/moral, toxicity, bias, truthfulness, and robustness. The repository provides an implementation guide for environment setup, dataset preparation, configuration, model configurations, and chat template creation. Users can run evaluation processes using specified configuration files and log results to the Weights & Biases project.

README:

Nejumi Leaderboard 3

Overview

This repository is for the Nejumi Leaderboard 3, a comprehensive evaluation platform for large language models. The leaderboard assesses both general language capabilities and alignment aspects. For detailed information about the leaderboard, please visit Nejumi Leaderboard website.

Evaluation Metrics

Our evaluation framework incorporates a diverse set of metrics to provide a holistic assessment of model performance:

Main Category	Subcategory	Automated Evaluation with Correct Data	AI Evaluation	Note
General Language Processing	Expression		MT-bench/roleplay (0shot) MT-bench/humanities (0shot) MT-bench/writing (0shot)
^	Translation	ALT e-to-j (jaster) (0shot, 2shot) ALT j-to-e (jaster) (0shot, 2shot) wikicorpus-e-to-j(jaster) (0shot, 2shot) wikicorpus-j-to-e(jaster) (0shot, 2shot)
^	Summarization
^	Information Extraction	JSQuaD (jaster) (0shot, 2shot)
^	Reasoning		MT-bench/reasoning (0shot)
^	Mathematical Reasoning	MAWPS(jaster) (0shot, 2shot) MGSM(jaster) (0shot, 2shot)	MT-bench/math (0shot)
^	(Entity) Extraction	wiki_ner(jaster) (0shot, 2shot) wiki_coreference(jaster) (0shot, 2shot) chABSA(jaster) (0shot, 2shot)	MT-bench/extraction (0shot)
^	Knowledge / Question Answering	JCommonsenseQA(jaster) (0shot, 2shot) JEMHopQA(jaster) (0shot, 2shot) JMMLU(0shot, 2shot) NIILC(jaster) (0shot, 2shot) aio*(jaster) (0shot, 2shot)	MT-bench/stem (0shot)
^	English	MMLU_en (0shot, 2shot)
^	semantic analysis	JNLI(jaster) (0shot, 2shot) JaNLI(jaster) (0shot, 2shot) JSeM(jaster) (0shot, 2shot) JSICK(jaster) (0shot, 2shot) Jamp*(jaster) (0shot, 2shot)
^	syntactic analysis	JCoLA-in-domain(jaster) (0shot, 2shot) JCoLA-out-of-domain(jaster) (0shot, 2shot) JBLiMP(jaster) (0shot, 2shot) wiki_reading(jaster) (0shot, 2shot) wiki_pas(jaster) (0shot, 2shot) wiki_dependency(jaster) (0shot, 2shot)
Alignment	Controllability	jaster* (0shot, 2shot) LCTG		LCTG cannot be used for business purposes. Usage for research and using the result in the press release are acceptable.
^	Ethics/Moral	JCommonsenseMorality*(2shot)
^	Toxicity		LINE Yahoo Reliability Evaluation Benchmark	This dataset is not publicly available due to its sensitive content.
^	Bias	JBBQ (2shot)		JBBQ needs to be downloaded from JBBQ github repository.
^	Truthfulness	JTruthfulQA		For JTruthfulQA evaluation, nlp-waseda/roberta_jtruthfulqa requires Juman++ to be installed beforehand. You can install it by running the script/install_jumanpp.sh script.
^	Robustness	Test multiple patterns against JMMLU (W&B original) (0shot, 2shot) - Standard method - Choices are symbols - Select anything but the correct answer

metrics with (0, 2-shot) are averaged across both settings.
Metrics marked with an asterisk (*) evaluate control capabilities.
For MT-bench, StabilityAI's MT-Bench JP is used with GPT-4o-2024-05-13 as the model to evaluate.
For LCTG, the only quantity test is conducted. (The quality test is not conducted)
vLLM is leveraged for efficient inference.
Alignment data may contain sensitive information and the default setting does not include it in this repository. If you want to evaluate your models agains Alinghment data, please check each dataset instruction carefully

Implementation Guide

Environment Setup

Set up environment variables

export WANDB_API_KEY=<your WANDB_API_KEY>
export OPENAI_API_KEY=<your OPENAI_API_KEY>
export LANG=ja_JP.UTF-8
# If using Azure OpenAI instead of standard OpenAI
export AZURE_OPENAI_ENDPOINT=<your AZURE_OPENAI_ENDPOINT>
export AZURE_OPENAI_API_KEY=<your AZURE_OPENAI_API_KEY>
export OPENAI_API_TYPE=azure
# if needed, set the following API KEY too
export ANTHROPIC_API_KEY=<your ANTHROPIC_API_KEY>
export GOOGLE_API_KEY=<your GOOGLE_API_KEY>
export COHERE_API_KEY=<your COHERE_API_KEY>
export MISTRAL_API_KEY=<your MISTRAL_API_KEY>
export AWS_ACCESS_KEY_ID=<your AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<your AWS_SECRET_ACCESS_KEY>
export AWS_DEFAULT_REGION=<your AWS_DEFAULT_REGION>
export UPSTAGE_API_KEY=<your UPSTAGE_API_KEY>
# if needed, please login in huggingface
huggingface-cli login

Clone the repository

git clone https://github.com/wandb/llm-leaderboard.git
cd llm-leaderboard

Set up a Python environment with requirements.txt

Dataset Preparation

For detailed instructions on dataset preparation and caveate, please refer to scripts/data_uploader/README.md.

In Nejumi Leadeboard3, the following dataset are used.

Please ensure to thoroughly review the terms of use for each dataset before using them.

jaster(Apache-2.0 license)
MT-Bench-JA (Apache-2.0 license)
LCTG (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permission from AI shift to use for the leaderboard and was received.)
JBBQ (Creative Commons Attribution 4.0 International License.)
LINE Yahoo Inappropriate Speech Evaluation Dataset (not publically available)
JTruthfulQA (Creative Commons Attribution 4.0 International License.)

Configuration

Base configuration

The base_config.yaml file contains basic settings, and you can create a separate YAML file for model-specific settings. This allows for easy customization of settings for each model while maintaining a consistent base configuration.

Below, you will find a detailed description of the variables utilized in the base_config.yaml file.

wandb: Information used for Weights & Biases (W&B) support.
- entity: Name of the W&B Entity.
- project: Name of the W&B Project.
- run_name: Name of the W&B run. Please set up run name in a model-specific config.
testmode: Default is false. Set to true for lightweight implementation with a small number of questions per category (for functionality checks).
inference_interval: Set inference interval in seconds. This is particularly effective when there are rate limits, such as with APIs.
run: Set to true for each evaluation dataset you want to run.
model: Information about the model.
- artifacts_path: Path of the wandb artifacts where the model is located.
- max_model_len: Maximum token length of the input.
- chat_template: Path to the chat template file. This is required for open-weights models.
- dtype: Data type. Choose from float32, float16, bfloat16.
- trust_remote_code: Default is true.
- device_map: Device map. Default is "auto".
- load_in_8bit: 8-bit quantization. Default is false.
- load_in_4bit: 4-bit quantization. Default is false.
generator: Settings for generation. For more details, refer to the generation_utils in Hugging Face Transformers.
- top_p: top-p sampling. Default is 1.0.
- temperature: The temperature for sampling. Default is 0.1.
- max_tokens: Maximum number of tokens to generate. This value will be overwritten in the script.
num_few_shots: Number of few-shot examples to use.
github_version: For recording, not required to be changed.
jaster: Settings for the Jaster dataset.
- artifacts_path: URL of the WandB Artifact for the Jaster dataset.
- dataset_dir: Directory of the Jaster dataset after downloading the Artifact.
jmmlu_robustness: Whether to include the JMMLU Robustness evaluation. Default is True.
lctg: Settings for the LCTG dataset.
- artifacts_path: URL of the WandB Artifact for the LCTG dataset.
- dataset_dir: Directory of the LCTG dataset after downloading the Artifact.
jbbq: Settings for the JBBQ dataset.
- artifacts_path: URL of the WandB Artifact for the JBBQ dataset.
- dataset_dir: Directory of the JBBQ dataset after downloading the Artifact.
toxicity: Settings for the toxicity evaluation.
- artifact_path: URL of the WandB Artifact of the toxicity dataset.
- judge_prompts_path: URL of the WandB Artifact of the toxicity judge prompts.
- max_workers: Number of workers for parallel processing.
- judge_model: Model used for toxicity judgment. Default is gpt-4o-2024-05-13
jtruthfulqa: Settings for the LCTG dataset.
- artifact_path: URL of the WandB Artifact for the JTruthfulQA dataset.
- roberta_model_name: Name of the RoBERTa model used for evaluation. Default is 'nlp-waseda/roberta_jtruthfulqa'.
mtbench: Settings for the MT-Bench evaluation.
- temperature_override: Override the temperature for each category of the MT-Bench.
- question_artifacts_path: URL of the WandB Artifact for the MT-Bench questions.
- referenceanswer_artifacts_path: URL of the WandB Artifact for the MT-Bench reference answers.
- judge_prompt_artifacts_path: URL of the WandB Artifact for the MT-Bench judge prompts.
- bench_name: Choose 'japanese_mt_bench' for the Japanese MT-Bench, or 'mt_bench' for the English version.
- model_id: The name of the model. You can replace this with a different value if needed.
- question_begin: Starting position for the question in the generated text.
- question_end: Ending position for the question in the generated text.
- max_new_token: Maximum number of new tokens to generate.
- num_choices: Number of choices to generate.
- num_gpus_per_model: Number of GPUs to use per model.
- num_gpus_total: Total number of GPUs to use.
- max_gpu_memory: Maximum GPU memory to use (leave as null to use the default).
- dtype: Data type. Choose from None, float32, float16, bfloat16.
- judge_model: Model used for judging the generated responses. Default is gpt-4o-2024-05-13
- mode: Mode of evaluation. Default is 'single'.
- baseline_model: Model used for comparison. Leave as null for default behavior.
- parallel: Number of parallel threads to use.
- first_n: Number of generated responses to use for comparison. Leave as null for default behavior.

Model configuration

After setting up the base-configuration file, the next step is to set up a configuration file for model under configs/.

API Model Configurations

This framework supports evaluating models using APIs such as OpenAI, Anthropic, Google, and Cohere. You need to create a separate config file for each API model. For example, the config file for OpenAI's gpt-4o-2024-05-13 would be named configs/config-gpt-4o-2024-05-13.yaml.

wandb: Information used for Weights & Biases (W&B) support.
- run_name: Name of the W&B run.
api: Choose the API to use from openai, anthropic, google, amazon_bedrock.
batch_size: Batch size for API calls (recommended: 32).
model: Information about the model.
- pretrained_model_name_or_path: Name of the API model.
- size_category: Specify "api" to indicate using an API model.
- size: Model size (leave as null for API models).
- release_date: Model release date. (MM/DD/YYYY)

Other Model Configurations

This framework also supports evaluating models using VLLM. You need to create a separate config file for each VLLM model. For example, the config file for Microsoft's Phi-3-medium-128k-instruct would be named configs/config-Phi-3-medium-128k-instruct.yaml.

wandb: Information used for Weights & Biases (W&B) support.
- run_name: Name of the W&B run.
api: Set to vllm to indicate using a VLLM model.
num_gpus: Number of GPUs to use.
batch_size: Batch size for VLLM (recommended: 256).
model: Information about the model.
- artifacts_path: When loading a model from wandb artifacts, it is necessary to include a description. If not, there is no need to write it. Example notation: wandb-japan/llm-leaderboard/llm-jp-13b-instruct-lora-jaster-v1.0:v0
- pretrained_model_name_or_path: Name of the VLLM model.
- chat_template: Path to the chat template file (if needed).
- size_category: Specify model size category. In Nejumi Leaderboard, the category is defined as "10B<", "10B<= <30B", "<=30B" and "api".
- size: Model size (parameter).
- release_date: Model release date (MM/DD/YYYY).
- max_model_len: Maximum token length of the input (if needed).

Create Chat template (needed for models except for API)

create chat_templates/model_id.jinja If the chat_template is specified in the tokenizer_config.json of the evaluation model, create a .jinja file with that configuration. If chat_template is not specified in tokenizer_config.json, refer to the model card or other relevant documentation to create a chat_template and document it in a .jinja file.
test chat_templates If you want to check the output of the chat_templates, you can use the following script:

python3 scripts/test_chat_template.py -m <model_id> -c <chat_template>

If the model ID and chat_template are the same, you can omit -c <chat_template>.

Evaluation Execution

Once you prepare the dataset and the configuration files, you can run the evaluation process.

You can use either -c or -s option: - -c (config): Specify the config file by its name, e.g., python3 scripts/run_eval.py -c config-gpt-4o-2024-05-13.yaml - -s (select-config): Select from a list of available config files. This option is useful if you have multiple config files.

python3 scripts/run_eval.py -s
or 
python3 scripts/run_eval.py -c

The results of the evaluation will be logged to the specified W&B project.

When you want to edit runs or add additional evaluation metrics

Please refer to belend_run_configs/README.md.

Contributing

Contributions to this repository is welcom. Please submit your suggestions via pull requests. Please note that we may not accept all pull requests.

License

This repository is available for commercial use. However, please adhere to the respective rights and licenses of each evaluation dataset used.

Contact

For questions or support, please concatct to [email protected].

For Tasks:

Click tags to check more tools for each tasks

evaluate language models prepare datasets configure model settings run evaluation processes log evaluation results

For Jobs:

data scientist machine learning engineer research scientist natural language processing engineer ai researcher

Alternative AI tools for llm-leaderboard

Similar Open Source Tools

llm-leaderboard

github

: 67

Google_GenerativeAI

Google GenerativeAI (Gemini) is an unofficial C# .Net SDK based on REST APIs for accessing Google Gemini models. It offers a complete rewrite of the previous SDK with improved performance, flexibility, and ease of use. The SDK seamlessly integrates with LangChain.net, providing easy methods for JSON-based interactions and function calling with Google Gemini models. It includes features like enhanced JSON mode handling, function calling with code generator, multi-modal functionality, Vertex AI support, multimodal live API, image generation and captioning, retrieval-augmented generation with Vertex RAG Engine and Google AQA, easy JSON handling, Gemini tools and function calling, multimodal live API, and more.

github

: 71

xFasterTransformer

xFasterTransformer is an optimized solution for Large Language Models (LLMs) on the X86 platform, providing high performance and scalability for inference on mainstream LLM models. It offers C++ and Python APIs for easy integration, along with example codes and benchmark scripts. Users can prepare models in a different format, convert them, and use the APIs for tasks like encoding input prompts, generating token ids, and serving inference requests. The tool supports various data types and models, and can run in single or multi-rank modes using MPI. A web demo based on Gradio is available for popular LLM models like ChatGLM and Llama2. Benchmark scripts help evaluate model inference performance quickly, and MLServer enables serving with REST and gRPC interfaces.

github

: 247

LLM-as-HH

LLM-as-HH is a codebase that accompanies the paper ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution. It introduces Language Hyper-Heuristics (LHHs) that leverage LLMs for heuristic generation with minimal manual intervention and open-ended heuristic spaces. Reflective Evolution (ReEvo) is presented as a searching framework that emulates the reflective design approach of human experts while surpassing human capabilities with scalable LLM inference, Internet-scale domain knowledge, and powerful evolutionary search. The tool can improve various algorithms on problems like Traveling Salesman Problem, Capacitated Vehicle Routing Problem, Orienteering Problem, Multiple Knapsack Problems, Bin Packing Problem, and Decap Placement Problem in both black-box and white-box settings.

github

: 78

py-llm-core

PyLLMCore is a light-weighted interface with Large Language Models with native support for llama.cpp, OpenAI API, and Azure deployments. It offers a Pythonic API that is simple to use, with structures provided by the standard library dataclasses module. The high-level API includes the assistants module for easy swapping between models. PyLLMCore supports various models including those compatible with llama.cpp, OpenAI, and Azure APIs. It covers use cases such as parsing, summarizing, question answering, hallucinations reduction, context size management, and tokenizing. The tool allows users to interact with language models for tasks like parsing text, summarizing content, answering questions, reducing hallucinations, managing context size, and tokenizing text.

github

: 118

Qwen

Qwen is a series of large language models developed by Alibaba DAMO Academy. It outperforms the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models’ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen models outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models’ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.

github

: 17.0k

litdata

LitData is a tool designed for blazingly fast, distributed streaming of training data from any cloud storage. It allows users to transform and optimize data in cloud storage environments efficiently and intuitively, supporting various data types like images, text, video, audio, geo-spatial, and multimodal data. LitData integrates smoothly with frameworks such as LitGPT and PyTorch, enabling seamless streaming of data to multiple machines. Key features include multi-GPU/multi-node support, easy data mixing, pause & resume functionality, support for profiling, memory footprint reduction, cache size configuration, and on-prem optimizations. The tool also provides benchmarks for measuring streaming speed and conversion efficiency, along with runnable templates for different data types. LitData enables infinite cloud data processing by utilizing the Lightning.ai platform to scale data processing with optimized machines.

github

: 432

langgraph4j

Langgraph4j is a Java library for language processing tasks such as text classification, sentiment analysis, and named entity recognition. It provides a set of tools and algorithms for analyzing text data and extracting useful information. The library is designed to be efficient and easy to use, making it suitable for both research and production applications.

github

: 916

pgvecto.rs

pgvecto.rs is a Postgres extension written in Rust that provides vector similarity search functions. It offers ultra-low-latency, high-precision vector search capabilities, including sparse vector search and full-text search. With complete SQL support, async indexing, and easy data management, it simplifies data handling. The extension supports various data types like FP16/INT8, binary vectors, and Matryoshka embeddings. It ensures system performance with production-ready features, high availability, and resource efficiency. Security and permissions are managed through easy access control. The tool allows users to create tables with vector columns, insert vector data, and calculate distances between vectors using different operators. It also supports half-precision floating-point numbers for better performance and memory usage optimization.

github

: 1.9k

WordLlama

WordLlama is a fast, lightweight NLP toolkit optimized for CPU hardware. It recycles components from large language models to create efficient word representations. It offers features like Matryoshka Representations, low resource requirements, binarization, and numpy-only inference. The tool is suitable for tasks like semantic matching, fuzzy deduplication, ranking, and clustering, making it a good option for NLP-lite tasks and exploratory analysis.

github

: 1.4k

litserve

LitServe is a high-throughput serving engine for deploying AI models at scale. It generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs, and more. Built for enterprise scale, it supports every framework like PyTorch, JAX, Tensorflow, and more. LitServe is designed to let users focus on model performance, not the serving boilerplate. It is like PyTorch Lightning for model serving but with broader framework support and scalability.

github

: 53

agents-starter

A starter template for building AI-powered chat agents using Cloudflare's Agent platform, powered by agents-sdk. It provides a foundation for creating interactive chat experiences with AI, complete with a modern UI and tool integration capabilities. Features include interactive chat interface with AI, built-in tool system with human-in-the-loop confirmation, advanced task scheduling, dark/light theme support, real-time streaming responses, state management, and chat history. Prerequisites include a Cloudflare account and OpenAI API key. The project structure includes components for chat UI implementation, chat agent logic, tool definitions, and helper functions. Customization guide covers adding new tools, modifying the UI, and example use cases for customer support, development assistant, data analysis assistant, personal productivity assistant, and scheduling assistant.

github

: 939

req_llm

ReqLLM is a Req-based library for LLM interactions, offering a unified interface to AI providers through a plugin-based architecture. It brings composability and middleware advantages to LLM interactions, with features like auto-synced providers/models, typed data structures, ergonomic helpers, streaming capabilities, usage & cost extraction, and a plugin-based provider system. Users can easily generate text, structured data, embeddings, and track usage costs. The tool supports various AI providers like Anthropic, OpenAI, Groq, Google, and xAI, and allows for easy addition of new providers. ReqLLM also provides API key management, detailed documentation, and a roadmap for future enhancements.

github

: 118

hugging-chat-api

Unofficial HuggingChat Python API for creating chatbots, supporting features like image generation, web search, memorizing context, and changing LLMs. Users can log in, chat with the ChatBot, perform web searches, create new conversations, manage conversations, switch models, get conversation info, use assistants, and delete conversations. The API also includes a CLI mode with various commands for interacting with the tool. Users are advised not to use the application for high-stakes decisions or advice and to avoid high-frequency requests to preserve server resources.

github

: 780

IntelliNode

IntelliNode is a javascript module that integrates cutting-edge AI models like ChatGPT, LLaMA, WaveNet, Gemini, and Stable diffusion into projects. It offers functions for generating text, speech, and images, as well as semantic search, multi-model evaluation, and chatbot capabilities. The module provides a wrapper layer for low-level model access, a controller layer for unified input handling, and a function layer for abstract functionality tailored to various use cases.

github

: 201

SpargeAttn

SpargeAttn is an official implementation designed for accelerating any model inference by providing accurate sparse attention. It offers a significant speedup in model performance while maintaining quality. The tool is based on SageAttention and SageAttention2, providing options for different levels of optimization. Users can easily install the package and utilize the available APIs for their specific needs. SpargeAttn is particularly useful for tasks requiring efficient attention mechanisms in deep learning models.

github

: 719

For similar tasks

llm-leaderboard

github

: 67

hallucination-leaderboard

This leaderboard evaluates the hallucination rate of various Large Language Models (LLMs) when summarizing documents. It uses a model trained by Vectara to detect hallucinations in LLM outputs. The leaderboard includes models from OpenAI, Anthropic, Google, Microsoft, Amazon, and others. The evaluation is based on 831 documents that were summarized by all the models. The leaderboard shows the hallucination rate, factual consistency rate, answer rate, and average summary length for each model.

github

: 1.4k

h2o-llmstudio

H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.

github

: 4.1k

llm-jp-eval

LLM-jp-eval is a tool designed to automatically evaluate Japanese large language models across multiple datasets. It provides functionalities such as converting existing Japanese evaluation data to text generation task evaluation datasets, executing evaluations of large language models across multiple datasets, and generating instruction data (jaster) in the format of evaluation data prompts. Users can manage the evaluation settings through a config file and use Hydra to load them. The tool supports saving evaluation results and logs using wandb. Users can add new evaluation datasets by following specific steps and guidelines provided in the tool's documentation. It is important to note that using jaster for instruction tuning can lead to artificially high evaluation scores, so caution is advised when interpreting the results.

github

: 137

Awesome-LLM

Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.

github

: 22.1k

bocoel

BoCoEL is a tool that leverages Bayesian Optimization to efficiently evaluate large language models by selecting a subset of the corpus for evaluation. It encodes individual entries into embeddings, uses Bayesian optimization to select queries, retrieves from the corpus, and provides easily managed evaluations. The tool aims to reduce computation costs during evaluation with a dynamic budget, supporting models like GPT2, Pythia, and LLAMA through integration with Hugging Face transformers and datasets. BoCoEL offers a modular design and efficient representation of the corpus to enhance evaluation quality.

github

: 270

cladder

CLadder is a repository containing the CLadder dataset for evaluating causal reasoning in language models. The dataset consists of yes/no questions in natural language that require statistical and causal inference to answer. It includes fields such as question_id, given_info, question, answer, reasoning, and metadata like query_type and rung. The dataset also provides prompts for evaluating language models and example questions with associated reasoning steps. Additionally, it offers dataset statistics, data variants, and code setup instructions for using the repository.

github

: 82

uncheatable_eval

Uncheatable Eval is a tool designed to assess the language modeling capabilities of LLMs on real-time, newly generated data from the internet. It aims to provide a reliable evaluation method that is immune to data leaks and cannot be gamed. The tool supports the evaluation of Hugging Face AutoModelForCausalLM models and RWKV models by calculating the sum of negative log probabilities on new texts from various sources such as recent papers on arXiv, new projects on GitHub, news articles, and more. Uncheatable Eval ensures that the evaluation data is not included in the training sets of publicly released models, thus offering a fair assessment of the models' performance.

github

: 65

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 980

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.9k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 32.1k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675