
llm-leaderboard
Project of llm evaluation to Japanese tasks
Stars: 67

Nejumi Leaderboard 3 is a comprehensive evaluation platform for large language models, assessing general language capabilities and alignment aspects. The evaluation framework includes metrics for language processing, translation, summarization, information extraction, reasoning, mathematical reasoning, entity extraction, knowledge/question answering, English, semantic analysis, syntactic analysis, alignment, ethics/moral, toxicity, bias, truthfulness, and robustness. The repository provides an implementation guide for environment setup, dataset preparation, configuration, model configurations, and chat template creation. Users can run evaluation processes using specified configuration files and log results to the Weights & Biases project.
README:
This repository is for the Nejumi Leaderboard 3, a comprehensive evaluation platform for large language models. The leaderboard assesses both general language capabilities and alignment aspects. For detailed information about the leaderboard, please visit Nejumi Leaderboard website.
Our evaluation framework incorporates a diverse set of metrics to provide a holistic assessment of model performance:
Main Category | Subcategory | Automated Evaluation with Correct Data | AI Evaluation | Note |
---|---|---|---|---|
General Language Processing | Expression | MT-bench/roleplay (0shot) MT-bench/humanities (0shot) MT-bench/writing (0shot) |
||
^ | Translation | ALT e-to-j (jaster) (0shot, 2shot) ALT j-to-e (jaster) (0shot, 2shot) wikicorpus-e-to-j(jaster) (0shot, 2shot) wikicorpus-j-to-e(jaster) (0shot, 2shot) |
||
^ | Summarization | |||
^ | Information Extraction | JSQuaD (jaster) (0shot, 2shot) | ||
^ | Reasoning | MT-bench/reasoning (0shot) | ||
^ | Mathematical Reasoning | MAWPS*(jaster) (0shot, 2shot) MGSM*(jaster) (0shot, 2shot) |
MT-bench/math (0shot) | |
^ | (Entity) Extraction | wiki_ner*(jaster) (0shot, 2shot) wiki_coreference(jaster) (0shot, 2shot) chABSA*(jaster) (0shot, 2shot) |
MT-bench/extraction (0shot) | |
^ | Knowledge / Question Answering | JCommonsenseQA*(jaster) (0shot, 2shot) JEMHopQA*(jaster) (0shot, 2shot) JMMLU*(0shot, 2shot) NIILC*(jaster) (0shot, 2shot) aio*(jaster) (0shot, 2shot) |
MT-bench/stem (0shot) | |
^ | English | MMLU_en (0shot, 2shot) | ||
^ | semantic analysis | JNLI*(jaster) (0shot, 2shot) JaNLI*(jaster) (0shot, 2shot) JSeM*(jaster) (0shot, 2shot) JSICK*(jaster) (0shot, 2shot) Jamp*(jaster) (0shot, 2shot) |
||
^ | syntactic analysis | JCoLA-in-domain*(jaster) (0shot, 2shot) JCoLA-out-of-domain*(jaster) (0shot, 2shot) JBLiMP*(jaster) (0shot, 2shot) wiki_reading*(jaster) (0shot, 2shot) wiki_pas*(jaster) (0shot, 2shot) wiki_dependency*(jaster) (0shot, 2shot) |
||
Alignment | Controllability | jaster* (0shot, 2shot) LCTG |
LCTG cannot be used for business purposes. Usage for research and using the result in the press release are acceptable. | |
^ | Ethics/Moral | JCommonsenseMorality*(2shot) | ||
^ | Toxicity | LINE Yahoo Reliability Evaluation Benchmark | This dataset is not publicly available due to its sensitive content. | |
^ | Bias | JBBQ (2shot) | JBBQ needs to be downloaded from JBBQ github repository. | |
^ | Truthfulness | JTruthfulQA | For JTruthfulQA evaluation, nlp-waseda/roberta_jtruthfulqa requires Juman++ to be installed beforehand. You can install it by running the script/install_jumanpp.sh script. | |
^ | Robustness | Test multiple patterns against JMMLU (W&B original) (0shot, 2shot) - Standard method - Choices are symbols - Select anything but the correct answer |
- metrics with (0, 2-shot) are averaged across both settings.
- Metrics marked with an asterisk (*) evaluate control capabilities.
- For MT-bench, StabilityAI's MT-Bench JP is used with GPT-4o-2024-05-13 as the model to evaluate.
- For LCTG, the only quantity test is conducted. (The quality test is not conducted)
- vLLM is leveraged for efficient inference.
- Alignment data may contain sensitive information and the default setting does not include it in this repository. If you want to evaluate your models agains Alinghment data, please check each dataset instruction carefully
- Set up environment variables
export WANDB_API_KEY=<your WANDB_API_KEY>
export OPENAI_API_KEY=<your OPENAI_API_KEY>
export LANG=ja_JP.UTF-8
# If using Azure OpenAI instead of standard OpenAI
export AZURE_OPENAI_ENDPOINT=<your AZURE_OPENAI_ENDPOINT>
export AZURE_OPENAI_API_KEY=<your AZURE_OPENAI_API_KEY>
export OPENAI_API_TYPE=azure
# if needed, set the following API KEY too
export ANTHROPIC_API_KEY=<your ANTHROPIC_API_KEY>
export GOOGLE_API_KEY=<your GOOGLE_API_KEY>
export COHERE_API_KEY=<your COHERE_API_KEY>
export MISTRAL_API_KEY=<your MISTRAL_API_KEY>
export AWS_ACCESS_KEY_ID=<your AWS_ACCESS_KEY_ID>
export AWS_SECRET_ACCESS_KEY=<your AWS_SECRET_ACCESS_KEY>
export AWS_DEFAULT_REGION=<your AWS_DEFAULT_REGION>
export UPSTAGE_API_KEY=<your UPSTAGE_API_KEY>
# if needed, please login in huggingface
huggingface-cli login
- Clone the repository
git clone https://github.com/wandb/llm-leaderboard.git
cd llm-leaderboard
- Set up a Python environment with
requirements.txt
For detailed instructions on dataset preparation and caveate, please refer to scripts/data_uploader/README.md.
In Nejumi Leadeboard3, the following dataset are used.
Please ensure to thoroughly review the terms of use for each dataset before using them.
- jaster(Apache-2.0 license)
- MT-Bench-JA (Apache-2.0 license)
- LCTG (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permission from AI shift to use for the leaderboard and was received.)
- JBBQ (Creative Commons Attribution 4.0 International License.)
- LINE Yahoo Inappropriate Speech Evaluation Dataset (not publically available)
- JTruthfulQA (Creative Commons Attribution 4.0 International License.)
The base_config.yaml
file contains basic settings, and you can create a separate YAML file for model-specific settings. This allows for easy customization of settings for each model while maintaining a consistent base configuration.
Below, you will find a detailed description of the variables utilized in the base_config.yaml
file.
-
wandb: Information used for Weights & Biases (W&B) support.
-
entity
: Name of the W&B Entity. -
project
: Name of the W&B Project. -
run_name
: Name of the W&B run. Please set up run name in a model-specific config.
-
-
testmode: Default is false. Set to true for lightweight implementation with a small number of questions per category (for functionality checks).
-
inference_interval: Set inference interval in seconds. This is particularly effective when there are rate limits, such as with APIs.
-
run: Set to true for each evaluation dataset you want to run.
-
model: Information about the model.
-
artifacts_path
: Path of the wandb artifacts where the model is located. -
max_model_len
: Maximum token length of the input. -
chat_template
: Path to the chat template file. This is required for open-weights models. -
dtype
: Data type. Choose from float32, float16, bfloat16. -
trust_remote_code
: Default is true. -
device_map
: Device map. Default is "auto". -
load_in_8bit
: 8-bit quantization. Default is false. -
load_in_4bit
: 4-bit quantization. Default is false.
-
-
generator: Settings for generation. For more details, refer to the generation_utils in Hugging Face Transformers.
-
top_p
: top-p sampling. Default is 1.0. -
temperature
: The temperature for sampling. Default is 0.1. -
max_tokens
: Maximum number of tokens to generate. This value will be overwritten in the script.
-
-
num_few_shots: Number of few-shot examples to use.
-
github_version: For recording, not required to be changed.
-
jaster: Settings for the Jaster dataset.
-
artifacts_path
: URL of the WandB Artifact for the Jaster dataset. -
dataset_dir
: Directory of the Jaster dataset after downloading the Artifact.
-
-
jmmlu_robustness: Whether to include the JMMLU Robustness evaluation. Default is True.
-
lctg: Settings for the LCTG dataset.
-
artifacts_path
: URL of the WandB Artifact for the LCTG dataset. -
dataset_dir
: Directory of the LCTG dataset after downloading the Artifact.
-
-
jbbq: Settings for the JBBQ dataset.
-
artifacts_path
: URL of the WandB Artifact for the JBBQ dataset. -
dataset_dir
: Directory of the JBBQ dataset after downloading the Artifact.
-
-
toxicity: Settings for the toxicity evaluation.
-
artifact_path
: URL of the WandB Artifact of the toxicity dataset. -
judge_prompts_path
: URL of the WandB Artifact of the toxicity judge prompts. -
max_workers
: Number of workers for parallel processing. -
judge_model
: Model used for toxicity judgment. Default isgpt-4o-2024-05-13
-
-
jtruthfulqa: Settings for the LCTG dataset.
-
artifact_path
: URL of the WandB Artifact for the JTruthfulQA dataset. -
roberta_model_name
: Name of the RoBERTa model used for evaluation. Default is 'nlp-waseda/roberta_jtruthfulqa'.
-
-
mtbench: Settings for the MT-Bench evaluation.
-
temperature_override
: Override the temperature for each category of the MT-Bench. -
question_artifacts_path
: URL of the WandB Artifact for the MT-Bench questions. -
referenceanswer_artifacts_path
: URL of the WandB Artifact for the MT-Bench reference answers. -
judge_prompt_artifacts_path
: URL of the WandB Artifact for the MT-Bench judge prompts. -
bench_name
: Choose 'japanese_mt_bench' for the Japanese MT-Bench, or 'mt_bench' for the English version. -
model_id
: The name of the model. You can replace this with a different value if needed. -
question_begin
: Starting position for the question in the generated text. -
question_end
: Ending position for the question in the generated text. -
max_new_token
: Maximum number of new tokens to generate. -
num_choices
: Number of choices to generate. -
num_gpus_per_model
: Number of GPUs to use per model. -
num_gpus_total
: Total number of GPUs to use. -
max_gpu_memory
: Maximum GPU memory to use (leave as null to use the default). -
dtype
: Data type. Choose from None, float32, float16, bfloat16. -
judge_model
: Model used for judging the generated responses. Default isgpt-4o-2024-05-13
-
mode
: Mode of evaluation. Default is 'single'. -
baseline_model
: Model used for comparison. Leave as null for default behavior. -
parallel
: Number of parallel threads to use. -
first_n
: Number of generated responses to use for comparison. Leave as null for default behavior.
-
After setting up the base-configuration file, the next step is to set up a configuration file for model under configs/
.
This framework supports evaluating models using APIs such as OpenAI, Anthropic, Google, and Cohere. You need to create a separate config file for each API model. For example, the config file for OpenAI's gpt-4o-2024-05-13 would be named configs/config-gpt-4o-2024-05-13.yaml
.
-
wandb: Information used for Weights & Biases (W&B) support.
-
run_name
: Name of the W&B run.
-
-
api: Choose the API to use from
openai
,anthropic
,google
,amazon_bedrock
. - batch_size: Batch size for API calls (recommended: 32).
-
model: Information about the model.
-
pretrained_model_name_or_path
: Name of the API model. -
size_category
: Specify "api" to indicate using an API model. -
size
: Model size (leave as null for API models). -
release_date
: Model release date. (MM/DD/YYYY)
-
This framework also supports evaluating models using VLLM. You need to create a separate config file for each VLLM model. For example, the config file for Microsoft's Phi-3-medium-128k-instruct would be named configs/config-Phi-3-medium-128k-instruct.yaml
.
-
wandb: Information used for Weights & Biases (W&B) support.
-
run_name
: Name of the W&B run.
-
-
api: Set to
vllm
to indicate using a VLLM model. - num_gpus: Number of GPUs to use.
- batch_size: Batch size for VLLM (recommended: 256).
-
model: Information about the model.
-
artifacts_path
: When loading a model from wandb artifacts, it is necessary to include a description. If not, there is no need to write it. Example notation: wandb-japan/llm-leaderboard/llm-jp-13b-instruct-lora-jaster-v1.0:v0 -
pretrained_model_name_or_path
: Name of the VLLM model. -
chat_template
: Path to the chat template file (if needed). -
size_category
: Specify model size category. In Nejumi Leaderboard, the category is defined as "10B<", "10B<= <30B", "<=30B" and "api". -
size
: Model size (parameter). -
release_date
: Model release date (MM/DD/YYYY). -
max_model_len
: Maximum token length of the input (if needed).
-
-
create chat_templates/model_id.jinja If the chat_template is specified in the tokenizer_config.json of the evaluation model, create a .jinja file with that configuration. If chat_template is not specified in tokenizer_config.json, refer to the model card or other relevant documentation to create a chat_template and document it in a .jinja file.
-
test chat_templates If you want to check the output of the chat_templates, you can use the following script:
python3 scripts/test_chat_template.py -m <model_id> -c <chat_template>
If the model ID and chat_template are the same, you can omit -c <chat_template>.
Once you prepare the dataset and the configuration files, you can run the evaluation process.
You can use either -c
or -s
option:
- -c (config): Specify the config file by its name, e.g., python3 scripts/run_eval.py -c config-gpt-4o-2024-05-13.yaml
- -s (select-config): Select from a list of available config files. This option is useful if you have multiple config files.
python3 scripts/run_eval.py -s
or
python3 scripts/run_eval.py -c
The results of the evaluation will be logged to the specified W&B project.
Please refer to belend_run_configs/README.md.
Contributions to this repository is welcom. Please submit your suggestions via pull requests. Please note that we may not accept all pull requests.
This repository is available for commercial use. However, please adhere to the respective rights and licenses of each evaluation dataset used.
For questions or support, please concatct to [email protected].
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for llm-leaderboard
Similar Open Source Tools

llm-leaderboard
Nejumi Leaderboard 3 is a comprehensive evaluation platform for large language models, assessing general language capabilities and alignment aspects. The evaluation framework includes metrics for language processing, translation, summarization, information extraction, reasoning, mathematical reasoning, entity extraction, knowledge/question answering, English, semantic analysis, syntactic analysis, alignment, ethics/moral, toxicity, bias, truthfulness, and robustness. The repository provides an implementation guide for environment setup, dataset preparation, configuration, model configurations, and chat template creation. Users can run evaluation processes using specified configuration files and log results to the Weights & Biases project.

wanda
Official PyTorch implementation of Wanda (Pruning by Weights and Activations), a simple and effective pruning approach for large language models. The pruning approach removes weights on a per-output basis, by the product of weight magnitudes and input activation norms. The repository provides support for various features such as LLaMA-2, ablation study on OBS weight update, zero-shot evaluation, and speedup evaluation. Users can replicate main results from the paper using provided bash commands. The tool aims to enhance the efficiency and performance of language models through structured and unstructured sparsity techniques.

xFasterTransformer
xFasterTransformer is an optimized solution for Large Language Models (LLMs) on the X86 platform, providing high performance and scalability for inference on mainstream LLM models. It offers C++ and Python APIs for easy integration, along with example codes and benchmark scripts. Users can prepare models in a different format, convert them, and use the APIs for tasks like encoding input prompts, generating token ids, and serving inference requests. The tool supports various data types and models, and can run in single or multi-rank modes using MPI. A web demo based on Gradio is available for popular LLM models like ChatGLM and Llama2. Benchmark scripts help evaluate model inference performance quickly, and MLServer enables serving with REST and gRPC interfaces.

LLM-as-HH
LLM-as-HH is a codebase that accompanies the paper ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution. It introduces Language Hyper-Heuristics (LHHs) that leverage LLMs for heuristic generation with minimal manual intervention and open-ended heuristic spaces. Reflective Evolution (ReEvo) is presented as a searching framework that emulates the reflective design approach of human experts while surpassing human capabilities with scalable LLM inference, Internet-scale domain knowledge, and powerful evolutionary search. The tool can improve various algorithms on problems like Traveling Salesman Problem, Capacitated Vehicle Routing Problem, Orienteering Problem, Multiple Knapsack Problems, Bin Packing Problem, and Decap Placement Problem in both black-box and white-box settings.

rag-chatbot
The RAG ChatBot project combines Lama.cpp, Chroma, and Streamlit to build a Conversation-aware Chatbot and a Retrieval-augmented generation (RAG) ChatBot. The RAG Chatbot works by taking a collection of Markdown files as input and provides answers based on the context provided by those files. It utilizes a Memory Builder component to load Markdown pages, divide them into sections, calculate embeddings, and save them in an embedding database. The chatbot retrieves relevant sections from the database, rewrites questions for optimal retrieval, and generates answers using a local language model. It also remembers previous interactions for more accurate responses. Various strategies are implemented to deal with context overflows, including creating and refining context, hierarchical summarization, and async hierarchical summarization.

FlexFlow
FlexFlow Serve is an open-source compiler and distributed system for **low latency**, **high performance** LLM serving. FlexFlow Serve outperforms existing systems by 1.3-2.0x for single-node, multi-GPU inference and by 1.4-2.4x for multi-node, multi-GPU inference.

DaoCloud-docs
DaoCloud Enterprise 5.0 Documentation provides detailed information on using DaoCloud, a Certified Kubernetes Service Provider. The documentation covers current and legacy versions, workflow control using GitOps, and instructions for opening a PR and previewing changes locally. It also includes naming conventions, writing tips, references, and acknowledgments to contributors. Users can find guidelines on writing, contributing, and translating pages, along with using tools like MkDocs, Docker, and Poetry for managing the documentation.

tts-generation-webui
TTS Generation WebUI is a comprehensive tool that provides a user-friendly interface for text-to-speech and voice cloning tasks. It integrates various AI models such as Bark, MusicGen, AudioGen, Tortoise, RVC, Vocos, Demucs, SeamlessM4T, and MAGNeT. The tool offers one-click installers, Google Colab demo, videos for guidance, and extra voices for Bark. Users can generate audio outputs, manage models, caches, and system space for AI projects. The project is open-source and emphasizes ethical and responsible use of AI technology.

py-llm-core
PyLLMCore is a light-weighted interface with Large Language Models with native support for llama.cpp, OpenAI API, and Azure deployments. It offers a Pythonic API that is simple to use, with structures provided by the standard library dataclasses module. The high-level API includes the assistants module for easy swapping between models. PyLLMCore supports various models including those compatible with llama.cpp, OpenAI, and Azure APIs. It covers use cases such as parsing, summarizing, question answering, hallucinations reduction, context size management, and tokenizing. The tool allows users to interact with language models for tasks like parsing text, summarizing content, answering questions, reducing hallucinations, managing context size, and tokenizing text.

spandrel
Spandrel is a library for loading and running pre-trained PyTorch models. It automatically detects the model architecture and hyperparameters from model files, and provides a unified interface for running models.

Qwen
Qwen is a series of large language models developed by Alibaba DAMO Academy. It outperforms the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models’ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen models outperform the baseline models of similar model sizes on a series of benchmark datasets, e.g., MMLU, C-Eval, GSM8K, MATH, HumanEval, MBPP, BBH, etc., which evaluate the models’ capabilities on natural language understanding, mathematic problem solving, coding, etc. Qwen-72B achieves better performance than LLaMA2-70B on all tasks and outperforms GPT-3.5 on 7 out of 10 tasks.

notte
Notte is a web browser designed specifically for LLM agents, providing a language-first web navigation experience without the need for DOM/HTML parsing. It transforms websites into structured, navigable maps described in natural language, enabling users to interact with the web using natural language commands. By simplifying browser complexity, Notte allows LLM policies to focus on conversational reasoning and planning, reducing token usage, costs, and latency. The tool supports various language model providers and offers a reinforcement learning style action space and controls for full navigation control.

LLamaSharp
LLamaSharp is a cross-platform library to run 🦙LLaMA/LLaVA model (and others) on your local device. Based on llama.cpp, inference with LLamaSharp is efficient on both CPU and GPU. With the higher-level APIs and RAG support, it's convenient to deploy LLM (Large Language Model) in your application with LLamaSharp.

pgvecto.rs
pgvecto.rs is a Postgres extension written in Rust that provides vector similarity search functions. It offers ultra-low-latency, high-precision vector search capabilities, including sparse vector search and full-text search. With complete SQL support, async indexing, and easy data management, it simplifies data handling. The extension supports various data types like FP16/INT8, binary vectors, and Matryoshka embeddings. It ensures system performance with production-ready features, high availability, and resource efficiency. Security and permissions are managed through easy access control. The tool allows users to create tables with vector columns, insert vector data, and calculate distances between vectors using different operators. It also supports half-precision floating-point numbers for better performance and memory usage optimization.

open-unlearning
OpenUnlearning is an easily extensible framework that unifies LLM unlearning evaluation benchmarks. It provides efficient implementations of TOFU and MUSE unlearning benchmarks, supporting 5 unlearning methods, 3+ datasets, 6+ evaluation metrics, and 7+ LLMs. Users can easily extend the framework to incorporate more variants, collaborate by adding new benchmarks, unlearning methods, datasets, and evaluation metrics, and drive progress in the field.

evidently
Evidently is an open-source Python library designed for evaluating, testing, and monitoring machine learning (ML) and large language model (LLM) powered systems. It offers a wide range of functionalities, including working with tabular, text data, and embeddings, supporting predictive and generative systems, providing over 100 built-in metrics for data drift detection and LLM evaluation, allowing for custom metrics and tests, enabling both offline evaluations and live monitoring, and offering an open architecture for easy data export and integration with existing tools. Users can utilize Evidently for one-off evaluations using Reports or Test Suites in Python, or opt for real-time monitoring through the Dashboard service.
For similar tasks

llm-leaderboard
Nejumi Leaderboard 3 is a comprehensive evaluation platform for large language models, assessing general language capabilities and alignment aspects. The evaluation framework includes metrics for language processing, translation, summarization, information extraction, reasoning, mathematical reasoning, entity extraction, knowledge/question answering, English, semantic analysis, syntactic analysis, alignment, ethics/moral, toxicity, bias, truthfulness, and robustness. The repository provides an implementation guide for environment setup, dataset preparation, configuration, model configurations, and chat template creation. Users can run evaluation processes using specified configuration files and log results to the Weights & Biases project.

hallucination-leaderboard
This leaderboard evaluates the hallucination rate of various Large Language Models (LLMs) when summarizing documents. It uses a model trained by Vectara to detect hallucinations in LLM outputs. The leaderboard includes models from OpenAI, Anthropic, Google, Microsoft, Amazon, and others. The evaluation is based on 831 documents that were summarized by all the models. The leaderboard shows the hallucination rate, factual consistency rate, answer rate, and average summary length for each model.

h2o-llmstudio
H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.

llm-jp-eval
LLM-jp-eval is a tool designed to automatically evaluate Japanese large language models across multiple datasets. It provides functionalities such as converting existing Japanese evaluation data to text generation task evaluation datasets, executing evaluations of large language models across multiple datasets, and generating instruction data (jaster) in the format of evaluation data prompts. Users can manage the evaluation settings through a config file and use Hydra to load them. The tool supports saving evaluation results and logs using wandb. Users can add new evaluation datasets by following specific steps and guidelines provided in the tool's documentation. It is important to note that using jaster for instruction tuning can lead to artificially high evaluation scores, so caution is advised when interpreting the results.

Awesome-LLM
Awesome-LLM is a curated list of resources related to large language models, focusing on papers, projects, frameworks, tools, tutorials, courses, opinions, and other useful resources in the field. It covers trending LLM projects, milestone papers, other papers, open LLM projects, LLM training frameworks, LLM evaluation frameworks, tools for deploying LLM, prompting libraries & tools, tutorials, courses, books, and opinions. The repository provides a comprehensive overview of the latest advancements and resources in the field of large language models.

bocoel
BoCoEL is a tool that leverages Bayesian Optimization to efficiently evaluate large language models by selecting a subset of the corpus for evaluation. It encodes individual entries into embeddings, uses Bayesian optimization to select queries, retrieves from the corpus, and provides easily managed evaluations. The tool aims to reduce computation costs during evaluation with a dynamic budget, supporting models like GPT2, Pythia, and LLAMA through integration with Hugging Face transformers and datasets. BoCoEL offers a modular design and efficient representation of the corpus to enhance evaluation quality.

cladder
CLadder is a repository containing the CLadder dataset for evaluating causal reasoning in language models. The dataset consists of yes/no questions in natural language that require statistical and causal inference to answer. It includes fields such as question_id, given_info, question, answer, reasoning, and metadata like query_type and rung. The dataset also provides prompts for evaluating language models and example questions with associated reasoning steps. Additionally, it offers dataset statistics, data variants, and code setup instructions for using the repository.

uncheatable_eval
Uncheatable Eval is a tool designed to assess the language modeling capabilities of LLMs on real-time, newly generated data from the internet. It aims to provide a reliable evaluation method that is immune to data leaks and cannot be gamed. The tool supports the evaluation of Hugging Face AutoModelForCausalLM models and RWKV models by calculating the sum of negative log probabilities on new texts from various sources such as recent papers on arXiv, new projects on GitHub, news articles, and more. Uncheatable Eval ensures that the evaluation data is not included in the training sets of publicly released models, thus offering a fair assessment of the models' performance.
For similar jobs

weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.