Korean-SAT-LLM-Leaderboard

Korean SAT leader board

Stars: 163

Visit

The Korean SAT LLM Leaderboard is a benchmarking project that allows users to test their fine-tuned Korean language models on a 10-year dataset of the Korean College Scholastic Ability Test (CSAT). The project provides a platform to compare human academic ability with the performance of large language models (LLMs) on various question types to assess reading comprehension, critical thinking, and sentence interpretation skills. It aims to share benchmark data, utilize a reliable evaluation dataset curated by the Korea Institute for Curriculum and Evaluation, provide annual updates to prevent data leakage, and promote open-source LLM advancement for achieving top-tier performance on the Korean CSAT.

README:

🏆 Korean SAT LLM Leaderboard

📊 Test Your Own Model on the 2023 Korean SAT Sample Dataset

Learn How to Test Your Model

🗂️ Index

한국어 설명 바로가기

Take advantage of this unique opportunity to compare human academic ability with the performance of large language models (LLMs) based on the highly reputable College Scholastic Ability Test (CSAT)!

Test how well your fine-tuned Korean LLM performs on a 10-year benchmark of the Korean CSAT and see what score it would achieve right now!

cf) To prevent data leakage, this benchmark only includes models released before each year's CSAT (Suneung) on the leaderboard.

🚨 Notice: 2025 SAT (1-Year) Model Performance Comparison Results

The current benchmark has been completed using GPT-based models. Future performance evaluations for other models will be conducted as additional resources and funding become available.
Please note that the current grade thresholds are estimated, and they will be updated once the official thresholds are announced.

Rank	Model Name	Standard Score	Raw Score	Common Subject Score	Elective Subject Score	Estimated Grade Cut (CruxTable Standard)
🥇 1st	o1-Preview	133	97	73	24	Grade 1
🥈 2nd	o1-mini	115	78	57	21	Grade 4
🥉3rd	gpt-4o	112	75	56	19	Grade 4
4th	claude-3-5-sonnet-20241022	108	70	54	16	Grade 4
5th	HyperClovaX	108	61	48	24	Grade 4
6th	gpt-4o-mini	97	59	44	15	Grade 5
7th	claude-3-opus-20240229	91	53	35	18	Grade 6
8th	claude-3-5-haiku-20241022	90	52	37	15	Grade 6
9th	gpt-3.5-turbo	56	16	10	6	Grade 9

cf)

The incorrectly answered question on the o1-preview was question 8 (3 points) from the CSAT Korean section, a non-literary text question!
For those curious about the analysis of the incorrect question and a detailed explanation of the experiment, please refer to this link.

To prevent data leakage issues, models released after the CSAT (Suneung) are recorded separately as references rather than being officially listed on the leaderboard.

Model Name	Standard Score	Raw Score	Common Subject Score	Elective Subject Score	Estimated Grade Cutoff (CruxTable Standard)
gpt-4.5-preview-2025-02-27	124	86	68	18	Grade 3
deepseek r1	116	78	65	13	Grade 4
gemini_2.0_experimental_advanced	114	77	55	22	Grade 4

🎯 What is the Korean SAT LLM Leaderboard?

The Korean SAT LLM leaderboard is a leaderboard benchmarking 10 years of Korean CSAT (College Scholastic Ability Test) exams, developed by the reputable KICE (Korea Institute for Curriculum and Evaluation).
The CSAT consists of a wide range of question types depending on the difficulty level, designed to assess reading comprehension, critical thinking, and sentence interpretation skills.

If you want to know more about Korean SAT (Korean College entrance exam), please refer this!

💯 Leaderboard

Leaderboard Rank	Model Name	Submitter Name	Avg. std Score	Avg. Grade	2024 SAT	2023 SAT	2022 SAT	2021 SAT	2020 SAT	2019 SAT	2018 SAT	2017 SAT	2016 SAT	2015 SAT	URL
🥇 1st	gpt-4o-2024-08-06	OpenAI	114.9	3.6	65 (4)	81 (4)	70 (4)	69 (4)	76 (4)	74 (3)	77 (4)	86 (2)	84 (3)	77 (4)	Link
🥈 2nd	Meta-Llama-3.1-405B-Instruct-Turbo	meta-llama	113.8	3.8	77 (3)	87 (3)	69 (4)	70 (4)	65 (5)	68 (4)	78 (4)	80 (3)	87 (3)	68 (5)	Link
🥉 3rd	Qwen2.5-72B-Instruct-Turbo	Qwen	105.8	4.6	61 (5)	78 (4)	52 (6)	60 (5)	60 (5)	64 (4)	74 (4)	70 (5)	74 (4)	79 (4)	Link
4th	Meta-Llama-3.1-70B-Instruct-Turbo	meta-llama	103.7	4.8	50 (6)	72 (5)	73 (3)	61 (5)	79 (3)	51 (5)	58 (6)	66 (5)	71 (5)	70 (5)	Link
5th	claude-3-5-sonnet-20241022	Antropic	102.6	5	60 (5)	61 (6)	69 (4)	58 (5)	72 (4)	63 (4)	71 (5)	70 (5)	58 (6)	55 (6)	Link
6th	Qwen2-72B-Instruct	Qwen	98	5.2	53 (5)	57 (6)	59 (5)	45 (6)	57 (5)	56 (5)	76 (4)	69 (5)	58 (6)	63 (5)	Link
7th	gpt-4o-mini-2024-07-18	OpenAI	93.9	5.6	57 (5)	53 (6)	50 (6)	55 (5)	50 (6)	46 (6)	62 (5)	58 (6)	64 (5)	57 (6)	Link
8th	claude-3-opus-20240229	Antropic	91.9	5.7	46 (6)	44 (7)	62 (5)	49 (6)	56 (5)	51 (5)	69 (5)	52 (6)	53 (6)	49 (6)	Link
9th	gemma-2-27b-it	Google	91	5.9	51 (6)	54 (6)	51 (6)	51 (6)	50 (6)	37 (7)	50 (6)	71 (4)	54 (6)	56 (6)	Link
10th	solar-mini-ja	Upstage	85.9	6.2	46 (6)	58 (6)	43 (6)	41 (7)	46 (6)	51 (5)	49 (6)	48 (7)	40 (7)	52 (6)	Link
11th	solar-mini	Upstage	85.5	6.4	33 (7)	57 (6)	48 (6)	42 (7)	46 (6)	50 (6)	43 (7)	55 (6)	42 (7)	56 (6)	Link
12th	Mixtral-8x22B-Instruct-v0.1	MistralAI	83.4	6.6	40 (7)	44 (7)	47 (6)	31 (8)	38 (7)	35 (7)	65 (5)	57 (6)	50 (6)	44 (7)	Link
13th	WizardLM-2-8x22B	Microsoft	83.3	6.6	37 (7)	56 (6)	47 (6)	30 (8)	52 (6)	29 (8)	51 (6)	47 (7)	51 (6)	53 (6)	Link
14th	Qwen2.5-7B-Instruct-Turbo	Qwen	80.3	6.8	40 (7)	40 (7)	39 (7)	35 (7)	35 (7)	35 (7)	58 (6)	53 (6)	44 (7)	42 (7)	Link
15th	claude-3-5-haiku-20241022	Antropic	80	6.8	45 (6)	41 (7)	34 (7)	23 (9)	38 (7)	37 (7)	39 (7)	63 (5)	43 (7)	53 (6)	Link
16th	Meta-Llama-3.1-8B-Instruct-Turbo	meta-llama	74.7	7.1	46 (6)	31 (8)	36 (7)	34 (7)	36 (7)	24 (8)	38 (7)	38 (7)	37 (7)	45 (7)	Link
17th	gpt-3.5-turbo-0125	OpenAI	68.7	7.7	29 (8)	39 (7)	26 (8)	17 (9)	36 (7)	24 (8)	38 (7)	25 (8)	45 (7)	27 (8)	Link
18th	Mixtral-8x7B-Instruct-v0.1	MistralAI	63.4	8.3	19 (9)	25 (8)	40 (7)	20 (9)	27 (8)	19 (9)	37 (7)	16 (9)	30 (8)	19 (9)	Link
19th	gemma-2-9b-it	Google	61.2	8.4	24 (8)	20 (9)	16 (9)	22 (9)	17 (9)	29 (8)	24 (8)	25 (8)	25 (8)	29 (8)	Link
20th	Llama-3.2-3B-Instruct-Turbo	meta-llama	60.6	8.7	28 (8)	18 (9)	27 (8)	23 (9)	16 (9)	17 (9)	21 (9)	29 (8)	22 (9)	23 (9)	Link
21th	Mistral-7B-Instruct-v0.3	MistralAI	57.2	8.9	17 (9)	11 (9)	22 (9)	12 (9)	18 (9)	21 (9)	19 (9)	27 (8)	23 (9)	21 (9)	Link

Ranking Criteria: Average of the standard scores over 10 years of CSAT <The standard scores reflect the difficulty level of each year's exam in the overall score.>
Avg. Std Score: The average of standard scores calculated using the KICE (Korea Institute for Curriculum and Evaluation) method.
Avg. Grade: Average grade
CSAT Scores by Year: Raw score (grade)

Here’s the English translation:

Click here for details on the scoring system

i.e.)

Donating GPU resources would be greatly appreciated and would help with the evaluations. Thank you!
Welcome any and all feedback! Feel free to share your thoughts anytime!

📗 Notes: Performance comparison of models on the 2024 CSAT (1-year)

o1-preview: 88 points (Grade 1, Top 4%)
o1-mini: 60 points (Grade 5)

i.e) The gpt o1 model is scheduled for a benchmark update when the official version of o1 is released!

⭐️ Purpose of the Korean SAT LLM Leaderboard Project

Sharing Benchmark Data: Provide benchmarks that allow for the comparison of human performance and LLM performance.
Reliable Evaluation Dataset: Utilize the highly authoritative benchmark dataset curated by KICE (Korea Institute for Curriculum and Evaluation) to assess Korean language proficiency.
Annual Updates with Leakage Prevention: Update the CSAT Korean benchmark dataset annually to prevent data leakage.
Open-Source LLM Advancement: Enable open-source LLMs, independent of any specific country or company, to achieve top-tier (1st-grade) performance on the Korean CSAT.

🏅 Submission Guidelines

If you prefer to keep your model’s performance private and not appear on the public leaderboard, feel free to leave a note in the "Comments" section.
⭐️ Your model must have a minimum context length of 8K tokens to solve the Korean SAT questions!

Model Submission:
- Submit via the Form: Fill in the survey form to submit your model!
  - Link: https://moaform.com/q/X6xfGE
- Email Submission: Send the Hugging Face URL of your fine-tuned model along with your nickname.
  - Submission email: [email protected]
- Submit via GitHub Issue: Post your model’s Hugging Face URL and nickname in a GitHub issue.
```
<Example for email or GitHub issue submission>
Submitter Name: Elon musk
Hugging Face Submission URL: https://huggingface.co/Elon_model
Comments: Let's go Mars!
```
Check the Leaderboard: View your rank on GitHub or Hugging Face.
Climb the Ranks: Improve your score and claim the Slayer Champion title!

Notice: Evaluation may take 1-3 weeks depending on available GPU resources and submission volume.

🪑 Benchmark Dataset

This competition utilizes CSAT Korean questions from the 10-year period between 2015 and 2024.
For the elective subjects introduced in 2022, Speech and Composition will be used as the elective subject for benchmarking.
The key evaluation metrics for the benchmark dataset include: Language Comprehension, **Core Content Understanding **, Logical Reasoning, Critical Thinking, Creative Thinking, and Multimedia Interpretation Skills. <Source: 2024 KICE CSAT Korean Evaluation Criteria>
The comprehension of LLMs can be assessed by examining their performance on problems from various fields such as humanities, social sciences, science, technology, and the arts.

📑 Category of Benchmark Dataset

(1) 📚 Reading Comprehension

Content: Passages are drawn from a wide range of fields, including humanities, social sciences, science, technology, and the arts.

Humanities passages: Cover topics such as philosophy, argumentation, and history.
Social sciences passages: Focus on subjects like economics, politics, law, and culture.
Science passages: Include content related to physics, chemistry, biology, and earth sciences.
Technology passages: Feature topics like computers, machinery, and everyday science.
Arts passages: Encompass a variety of artistic themes, including visual arts, music, architecture, and film.
Integrated passages: Complex, long passages that combine multiple fields are also frequently included.
Evaluation focus: Assessment of understanding across diverse domains, evaluation of reasoning and critical thinking skills.

(2) 🧑‍🎤 Literature

Content: Covers a variety of literary genres such as classical and modern novels, classical and modern poetry, and classical essays.

Evaluation focus: Assessment of emotional and stylistic comprehension, evaluation of understanding of various literary periods and genres.

(3) 🗣️ Speech and Writing

Content: Includes problems dealing with dialogue and writing.

Evaluation focus: Assessment of understanding in conversational contexts and evaluation of writing skills.

♾️ Metric

Evaluation Method

In this competition, the answers submitted by each model are evaluated based on whether they match the actual correct answers.
Scores are graded for each year's questions, and the final ranking is determined by the average of the standardized scores.

Explanation of Leaderboard Score

Raw Score: The score achieved out of a maximum of 100 points on the test.
Standardized Score: A score that measures how far the test taker's raw score deviates from the average.
Grade: Based on standardized scores, test takers are classified into 9 grades. For the Korean, Math, and Inquiry subjects, the top 4% of all test takers receive a grade of 1, the next 7% (cumulative 11%) receive a grade of 2, and the next 12% (cumulative 23%) receive a grade of 3. Refer to EBSI

📗 Helpful Reference

Nomadamas Experimental Records

🤷 More About CSAT(Korean college entrance exam)

The Korean CSAT (College Scholastic Ability Test) Korean Language section is a university entrance exam, with questions meticulously selected by renowned scholars from KICE (Korea Institute for Curriculum and Evaluation), who are confined to a hotel for this process. It is one of the most reputable exams in Korea. Evaluating the performance of LLMs (Large Language Models) based on the metrics used to assess test-takers provides an excellent opportunity to compare human proficiency with the language capabilities of LLMs.

📰 Notice

Due to copyright concerns, the CSAT benchmark dataset will not be made publicly available. The evaluation data spans from the 2015 CSAT to the 2024 CSAT, and elective subjects for the years 2022 to 2024 will be limited to Speech and Composition.
To ensure fairness, the prompts will not be disclosed.
In the future, the leaderboard will be updated to reflect models submitted for all subjects, including Korean, English, Math, Science, and Social Studies, on the day of the CSAT.
This Korean-SAT benchmarking system powered by AutoRAG. AutoRAG is awesome!! (AutoRAG is an automatic RAG optimization tool that can also be used for LLM performance comparison and prompt engineering.) Since the implementation of elective subjects in 2022, the standard score formula has been derived using the Crux Table provided by Crux Consulting.

📬 Contact Us

For questions, errors, or support, feel free to reach out:
Email: [email protected]

Are you ready to become the next KO-SAT Slayer Champion? 💪

License

This dataset is sourced from the Korea Institute for Curriculum Evaluation (KICE).

This benchmark leaderboard is a non-profit project that aims to provide information on LLM performance with SAT benchmarks!

For Tasks:

Click tags to check more tools for each tasks

test model performance compare human vs llm benchmark language skills evaluate critical thinking improve model ranking

For Jobs:

data scientist machine learning engineer educational researcher linguistics researcher ai researcher

Alternative AI tools for Korean-SAT-LLM-Leaderboard

Similar Open Source Tools

Korean-SAT-LLM-Leaderboard

github

: 163

awesome-llm

Awesome LLM is a curated list of resources related to Large Language Models (LLMs), including models, projects, datasets, benchmarks, materials, papers, posts, GitHub repositories, HuggingFace repositories, and reading materials. It provides detailed information on various LLMs, their parameter sizes, announcement dates, and contributors. The repository covers a wide range of LLM-related topics and serves as a valuable resource for researchers, developers, and enthusiasts interested in the field of natural language processing and artificial intelligence.

github

: 58

writing

The LLM Creative Story-Writing Benchmark evaluates large language models based on their ability to incorporate a set of 10 mandatory story elements in a short narrative. It measures constraint satisfaction and literary quality by grading models on character development, plot structure, atmosphere, storytelling impact, authenticity, and execution. The benchmark aims to assess how well models can adapt to rigid requirements, remain original, and produce cohesive stories using all assigned elements.

github

: 97

tt-metal

TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.

github

: 786

inference-speed-tests

This repository contains inference speed tests on Local Large Language Models on various devices. It provides results for different models tested on Macbook Pro and Mac Studio. Users can contribute their own results by running models with the provided prompt and adding the tokens-per-second output. Note that the results are not verified.

github

: 54

Tutorial

The Bookworm·Puyu large model training camp aims to promote the implementation of large models in more industries and provide developers with a more efficient platform for learning the development and application of large models. Within two weeks, you will learn the entire process of fine-tuning, deploying, and evaluating large models.

github

: 1.6k

InternVL

InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. It is a vision-language foundation model that can perform various tasks, including: **Visual Perception** - Linear-Probe Image Classification - Semantic Segmentation - Zero-Shot Image Classification - Multilingual Zero-Shot Image Classification - Zero-Shot Video Classification **Cross-Modal Retrieval** - English Zero-Shot Image-Text Retrieval - Chinese Zero-Shot Image-Text Retrieval - Multilingual Zero-Shot Image-Text Retrieval on XTD **Multimodal Dialogue** - Zero-Shot Image Captioning - Multimodal Benchmarks with Frozen LLM - Multimodal Benchmarks with Trainable LLM - Tiny LVLM InternVL has been shown to achieve state-of-the-art results on a variety of benchmarks. For example, on the MMMU image classification benchmark, InternVL achieves a top-1 accuracy of 51.6%, which is higher than GPT-4V and Gemini Pro. On the DocVQA question answering benchmark, InternVL achieves a score of 82.2%, which is also higher than GPT-4V and Gemini Pro. InternVL is open-sourced and available on Hugging Face. It can be used for a variety of applications, including image classification, object detection, semantic segmentation, image captioning, and question answering.

github

: 6.5k

goodai-ltm-benchmark

This repository contains code and data for replicating experiments on Long-Term Memory (LTM) abilities of conversational agents. It includes a benchmark for testing agents' memory performance over long conversations, evaluating tasks requiring dynamic memory upkeep and information integration. The repository supports various models, datasets, and configurations for benchmarking and reporting results.

github

: 51

awesome-japanese-llm

日本語LLMの包括的な概要です。400ワード未満で、改行や引用符はありません。

github

: 1.1k

LLMs

LLMs is a Chinese large language model technology stack for practical use. It includes high-availability pre-training, SFT, and DPO preference alignment code framework. The repository covers pre-training data cleaning, high-concurrency framework, SFT dataset cleaning, data quality improvement, and security alignment work for Chinese large language models. It also provides open-source SFT dataset construction, pre-training from scratch, and various tools and frameworks for data cleaning, quality optimization, and task alignment.

github

: 97

visionOS-examples

visionOS-examples is a repository containing accelerators for Spatial Computing. It includes examples such as Local Large Language Model, Chat Apple Vision Pro, WebSockets, Anchor To Head, Hand Tracking, Battery Life, Countdown, Plane Detection, Timer Vision, and PencilKit for visionOS. The repository showcases various functionalities and features for Apple Vision Pro, offering tools for developers to enhance their visionOS apps with capabilities like hand tracking, plane detection, and real-time cryptocurrency prices.

github

: 223

awesome-local-llms

The 'awesome-local-llms' repository is a curated list of open-source tools for local Large Language Model (LLM) inference, covering both proprietary and open weights LLMs. The repository categorizes these tools into LLM inference backend engines, LLM front end UIs, and all-in-one desktop applications. It collects GitHub repository metrics as proxies for popularity and active maintenance. Contributions are encouraged, and users can suggest additional open-source repositories through the Issues section or by running a provided script to update the README and make a pull request. The repository aims to provide a comprehensive resource for exploring and utilizing local LLM tools.

github

: 390

Awesome-Tabular-LLMs

This repository is a collection of papers on Tabular Large Language Models (LLMs) specialized for processing tabular data. It includes surveys, models, and applications related to table understanding tasks such as Table Question Answering, Table-to-Text, Text-to-SQL, and more. The repository categorizes the papers based on key ideas and provides insights into the advancements in using LLMs for processing diverse tables and fulfilling various tabular tasks based on natural language instructions.

github

: 151

awesome-mobile-llm

Awesome Mobile LLMs is a curated list of Large Language Models (LLMs) and related studies focused on mobile and embedded hardware. The repository includes information on various LLM models, deployment frameworks, benchmarking efforts, applications, multimodal LLMs, surveys on efficient LLMs, training LLMs on device, mobile-related use-cases, industry announcements, and related repositories. It aims to be a valuable resource for researchers, engineers, and practitioners interested in mobile LLMs.

github

: 154

info8006-introduction-to-ai

INFO8006 Introduction to Artificial Intelligence is a course at ULiège that covers various topics in AI such as intelligent agents, problem-solving, games, probabilistic reasoning, machine learning, neural networks, reinforcement learning, and decision-making. The course includes lectures, exercises, and programming projects using Python. Students can access course materials, previous exams, and archived lectures to enhance their understanding of AI concepts.

github

: 378

FlipAttack

FlipAttack is a jailbreak attack tool designed to exploit black-box Language Model Models (LLMs) by manipulating text inputs. It leverages insights into LLMs' autoregressive nature to construct noise on the left side of the input text, deceiving the model and enabling harmful behaviors. The tool offers four flipping modes to guide LLMs in denoising and executing malicious prompts effectively. FlipAttack is characterized by its universality, stealthiness, and simplicity, allowing users to compromise black-box LLMs with just one query. Experimental results demonstrate its high success rates against various LLMs, including GPT-4o and guardrail models.

github

: 91

For similar tasks

Korean-SAT-LLM-Leaderboard

github

: 163

LLMFarm

LLMFarm is an iOS and MacOS app designed to work with large language models (LLM). It allows users to load different LLMs with specific parameters, test the performance of various LLMs on iOS and macOS, and identify the most suitable model for their projects. The tool is based on ggml and llama.cpp by Georgi Gerganov and incorporates sources from rwkv.cpp by saharNooby, Mia by byroneverson, and LlamaChat by alexrozanski. LLMFarm features support for MacOS (13+) and iOS (16+), various inferences and sampling methods, Metal compatibility (not supported on Intel Mac), model setting templates, LoRA adapters support, LoRA finetune support, LoRA export as model support, and more. It also offers a range of inferences including LLaMA, GPTNeoX, Replit, GPT2, Starcoder, RWKV, Falcon, MPT, Bloom, and others. Additionally, it supports multimodal models like LLaVA, Obsidian, and MobileVLM. Users can customize inference options through JSON files and access supported models for download.

github

: 1.5k

abliterator

abliterator.py is a simple Python library/structure designed to ablate features in large language models (LLMs) supported by TransformerLens. It provides capabilities to enter temporary contexts, cache activations with N samples, calculate refusal directions, and includes tokenizer utilities. The library aims to streamline the process of experimenting with ablation direction turns by encapsulating useful logic and minimizing code complexity. While currently basic and lacking comprehensive documentation, the library serves well for personal workflows and aims to expand beyond feature ablation to augmentation and additional features over time with community support.

github

: 96

MMOS

MMOS (Mix of Minimal Optimal Sets) is a dataset designed for math reasoning tasks, offering higher performance and lower construction costs. It includes various models and data subsets for tasks like arithmetic reasoning and math word problem solving. The dataset is used to identify minimal optimal sets through reasoning paths and statistical analysis, with a focus on QA-pairs generated from open-source datasets. MMOS also provides an auto problem generator for testing model robustness and scripts for training and inference.

github

: 61

For similar jobs

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 855

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94

kaito

Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

github

: 405

PyRIT

PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.

github

: 2.3k

tabby

Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.

github

: 30.6k

spear

SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.

github

: 224

Magick

Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.

github

: 675