Korean-SAT-LLM-Leaderboard
Korean SAT leader board
Stars: 160
The Korean SAT LLM Leaderboard is a benchmarking project that allows users to test their fine-tuned Korean language models on a 10-year dataset of the Korean College Scholastic Ability Test (CSAT). The project provides a platform to compare human academic ability with the performance of large language models (LLMs) on various question types to assess reading comprehension, critical thinking, and sentence interpretation skills. It aims to share benchmark data, utilize a reliable evaluation dataset curated by the Korea Institute for Curriculum and Evaluation, provide annual updates to prevent data leakage, and promote open-source LLM advancement for achieving top-tier performance on the Korean CSAT.
README:
- ๐ฏ What is the Korean SAT LLM Leaderboard?
- ๐ฏ Leaderboard
- ๐ Submission Guidelines
- ๐ช About benchmark Datadataset
- โพ๏ธ Metric
- ๐ Helpful Reference
- ๐คท More About CSAT(Korean college entrance exam)
- ๐ฐ Notice
- ๐ฌ Contact Us
Take advantage of this unique opportunity to compare human academic ability with the performance of large language models (LLMs) based on the highly reputable College Scholastic Ability Test (CSAT)!
Test how well your fine-tuned Korean LLM performs on a 10-year benchmark of the Korean CSAT and see what score it would achieve right now!
The current benchmark has been completed using GPT-based models. Future performance evaluations for other models will be conducted as additional resources and funding become available.
Please note that the current grade thresholds are estimated, and they will be updated once the official thresholds are announced.
Rank | Model Name | Standard Score | Raw Score | Common Subject Score | Elective Subject Score | Estimated Grade Cut (CruxTable Standard) |
---|---|---|---|---|---|---|
๐ฅ 1st | o1-Preview | 133 | 97 | 73 | 24 | Grade 1 |
๐ฅ 2nd | o1-mini | 115 | 78 | 57 | 21 | Grade 4 |
๐ฅ 3rd | gemini_2.0_experimental_advanced | 114 | 77 | 55 | 22 | Grade 4 |
4th | gpt-4o | 112 | 75 | 56 | 19 | Grade 4 |
5th | claude-3-5-sonnet-20241022 | 108 | 70 | 54 | 16 | Grade 4 |
6th | HyperClovaX | 108 | 61 | 48 | 24 | Grade 4 |
7th | gpt-4o-mini | 97 | 59 | 44 | 15 | Grade 5 |
8th | gpt-3.5-turbo | 56 | 16 | 10 | 6 | Grade 9 |
cf)
The incorrectly answered question on the o1-preview was question 8 (3 points) from the CSAT Korean section, a non-literary text question!
For those curious about the analysis of the incorrect question and a detailed explanation of the experiment, please refer to this link.
The Korean SAT LLM leaderboard is a leaderboard benchmarking 10 years of Korean CSAT (College Scholastic Ability Test)
exams, developed by the reputable KICE (Korea Institute for Curriculum and Evaluation).
The CSAT consists of a wide range of question types depending on the difficulty level, designed to assess reading
comprehension, critical thinking, and sentence interpretation skills.
If you want to know more about Korean SAT (Korean College entrance exam), please refer this!
Leaderboard Rank | Model Name | Submitter Name | Avg. std Score | Avg. Grade | 2024 SAT | 2023 SAT | 2022 SAT | 2021 SAT | 2020 SAT | 2019 SAT | 2018 SAT | 2017 SAT | 2016 SAT | 2015 SAT | URL |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
๐ฅ 1st | gpt-4o-2024-08-06 | OpenAI | 114.9 | 3.6 | 65 (4) | 81 (4) | 70 (4) | 69 (4) | 76 (4) | 74 (3) | 77 (4) | 86 (2) | 84 (3) | 77 (4) | Link |
๐ฅ 2nd | Meta-Llama-3.1-405B-Instruct-Turbo | meta-llama | 113.8 | 3.8 | 77 (3) | 87 (3) | 69 (4) | 70 (4) | 65 (5) | 68 (4) | 78 (4) | 80 (3) | 87 (3) | 68 (5) | Link |
๐ฅ 3rd | Qwen2.5-72B-Instruct-Turbo | Qwen | 105.8 | 4.6 | 61 (5) | 78 (4) | 52 (6) | 60 (5) | 60 (5) | 64 (4) | 74 (4) | 70 (5) | 74 (4) | 79 (4) | Link |
4th | Meta-Llama-3.1-70B-Instruct-Turbo | meta-llama | 103.7 | 4.8 | 50 (6) | 72 (5) | 73 (3) | 61 (5) | 79 (3) | 51 (5) | 58 (6) | 66 (5) | 71 (5) | 70 (5) | Link |
5th | claude-3-5-sonnet-20241022 | Antropic | 102.6 | 5 | 60 (5) | 61 (6) | 69 (4) | 58 (5) | 72 (4) | 63 (4) | 71 (5) | 70 (5) | 58 (6) | 55 (6) | Link |
6th | Qwen2-72B-Instruct | Qwen | 98 | 5.2 | 53 (5) | 57 (6) | 59 (5) | 45 (6) | 57 (5) | 56 (5) | 76 (4) | 69 (5) | 58 (6) | 63 (5) | Link |
7th | gpt-4o-mini-2024-07-18 | OpenAI | 93.9 | 5.6 | 57 (5) | 53 (6) | 50 (6) | 55 (5) | 50 (6) | 46 (6) | 62 (5) | 58 (6) | 64 (5) | 57 (6) | Link |
8th | gemma-2-27b-it | 91 | 5.9 | 51 (6) | 54 (6) | 51 (6) | 51 (6) | 50 (6) | 37 (7) | 50 (6) | 71 (4) | 54 (6) | 56 (6) | Link | |
9th | solar-mini-ja | Upstage | 85.9 | 6.2 | 46 (6) | 58 (6) | 43 (6) | 41 (7) | 46 (6) | 51 (5) | 49 (6) | 48 (7) | 40 (7) | 52 (6) | Link |
10th | solar-mini | Upstage | 85.5 | 6.4 | 33 (7) | 57 (6) | 48 (6) | 42 (7) | 46 (6) | 50 (6) | 43 (7) | 55 (6) | 42 (7) | 56 (6) | Link |
11th | Mixtral-8x22B-Instruct-v0.1 | MistralAI | 83.4 | 6.6 | 40 (7) | 44 (7) | 47 (6) | 31 (8) | 38 (7) | 35 (7) | 65 (5) | 57 (6) | 50 (6) | 44 (7) | Link |
12th | WizardLM-2-8x22B | Microsoft | 83.3 | 6.6 | 37 (7) | 56 (6) | 47 (6) | 30 (8) | 52 (6) | 29 (8) | 51 (6) | 47 (7) | 51 (6) | 53 (6) | Link |
13th | Qwen2.5-7B-Instruct-Turbo | Qwen | 80.3 | 6.8 | 40 (7) | 40 (7) | 39 (7) | 35 (7) | 35 (7) | 35 (7) | 58 (6) | 53 (6) | 44 (7) | 42 (7) | Link |
14th | Meta-Llama-3.1-8B-Instruct-Turbo | meta-llama | 74.7 | 7.1 | 46 (6) | 31 (8) | 36 (7) | 34 (7) | 36 (7) | 24 (8) | 38 (7) | 38 (7) | 37 (7) | 45 (7) | Link |
15th | gpt-3.5-turbo-0125 | OpenAI | 68.7 | 7.7 | 29 (8) | 39 (7) | 26 (8) | 17 (9) | 36 (7) | 24 (8) | 38 (7) | 25 (8) | 45 (7) | 27 (8) | Link |
16th | Mixtral-8x7B-Instruct-v0.1 | MistralAI | 63.4 | 8.3 | 19 (9) | 25 (8) | 40 (7) | 20 (9) | 27 (8) | 19 (9) | 37 (7) | 16 (9) | 30 (8) | 19 (9) | Link |
17th | gemma-2-9b-it | 61.2 | 8.4 | 24 (8) | 20 (9) | 16 (9) | 22 (9) | 17 (9) | 29 (8) | 24 (8) | 25 (8) | 25 (8) | 29 (8) | Link | |
18th | Llama-3.2-3B-Instruct-Turbo | meta-llama | 60.6 | 8.7 | 28 (8) | 18 (9) | 27 (8) | 23 (9) | 16 (9) | 17 (9) | 21 (9) | 29 (8) | 22 (9) | 23 (9) | Link |
19th | Mistral-7B-Instruct-v0.3 | MistralAI | 57.2 | 8.9 | 17 (9) | 11 (9) | 22 (9) | 12 (9) | 18 (9) | 21 (9) | 19 (9) | 27 (8) | 23 (9) | 21 (9) | Link |
- Ranking Criteria: Average of the standard scores over 10 years of CSAT <The standard scores reflect the difficulty level of each year's exam in the overall score.>
- Avg. Std Score: The average of standard scores calculated using the KICE (Korea Institute for Curriculum and Evaluation) method.
- Avg. Grade: Average grade
- CSAT Scores by Year: Raw score (grade)
Hereโs the English translation:
Click here for details on the scoring system
i.e.)
- Donating GPU resources would be greatly appreciated and would help with the evaluations. Thank you!
- Welcome any and all feedback! Feel free to share your thoughts anytime!
- o1-preview: 88 points (Grade 1, Top 4%)
- o1-mini: 60 points (Grade 5)
i.e) The gpt o1 model is scheduled for a benchmark update when the official version of o1 is released!
- Sharing Benchmark Data: Provide benchmarks that allow for the comparison of human performance and LLM performance.
- Reliable Evaluation Dataset: Utilize the highly authoritative benchmark dataset curated by KICE (Korea Institute for Curriculum and Evaluation) to assess Korean language proficiency.
- Annual Updates with Leakage Prevention: Update the CSAT Korean benchmark dataset annually to prevent data leakage.
- Open-Source LLM Advancement: Enable open-source LLMs, independent of any specific country or company, to achieve top-tier (1st-grade) performance on the Korean CSAT.
- If you prefer to keep your modelโs performance private and not appear on the public leaderboard, feel free to leave a note in the "Comments" section.
- โญ๏ธ Your model must have a minimum context length of 8K tokens to solve the Korean SAT questions!
-
Model Submission:
- Submit via the Form: Fill in the survey form to submit your model!
-
Email Submission: Send the Hugging Face URL of your fine-tuned model along with your nickname.
- Submission email: [email protected]
- Submit via GitHub Issue: Post your modelโs Hugging Face URL and nickname in a GitHub issue.
<Example for email or GitHub issue submission> Submitter Name: Elon musk Hugging Face Submission URL: https://huggingface.co/Elon_model Comments: Let's go Mars!
-
Check the Leaderboard: View your rank on GitHub or Hugging Face.
-
Climb the Ranks: Improve your score and claim the Slayer Champion title!
Notice: Evaluation may take 1-3 weeks depending on available GPU resources and submission volume.
- This competition utilizes CSAT Korean questions from the 10-year period between 2015 and 2024.
- For the elective subjects introduced in 2022, Speech and Composition will be used as the elective subject for benchmarking.
- The key evaluation metrics for the benchmark dataset include: Language Comprehension, Core Content Understanding, Logical Reasoning, Critical Thinking, Creative Thinking, and Multimedia Interpretation Skills. <Source: 2024 KICE CSAT Korean Evaluation Criteria>
- The comprehension of LLMs can be assessed by examining their performance on problems from various fields such as humanities, social sciences, science, technology, and the arts.
Content: Passages are drawn from a wide range of fields, including humanities, social sciences, science, technology, and the arts.
-
Humanities passages: Cover topics such as philosophy, argumentation, and history.
-
Social sciences passages: Focus on subjects like economics, politics, law, and culture.
-
Science passages: Include content related to physics, chemistry, biology, and earth sciences.
-
Technology passages: Feature topics like computers, machinery, and everyday science.
-
Arts passages: Encompass a variety of artistic themes, including visual arts, music, architecture, and film.
-
Integrated passages: Complex, long passages that combine multiple fields are also frequently included.
-
Evaluation focus: Assessment of understanding across diverse domains, evaluation of reasoning and critical thinking skills.
Content: Covers a variety of literary genres such as classical and modern novels, classical and modern poetry, and classical essays.
- Evaluation focus: Assessment of emotional and stylistic comprehension, evaluation of understanding of various literary periods and genres.
Content: Includes problems dealing with dialogue and writing.
- Evaluation focus: Assessment of understanding in conversational contexts and evaluation of writing skills.
- In this competition, the answers submitted by each model are evaluated based on whether they match the actual correct answers.
- Scores are graded for each year's questions, and the final ranking is determined by the average of the standardized scores.
- Raw Score: The score achieved out of a maximum of 100 points on the test.
- Standardized Score: A score that measures how far the test taker's raw score deviates from the average.
- Grade: Based on standardized scores, test takers are classified into 9 grades. For the Korean, Math, and Inquiry subjects, the top 4% of all test takers receive a grade of 1, the next 7% (cumulative 11%) receive a grade of 2, and the next 12% (cumulative 23%) receive a grade of 3. Refer to EBSI
The Korean CSAT (College Scholastic Ability Test) Korean Language section is a university entrance exam, with questions meticulously selected by renowned scholars from KICE (Korea Institute for Curriculum and Evaluation), who are confined to a hotel for this process. It is one of the most reputable exams in Korea. Evaluating the performance of LLMs (Large Language Models) based on the metrics used to assess test-takers provides an excellent opportunity to compare human proficiency with the language capabilities of LLMs.
- Due to copyright concerns, the CSAT benchmark dataset will not be made publicly available. The evaluation data spans from the 2015 CSAT to the 2024 CSAT, and elective subjects for the years 2022 to 2024 will be limited to Speech and Composition.
- To ensure fairness, the prompts will not be disclosed.
- In the future, the leaderboard will be updated to reflect models submitted for all subjects, including Korean, English, Math, Science, and Social Studies, on the day of the CSAT.
- This Korean-SAT benchmarking system powered by AutoRAG. AutoRAG is awesome!! (AutoRAG is an automatic RAG optimization tool that can also be used for LLM performance comparison and prompt engineering.) Since the implementation of elective subjects in 2022, the standard score formula has been derived using the Crux Table provided by Crux Consulting.
-
For questions, errors, or support, feel free to reach out:
-
Email: [email protected]
Are you ready to become the next KO-SAT Slayer Champion? ๐ช
- This dataset is sourced from the Korea Institute for Curriculum Evaluation (KICE).
This benchmark leaderboard is a non-profit project that aims to provide information on LLM performance with SAT benchmarks!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for Korean-SAT-LLM-Leaderboard
Similar Open Source Tools
Korean-SAT-LLM-Leaderboard
The Korean SAT LLM Leaderboard is a benchmarking project that allows users to test their fine-tuned Korean language models on a 10-year dataset of the Korean College Scholastic Ability Test (CSAT). The project provides a platform to compare human academic ability with the performance of large language models (LLMs) on various question types to assess reading comprehension, critical thinking, and sentence interpretation skills. It aims to share benchmark data, utilize a reliable evaluation dataset curated by the Korea Institute for Curriculum and Evaluation, provide annual updates to prevent data leakage, and promote open-source LLM advancement for achieving top-tier performance on the Korean CSAT.
awesome-llm
Awesome LLM is a curated list of resources related to Large Language Models (LLMs), including models, projects, datasets, benchmarks, materials, papers, posts, GitHub repositories, HuggingFace repositories, and reading materials. It provides detailed information on various LLMs, their parameter sizes, announcement dates, and contributors. The repository covers a wide range of LLM-related topics and serves as a valuable resource for researchers, developers, and enthusiasts interested in the field of natural language processing and artificial intelligence.
Native-LLM-for-Android
This repository provides a demonstration of running a native Large Language Model (LLM) on Android devices. It supports various models such as Qwen2.5-Instruct, MiniCPM-DPO/SFT, Yuan2.0, Gemma2-it, StableLM2-Chat/Zephyr, and Phi3.5-mini-instruct. The demo models are optimized for extreme execution speed after being converted from HuggingFace or ModelScope. Users can download the demo models from the provided drive link, place them in the assets folder, and follow specific instructions for decompression and model export. The repository also includes information on quantization methods and performance benchmarks for different models on various devices.
Tutorial
The BookwormยทPuyu large model training camp aims to promote the implementation of large models in more industries and provide developers with a more efficient platform for learning the development and application of large models. Within two weeks, you will learn the entire process of fine-tuning, deploying, and evaluating large models.
InternVL
InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM. It is a vision-language foundation model that can perform various tasks, including: **Visual Perception** - Linear-Probe Image Classification - Semantic Segmentation - Zero-Shot Image Classification - Multilingual Zero-Shot Image Classification - Zero-Shot Video Classification **Cross-Modal Retrieval** - English Zero-Shot Image-Text Retrieval - Chinese Zero-Shot Image-Text Retrieval - Multilingual Zero-Shot Image-Text Retrieval on XTD **Multimodal Dialogue** - Zero-Shot Image Captioning - Multimodal Benchmarks with Frozen LLM - Multimodal Benchmarks with Trainable LLM - Tiny LVLM InternVL has been shown to achieve state-of-the-art results on a variety of benchmarks. For example, on the MMMU image classification benchmark, InternVL achieves a top-1 accuracy of 51.6%, which is higher than GPT-4V and Gemini Pro. On the DocVQA question answering benchmark, InternVL achieves a score of 82.2%, which is also higher than GPT-4V and Gemini Pro. InternVL is open-sourced and available on Hugging Face. It can be used for a variety of applications, including image classification, object detection, semantic segmentation, image captioning, and question answering.
tt-metal
TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.
goodai-ltm-benchmark
This repository contains code and data for replicating experiments on Long-Term Memory (LTM) abilities of conversational agents. It includes a benchmark for testing agents' memory performance over long conversations, evaluating tasks requiring dynamic memory upkeep and information integration. The repository supports various models, datasets, and configurations for benchmarking and reporting results.
awesome-mobile-llm
Awesome Mobile LLMs is a curated list of Large Language Models (LLMs) and related studies focused on mobile and embedded hardware. The repository includes information on various LLM models, deployment frameworks, benchmarking efforts, applications, multimodal LLMs, surveys on efficient LLMs, training LLMs on device, mobile-related use-cases, industry announcements, and related repositories. It aims to be a valuable resource for researchers, engineers, and practitioners interested in mobile LLMs.
visionOS-examples
visionOS-examples is a repository containing accelerators for Spatial Computing. It includes examples such as Local Large Language Model, Chat Apple Vision Pro, WebSockets, Anchor To Head, Hand Tracking, Battery Life, Countdown, Plane Detection, Timer Vision, and PencilKit for visionOS. The repository showcases various functionalities and features for Apple Vision Pro, offering tools for developers to enhance their visionOS apps with capabilities like hand tracking, plane detection, and real-time cryptocurrency prices.
LLMs
LLMs is a Chinese large language model technology stack for practical use. It includes high-availability pre-training, SFT, and DPO preference alignment code framework. The repository covers pre-training data cleaning, high-concurrency framework, SFT dataset cleaning, data quality improvement, and security alignment work for Chinese large language models. It also provides open-source SFT dataset construction, pre-training from scratch, and various tools and frameworks for data cleaning, quality optimization, and task alignment.
awesome-japanese-llm
ๆฅๆฌ่ชLLMใฎๅ ๆฌ็ใชๆฆ่ฆใงใใ400ใฏใผใๆชๆบใงใๆน่กใๅผ็จ็ฌฆใฏใใใพใใใ
unsloth
Unsloth is a tool that allows users to fine-tune large language models (LLMs) 2-5x faster with 80% less memory. It is a free and open-source tool that can be used to fine-tune LLMs such as Gemma, Mistral, Llama 2-5, TinyLlama, and CodeLlama 34b. Unsloth supports 4-bit and 16-bit QLoRA / LoRA fine-tuning via bitsandbytes. It also supports DPO (Direct Preference Optimization), PPO, and Reward Modelling. Unsloth is compatible with Hugging Face's TRL, Trainer, Seq2SeqTrainer, and Pytorch code. It is also compatible with NVIDIA GPUs since 2018+ (minimum CUDA Capability 7.0).
info8006-introduction-to-ai
INFO8006 Introduction to Artificial Intelligence is a course at ULiรจge that covers various topics in AI such as intelligent agents, problem-solving, games, probabilistic reasoning, machine learning, neural networks, reinforcement learning, and decision-making. The course includes lectures, exercises, and programming projects using Python. Students can access course materials, previous exams, and archived lectures to enhance their understanding of AI concepts.
Awesome-Tabular-LLMs
This repository is a collection of papers on Tabular Large Language Models (LLMs) specialized for processing tabular data. It includes surveys, models, and applications related to table understanding tasks such as Table Question Answering, Table-to-Text, Text-to-SQL, and more. The repository categorizes the papers based on key ideas and provides insights into the advancements in using LLMs for processing diverse tables and fulfilling various tabular tasks based on natural language instructions.
TrustLLM
TrustLLM is a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. The document explains how to use the trustllm python package to help you assess the performance of your LLM in trustworthiness more quickly. For more details about TrustLLM, please refer to project website.
awesome-local-llms
The 'awesome-local-llms' repository is a curated list of open-source tools for local Large Language Model (LLM) inference, covering both proprietary and open weights LLMs. The repository categorizes these tools into LLM inference backend engines, LLM front end UIs, and all-in-one desktop applications. It collects GitHub repository metrics as proxies for popularity and active maintenance. Contributions are encouraged, and users can suggest additional open-source repositories through the Issues section or by running a provided script to update the README and make a pull request. The repository aims to provide a comprehensive resource for exploring and utilizing local LLM tools.
For similar tasks
Korean-SAT-LLM-Leaderboard
The Korean SAT LLM Leaderboard is a benchmarking project that allows users to test their fine-tuned Korean language models on a 10-year dataset of the Korean College Scholastic Ability Test (CSAT). The project provides a platform to compare human academic ability with the performance of large language models (LLMs) on various question types to assess reading comprehension, critical thinking, and sentence interpretation skills. It aims to share benchmark data, utilize a reliable evaluation dataset curated by the Korea Institute for Curriculum and Evaluation, provide annual updates to prevent data leakage, and promote open-source LLM advancement for achieving top-tier performance on the Korean CSAT.
LLMFarm
LLMFarm is an iOS and MacOS app designed to work with large language models (LLM). It allows users to load different LLMs with specific parameters, test the performance of various LLMs on iOS and macOS, and identify the most suitable model for their projects. The tool is based on ggml and llama.cpp by Georgi Gerganov and incorporates sources from rwkv.cpp by saharNooby, Mia by byroneverson, and LlamaChat by alexrozanski. LLMFarm features support for MacOS (13+) and iOS (16+), various inferences and sampling methods, Metal compatibility (not supported on Intel Mac), model setting templates, LoRA adapters support, LoRA finetune support, LoRA export as model support, and more. It also offers a range of inferences including LLaMA, GPTNeoX, Replit, GPT2, Starcoder, RWKV, Falcon, MPT, Bloom, and others. Additionally, it supports multimodal models like LLaVA, Obsidian, and MobileVLM. Users can customize inference options through JSON files and access supported models for download.
abliterator
abliterator.py is a simple Python library/structure designed to ablate features in large language models (LLMs) supported by TransformerLens. It provides capabilities to enter temporary contexts, cache activations with N samples, calculate refusal directions, and includes tokenizer utilities. The library aims to streamline the process of experimenting with ablation direction turns by encapsulating useful logic and minimizing code complexity. While currently basic and lacking comprehensive documentation, the library serves well for personal workflows and aims to expand beyond feature ablation to augmentation and additional features over time with community support.
MMOS
MMOS (Mix of Minimal Optimal Sets) is a dataset designed for math reasoning tasks, offering higher performance and lower construction costs. It includes various models and data subsets for tasks like arithmetic reasoning and math word problem solving. The dataset is used to identify minimal optimal sets through reasoning paths and statistical analysis, with a focus on QA-pairs generated from open-source datasets. MMOS also provides an auto problem generator for testing model robustness and scripts for training and inference.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.