evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Stars: 692

Visit

Eval-Scope is a framework designed to support the evaluation of large language models (LLMs) by providing pre-configured benchmark datasets, common evaluation metrics, model integration, automatic evaluation for objective questions, complex task evaluation using expert models, reports generation, visualization tools, and model inference performance evaluation. It is lightweight, easy to customize, supports new dataset integration, model hosting on ModelScope, deployment of locally hosted models, and rich evaluation metrics. Eval-Scope also supports various evaluation modes like single mode, pairwise-baseline mode, and pairwise (all) mode, making it suitable for assessing and improving LLMs.

README:

中文 &nbsp ｜ &nbsp English &nbsp

📖 中文文档 &nbsp ｜ &nbsp 📖 English Documents

⭐ If you like this project, please click the "Star" button at the top right to support us. Your support is our motivation to keep going!

📋 Contents

📋 Contents
📝 Introduction
☎ User Groups
🎉 News
🛠️ Installation
- Method 1: Install Using pip
- Method 2: Install from Source
🚀 Quick Start
📈 Visualization of Evaluation Results
🌐 Evaluation of Specified Model API
⚙️ Custom Parameter Evaluation
- Parameter
Evaluation Backend
📈 Model Serving Performance Evaluation
🖊️ Custom Dataset Evaluation
🏟️ Arena Mode
👷‍♂️ Contribution
🔜 Roadmap
Star History

📝 Introduction

EvalScope is ModelScope's official framework for model evaluation and benchmarking, designed for diverse assessment needs. It supports various model types including large language models, multimodal, embedding, reranker, and CLIP models.

The framework accommodates multiple evaluation scenarios such as end-to-end RAG evaluation, arena mode, and inference performance testing. It features built-in benchmarks and metrics like MMLU, CMMLU, C-Eval, and GSM8K. Seamlessly integrated with the ms-swift training framework, EvalScope enables one-click evaluations, offering comprehensive support for model training and assessment 🚀

EvalScope Framework.

Framework Description

The architecture includes the following modules:

Model Adapter: The model adapter is used to convert the outputs of specific models into the format required by the framework, supporting both API call models and locally run models.
Data Adapter: The data adapter is responsible for converting and processing input data to meet various evaluation needs and formats.
Evaluation Backend:
- Native: EvalScope’s own default evaluation framework, supporting various evaluation modes, including single model evaluation, arena mode, baseline model comparison mode, etc.
- OpenCompass: Supports OpenCompass as the evaluation backend, providing advanced encapsulation and task simplification, allowing you to submit tasks for evaluation more easily.
- VLMEvalKit: Supports VLMEvalKit as the evaluation backend, enabling easy initiation of multi-modal evaluation tasks, supporting various multi-modal models and datasets.
- RAGEval: Supports RAG evaluation, supporting independent evaluation of embedding models and rerankers using MTEB/CMTEB, as well as end-to-end evaluation using RAGAS.
- ThirdParty: Other third-party evaluation tasks, such as ToolBench.
Performance Evaluator: Model performance evaluation, responsible for measuring model inference service performance, including performance testing, stress testing, performance report generation, and visualization.
Evaluation Report: The final generated evaluation report summarizes the model's performance, which can be used for decision-making and further model optimization.
Visualization: Visualization results help users intuitively understand evaluation results, facilitating analysis and comparison of different model performances.

☎ User Groups

Please scan the QR code below to join our community groups:

Discord Group	WeChat Group	DingTalk Group

🎉 News

🔥 [2025.03.20] The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the user guide for more details.
🔥 [2025.03.13] Added support for the LiveCodeBench code evaluation benchmark, which can be used by specifying live_code_bench. Supports evaluating QwQ-32B on LiveCodeBench, refer to the best practices.
🔥 [2025.03.11] Added support for the SimpleQA and Chinese SimpleQA evaluation benchmarks. These are used to assess the factual accuracy of models, and you can specify simple_qa and chinese_simpleqa for use. Support for specifying a judge model is also available. For more details, refer to the relevant parameter documentation.
🔥 [2025.03.07] Added support for the QwQ-32B model, evaluate the model's reasoning ability and reasoning efficiency, refer to 📖 Best Practices for QwQ-32B Evaluation for more details.
🔥 [2025.03.04] Added support for the SuperGPQA dataset, which covers 13 categories, 72 first-level disciplines, and 285 second-level disciplines, totaling 26,529 questions. You can use it by specifying super_gpqa.
🔥 [2025.03.03] Added support for evaluating the IQ and EQ of models. Refer to 📖 Best Practices for IQ and EQ Evaluation to find out how smart your AI is!
🔥 [2025.02.27] Added support for evaluating the reasoning efficiency of models. Refer to 📖 Best Practices for Evaluating Thinking Efficiency. This implementation is inspired by the works Overthinking and Underthinking.
🔥 [2025.02.25] Added support for two model inference-related evaluation benchmarks: MuSR and ProcessBench. To use them, simply specify musr and process_bench respectively in the datasets parameter.
🔥 [2025.02.18] Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).
🔥 [2025.02.13] Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to best practice; Added support for specifying the eval_batch_size parameter to accelerate model evaluation.
🔥 [2025.01.20] Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the 📖 Visualizing Evaluation Results for more details; Added iquiz evaluation example, evaluating the IQ and EQ of the model.
🔥 [2025.01.07] Native backend: Support for model API evaluation is now available. Refer to the 📖 Model API Evaluation Guide for more details. Additionally, support for the ifeval evaluation benchmark has been added.
🔥🔥 [2024.12.31] Support for adding benchmark evaluations, refer to the 📖 Benchmark Evaluation Addition Guide; support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the 📖 Mixed Dataset Evaluation Guide.
🔥 [2024.12.13] Model evaluation optimization: no need to pass the --template-type parameter anymore; supports starting evaluation with evalscope eval --args. Refer to the 📖 User Guide for more details.
🔥 [2024.11.26] The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the 📖 User Guide.
🔥 [2024.10.31] The best practice for evaluating Multimodal-RAG has been updated, please check the 📖 Blog for more details.
🔥 [2024.10.23] Supports multimodal RAG evaluation, including the assessment of image-text retrieval using CLIP_Benchmark, and extends RAGAS to support end-to-end multimodal metrics evaluation.
🔥 [2024.10.8] Support for RAG evaluation, including independent evaluation of embedding models and rerankers using MTEB/CMTEB, as well as end-to-end evaluation using RAGAS.

🔥 [2024.09.18] Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to 📖 read it.
🔥 [2024.09.12] Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark LongBench-Write to measure the long output quality as well as the output length.
🔥 [2024.08.30] Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.
🔥 [2024.08.20] Updated the official documentation, including getting started guides, best practices, and FAQs. Feel free to 📖read it here!
🔥 [2024.08.09] Simplified the installation process, allowing for pypi installation of vlmeval dependencies; optimized the multimodal model evaluation experience, achieving up to 10x acceleration based on the OpenAI API evaluation chain.
🔥 [2024.07.31] Important change: The package name llmuses has been changed to evalscope. Please update your code accordingly.
🔥 [2024.07.26] Support for VLMEvalKit as a third-party evaluation framework to initiate multimodal model evaluation tasks.
🔥 [2024.06.29] Support for OpenCompass as a third-party evaluation framework, which we have encapsulated at a higher level, supporting pip installation and simplifying evaluation task configuration.
🔥 [2024.06.13] EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.
🔥 [2024.06.13] Integrated the Agent evaluation dataset ToolBench.

🛠️ Installation

Method 1: Install Using pip

We recommend using conda to manage your environment and installing dependencies with pip:

Create a conda environment (optional)

# It is recommended to use Python 3.10
conda create -n evalscope python=3.10
# Activate the conda environment
conda activate evalscope

Install dependencies using pip

pip install evalscope                # Install Native backend (default)
# Additional options
pip install 'evalscope[opencompass]'   # Install OpenCompass backend
pip install 'evalscope[vlmeval]'       # Install VLMEvalKit backend
pip install 'evalscope[rag]'           # Install RAGEval backend
pip install 'evalscope[perf]'          # Install dependencies for the model performance testing module
pip install 'evalscope[app]'           # Install dependencies for visualization
pip install 'evalscope[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)

[!WARNING] As the project has been renamed to evalscope, for versions v0.4.3 or earlier, you can install using the following command:
pip install llmuses<=0.4.3
To import relevant dependencies using llmuses:
from llmuses import ...

Method 2: Install from Source

Download the source code

git clone https://github.com/modelscope/evalscope.git

Install dependencies

cd evalscope/
pip install -e .                  # Install Native backend
# Additional options
pip install -e '.[opencompass]'   # Install OpenCompass backend
pip install -e '.[vlmeval]'       # Install VLMEvalKit backend
pip install -e '.[rag]'           # Install RAGEval backend
pip install -e '.[perf]'          # Install Perf dependencies
pip install -e '.[app]'           # Install visualization dependencies
pip install -e '.[all]'           # Install all backends (Native, OpenCompass, VLMEvalKit, RAGEval)

🚀 Quick Start

To evaluate a model on specified datasets using default configurations, this framework supports two ways to initiate evaluation tasks: using the command line or using Python code.

Method 1. Using Command Line

Execute the eval command in any directory:

evalscope eval \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --datasets gsm8k arc \
 --limit 5

Method 2. Using Python Code

When using Python code for evaluation, you need to submit the evaluation task using the run_task function, passing a TaskConfig as a parameter. It can also be a Python dictionary, yaml file path, or json file path, for example:

Using Python Dictionary

from evalscope.run import run_task

task_cfg = {
    'model': 'Qwen/Qwen2.5-0.5B-Instruct',
    'datasets': ['gsm8k', 'arc'],
    'limit': 5
}

run_task(task_cfg=task_cfg)

More Startup Methods

Using TaskConfig

from evalscope.run import run_task
from evalscope.config import TaskConfig

task_cfg = TaskConfig(
    model='Qwen/Qwen2.5-0.5B-Instruct',
    datasets=['gsm8k', 'arc'],
    limit=5
)

run_task(task_cfg=task_cfg)

Using yaml file

config.yaml:

model: Qwen/Qwen2.5-0.5B-Instruct
datasets:
  - gsm8k
  - arc
limit: 5

from evalscope.run import run_task

run_task(task_cfg="config.yaml")

Using json file

config.json:

{
    "model": "Qwen/Qwen2.5-0.5B-Instruct",
    "datasets": ["gsm8k", "arc"],
    "limit": 5
}

from evalscope.run import run_task

run_task(task_cfg="config.json")

Basic Parameter

--model: Specifies the model_id of the model in ModelScope, which can be automatically downloaded, e.g., Qwen/Qwen2.5-0.5B-Instruct; or use the local path of the model, e.g., /path/to/model
--datasets: Dataset names, supports inputting multiple datasets separated by spaces. Datasets will be automatically downloaded from modelscope. For supported datasets, refer to the Dataset List
--limit: Maximum amount of evaluation data for each dataset. If not specified, it defaults to evaluating all data. Can be used for quick validation

Output Results

+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Model Name            | Dataset Name   | Metric Name     | Category Name   | Subset Name   |   Num |   Score |
+=======================+================+=================+=================+===============+=======+=========+
| Qwen2.5-0.5B-Instruct | gsm8k          | AverageAccuracy | default         | main          |     5 |     0.4 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Easy      |     5 |     0.8 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+
| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Challenge |     5 |     0.4 |
+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+

📈 Visualization of Evaluation Results

Install the dependencies required for visualization, including gradio, plotly, etc.

pip install 'evalscope[app]'

Start the Visualization Service

Run the following command to start the visualization service.

evalscope app

You can access the visualization service in the browser if the following output appears.

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.

Setting Interface	Model Comparison
Report Overview	Report Details

For more details, refer to: 📖 Visualization of Evaluation Results

🌐 Evaluation of Specified Model API

Specify the model API service address (api_url) and API Key (api_key) to evaluate the deployed model API service. In this case, the eval-type parameter must be specified as service, for example:

For example, to launch a model service using vLLM:

export VLLM_USE_MODELSCOPE=True && python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-0.5B-Instruct --served-model-name qwen2.5 --trust_remote_code --port 8801

Then, you can use the following command to evaluate the model API service:

evalscope eval \
 --model qwen2.5 \
 --api-url http://127.0.0.1:8801/v1 \
 --api-key EMPTY \
 --eval-type service \
 --datasets gsm8k \
 --limit 10

⚙️ Custom Parameter Evaluation

For more customized evaluations, such as customizing model parameters or dataset parameters, you can use the following command. The evaluation startup method is the same as simple evaluation. Below shows how to start the evaluation using the eval command:

evalscope eval \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --model-args revision=master,precision=torch.float16,device_map=auto \
 --generation-config do_sample=true,temperature=0.5 \
 --dataset-args '{"gsm8k": {"few_shot_num": 0, "few_shot_random": false}}' \
 --datasets gsm8k \
 --limit 10

Parameter

--model-args: Model loading parameters, separated by commas in key=value format. Default parameters:
- revision: Model version, default is master
- precision: Model precision, default is auto
- device_map: Model device allocation, default is auto
--generation-config: Generation parameters, separated by commas in key=value format. Default parameters:
- do_sample: Whether to use sampling, default is false
- max_length: Maximum length, default is 2048
- max_new_tokens: Maximum length of generation, default is 512
--dataset-args: Configuration parameters for evaluation datasets, passed in json format. The key is the dataset name, and the value is the parameters. Note that it needs to correspond one-to-one with the values in the --datasets parameter:
- few_shot_num: Number of few-shot examples
- few_shot_random: Whether to randomly sample few-shot data, if not set, defaults to true

Reference: Full Parameter Description

Evaluation Backend

EvalScope supports using third-party evaluation frameworks to initiate evaluation tasks, which we call Evaluation Backend. Currently supported Evaluation Backend includes:

Native: EvalScope's own default evaluation framework, supporting various evaluation modes including single model evaluation, arena mode, and baseline model comparison mode.
OpenCompass: Initiate OpenCompass evaluation tasks through EvalScope. Lightweight, easy to customize, supports seamless integration with the LLM fine-tuning framework ms-swift. 📖 User Guide
VLMEvalKit: Initiate VLMEvalKit multimodal evaluation tasks through EvalScope. Supports various multimodal models and datasets, and offers seamless integration with the LLM fine-tuning framework ms-swift. 📖 User Guide
RAGEval: Initiate RAG evaluation tasks through EvalScope, supporting independent evaluation of embedding models and rerankers using MTEB/CMTEB, as well as end-to-end evaluation using RAGAS: 📖 User Guide
ThirdParty: Third-party evaluation tasks, such as ToolBench and LongBench-Write.

📈 Model Serving Performance Evaluation

A stress testing tool focused on large language models, which can be customized to support various dataset formats and different API protocol formats.

Reference: Performance Testing 📖 User Guide

Supports wandb for recording results

Supports Speed Benchmark

It supports speed testing and provides speed benchmarks similar to those found in the official Qwen reports:

Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
|       1       |      50.69      |      0.97      |
|     6144      |      51.36      |      1.23      |
|     14336     |      49.93      |      1.59      |
|     30720     |      49.56      |      2.34      |
+---------------+-----------------+----------------+

🖊️ Custom Dataset Evaluation

EvalScope supports custom dataset evaluation. For detailed information, please refer to the Custom Dataset Evaluation 📖User Guide

🏟️ Arena Mode

The Arena mode allows multiple candidate models to be evaluated through pairwise battles, and can choose to use the AI Enhanced Auto-Reviewer (AAR) automatic evaluation process or manual evaluation to obtain the evaluation report.

Refer to: Arena Mode 📖 User Guide

👷‍♂️ Contribution

EvalScope, as the official evaluation tool of ModelScope, is continuously optimizing its benchmark evaluation features! We invite you to refer to the Contribution Guide to easily add your own evaluation benchmarks and share your contributions with the community. Let’s work together to support the growth of EvalScope and make our tools even better! Join us now!

🔜 Roadmap

[ ] Support for better evaluation report visualization
[x] Support for mixed evaluations across multiple datasets
[x] RAG evaluation
[x] VLM evaluation
[x] Agents evaluation
[x] vLLM
[ ] Distributed evaluating
[x] Multi-modal evaluation
[ ] Benchmarks
- [ ] GAIA
- [ ] GPQA
- [x] MBPP

Star History

For Tasks:

Click tags to check more tools for each tasks

evaluate models generate reports visualize evaluations deploy models integrate new datasets

For Jobs:

data scientist machine learning engineer research scientist ai engineer nlp specialist

Alternative AI tools for evalscope

Similar Open Source Tools

evalscope

github

: 692

TempCompass

TempCompass is a benchmark designed to evaluate the temporal perception ability of Video LLMs. It encompasses a diverse set of temporal aspects and task formats to comprehensively assess the capability of Video LLMs in understanding videos. The benchmark includes conflicting videos to prevent models from relying on single-frame bias and language priors. Users can clone the repository, install required packages, prepare data, run inference using examples like Video-LLaVA and Gemini, and evaluate the performance of their models across different tasks such as Multi-Choice QA, Yes/No QA, Caption Matching, and Caption Generation.

github

: 71

cb-tumblebug

CB-Tumblebug (CB-TB) is a system for managing multi-cloud infrastructure consisting of resources from multiple cloud service providers. It provides an overview, features, and architecture. The tool supports various cloud providers and resource types, with ongoing development and localization efforts. Users can deploy a multi-cloud infra with GPUs, enjoy multiple LLMs in parallel, and utilize LLM-related scripts. The tool requires Linux, Docker, Docker Compose, and Golang for building the source. Users can run CB-TB with Docker Compose or from the Makefile, set up prerequisites, contribute to the project, and view a list of contributors. The tool is licensed under an open-source license.

github

: 52

llama-assistant

Llama Assistant is an AI-powered assistant that helps with daily tasks, such as voice recognition, natural language processing, summarizing text, rephrasing sentences, answering questions, and more. It runs offline on your local machine, ensuring privacy by not sending data to external servers. The project is a work in progress with regular feature additions.

github

: 300

ScaleLLM

ScaleLLM is a cutting-edge inference system engineered for large language models (LLMs), meticulously designed to meet the demands of production environments. It extends its support to a wide range of popular open-source models, including Llama3, Gemma, Bloom, GPT-NeoX, and more. ScaleLLM is currently undergoing active development. We are fully committed to consistently enhancing its efficiency while also incorporating additional features. Feel free to explore our **_Roadmap_** for more details. ## Key Features * High Efficiency: Excels in high-performance LLM inference, leveraging state-of-the-art techniques and technologies like Flash Attention, Paged Attention, Continuous batching, and more. * Tensor Parallelism: Utilizes tensor parallelism for efficient model execution. * OpenAI-compatible API: An efficient golang rest api server that compatible with OpenAI. * Huggingface models: Seamless integration with most popular HF models, supporting safetensors. * Customizable: Offers flexibility for customization to meet your specific needs, and provides an easy way to add new models. * Production Ready: Engineered with production environments in mind, ScaleLLM is equipped with robust system monitoring and management features to ensure a seamless deployment experience.

github

: 418

duolingo-clone

Lingo is an interactive platform for language learning that provides a modern UI/UX experience. It offers features like courses, quests, and a shop for users to engage with. The tech stack includes React JS, Next JS, Typescript, Tailwind CSS, Vercel, and Postgresql. Users can contribute to the project by submitting changes via pull requests. The platform utilizes resources from CodeWithAntonio, Kenney Assets, Freesound, Elevenlabs AI, and Flagpack. Key dependencies include @clerk/nextjs, @neondatabase/serverless, @radix-ui/react-avatar, and more. Users can follow the project creator on GitHub and Twitter, as well as subscribe to their YouTube channel for updates. To learn more about Next.js, users can refer to the Next.js documentation and interactive tutorial.

github

: 104

amica

Amica is an application that allows you to easily converse with 3D characters in your browser. You can import VRM files, adjust the voice to fit the character, and generate response text that includes emotional expressions.

github

: 879

rig

Rig is a Rust library designed for building scalable, modular, and user-friendly applications powered by large language models (LLMs). It provides full support for LLM completion and embedding workflows, offers simple yet powerful abstractions for LLM providers like OpenAI and Cohere, as well as vector stores such as MongoDB and in-memory storage. With Rig, users can easily integrate LLMs into their applications with minimal boilerplate code.

github

: 3.4k

client-ts

Mistral Typescript Client is an SDK for Mistral AI API, providing Chat Completion and Embeddings APIs. It allows users to create chat completions, upload files, create agent completions, create embedding requests, and more. The SDK supports various JavaScript runtimes and provides detailed documentation on installation, requirements, API key setup, example usage, error handling, server selection, custom HTTP client, authentication, providers support, standalone functions, debugging, and contributions.

github

: 52

llama-assistant

Llama Assistant is a local AI assistant that respects your privacy. It is an AI-powered assistant that can recognize your voice, process natural language, and perform various actions based on your commands. It can help with tasks like summarizing text, rephrasing sentences, answering questions, writing emails, and more. The assistant runs offline on your local machine, ensuring privacy by not sending data to external servers. It supports voice recognition, natural language processing, and customizable UI with adjustable transparency. The project is a work in progress with new features being added regularly.

github

: 485

PromptClip

PromptClip is a tool that allows developers to create video clips using LLM prompts. Users can upload videos from various sources, prompt the video in natural language, use different LLM models, instantly watch the generated clips, finetune the clips, and add music or image overlays. The tool provides a seamless way to extract specific moments from videos based on user queries, making video editing and content creation more efficient and intuitive.

github

: 100

tiddlywiki-starter-kit

TiddlyWiki Starter Kit is a pre-configured setup for TiddlyWiki, utilizing Tailwind CSS for responsive design and providing multiple wiki support for different purposes. It offers quick operations with keyboard shortcuts, simplified configuration through editing the .env file, and one-click installation using npm create command.

github

: 51

fastserve-ai

FastServe-AI is a machine learning serving tool focused on GenAI & LLMs with simplicity as the top priority. It allows users to easily serve custom models by implementing the 'handle' method for 'FastServe'. The tool provides a FastAPI server for custom models and can be deployed using Lightning AI Studio. Users can install FastServe-AI via pip and run it to serve their own GPT-like LLM models in minutes.

github

: 56

Apollo

Apollo is a multilingual medical LLM that covers English, Chinese, French, Hindi, Spanish, Hindi, and Arabic. It is designed to democratize medical AI to 6B people. Apollo has achieved state-of-the-art results on a variety of medical NLP tasks, including question answering, medical dialogue generation, and medical text classification. Apollo is easy to use and can be integrated into a variety of applications, making it a valuable tool for healthcare professionals and researchers.

github

: 155

Noi

Noi is an AI-enhanced customizable browser designed to streamline digital experiences. It includes curated AI websites, allows adding any URL, offers prompts management, Noi Ask for batch messaging, various themes, Noi Cache Mode for quick link access, cookie data isolation, and more. Users can explore, extend, and empower their browsing experience with Noi.

github

: 6.0k

Imagine_AI

IMAGINE - AI is a groundbreaking image generator tool that leverages the power of OpenAI's DALL-E 2 API library to create extraordinary visuals. Developed using Node.js and Express, this tool offers a transformative way to unleash artistic creativity and imagination by generating unique and captivating images through simple prompts or keywords.

github

: 51

For similar tasks

ai-on-gke

This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources

github

: 280

ray

Ray is a unified framework for scaling AI and Python applications. It consists of a core distributed runtime and a set of AI libraries for simplifying ML compute, including Data, Train, Tune, RLlib, and Serve. Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing ecosystem of community integrations. With Ray, you can seamlessly scale the same code from a laptop to a cluster, making it easy to meet the compute-intensive demands of modern ML workloads.

github

: 36.4k

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135

djl

Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. It is designed to be easy to get started with and simple to use for Java developers. DJL provides a native Java development experience and allows users to integrate machine learning and deep learning models with their Java applications. The framework is deep learning engine agnostic, enabling users to switch engines at any point for optimal performance. DJL's ergonomic API interface guides users with best practices to accomplish deep learning tasks, such as running inference and training neural networks.

github

: 4.1k

mlflow

MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud). MLflow's current components are: * `MLflow Tracking `_: An API to log parameters, code, and results in machine learning experiments and compare them using an interactive UI. * `MLflow Projects `_: A code packaging format for reproducible runs using Conda and Docker, so you can share your ML code with others. * `MLflow Models `_: A model packaging format and tools that let you easily deploy the same model (from any ML library) to batch and real-time scoring on platforms such as Docker, Apache Spark, Azure ML and AWS SageMaker. * `MLflow Model Registry `_: A centralized model store, set of APIs, and UI, to collaboratively manage the full lifecycle of MLflow Models.

github

: 19.9k

tt-metal

TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.

github

: 786

burn

Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.

github

: 10.2k

awsome-distributed-training

This repository contains reference architectures and test cases for distributed model training with Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS. The test cases cover different types and sizes of models as well as different frameworks and parallel optimizations (Pytorch DDP/FSDP, MegatronLM, NemoMegatron...).

github

: 230

For similar jobs

sweep

Sweep is an AI junior developer that turns bugs and feature requests into code changes. It automatically handles developer experience improvements like adding type hints and improving test coverage.

github

: 7.1k

teams-ai

The Teams AI Library is a software development kit (SDK) that helps developers create bots that can interact with Teams and Microsoft 365 applications. It is built on top of the Bot Framework SDK and simplifies the process of developing bots that interact with Teams' artificial intelligence capabilities. The SDK is available for JavaScript/TypeScript, .NET, and Python.

github

: 502

ai-guide

This guide is dedicated to Large Language Models (LLMs) that you can run on your home computer. It assumes your PC is a lower-end, non-gaming setup.

github

: 159

classifai

Supercharge WordPress Content Workflows and Engagement with Artificial Intelligence. Tap into leading cloud-based services like OpenAI, Microsoft Azure AI, Google Gemini and IBM Watson to augment your WordPress-powered websites. Publish content faster while improving SEO performance and increasing audience engagement. ClassifAI integrates Artificial Intelligence and Machine Learning technologies to lighten your workload and eliminate tedious tasks, giving you more time to create original content that matters.

github

: 620

chatbot-ui

Chatbot UI is an open-source AI chat app that allows users to create and deploy their own AI chatbots. It is easy to use and can be customized to fit any need. Chatbot UI is perfect for businesses, developers, and anyone who wants to create a chatbot.

github

: 27.7k

BricksLLM

BricksLLM is a cloud native AI gateway written in Go. Currently, it provides native support for OpenAI, Anthropic, Azure OpenAI and vLLM. BricksLLM aims to provide enterprise level infrastructure that can power any LLM production use cases. Here are some use cases for BricksLLM: * Set LLM usage limits for users on different pricing tiers * Track LLM usage on a per user and per organization basis * Block or redact requests containing PIIs * Improve LLM reliability with failovers, retries and caching * Distribute API keys with rate limits and cost limits for internal development/production use cases * Distribute API keys with rate limits and cost limits for students

github

: 953

uAgents

uAgents is a Python library developed by Fetch.ai that allows for the creation of autonomous AI agents. These agents can perform various tasks on a schedule or take action on various events. uAgents are easy to create and manage, and they are connected to a fast-growing network of other uAgents. They are also secure, with cryptographically secured messages and wallets.

github

: 1.3k

griptape

Griptape is a modular Python framework for building AI-powered applications that securely connect to your enterprise data and APIs. It offers developers the ability to maintain control and flexibility at every step. Griptape's core components include Structures (Agents, Pipelines, and Workflows), Tasks, Tools, Memory (Conversation Memory, Task Memory, and Meta Memory), Drivers (Prompt and Embedding Drivers, Vector Store Drivers, Image Generation Drivers, Image Query Drivers, SQL Drivers, Web Scraper Drivers, and Conversation Memory Drivers), Engines (Query Engines, Extraction Engines, Summary Engines, Image Generation Engines, and Image Query Engines), and additional components (Rulesets, Loaders, Artifacts, Chunkers, and Tokenizers). Griptape enables developers to create AI-powered applications with ease and efficiency.

github

: 2.2k