cube-studio
cube studio开源云原生一站式机器学习/深度学习/大模型AI平台,支持sso登录,多租户,大数据平台对接,notebook在线开发,拖拉拽任务流pipeline编排,多机多卡分布式训练,超参搜索,推理服务VGPU,边缘计算,serverless,标注平台,自动化标注,数据集管理,大模型微调,vllm大模型推理,llmops,私有知识库,AI模型应用商店,支持模型一键开发/推理/微调,支持国产cpu/gpu/npu芯片,支持RDMA,支持pytorch/tf/mxnet/deepspeed/paddle/colossalai/horovod/spark/ray/volcano分布式
Stars: 913
Cube Studio is an open-source all-in-one cloud-native machine learning platform that provides various functionalities such as project group management, network configuration, user management, role management, billing functions, SSO single sign-on, support for multiple computing power types, support for multiple resource groups and clusters, edge cluster support, serverless cluster mode support, database storage support, machine resource management, storage disk management, internationalization capabilities, data map management, data calculation, ETL orchestration, data set management, data annotation, image/audio/text dataset support, feature processing, traditional machine learning algorithms, distributed deep learning frameworks, distributed acceleration frameworks, model evaluation, model format conversion, model registration, model deployment, distributed media processing, custom operators, automatic learning, custom training images, automatic parameter tuning, TensorBoard jobs, internal services, model management, inference services, monitoring, model application management, model marketplace, model development, model fine-tuning, web model deployment, automated annotation, dataset SDK, notebook SDK, pipeline training SDK, inference service SDK, large model distributed training, large model inference, large model fine-tuning, intelligent conversation, private knowledge base, model deployment for WeChat public accounts, enterprise WeChat group chatbot integration, DingTalk group chatbot integration, and more. Cube Studio offers template-based functionality for data import/export, data processing, feature processing, machine learning frameworks, machine learning algorithms, deep learning frameworks, model processing, model serving, monitoring, and more.
README:
English | 简体中文
https://github.com/data-infra/cube-studio/wiki
学习、部署、体验、开源建设、商业合作 欢迎来撩。或添加微信luanpeng1234,备注<开源建设>
cube studio是 开源一站式云原生机器学习平台,目前主要包含
模块分组 | 功能模块 | 功能清单 |
---|---|---|
基础能力 | 项目组管理 |
|
基础能力 | 网络 |
|
基础能力 | 用户管理 角色管理/权限管理 |
|
基础能力 | 计量计费功能 | |
基础能力 | SSO单点登录 |
|
基础能力 | 支持多种算力 |
|
基础能力 | 多资源组/多集群 |
|
基础能力 | 边缘集群 | |
基础能力 | serverless集群模式 | |
基础能力 | 数据库存储 |
|
基础能力 | 机器资源管理 | |
基础能力 | 存储盘管理 | |
基础能力 | 国际化能力 | |
数据管理 | 数据地图 | |
数据管理 | 数据计算 | |
数据管理 | ETL编排 | |
数据管理 | 数据集管理 |
|
数据管理 | 数据标注 | |
开发环境 | 镜像功能 |
|
开发环境 | notebook |
|
模型训练 | 拖拉拽任务流编排调试 |
|
模型训练 | 主流功能算子 | 基础算子: |
模型训练 | 算子自定义 | 支持算子自定义,通过web界面操作将自定义算法代码镜像,注册为可被他人复用的pipeline算子 |
模型训练 | 自动学习 | 面向非AI背景的用户提供自动学习服务,用户选择某一个场景之后,上传训练数据即可自动开始训练和模型部署,支持示例automl任务流导入导出 |
模型训练 | 自定义镜像 | 面向高级 AI 开发者,提供自定义训练作业(执行环境 + 代码)功能; |
模型训练 | 自动调参 | 基于单机/分布式自动超参搜索 |
模型训练 | TensorBoard作业 | 实时/离线观察模型训练过程中的参数和指标变化情况 |
模型管理 推理服务 | 内部服务 | 支持开发或运维工具快捷部署,提供mysql-web,postgresql web,mobgo web, redis web,neo4j,rstudio等开源工具 |
模型管理 推理服务 | 模型管理 | 模型管理用于对模型多版本管理,支持模型发布为推理服务 |
模型管理 推理服务 | 推理服务 |
|
监控 | 整体资源 |
|
监控 | 监控体系 |
|
模型应用市场 | 模型应用管理方案 | |
模型应用市场 | 模型应用管理方案 | |
模型应用市场 | 模型应用管理方案 | |
模型应用市场 | 预训练模型 | |
模型应用市场 | 模型市场 | |
模型应用市场 | 模型一键开发 | |
模型应用市场 | 模型一键微调 | |
模型应用市场 | 模型一键部署web | |
模型应用市场 | 模型自动化标注 | |
模型应用市场 | 数据集sdk | |
模型应用市场 | notebook sdk | |
模型应用市场 | pipeline训练sdk | |
模型应用市场 | 推理服务sdk | |
大模型 | 大模型分布式多机多卡 | |
大模型 | 支持大模型推理aihub形式 | |
大模型 | 支持大模型推理openai接口 | |
大模型 | 支持大模型微调 | |
大模型 | 智能对话 | |
大模型 | 私有知识库 | |
大模型 | 私有知识库 | |
大模型 | 私有知识库 | |
大模型 | 私有知识库 | |
大模型 | 私有知识库 | |
大模型 | 私有知识库 |
提示:
- 1、可自由定制任务插件,更适用当前业务需求
模块 | 模板 | 类型 | 文档地址 |
---|---|---|---|
数据导入导出 | datax | 单机 | job-template/job/datax/README.md |
数据导入导出 | 数据集导入 | 单机 | job-template/job/dataset/README.md |
数据导入导出 | 模型导入 | 单机 | job-template/job/model_download/README.md |
数据预处理 | data-process | 单机 | job-template/job/data-process/README.md |
数据处理工具 | hadoop | 单机 | job-template/job/hadoop/README.md |
数据处理工具 | ray | 分布式 | job-template/job/ray/README.md |
数据处理工具 | volcanojob | 分布式 | job-template/job/volcano/README.md |
特征处理 | feature-process | 单机 | job-template/job/feature-process/README.md |
机器学习框架 | ray-sklearn | 分布式 | job-template/job/ray_sklearn/README.md |
机器学习算法 | random_forest | 单机 | job-template/job/random_forest/README.md |
机器学习算法 | lr | 单机 | job-template/job/lr/README.md |
机器学习算法 | lightgbm | 单机 | job-template/job/lightgbm/README.md |
机器学习算法 | knn | 单机 | job-template/job/knn/README.md |
机器学习算法 | kmeans | 单机 | job-template/job/kmeans/README.md |
机器学习算法 | nni | 单机 | job-template/job/hyperparam-search-nni/README.md |
机器学习算法 | xgb | 单机 | job-template/job/xgb/README.md |
机器学习算法 | gbdt | 单机 | job-template/job/gbdt/README.md |
机器学习算法 | decision-tree | 单机 | job-template/job/decision_tree/README.md |
机器学习算法 | bayesian | 单机 | job-template/job/bayesian/README.md |
机器学习算法 | adaboost | 单机 | job-template/job/adaboost/README.md |
深度学习 | tfjob | 分布式 | job-template/job/tf/README.md |
深度学习 | pytorchjob | 分布式 | job-template/job/pytorch/README.md |
深度学习 | paddle | 分布式 | job-template/job/paddle/README.md |
深度学习 | mxnet | 分布式 | job-template/job/mxnet/README.md |
深度学习 | mindspore | 分布式 | job-template/job/mindspore/README.md |
深度学习 | horovod | 分布式 | job-template/job/horovod/README.md |
深度学习 | mpi | 分布式 | job-template/job/mpi/README.md |
深度学习 | colossalai | 分布式 | job-template/job/colossalai/README.md |
深度学习 | deepspeed | 分布式 | job-template/job/deepspeed/README.md |
深度学习 | megatron | 分布式 | job-template/job/megatron/README.md |
模型处理 | model-evaluation | 单机 | job-template/job/model_evaluation/README.md |
模型服务化 | model-convert | 单机 | job-template/job/model_convert/README.md |
模型服务化 | model-register | 单机 | job-template/job/model_register/README.md |
模型服务化 | deploy-service | 单机 | job-template/job/deploy-service/README.md |
模型服务化 | model-offline-predict | 分布式 | job-template/job/model_offline_predict/README.md |
多媒体类 | media-download | 分布式 | job-template/job/video-audio/README.md |
多媒体类 | video-img | 分布式 | job-template/job/video-audio/README.md |
多媒体类 | video-audio | 分布式 | job-template/job/video-audio/README.md |
大模型 | llama2 | 单机多卡 | job-template/job/llama2/README.md |
大模型 | chatglm2 | 单机多卡 | job-template/job/chatglm2/README.md |
大模型 | chatglm3 | 单机多卡 | job-template/job/chatglm3/README.md |
大模型 | chatglm4 | 单机多卡 | job-template/job/chatglm4/README.md |
大模型 | baichuan2 | 单机多卡 | job-template/job/baichuan2/README.md |
大模型 | qwen2 | 单机多卡 | job-template/job/qwen2/README.md |
大模型 | llama-factory | 单机多卡 | job-template/job/llama_factory/README.md |
完整的平台包含
- 1、机器的标准化
- 2、分布式存储(单机可忽略)、k8s集群、监控体系(prometheus/efk/zipkin)
- 3、基础能力(tf/pytorch/mxnet/valcano/ray等分布式,nni/katib超参搜索)
- 4、平台web部分(oa/权限/项目组、在线构建镜像、在线开发、pipeline拖拉拽、超参搜索、推理服务管理等)
算力:
- 云原生统筹平台cpu/gpu等算力
- 支持划分多资源组,支持多k8s集群,多地部署
- 支持T4/V100/A100/昇腾/dcu/VGPU等异构GPU/NPU环境
- 支持边缘集群模式,支持边缘节点上开发/训练/推理
- 支持鲲鹏芯片arm64架构,RDMA
存储:
- 自带分布式存储,支持多机分布式下文件处理
- 支持外部存储挂载,支持项目组挂载绑定
- 支持个人存储空间/组空间等多种形式
- 平台内存储空间不需要迁移
用户权限:
- 支持sso登录,对接公司账号体系
- 支持项目组划分,支持配置相应项目组用户的权限
- 管理平台用户的基本信息,组织架构,rbac权限体系
cube支持多集群调度,可同时管控多个训练或推理集群。在单个集群内,不仅能做到一个项目组内对在线开发、训练、推理的隔离,还可以做到一个k8s集群下多个项目组算力的隔离。另外在不同项目组下的算力间具有动态均衡的能力,能够在多项目间共享公共算力池和私有化算力池,做到成本最低化。
cube会自动为用户挂载用户的个人目录,同一个用户在平台任何地方启动的容器,其用户个人子目录均为/mnt/$username。可以将pvc/hostpath/memory/configmap等挂载成容器目录。同时可以在项目组中配置项目组的默认挂载,进而实现一个项目组共享同一个目录等功能。
- 系统多租户/多实例管理,在线交互开发调试,无需安装三方控件,只需浏览器就能完成开发。
- 支持vscode,jupyter,Matlab,Rstudio等多种在线IDE类型
- Jupyter支持cube-studio sdk,Julia,R,python,pyspark多内核版本,
- 支持c++,java,conda等多种开发语言,以及tensorboard/git/gpu监控等多种插件
- 支持ssh remote与notebook互通,本地进行代码开发
- 在线镜像构建,通过Web Shell方式在浏览器中完成构建;并提供各种版本notebook,inference,gpu,python等基础镜像
- 支持图/文/音/多模态/大模型多种类型标注功能,用户管理,工作任务分发
- 对接aihub模型市场,支持自动化标注;对接数据集,支持标注数据导入;对接pipeline,支持标注结果自动化训练
1、Ml全流程
数据导入,数据预处理,超惨搜索,模型训练,模型评估,模型压缩,模型注册,服务上线,ml算法全流程
2、灵活开放
支持单任务调试、分布式任务日志聚合查看,pipeline调试跟踪,任务运行资源监控,以及定时调度功能(包含补录,忽略,重试,依赖,并发限制,过期淘汰等功能)
1、训练框架支持分布式(协议和策略)
2、代码识别分布式角色(有状态)
3、控制器部署分布式训练集群(operator)
4、配置分布式训练集群的部署(CRD)
以k8s为核心,
1、支持tf分布式训练、pytorch分布式训练、spark分布式数据处理、ray分布式超参搜索、mpi分布式训练、horovod分布式训练、nni分布式超参搜索、mxnet分布式训练、volcano分布式数据处理、kaldi分布式语音训练等,
2、 以及在此衍生出来的分布式的数据下载,hdfs拉取,cos上传下载,视频采帧,音频抽取,分布式的训练,例如推荐场景的din算法,ComiRec算法,MMoE算法,DeepFM算法,youtube dnn算法,ple模型,ESMM模型,双塔模型,音视频的wenet,containAI等算法的分布式训练。
- 和非模板开发相比,使用模板建立应用成本会更低一些,无需开发平台。
- 迁移更加容易,通过模板标准化后,后续应用迁移迭代只需迁移配置模板,简化复杂的配置操作。
- 配置复用,通过简单的配置就可以复用这些能力,算法与工程分离避免重复开发。
为了避免重复开发,对pipeline中的task功能进行模板化开发。平台开发者或用户可自行开发模板镜像,将镜像注册到平台,这样其他用户就可以复用这些功能。平台自带模板在job-template目录下
- Pipeline调试支持定时执行,支持,补录,并发限制,超时,实例依赖等。
- Pipeling运行,支持变量在任务间输入输出,全局变量,流向控制,模板变量,数据时间等
- Pipeling运行,支持任务结果可视化,图片、csv/json,echart源码可视化
界面化呈现训练各组数据,通过图形界面进行直观呈现。 减少以往开发调参过程的枯燥感,让整个调参过程更加生动具有趣味性,完全无需丰富经验就能实现更精准的参数控制调节。
# 上报当前迭代目标值
nni.report_intermediate_result(test_acc)
# 上报最终目标值
nni.report_final_result(test_acc)
# 接收超参数为输入参数
parser.add_argument('--batch_size', type=int)
0代码发布推理服务从底层到上层,包含服务网格,serverless,pipeline,http框架,模型计算。
-
服务网格阶段:主要工作是代理流量的中转和管控,例如分流,镜像,限流,黑白名单之类的。
-
serverless阶段:主要为服务的智能化运维,例如服务的激活,伸缩容,版本管理,蓝绿发布。
-
pipeline阶段:主要为请求在各数据处理/推理之间的流动。推理的前后置处理逻辑等。
-
http/grpc框架:主要为处理客户端的请求,准备推理样本,推理后作出响应。
-
模型计算:模型在cpu/gpu上对输入样本做前向计算。
主要功能:
- 支持模型管理注册,灰度发布,版本回退,模型指标可视化,以及在piepline中进行模型注册
- 推理服务支持多集群,多资源组,异构gpu环境,平台资源统筹监控,VGPU,服务流量分流,复制,sidecar
- 支持0代码的模型发布,gpu推理加速,支持训练推理混部,服务优先级,自定义指标弹性伸缩。
监控:cube-studio集成prometheus生态,可以监控包括主机,进程,服务流量,gpu等相关负载,并配套grafana进行可视化
推送:cube-studio开放推送接口,可自定义推送给企业oa系统
- 系统自带通用模型数量400+,覆盖绝大数行业场景,根据需求可以不断扩充。
- 模型开源、按需定制,方便快速集成,满足用户业务增长及二次开发升级。
- 模型标准化开发管理,大幅降低使用门槛,开发周期时长平均下降30%以上。
- AIHub模型可一键部署为WEB端应用,手机端/PC端皆可,实时查看模型应用效果
- 点击模型开发即可进入notebook进行模型代码的二次开发,实现一键开发
- 点击训练即可加入自己的数据进行一键微调,使模型更贴合自身场景
- cube-studio支持deepspeed/colossalai等分布式加速框架,可一键实现大模型多机多卡分布式训练
- AIHub包含gpt/AIGC大模型,可一键转为微调pipeline,修改为自己的数据后,便可以微调并部署
rdma插件部署后,k8s机器可用资源
capacity:
cpu: '128'
memory: 1056469320Ki
nvidia.com/gpu: '8'
rdma/hca: '500'
代码分布式训练中使用IB设备
export NCCL_IB_HCA=mlx5
export MLP_WORKER_GPU=$GPU_NUM
export MLP_WORKER_NUM=$WORLD_SIZE
export MLP_ROLE_INDEX=$RANK
export MLP_WORKER_0_HOST=$MASTER_ADDR
export MLP_WORKER_0_PORT=$MASTER_PORT
- 数据智能模块可配置专业领域智能对话,快速敏捷使用llm
- 可为某个聊天场景配置私有知识库文件,支持主题分割,语义embedding,意图识别,概要提取,多路召回,排序,多种功能融合
- 可以将智能会话与AIHub相结合,例如下面AIGC模型与聊天会话
- 可使用Autogpt方式串联所有aihub模型,进行图文音智能化处理
- 智能会话与公共直接打通,可在微信公众号中进行图文音对话
为了加速AI算法平台的使用,cube-studio支持对接公司原有数据中台,包括数据计算引擎sqllab,元数据管理,指标管理,维表管理,数据ETL,数据集管理
tmeps支持tf框架实时训练,秒级上线,能应对embedding稀疏大模型推荐场景
针对企业需求,根据不同场景对计算实时性的不同需求,可以提供三种建设模式
模式一:私有化部署——对数据安全要求高、预算充足、自己有开发能力
模式二:边缘集群部署——算力分散,多个子网环境的场景,或边缘设备场景
模式三:serverless集群——成本有限,按需申请算力的场景
通过边缘集群的形式,在中心节点部署平台,并将边缘节点加入调度,每个私有网用户,通过项目组,将notebook,pipeline,service部署在边缘节点
- 1、避免数据到中心节点的带宽传输
- 2、避免中心节点的算力成本,充分利用边缘节点算力
- 3、避免边缘节点的运维成本
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for cube-studio
Similar Open Source Tools
cube-studio
Cube Studio is an open-source all-in-one cloud-native machine learning platform that provides various functionalities such as project group management, network configuration, user management, role management, billing functions, SSO single sign-on, support for multiple computing power types, support for multiple resource groups and clusters, edge cluster support, serverless cluster mode support, database storage support, machine resource management, storage disk management, internationalization capabilities, data map management, data calculation, ETL orchestration, data set management, data annotation, image/audio/text dataset support, feature processing, traditional machine learning algorithms, distributed deep learning frameworks, distributed acceleration frameworks, model evaluation, model format conversion, model registration, model deployment, distributed media processing, custom operators, automatic learning, custom training images, automatic parameter tuning, TensorBoard jobs, internal services, model management, inference services, monitoring, model application management, model marketplace, model development, model fine-tuning, web model deployment, automated annotation, dataset SDK, notebook SDK, pipeline training SDK, inference service SDK, large model distributed training, large model inference, large model fine-tuning, intelligent conversation, private knowledge base, model deployment for WeChat public accounts, enterprise WeChat group chatbot integration, DingTalk group chatbot integration, and more. Cube Studio offers template-based functionality for data import/export, data processing, feature processing, machine learning frameworks, machine learning algorithms, deep learning frameworks, model processing, model serving, monitoring, and more.
haystack-core-integrations
This repository contains integrations to extend the capabilities of Haystack version 2.0 and onwards. The code in this repo is maintained by deepset, see each integration's `README` file for details around installation, usage and support.
Nocode-Wep
Nocode/WEP is a forward-looking office visualization platform that includes modules for document building, web application creation, presentation design, and AI capabilities for office scenarios. It supports features such as configuring bullet comments, global article comments, multimedia content, custom drawing boards, flowchart editor, form designer, keyword annotations, article statistics, custom appreciation settings, JSON import/export, content block copying, and unlimited hierarchical directories. The platform is compatible with major browsers and aims to deliver content value, iterate products, share technology, and promote open-source collaboration.
awesome-langchain-zh
The awesome-langchain-zh repository is a collection of resources related to LangChain, a framework for building AI applications using large language models (LLMs). The repository includes sections on the LangChain framework itself, other language ports of LangChain, tools for low-code development, services, agents, templates, platforms, open-source projects related to knowledge management and chatbots, as well as learning resources such as notebooks, videos, and articles. It also covers other LLM frameworks and provides additional resources for exploring and working with LLMs. The repository serves as a comprehensive guide for developers and AI enthusiasts interested in leveraging LangChain and LLMs for various applications.
widgets
Widgets is a desktop component front-end open source component. The project is still being continuously improved. The desktop component client can be downloaded and run in two ways: 1. https://www.microsoft.com/store/productId/9NPR50GQ7T53 2. https://widgetjs.cn After cloning the code, you need to download the dependency in the project directory: `shell pnpm install` and run: `shell pnpm serve`
WristAssist
WristAssist is the first app for all WearOS watches that fully brings the classic ChatGPT and DALL-E features to your wrist. It allows users to chat, save chats, edit chats, view galleries, create images, and share images directly from their wrist. The app provides a seamless user experience with easy installation from Google Play and free lifetime updates. Users can refer to the Wiki page for detailed setup instructions. WristAssist is licensed under the terms of the Apache 2.0 license.
Qbot
Qbot is an AI-oriented automated quantitative investment platform that supports diverse machine learning modeling paradigms, including supervised learning, market dynamics modeling, and reinforcement learning. It provides a full closed-loop process from data acquisition, strategy development, backtesting, simulation trading to live trading. The platform emphasizes AI strategies such as machine learning, reinforcement learning, and deep learning, combined with multi-factor models to enhance returns. Users with some Python knowledge and trading experience can easily utilize the platform to address trading pain points and gaps in the market.
awesome-ai-painting
This repository, named 'awesome-ai-painting', is a comprehensive collection of resources related to AI painting. It is curated by a user named 秋风, who is an AI painting enthusiast with a background in the AIGC industry. The repository aims to help more people learn AI painting and also documents the user's goal of creating 100 AI products, with current progress at 4/100. The repository includes information on various AI painting products, tutorials, tools, and models, providing a valuable resource for individuals interested in AI painting and related technologies.
agentica
Agentica is a human-centric framework for building large language model agents. It provides functionalities for planning, memory management, tool usage, and supports features like reflection, planning and execution, RAG, multi-agent, multi-role, and workflow. The tool allows users to quickly code and orchestrate agents, customize prompts, and make API calls to various services. It supports API calls to OpenAI, Azure, Deepseek, Moonshot, Claude, Ollama, and Together. Agentica aims to simplify the process of building AI agents by providing a user-friendly interface and a range of functionalities for agent development.
swift
SWIFT (Scalable lightWeight Infrastructure for Fine-Tuning) supports training, inference, evaluation and deployment of nearly **200 LLMs and MLLMs** (multimodal large models). Developers can directly apply our framework to their own research and production environments to realize the complete workflow from model training and evaluation to application. In addition to supporting the lightweight training solutions provided by [PEFT](https://github.com/huggingface/peft), we also provide a complete **Adapters library** to support the latest training techniques such as NEFTune, LoRA+, LLaMA-PRO, etc. This adapter library can be used directly in your own custom workflow without our training scripts. To facilitate use by users unfamiliar with deep learning, we provide a Gradio web-ui for controlling training and inference, as well as accompanying deep learning courses and best practices for beginners. Additionally, we are expanding capabilities for other modalities. Currently, we support full-parameter training and LoRA training for AnimateDiff.
vectordb-recipes
This repository contains examples, applications, starter code, & tutorials to help you kickstart your GenAI projects. * These are built using LanceDB, a free, open-source, serverless vectorDB that **requires no setup**. * It **integrates into python data ecosystem** so you can simply start using these in your existing data pipelines in pandas, arrow, pydantic etc. * LanceDB has **native Typescript SDK** using which you can **run vector search** in serverless functions! This repository is divided into 3 sections: - Examples - Get right into the code with minimal introduction, aimed at getting you from an idea to PoC within minutes! - Applications - Ready to use Python and web apps using applied LLMs, VectorDB and GenAI tools - Tutorials - A curated list of tutorials, blogs, Colabs and courses to get you started with GenAI in greater depth.
ASTRA.ai
Astra.ai is a multimodal agent powered by TEN, showcasing its capabilities in speech, vision, and reasoning through RAG from local documentation. It provides a platform for developing AI agents with features like RTC transportation, extension store, workflow builder, and local deployment. Users can build and test agents locally using Docker and Node.js, with prerequisites including Agora App ID, Azure's speech-to-text and text-to-speech API keys, and OpenAI API key. The platform offers advanced customization options through config files and API keys setup, enabling users to create and deploy their AI agents for various tasks.
Firefly
Firefly is an open-source large model training project that supports pre-training, fine-tuning, and DPO of mainstream large models. It includes models like Llama3, Gemma, Qwen1.5, MiniCPM, Llama, InternLM, Baichuan, ChatGLM, Yi, Deepseek, Qwen, Orion, Ziya, Xverse, Mistral, Mixtral-8x7B, Zephyr, Vicuna, Bloom, etc. The project supports full-parameter training, LoRA, QLoRA efficient training, and various tasks such as pre-training, SFT, and DPO. Suitable for users with limited training resources, QLoRA is recommended for fine-tuning instructions. The project has achieved good results on the Open LLM Leaderboard with QLoRA training process validation. The latest version has significant updates and adaptations for different chat model templates.
For similar tasks
ai-on-gke
This repository contains assets related to AI/ML workloads on Google Kubernetes Engine (GKE). Run optimized AI/ML workloads with Google Kubernetes Engine (GKE) platform orchestration capabilities. A robust AI/ML platform considers the following layers: Infrastructure orchestration that support GPUs and TPUs for training and serving workloads at scale Flexible integration with distributed computing and data processing frameworks Support for multiple teams on the same infrastructure to maximize utilization of resources
ray
Ray is a unified framework for scaling AI and Python applications. It consists of a core distributed runtime and a set of AI libraries for simplifying ML compute, including Data, Train, Tune, RLlib, and Serve. Ray runs on any machine, cluster, cloud provider, and Kubernetes, and features a growing ecosystem of community integrations. With Ray, you can seamlessly scale the same code from a laptop to a cluster, making it easy to meet the compute-intensive demands of modern ML workloads.
labelbox-python
Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.
djl
Deep Java Library (DJL) is an open-source, high-level, engine-agnostic Java framework for deep learning. It is designed to be easy to get started with and simple to use for Java developers. DJL provides a native Java development experience and allows users to integrate machine learning and deep learning models with their Java applications. The framework is deep learning engine agnostic, enabling users to switch engines at any point for optimal performance. DJL's ergonomic API interface guides users with best practices to accomplish deep learning tasks, such as running inference and training neural networks.
mlflow
MLflow is a platform to streamline machine learning development, including tracking experiments, packaging code into reproducible runs, and sharing and deploying models. MLflow offers a set of lightweight APIs that can be used with any existing machine learning application or library (TensorFlow, PyTorch, XGBoost, etc), wherever you currently run ML code (e.g. in notebooks, standalone applications or the cloud). MLflow's current components are:
* `MLflow Tracking
tt-metal
TT-NN is a python & C++ Neural Network OP library. It provides a low-level programming model, TT-Metalium, enabling kernel development for Tenstorrent hardware.
burn
Burn is a new comprehensive dynamic Deep Learning Framework built using Rust with extreme flexibility, compute efficiency and portability as its primary goals.
awsome-distributed-training
This repository contains reference architectures and test cases for distributed model training with Amazon SageMaker Hyperpod, AWS ParallelCluster, AWS Batch, and Amazon EKS. The test cases cover different types and sizes of models as well as different frameworks and parallel optimizations (Pytorch DDP/FSDP, MegatronLM, NemoMegatron...).
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.