data-prep-kit
Open source project for data preparation of LLM application builders
Stars: 215
Data Prep Kit is a community project aimed at democratizing and speeding up unstructured data preparation for LLM app developers. It provides high-level APIs and modules for transforming data (code, language, speech, visual) to optimize LLM performance across different use cases. The toolkit supports Python, Ray, Spark, and Kubeflow Pipelines runtimes, offering scalability from laptop to datacenter-scale processing. Developers can contribute new custom modules and leverage the data processing library for building data pipelines. Automation features include workflow automation with Kubeflow Pipelines for transform execution.
README:
Data Prep Kit is a community project to democratize and accelerate unstructured data preparation for LLM app developers. With the explosive growth of LLM-enabled use cases, developers are faced with the enormous challenge of preparing use case-specific unstructured data to fine-tune, instruct-tune the LLMs or to build RAG applications for LLMs. As the variety of use cases grow, so does the need to support:
- New ways of transforming the data to enhance the performance of the resulting LLMs for each specific use case.
- A large variety in the scale of data to be processed, from laptop-scale to datacenter-scale
- Support for different data modalities including language, code, vision, multimodal etc
Data Prep Kit offers implementations of commonly needed data preparation steps, called modules or transforms, for both Code and Language modalities, with vision to extend to images, speech and multimodal data. The goal is to offer high-level APIs for developers to quickly get started in working with their data, without needing expertise in the underlying runtimes and frameworks.
- About
- Getting Started
- Scaling transforms from laptop to cluster
- Repository Use and Navigation
- How to Contribute
- Talks and Papers
- Citations
Data Prep Kit is a toolkit for streamlining data preparation for developers looking to build LLM-enabled applications via fine-tuning, RAG or instruction-tuning. Data Prep Kit contributes a set of modules that the developer can get started with to easily build data pipelines suitable for their use case. These modules have been tested while producing pre-training datasets for the Granite open source LLM models.
The modules are built on common frameworks (for Spark and Ray), called the data processing library that allows the developers to build new custom modules that readily scale across a variety of runtimes.
Features of the toolkit:
- It aims to accelerate unstructured data prep for the "long tail" of LLM use cases.
- It offers a growing set of module implementations across multiple runtimes, targeting laptop-scale to datacenter-scale processing.
- It provides a growing set of sample data processing pipelines that can be used for real enterprise use cases.
- It provides the Data processing library to enable contribution of new custom modules targeting new use cases.
- It uses Kubeflow Pipelines-based workflow automation.
Data modalities supported today: Code and Natural Language.
With no setup necessary, let's use a Google Colab friendly notebook to try Data Prep Kit. This is a simple transform to extract content from PDF files: examples/notebooks/Run_your_first_transform_colab.ipynb | . (Here are some tips for running Data Prep Kit transforms on Google Colab. For this simple example, these tips are either already taken care of, or are not needed.)
To run on a local machine, follow these steps to quickly set up and deploy the Data Prep Kit in your virtual Python environment.
conda create -n data-prep-kit -y python=3.11
conda activate data-prep-kit
python --version
Check if the python version is 3.11.
If you are using a linux system, install gcc using the below commands:
conda install gcc_linux-64
conda install gxx_linux-64
Next, install the data prep toolkit library. This library installs both the python and ray versions of the transforms. For better management of dependencies, it is recommended to install the same tagged version of both the library and the transform.
pip3 install 'data-prep-toolkit[ray]==0.2.2.dev1'
pip3 install 'data-prep-toolkit-transforms[ray,all]==0.2.2.dev1'
pip3 install jupyterlab ipykernel ipywidgets
## install custom kernel
python -m ipykernel install --user --name=data-prep-kit --display-name "dataprepkit"
Test, your installation. If you are able to import these data-prep-kit libraries successfully in python, your installation has succeeded.
## start python interpreter
$ python
# import DPK libraries
>>> from data_processing_ray.runtime.ray import RayTransformLauncher
>>> from data_processing.runtime.pure_python import PythonTransformLauncher
If there are no errors, you are good to go!
Let's try the same simple transform to extract content from PDF files on a local machine.
Local Notebook versions
You can try either one or both of the following two versions:
- Option 1: Pure python notebook: examples/notebooks/Run_your_first_transform_python.ipynb - easiest to get started
- Option 2: Ray version: This one uses Ray framework for parallel execution while still allowing local processing - examples/notebooks/Run_your_first_transform_ray.ipynb
To run the notebooks, launch jupyter from the same virtual environment you created using the command below.
jupyter lab
After opening the jupyter notebook, change the kernel to dataprepkit
, so all libraries will be properly loaded.
Explore more examples here.
Now that you have run a single transform, the next step is to explore how to put these transforms together to run a data prep pipeline for an end to end use case like fine tuning model or building a RAG application. This notebook gives an example of how to build an end to end data prep pipeline for fine tuning for code LLMs. You can also explore how to build a RAG pipeline here.
The matrix below shows the the combination of modules and supported runtimes. All the modules can be accessed here and can be combined to form data processing pipelines, as shown in the examples folder.
Modules | Python-only | Ray | Spark | KFP on Ray |
---|---|---|---|---|
Data Ingestion | ||||
Code (from zip) to Parquet | ✅ | ✅ | ✅ | |
PDF to Parquet | ✅ | ✅ | ✅ | |
HTML to Parquet | ✅ | ✅ | ||
Universal (Code & Language) | ||||
Exact dedup filter | ✅ | ✅ | ✅ | |
Fuzzy dedup filter | ✅ | ✅ | ||
Unique ID annotation | ✅ | ✅ | ✅ | ✅ |
Filter on annotations | ✅ | ✅ | ✅ | ✅ |
Profiler | ✅ | ✅ | ✅ | ✅ |
Resize | ✅ | ✅ | ✅ | ✅ |
HAP | ✅ | ✅ | ✅ | |
Tokenizer | ✅ | ✅ | ✅ | |
Language-only | ||||
Language identification | ✅ | ✅ | ✅ | |
Document quality | ✅ | ✅ | ✅ | |
Document chunking for RAG | ✅ | ✅ | ✅ | |
Text encoder | ✅ | ✅ | ✅ | |
PII Annotator/Redactor | ✅ | ✅ | ✅ | |
Code-only | ||||
Programming language annnotation | ✅ | ✅ | ✅ | |
Code quality annotation | ✅ | ✅ | ✅ | |
Malware annotation | ✅ | ✅ | ✅ | |
Header cleanser | ✅ | ✅ | ✅ | |
Semantic file ordering | ✅ | |||
License Select Annotation | ✅ | ✅ | ✅ |
Contributors are welcome to add new modules to expand to other data modalities as well as add runtime support for existing modules!
At the core of the framework, is a data processing library, that provides a systematic way to implement the data processing modules. The library is python-based and enables the application of "transforms" to a one or more input data files to produce one or more output data files. We use the popular parquet format to store the data (code or language). Every parquet file follows a set schema. A user can use one or more transforms (or modules) as discussed above to process their data. A transform can follow one of the two patterns: annotator or filter.
-
Annotator An annotator transform adds information during the processing by adding one more columns to the parquet files. The annotator design also allows a user to verify the results of the processing before the actual filtering of the data.
-
Filter A filter transform processes the data and outputs the transformed data, e.g., exact deduplication. A general purpose SQL-based filter transform enables a powerful mechanism for identifying columns and rows of interest for downstream processing.
For a new module to be added, a user can pick the right design based on the processing to be applied. More details here.
One can leverage Python-based processing logic and the Data Processing Library to easily build and contribute new transforms. We have provided an example transform that can serve as a template to add new simple transforms. Follow the step by step tutorial to help you add your own new transform.
For a deeper understanding of the library's architecture, its transforms, and available runtimes, we encourage the reader to consult the comprehensive overview document alongside dedicated sections on transforms and runtimes.
Additionally, check out our video tutorial for a visual, example-driven guide on adding custom modules.
Data-prep-kit provides the flexibility to transition your projects from proof-of-concept (PoC) stage to full-scale production mode, offering all the necessary tools to run your data transformations at high volume. In this section, we enable you how to run your transforms at scale and how to automate them.
To enable processing of large data volumes leveraging multi-mode clusters, Ray or Spark wrappers are provided, to readily scale out the Python implementations.
A generalized workflow is shown here.
The toolkit also supports transform execution automation based on Kubeflow pipelines (KFP), tested on a locally deployed Kind cluster and external OpenShift clusters. There is an automation to create a Kind cluster and deploy all required components on it. The KFP implementation is based on the KubeRay Operator for creating and managing the Ray cluster and KubeRay API server to interact with the KubeRay operator. An additional framework along with several kfp components is used to simplify the pipeline implementation.
A simple transform pipeline tutorial explains the pipeline creation and execution. In addition, if you want to combine several transformers in a single pipeline, you can look at multi-steps pipeline
When you finish working with the cluster, and want to clean up or destroy it. See the clean up the cluster
You can run transforms via docker image or using virtual environments. This document shows how to run a transform using virtual environment. You can follow this document to run using docker image.
- Data-Prep-Kit: getting your data ready for LLM application development
- Granite Code Models: A Family of Open Foundation Models for Code Intelligence
- Scaling Granite Code Models to 128K Context
- Talk on "Building Successful LLM Apps: The Power of high quality data" Video Slides
- Talk on "Hands on session for fine tuning LLMs" Video
- Talk on "Build your own data preparation module using data-prep-kit" Video
If you use Data Prep Kit in your research, please cite our paper:
@misc{wood2024dataprepkitgettingdataready,
title={Data-Prep-Kit: getting your data ready for LLM application development},
author={David Wood and Boris Lublinsky and Alexy Roytman and Shivdeep Singh
and Abdulhamid Adebayo and Revital Eres and Mohammad Nassar and Hima Patel
and Yousaf Shah and Constantin Adam and Petros Zerfos and Nirmit Desai
and Daiki Tsuzuku and Takuya Goto and Michele Dolfi and Saptha Surendran
and Paramesvaran Selvam and Sungeun An and Yuan Chi Chang and Dhiraj Joshi
and Hajar Emami-Gohari and Xuan-Hong Dang and Yan Koyfman and Shahrokh Daijavad},
year={2024},
eprint={2409.18164},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2409.18164},
}
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for data-prep-kit
Similar Open Source Tools
data-prep-kit
Data Prep Kit is a community project aimed at democratizing and speeding up unstructured data preparation for LLM app developers. It provides high-level APIs and modules for transforming data (code, language, speech, visual) to optimize LLM performance across different use cases. The toolkit supports Python, Ray, Spark, and Kubeflow Pipelines runtimes, offering scalability from laptop to datacenter-scale processing. Developers can contribute new custom modules and leverage the data processing library for building data pipelines. Automation features include workflow automation with Kubeflow Pipelines for transform execution.
langkit
LangKit is an open-source text metrics toolkit for monitoring language models. It offers methods for extracting signals from input/output text, compatible with whylogs. Features include text quality, relevance, security, sentiment, toxicity analysis. Installation via PyPI. Modules contain UDFs for whylogs. Benchmarks show throughput on AWS instances. FAQs available.
llama-recipes
The llama-recipes repository provides a scalable library for fine-tuning Llama 2, along with example scripts and notebooks to quickly get started with using the Llama 2 models in a variety of use-cases, including fine-tuning for domain adaptation and building LLM-based applications with Llama 2 and other tools in the LLM ecosystem. The examples here showcase how to run Llama 2 locally, in the cloud, and on-prem.
generative-ai-cdk-constructs
The AWS Generative AI Constructs Library is an open-source extension of the AWS Cloud Development Kit (AWS CDK) that provides multi-service, well-architected patterns for quickly defining solutions in code to create predictable and repeatable infrastructure, called constructs. The goal of AWS Generative AI CDK Constructs is to help developers build generative AI solutions using pattern-based definitions for their architecture. The patterns defined in AWS Generative AI CDK Constructs are high level, multi-service abstractions of AWS CDK constructs that have default configurations based on well-architected best practices. The library is organized into logical modules using object-oriented techniques to create each architectural pattern model.
leapfrogai
LeapfrogAI is a self-hosted AI platform designed to be deployed in air-gapped resource-constrained environments. It brings sophisticated AI solutions to these environments by hosting all the necessary components of an AI stack, including vector databases, model backends, API, and UI. LeapfrogAI's API closely matches that of OpenAI, allowing tools built for OpenAI/ChatGPT to function seamlessly with a LeapfrogAI backend. It provides several backends for various use cases, including llama-cpp-python, whisper, text-embeddings, and vllm. LeapfrogAI leverages Chainguard's apko to harden base python images, ensuring the latest supported Python versions are used by the other components of the stack. The LeapfrogAI SDK provides a standard set of protobuffs and python utilities for implementing backends and gRPC. LeapfrogAI offers UI options for common use-cases like chat, summarization, and transcription. It can be deployed and run locally via UDS and Kubernetes, built out using Zarf packages. LeapfrogAI is supported by a community of users and contributors, including Defense Unicorns, Beast Code, Chainguard, Exovera, Hypergiant, Pulze, SOSi, United States Navy, United States Air Force, and United States Space Force.
taranis-ai
Taranis AI is an advanced Open-Source Intelligence (OSINT) tool that leverages Artificial Intelligence to revolutionize information gathering and situational analysis. It navigates through diverse data sources like websites to collect unstructured news articles, utilizing Natural Language Processing and Artificial Intelligence to enhance content quality. Analysts then refine these AI-augmented articles into structured reports that serve as the foundation for deliverables such as PDF files, which are ultimately published.
torchtune
Torchtune is a PyTorch-native library for easily authoring, fine-tuning, and experimenting with LLMs. It provides native-PyTorch implementations of popular LLMs using composable and modular building blocks, easy-to-use and hackable training recipes for popular fine-tuning techniques, YAML configs for easily configuring training, evaluation, quantization, or inference recipes, and built-in support for many popular dataset formats and prompt templates to help you quickly get started with training.
GenAIExamples
This project provides a collective list of Generative AI (GenAI) and Retrieval-Augmented Generation (RAG) examples such as chatbot with question and answering (ChatQnA), code generation (CodeGen), document summary (DocSum), etc.
synmetrix
Synmetrix is an open source data engineering platform and semantic layer for centralized metrics management. It provides a complete framework for modeling, integrating, transforming, aggregating, and distributing metrics data at scale. Key features include data modeling and transformations, semantic layer for unified data model, scheduled reports and alerts, versioning, role-based access control, data exploration, caching, and collaboration on metrics modeling. Synmetrix leverages Cube.js to consolidate metrics from various sources and distribute them downstream via a SQL API. Use cases include data democratization, business intelligence and reporting, embedded analytics, and enhancing accuracy in data handling and queries. The tool speeds up data-driven workflows from metrics definition to consumption by combining data engineering best practices with self-service analytics capabilities.
mlcraft
Synmetrix (prev. MLCraft) is an open source data engineering platform and semantic layer for centralized metrics management. It provides a complete framework for modeling, integrating, transforming, aggregating, and distributing metrics data at scale. Key features include data modeling and transformations, semantic layer for unified data model, scheduled reports and alerts, versioning, role-based access control, data exploration, caching, and collaboration on metrics modeling. Synmetrix leverages Cube (Cube.js) for flexible data models that consolidate metrics from various sources, enabling downstream distribution via a SQL API for integration into BI tools, reporting, dashboards, and data science. Use cases include data democratization, business intelligence, embedded analytics, and enhancing accuracy in data handling and queries. The tool speeds up data-driven workflows from metrics definition to consumption by combining data engineering best practices with self-service analytics capabilities.
redis-ai-resources
A curated repository of code recipes, demos, and resources for basic and advanced Redis use cases in the AI ecosystem. It includes demos for ArxivChatGuru, Redis VSS, Vertex AI & Redis, Agentic RAG, ArXiv Search, and Product Search. Recipes cover topics like Getting started with RAG, Semantic Cache, Advanced RAG, and Recommendation systems. The repository also provides integrations/tools like RedisVL, AWS Bedrock, LangChain Python, LangChain JS, LlamaIndex, Semantic Kernel, RelevanceAI, and DocArray. Additional content includes blog posts, talks, reviews, and documentation related to Vector Similarity Search, AI-Powered Document Search, Vector Databases, Real-Time Product Recommendations, and more. Benchmarks compare Redis against other Vector Databases and ANN benchmarks. Documentation includes QuickStart guides, official literature for Vector Similarity Search, Redis-py client library docs, Redis Stack documentation, and Redis client list.
ByteMLPerf
ByteMLPerf is an AI Accelerator Benchmark that focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and versatility of software and hardware. Byte MLPerf has the following characteristics: - Models and runtime environments are more closely aligned with practical business use cases. - For ASIC hardware evaluation, besides evaluate performance and accuracy, it also measure metrics like compiler usability and coverage. - Performance and accuracy results obtained from testing on the open Model Zoo serve as reference metrics for evaluating ASIC hardware integration.
qlib
Qlib is an open-source, AI-oriented quantitative investment platform that supports diverse machine learning modeling paradigms, including supervised learning, market dynamics modeling, and reinforcement learning. It covers the entire chain of quantitative investment, from alpha seeking to order execution. The platform empowers researchers to explore ideas and implement productions using AI technologies in quantitative investment. Qlib collaboratively solves key challenges in quantitative investment by releasing state-of-the-art research works in various paradigms. It provides a full ML pipeline for data processing, model training, and back-testing, enabling users to perform tasks such as forecasting market patterns, adapting to market dynamics, and modeling continuous investment decisions.
hass-ollama-conversation
The Ollama Conversation integration adds a conversation agent powered by Ollama in Home Assistant. This agent can be used in automations to query information provided by Home Assistant about your house, including areas, devices, and their states. Users can install the integration via HACS and configure settings such as API timeout, model selection, context size, maximum tokens, and other parameters to fine-tune the responses generated by the AI language model. Contributions to the project are welcome, and discussions can be held on the Home Assistant Community platform.
open-assistant-api
Open Assistant API is an open-source, self-hosted AI intelligent assistant API compatible with the official OpenAI interface. It supports integration with more commercial and private models, R2R RAG engine, internet search, custom functions, built-in tools, code interpreter, multimodal support, LLM support, and message streaming output. Users can deploy the service locally and expand existing features. The API provides user isolation based on tokens for SaaS deployment requirements and allows integration of various tools to enhance its capability to connect with the external world.
yuna-ai
Yuna AI is a unique AI companion designed to form a genuine connection with users. It runs exclusively on the local machine, ensuring privacy and security. The project offers features like text generation, language translation, creative content writing, roleplaying, and informal question answering. The repository provides comprehensive setup and usage guides for Yuna AI, along with additional resources and tools to enhance the user experience.
For similar tasks
data-prep-kit
Data Prep Kit is a community project aimed at democratizing and speeding up unstructured data preparation for LLM app developers. It provides high-level APIs and modules for transforming data (code, language, speech, visual) to optimize LLM performance across different use cases. The toolkit supports Python, Ray, Spark, and Kubeflow Pipelines runtimes, offering scalability from laptop to datacenter-scale processing. Developers can contribute new custom modules and leverage the data processing library for building data pipelines. Automation features include workflow automation with Kubeflow Pipelines for transform execution.
instill-core
Instill Core is an open-source orchestrator comprising a collection of source-available projects designed to streamline every aspect of building versatile AI features with unstructured data. It includes Instill VDP (Versatile Data Pipeline) for unstructured data, AI, and pipeline orchestration, Instill Model for scalable MLOps and LLMOps for open-source or custom AI models, and Instill Artifact for unified unstructured data management. Instill Core can be used for tasks such as building, testing, and sharing pipelines, importing, serving, fine-tuning, and monitoring ML models, and transforming documents, images, audio, and video into a unified AI-ready format.
fastRAG
fastRAG is a research framework designed to build and explore efficient retrieval-augmented generative models. It incorporates state-of-the-art Large Language Models (LLMs) and Information Retrieval to empower researchers and developers with a comprehensive tool-set for advancing retrieval augmented generation. The framework is optimized for Intel hardware, customizable, and includes key features such as optimized RAG pipelines, efficient components, and RAG-efficient components like ColBERT and Fusion-in-Decoder (FiD). fastRAG supports various unique components and backends for running LLMs, making it a versatile tool for research and development in the field of retrieval-augmented generation.
ai-on-openshift
AI on OpenShift is a site providing installation recipes, patterns, and demos for AI/ML tools and applications used in Data Science and Data Engineering projects running on OpenShift. It serves as a comprehensive resource for developers looking to deploy AI solutions on the OpenShift platform.
sematic
Sematic is an open-source ML development platform that allows ML Engineers and Data Scientists to write complex end-to-end pipelines with Python. It can be executed locally, on a cloud VM, or on a Kubernetes cluster. Sematic enables chaining data processing jobs with model training into reproducible pipelines that can be monitored and visualized in a web dashboard. It offers features like easy onboarding, local-to-cloud parity, end-to-end traceability, access to heterogeneous compute resources, and reproducibility.
SuperKnowa
SuperKnowa is a fast framework to build Enterprise RAG (Retriever Augmented Generation) Pipelines at Scale, powered by watsonx. It accelerates Enterprise Generative AI applications to get prod-ready solutions quickly on private data. The framework provides pluggable components for tackling various Generative AI use cases using Large Language Models (LLMs), allowing users to assemble building blocks to address challenges in AI-driven text generation. SuperKnowa is battle-tested from 1M to 200M private knowledge base & scaled to billions of retriever tokens.
ZetaForge
ZetaForge is an open-source AI platform designed for rapid development of advanced AI and AGI pipelines. It allows users to assemble reusable, customizable, and containerized Blocks into highly visual AI Pipelines, enabling rapid experimentation and collaboration. With ZetaForge, users can work with AI technologies in any programming language, easily modify and update AI pipelines, dive into the code whenever needed, utilize community-driven blocks and pipelines, and share their own creations. The platform aims to accelerate the development and deployment of advanced AI solutions through its user-friendly interface and community support.
AdalFlow
AdalFlow is a library designed to help developers build and optimize Large Language Model (LLM) task pipelines. It follows a design pattern similar to PyTorch, offering a light, modular, and robust codebase. Named in honor of Ada Lovelace, AdalFlow aims to inspire more women to enter the AI field. The library is tailored for various GenAI applications like chatbots, translation, summarization, code generation, and autonomous agents, as well as classical NLP tasks such as text classification and named entity recognition. AdalFlow emphasizes modularity, robustness, and readability to support users in customizing and iterating code for their specific use cases.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.