curator
Synthetic Data curation for post-training and structured data extraction
Stars: 79
Bespoke Curator is an open-source tool for data curation and structured data extraction. It provides a Python library for generating synthetic data at scale, with features like programmability, performance optimization, caching, and integration with HuggingFace Datasets. The tool includes a Curator Viewer for dataset visualization and offers a rich set of functionalities for creating and refining data generation strategies.
README:
Bespoke Curator makes it very easy to create high-quality synthetic data at scale, which you can use to finetune models or use for structured data extraction at scale.
Bespoke Curator is an open-source project:
- That comes with a rich Python based library for generating and curating synthetic data.
- A Curator Viewer which makes it easy to view the datasets, thus aiding in the dataset creation.
- We will also be releasing high-quality datasets that should move the needle on post-training.
- Programmability and Structured Outputs: Synthetic data generation is lot more than just using a single prompt -- it involves calling LLMs multiple times and orchestrating control-flow. Curator treats structured outputs as first class citizens and helps you design complex pipelines.
- Built-in Performance Optimization: We often see calling LLMs in loops, or inefficient implementation of multi-threading. We have baked in performance optimizations so that you don't need to worry about those!
- Intelligent Caching and Fault Recovery: Given LLM calls can add up in cost and time, failures are undesirable but sometimes unavoidable. We cache the LLM requests and responses so that it is easy to recover from a failure. Moreover, when working on a multi-stage pipeline, caching of stages makes it easy to iterate.
- Native HuggingFace Dataset Integration: Work directly on HuggingFace Dataset objects throughout your pipeline. Your synthetic data is immediately ready for fine-tuning!
- Interactive Curator Viewer: Improve and iterate on your prompts using our built-in viewer. Inspect LLM requests and responses in real-time, allowing you to iterate and refine your data generation strategy with immediate feedback.
pip install bespokelabs-curator
To run the examples below, make sure to set your OpenAI API key in
the environment variable OPENAI_API_KEY
by running export OPENAI_API_KEY=sk-...
in your terminal.
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="gpt-4o-mini")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
# Or you can pass a list of prompts to generate multiple responses.
poems = llm(["Write a poem about the importance of data in AI.",
"Write a haiku about the importance of data in AI."])
print(poems)
Note that retries and caching are enabled by default.
So now if you run the same prompt again, you will get the same response, pretty much instantly.
You can delete the cache at ~/.cache/curator
.
You can use the LiteLLM backend for calling other models.
from bespokelabs import curator
llm = curator.SimpleLLM(model_name="claude-3-5-sonnet-20240620", backend="litellm")
poem = llm("Write a poem about the importance of data in AI.")
print(poem)
Run curator-viewer
on the command line to see the dataset in the viewer.
You can click on a run and then click on a specific row to see the LLM request and response. More examples below.
Let's use structured outputs to generate poems.
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import List
topics = Dataset.from_dict({"topic": [
"Urban loneliness in a bustling city",
"Beauty of Bespoke Labs's Curator library"
]})
Define a class to encapsulate a list of poems.
class Poem(BaseModel):
poem: str = Field(description="A poem.")
class Poems(BaseModel):
poems_list: List[Poem] = Field(description="A list of poems.")
We define an LLM
object that generates poems which gets applied to the topics dataset.
poet = curator.LLM(
prompt_func=lambda row: f"Write two poems about {row['topic']}.",
model_name="gpt-4o-mini",
response_format=Poems,
parse_func=lambda row, poems: [
{"topic": row["topic"], "poem": p.poem} for p in poems.poems_list
],
)
Here:
-
prompt_func
takes a row of the dataset as input and returns the prompt for the LLM. -
response_format
is the structured output class we defined above. -
parse_func
takes the input (row
) and the structured output (poems
) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.
Now we can apply the LLM
object to the dataset, which reads very pythonic.
poem = poet(topics)
print(poem.to_pandas())
# Example output:
# topic poem
# 0 Urban loneliness in a bustling city In the city's heart, where the sirens wail,\nA...
# 1 Urban loneliness in a bustling city City streets hum with a bittersweet song,\nHor...
# 2 Beauty of Bespoke Labs's Curator library In whispers of design and crafted grace,\nBesp...
# 3 Beauty of Bespoke Labs's Curator library In the hushed breath of parchment and ink,\nBe...
Note that topics
can be created with curator.LLM
as well,
and we can scale this up to create tens of thousands of diverse poems.
You can see a more detailed example in the examples/poem.py file,
and other examples in the examples directory.
See the docs for more details as well as for troubleshooting information.
To run the bespoke dataset viewer:
curator-viewer
This will pop up a browser window with the viewer running on 127.0.0.1:3000
by default if you haven't specified a different host and port.
The dataset viewer shows all the different runs you have made.
You can also see the dataset and the responses from the LLM.
Optional parameters to run the viewer on a different host and port:
>>> curator-viewer -h
usage: curator-viewer [-h] [--host HOST] [--port PORT] [--verbose]
Curator Viewer
options:
-h, --help show this help message and exit
--host HOST Host to run the server on (default: localhost)
--port PORT Port to run the server on (default: 3000)
--verbose, -v Enables debug logging for more verbose output
The only requirement for running curator-viewer
is to install node. You can install them by following the instructions here.
For example, to check if you have node installed, you can run:
node -v
If it's not installed, installing latest node on MacOS, you can run:
# installs nvm (Node Version Manager)
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.0/install.sh | bash
# download and install Node.js (you may need to restart the terminal)
nvm install 22
# verifies the right Node.js version is in the environment
node -v # should print `v22.11.0`
# verifies the right npm version is in the environment
npm -v # should print `10.9.0`
- Online Inference with Local Models
- Offline Inference with Local Models
- Full list of local models examples
You can use local models served with vLLM OpenAI compatibale server for online data generation. In order to do that, first start a vLLM ssrver:
vllm serve \
NousResearch/Meta-Llama-3-8B-Instruct \
--host localhost \
--port 8787 \
--api-key token-abc123
Note that you still need to provide a dummy API key since OpenAI client expects it (LiteLLM uses OpenAI client in the background). See here a full example on starting a vLLM server.
Then, to use this server, provide the API endpoint URL and the dummy API key. We use LiteLLM to generate data with vLLM hosted local models so you need to set the backend to litellm
. In order for LiteLLM to recognize proper backend, add hosted_vllm/
prefix to your model name. Set HOSTED_VLLM_API_KEY
environment varibale to your dummy API key. Example:
from bespokelabs import curator
model_path = "hosted_vllm/NousResearch/Meta-Llama-3-8B-Instruct" # Make sure to add hosted_vllm/ prefix
PORT = 8787
HOST = "localhost"
URL = f"http://{HOST}:{PORT}/v1"
os.environ["HOSTED_VLLM_API_KEY"] = "token-abc123
poem_prompter = curator.LLM(
model_name=model_path,
prompt_func=lambda row: "Generate a poem",
backend="litellm",
base_url=URL,
)
poem_prompter()
See here a full example. LiteLLM supports structured output, see here an example with a vLLM hosted model.
We use vLLM offline LLM running engine to generate synthetic data with local models.
Here is the full list of models that are supported by vLLM.
Install vLLM in the Python or Conda environment:
pip install vllm
You may also need to install Ray for multi-node inference based on Ray clusters:
pip install ray
Please refer to the vLLM isntallation instructions.
To use curator with a local model, povide your model local path in the model argument and set the backend to vllm
:
model_path = "meta-llama/Meta-Llama-3.1-8B-Instruct"
curator.LLM(
model_name=model_path,
backend="vllm"
)
See full example here.
To use local model that are too big to fit on one GPU, use tensor parallelism. That way you can split the model across multiple GPUs. To do that, specify tensor_parallel_size
argument that should be equal to number of GPUs you have:
model_path = "meta-llama/Meta-Llama-3.1-70B-Instruct"
curator.LLM(
model_name=model_path,
backend="vllm",
tensor_parallel_size=4 # split across 4 GPUs
)
We use vLLM's Guided Decoding to obtain structured output from local models during offline inference:
from pydantic import BaseModel, Field
class Cuisines(BaseModel):
cuisines_list: List[str] = Field(description="A list of cuisines.")
model_path = "/local/path/to/weights/meta-llama/Meta-Llama-3.1-70B-Instruct"
cuisines_generator = curator.LLM(
prompt_func=lambda: f"Generate 10 diverse cuisines.",
model_name=model_path,
response_format=Cuisines,
backend="vllm",
)
See full example here.
Offline vLLM inference support batch inference by default, the default batch size (number of sequences to process at a time) is equal to 256. Set the batch_size
argument to change the default value:
curator.LLM(
model_name=model_path,
prompt_func=lambda row: f"write a poem",
backend="vllm",
batch_size=32
)
-
max_model_length
(int, optional): The maximum model context length. Defaults to 4096. -
enforce_eager
(bool, optional): Whether to enforce eager execution. Defaults to False. -
tensor_parallel_size
(int, optional): The tensor parallel size. Defaults to 1. -
gpu_memory_utilization
(float, optional): The GPU memory utilization. Defaults to 0.95. -
max_tokens
(int, optional): The maximum number of tokens for models to generate. Defaults to 1024.
- Generate recipes with Meta LLama 3.1 8B offline
- Recipes with structured output
- Use vLLM OpeneAI compatible server
- Use vLLM OpenAI compatible server with structured output
Contributions are welcome!
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for curator
Similar Open Source Tools
curator
Bespoke Curator is an open-source tool for data curation and structured data extraction. It provides a Python library for generating synthetic data at scale, with features like programmability, performance optimization, caching, and integration with HuggingFace Datasets. The tool includes a Curator Viewer for dataset visualization and offers a rich set of functionalities for creating and refining data generation strategies.
web-llm
WebLLM is a modular and customizable javascript package that directly brings language model chats directly onto web browsers with hardware acceleration. Everything runs inside the browser with no server support and is accelerated with WebGPU. WebLLM is fully compatible with OpenAI API. That is, you can use the same OpenAI API on any open source models locally, with functionalities including json-mode, function-calling, streaming, etc. We can bring a lot of fun opportunities to build AI assistants for everyone and enable privacy while enjoying GPU acceleration.
LeanCopilot
Lean Copilot is a tool that enables the use of large language models (LLMs) in Lean for proof automation. It provides features such as suggesting tactics/premises, searching for proofs, and running inference of LLMs. Users can utilize built-in models from LeanDojo or bring their own models to run locally or on the cloud. The tool supports platforms like Linux, macOS, and Windows WSL, with optional CUDA and cuDNN for GPU acceleration. Advanced users can customize behavior using Tactic APIs and Model APIs. Lean Copilot also allows users to bring their own models through ExternalGenerator or ExternalEncoder. The tool comes with caveats such as occasional crashes and issues with premise selection and proof search. Users can get in touch through GitHub Discussions for questions, bug reports, feature requests, and suggestions. The tool is designed to enhance theorem proving in Lean using LLMs.
storm
STORM is a LLM system that writes Wikipedia-like articles from scratch based on Internet search. While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage. **Try out our [live research preview](https://storm.genie.stanford.edu/) to see how STORM can help your knowledge exploration journey and please provide feedback to help us improve the system 🙏!**
tonic_validate
Tonic Validate is a framework for the evaluation of LLM outputs, such as Retrieval Augmented Generation (RAG) pipelines. Validate makes it easy to evaluate, track, and monitor your LLM and RAG applications. Validate allows you to evaluate your LLM outputs through the use of our provided metrics which measure everything from answer correctness to LLM hallucination. Additionally, Validate has an optional UI to visualize your evaluation results for easy tracking and monitoring.
OnAIR
The On-board Artificial Intelligence Research (OnAIR) Platform is a framework that enables AI algorithms written in Python to interact with NASA's cFS. It is intended to explore research concepts in autonomous operations in a simulated environment. The platform provides tools for generating environments, handling telemetry data through Redis, running unit tests, and contributing to the repository. Users can set up a conda environment, configure telemetry and Redis examples, run simulations, and conduct unit tests to ensure the functionality of their AI algorithms. The platform also includes guidelines for licensing, copyright, and contributions to the repository.
kvpress
This repository implements multiple key-value cache pruning methods and benchmarks using transformers, aiming to simplify the development of new methods for researchers and developers in the field of long-context language models. It provides a set of 'presses' that compress the cache during the pre-filling phase, with each press having a compression ratio attribute. The repository includes various training-free presses, special presses, and supports KV cache quantization. Users can contribute new presses and evaluate the performance of different presses on long-context datasets.
paper-qa
PaperQA is a minimal package for question and answering from PDFs or text files, providing very good answers with in-text citations. It uses OpenAI Embeddings to embed and search documents, and includes a process of embedding docs, queries, searching for top passages, creating summaries, using an LLM to re-score and select relevant summaries, putting summaries into prompt, and generating answers. The tool can be used to answer specific questions related to scientific research by leveraging citations and relevant passages from documents.
ell
ell is a lightweight, functional prompt engineering framework that treats prompts as programs rather than strings. It provides tools for prompt versioning, monitoring, and visualization, as well as support for multimodal inputs and outputs. The framework aims to simplify the process of prompt engineering for language models.
WindowsAgentArena
Windows Agent Arena (WAA) is a scalable Windows AI agent platform designed for testing and benchmarking multi-modal, desktop AI agents. It provides researchers and developers with a reproducible and realistic Windows OS environment for AI research, enabling testing of agentic AI workflows across various tasks. WAA supports deploying agents at scale using Azure ML cloud infrastructure, allowing parallel running of multiple agents and delivering quick benchmark results for hundreds of tasks in minutes.
oasis
OASIS is a scalable, open-source social media simulator that integrates large language models with rule-based agents to realistically mimic the behavior of up to one million users on platforms like Twitter and Reddit. It facilitates the study of complex social phenomena such as information spread, group polarization, and herd behavior, offering a versatile tool for exploring diverse social dynamics and user interactions in digital environments. With features like scalability, dynamic environments, diverse action spaces, and integrated recommendation systems, OASIS provides a comprehensive platform for simulating social media interactions at a large scale.
torchchat
torchchat is a codebase showcasing the ability to run large language models (LLMs) seamlessly. It allows running LLMs using Python in various environments such as desktop, server, iOS, and Android. The tool supports running models via PyTorch, chatting, generating text, running chat in the browser, and running models on desktop/server without Python. It also provides features like AOT Inductor for faster execution, running in C++ using the runner, and deploying and running on iOS and Android. The tool supports popular hardware and OS including Linux, Mac OS, Android, and iOS, with various data types and execution modes available.
raptor
RAPTOR introduces a novel approach to retrieval-augmented language models by constructing a recursive tree structure from documents. This allows for more efficient and context-aware information retrieval across large texts, addressing common limitations in traditional language models. Users can add documents to the tree, answer questions based on indexed documents, save and load the tree, and extend RAPTOR with custom summarization, question-answering, and embedding models. The tool is designed to be flexible and customizable for various NLP tasks.
probsem
ProbSem is a repository that provides a framework to leverage large language models (LLMs) for assigning context-conditional probability distributions over queried strings. It supports OpenAI engines and HuggingFace CausalLM models, and is flexible for research applications in linguistics, cognitive science, program synthesis, and NLP. Users can define prompts, contexts, and queries to derive probability distributions over possible completions, enabling tasks like cloze completion, multiple-choice QA, semantic parsing, and code completion. The repository offers CLI and API interfaces for evaluation, with options to customize models, normalize scores, and adjust temperature for probability distributions.
rtdl-num-embeddings
This repository provides the official implementation of the paper 'On Embeddings for Numerical Features in Tabular Deep Learning'. It focuses on transforming scalar continuous features into vectors before integrating them into the main backbone of tabular neural networks, showcasing improved performance. The embeddings for continuous features are shown to enhance the performance of tabular DL models and are applicable to various conventional backbones, offering efficiency comparable to Transformer-based models. The repository includes Python packages for practical usage, exploration of metrics and hyperparameters, and reproducing reported results for different algorithms and datasets.
jina
Jina is a tool that allows users to build multimodal AI services and pipelines using cloud-native technologies. It provides a Pythonic experience for serving ML models and transitioning from local deployment to advanced orchestration frameworks like Docker-Compose, Kubernetes, or Jina AI Cloud. Users can build and serve models for any data type and deep learning framework, design high-performance services with easy scaling, serve LLM models while streaming their output, integrate with Docker containers via Executor Hub, and host on CPU/GPU using Jina AI Cloud. Jina also offers advanced orchestration and scaling capabilities, a smooth transition to the cloud, and easy scalability and concurrency features for applications. Users can deploy to their own cloud or system with Kubernetes and Docker Compose integration, and even deploy to JCloud for autoscaling and monitoring.
For similar tasks
curator
Bespoke Curator is an open-source tool for data curation and structured data extraction. It provides a Python library for generating synthetic data at scale, with features like programmability, performance optimization, caching, and integration with HuggingFace Datasets. The tool includes a Curator Viewer for dataset visualization and offers a rich set of functionalities for creating and refining data generation strategies.
mindsdb
MindsDB is a platform for customizing AI from enterprise data. You can create, serve, and fine-tune models in real-time from your database, vector store, and application data. MindsDB "enhances" SQL syntax with AI capabilities to make it accessible for developers worldwide. With MindsDB’s nearly 200 integrations, any developer can create AI customized for their purpose, faster and more securely. Their AI systems will constantly improve themselves — using companies’ own data, in real-time.
training-operator
Kubeflow Training Operator is a Kubernetes-native project for fine-tuning and scalable distributed training of machine learning (ML) models created with various ML frameworks such as PyTorch, Tensorflow, XGBoost, MPI, Paddle and others. Training Operator allows you to use Kubernetes workloads to effectively train your large models via Kubernetes Custom Resources APIs or using Training Operator Python SDK. > Note: Before v1.2 release, Kubeflow Training Operator only supports TFJob on Kubernetes. * For a complete reference of the custom resource definitions, please refer to the API Definition. * TensorFlow API Definition * PyTorch API Definition * Apache MXNet API Definition * XGBoost API Definition * MPI API Definition * PaddlePaddle API Definition * For details of all-in-one operator design, please refer to the All-in-one Kubeflow Training Operator * For details on its observability, please refer to the monitoring design doc.
helix
HelixML is a private GenAI platform that allows users to deploy the best of open AI in their own data center or VPC while retaining complete data security and control. It includes support for fine-tuning models with drag-and-drop functionality. HelixML brings the best of open source AI to businesses in an ergonomic and scalable way, optimizing the tradeoff between GPU memory and latency.
nntrainer
NNtrainer is a software framework for training neural network models on devices with limited resources. It enables on-device fine-tuning of neural networks using user data for personalization. NNtrainer supports various machine learning algorithms and provides examples for tasks such as few-shot learning, ResNet, VGG, and product rating. It is optimized for embedded devices and utilizes CBLAS and CUBLAS for accelerated calculations. NNtrainer is open source and released under the Apache License version 2.0.
petals
Petals is a tool that allows users to run large language models at home in a BitTorrent-style manner. It enables fine-tuning and inference up to 10x faster than offloading. Users can generate text with distributed models like Llama 2, Falcon, and BLOOM, and fine-tune them for specific tasks directly from their desktop computer or Google Colab. Petals is a community-run system that relies on people sharing their GPUs to increase its capacity and offer a distributed network for hosting model layers.
LLaVA-pp
This repository, LLaVA++, extends the visual capabilities of the LLaVA 1.5 model by incorporating the latest LLMs, Phi-3 Mini Instruct 3.8B, and LLaMA-3 Instruct 8B. It provides various models for instruction-following LMMS and academic-task-oriented datasets, along with training scripts for Phi-3-V and LLaMA-3-V. The repository also includes installation instructions and acknowledgments to related open-source contributions.
KULLM
KULLM (구름) is a Korean Large Language Model developed by Korea University NLP & AI Lab and HIAI Research Institute. It is based on the upstage/SOLAR-10.7B-v1.0 model and has been fine-tuned for instruction. The model has been trained on 8×A100 GPUs and is capable of generating responses in Korean language. KULLM exhibits hallucination and repetition phenomena due to its decoding strategy. Users should be cautious as the model may produce inaccurate or harmful results. Performance may vary in benchmarks without a fixed system prompt.
For similar jobs
weave
Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.
LLMStack
LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.
VisionCraft
The VisionCraft API is a free API for using over 100 different AI models. From images to sound.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
PyRIT
PyRIT is an open access automation framework designed to empower security professionals and ML engineers to red team foundation models and their applications. It automates AI Red Teaming tasks to allow operators to focus on more complicated and time-consuming tasks and can also identify security harms such as misuse (e.g., malware generation, jailbreaking), and privacy harms (e.g., identity theft). The goal is to allow researchers to have a baseline of how well their model and entire inference pipeline is doing against different harm categories and to be able to compare that baseline to future iterations of their model. This allows them to have empirical data on how well their model is doing today, and detect any degradation of performance based on future improvements.
tabby
Tabby is a self-hosted AI coding assistant, offering an open-source and on-premises alternative to GitHub Copilot. It boasts several key features: * Self-contained, with no need for a DBMS or cloud service. * OpenAPI interface, easy to integrate with existing infrastructure (e.g Cloud IDE). * Supports consumer-grade GPUs.
spear
SPEAR (Simulator for Photorealistic Embodied AI Research) is a powerful tool for training embodied agents. It features 300 unique virtual indoor environments with 2,566 unique rooms and 17,234 unique objects that can be manipulated individually. Each environment is designed by a professional artist and features detailed geometry, photorealistic materials, and a unique floor plan and object layout. SPEAR is implemented as Unreal Engine assets and provides an OpenAI Gym interface for interacting with the environments via Python.
Magick
Magick is a groundbreaking visual AIDE (Artificial Intelligence Development Environment) for no-code data pipelines and multimodal agents. Magick can connect to other services and comes with nodes and templates well-suited for intelligent agents, chatbots, complex reasoning systems and realistic characters.