
datatune
Perform transformations on your data with natural language using LLMs
Stars: 101

Datatune is a data analysis tool designed to help users explore and analyze datasets efficiently. It provides a user-friendly interface for importing, cleaning, visualizing, and modeling data. With Datatune, users can easily perform tasks such as data preprocessing, feature engineering, model selection, and evaluation. The tool offers a variety of statistical and machine learning algorithms to support data analysis tasks. Whether you are a data scientist, analyst, or researcher, Datatune can streamline your data analysis workflow and help you derive valuable insights from your data.
README:
Perform transformations on your data with natural language using LLMs
pip install datatune
From source:
pip install -e .
import os
import dask.dataframe as dd
import datatune as dt
from datatune.llm.llm import OpenAI
os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
# Set tokens-per-minute and requests-per-minute limits
llm = OpenAI(model_name="gpt-3.5-turbo", tpm = 200000, rpm = 50)
# Load data from your source with Dask
df = dd.read_csv("tests/test_data/products.csv")
print(df.head())
# Transform data with Map
mapped = dt.map(
prompt="Extract categories from the description and name of product.",
output_fields=["Category", "Subcategory"],
input_fields = ["Description","Name"] # Relevant input fields (optional)
)(llm, df)
# Filter data based on criteria
filtered = dt.filter(
prompt="Keep only electronics products",
input_fields = ["Name"] # Relevant input fields (optional)
)(llm, mapped)
# Get the final dataframe after cleanup of metadata and deleted rows after operations using `finalize`.
result = dt.finalize(filtered)
result.compute().to_csv("electronics_products.csv")
new_df = dd.read_csv("electronics_products.csv")
print(new_df.head())
products.csv
ProductID Name Price Quantity Description SKU
0 1001 Wireless Mouse 25.99 150 Ergonomic wireless mouse with 2.4GHz connectivity WM-1001
1 1002 Office Chair 89.99 75 Comfortable swivel office chair with lumbar su... OC-2002
2 1003 Coffee Mug 9.49 300 Ceramic mug, 12oz, microwave safe CM-3003
3 1004 LED Monitor 24" 149.99 60 24-inch Full HD LED monitor with HDMI and VGA ... LM-2404
4 1005 Notebook Pack 6.99 500 Pack of 3 ruled notebooks, 100 pages each NP-5005
electronics_products.csv
Unnamed: 0 ProductID Name ... SKU Category Subcategory
0 0 1001 Wireless Mouse ... WM-1001 Electronics Computer Accessories
1 3 1004 LED Monitor 24" ... LM-2404 Electronics Monitors
2 6 1007 USB-C Cable 1m ... UC-7007 Electronics Cables
3 8 1009 Bluetooth Speaker ... BS-9009 Electronics Audio
If you don't set rpm
or tpm
, Datatune will automatically look up default limits for your model from our model_rate_limits. If model is not available in the lookup dictionary rpm and tpm will default to gpt-3.5-turbo limits.
Passing input_fields
reduces the number of tokens sent by sending only relevant columns as input to the given LLM API, hence reducing the cost.
Protect sensitive information while preserving data utility:
# Anonymize personally identifiable information
customer_data = dd.read_csv("customer_records.csv")
anonymized = dt.map(
prompt="Replace all personally identifiable fields with XX - emails, phone numbers, names, addresses",
output_fields=["anonymized_text"],
input_fields=["customer_notes"]
)(llm, customer_data)
Output:
CustomerID Original_Notes Anonymized_Text
0 3001 "John Smith called about bill" "XX called about bill"
1 3002 "Email: [email protected] for updates" "Email: XX for updates"
2 3003 "Call 555-1234 regarding order" "Call XX regarding order"
Extract and categorize information:
# Classify customer support emails by department and urgency
support_emails = dd.read_csv("support_emails.csv")
classified = dt.map(
prompt="Classify emails by department (Technical/Billing/Sales) and urgency level (Low/Medium/High/Critical)",
output_fields=["department", "urgency_level", "estimated_response_time"],
input_fields=["subject", "email_body"]
)(llm, support_emails)
Output:
EmailID Subject Department Urgency_Level Estimated_Response_Time
0 4001 "Login issues on mobile" Technical High "2 hours"
1 4002 "Invoice payment question" Billing Medium "1 day"
2 4003 "Server completely down" Technical Critical "30 minutes"
Filter to remove rows based on criteria:
# Filter high-quality product reviews
reviews = dd.read_csv("reviews.csv")
quality_reviews = dt.filter(
prompt="Keep only genuine, detailed reviews that are not spam",
input_fields=["review_text", "reviewer_history"]
)(llm, reviews)
Output:
ReviewID Review_Text Reviewer_History Rating
0 5001 "Excellent product, works as expected..." "50+ reviews, verified" 5
1 5004 "Good value for money, fast shipping..." "25+ reviews, verified" 4
2 5007 "Quality exceeded my expectations..." "15+ reviews, verified" 5
Transform data with natural language:
customers = dd.read_csv("customers.csv")
mapped = dt.map(
prompt="Extract country and city from the address field",
output_fields=["country", "city"]
)(llm, customers)
# Filter to remove rows
filtered = dt.filter(
prompt="Keep only customers who are from Asia"
)(llm, mapped)
Datatune works with various LLM providers with the help of LiteLLM under the hood:
# Using Ollama
from datatune.llm.llm import Ollama
llm = Ollama()
# Using Azure
from datatune.llm.llm import Azure
llm = Azure(
model_name="gpt-3.5-turbo",
api_key=api_key,
api_base=api_base,
api_version=api_version)
# OpenAI
from datatune.llm.llm import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo")
Datatune provides an agentic interface that allows large language models (LLMs) to autonomously plan and execute data transformation steps using natural language prompts. Agents understand your instructions and dynamically generate the appropriate sequence of Map, Filter, and other operations on your data β no need to manually compose transformation chains.
With just a single prompt, the agent analyzes your intent, determines the necessary transformations, and applies them directly to your Dask DataFrame.
import datatune as dt
from datatune.llm.llm import OpenAI
llm = OpenAI(model_name="gpt-3.5-turbo", tpm=200000)
# Create a Datatune Agent
agent = dt.Agent(llm)
# Define your transformation task
prompt = "Add a new column called ProfitMargin = (Total Profit / Total Revenue) * 100."
# Let the agent handle it!
df = agent.do(prompt, df)
result = dt.finalize(df)
The agent automatically infers the right operations for the job:
- Column creation: Derive new columns using arithmetic, string manipulation, or semantic understanding.
- Conditional filtering: Keep or drop rows based on complex logic.
- Semantic classification: Categorize data based on textual cues or domain knowledge.
- Multi-step pipelines: Chain multiple transformations from a single prompt.
prompt = "Add a new column called ProfitMargin = (Total Profit / Total Revenue) * 100."
df = agent.do(prompt, df)
β Adds the column, infers data types, and inserts it in-place.
prompt = "Create a new column called Category and Sub-Category based on the Industry column and only keep organizations that are in Africa."
df = agent.do(prompt, df)
β Categorizes based on industry and filters by region β all in a single command.
prompt = "Extract year from date of birth column into a new column called Year and keep only people who are in STEM related jobs."
df = agent.do(prompt, df)
β Extracts the year, identifies STEM professions, and filters accordingly.
After the agent has performed its tasks, finalize the dataframe to apply clean-up and remove intermediate metadata:
result = dt.finalize(df)
result.compute().to_csv("output.csv", index=False)
Agents make Datatune ideal for non-technical users, rapid prototyping, and intelligent data workflows β just describe what you want, and let the agent do the rest.
Datatune leverages Dask DataFrames to enable scalable processing across large datasets. This approach allows you to:
- Process data larger than context length of LLMs
- Execute parallel computations efficiently
If you're working with pandas DataFrames, convert them with a simple:
import dask.dataframe as dd
dask_df = dd.from_pandas(pandas_df, npartitions=4) # adjust partitions based on your data size
Check out examples
Check out our documentation to learn how to use datatune.
Want to raise an issue or want us to build a new feature? Head over to issues and raise a ticket!
You can also mail us at [email protected]
Join our Discord community to connect with other users, ask questions, and get support:
MIT License
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for datatune
Similar Open Source Tools

datatune
Datatune is a data analysis tool designed to help users explore and analyze datasets efficiently. It provides a user-friendly interface for importing, cleaning, visualizing, and modeling data. With Datatune, users can easily perform tasks such as data preprocessing, feature engineering, model selection, and evaluation. The tool offers a variety of statistical and machine learning algorithms to support data analysis tasks. Whether you are a data scientist, analyst, or researcher, Datatune can streamline your data analysis workflow and help you derive valuable insights from your data.

ROGRAG
ROGRAG is a powerful open-source tool designed for data analysis and visualization. It provides a user-friendly interface for exploring and manipulating datasets, making it ideal for researchers, data scientists, and analysts. With ROGRAG, users can easily import, clean, analyze, and visualize data to gain valuable insights and make informed decisions. The tool supports a wide range of data formats and offers a variety of statistical and visualization tools to help users uncover patterns, trends, and relationships in their data. Whether you are working on exploratory data analysis, statistical modeling, or data visualization, ROGRAG is a versatile tool that can streamline your workflow and enhance your data analysis capabilities.

arconia
Arconia is a powerful open-source tool for managing and visualizing data in a user-friendly way. It provides a seamless experience for data analysts and scientists to explore, clean, and analyze datasets efficiently. With its intuitive interface and robust features, Arconia simplifies the process of data manipulation and visualization, making it an essential tool for anyone working with data.

catwalk
Catwalk is a lightweight and user-friendly tool for visualizing and analyzing data. It provides a simple interface for users to explore and understand their datasets through interactive charts and graphs. With Catwalk, users can easily upload their data, customize visualizations, and gain insights from their data without the need for complex coding or technical skills.

Eridanus
Eridanus is a powerful data visualization tool designed to help users create interactive and insightful visualizations from their datasets. With a user-friendly interface and a wide range of customization options, Eridanus makes it easy for users to explore and analyze their data in a meaningful way. Whether you are a data scientist, business analyst, or student, Eridanus provides the tools you need to communicate your findings effectively and make data-driven decisions.

xorq
Xorq (formerly LETSQL) is a data processing library built on top of Ibis and DataFusion to write multi-engine data workflows. It provides a flexible and powerful tool for processing and analyzing data from various sources, enabling users to create complex data pipelines and perform advanced data transformations.

AI_Spectrum
AI_Spectrum is a versatile machine learning library that provides a wide range of tools and algorithms for building and deploying AI models. It offers a user-friendly interface for data preprocessing, model training, and evaluation. With AI_Spectrum, users can easily experiment with different machine learning techniques and optimize their models for various tasks. The library is designed to be flexible and scalable, making it suitable for both beginners and experienced data scientists.

SQLBot
SQLBot is a versatile tool for executing SQL queries and managing databases. It provides a user-friendly interface for interacting with databases, allowing users to easily query, insert, update, and delete data. SQLBot supports various database systems such as MySQL, PostgreSQL, and SQLite, making it a valuable tool for developers, data analysts, and database administrators. With SQLBot, users can streamline their database management tasks and improve their productivity by quickly accessing and manipulating data without the need for complex SQL commands.

waidrin
Waidrin is a powerful web scraping tool that allows users to easily extract data from websites. It provides a user-friendly interface for creating custom web scraping scripts and supports various data formats for exporting the extracted data. With Waidrin, users can automate the process of collecting information from multiple websites, saving time and effort. The tool is designed to be flexible and scalable, making it suitable for both beginners and advanced users in the field of web scraping.

dyad
Dyad is a lightweight Python library for analyzing dyadic data, which involves pairs of individuals and their interactions. It provides functions for computing various network metrics, visualizing network structures, and conducting statistical analyses on dyadic data. Dyad is designed to be user-friendly and efficient, making it suitable for researchers and practitioners working with relational data in fields such as social network analysis, communication studies, and psychology.

agent-lightning
Agent Lightning is a lightweight and efficient tool for automating repetitive tasks in the field of data analysis and machine learning. It provides a user-friendly interface to create and manage automated workflows, allowing users to easily schedule and execute data processing, model training, and evaluation tasks. With its intuitive design and powerful features, Agent Lightning streamlines the process of building and deploying machine learning models, making it ideal for data scientists, machine learning engineers, and AI enthusiasts looking to boost their productivity and efficiency in their projects.

read-frog
Read-frog is a powerful text analysis tool designed to help users extract valuable insights from text data. It offers a wide range of features including sentiment analysis, keyword extraction, entity recognition, and text summarization. With its user-friendly interface and robust algorithms, Read-frog is suitable for both beginners and advanced users looking to analyze text data for various purposes such as market research, social media monitoring, and content optimization. Whether you are a data scientist, marketer, researcher, or student, Read-frog can streamline your text analysis workflow and provide actionable insights to drive decision-making and enhance productivity.

PulsarRPAPro
PulsarRPAPro is a powerful robotic process automation (RPA) tool designed to automate repetitive tasks and streamline business processes. It offers a user-friendly interface for creating and managing automation workflows, allowing users to easily automate tasks without the need for extensive programming knowledge. With features such as task scheduling, data extraction, and integration with various applications, PulsarRPAPro helps organizations improve efficiency and productivity by reducing manual work and human errors. Whether you are a small business looking to automate simple tasks or a large enterprise seeking to optimize complex processes, PulsarRPAPro provides the flexibility and scalability to meet your automation needs.

LightLLM
LightLLM is a lightweight library for linear and logistic regression models. It provides a simple and efficient way to train and deploy machine learning models for regression tasks. The library is designed to be easy to use and integrate into existing projects, making it suitable for both beginners and experienced data scientists. With LightLLM, users can quickly build and evaluate regression models using a variety of algorithms and hyperparameters. The library also supports feature engineering and model interpretation, allowing users to gain insights from their data and make informed decisions based on the model predictions.

cipher
Cipher is a versatile encryption and decryption tool designed to secure sensitive information. It offers a user-friendly interface with various encryption algorithms to choose from, ensuring data confidentiality and integrity. With Cipher, users can easily encrypt text or files using strong encryption methods, making it suitable for protecting personal data, confidential documents, and communication. The tool also supports decryption of encrypted data, providing a seamless experience for users to access their secured information. Cipher is a reliable solution for individuals and organizations looking to enhance their data security measures.

firecrawl
Firecrawl is an API service that empowers AI applications with clean data from any website. It features advanced scraping, crawling, and data extraction capabilities. The repository is still in development, integrating custom modules into the mono repo. Users can run it locally but it's not fully ready for self-hosted deployment yet. Firecrawl offers powerful capabilities like scraping, crawling, mapping, searching, and extracting structured data from single pages, multiple pages, or entire websites with AI. It supports various formats, actions, and batch scraping. The tool is designed to handle proxies, anti-bot mechanisms, dynamic content, media parsing, change tracking, and more. Firecrawl is available as an open-source project under the AGPL-3.0 license, with additional features offered in the cloud version.
For similar tasks

opendataeditor
The Open Data Editor (ODE) is a no-code application to explore, validate and publish data in a simple way. It is an open source project powered by the Frictionless Framework. The ODE is currently available for download and testing in beta.

data-juicer
Data-Juicer is a one-stop data processing system to make data higher-quality, juicier, and more digestible for LLMs. It is a systematic & reusable library of 80+ core OPs, 20+ reusable config recipes, and 20+ feature-rich dedicated toolkits, designed to function independently of specific LLM datasets and processing pipelines. Data-Juicer allows detailed data analyses with an automated report generation feature for a deeper understanding of your dataset. Coupled with multi-dimension automatic evaluation capabilities, it supports a timely feedback loop at multiple stages in the LLM development process. Data-Juicer offers tens of pre-built data processing recipes for pre-training, fine-tuning, en, zh, and more scenarios. It provides a speedy data processing pipeline requiring less memory and CPU usage, optimized for maximum productivity. Data-Juicer is flexible & extensible, accommodating most types of data formats and allowing flexible combinations of OPs. It is designed for simplicity, with comprehensive documentation, easy start guides and demo configs, and intuitive configuration with simple adding/removing OPs from existing configs.

OAD
OAD is a powerful open-source tool for analyzing and visualizing data. It provides a user-friendly interface for exploring datasets, generating insights, and creating interactive visualizations. With OAD, users can easily import data from various sources, clean and preprocess data, perform statistical analysis, and create customizable visualizations to communicate findings effectively. Whether you are a data scientist, analyst, or researcher, OAD can help you streamline your data analysis workflow and uncover valuable insights from your data.

Streamline-Analyst
Streamline Analyst is a cutting-edge, open-source application powered by Large Language Models (LLMs) designed to revolutionize data analysis. This Data Analysis Agent effortlessly automates tasks such as data cleaning, preprocessing, and complex operations like identifying target objects, partitioning test sets, and selecting the best-fit models based on your data. With Streamline Analyst, results visualization and evaluation become seamless. It aims to expedite the data analysis process, making it accessible to all, regardless of their expertise in data analysis. The tool is built to empower users to process data and achieve high-quality visualizations with unparalleled efficiency, and to execute high-performance modeling with the best strategies. Future enhancements include Natural Language Processing (NLP), neural networks, and object detection utilizing YOLO, broadening its capabilities to meet diverse data analysis needs.

2021-13th-ironman
This repository is a part of the 13th iT Help Ironman competition, focusing on exploring explainable artificial intelligence (XAI) in machine learning and deep learning. The content covers the basics of XAI, its applications, cases, challenges, and future directions. It also includes practical machine learning algorithms, model deployment, and integration concepts. The author aims to provide detailed resources on AI and share knowledge with the audience through this competition.

crazyai-ml
The 'crazyai-ml' repository is a collection of resources related to machine learning, specifically focusing on explaining artificial intelligence models. It includes articles, code snippets, and tutorials covering various machine learning algorithms, data analysis, model training, and deployment. The content aims to provide a comprehensive guide for beginners in the field of AI, offering practical implementations and insights into popular machine learning packages and model tuning techniques. The repository also addresses the integration of AI models and frontend-backend concepts, making it a valuable resource for individuals interested in AI applications.

ProX
ProX is a lm-based data refinement framework that automates the process of cleaning and improving data used in pre-training large language models. It offers better performance, domain flexibility, efficiency, and cost-effectiveness compared to traditional methods. The framework has been shown to improve model performance by over 2% and boost accuracy by up to 20% in tasks like math. ProX is designed to refine data at scale without the need for manual adjustments, making it a valuable tool for data preprocessing in natural language processing tasks.

LLM4DB
LLM4DB is a repository focused on the intersection of Large Language Models (LLMs) and Database technologies. It covers various aspects such as data processing, data analysis, database optimization, and data management for LLMs. The repository includes research papers, tools, and techniques related to leveraging LLMs for tasks like data cleaning, entity matching, schema matching, data discovery, NL2SQL, data exploration, data visualization, knob tuning, query optimization, and database diagnosis.
For similar jobs

Azure-Analytics-and-AI-Engagement
The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customerβs subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

skyvern
Skyvern automates browser-based workflows using LLMs and computer vision. It provides a simple API endpoint to fully automate manual workflows, replacing brittle or unreliable automation solutions. Traditional approaches to browser automations required writing custom scripts for websites, often relying on DOM parsing and XPath-based interactions which would break whenever the website layouts changed. Instead of only relying on code-defined XPath interactions, Skyvern adds computer vision and LLMs to the mix to parse items in the viewport in real-time, create a plan for interaction and interact with them. This approach gives us a few advantages: 1. Skyvern can operate on websites itβs never seen before, as itβs able to map visual elements to actions necessary to complete a workflow, without any customized code 2. Skyvern is resistant to website layout changes, as there are no pre-determined XPaths or other selectors our system is looking for while trying to navigate 3. Skyvern leverages LLMs to reason through interactions to ensure we can cover complex situations. Examples include: 1. If you wanted to get an auto insurance quote from Geico, the answer to a common question βWere you eligible to drive at 18?β could be inferred from the driver receiving their license at age 16 2. If you were doing competitor analysis, itβs understanding that an Arnold Palmer 22 oz can at 7/11 is almost definitely the same product as a 23 oz can at Gopuff (even though the sizes are slightly different, which could be a rounding error!) Want to see examples of Skyvern in action? Jump to #real-world-examples-of- skyvern

pandas-ai
PandasAI is a Python library that makes it easy to ask questions to your data in natural language. It helps you to explore, clean, and analyze your data using generative AI.

vanna
Vanna is an open-source Python framework for SQL generation and related functionality. It uses Retrieval-Augmented Generation (RAG) to train a model on your data, which can then be used to ask questions and get back SQL queries. Vanna is designed to be portable across different LLMs and vector databases, and it supports any SQL database. It is also secure and private, as your database contents are never sent to the LLM or the vector database.

databend
Databend is an open-source cloud data warehouse that serves as a cost-effective alternative to Snowflake. With its focus on fast query execution and data ingestion, it's designed for complex analysis of the world's largest datasets.

Avalonia-Assistant
Avalonia-Assistant is an open-source desktop intelligent assistant that aims to provide a user-friendly interactive experience based on the Avalonia UI framework and the integration of Semantic Kernel with OpenAI or other large LLM models. By utilizing Avalonia-Assistant, you can perform various desktop operations through text or voice commands, enhancing your productivity and daily office experience.

marvin
Marvin is a lightweight AI toolkit for building natural language interfaces that are reliable, scalable, and easy to trust. Each of Marvin's tools is simple and self-documenting, using AI to solve common but complex challenges like entity extraction, classification, and generating synthetic data. Each tool is independent and incrementally adoptable, so you can use them on their own or in combination with any other library. Marvin is also multi-modal, supporting both image and audio generation as well using images as inputs for extraction and classification. Marvin is for developers who care more about _using_ AI than _building_ AI, and we are focused on creating an exceptional developer experience. Marvin users should feel empowered to bring tightly-scoped "AI magic" into any traditional software project with just a few extra lines of code. Marvin aims to merge the best practices for building dependable, observable software with the best practices for building with generative AI into a single, easy-to-use library. It's a serious tool, but we hope you have fun with it. Marvin is open-source, free to use, and made with π by the team at Prefect.

activepieces
Activepieces is an open source replacement for Zapier, designed to be extensible through a type-safe pieces framework written in Typescript. It features a user-friendly Workflow Builder with support for Branches, Loops, and Drag and Drop. Activepieces integrates with Google Sheets, OpenAI, Discord, and RSS, along with 80+ other integrations. The list of supported integrations continues to grow rapidly, thanks to valuable contributions from the community. Activepieces is an open ecosystem; all piece source code is available in the repository, and they are versioned and published directly to npmjs.com upon contributions. If you cannot find a specific piece on the pieces roadmap, please submit a request by visiting the following link: Request Piece Alternatively, if you are a developer, you can quickly build your own piece using our TypeScript framework. For guidance, please refer to the following guide: Contributor's Guide