greenmask

PostgreSQL database anonymization and synthetic data generation tool

Stars: 1285

Visit

Greenmask is a powerful open-source utility designed for logical database backup dumping, anonymization, synthetic data generation, and restoration. It is highly customizable, stateless, and backward-compatible with existing PostgreSQL utilities. Greenmask supports advanced subset systems, deterministic transformers, dynamic parameters, transformation conditions, and more. It is cross-platform, database type safe, extensible, and supports parallel execution and various storage options. Ideal for backup and restoration tasks, anonymization, transformation, and data masking.

README:

Greenmask

Dump anonymization and synthetic data generation tool

Greenmask is a powerful open-source utility that is designed for logical database backup dumping, anonymization, synthetic data generation and restoration. It has ported PostgreSQL libraries, making it reliable. It is stateless and does not require any changes to your database schema. It is designed to be highly customizable and backward-compatible with existing PostgreSQL utilities, fast and reliable.

Getting started

Greenmask has a Playground - it is a sandbox environment in Docker with sample databases included to help you try Greenmask without any additional actions

Clone the greenmask repository and navigate to its directory by running the following commands:
```
git clone [email protected]:GreenmaskIO/greenmask.git && cd greenmask
```
Once you have cloned the repository, start the environment by running Docker Compose:
```
docker-compose run greenmask
```

Features

Database subset - One of the most advanced subset systems on the market. It supports virtual references, nullable columns, polymorphic references, and can subset even the most complex schemas with cyclic references.
Deterministic transformers — Uses hash functions to ensure consistent output for the same input. Most transformers support both random and hash engines, offering flexibility for various use cases.
Dynamic parameters — most transformers support dynamic parameters, allowing them to adapt based on table column values. This feature helps manage dependencies between columns and meet constraints effectively.
Transformation Condition — applies the transformation only when a specified condition is met, making it useful for targeting specific rows.
Transformation validation and easy maintenance — Greenmask provides validation warnings, data transformation diffs, and schema diffs during configuration, enabling effective monitoring and maintenance of transformations. The schema diff feature helps prevent data leakage when the schema changes.
Transformation inheritance — transformation inheritance for partitioned tables and tables with foreign keys. Define once and apply to all.
Stateless — Greenmask operates as a logical dump and does not impact your existing database schema.
Cross-platform — Can be easily built and executed on any platform, thanks to its Go-based architecture, which eliminates platform dependencies.
Database type safe — Ensures data integrity by validating data and using the database driver for encoding and decoding operations, preserving accurate data formats.
Backward compatible — Fully supports the same features and protocols as standard PostgreSQL utilities. Dumps created by Greenmask can be seamlessly restored using the pg_restore utility.
Extensible — Users have the flexibility to implement domain-based transformations in any programming language or use predefined templates.
Parallel execution — Enables parallel dumping and restoration to significantly speed up results.
Variety of storages — Supports both local and remote storage, including directories and S3-compatible solutions.
Pgzip support for faster compression — Speeds up dump and restoration processes with parallel compression by setting --pgzip.

Use Cases

Greenmask is ideal for various scenarios, including:

Backup and Restoration. Use Greenmask for your daily routines involving logical backup dumping and restoration. It seamlessly handles tasks like table restoration after truncation. Its functionality closely mirrors that of pg_dump and pg_restore, making it a straightforward replacement.
Anonymization, Transformation, and Data Masking. Employ Greenmask for anonymizing, transforming, and masking backups, especially when setting up a staging environment or for analytical purposes. It simplifies the deployment of a pre-production environment with consistently anonymized data, facilitating faster time-to-market in the development lifecycle.

General Information

The best approach for logical backup dumping and restoration is to use core PostgreSQL utilities, specifically pg_dump and pg_restore. Greenmask is designed to align with these native tools, ensuring full compatibility. It independently manages data dumping while delegating schema dumping and restoration to pg_dump and pg_restore, ensuring smooth integration with PostgreSQL’s standard workflow.

Greenmask utilizes the directory format of pg_dump and pg_restore, ideal for parallel execution and partial restoration. This format includes metadata files to guide backup and restoration steps.

Storage Options

s3 - Supports any S3-compatible storage system, including AWS S3, offering flexibility across different cloud storage solutions.
directory - This is the default option, representing a standard filesystem directory for local storage.

Data Anonymization and Validation

Greenmask works with COPY lines, collects schema metadata using the Golang driver, and employs this driver in the encoding and decoding process. The validate command offers a way to assess the impact on both schema (validation warnings) and data (transformation and displaying differences). This command allows you to validate the schema and data transformations, ensuring the desired outcomes during the Anonymization process.

Customization

If your table schema relies on functional dependencies between columns, you can address this challenge using the Dynamic parameters. By setting dynamic parameters, you can resolve such as created_at and updated_at cases, where the updated_at must be greater or equal than the created_at.

If you need to implement custom logic imperatively use Cmd or TemplateRecord or Template transformers.

PostgreSQL Version Compatibility

Greenmask is compatible with PostgreSQL versions 11 and higher.

References

Utilized the Demo database, provided by PostgresPro, for integration testing purposes.
Employed the adventureworks database created by morenoh149/postgresDBSamples, in the Docker Compose playground.

For Tasks:

Click tags to check more tools for each tasks

backup and restoration anonymization and masking data transformation logical data dumping synthetic data creation

For Jobs:

database administrator data engineer data analyst software developer devops engineer

Alternative AI tools for greenmask

Similar Open Source Tools

greenmask

github

: 1.3k

agent-zero

Agent Zero is a personal and organic AI framework designed to be dynamic, organically growing, and learning as you use it. It is fully transparent, readable, comprehensible, customizable, and interactive. The framework uses the computer as a tool to accomplish tasks, with no single-purpose tools pre-programmed. It emphasizes multi-agent cooperation, complete customization, and extensibility. Communication is key in this framework, allowing users to give proper system prompts and instructions to achieve desired outcomes. Agent Zero is capable of dangerous actions and should be run in an isolated environment. The framework is prompt-based, highly customizable, and requires a specific environment to run effectively.

github

: 5.4k

fridon-ai

FridonAI is an open-source project offering AI-powered tools for cryptocurrency analysis and blockchain operations. It includes modules like FridonAnalytics for price analysis, FridonSearch for technical indicators, FridonNotifier for custom alerts, FridonBlockchain for blockchain operations, and FridonChat as a unified chat interface. The platform empowers users to create custom AI chatbots, access crypto tools, and interact effortlessly through chat. The core functionality is modular, with plugins, tools, and utilities for easy extension and development. FridonAI implements a scoring system to assess user interactions and incentivize engagement. The application uses Redis extensively for communication and includes a Nest.js backend for system operations.

github

: 82

AgentForge

AgentForge is a low-code framework tailored for the rapid development, testing, and iteration of AI-powered autonomous agents and Cognitive Architectures. It is compatible with a range of LLM models and offers flexibility to run different models for different agents based on specific needs. The framework is designed for seamless extensibility and database-flexibility, making it an ideal playground for various AI projects. AgentForge is a beta-testing ground and future-proof hub for crafting intelligent, model-agnostic autonomous agents.

github

: 494

chatnio

Chat Nio is a next-generation AI one-stop solution that provides a rich and user-friendly interface for interacting with various AI models. It offers features such as AI chat conversation, rich format compatibility, markdown support, message menu support, multi-platform adaptation, dialogue memory, full-model file parsing, full-model DuckDuckGo online search, full-screen large text editing, model marketplace, preset support, site announcements, preference settings, internationalization support, and a rich admin system. Chat Nio also boasts a powerful channel management system that utilizes a self-developed channel distribution algorithm, supports multi-channel management, is compatible with multiple formats, allows for custom models, supports channel retries, enables balanced load within the same channel, and provides channel model mapping and user grouping. Additionally, Chat Nio offers forwarding API services that are compatible with multiple formats in the OpenAI universal format and support multiple model compatible layers. It also provides a custom build and install option for highly customizable deployments. Chat Nio is an open-source project licensed under the Apache License 2.0 and welcomes contributions from the community.

github

: 2.8k

ROSGPT_Vision

ROSGPT_Vision is a new robotic framework designed to command robots using only two prompts: a Visual Prompt for visual semantic features and an LLM Prompt to regulate robotic reactions. It is based on the Prompting Robotic Modalities (PRM) design pattern and is used to develop CarMate, a robotic application for monitoring driver distractions and providing real-time vocal notifications. The framework leverages state-of-the-art language models to facilitate advanced reasoning about image data and offers a unified platform for robots to perceive, interpret, and interact with visual data through natural language. LangChain is used for easy customization of prompts, and the implementation includes the CarMate application for driver monitoring and assistance.

github

: 74

deepflow

DeepFlow is an open-source project that provides deep observability for complex cloud-native and AI applications. It offers Zero Code data collection with eBPF for metrics, distributed tracing, request logs, and function profiling. DeepFlow is integrated with SmartEncoding to achieve Full Stack correlation and efficient access to all observability data. With DeepFlow, cloud-native and AI applications automatically gain deep observability, removing the burden of developers continually instrumenting code and providing monitoring and diagnostic capabilities covering everything from code to infrastructure for DevOps/SRE teams.

github

: 3.2k

persian-license-plate-recognition

The Persian License Plate Recognition (PLPR) system is a state-of-the-art solution designed for detecting and recognizing Persian license plates in images and video streams. Leveraging advanced deep learning models and a user-friendly interface, it ensures reliable performance across different scenarios. The system offers advanced detection using YOLOv5 models, precise recognition of Persian characters, real-time processing capabilities, and a user-friendly GUI. It is well-suited for applications in traffic monitoring, automated vehicle identification, and similar fields. The system's architecture includes modules for resident management, entrance management, and a detailed flowchart explaining the process from system initialization to displaying results in the GUI. Hardware requirements include an Intel Core i5 processor, 8 GB RAM, a dedicated GPU with at least 4 GB VRAM, and an SSD with 20 GB of free space. The system can be installed by cloning the repository and installing required Python packages. Users can customize the video source for processing and run the application to upload and process images or video streams. The system's GUI allows for parameter adjustments to optimize performance, and the Wiki provides in-depth information on the system's architecture and model training.

github

: 345

CodeGPT

CodeGPT is an extension for JetBrains IDEs that provides access to state-of-the-art large language models (LLMs) for coding assistance. It offers a range of features to enhance the coding experience, including code completions, a ChatGPT-like interface for instant coding advice, commit message generation, reference file support, name suggestions, and offline development support. CodeGPT is designed to keep privacy in mind, ensuring that user data remains secure and private.

github

: 1.2k

plandex

Plandex is an open source, terminal-based AI coding engine designed for complex tasks. It uses long-running agents to break up large tasks into smaller subtasks, helping users work through backlogs, navigate unfamiliar technologies, and save time on repetitive tasks. Plandex supports various AI models, including OpenAI, Anthropic Claude, Google Gemini, and more. It allows users to manage context efficiently in the terminal, experiment with different approaches using branches, and review changes before applying them. The tool is platform-independent and runs from a single binary with no dependencies.

github

: 11.4k

reductstore

ReductStore is a high-performance time series database designed for storing and managing large amounts of unstructured blob data. It offers features such as real-time querying, batching data, and HTTP(S) API for edge computing, computer vision, and IoT applications. The database ensures data integrity, implements retention policies, and provides efficient data access, making it a cost-effective solution for applications requiring unstructured data storage and access at specific time intervals.

github

: 199

burpference

Burpference is an open-source extension designed to capture in-scope HTTP requests and responses from Burp's proxy history and send them to a remote LLM API in JSON format. It automates response capture, integrates with APIs, optimizes resource usage, provides color-coded findings visualization, offers comprehensive logging, supports native Burp reporting, and allows flexible configuration. Users can customize system prompts, API keys, and remote hosts, and host models locally to prevent high inference costs. The tool is ideal for offensive web application engagements to surface findings and vulnerabilities.

github

: 92

Hexabot

Hexabot Community Edition is an open-source chatbot solution designed for flexibility and customization, offering powerful text-to-action capabilities. It allows users to create and manage AI-powered, multi-channel, and multilingual chatbots with ease. The platform features an analytics dashboard, multi-channel support, visual editor, plugin system, NLP/NLU management, multi-lingual support, CMS integration, user roles & permissions, contextual data, subscribers & labels, and inbox & handover functionalities. The directory structure includes frontend, API, widget, NLU, and docker components. Prerequisites for running Hexabot include Docker and Node.js. The installation process involves cloning the repository, setting up the environment, and running the application. Users can access the UI admin panel and live chat widget for interaction. Various commands are available for managing the Docker services. Detailed documentation and contribution guidelines are provided for users interested in contributing to the project.

github

: 692

deepdoctection

**deep** doctection is a Python library that orchestrates document extraction and document layout analysis tasks using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models. For more specific text processing tasks use one of the many other great NLP libraries. **deep** doctection focuses on applications and is made for those who want to solve real world problems related to document extraction from PDFs or scans in various image formats. **deep** doctection provides model wrappers of supported libraries for various tasks to be integrated into pipelines. Its core function does not depend on any specific deep learning library. Selected models for the following tasks are currently supported: * Document layout analysis including table recognition in Tensorflow with **Tensorpack**, or PyTorch with **Detectron2**, * OCR with support of **Tesseract**, **DocTr** (Tensorflow and PyTorch implementations available) and a wrapper to an API for a commercial solution, * Text mining for native PDFs with **pdfplumber**, * Language detection with **fastText**, * Deskewing and rotating images with **jdeskew**. * Document and token classification with all LayoutLM models provided by the **Transformer library**. (Yes, you can use any LayoutLM-model with any of the provided OCR-or pdfplumber tools straight away!). * Table detection and table structure recognition with **table-transformer**. * There is a small dataset for token classification available and a lot of new tutorials to show, how to train and evaluate this dataset using LayoutLMv1, LayoutLMv2, LayoutXLM and LayoutLMv3. * Comprehensive configuration of **analyzer** like choosing different models, output parsing, OCR selection. Check this notebook or the docs for more infos. * Document layout analysis and table recognition now runs with **Torchscript** (CPU) as well and **Detectron2** is not required anymore for basic inference. * [**new**] More angle predictors for determining the rotation of a document based on **Tesseract** and **DocTr** (not contained in the built-in Analyzer). * [**new**] Token classification with **LiLT** via **transformers**. We have added a model wrapper for token classification with LiLT and added a some LiLT models to the model catalog that seem to look promising, especially if you want to train a model on non-english data. The training script for LayoutLM can be used for LiLT as well and we will be providing a notebook on how to train a model on a custom dataset soon. **deep** doctection provides on top of that methods for pre-processing inputs to models like cropping or resizing and to post-process results, like validating duplicate outputs, relating words to detected layout segments or ordering words into contiguous text. You will get an output in JSON format that you can customize even further by yourself. Have a look at the **introduction notebook** in the notebook repo for an easy start. Check the **release notes** for recent updates. **deep** doctection or its support libraries provide pre-trained models that are in most of the cases available at the **Hugging Face Model Hub** or that will be automatically downloaded once requested. For instance, you can find pre-trained object detection models from the Tensorpack or Detectron2 framework for coarse layout analysis, table cell detection and table recognition. Training is a substantial part to get pipelines ready on some specific domain, let it be document layout analysis, document classification or NER. **deep** doctection provides training scripts for models that are based on trainers developed from the library that hosts the model code. Moreover, **deep** doctection hosts code to some well established datasets like **Publaynet** that makes it easy to experiment. It also contains mappings from widely used data formats like COCO and it has a dataset framework (akin to **datasets** so that setting up training on a custom dataset becomes very easy. **This notebook** shows you how to do this. **deep** doctection comes equipped with a framework that allows you to evaluate predictions of a single or multiple models in a pipeline against some ground truth. Check again **here** how it is done. Having set up a pipeline it takes you a few lines of code to instantiate the pipeline and after a for loop all pages will be processed through the pipeline.

github

: 2.8k

hal-9100

This repository is now archived and the code is privately maintained. If you are interested in this infrastructure, please contact the maintainer directly.

github

: 353

OpenCAGE

OpenCAGE is an open-source modding toolkit for Alien: Isolation, enabling custom scripting, configuration, and content modification through graphical interfaces. It includes tools for editing assets, configurations, scripts, behaviour trees, launching the game, and managing backups. The project is constantly evolving with a roadmap that includes features like contextual script editing, content porter, new level creator, mod installers, 3D viewer improvements, navmesh generation, skinned meshes support, sound import/export, and more. OpenCAGE is supported financially by the community and welcomes code contributions.

github

: 278

For similar tasks

greenmask

github

: 1.3k

For similar jobs

lollms-webui

LoLLMs WebUI (Lord of Large Language Multimodal Systems: One tool to rule them all) is a user-friendly interface to access and utilize various LLM (Large Language Models) and other AI models for a wide range of tasks. With over 500 AI expert conditionings across diverse domains and more than 2500 fine tuned models over multiple domains, LoLLMs WebUI provides an immediate resource for any problem, from car repair to coding assistance, legal matters, medical diagnosis, entertainment, and more. The easy-to-use UI with light and dark mode options, integration with GitHub repository, support for different personalities, and features like thumb up/down rating, copy, edit, and remove messages, local database storage, search, export, and delete multiple discussions, make LoLLMs WebUI a powerful and versatile tool.

github

: 4.6k

Azure-Analytics-and-AI-Engagement

The Azure-Analytics-and-AI-Engagement repository provides packaged Industry Scenario DREAM Demos with ARM templates (Containing a demo web application, Power BI reports, Synapse resources, AML Notebooks etc.) that can be deployed in a customer’s subscription using the CAPE tool within a matter of few hours. Partners can also deploy DREAM Demos in their own subscriptions using DPoC.

github

: 136

minio

MinIO is a High Performance Object Storage released under GNU Affero General Public License v3.0. It is API compatible with Amazon S3 cloud storage service. Use MinIO to build high performance infrastructure for machine learning, analytics and application data workloads.

github

: 46.0k

mage-ai

Mage is an open-source data pipeline tool for transforming and integrating data. It offers an easy developer experience, engineering best practices built-in, and data as a first-class citizen. Mage makes it easy to build, preview, and launch data pipelines, and provides observability and scaling capabilities. It supports data integrations, streaming pipelines, and dbt integration.

github

: 7.8k

AiTreasureBox

AiTreasureBox is a versatile AI tool that provides a collection of pre-trained models and algorithms for various machine learning tasks. It simplifies the process of implementing AI solutions by offering ready-to-use components that can be easily integrated into projects. With AiTreasureBox, users can quickly prototype and deploy AI applications without the need for extensive knowledge in machine learning or deep learning. The tool covers a wide range of tasks such as image classification, text generation, sentiment analysis, object detection, and more. It is designed to be user-friendly and accessible to both beginners and experienced developers, making AI development more efficient and accessible to a wider audience.

github

: 368

tidb

TiDB is an open-source distributed SQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability.

github

: 37.1k

airbyte

Airbyte is an open-source data integration platform that makes it easy to move data from any source to any destination. With Airbyte, you can build and manage data pipelines without writing any code. Airbyte provides a library of pre-built connectors that make it easy to connect to popular data sources and destinations. You can also create your own connectors using Airbyte's no-code Connector Builder or low-code CDK. Airbyte is used by data engineers and analysts at companies of all sizes to build and manage their data pipelines.

github

: 17.8k

labelbox-python

Labelbox is a data-centric AI platform for enterprises to develop, optimize, and use AI to solve problems and power new products and services. Enterprises use Labelbox to curate data, generate high-quality human feedback data for computer vision and LLMs, evaluate model performance, and automate tasks by combining AI and human-centric workflows. The academic & research community uses Labelbox for cutting-edge AI research.

github

: 135

greenmask

README:

Greenmask

Dump anonymization and synthetic data generation tool

Getting started

Features

Use Cases

General Information

Storage Options

Data Anonymization and Validation

Customization

PostgreSQL Version Compatibility

Links

References

For Tasks:

For Jobs:

Alternative AI tools for greenmask

Similar Open Source Tools

greenmask

agent-zero

fridon-ai

AgentForge

chatnio

ROSGPT_Vision

deepflow

persian-license-plate-recognition

CodeGPT

plandex

reductstore

burpference

Hexabot

deepdoctection

hal-9100

OpenCAGE

For similar tasks

greenmask

For similar jobs

lollms-webui

Azure-Analytics-and-AI-Engagement

minio

mage-ai

AiTreasureBox

tidb

airbyte

labelbox-python