pyomop

Python package for managing OHDSI clinical data models. Includes support for LLM based plain text queries, MCP server and FHIR import.

Stars: 57

Visit

pyomop is a versatile tool designed as an OMOP Swiss Army Knife for working with OHDSI OMOP Common Data Model (CDM) v5.4 or v6 compliant databases using SQLAlchemy as the ORM. It supports converting query results to pandas DataFrames for machine learning pipelines and provides utilities for working with OMOP vocabularies. The tool is lightweight, easy-to-use, and can be used both as a command-line tool and as an imported library in code. It supports SQLite, PostgreSQL, and MySQL databases, LLM-based natural language queries, FHIR to OMOP conversion utilities, and executing QueryLibrary.

README:

pyomop: OMOP Swiss Army Knife 🔧

✨ Overview

pyomop is your OMOP Swiss Army Knife 🔧 for working with OHDSI OMOP Common Data Model (CDM) v5.4 or v6 compliant databases using SQLAlchemy as the ORM. It supports converting query results to pandas DataFrames for machine learning pipelines and provides utilities for working with OMOP vocabularies. Table definitions are based on the omop-cdm library. Pyomop is designed to be a lightweight, easy-to-use library for researchers and developers experimenting and testing with OMOP CDM databases. It can be used both as a commandline tool and as an imported library in your code.

Supports SQLite, PostgreSQL, and MySQL. CDM and Vocab tables are created in the same schema. (See usage below for more details)
LLM-based natural language queries via langchain. Usage.
🔥 FHIR to OMOP conversion utilities. (See usage below for more details)
Execute QueryLibrary. (See usage below for more details)

Please ⭐️ If you find this project useful!

Installation

Stable release:

pip install pyomop

Development version:

git clone https://github.com/dermatologist/pyomop.git
cd pyomop
pip install -e .

LLM support:

pip install pyomop[llm]

✨ See this notebook or script for examples. 👇 MCP SERVER is recommended for advanced usage.

Docker

A docker-compose is provided to quickly set up an environment with postgrs, webapi, atlas and a sql script to create a source in webapi. The script can be run using the psql command line tool or via the webapi UI. Please refresh after running the script by sending a request to /WebAPI/source/refresh.

🔧 Usage

import asyncio
import datetime

from sqlalchemy import select

from pyomop import CdmEngineFactory, CdmVector, CdmVocabulary
# cdm6 and cdm54 are supported
from pyomop.cdm54 import Base, Cohort, Person, Vocabulary

async def main():
    cdm = CdmEngineFactory() # Creates SQLite database by default for fast testing
    # cdm = CdmEngineFactory(db='pgsql', host='', port=5432,
    #                       user='', pw='',
    #                       name='', schema='')
    # cdm = CdmEngineFactory(db='mysql', host='', port=3306,
    #                       user='', pw='',
    #                       name='')
    engine = cdm.engine
    # Comment the following line if using an existing database. Both cdm6 and cdm54 are supported, see the import statements above
    await cdm.init_models(Base.metadata) # Initializes the database with the OMOP CDM tables
    vocab = CdmVocabulary(cdm, version='cdm54') # or 'cdm6' for v6
    # Uncomment the following line to create a new vocabulary from CSV files
    # vocab.create_vocab('/path/to/csv/files')

    async with cdm.session() as session:  # type: ignore
        # Add Persons
        async with session.begin():
            session.add(
                Person(
                    person_id=100,
                    gender_concept_id=8532,
                    gender_source_concept_id=8512,
                    year_of_birth=1980,
                    month_of_birth=1,
                    day_of_birth=1,
                    birth_datetime=datetime.datetime(1980, 1, 1),
                    race_concept_id=8552,
                    race_source_concept_id=8552,
                    ethnicity_concept_id=38003564,
                    ethnicity_source_concept_id=38003564,
                )
            )
            session.add(
                Person(
                    person_id=101,
                    gender_concept_id=8532,
                    gender_source_concept_id=8512,
                    year_of_birth=1980,
                    month_of_birth=1,
                    day_of_birth=1,
                    birth_datetime=datetime.datetime(1980, 1, 1),
                    race_concept_id=8552,
                    race_source_concept_id=8552,
                    ethnicity_concept_id=38003564,
                    ethnicity_source_concept_id=38003564,
                )
            )

        # Query the Person
        stmt = select(Person).where(Person.person_id == 100)
        result = await session.execute(stmt)
        for row in result.scalars():
            print(row)
            assert row.person_id == 100

        # Query the person pattern 2
        person = await session.get(Person, 100)
        print(person)
        assert person is not None
        assert person.person_id == 100

    # Convert result to a pandas dataframe
    vec = CdmVector()

    # https://github.com/OHDSI/QueryLibrary/blob/master/inst/shinyApps/QueryLibrary/queries/person/PE02.md
    result = await vec.query_library(cdm, resource='person', query_name='PE02')
    df = vec.result_to_df(result)
    print("DataFrame from result:")
    print(df.head())

    result = await vec.execute(cdm, query='SELECT * from person;')
    print("Executing custom query:")
    df = vec.result_to_df(result)
    print("DataFrame from result:")
    print(df.head())

    # Close engine
    await engine.dispose() # type: ignore

# Run the main function
asyncio.run(main())

🔥 FHIR to OMOP mapping

pyomop can load FHIR Bulk Export (NDJSON) files into an OMOP CDM database.

Sample datasets: https://github.com/smart-on-fhir/sample-bulk-fhir-datasets
Remove any non-FHIR files (for example, log.ndjson) from the input folder.
Download OMOP vocabulary CSV files (for example from OHDSI Athena) and place them in a folder.

Run:

pyomop --create --vocab ~/Downloads/omop-vocab/ --input ~/Downloads/fhir/

This will create an OMOP CDM in SQLite, load the vocabulary files, and import the FHIR data from the input folder and reconcile vocabulary, mapping source_value to concept_id. The mapping is defined in the mapping.example.json file. The default mapping is here. Mapping happens in 5 steps as implemented here.

Example using postgres (Docker)

pyomop --dbtype pgsql --host localhost --user postgres --pw mypass  --create --vocab ~/Downloads/omop-vocab/ --input ~/Downloads/fhir/

FHIR to data frame mapping is done with FHIRy
Most of the code for this functionality was written by an LLM agent. The prompts used are here

Command-line

  -c, --create                Create CDM tables (see --version).
  -t, --dbtype TEXT           Database Type for creating CDM (sqlite, mysql or
                              pgsql)
  -h, --host TEXT             Database host
  -p, --port TEXT             Database port
  -u, --user TEXT             Database user
  -w, --pw TEXT               Database password
  -v, --version TEXT          CDM version (cdm54 (default) or cdm6)
  -n, --name TEXT             Database name
  -s, --schema TEXT           Database schema (for pgsql)
  -i, --vocab TEXT            Folder with vocabulary files (csv) to import
  -f, --input DIRECTORY       Input folder with FHIR bundles or ndjson files.
  -e, --eunomia-dataset TEXT  Download and load Eunomia dataset (e.g.,
                              'GiBleed', 'Synthea')
  --eunomia-path TEXT         Path to store/find Eunomia datasets (uses
                              EUNOMIA_DATA_FOLDER env var if not specified)
  --connection-info           Display connection information for the database (For R package compatibility)
  --mcp-server                Start MCP server for stdio interaction
  --pyhealth-path TEXT        Path to export PyHealth compatible CSV files
  --help                      Show this message and exit.

MCP Server

pyomop includes an MCP (Model Context Protocol) server that exposes tools for interacting with OMOP CDM databases. This allows MCP clients to create databases, load data, and execute SQL statements.

Usage with MCP Clients

The MCP server can be used with any MCP-compatible client such as Claude desktop. Example configuration for VSCODE as below is already provided in the repository. So if you are viewing this in VSCODE, you can start server and enable tools directly in Copilot.

{
  "servers": {
      "pyomop": {
      "command": "uv",
      "args": ["run", "pyomop", "--mcp-server"]
    }
  }
}

If the vocabulary is not installed locally or advanced vocabulary support is required from Athena, it is recommended to combine omop_mcp with PyOMOP.

Available MCP Tools

create_cdm: Create an empty CDM database
create_eunomia: Add Eunomia sample dataset
get_table_columns: Get column names for a specific table
get_single_table_info: Get detailed table information, including foreign keys
get_usable_table_names: Get a list of all available table names
run_sql: Execute SQL statements with error handling
example_query: Get example queries for specific OMOP CDM tables from OHDSI QueryLibrary
check_sql: Validate SQL query syntax before execution

create_cdm and create_eunomia support only local sqlite databases to avoid inadvertent data loss in production databases.

HTTP Transport Support

The MCP server now supports both stdio (default) and HTTP transports:

Stdio transport (default):

pyomop --mcp-server
# or
pyomop-mcp-server

HTTP transport:

pyomop-mcp-server-http
# or with custom host/port
pyomop-mcp-server-http --host 0.0.0.0 --port 8000
# or via Python module
python -m pyomop.mcp.server --http --host 0.0.0.0 --port 8000

To use HTTP transport, install additional dependencies:

pip install pyomop[http]
# or for both LLM and HTTP features
pip install pyomop[llm,http]

Available Prompts

query_execution_steps: Provides step-by-step guidance for executing database queries based on free text instructions

Eunomia import and cohort creation

pyomop -e Synthea27Nj -v 5.4 --connection-info
pyomop -e GiBleed -v 5.3 --connection-info

PyHealth and PLP Compatibility (For Machine Learning pipelines)

pyomop supports exporting OMOP CDM data (to --pyhealth-path) in a format compatible with PyHealth, a machine learning library for healthcare data analysis (See Notebook and usage below). Additionally, you can export the connection information for use with the various R packages such as PatientLevelPrediction using the --connection-info option.

pyomop -e GiBleed -v 5.3 --connection-info --pyhealth-path ~/pyhealth

Additional Tools

Convert FHIR to pandas DataFrame: fhiry
.NET and Golang OMOP CDM: .NET, Golang

Supported Databases

PostgreSQL
MySQL
SQLite

Environment Variables for Database Connection

You can configure database connection parameters using environment variables. These will be used as defaults by pyomop and the MCP server:

PYOMOP_DB: Database type (sqlite, mysql, pgsql)
PYOMOP_HOST: Database host
PYOMOP_PORT: Database port
PYOMOP_USER: Database user
PYOMOP_PW: Database password
PYOMOP_SCHEMA: Database schema (for PostgreSQL)

Example usage:

export PYOMOP_DB=pgsql
export PYOMOP_HOST=localhost
export PYOMOP_PORT=5432
export PYOMOP_USER=postgres
export PYOMOP_PW=mypass
export PYOMOP_SCHEMA=omop

These environment variables will be checked before assigning default values for database connection in pyomop and MCP server tools.

🗄️ Agent Assisted ETL - Work in progress

Use --migrate to run the generic loader from the command line. Provide source-database connection details with --src-* options; target-database details use the standard --dbtype / --host / … options.

# SQLite source → SQLite OMOP target
pyomop-migrate --migrate \
  --src-dbtype sqlite --src-name source.sqlite \
  --dbtype sqlite --name omop.sqlite \
  --mapping mapping.json

# PostgreSQL source → PostgreSQL OMOP target
pyomop-migrate --migrate \
  --src-dbtype pgsql --src-host srchost --src-user reader --src-pw secret --src-name ehr \
  --dbtype pgsql --host omophost --user writer --pw secret --name omop \
  --mapping ehr_to_omop.json --batch-size 500

Source connection credentials can also be provided via environment variables (SRC_DB_HOST, SRC_DB_PORT, SRC_DB_USER, SRC_DB_PASSWORD, SRC_DB_NAME) to avoid exposing passwords in the shell history.

Schema extraction

Use --extract-schema to generate a Markdown document describing the source database schema (tables, columns, types, PK/FK relationships). This is especially useful for feeding to an AI agent to generate the mapping JSON.

pyomop-migrate --extract-schema \
  --src-dbtype sqlite --src-name source.sqlite \
  --schema-output schema.md

The same SRC_DB_* environment variables are supported for credentials.

Plan

Use the extracted schema to generate a mapping JSON using an appropriate agentic skill.

See the bundled example mapping and the full documentation for all supported options.

Contributing

Pull requests are welcome! See CONTRIBUTING.md.

Contributors

Bell Eapen

For Tasks:

Click tags to check more tools for each tasks

analyze patient data convert fhir to omop execute database queries work with omop vocabularies create machine learning pipelines

For Jobs:

data scientist healthcare analyst clinical researcher database administrator bioinformatics specialist

Alternative AI tools for pyomop

Similar Open Source Tools

pyomop

github

: 57

graphiti

Graphiti is a framework for building and querying temporally-aware knowledge graphs, tailored for AI agents in dynamic environments. It continuously integrates user interactions, structured and unstructured data, and external information into a coherent, queryable graph. The framework supports incremental data updates, efficient retrieval, and precise historical queries without complete graph recomputation, making it suitable for developing interactive, context-aware AI applications.

github

: 23.0k

AutoDocs

AutoDocs by Sita is a tool designed to automate documentation for any repository. It parses the repository using tree-sitter and SCIP, constructs a code dependency graph, and generates repository-wide, dependency-aware documentation and summaries. It provides a FastAPI backend for ingestion/search and a Next.js web UI for chat and exploration. Additionally, it includes an MCP server for deep search capabilities. The tool aims to simplify the process of generating accurate and high-signal documentation for codebases.

github

: 134

easy-dataset

Easy Dataset is a specialized application designed to streamline the creation of fine-tuning datasets for Large Language Models (LLMs). It offers an intuitive interface for uploading domain-specific files, intelligently splitting content, generating questions, and producing high-quality training data for model fine-tuning. With Easy Dataset, users can transform domain knowledge into structured datasets compatible with all OpenAI-format compatible LLM APIs, making the fine-tuning process accessible and efficient.

github

: 11.0k

MCPSharp

MCPSharp is a .NET library that helps build Model Context Protocol (MCP) servers and clients for AI assistants and models. It allows creating MCP-compliant tools, connecting to existing MCP servers, exposing .NET methods as MCP endpoints, and handling MCP protocol details seamlessly. With features like attribute-based API, JSON-RPC support, parameter validation, and type conversion, MCPSharp simplifies the development of AI capabilities in applications through standardized interfaces.

github

: 142

pentagi

PentAGI is an innovative tool for automated security testing that leverages cutting-edge artificial intelligence technologies. It is designed for information security professionals, researchers, and enthusiasts who need a powerful and flexible solution for conducting penetration tests. The tool provides secure and isolated operations in a sandboxed Docker environment, fully autonomous AI-powered agent for penetration testing steps, a suite of 20+ professional security tools, smart memory system for storing research results, web intelligence for gathering information, integration with external search systems, team delegation system, comprehensive monitoring and reporting, modern interface, API integration, persistent storage, scalable architecture, self-hosted solution, flexible authentication, and quick deployment through Docker Compose.

github

: 8.2k

databao-context-engine

Databao Context Engine is a Python library that automatically generates governed semantic context from databases, BI tools, documents, and spreadsheets. It provides accurate, context-aware answers without the need for manual schema copying or documentation writing. The tool can be integrated as a standard Python dependency or via the Databao CLI. It supports various data sources like Athena, BigQuery, MySQL, PDF files, and more, and works with LLMs such as Ollama. Users can create domains, configure data sources, build context, and utilize the built contexts for search queries. The tool is governed, versioned, and supports dynamic or static serving of context via MCP server or export as an artifact. Contributions are welcome, and the tool is licensed under Apache 2.0.

github

: 51

chunkr

Chunkr is an open-source document intelligence API that provides a production-ready service for document layout analysis, OCR, and semantic chunking. It allows users to convert PDFs, PPTs, Word docs, and images into RAG/LLM-ready chunks. The API offers features such as layout analysis, OCR with bounding boxes, structured HTML and markdown output, and VLM processing controls. Users can interact with Chunkr through a Python SDK, enabling them to upload documents, process them, and export results in various formats. The tool also supports self-hosted deployment options using Docker Compose or Kubernetes, with configurations for different AI models like OpenAI, Google AI Studio, and OpenRouter. Chunkr is dual-licensed under the GNU Affero General Public License v3.0 (AGPL-3.0) and a commercial license, providing flexibility for different usage scenarios.

github

: 2.1k

metis

Metis is an open-source, AI-driven tool for deep security code review, created by Arm's Product Security Team. It helps engineers detect subtle vulnerabilities, improve secure coding practices, and reduce review fatigue. Metis uses LLMs for semantic understanding and reasoning, RAG for context-aware reviews, and supports multiple languages and vector store backends. It provides a plugin-friendly and extensible architecture, named after the Greek goddess of wisdom, Metis. The tool is designed for large, complex, or legacy codebases where traditional tooling falls short.

github

: 470

openai-kotlin

OpenAI Kotlin API client is a Kotlin client for OpenAI's API with multiplatform and coroutines capabilities. It allows users to interact with OpenAI's API using Kotlin programming language. The client supports various features such as models, chat, images, embeddings, files, fine-tuning, moderations, audio, assistants, threads, messages, and runs. It also provides guides on getting started, chat & function call, file source guide, and assistants. Sample apps are available for reference, and troubleshooting guides are provided for common issues. The project is open-source and licensed under the MIT license, allowing contributions from the community.

github

: 1.4k

FileScopeMCP

FileScopeMCP is a TypeScript-based tool for ranking files in a codebase by importance, tracking dependencies, and providing summaries. It analyzes code structure, generates importance scores, maps bidirectional dependencies, visualizes file relationships, and allows adding custom summaries. The tool supports multiple languages, persistent storage, and offers tools for file tree management, file analysis, file summaries, diagram generation, and file watching. It is built with TypeScript/Node.js, implements the Model Context Protocol, uses Mermaid.js for diagram generation, and stores data in JSON format. FileScopeMCP aims to enhance code understanding and visualization for developers.

github

: 279

LEANN

LEANN is an innovative vector database that democratizes personal AI, transforming your laptop into a powerful RAG system that can index and search through millions of documents using 97% less storage than traditional solutions without accuracy loss. It achieves this through graph-based selective recomputation and high-degree preserving pruning, computing embeddings on-demand instead of storing them all. LEANN allows semantic search of file system, emails, browser history, chat history, codebase, or external knowledge bases on your laptop with zero cloud costs and complete privacy. It is a drop-in semantic search MCP service fully compatible with Claude Code, enabling intelligent retrieval without changing your workflow.

github

: 10.0k

golf

Golf is a simple command-line tool for calculating the distance between two geographic coordinates. It uses the Haversine formula to accurately determine the distance between two points on the Earth's surface. This tool is useful for developers working on location-based applications or projects that require distance calculations. With Golf, users can easily input latitude and longitude coordinates and get the precise distance in kilometers or miles. The tool is lightweight, easy to use, and can be integrated into various programming workflows.

github

: 776

mmore

MMORE is an open-source, end-to-end pipeline for ingesting, processing, indexing, and retrieving knowledge from various file types such as PDFs, Office docs, images, audio, video, and web pages. It standardizes content into a unified multimodal format, supports distributed CPU/GPU processing, and offers hybrid dense+sparse retrieval with an integrated RAG service through CLI and APIs.

github

: 134

Zero

Zero is an open-source AI email solution that allows users to self-host their email app while integrating external services like Gmail. It aims to modernize and enhance emails through AI agents, offering features like open-source transparency, AI-driven enhancements, data privacy, self-hosting freedom, unified inbox, customizable UI, and developer-friendly extensibility. Built with modern technologies, Zero provides a reliable tech stack including Next.js, React, TypeScript, TailwindCSS, Node.js, Drizzle ORM, and PostgreSQL. Users can set up Zero using standard setup or Dev Container setup for VS Code users, with detailed environment setup instructions for Better Auth, Google OAuth, and optional GitHub OAuth. Database setup involves starting a local PostgreSQL instance, setting up database connection, and executing database commands for dependencies, tables, migrations, and content viewing.

github

: 4.8k

py-llm-core

PyLLMCore is a light-weighted interface with Large Language Models with native support for llama.cpp, OpenAI API, and Azure deployments. It offers a Pythonic API that is simple to use, with structures provided by the standard library dataclasses module. The high-level API includes the assistants module for easy swapping between models. PyLLMCore supports various models including those compatible with llama.cpp, OpenAI, and Azure APIs. It covers use cases such as parsing, summarizing, question answering, hallucinations reduction, context size management, and tokenizing. The tool allows users to interact with language models for tasks like parsing text, summarizing content, answering questions, reducing hallucinations, managing context size, and tokenizing text.

github

: 118

For similar tasks

Open-Medical-Reasoning-Tasks

Open Life Science AI: Medical Reasoning Tasks is a collaborative hub for developing cutting-edge reasoning tasks for Large Language Models (LLMs) in the medical, healthcare, and clinical domains. The repository aims to advance AI capabilities in healthcare by fostering accurate diagnoses, personalized treatments, and improved patient outcomes. It offers a diverse range of medical reasoning challenges such as Diagnostic Reasoning, Treatment Planning, Medical Image Analysis, Clinical Data Interpretation, Patient History Analysis, Ethical Decision Making, Medical Literature Comprehension, and Drug Interaction Assessment. Contributors can join the community of healthcare professionals, AI researchers, and enthusiasts to contribute to the repository by creating new tasks or improvements following the provided guidelines. The repository also provides resources including a task list, evaluation metrics, medical AI papers, and healthcare datasets for training and evaluation.

github

: 80

pyomop

github

: 57

For similar jobs

pyomop

github

: 57

seismometer

Seismometer is a suite of tools designed to evaluate AI model performance in healthcare settings. It helps healthcare organizations assess the accuracy of AI models and ensure equitable care for diverse patient populations. The tool allows users to validate model performance using standardized evaluation criteria based on local data and workflows. It includes templates for analyzing statistical performance, fairness across different cohorts, and the impact of interventions on outcomes. Seismometer is continuously evolving to incorporate new validation and analysis techniques.

github

: 179

ctakes

Apache cTAKES is a clinical Text Analysis and Knowledge Extraction System that focuses on extracting knowledge from clinical text through Natural Language Processing (NLP) techniques. It is modular and employs rule-based and machine learning methods to extract concepts such as symptoms, procedures, diagnoses, medications, and anatomy with attributes and standard codes. cTAKES can identify temporal events, dates, and times, placing events in a patient timeline. It supports various biomedical text processing tasks and can handle different types of clinical and health-related narratives using multiple data standards. cTAKES is widely used in research initiatives and encourages contributions from professionals, researchers, doctors, and students from diverse backgrounds.

github

: 121

weave

Weave is a toolkit for developing Generative AI applications, built by Weights & Biases. With Weave, you can log and debug language model inputs, outputs, and traces; build rigorous, apples-to-apples evaluations for language model use cases; and organize all the information generated across the LLM workflow, from experimentation to evaluations to production. Weave aims to bring rigor, best-practices, and composability to the inherently experimental process of developing Generative AI software, without introducing cognitive overhead.

github

: 1.1k

agentcloud

AgentCloud is an open-source platform that enables companies to build and deploy private LLM chat apps, empowering teams to securely interact with their data. It comprises three main components: Agent Backend, Webapp, and Vector Proxy. To run this project locally, clone the repository, install Docker, and start the services. The project is licensed under the GNU Affero General Public License, version 3 only. Contributions and feedback are welcome from the community.

github

: 583

oss-fuzz-gen

This framework generates fuzz targets for real-world `C`/`C++` projects with various Large Language Models (LLM) and benchmarks them via the `OSS-Fuzz` platform. It manages to successfully leverage LLMs to generate valid fuzz targets (which generate non-zero coverage increase) for 160 C/C++ projects. The maximum line coverage increase is 29% from the existing human-written targets.

github

: 1.2k

LLMStack

LLMStack is a no-code platform for building generative AI agents, workflows, and chatbots. It allows users to connect their own data, internal tools, and GPT-powered models without any coding experience. LLMStack can be deployed to the cloud or on-premise and can be accessed via HTTP API or triggered from Slack or Discord.

github

: 1.5k

VisionCraft

The VisionCraft API is a free API for using over 100 different AI models. From images to sound.

github

: 94