fuse-med-ml

A python framework accelerating ML based discovery in the medical field by encouraging code reuse. Batteries included :)

Stars: 138

Visit

FuseMedML is a Python framework designed to accelerate machine learning-based discovery in the medical field by promoting code reuse. It provides a flexible design concept where data is stored in a nested dictionary, allowing easy handling of multi-modality information. The framework includes components for creating custom models, loss functions, metrics, and data processing operators. Additionally, FuseMedML offers 'batteries included' key components such as fuse.data for data processing, fuse.eval for model evaluation, and fuse.dl for reusable deep learning components. It supports PyTorch and PyTorch Lightning libraries and encourages the creation of domain extensions for specific medical domains.

README:

Effective Code Reuse across ML projects!

A python framework accelerating ML based discovery in the medical field by encouraging code reuse. Batteries included :)

FuseMedML is part of the PyTorch Ecosystem.

Motivation - "Oh, the pain!"

Analyzing many ML research projects we discovered that

Projects bring up is taking far too long, even when very similar projects were already done in the past by the same lab!
Porting individual components across projects was painful - resulting in "reinventing the wheel" time after time

How the magic happens

1. A simple yet super effective design concept

Data is kept in a nested (hierarchical) dictionary

This is a key aspect in FuseMedML (shortly named as "fuse"). It's a key driver of flexibility, and allows to easily deal with multi modality information.

from fuse.utils import NDict

sample_ndict = NDict()
sample_ndict['input.mri'] = # ...
sample_ndict['input.ct_view_a'] = # ...
sample_ndict['input.ct_view_b'] = # ...
sample_ndict['groundtruth.disease_level_label'] = # ...

This data can be a single sample, it can be for a minibatch, for an entire epoch, or anything that is desired. The "nested key" ("a.b.c.d.etc') is called "path key", as it can be seen as a path inside the nested dictionary.

Components are written in a way that allows to define input and output keys, to be read and written from the nested dict See a short introduction video (3 minutes) to how FuseMedML components work:

https://user-images.githubusercontent.com/7043815/177197158-d3ea0736-629e-4dcb-bd5e-666993fbcfa2.mp4

Examples - using FuseMedML-style components

A multi head model FuseMedML style component, allows easy reuse across projects:

ModelMultiHead(
    conv_inputs=(('data.input.img', 1),),                                       # input to the backbone model
    backbone=BackboneResnet3D(in_channels=1),                                   # PyTorch nn Module
    heads=[                                                                     # list of heads - gives the option to support multi task / multi head approach
               Head3D(head_name='classification',
                                mode="classification",
                                conv_inputs=[("model.backbone_features", 512)]  # Input to the classification head
                                ,),
          ]
)

Our default loss implementation - creates an easy wrap around a callable function, while being FuseMedML style

LossDefault(
    pred='model.logits.classification',          # input - model prediction scores
    target='data.label',                         # input - ground truth labels
    callable=torch.nn.functional.cross_entropy   # callable - function that will get the prediction scores and labels extracted from batch_dict and compute the loss
)

An example metric that can be used

MetricAUCROC(
    pred='model.output', # input - model prediction scores
    target='data.label'  # input - ground truth labels
)

Note that several components return answers directly and not write it into the nested dictionary. This is perfectly fine, and to allow maximum flexibility we do not require any usage of output path keys.

Creating a custom FuseMedML component

Creating custom FuseMedML components is easy - in the following example we add a new data processing operator:

A data pipeline operator

class OpPad(OpBase):
    def __call__(self, sample_dict: NDict,
        key_in: str,
        padding: List[int], fill: int = 0, mode: str = 'constant',
        key_out:Optional[str]=None,
        ):

        # we extract the element in the defined key location (for example 'input.xray_img')
        img = sample_dict[key_in]
        assert isinstance(img, np.ndarray), f'Expected np.ndarray but got {type(img)}'
        processed_img = np.pad(img, pad_width=padding, mode=mode, constant_values=fill)

        # store the result in the requested output key (or in key_in if no key_out is provided)
        key_out = key_in if key_out is None
        sample_dict[key_out] = processed_img

        # returned the modified nested dict
        return sample_dict

Since the key location isn't hardcoded, this module can be easily reused across different research projects with very different data sample structures. More code reuse - Hooray!

FuseMedML-style components in general are any classes or functions that define which key paths will be written and which will be read. Arguments can be freely named, and you don't even have to write anything to the nested dict. Some FuseMedML components return a value directly - for example, loss functions.

2. "Batteries included" key components, built using the same design concept

fuse.data - A declarative super flexible data processing pipeline

Easy dealing with complex multi modality scenario
Advanced caching, including periodic audits to automatically detect stale caches
Default ready-to-use Dataset and Sampler classes
See detailed introduction here

fuse.eval - a standalone library for evaluating ML models (not necessarily trained with FuseMedML)

The package includes collection of off-the-shelf metrics and utilities such as statistical significance tests, calibration, thresholding, model comparison and more. See detailed introduction here

fuse.dl - reusable dl (deep learning) model architecture components, loss functions, etc.

Supported DL libraries

Some components depend on pytorch. For example, fuse.data is oriented towards pytorch DataSet, DataLoader, DataSampler etc. fuse.dl makes heavy usage of pytorch models. Some components do not depend on any specific DL library - for example fuse.eval.

Broadly speaking, the supported DL libraries are:

"Pure" pytorch
pytorch-lightning

Before you ask - pytorch-lightning and FuseMedML play along very nicely and have in practice orthogonal and additive benefits :) See Simple FuseMedML + PytorchLightning Example for simple supervised learning cases, and this example for completely custom usage of pytorch-lightning and FuseMedML - useful for advanced scenarios such as Reinforcement Learning and generative models.

Domain Extensions

fuse-med-ml, the core library, is completely domain agnostic! Domain extensions are optionally installable packages that deal with specific (sub) domains. For example:

fuseimg which was battle-tested in many medical imaging related projects (different organs, imaging modalities, tasks, etc.)
fusedrug (to be released soon) which focuses on molecular biology and chemistry - prediction, generation and more

Domain extensions contain concrete implementation of components and components parts within the relevant domain, for example:

Data pipeline operations - for example, a 3d affine transformation of a 3d image
Evaluation metrics - for example, a custom metric evaluating docking of a potential drug with a protein target
Loss functions - for example, a custom segmentation evaluation loss

The recommended directory structure mimics fuse-med-ml core structure

your_package
    data #everything related to datasets, samplers, data processing pipeline Ops, etc.
    dl #everything related to deep learning architectures, optimizers, loss functions etc.
    eval #evaluation metrics
    utils #any utilities

You are highly encouraged to create additional domain extensions and/or contribute to the existing ones! There's no need to wait for any approval, you can create domain extensions on your own repos right away

Note - in general, we find it helpful to follow the same directory structure shown above even in small and specific research projects that use FuseMedML for consistency and easy landing for newcomers into your project :)

Installation

FuseMedML is tested on Python >= 3.9 and PyTorch >= 2.0

We recommend using a Conda environment

Create a conda environment using the following command (you can replace FUSEMEDML with your preferred enviornment name)

conda create -n FUSEMEDML python=3.9
conda activate FUSEMEDML

Now one shall install PyTorch and it's corresponding cudatoolkit. See here for the exact command that will suit your local environment. For example:

conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia

and then do Option 1 or Option 2 below inside the activated conda env

Option 1: Install from source (recommended)

The best way to install FuseMedML is to clone the repository and install it in an editable mode using pip:

$ pip install -e .[all]

This mode installs all the currently publicly available domain extensions - fuseimg as of now, fusedrug will be added soon.

To install FuseMedML with an included collection of examples install it using:

$ pip install -e .[all,examples]

Option 2: Install from PyPI

$ pip install fuse-med-ml[all]

or with examples:

$ pip install fuse-med-ml[all,examples]

Examples

Easy access "Hello World" colab notebook
Classification
- MNIST - a simple example, including training, inference and evaluation over MNIST dataset
- STOIC - severe COVID-19 classifier baseline given a Computed-Tomography (CT), age group and gender. Challenge description
- KNIGHT Challenge - preoperative prediction of risk class for patients with renal masses identified in clinical Computed Tomography (CT) imaging of the kidneys. Including data pre-processing, baseline implementation and evaluation pipeline for the challenge.
- Multimodality tutorial - demonstration of two popular simple methods integrating imaging and clinical data (tabular) using FuseMedML
- Skin Lesion - skin lesion classification , including training, inference and evaluation over the public dataset introduced in ISIC challenge
- Breast Cancer Lesion Classification - lesions classification of tumor ( benign, malignant) in breast mammography over the public dataset introduced in The Chinese Mammography Database (CMMD)
- Mortality prediction for ICU patients - Example of EHR transformer applied to the data of Intensive Care Units patients for in-hospital mortality prediction. The dataset is from PhysioNet Computing in Cardiology Challenge (2012)
Pre-training
- Medical Imaging Pre-training and Downstream Task Validation - pre-training a model on 3D MRI medical imaging and then using it for classification and segmentation downstream tasks.

Walkthrough template

Walkthrough Template - includes several TODO notes, marking the minimal scope of code required to get your pipeline up and running. The template also includes useful explanations and tips.

Community support - join the discussion!

Slack workspace at fusemedml.slack.com for informal communication - click here to join
Github Discussions

Citation

If you use FuseMedML in scientific context, please consider citing our JOSS paper:

@article{Golts2023,
        doi = {10.21105/joss.04943},
        url = {https://doi.org/10.21105/joss.04943},
        year = {2023},
        publisher = {The Open Journal},
        volume = {8},
        number = {81},
        pages = {4943},
        author = {Alex Golts and Moshe Raboh and Yoel Shoshan and Sagi Polaczek and Simona Rabinovici-Cohen and Efrat Hexter},
        title = {FuseMedML: a framework for accelerated discovery in machine learning based biomedicine},
        journal = {Journal of Open Source Software}
}

For Tasks:

Click tags to check more tools for each tasks

train medical models evaluate ml models process medical data create custom ml components implement deep learning architectures

For Jobs:

machine learning engineer data scientist medical researcher healthcare data analyst biomedical engineer

Alternative AI tools for fuse-med-ml

Similar Open Source Tools

fuse-med-ml

github

: 138

HuixiangDou2

HuixiangDou2 is a robustly optimized GraphRAG approach that integrates multiple open-source projects to improve performance in graph-based augmented generation. It conducts comparative experiments and achieves a significant score increase, leading to a GraphRAG implementation with recognized performance. The repository provides code improvements, dense retrieval for querying entities and relationships, real domain knowledge testing, and impact analysis on accuracy.

github

: 78

ollama-grid-search

A Rust based tool to evaluate LLM models, prompts and model params. It automates the process of selecting the best model parameters, given an LLM model and a prompt, iterating over the possible combinations and letting the user visually inspect the results. The tool assumes the user has Ollama installed and serving endpoints, either in `localhost` or in a remote server. Key features include: * Automatically fetches models from local or remote Ollama servers * Iterates over different models and params to generate inferences * A/B test prompts on different models simultaneously * Allows multiple iterations for each combination of parameters * Makes synchronous inference calls to avoid spamming servers * Optionally outputs inference parameters and response metadata (inference time, tokens and tokens/s) * Refetching of individual inference calls * Model selection can be filtered by name * List experiments which can be downloaded in JSON format * Configurable inference timeout * Custom default parameters and system prompts can be defined in settings

github

: 564

gromacs_copilot

GROMACS Copilot is an agent designed to automate molecular dynamics simulations for proteins in water using GROMACS. It handles system setup, simulation execution, and result analysis automatically, providing outputs such as RMSD, RMSF, Rg, and H-bonds. Users can interact with the agent through prompts and API keys from DeepSeek and OpenAI. The tool aims to simplify the process of running MD simulations, allowing users to focus on other tasks while it handles the technical aspects of the simulations.

github

: 172

MME-RealWorld

MME-RealWorld is a benchmark designed to address real-world applications with practical relevance, featuring 13,366 high-resolution images and 29,429 annotations across 43 tasks. It aims to provide substantial recognition challenges and overcome common barriers in existing Multimodal Large Language Model benchmarks, such as small data scale, restricted data quality, and insufficient task difficulty. The dataset offers advantages in data scale, data quality, task difficulty, and real-world utility compared to existing benchmarks. It also includes a Chinese version with additional images and QA pairs focused on Chinese scenarios.

github

: 94

crewAI

CrewAI is a cutting-edge framework designed to orchestrate role-playing autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks. It enables AI agents to assume roles, share goals, and operate in a cohesive unit, much like a well-oiled crew. Whether you're building a smart assistant platform, an automated customer service ensemble, or a multi-agent research team, CrewAI provides the backbone for sophisticated multi-agent interactions. With features like role-based agent design, autonomous inter-agent delegation, flexible task management, and support for various LLMs, CrewAI offers a dynamic and adaptable solution for both development and production workflows.

github

: 29.5k

RAG-FiT

RAG-FiT is a library designed to improve Language Models' ability to use external information by fine-tuning models on specially created RAG-augmented datasets. The library assists in creating training data, training models using parameter-efficient finetuning (PEFT), and evaluating performance using RAG-specific metrics. It is modular, customizable via configuration files, and facilitates fast prototyping and experimentation with various RAG settings and configurations.

github

: 517

synthora

Synthora is a lightweight and extensible framework for LLM-driven Agents and ALM research. It aims to simplify the process of building, testing, and evaluating agents by providing essential components. The framework allows for easy agent assembly with a single config, reducing the effort required for tuning and sharing agents. Although in early development stages with unstable APIs, Synthora welcomes feedback and contributions to enhance its stability and functionality.

github

: 67

craftium

Craftium is an open-source platform based on the Minetest voxel game engine and the Gymnasium and PettingZoo APIs, designed for creating fast, rich, and diverse single and multi-agent environments. It allows for connecting to Craftium's Python process, executing actions as keyboard and mouse controls, extending the Lua API for creating RL environments and tasks, and supporting client/server synchronization for slow agents. Craftium is fully extensible, extensively documented, modern RL API compatible, fully open source, and eliminates the need for Java. It offers a variety of environments for research and development in reinforcement learning.

github

: 64

MoBA

MoBA (Mixture of Block Attention) is an innovative approach for long-context language models, enabling efficient processing of long sequences by dividing the full context into blocks and introducing a parameter-less gating mechanism. It allows seamless transitions between full and sparse attention modes, enhancing efficiency without compromising performance. MoBA has been deployed to support long-context requests and demonstrates significant advancements in efficient attention computation for large language models.

github

: 1.3k

RAGFoundry

RAG Foundry is a library designed to enhance Large Language Models (LLMs) by fine-tuning models on RAG-augmented datasets. It helps create training data, train models using parameter-efficient finetuning (PEFT), and measure performance using RAG-specific metrics. The library is modular, customizable using configuration files, and facilitates prototyping with various RAG settings and configurations for tasks like data processing, retrieval, training, inference, and evaluation.

github

: 463

docwire

github

: 64

DemoGPT

DemoGPT is an all-in-one agent library that provides tools, prompts, frameworks, and LLM models for streamlined agent development. It leverages GPT-3.5-turbo to generate LangChain code, creating interactive Streamlit applications. The tool is designed for creating intelligent, interactive, and inclusive solutions in LLM-based application development. It offers model flexibility, iterative development, and a commitment to user engagement. Future enhancements include integrating Gorilla for autonomous API usage and adding a publicly available database for refining the generation process.

github

: 1.7k

baal

Baal is an active learning library that supports both industrial applications and research use cases. It provides a framework for Bayesian active learning methods such as Monte-Carlo Dropout, MCDropConnect, Deep ensembles, and Semi-supervised learning. Baal helps in labeling the most uncertain items in the dataset pool to improve model performance and reduce annotation effort. The library is actively maintained by a dedicated team and has been used in various research papers for production and experimentation.

github

: 833

llm-on-ray

LLM-on-Ray is a comprehensive solution for building, customizing, and deploying Large Language Models (LLMs). It simplifies complex processes into manageable steps by leveraging the power of Ray for distributed computing. The tool supports pretraining, finetuning, and serving LLMs across various hardware setups, incorporating industry and Intel optimizations for performance. It offers modular workflows with intuitive configurations, robust fault tolerance, and scalability. Additionally, it provides an Interactive Web UI for enhanced usability, including a chatbot application for testing and refining models.

github

: 87

llm-ls

llm-ls is a Language Server Protocol (LSP) server that utilizes Large Language Models (LLMs) to enhance the development experience. It aims to serve as a foundation for IDE extensions by simplifying interactions with LLMs, enabling lightweight extension code. The server offers features such as context-based prompt generation, telemetry for retraining, code completion based on AST analysis, and compatibility with various backends like Hugging Face's APIs and llama.cpp server bindings.

github

: 477

For similar tasks

fuse-med-ml

github

: 138

For similar jobs

grand-challenge.org

Grand Challenge is a platform that provides access to large amounts of annotated training data, objective comparisons of state-of-the-art machine learning solutions, and clinical validation using real-world data. It assists researchers, data scientists, and clinicians in collaborating to develop robust machine learning solutions to problems in biomedical imaging.

github

: 178

Detection-and-Classification-of-Alzheimers-Disease

This tool is designed to detect and classify Alzheimer's Disease using Deep Learning and Machine Learning algorithms on an early basis, which is further optimized using the Crow Search Algorithm (CSA). Alzheimer's is a fatal disease, and early detection is crucial for patients to predetermine their condition and prevent its progression. By analyzing MRI scanned images using Artificial Intelligence technology, this tool can classify patients who may or may not develop AD in the future. The CSA algorithm, combined with ML algorithms, has proven to be the most effective approach for this purpose.

github

: 68

OpenCRISPR

OpenCRISPR is a set of free and open gene editing systems designed by Profluent Bio. The OpenCRISPR-1 protein maintains the prototypical architecture of a Type II Cas9 nuclease but is hundreds of mutations away from SpCas9 or any other known natural CRISPR-associated protein. You can view OpenCRISPR-1 as a drop-in replacement for many protocols that need a cas9-like protein with an NGG PAM and you can even use it with canonical SpCas9 gRNAs. OpenCRISPR-1 can be fused in a deactivated or nickase format for next generation gene editing techniques like base, prime, or epigenome editing.

github

: 253

AlphaFold3

AlphaFold3 is an implementation of the Alpha Fold 3 model in PyTorch for accurate structure prediction of biomolecular interactions. It includes modules for genetic diffusion and full model examples for forward pass computations. The tool allows users to generate random pair and single representations, operate on atomic coordinates, and perform structure predictions based on input tensors. The implementation also provides functionalities for training and evaluating the model.

github

: 453

fuse-med-ml

github

: 138

hi-ml

The Microsoft Health Intelligence Machine Learning Toolbox is a repository that provides low-level and high-level building blocks for Machine Learning / AI researchers and practitioners. It simplifies and streamlines work on deep learning models for healthcare and life sciences by offering tested components such as data loaders, pre-processing tools, deep learning models, and cloud integration utilities. The repository includes two Python packages, 'hi-ml-azure' for helper functions in AzureML, 'hi-ml' for ML components, and 'hi-ml-cpath' for models and workflows related to histopathology images.

github

: 238

SlicerTotalSegmentator

TotalSegmentator is a 3D Slicer extension designed for fully automatic whole body CT segmentation using the 'TotalSegmentator' AI model. The computation time is less than one minute, making it efficient for research purposes. Users can set up GPU acceleration for faster segmentation. The tool provides a user-friendly interface for loading CT images, creating segmentations, and displaying results in 3D. Troubleshooting steps are available for common issues such as failed computation, GPU errors, and inaccurate segmentations. Contributions to the extension are welcome, following 3D Slicer contribution guidelines.

github

: 167

md-agent

MD-Agent is a LLM-agent based toolset for Molecular Dynamics. It uses Langchain and a collection of tools to set up and execute molecular dynamics simulations, particularly in OpenMM. The tool assists in environment setup, installation, and usage by providing detailed steps. It also requires API keys for certain functionalities, such as OpenAI and paper-qa for literature searches. Contributions to the project are welcome, with a detailed Contributor's Guide available for interested individuals.

github

: 73

fuse-med-ml

README:

Effective Code Reuse across ML projects!

Jump to:

Motivation - "Oh, the pain!"

How the magic happens

1. A simple yet super effective design concept

Data is kept in a nested (hierarchical) dictionary

Examples - using FuseMedML-style components

Creating a custom FuseMedML component

2. "Batteries included" key components, built using the same design concept

fuse.data - A declarative super flexible data processing pipeline

fuse.eval - a standalone library for evaluating ML models (not necessarily trained with FuseMedML)

fuse.dl - reusable dl (deep learning) model architecture components, loss functions, etc.

Supported DL libraries

Domain Extensions

Installation

We recommend using a Conda environment

Option 1: Install from source (recommended)

Option 2: Install from PyPI

Examples

Walkthrough template

Community support - join the discussion!

Citation

For Tasks:

For Jobs:

Alternative AI tools for fuse-med-ml

Similar Open Source Tools

fuse-med-ml

HuixiangDou2

ollama-grid-search

gromacs_copilot

MME-RealWorld

crewAI

RAG-FiT

synthora

craftium

MoBA

RAGFoundry

docwire

DemoGPT

baal

llm-on-ray

llm-ls

For similar tasks

fuse-med-ml

For similar jobs

grand-challenge.org

Detection-and-Classification-of-Alzheimers-Disease

OpenCRISPR

AlphaFold3

fuse-med-ml

hi-ml

SlicerTotalSegmentator

md-agent