bionemo-framework
BioNeMo Framework: For building and adapting AI models in drug discovery at scale
Stars: 251
NVIDIA BioNeMo Framework is a collection of programming tools, libraries, and models for computational drug discovery. It accelerates building and adapting biomolecular AI models by providing domain-specific, optimized models and tooling for GPU-based computational resources. The framework offers comprehensive documentation and support for both community and enterprise users.
README:
NVIDIA BioNeMo Framework is a collection of programming tools, libraries, and models for computational drug discovery. It accelerates the most time-consuming and costly stages of building and adapting biomolecular AI models by providing domain-specific, optimized models and tooling that are easily integrated into GPU-based computational resources for the fastest performance on the market. You can access BioNeMo Framework as a free community resource here in this repository or learn more at https://www.nvidia.com/en-us/clara/bionemo/ about getting an enterprise license for improved expert-level support.
bionemo2
code is partitioned into independently installable namespace packages.
These are located under the sub-packages/
directory. Please refer to PEP 420 – Implicit Namespace Packages for details.
The latest released container for the BioNeMo Framework is available for download through NGC. Comprehensive documentation, including user guides, API references, and troubleshooting information, can be found in our official documentation set at:
https://docs.nvidia.com/bionemo-framework/latest/
For those interested in exploring the latest developments and features not yet included in the released container, we also maintain an up-to-date documentation set that reflects the current state of the main
branch. This in-progress documentation can be accessed at:
https://nvidia.github.io/bionemo-framework/
Please note that while this documentation is generally accurate and helpful, it may contain references to features or APIs not yet stabilized or released. As always, we appreciate feedback on our documentation and strive to continually improve its quality.
By contributing to this repo you acknowledge that either this is your original work, or have the right to submit the work under our license, which as of this writing is Apache v2. See license for the current license, and the contributing document for more information.
If you find yourself having made a number of commits in a PR, and need to sign them all, a useful tool is the following:
- Find your first unsigned commit, say it is
mYcmtShrtHash
. - Run
git rebase --signoff mYcmtShrtHash^
to sign that commit and all future commits (in your branch please). - Push the updated commits
git push -f
.
The NeMo and Megatron-LM dependencies are vendored in the bionemo-2 repository workspace as git submodules for development purposes. The pinned commits for these submodules represent the "last-known-good" versions of these packages that are confirmed to be working with bionemo2 (and those that are tested in CI).
To initialize these sub-modules when cloning the repo, add the --recursive
flag to the git clone command:
git clone --recursive [email protected]:NVIDIA/bionemo-framework.git
To download the pinned versions of these submodules within an existing git repository, run
git submodule update --init --recursive
Different branches of the repo can have different pinned versions of these third-party submodules. Make sure you update submodules after switching branches or pulling recent changes!
To configure git to automatically update submodules when switching branches, run
git config submodule.recurse true
NOTE: this setting will not download new or remove old submodules with the branch's changes.
You will have to run the full git submodule update --init --recursive
command in these situations.
After cloning the repository, you need to run the setup script first:
./internal/scripts/setup_env_file.sh
This will return an exit code of 1 on a first time run.
To build the release image, run the following script:
DOCKER_BUILDKIT=1 ./ci/scripts/build_docker_image.sh \
-regular-docker-builder \
-image-name "nvcr.io/nvidian/cvai_bnmo_trng/bionemo:bionemo2-$(git rev-parse HEAD)"
To build the development image, run the following script:
./internal/scripts/build_dev_image.sh
After building the development image, you can start a container from it and open a bash shell in it by executing:
./internal/scripts/run_dev.sh
Set the AWS access info in environment prior to running the dev-container launch script:
AWS_ACCESS_KEY_ID="team-bionemo"
AWS_SECRET_ACCESS_KEY=$(grep aws_secret_access_key ~/.aws/config | cut -d' ' -f 3)
AWS_REGION="us-east-1"
AWS_ENDPOINT_URL="https://pbss.s8k.io"
Running tests downloads the test data to a cache location when first invoked.
For more information on adding new test artifacts, see the documentation in
bionemo.core.data.load
.
Pinned commits are bumped by depend-a-bot. To update the pinned commits of NeMo or Megatron-LM manually, checkout the commit of interest in the submodule folder, and then commit the result in the top-level bionemo repository.
cd 3rdparty/NeMo/
git fetch
git checkout <desired_sha>
cd ../..
git add '3rdparty/NeMo/'
git commit -m "updating NeMo commit"
Inside the development container, run ./ci/scripts/static_checks.sh
to validate that code changes will pass the code
formatting and license checks run during CI. In addition, run the longer ./ci/scripts/pr_test.sh
script to run unit
tests for all sub-packages.
We use setuptools-scm to dynamically determine the library version from git tags. As an example:
$ git tag 2.0.0a1
$ docker build . -t bionemo-uv
$ docker run --rm -it bionemo-uv:latest python -c "from importlib.metadata import version; print(version('bionemo.esm2'))"
2.0.0a1
Bionemo packages follow semantic versioning 2.0 rules: API-breaking changes are MAJOR
, new
features are MINOR
, and bug-fixes and refactors are PATCH
in MAJOR.MINOR.PATCH
version string format.
If subsequent commits are added after a git tag, the version string will reflect the additional commits (e.g.
2.0.0a1.post1
). NOTE: we don't consider uncommitted changes in determining the version string.
An overview for publishing packages with uv
can be found here: https://docs.astral.sh/uv/guides/publish/
Build the bionemo sub-package project by executing the following for the desired package:
uv build sub-packages/bionemo-core/
Produce a wheel file for the sub-package's code and its dependencies:
$ ls sub-packages/bionemo-core/dist/
bionemo_core-2.0.0a1.post0-py3-none-any.whl bionemo_core-2.0.0a1.post0.tar.gz
After building, the wheel file may be uploaded to PyPI (or a compatible package registry) by executing
uvx twine upload sub-packages/bionemo-core/dist/*
.
Assumes we're building a wheel for bionemo-core
.
git tag MY-VERSION-TAG
uv build /sub-packages/bionemo-core
TWINE_PASSWORD="<pypi pass>" TWINE_USERNAME="<pypi user>" uvx twine upload /sub-packages/bionemo-core/dist/*
BioNeMo 2 provides two entrypoints for models with both argparse and pydantic. Both documented in the Models
section below.
Pydantic based configuration is designed to accept a configuration yaml file as input, along with context specific arguments (e.g., should we resume from existing checkpoints?). These YAML configs go through a Pydantic Validator, in this case referred to as MainConfig
. This Config is composed of several other Pydantic models, see the class definition for details. To pre-populate a config with reasonable defaults for various standard models, we provide 'recipes.' These are simple methods that instantiate the config object and then serialize it to a YAML configuration file. From this file, you may either submit it directly, or modify the various parameters to meet your usecase. For example, Weights and biases, devices, precision, and dataset options are all extremely useful to modify. Then, you would submit this config for training.
These two workflows are packaged as executables when esm2 or geneformer are installed with pip. These commands will appear as:
bionemo-geneformer-recipe
bionemo-esm2-recipe
bionemo-geneformer-train
bionemo-esm2-train
First off, we have a utility function for downloading full/test data and model checkpoints called download_bionemo_data
that our following examples currently use. This will download the object if it is not already on your local system, and then return the path either way. For example if you run this twice in a row, you should expect the second time you run it to return the path almost instantly.
NOTE: NVIDIA employees should use pbss
rather than ngc
for the data source.
export MY_DATA_SOURCE="ngc"
or for NVIDIA internal employees with new data etc:
export MY_DATA_SOURCE="pbss"
# The fastest transformer engine environment variables in testing were the following two
TEST_DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source $MY_DATA_SOURCE); \
ESM2_650M_CKPT=$(download_bionemo_data esm2/650m:2.0 --source $MY_DATA_SOURCE); \
train_esm2 \
--train-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet \
--train-database-path ${TEST_DATA_DIR}/2024_03_sanity/train_sanity.db \
--valid-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/valid_clusters.parquet \
--valid-database-path ${TEST_DATA_DIR}/2024_03_sanity/validation.db \
--result-dir ./results \
--experiment-name test_experiment \
--num-gpus 1 \
--num-nodes 1 \
--val-check-interval 10 \
--num-dataset-workers 1 \
--num-steps 10 \
--max-seq-length 1024 \
--limit-val-batches 2 \
--micro-batch-size 2 \
--restore-from-checkpoint-path ${ESM2_650M_CKPT}
Alternatively, we provide a validated and serialized configuration file entrypoint for executing the same workflow. These can be generated using the bionemo-esm2-recipe
entrypoints. Recipes
are available for 8m, 650m, and 3b ESM2 models. You may select which preset config to use by setting the --recipe
parameter.
The output is then a serialized configuration file that may be used in the associated bionemo-esm2-train
commands.
# The fastest transformer engine environment variables in testing were the following two
TEST_DATA_DIR=$(download_bionemo_data esm2/testdata_esm2_pretrain:2.0 --source $MY_DATA_SOURCE); \
bionemo-esm2-recipe \
--train-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/train_clusters_sanity.parquet \
--train-database-path ${TEST_DATA_DIR}/2024_03_sanity/train_sanity.db \
--valid-cluster-path ${TEST_DATA_DIR}/2024_03_sanity/valid_clusters.parquet \
--valid-database-path ${TEST_DATA_DIR}/2024_03_sanity/validation.db \
--result-dir ./results \
--dest my_config.yaml\
--recipe esm2_8m_recipe
⚠️ IMPORTANT: Inspect and edit the contents of the outputted my_config.yaml as you see fit
NOTE: To continue training from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.
To submit a training job with the passed config, first update the yaml file with any additional execution parameters of your choosing: number of devices, workers, steps, etc. Second, invoke our training entrypoint. To do this, we need three things:
- Configuration file, the YAML produced by the previous step
- Model config type, in this case the pretraining config. This will validate the arguments in the config YAML against those required for pretraining. Alternatively, things like fine-tuning with custom task heads may be specified here. This allows for mixing/matching Data Modules with various tasks.
- Data Config type, this specifies how to parse, validate, and prepare the DataModule. This may change depending on task, for example, pretraining ESM2 uses a protein cluster oriented sampling method. In the case of inference or fine-tuning a pretrained model, a simple fasta file may be sufficient. There is a one-to-one relationship between DataConfig types and DataModule types.
⚠️ Warning: This setup does NO configuration of Weights and Biases. Edit your config YAML and populate it with your WandB details.
bionemo-esm2-train \
--data-config-cls bionemo.esm2.run.config_models.ESM2DataConfig \
--model-config-cls bionemo.esm2.run.config_models.ExposedESM2PretrainConfig \
--config my_config.yaml
NOTE: both data-config-cls and model-config-cls have default values corresponding to ESM2DataConfig and ExposedESM2PretrainingConfig
DataConfigCls and ModelConfigCls can also refer to locally defined types by the user. As long as python knows how to import the specified path, they may be configured. For example, you may have a custom Dataset/DataModule that you would like to mix with an existing recipe. In this case, you define a DataConfig object with the generic specified as your DataModule type, and then pass in the config type to the training recipe.
Similar to ESM-2, you can download the dataset and checkpoint through our utility function.
TEST_DATA_DIR=$(download_bionemo_data single_cell/testdata-20241203 --source $MY_DATA_SOURCE); \
GENEFORMER_10M_CKPT=$(download_bionemo_data geneformer/10M_240530:2.0 --source $MY_DATA_SOURCE); \
train_geneformer \
--data-dir ${TEST_DATA_DIR}/cellxgene_2023-12-15_small_processed_scdl \
--result-dir ./results \
--restore-from-checkpoint-path ${GENEFORMER_10M_CKPT} \
--experiment-name test_experiment \
--num-gpus 1 \
--num-nodes 1 \
--val-check-interval 10 \
--num-dataset-workers 0 \
--num-steps 55 \
--seq-length 128 \
--limit-val-batches 2 \
--micro-batch-size 2
To fine-tune, you to specify a different combination of model and loss. Pass the path to the outputted config file from the previous step as the --restore-from-checkpoint-path
, and also change
--training-model-config-class
to the newly created model-config-class.
While no CLI option currently exists to hot swap in different data modules and processing functions now, you could
copy the sub-projects/bionemo-geneformer/geneformer/scripts/train_geneformer.py
and modify the DataModule class that gets initialized.
Simple fine-tuning example (NOTE: please change --restore-from-checkpoint-path
to be the checkpoint directory path that was output last
by the previous train run)
TEST_DATA_DIR=$(download_bionemo_data single_cell/testdata-20241203 --source $MY_DATA_SOURCE); \
train_geneformer \
--data-dir ${TEST_DATA_DIR}/cellxgene_2023-12-15_small_processed_scdl \
--result-dir ./results \
--experiment-name test_finettune_experiment \
--num-gpus 1 \
--num-nodes 1 \
--val-check-interval 10 \
--num-dataset-workers 0 \
--num-steps 55 \
--seq-length 128 \
--limit-val-batches 2 \
--micro-batch-size 2 \
--training-model-config-class FineTuneSeqLenBioBertConfig \
--restore-from-checkpoint-path results/test_experiment/dev/checkpoints/test_experiment--val_loss=4.3506-epoch=1-last
Alternatively, we provide a validated and serialized configuration file entrypoint for executing the same workflow. Recipes are available for 10m, and 106m geneformer models. Additionally we provide an example recipe of finetuning, where the objective is to 'regress' on token IDs rather than the traditional masked language model approach. In practice, you will likely need to implement your own DataModule, DataConfig, and Finetuning model. You can use the same overall approach, but with customizations for your task.
TEST_DATA_DIR=$(download_bionemo_data single_cell/testdata-20241203 --source $MY_DATA_SOURCE); \
bionemo-geneformer-recipe \
--recipe 10m-pretrain \
--dest my_config.json \
--data-path ${TEST_DATA_DIR}/cellxgene_2023-12-15_small_processed_scdl \
--result-dir ./results
⚠️ IMPORTANT: Inspect and edit the contents of the outputted my_config.yaml as you see fit
NOTE: To pretrain from an existing checkpoint, simply pass in the path --initial-ckpt-path to the recipe command. This will populate the YAML with the correct field to ensure pretraining is initialized from an existing checkpoint.
To submit a training job with the passed config, first update the yaml file with any additional execution parameters of your choosing: number of devices, workers, steps, etc. Second, invoke our training entrypoint. To do this, we need three things:
- Configuration file, the YAML produced by the previous step
- Model config type, in this case the pretraining config. This will validate the arguments in the config YAML against those required for pretraining. Alternatively, things like fine-tuning with custom task heads may be specified here. This allows for mixing/matching Data Modules with various tasks.
- Data Config type, this specifies how to parse, validate, and prepare the DataModule. This may change depending on task, for example, while fine-tuning you may want to use a custom Dataset/DataModule that includes PERTURB-seq. In this case, the default pretraining DataConfig and DataModule will be insufficient. See ESM2 for additional example usecases.
⚠️ Warning: This setup does NO configuration of Weights and Biases. Edit your config YAML and populate it with your WandB details.
bionemo-geneformer-train \
--data-config-cls bionemo.geneformer.run.config_models.GeneformerPretrainingDataConfig \
--model-config-cls bionemo.geneformer.run.config_models.ExposedGeneformerPretrainConfig \
--config my_config.yaml
NOTE: both data-config-cls and model-config-cls have default values corresponding to GeneformerPretrainingDataConfig and ExposedGeneformerPretrainConfig
DataConfigCls and ModelConfigCls can also refer to locally defined types by the user. As long as python knows how to import the specified path, they may be configured. For example, you may have a custom Dataset/DataModule that you would like to mix with an existing recipe. In this case, you define a DataConfig object with the generic specified as your DataModule type, and then pass in the config type to the training recipe.
If you add new Python (.py
) files, be sure to run our license-check. If you have not already done sone, please install
the dev-requirements.txt. If you are working directly inside a release container, you may need to manually install these.
We recommend using the developer container for contributions.
pip install -r dev-requirements.txt --user
python ./scripts/license_check.py --modify --replace --license-header ./license_header -c sub-packages/ -c docs/ -c scripts/ -c ci/ -c internal/
If false-positives are raised by the detect-secrets pre-commit hook, they can be added to the baseline files by running the following commands:
detect-secrets scan --baseline .secrets.baseline --exclude-files '(.*\.ipynb|.*\.baseline)$'
detect-secrets scan --baseline .secrets-nb.baseline --exclude-files '^.(?!.*\.ipynb)' --exclude-lines '"(hash|id|image/\w+)":.*'
The resulting altered baseline files should then be committed.
BioNeMo FW is migrating to use uv
(https://docs.astral.sh/uv/) for handling python packaging inside our docker containers.
In addition to streamlining how we specify intra-repo dependencies, it allows us to create a uv lockfile to pin our
dependencies for our bionemo docker container.
We'll maintain two images going forward:
-
An image that derives from
nvcr.io/nvidia/pytorch
that will be our performance baseline. The advantage of this image base is that the performance of pytorch is validated by the NVIDIA pytorch team, but the downsides are that (1) the overall image size is quite large, and (2) usinguv sync
to install a pinned virtual environment is not possible with the existing python environment in the ngc image. -
An image that derives from
nvcr.io/nvidia/cuda
, where we use uv to create the python environment from scratch. This image uses pytorch wheels from https://download.pytorch.org.
Currently, the devcontainer derives from the cuda-based image above, while the release image derives from the pytorch image.
docker run --rm -it \
-v ${HOME}/.aws:/home/bionemo/.aws \
-v ${HOME}/.ngc:/home/bionemo/.ngc \
-v ${PWD}:/home/bionemo/ \
-v ${HOME}/.cache:/home/bionemo/.cache \
-e HOST_UID=$(id -u) \
-e HOST_GID=$(id -g) \
--gpus=all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
bionemo-uv:latest \
py.test sub-packages/ scripts/
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for bionemo-framework
Similar Open Source Tools
bionemo-framework
NVIDIA BioNeMo Framework is a collection of programming tools, libraries, and models for computational drug discovery. It accelerates building and adapting biomolecular AI models by providing domain-specific, optimized models and tooling for GPU-based computational resources. The framework offers comprehensive documentation and support for both community and enterprise users.
h2o-llmstudio
H2O LLM Studio is a framework and no-code GUI designed for fine-tuning state-of-the-art large language models (LLMs). With H2O LLM Studio, you can easily and effectively fine-tune LLMs without the need for any coding experience. The GUI is specially designed for large language models, and you can finetune any LLM using a large variety of hyperparameters. You can also use recent finetuning techniques such as Low-Rank Adaptation (LoRA) and 8-bit model training with a low memory footprint. Additionally, you can use Reinforcement Learning (RL) to finetune your model (experimental), use advanced evaluation metrics to judge generated answers by the model, track and compare your model performance visually, and easily export your model to the Hugging Face Hub and share it with the community.
eval-dev-quality
DevQualityEval is an evaluation benchmark and framework designed to compare and improve the quality of code generation of Language Model Models (LLMs). It provides developers with a standardized benchmark to enhance real-world usage in software development and offers users metrics and comparisons to assess the usefulness of LLMs for their tasks. The tool evaluates LLMs' performance in solving software development tasks and measures the quality of their results through a point-based system. Users can run specific tasks, such as test generation, across different programming languages to evaluate LLMs' language understanding and code generation capabilities.
MARS5-TTS
MARS5 is a novel English speech model (TTS) developed by CAMB.AI, featuring a two-stage AR-NAR pipeline with a unique NAR component. The model can generate speech for various scenarios like sports commentary and anime with just 5 seconds of audio and a text snippet. It allows steering prosody using punctuation and capitalization in the transcript. Speaker identity is specified using an audio reference file, enabling 'deep clone' for improved quality. The model can be used via torch.hub or HuggingFace, supporting both shallow and deep cloning for inference. Checkpoints are provided for AR and NAR models, with hardware requirements of 750M+450M params on GPU. Contributions to improve model stability, performance, and reference audio selection are welcome.
kaito
Kaito is an operator that automates the AI/ML inference model deployment in a Kubernetes cluster. It manages large model files using container images, avoids tuning deployment parameters to fit GPU hardware by providing preset configurations, auto-provisions GPU nodes based on model requirements, and hosts large model images in the public Microsoft Container Registry (MCR) if the license allows. Using Kaito, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.
uni-api
uni-api is a project that unifies the management of large language model APIs, allowing you to call multiple backend services through a single unified API interface, converting them all to OpenAI format, and supporting load balancing. It supports various backend services such as OpenAI, Anthropic, Gemini, Vertex, Azure, xai, Cohere, Groq, Cloudflare, OpenRouter, and more. The project offers features like no front-end, pure configuration file setup, unified management of multiple backend services, support for multiple standard OpenAI format interfaces, rate limiting, automatic retry, channel cooling, fine-grained model timeout settings, and fine-grained permission control.
redbox
Redbox is a retrieval augmented generation (RAG) app that uses GenAI to chat with and summarise civil service documents. It increases organisational memory by indexing documents and can summarise reports read months ago, supplement them with current work, and produce a first draft that lets civil servants focus on what they do best. The project uses a microservice architecture with each microservice running in its own container defined by a Dockerfile. Dependencies are managed using Python Poetry. Contributions are welcome, and the project is licensed under the MIT License. Security measures are in place to ensure user data privacy and considerations are being made to make the core-api secure.
ezkl
EZKL is a library and command-line tool for doing inference for deep learning models and other computational graphs in a zk-snark (ZKML). It enables the following workflow: 1. Define a computational graph, for instance a neural network (but really any arbitrary set of operations), as you would normally in pytorch or tensorflow. 2. Export the final graph of operations as an .onnx file and some sample inputs to a .json file. 3. Point ezkl to the .onnx and .json files to generate a ZK-SNARK circuit with which you can prove statements such as: > "I ran this publicly available neural network on some private data and it produced this output" > "I ran my private neural network on some public data and it produced this output" > "I correctly ran this publicly available neural network on some public data and it produced this output" In the backend we use the collaboratively-developed Halo2 as a proof system. The generated proofs can then be verified with much less computational resources, including on-chain (with the Ethereum Virtual Machine), in a browser, or on a device.
stable-diffusion-webui
Stable Diffusion web UI is a web interface for Stable Diffusion, implemented using Gradio library. It provides a user-friendly interface to access the powerful image generation capabilities of Stable Diffusion. With Stable Diffusion web UI, users can easily generate images from text prompts, edit and refine images using inpainting and outpainting, and explore different artistic styles and techniques. The web UI also includes a range of advanced features such as textual inversion, hypernetworks, and embeddings, allowing users to customize and fine-tune the image generation process. Whether you're an artist, designer, or simply curious about the possibilities of AI-generated art, Stable Diffusion web UI is a valuable tool that empowers you to create stunning and unique images.
agentok
Agentok Studio is a visual tool built for AutoGen, a cutting-edge agent framework from Microsoft and various contributors. It offers intuitive visual tools to simplify the construction and management of complex agent-based workflows. Users can create workflows visually as graphs, chat with agents, and share flow templates. The tool is designed to streamline the development process for creators and developers working on next-generation Multi-Agent Applications.
agentok
Agentok Studio is a tool built upon AG2, a powerful agent framework from Microsoft, offering intuitive visual tools to streamline the creation and management of complex agent-based workflows. It simplifies the process for creators and developers by generating native Python code with minimal dependencies, enabling users to create self-contained code that can be executed anywhere. The tool is currently under development and not recommended for production use, but contributions are welcome from the community to enhance its capabilities and functionalities.
askui
AskUI is a reliable, automated end-to-end automation tool that only depends on what is shown on your screen instead of the technology or platform you are running on.
AI-on-the-edge-device-docs
This repository contains documentation for the AI on the Edge Device Project. Users can edit Markdown documents in the 'docs' folder, create Pull Requests to merge changes, and Github Actions will regenerate the documentation on the 'gh-pages' branch. The documentation includes parameter documentation, template generation for new parameters, formatting options like boxes using the admonition extension, and local testing instructions using MkDocs.
empirical
Empirical is a tool that allows you to test different LLMs, prompts, and other model configurations across all the scenarios that matter for your application. With Empirical, you can run your test datasets locally against off-the-shelf models, test your own custom models and RAG applications, view, compare, and analyze outputs on a web UI, score your outputs with scoring functions, and run tests on CI/CD.
hi-ml
The Microsoft Health Intelligence Machine Learning Toolbox is a repository that provides low-level and high-level building blocks for Machine Learning / AI researchers and practitioners. It simplifies and streamlines work on deep learning models for healthcare and life sciences by offering tested components such as data loaders, pre-processing tools, deep learning models, and cloud integration utilities. The repository includes two Python packages, 'hi-ml-azure' for helper functions in AzureML, 'hi-ml' for ML components, and 'hi-ml-cpath' for models and workflows related to histopathology images.
ersilia
The Ersilia Model Hub is a unified platform of pre-trained AI/ML models dedicated to infectious and neglected disease research. It offers an open-source, low-code solution that provides seamless access to AI/ML models for drug discovery. Models housed in the hub come from two sources: published models from literature (with due third-party acknowledgment) and custom models developed by the Ersilia team or contributors.
For similar tasks
Generative-AI-Drug-Discovery
Generative-AI-Drug-Discovery is a public repository on GitHub focused on using tensor network machine learning approaches to accelerate GenAI for drug discovery. The repository aims to implement effective architectures and methodologies into Large Language Models (LLMs) to enhance Drug Discovery Generative AI performance.
bionemo-framework
NVIDIA BioNeMo Framework is a collection of programming tools, libraries, and models for computational drug discovery. It accelerates building and adapting biomolecular AI models by providing domain-specific, optimized models and tooling for GPU-based computational resources. The framework offers comprehensive documentation and support for both community and enterprise users.
Pathway-AI-Bootcamp
Welcome to the μLearn x Pathway Initiative, an exciting adventure into the world of Artificial Intelligence (AI)! This comprehensive course, developed in collaboration with Pathway, will empower you with the knowledge and skills needed to navigate the fascinating world of AI, with a special focus on Large Language Models (LLMs).
LLM-Agent-Survey
Autonomous agents are designed to achieve specific objectives through self-guided instructions. With the emergence and growth of large language models (LLMs), there is a growing trend in utilizing LLMs as fundamental controllers for these autonomous agents. This repository conducts a comprehensive survey study on the construction, application, and evaluation of LLM-based autonomous agents. It explores essential components of AI agents, application domains in natural sciences, social sciences, and engineering, and evaluation strategies. The survey aims to be a resource for researchers and practitioners in this rapidly evolving field.
genkit
Firebase Genkit (beta) is a framework with powerful tooling to help app developers build, test, deploy, and monitor AI-powered features with confidence. Genkit is cloud optimized and code-centric, integrating with many services that have free tiers to get started. It provides unified API for generation, context-aware AI features, evaluation of AI workflow, extensibility with plugins, easy deployment to Firebase or Google Cloud, observability and monitoring with OpenTelemetry, and a developer UI for prototyping and testing AI features locally. Genkit works seamlessly with Firebase or Google Cloud projects through official plugins and templates.
vector-cookbook
The Vector Cookbook is a collection of recipes and sample application starter kits for building AI applications with LLMs using PostgreSQL and Timescale Vector. Timescale Vector enhances PostgreSQL for AI applications by enabling the storage of vector, relational, and time-series data with faster search, higher recall, and more efficient time-based filtering. The repository includes resources, sample applications like TSV Time Machine, and guides for creating, storing, and querying OpenAI embeddings with PostgreSQL and pgvector. Users can learn about Timescale Vector, explore performance benchmarks, and access Python client libraries and tutorials.
cogai
The W3C Cognitive AI Community Group focuses on advancing Cognitive AI through collaboration on defining use cases, open source implementations, and application areas. The group aims to demonstrate the potential of Cognitive AI in various domains such as customer services, healthcare, cybersecurity, online learning, autonomous vehicles, manufacturing, and web search. They work on formal specifications for chunk data and rules, plausible knowledge notation, and neural networks for human-like AI. The group positions Cognitive AI as a combination of symbolic and statistical approaches inspired by human thought processes. They address research challenges including mimicry, emotional intelligence, natural language processing, and common sense reasoning. The long-term goal is to develop cognitive agents that are knowledgeable, creative, collaborative, empathic, and multilingual, capable of continual learning and self-awareness.
ai-hub
The Enterprise Azure OpenAI Hub is a comprehensive repository designed to guide users through the world of Generative AI on the Azure platform. It offers a structured learning experience to accelerate the transition from concept to production in an Enterprise context. The hub empowers users to explore various use cases with Azure services, ensuring security and compliance. It provides real-world examples and playbooks for practical insights into solving complex problems and developing cutting-edge AI solutions. The repository also serves as a library of proven patterns, aligning with industry standards and promoting best practices for secure and compliant AI development.
For similar jobs
AlphaFold3
AlphaFold3 is an implementation of the Alpha Fold 3 model in PyTorch for accurate structure prediction of biomolecular interactions. It includes modules for genetic diffusion and full model examples for forward pass computations. The tool allows users to generate random pair and single representations, operate on atomic coordinates, and perform structure predictions based on input tensors. The implementation also provides functionalities for training and evaluating the model.
biochatter
Generative AI models have shown tremendous usefulness in increasing accessibility and automation of a wide range of tasks. This repository contains the `biochatter` Python package, a generic backend library for the connection of biomedical applications to conversational AI. It aims to provide a common framework for deploying, testing, and evaluating diverse models and auxiliary technologies in the biomedical domain. BioChatter is part of the BioCypher ecosystem, connecting natively to BioCypher knowledge graphs.
admet_ai
ADMET-AI is a platform for ADMET prediction using Chemprop-RDKit models trained on ADMET datasets from the Therapeutics Data Commons. It offers command line, Python API, and web server interfaces for making ADMET predictions on new molecules. The platform can be easily installed using pip and supports GPU acceleration. It also provides options for processing TDC data, plotting results, and hosting a web server. ADMET-AI is a machine learning platform for evaluating large-scale chemical libraries.
AI-Drug-Discovery-Design
AI-Drug-Discovery-Design is a repository focused on Artificial Intelligence-assisted Drug Discovery and Design. It explores the use of AI technology to accelerate and optimize the drug development process. The advantages of AI in drug design include speeding up research cycles, improving accuracy through data-driven models, reducing costs by minimizing experimental redundancies, and enabling personalized drug design for specific patients or disease characteristics.
bionemo-framework
NVIDIA BioNeMo Framework is a collection of programming tools, libraries, and models for computational drug discovery. It accelerates building and adapting biomolecular AI models by providing domain-specific, optimized models and tooling for GPU-based computational resources. The framework offers comprehensive documentation and support for both community and enterprise users.
grand-challenge.org
Grand Challenge is a platform that provides access to large amounts of annotated training data, objective comparisons of state-of-the-art machine learning solutions, and clinical validation using real-world data. It assists researchers, data scientists, and clinicians in collaborating to develop robust machine learning solutions to problems in biomedical imaging.
Detection-and-Classification-of-Alzheimers-Disease
This tool is designed to detect and classify Alzheimer's Disease using Deep Learning and Machine Learning algorithms on an early basis, which is further optimized using the Crow Search Algorithm (CSA). Alzheimer's is a fatal disease, and early detection is crucial for patients to predetermine their condition and prevent its progression. By analyzing MRI scanned images using Artificial Intelligence technology, this tool can classify patients who may or may not develop AD in the future. The CSA algorithm, combined with ML algorithms, has proven to be the most effective approach for this purpose.
OpenCRISPR
OpenCRISPR is a set of free and open gene editing systems designed by Profluent Bio. The OpenCRISPR-1 protein maintains the prototypical architecture of a Type II Cas9 nuclease but is hundreds of mutations away from SpCas9 or any other known natural CRISPR-associated protein. You can view OpenCRISPR-1 as a drop-in replacement for many protocols that need a cas9-like protein with an NGG PAM and you can even use it with canonical SpCas9 gRNAs. OpenCRISPR-1 can be fused in a deactivated or nickase format for next generation gene editing techniques like base, prime, or epigenome editing.