REINVENT4

AI molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.

Stars: 447

Visit

REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.

README:

REINVENT 4

Description

REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. REINVENT uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.

A paper describing the software has been published as Open Access in the Journal of Cheminformatics: Reinvent 4: Modern AI–driven generative molecule design. See AUTHORS.md for references to previous papers.

Requirements

REINVENT is being developed on Linux and supports both GPU and CPU. The Linux version is fully validated. REINVENT on Windows and MacOSX supports both GPU and CPU but is only partially tested on these platforms and therefore support is limited.

The code is written in Python 3 (>= 3.10). The list of dependencies can be found in the repository (see also Installation below).

A GPU is not strictly necessary but strongly recommended for performance reasons especially for transfer learning and model training. Reinforcement learning (RL) requires the computation of scores where most scoring components run on the CPU. Thus, a GPU is less important for RL (depending on how much time is spent on the CPU).

Note that if no GPU is installed in your computer the code will run on the CPU automatically. REINVENT supports NVIDIA GPUs and also some AMD GPUs. For most design tasks a memory of about 8 GiB for both CPU main memory and GPU memory is sufficient.

Installation

Clone this Git repository.
Install a compatible version of Python, for example with Conda (other virtual environments like Docker, pyenv, or the system package manager work too).
```
conda create --name reinvent4 python=3.10
conda activate reinvent4
```
Change directory to the repository and install the dependencies from the lockfile:
```
pip install -r requirements-linux-64.lock
```
1. Optional: if you want to use AMD GPUs on Linux you would need to install the ROCm PyTorch version manually after installation of the dependencies in point 3, e.g.
```
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/rocm5.7
```
2. Optional: use requirements file requirements-macOS.lock for MacOSX.
Install the tool. The dependencies were already installed in the previous step, so there is no need to install them again (flag `--no-deps). If you want to install in editable mode (changes to the code are automatically available), add -e before the dot.
```
pip install --no-deps .
```
Test the tool. The installer has added a script reinvent to your PATH.
```
reinvent --help
```

Basic Usage

REINVENT is a command line tool and works principally as follows

reinvent -l sampling.log sampling.toml

This writes logging information to the file sampling.log. If you wish to write this to the screen, leave out the -l sampling.log part. sampling.toml is the configuration file. The main user format is TOML as it tends to be more use friendly. JSON can be used too, add -f json, but a specialised editor is recommended as the format is very sensitive to minor changes.

Sample configuration files for all run modes are located in configs/toml in the repository and file paths in these files would need to be adjusted to your local installation. In particular, ready made prior models are located in priors and you would choose a model and the appropriate run mode depending on the research problem you are trying to address. There is additional information in configs/toml in several *.md files with instructions on how to configure the TOML file. Internal priors can be referenced with a dot notation (see reinvent/prior_registry.py).

Tutorials / `Jupyter` notebooks

Basic instructions can be found in the comments in the config examples in configs/toml.

Notebooks are provided in the notebooks/ directory. Please note that we provide the notebooks in jupytext "light script" format. To work with the light scripts you will need to install jupytext. A few other packages will come in handy too.

pip install jupytext mols2grid seaborn

The Python files in notebooks/ can then be converted to a notebook e.g.

jupytext -o Reinvent_demo.ipynb Reinvent_demo.py

Updating dependencies

Update the lock files with pip-tools (please, do not edit the files manually):

pip-compile --extra-index-url=https://download.pytorch.org/whl/cu121 --extra-index-url=https://pypi.anaconda.org/OpenEye/simple --resolver=backtracking pyproject.toml

To update a single package, use pip-compile --upgrade-package somepackage (see the documentation for pip-tools).

Scoring Plugins

The scoring subsystem uses a simple plugin mechanism (Python native namespace packages). If you wish to write your own plugin, follow the instructions below. There is no need to touch any of the REINVENT code. The public repository contains a contrib directory with some useful examples.

Create /top/dir/somewhere/reinvent\_plugins/components where /top/dir/somewhere is a convenient location for you.
Do not place a __init__.py in either reinvent_plugins or components as this would break the mechanism. It is fine to create normal packages within components as long as you import those correctly.
Place a file whose name starts with comp_* into reinvent_plugins/components. Files with different names will be ignored i.e. not imported. The directory will be searched recursively so structure your code as needed but directory/package names must be unique.
Tag the scoring component class(es) in that file with the @add_tag decorator. More than one component class can be added to the same comp_ file. See existing code.
Tag at most one dataclass for parameters in the same file, see existing code. This is optional.
Set or add /top/dir/somewhere to the PYTHONPATH environment variable or use any other mechanism to extend sys.path.
The scoring component should now automatically be picked up by REINVENT.

Unit and Integration Tests

This is primarily for developers and admins/users who wish to ensure that the installation works. The information here is not relevant to the practical use of REINVENT. Please refer to Basic Usage for instructions on how to use the reinvent command.

The REINVENT project uses the pytest framework for its tests. Before you run them you first have to create a configuration file for the tests.

In the project directory, create a config.json file in the configs/ directory. You can use the example config example.config.json as a base. Make sure that you set MAIN_TEST_PATH to a non-existent directory. That is where temporary files will be written during the tests. If it is set to an existing directory, that directory will be removed once the tests have finished.

Some tests require a proprietary OpenEye license. You have to set up a few things to make the tests read your license. The simple way is to just set the OE_LICENSE environment variable to the path of the file containing the license.

Once you have a configuration and your license can be read, you can run the tests.

$ pytest tests --json /path/to/config.json --device cuda

For Tasks:

Click tags to check more tools for each tasks

generate molecules optimize molecules design molecules

For Jobs:

molecular design de novo design scaffold hopping r-group replacement linker design

Alternative AI tools for REINVENT4

Similar Open Source Tools

REINVENT4

github

: 447

airwin2rack

The 'airwin2rack' repository is a collection of Airwindows audio plugins presented in various formats, including as a static library, a module for VCV Rack, and as CLAP/VST3/AU/LV2/Standalone plugins for DAWs. Users can access these plugins through different methods and interfaces, such as a uniform registry and access pattern, making it easy to integrate Airwindows plugins into their audio projects. The repository also provides instructions for updating the Airwindows sub-library and information on licensing, ensuring that users can utilize the plugins in both open and closed source environments.

github

: 376

ollama-autocoder

Ollama Autocoder is a simple to use autocompletion engine that integrates with Ollama AI. It provides options for streaming functionality and requires specific settings for optimal performance. Users can easily generate text completions by pressing a key or using a command pallete. The tool is designed to work with Ollama API and a specified model, offering real-time generation of text suggestions.

github

: 92

llm.c

LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation. I chose GPT-2 as the first working example because it is the grand-daddy of LLMs, the first time the modern stack was put together.

github

: 23.4k

RouteLLM

RouteLLM is a framework for serving and evaluating LLM routers. It allows users to launch an OpenAI-compatible API that routes requests to the best model based on cost thresholds. Trained routers are provided to reduce costs while maintaining performance. Users can easily extend the framework, compare router performance, and calibrate cost thresholds. RouteLLM supports multiple routing strategies and benchmarks, offering a lightweight server and evaluation framework. It enables users to evaluate routers on benchmarks, calibrate thresholds, and modify model pairs. Contributions for adding new routers and benchmarks are welcome.

github

: 2.6k

AirSane

AirSane is a SANE frontend and scanner server that supports Apple's AirScan protocol. It automatically detects scanners and publishes them through mDNS. Acquired images can be transferred in JPEG, PNG, and PDF/raster format. The tool is intended to be used with AirScan/eSCL clients such as Apple's Image Capture, sane-airscan on Linux, and the eSCL client built into Windows 10 and 11. It provides a simple web interface and encodes images on-the-fly to keep memory/storage demands low, making it suitable for devices like Raspberry Pi. Authentication and secure communication are supported in conjunction with a proxy server like nginx. AirSane has been reverse-engineered from Apple's AirScanScanner client communication protocol and offers a range of installation and configuration options for different operating systems.

github

: 224

MultiPL-E

MultiPL-E is a system for translating unit test-driven neural code generation benchmarks to new languages. It is part of the BigCode Code Generation LM Harness and allows for evaluating Code LLMs using various benchmarks. The tool supports multiple versions with improvements and new language additions, providing a scalable and polyglot approach to benchmarking neural code generation. Users can access a tutorial for direct usage and explore the dataset of translated prompts on the Hugging Face Hub.

github

: 219

eureka-ml-insights

The Eureka ML Insights Framework is a repository containing code designed to help researchers and practitioners run reproducible evaluations of generative models efficiently. Users can define custom pipelines for data processing, inference, and evaluation, as well as utilize pre-defined evaluation pipelines for key benchmarks. The framework provides a structured approach to conducting experiments and analyzing model performance across various tasks and modalities.

github

: 106

RAGMeUp

RAG Me Up is a generic framework that enables users to perform Retrieve, Answer, Generate (RAG) on their own dataset easily. It consists of a small server and UIs for communication. The tool can run on CPU but is optimized for GPUs with at least 16GB of vRAM. Users can combine RAG with fine-tuning using the LLaMa2Lang repository. The tool provides a configurable RAG pipeline without the need for coding, utilizing indexing and inference steps to accurately answer user queries.

github

: 576

RAGMeUp

RAG Me Up is a generic framework that enables users to perform Retrieve and Generate (RAG) on their own dataset easily. It consists of a small server and UIs for communication. Best run on GPU with 16GB vRAM. Users can combine RAG with fine-tuning using LLaMa2Lang repository. The tool allows configuration for LLM, data, LLM parameters, prompt, and document splitting. Funding is sought to democratize AI and advance its applications.

github

: 489

llama3-tokenizer-js

JavaScript tokenizer for LLaMA 3 designed for client-side use in the browser and Node, with TypeScript support. It accurately calculates token count, has 0 dependencies, optimized running time, and somewhat optimized bundle size. Compatible with most LLaMA 3 models. Can encode and decode text, but training is not supported. Pollutes global namespace with `llama3Tokenizer` in the browser. Mostly compatible with LLaMA 3 models released by Facebook in April 2024. Can be adapted for incompatible models by passing custom vocab and merge data. Handles special tokens and fine tunes. Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi, and ConProgramming.

github

: 104

tutor-gpt

Tutor-GPT is an LLM powered learning companion developed by Plastic Labs. It dynamically reasons about your learning needs and updates its own prompts to best serve you. It is an expansive learning companion that uses theory of mind experiments to provide personalized learning experiences. The project is split into different modules for backend logic, including core logic, discord bot implementation, FastAPI API interface, NextJS web front end, common utilities, and SQL scripts for setting up local supabase. Tutor-GPT is powered by Honcho to build robust user representations and create personalized experiences for each user. Users can run their own instance of the bot by following the provided instructions.

github

: 725

nx_open

The `nx_open` repository contains open-source components for the Network Optix Meta Platform, used to build products like Nx Witness Video Management System. It includes source code, specifications, and a Desktop Client. The repository is licensed under Mozilla Public License 2.0. Users can build the Desktop Client and customize it using a zip file. The build environment supports Windows, Linux, and macOS platforms with specific prerequisites. The repository provides scripts for building, signing executable files, and running the Desktop Client. Compatibility with VMS Server versions is crucial, and automatic VMS updates are disabled for the open-source Desktop Client.

github

: 51

llamafile

llamafile is a tool that enables users to distribute and run Large Language Models (LLMs) with a single file. It combines llama.cpp with Cosmopolitan Libc to create a framework that simplifies the complexity of LLMs into a single-file executable called a 'llamafile'. Users can run these executable files locally on most computers without the need for installation, making open LLMs more accessible to developers and end users. llamafile also provides example llamafiles for various LLM models, allowing users to try out different LLMs locally. The tool supports multiple CPU microarchitectures, CPU architectures, and operating systems, making it versatile and easy to use.

github

: 19.7k

chronon

Chronon is a platform that simplifies and improves ML workflows by providing a central place to define features, ensuring point-in-time correctness for backfills, simplifying orchestration for batch and streaming pipelines, offering easy endpoints for feature fetching, and guaranteeing and measuring consistency. It offers benefits over other approaches by enabling the use of a broad set of data for training, handling large aggregations and other computationally intensive transformations, and abstracting away the infrastructure complexity of data plumbing.

github

: 766

SlicerTotalSegmentator

TotalSegmentator is a 3D Slicer extension designed for fully automatic whole body CT segmentation using the 'TotalSegmentator' AI model. The computation time is less than one minute, making it efficient for research purposes. Users can set up GPU acceleration for faster segmentation. The tool provides a user-friendly interface for loading CT images, creating segmentations, and displaying results in 3D. Troubleshooting steps are available for common issues such as failed computation, GPU errors, and inaccurate segmentations. Contributions to the extension are welcome, following 3D Slicer contribution guidelines.

github

: 167

For similar tasks

REINVENT4

github

: 447

For similar jobs

REINVENT4

github

: 447

REINVENT4

README:

REINVENT 4

Description

Requirements

Installation

Basic Usage

Tutorials / Jupyter notebooks

Updating dependencies

Scoring Plugins

Unit and Integration Tests

For Tasks:

For Jobs:

Alternative AI tools for REINVENT4

Similar Open Source Tools

REINVENT4

airwin2rack

ollama-autocoder

llm.c

RouteLLM

AirSane

MultiPL-E

eureka-ml-insights

RAGMeUp

RAGMeUp

llama3-tokenizer-js

tutor-gpt

nx_open

llamafile

chronon

SlicerTotalSegmentator

For similar tasks

REINVENT4

For similar jobs

REINVENT4

Tutorials / `Jupyter` notebooks