
REINVENT4
AI molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.
Stars: 447

REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.
README:
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. REINVENT uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.
A paper describing the software has been published as Open Access in the Journal of Cheminformatics: Reinvent 4: Modern AI–driven generative molecule design. See AUTHORS.md for references to previous papers.
REINVENT is being developed on Linux and supports both GPU and CPU. The Linux version is fully validated. REINVENT on Windows and MacOSX supports both GPU and CPU but is only partially tested on these platforms and therefore support is limited.
The code is written in Python 3 (>= 3.10). The list of dependencies can be found in the repository (see also Installation below).
A GPU is not strictly necessary but strongly recommended for performance reasons especially for transfer learning and model training. Reinforcement learning (RL) requires the computation of scores where most scoring components run on the CPU. Thus, a GPU is less important for RL (depending on how much time is spent on the CPU).
Note that if no GPU is installed in your computer the code will run on the CPU automatically. REINVENT supports NVIDIA GPUs and also some AMD GPUs. For most design tasks a memory of about 8 GiB for both CPU main memory and GPU memory is sufficient.
- Clone this Git repository.
- Install a compatible version of Python, for example with Conda (other virtual environments like Docker, pyenv, or the system package manager work too).
conda create --name reinvent4 python=3.10 conda activate reinvent4
- Change directory to the repository and install the dependencies from the lockfile:
pip install -r requirements-linux-64.lock
-
Optional: if you want to use AMD GPUs on Linux you would need to install the ROCm PyTorch version manually after installation of the dependencies in point 3, e.g.
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/rocm5.7
-
Optional: use requirements file
requirements-macOS.lock
for MacOSX.
-
Optional: if you want to use AMD GPUs on Linux you would need to install the ROCm PyTorch version manually after installation of the dependencies in point 3, e.g.
- Install the tool. The dependencies were already installed in the previous step, so there is no need to install them again (flag `--no-deps). If you want to install in editable mode (changes to the code are automatically available), add -e before the dot.
pip install --no-deps .
- Test the tool. The installer has added a script
reinvent
to your PATH.reinvent --help
REINVENT is a command line tool and works principally as follows
reinvent -l sampling.log sampling.toml
This writes logging information to the file sampling.log
. If you wish to write
this to the screen, leave out the -l sampling.log
part. sampling.toml
is the
configuration file. The main user format is TOML as it tends to be more
use friendly. JSON can be used too, add -f json
, but a specialised editor is
recommended as the format is very sensitive to minor changes.
Sample configuration files for all run modes are
located in configs/toml
in the repository and file paths in these files would need to be
adjusted to your local installation. In particular, ready made prior models are
located in priors
and you would choose a model and the
appropriate run mode depending on the research problem you are trying to address.
There is additional information in configs/toml
in several *.md
files with
instructions on how to configure the TOML file. Internal priors can be referenced with a
dot notation (see reinvent/prior_registry.py
).
Basic instructions can be found in the comments in the config examples in configs/toml
.
Notebooks are provided in the notebooks/
directory. Please note that we
provide the notebooks in jupytext "light script" format. To work with the light
scripts you will need to install jupytext. A few other packages will come in handy too.
pip install jupytext mols2grid seaborn
The Python files in notebooks/
can then be converted to a notebook e.g.
jupytext -o Reinvent_demo.ipynb Reinvent_demo.py
Update the lock files with pip-tools (please, do not edit the files manually):
pip-compile --extra-index-url=https://download.pytorch.org/whl/cu121 --extra-index-url=https://pypi.anaconda.org/OpenEye/simple --resolver=backtracking pyproject.toml
To update a single package, use pip-compile --upgrade-package somepackage
(see the documentation for pip-tools).
The scoring subsystem uses a simple plugin mechanism (Python native namespace packages). If you wish to write your own plugin, follow the instructions below. There is no need to touch any of the REINVENT code. The public repository contains a contrib directory with some useful examples.
- Create
/top/dir/somewhere/reinvent\_plugins/components
where/top/dir/somewhere
is a convenient location for you. - Do not place a
__init__.py
in eitherreinvent_plugins
orcomponents
as this would break the mechanism. It is fine to create normal packages withincomponents
as long as you import those correctly. - Place a file whose name starts with
comp_*
intoreinvent_plugins/components
. Files with different names will be ignored i.e. not imported. The directory will be searched recursively so structure your code as needed but directory/package names must be unique. - Tag the scoring component class(es) in that file with the @add_tag decorator. More than one component class can be added to the same comp_ file. See existing code.
- Tag at most one dataclass for parameters in the same file, see existing code. This is optional.
- Set or add
/top/dir/somewhere
to thePYTHONPATH
environment variable or use any other mechanism to extendsys.path
. - The scoring component should now automatically be picked up by REINVENT.
This is primarily for developers and admins/users who wish to ensure that the
installation works. The information here is not relevant to the practical use
of REINVENT. Please refer to Basic Usage for instructions on how to use the
reinvent
command.
The REINVENT project uses the pytest
framework for its tests. Before you run
them you first have to create a configuration file for the tests.
In the project directory, create a config.json
file in the configs/
directory.
You can use the example config example.config.json
as a base. Make sure that
you set MAIN_TEST_PATH
to a non-existent directory. That is where temporary
files will be written during the tests. If it is set to an existing directory,
that directory will be removed once the tests have finished.
Some tests require a proprietary OpenEye license. You have to set up a few
things to make the tests read your license. The simple way is to just set the
OE_LICENSE
environment variable to the path of the file containing the
license.
Once you have a configuration and your license can be read, you can run the tests.
$ pytest tests --json /path/to/config.json --device cuda
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for REINVENT4
Similar Open Source Tools

REINVENT4
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.

airwin2rack
The 'airwin2rack' repository is a collection of Airwindows audio plugins presented in various formats, including as a static library, a module for VCV Rack, and as CLAP/VST3/AU/LV2/Standalone plugins for DAWs. Users can access these plugins through different methods and interfaces, such as a uniform registry and access pattern, making it easy to integrate Airwindows plugins into their audio projects. The repository also provides instructions for updating the Airwindows sub-library and information on licensing, ensuring that users can utilize the plugins in both open and closed source environments.

ollama-autocoder
Ollama Autocoder is a simple to use autocompletion engine that integrates with Ollama AI. It provides options for streaming functionality and requires specific settings for optimal performance. Users can easily generate text completions by pressing a key or using a command pallete. The tool is designed to work with Ollama API and a specified model, offering real-time generation of text suggestions.

llm.c
LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation. I chose GPT-2 as the first working example because it is the grand-daddy of LLMs, the first time the modern stack was put together.

RouteLLM
RouteLLM is a framework for serving and evaluating LLM routers. It allows users to launch an OpenAI-compatible API that routes requests to the best model based on cost thresholds. Trained routers are provided to reduce costs while maintaining performance. Users can easily extend the framework, compare router performance, and calibrate cost thresholds. RouteLLM supports multiple routing strategies and benchmarks, offering a lightweight server and evaluation framework. It enables users to evaluate routers on benchmarks, calibrate thresholds, and modify model pairs. Contributions for adding new routers and benchmarks are welcome.

AirSane
AirSane is a SANE frontend and scanner server that supports Apple's AirScan protocol. It automatically detects scanners and publishes them through mDNS. Acquired images can be transferred in JPEG, PNG, and PDF/raster format. The tool is intended to be used with AirScan/eSCL clients such as Apple's Image Capture, sane-airscan on Linux, and the eSCL client built into Windows 10 and 11. It provides a simple web interface and encodes images on-the-fly to keep memory/storage demands low, making it suitable for devices like Raspberry Pi. Authentication and secure communication are supported in conjunction with a proxy server like nginx. AirSane has been reverse-engineered from Apple's AirScanScanner client communication protocol and offers a range of installation and configuration options for different operating systems.

MultiPL-E
MultiPL-E is a system for translating unit test-driven neural code generation benchmarks to new languages. It is part of the BigCode Code Generation LM Harness and allows for evaluating Code LLMs using various benchmarks. The tool supports multiple versions with improvements and new language additions, providing a scalable and polyglot approach to benchmarking neural code generation. Users can access a tutorial for direct usage and explore the dataset of translated prompts on the Hugging Face Hub.

eureka-ml-insights
The Eureka ML Insights Framework is a repository containing code designed to help researchers and practitioners run reproducible evaluations of generative models efficiently. Users can define custom pipelines for data processing, inference, and evaluation, as well as utilize pre-defined evaluation pipelines for key benchmarks. The framework provides a structured approach to conducting experiments and analyzing model performance across various tasks and modalities.

RAGMeUp
RAG Me Up is a generic framework that enables users to perform Retrieve, Answer, Generate (RAG) on their own dataset easily. It consists of a small server and UIs for communication. The tool can run on CPU but is optimized for GPUs with at least 16GB of vRAM. Users can combine RAG with fine-tuning using the LLaMa2Lang repository. The tool provides a configurable RAG pipeline without the need for coding, utilizing indexing and inference steps to accurately answer user queries.

RAGMeUp
RAG Me Up is a generic framework that enables users to perform Retrieve and Generate (RAG) on their own dataset easily. It consists of a small server and UIs for communication. Best run on GPU with 16GB vRAM. Users can combine RAG with fine-tuning using LLaMa2Lang repository. The tool allows configuration for LLM, data, LLM parameters, prompt, and document splitting. Funding is sought to democratize AI and advance its applications.

llama3-tokenizer-js
JavaScript tokenizer for LLaMA 3 designed for client-side use in the browser and Node, with TypeScript support. It accurately calculates token count, has 0 dependencies, optimized running time, and somewhat optimized bundle size. Compatible with most LLaMA 3 models. Can encode and decode text, but training is not supported. Pollutes global namespace with `llama3Tokenizer` in the browser. Mostly compatible with LLaMA 3 models released by Facebook in April 2024. Can be adapted for incompatible models by passing custom vocab and merge data. Handles special tokens and fine tunes. Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi, and ConProgramming.

tutor-gpt
Tutor-GPT is an LLM powered learning companion developed by Plastic Labs. It dynamically reasons about your learning needs and updates its own prompts to best serve you. It is an expansive learning companion that uses theory of mind experiments to provide personalized learning experiences. The project is split into different modules for backend logic, including core logic, discord bot implementation, FastAPI API interface, NextJS web front end, common utilities, and SQL scripts for setting up local supabase. Tutor-GPT is powered by Honcho to build robust user representations and create personalized experiences for each user. Users can run their own instance of the bot by following the provided instructions.

nx_open
The `nx_open` repository contains open-source components for the Network Optix Meta Platform, used to build products like Nx Witness Video Management System. It includes source code, specifications, and a Desktop Client. The repository is licensed under Mozilla Public License 2.0. Users can build the Desktop Client and customize it using a zip file. The build environment supports Windows, Linux, and macOS platforms with specific prerequisites. The repository provides scripts for building, signing executable files, and running the Desktop Client. Compatibility with VMS Server versions is crucial, and automatic VMS updates are disabled for the open-source Desktop Client.

llamafile
llamafile is a tool that enables users to distribute and run Large Language Models (LLMs) with a single file. It combines llama.cpp with Cosmopolitan Libc to create a framework that simplifies the complexity of LLMs into a single-file executable called a 'llamafile'. Users can run these executable files locally on most computers without the need for installation, making open LLMs more accessible to developers and end users. llamafile also provides example llamafiles for various LLM models, allowing users to try out different LLMs locally. The tool supports multiple CPU microarchitectures, CPU architectures, and operating systems, making it versatile and easy to use.

chronon
Chronon is a platform that simplifies and improves ML workflows by providing a central place to define features, ensuring point-in-time correctness for backfills, simplifying orchestration for batch and streaming pipelines, offering easy endpoints for feature fetching, and guaranteeing and measuring consistency. It offers benefits over other approaches by enabling the use of a broad set of data for training, handling large aggregations and other computationally intensive transformations, and abstracting away the infrastructure complexity of data plumbing.

SlicerTotalSegmentator
TotalSegmentator is a 3D Slicer extension designed for fully automatic whole body CT segmentation using the 'TotalSegmentator' AI model. The computation time is less than one minute, making it efficient for research purposes. Users can set up GPU acceleration for faster segmentation. The tool provides a user-friendly interface for loading CT images, creating segmentations, and displaying results in 3D. Troubleshooting steps are available for common issues such as failed computation, GPU errors, and inaccurate segmentations. Contributions to the extension are welcome, following 3D Slicer contribution guidelines.
For similar tasks

REINVENT4
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.
For similar jobs

REINVENT4
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.