
REINVENT4
AI molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design and molecule optimization.
Stars: 395

REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.
README:
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. REINVENT uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.
A paper describing the software has been published as Open Access in the Journal of Cheminformatics: Reinvent 4: Modern AI–driven generative molecule design. See AUTHORS.md for references to previous papers.
REINVENT is being developed on Linux and supports both GPU and CPU. The Linux version is fully validated. REINVENT on Windows and MacOSX supports both GPU and CPU but is only partially tested on these platforms and therefore support is limited.
The code is written in Python 3 (>= 3.10). The list of dependencies can be found in the repository (see also Installation below).
A GPU is not strictly necessary but strongly recommended for performance reasons especially for transfer learning and model training. Reinforcement learning (RL) requires the computation of scores where most scoring components run on the CPU. Thus, a GPU is less important for RL (depending on how much time is spent on the CPU).
Note that if no GPU is installed in your computer the code will run on the CPU automatically. REINVENT supports NVIDIA GPUs and also some AMD GPUs. For most design tasks a memory of about 8 GiB for both CPU main memory and GPU memory is sufficient.
- Clone this Git repository.
- Install a compatible version of Python, for example with Conda (other virtual environments like Docker, pyenv, or the system package manager work too).
conda create --name reinvent4 python=3.10 conda activate reinvent4
- Change directory to the repository and install the dependencies from the lockfile:
pip install -r requirements-linux-64.lock
-
Optional: if you want to use AMD GPUs on Linux you would need to install the ROCm PyTorch version manually after installation of the dependencies in point 3, e.g.
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/rocm5.7
-
Optional: use requirements file
requirements-macOS.lock
for MacOSX.
-
Optional: if you want to use AMD GPUs on Linux you would need to install the ROCm PyTorch version manually after installation of the dependencies in point 3, e.g.
- Install the tool. The dependencies were already installed in the previous step, so there is no need to install them again (flag `--no-deps). If you want to install in editable mode (changes to the code are automatically available), add -e before the dot.
pip install --no-deps .
- Test the tool. The installer has added a script
reinvent
to your PATH.reinvent --help
REINVENT is a command line tool and works principally as follows
reinvent -l sampling.log sampling.toml
This writes logging information to the file sampling.log
. If you wish to write
this to the screen, leave out the -l sampling.log
part. sampling.toml
is the
configuration file. The main user format is TOML as it tends to be more
use friendly. JSON can be used too, add -f json
, but a specialised editor is
recommended as the format is very sensitive to minor changes.
Sample configuration files for all run modes are
located in configs/toml
in the repository and file paths in these files would need to be
adjusted to your local installation. In particular, ready made prior models are
located in priors
and you would choose a model and the
appropriate run mode depending on the research problem you are trying to address.
There is additional information in configs/toml
in several *.md
files with
instructions on how to configure the TOML file. Internal priors can be referenced with a
dot notation (see reinvent/prior_registry.py
).
Basic instructions can be found in the comments in the config examples in configs/toml
.
Notebooks are provided in the notebooks/
directory. Please note that we
provide the notebooks in jupytext "light script" format. To work with the light
scripts you will need to install jupytext. A few other packages will come in handy too.
pip install jupytext mols2grid seaborn
The Python files in notebooks/
can then be converted to a notebook e.g.
jupytext -o Reinvent_demo.ipynb Reinvent_demo.py
Update the lock files with pip-tools (please, do not edit the files manually):
pip-compile --extra-index-url=https://download.pytorch.org/whl/cu121 --extra-index-url=https://pypi.anaconda.org/OpenEye/simple --resolver=backtracking pyproject.toml
To update a single package, use pip-compile --upgrade-package somepackage
(see the documentation for pip-tools).
The scoring subsystem uses a simple plugin mechanism (Python native namespace packages). If you wish to write your own plugin, follow the instructions below. There is no need to touch any of the REINVENT code. The public repository contains a contrib directory with some useful examples.
- Create
/top/dir/somewhere/reinvent\_plugins/components
where/top/dir/somewhere
is a convenient location for you. - Do not place a
__init__.py
in eitherreinvent_plugins
orcomponents
as this would break the mechanism. It is fine to create normal packages withincomponents
as long as you import those correctly. - Place a file whose name starts with
comp_*
intoreinvent_plugins/components
. Files with different names will be ignored i.e. not imported. The directory will be searched recursively so structure your code as needed but directory/package names must be unique. - Tag the scoring component class(es) in that file with the @add_tag decorator. More than one component class can be added to the same comp_ file. See existing code.
- Tag at most one dataclass for parameters in the same file, see existing code. This is optional.
- Set or add
/top/dir/somewhere
to thePYTHONPATH
environment variable or use any other mechanism to extendsys.path
. - The scoring component should now automatically be picked up by REINVENT.
This is primarily for developers and admins/users who wish to ensure that the
installation works. The information here is not relevant to the practical use
of REINVENT. Please refer to Basic Usage for instructions on how to use the
reinvent
command.
The REINVENT project uses the pytest
framework for its tests. Before you run
them you first have to create a configuration file for the tests.
In the project directory, create a config.json
file in the configs/
directory.
You can use the example config example.config.json
as a base. Make sure that
you set MAIN_TEST_PATH
to a non-existent directory. That is where temporary
files will be written during the tests. If it is set to an existing directory,
that directory will be removed once the tests have finished.
Some tests require a proprietary OpenEye license. You have to set up a few
things to make the tests read your license. The simple way is to just set the
OE_LICENSE
environment variable to the path of the file containing the
license.
Once you have a configuration and your license can be read, you can run the tests.
$ pytest tests --json /path/to/config.json --device cuda
For Tasks:
Click tags to check more tools for each tasksFor Jobs:
Alternative AI tools for REINVENT4
Similar Open Source Tools

REINVENT4
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.

airwin2rack
The 'airwin2rack' repository is a collection of Airwindows audio plugins presented in various formats, including as a static library, a module for VCV Rack, and as CLAP/VST3/AU/LV2/Standalone plugins for DAWs. Users can access these plugins through different methods and interfaces, such as a uniform registry and access pattern, making it easy to integrate Airwindows plugins into their audio projects. The repository also provides instructions for updating the Airwindows sub-library and information on licensing, ensuring that users can utilize the plugins in both open and closed source environments.

ollama-autocoder
Ollama Autocoder is a simple to use autocompletion engine that integrates with Ollama AI. It provides options for streaming functionality and requires specific settings for optimal performance. Users can easily generate text completions by pressing a key or using a command pallete. The tool is designed to work with Ollama API and a specified model, offering real-time generation of text suggestions.

llm.c
LLM training in simple, pure C/CUDA. There is no need for 245MB of PyTorch or 107MB of cPython. For example, training GPT-2 (CPU, fp32) is ~1,000 lines of clean code in a single file. It compiles and runs instantly, and exactly matches the PyTorch reference implementation. I chose GPT-2 as the first working example because it is the grand-daddy of LLMs, the first time the modern stack was put together.

Open-LLM-VTuber
Open-LLM-VTuber is a project in early stages of development that allows users to interact with Large Language Models (LLM) using voice commands and receive responses through a Live2D talking face. The project aims to provide a minimum viable prototype for offline use on macOS, Linux, and Windows, with features like long-term memory using MemGPT, customizable LLM backends, speech recognition, and text-to-speech providers. Users can configure the project to chat with LLMs, choose different backend services, and utilize Live2D models for visual representation. The project supports perpetual chat, offline operation, and GPU acceleration on macOS, addressing limitations of existing solutions on macOS.

gpt-subtrans
GPT-Subtrans is an open-source subtitle translator that utilizes large language models (LLMs) as translation services. It supports translation between any language pairs that the language model supports. Note that GPT-Subtrans requires an active internet connection, as subtitles are sent to the provider's servers for translation, and their privacy policy applies.

RouteLLM
RouteLLM is a framework for serving and evaluating LLM routers. It allows users to launch an OpenAI-compatible API that routes requests to the best model based on cost thresholds. Trained routers are provided to reduce costs while maintaining performance. Users can easily extend the framework, compare router performance, and calibrate cost thresholds. RouteLLM supports multiple routing strategies and benchmarks, offering a lightweight server and evaluation framework. It enables users to evaluate routers on benchmarks, calibrate thresholds, and modify model pairs. Contributions for adding new routers and benchmarks are welcome.

AirSane
AirSane is a SANE frontend and scanner server that supports Apple's AirScan protocol. It automatically detects scanners and publishes them through mDNS. Acquired images can be transferred in JPEG, PNG, and PDF/raster format. The tool is intended to be used with AirScan/eSCL clients such as Apple's Image Capture, sane-airscan on Linux, and the eSCL client built into Windows 10 and 11. It provides a simple web interface and encodes images on-the-fly to keep memory/storage demands low, making it suitable for devices like Raspberry Pi. Authentication and secure communication are supported in conjunction with a proxy server like nginx. AirSane has been reverse-engineered from Apple's AirScanScanner client communication protocol and offers a range of installation and configuration options for different operating systems.

MultiPL-E
MultiPL-E is a system for translating unit test-driven neural code generation benchmarks to new languages. It is part of the BigCode Code Generation LM Harness and allows for evaluating Code LLMs using various benchmarks. The tool supports multiple versions with improvements and new language additions, providing a scalable and polyglot approach to benchmarking neural code generation. Users can access a tutorial for direct usage and explore the dataset of translated prompts on the Hugging Face Hub.

eureka-ml-insights
The Eureka ML Insights Framework is a repository containing code designed to help researchers and practitioners run reproducible evaluations of generative models efficiently. Users can define custom pipelines for data processing, inference, and evaluation, as well as utilize pre-defined evaluation pipelines for key benchmarks. The framework provides a structured approach to conducting experiments and analyzing model performance across various tasks and modalities.

RAGMeUp
RAG Me Up is a generic framework that enables users to perform Retrieve, Answer, Generate (RAG) on their own dataset easily. It consists of a small server and UIs for communication. The tool can run on CPU but is optimized for GPUs with at least 16GB of vRAM. Users can combine RAG with fine-tuning using the LLaMa2Lang repository. The tool provides a configurable RAG pipeline without the need for coding, utilizing indexing and inference steps to accurately answer user queries.

RAGMeUp
RAG Me Up is a generic framework that enables users to perform Retrieve and Generate (RAG) on their own dataset easily. It consists of a small server and UIs for communication. Best run on GPU with 16GB vRAM. Users can combine RAG with fine-tuning using LLaMa2Lang repository. The tool allows configuration for LLM, data, LLM parameters, prompt, and document splitting. Funding is sought to democratize AI and advance its applications.

lumigator
Lumigator is an open-source platform developed by Mozilla.ai to help users select the most suitable language model for their specific needs. It supports the evaluation of summarization tasks using sequence-to-sequence models such as BART and BERT, as well as causal models like GPT and Mistral. The platform aims to make model selection transparent, efficient, and empowering by providing a framework for comparing LLMs using task-specific metrics to evaluate how well a model fits a project's needs. Lumigator is in the early stages of development and plans to expand support to additional machine learning tasks and use cases in the future.

llama3-tokenizer-js
JavaScript tokenizer for LLaMA 3 designed for client-side use in the browser and Node, with TypeScript support. It accurately calculates token count, has 0 dependencies, optimized running time, and somewhat optimized bundle size. Compatible with most LLaMA 3 models. Can encode and decode text, but training is not supported. Pollutes global namespace with `llama3Tokenizer` in the browser. Mostly compatible with LLaMA 3 models released by Facebook in April 2024. Can be adapted for incompatible models by passing custom vocab and merge data. Handles special tokens and fine tunes. Developed by belladore.ai with contributions from xenova, blaze2004, imoneoi, and ConProgramming.

tutor-gpt
Tutor-GPT is an LLM powered learning companion developed by Plastic Labs. It dynamically reasons about your learning needs and updates its own prompts to best serve you. It is an expansive learning companion that uses theory of mind experiments to provide personalized learning experiences. The project is split into different modules for backend logic, including core logic, discord bot implementation, FastAPI API interface, NextJS web front end, common utilities, and SQL scripts for setting up local supabase. Tutor-GPT is powered by Honcho to build robust user representations and create personalized experiences for each user. Users can run their own instance of the bot by following the provided instructions.

StableSwarmUI
StableSwarmUI is a modular Stable Diffusion web user interface that emphasizes making power tools easily accessible, high performance, and extensible. It is designed to be a one-stop-shop for all things Stable Diffusion, providing a wide range of features and capabilities to enhance the user experience.
For similar tasks

REINVENT4
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.
For similar jobs

REINVENT4
REINVENT is a molecular design tool for de novo design, scaffold hopping, R-group replacement, linker design, molecule optimization, and other small molecule design tasks. It uses a Reinforcement Learning (RL) algorithm to generate optimized molecules compliant with a user-defined property profile defined as a multi-component score. Transfer Learning (TL) can be used to create or pre-train a model that generates molecules closer to a set of input molecules.