LLM-for-genomics-training

LLM-for-genomics-training

Tutorial on large language models for genomics

Stars: 192

Visit
 screenshot

This repository provides training on large language models (LLMs) for genomics, including lecture notes and lab classes covering pretraining, finetuning, zeroshot learning prediction of mutation effect, synthetic DNA sequence generation, and DNA sequence optimization.

README:

Training on large language models for genomics

alt text

Overview

In this repository, we will follow a training for large language models (LLMs) for genomics. The training comprises a short lecture and several lab classes.

Lecture notes

You can download the lecture note here.

Video of the lecture

Lab classes

Data to pretrain the model

The data can be found in the file:

  • data/genome_sequences/hg38/sequences_hg38_200b_verysmall.csv.gz

The file contains 100,000 non-overlapping DNA sequences of 200 bases, corresponding to around 1% of the human genome. For instance, here is one DNA sequence of 200 bases: alt text

Pretraining of an LLM on DNA sequences

We will pretrain an LLM from scratch (a simplified mistral model, see folder data/models/Mixtral-8x7B-v0.1/) on the 100,000 DNA sequences from the human genome. The LLM is pretrained with causal language modeling using 200b DNA sequences from the human genome hg38 assembly.

Script on Google Colab

Video of the tutorial

Medium article

Finetuning of an LLM for DNA sequence classification

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-17M-hg38) and finetune it for DNA sequence classification. The aim is to classify a DNA sequence depending on whether it binds a protein or not (transcription factor), or if a histone mark is present, or if a promoter is active.

Script on Google Colab.

Video of the tutorial

Medium article

Zeroshot learning prediction of mutation effect

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-17M-hg38) to predict the impact of mutations with zeroshot learning (directly using the pretrained model for DNA sequences). Here, we compute the embedding of the wild type sequence and compare it to the embedding of the mutated sequence, and then compute a L2 distance between the two embeddings. We expect that the higher the distance, the larger the mutation effect.

Script on Google Colab.

Video of the tutorial

Medium article

Synthetic DNA sequence generation

We will use a pretrained LLM from huggingface (https://huggingface.co/RaphaelMourad/Mistral-DNA-v1-138M-yeast) to generate artificial yeast DNA sequences.

Script on Google Colab.

Video of the tutorial

Medium article

DNA sequence optimization

We will use a finetuned LLM for promoter or transcription factor binding.

Script on Google Colab.

Video of the tutorial

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for LLM-for-genomics-training

Similar Open Source Tools

For similar tasks

For similar jobs