Scientific-LLM-Survey

Scientific Large Language Models: A Survey on Biological & Chemical Domains

Stars: 261

Visit

Scientific Large Language Models (Sci-LLMs) is a repository that collects papers on scientific large language models, focusing on biology and chemistry domains. It includes textual, molecular, protein, and genomic languages, as well as multimodal language. The repository covers various large language models for tasks such as molecule property prediction, interaction prediction, protein sequence representation, protein sequence generation/design, DNA-protein interaction prediction, and RNA prediction. It also provides datasets and benchmarks for evaluating these models. The repository aims to facilitate research and development in the field of scientific language modeling.

README:

Scientific Large Language Models (Sci-LLMs)

This repository collects papers on scientific large language models, particularly in the domains of biology and chemistry.

😎 Welcome to recommend missing papers through Adding Issues or Pull Requests.

🔔 News

💥 [2024/07] We have updated our survey paper by incorporating the latest related works. Please refer to the revised version on arXiv.
💥 [2024/01] Our survey paper 'Scientific Large Language Models: A Survey on Biological & Chemical Domains' has been released on arXiv.

In this survey, we focus on scientific languages (i.e., textual, molecular, protein and genomic languages), as well as their combination (i.e., multimodal language).

🌟 Contents

Scientific Large Language Models (Sci-LLMs)

📖 Textual Scientific Large Language Models (Text-Sci-LLMs)

Medical

2019.04 ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission, arXiv, Code
2022.02 GatorTron: A Large Clinical Language Model to Unlock Patient Information from Unstructured Electronic Health Records, arXiv, Model
2022.12 BioMedLM, .stanford.edu, huggingface
2023.05 A Study of Generative Large Language Model for Medical Research and Healthcare (GatorTronGPT), arXiv, Code
2023.11 MEDITRON-70B: Scaling Medical Pretraining for Large Language Models, arXiv, Code
2024.03 Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks (Meerkat), arXiv
2023.06 ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation, arXiv
2023.10 Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model, arXiv, Code
2023.03 ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge, arXiv, Code
2023.04 HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge, arXiv, Code
2023.05 HuatuoGPT, towards Taming Language Model to Be a Doctor, arXiv, Code
2023.04 Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data, arXiv, Code
2023.08 Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue, arXiv, Code
2023.04 PMC-LLaMA: Towards Building Open-source Language Models for Medicine, arXiv, Code
2023.09 CPLLM: Clinical Prediction with Large Language Models, arXiv, Code
2023.05 Towards Expert-Level Medical Question Answering with Large Language Models(Med-PaLM 2), Google Research, arXiv
2023.05 Clinical Camel: An Open Expert-Level Medical Language Model with Dialogue-Based Knowledge Encoding, arXiv, Code
2023.04 DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task, arXiv, Code
2023.10 BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT, arXiv, Code
2024.01 Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain, arXiv
2024.02 Me LLaMA: Foundation Large Language Models for Medical Applications, arXiv, Code
2024.02 BiMediX: Bilingual Medical Mixture of Experts LLM, arXiv, Code, Hugging Face

Biology

2019.04 BioELMo: Probing Biomedical Embeddings from Language Models, arXiv, Code
2019.05 BioBERT: a pre-trained biomedical language representation model for biomedical text mining, arXiv, Code
2019.07 Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets, arXiv, Code
2020.10 BioMegatron: Larger Biomedical Domain Language Model, arXiv, Code
2020.10 Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, arXiv, Hugging Face
2021.06 BioM-Transformers: Building Large Biomedical Language Models with BERT, ALBERT and ELECTRA, ACL Anthology, Code
2022.03 LinkBERT: Pretraining Language Models with Document Links, arXiv, Code
2023.03 BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining, arXiv, Code
2023.08 BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine, arXiv, Code
2023.09 BioinspiredLLM: Conversational Large Language Model for the Mechanics of Biological and Bio-Inspired Materials, arXiv
2024.02 BioMistral: BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains, arXiv, Code

Chemistry

2021.06 Automated Chemical Reaction Extraction from Scientific Literature. Journal of Chemical Information and Modeling, Code
2021.09 MatSciBERT: A materials domain language model for text mining and information extraction, npj Computational Materials, Code
2022.09 A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing, npj Computational Materials, Hugging Face
2024.01 ChemDFM: Dialogue Foundation Model for Chemistry, arXiv, Model
2024.02 ChemLLM: A Chemical Large Language Model, arXiv, Model
2024.02 LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset, arXiv, Page, Model, Dataset
2024.02 PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry, arXiv

Comprehensive

2019.09 SciBERT: A Pretrained Language Model for Scientific Text, arXiv, Code
2023.05 The Diminishing Returns of Masked Language Models to Science, arXiv, Hugging Face
2023.08 DARWIN Series: Domain Specific Large Language Models for Natural Science, arXiv, Code
2024.01 SciGLM: Training Scientific Language Models with Self-Reflective Instruction Annotation and Tuning, arXiv, GitHub
2024.03 Uni-SMART: Universal Science Multimodal Analysis and Research Transformer, arXiv
2024.05 INDUS: Effective and Efficient Language Models for Scientific Applications,arXiv

Datasets and Benchmarks

The MIMIC dataset, 2016.05. mimic-code, Data Descriptor: MIMIC-III, a freely accessible critical care database, Scientific Data
eICU-CRD. 2019.04. The eICU Collaborative Research Database, a freely available multi-center database for critical care research, Scientific Data
cMedQA2, 2018.11. Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection, IEEE Access
MedDialog-Chinese. MedDialog: Large-scale Medical Dialogue Datasets, EMNLP 2020
ChiMed. 2023.10. Qilin-Med: Multi-stage Knowledge Injection Advanced Medical Large Language Model, arXiv
HealthCareMagic-100k, 2023.03. ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) Using Medical Domain Knowledge, arXiv
MedQuAD, 2019.01. A Question-Entailment Approach to Question Answering, arXiv
MultiMedQA, 2023.07. Large language models encode clinical knowledge, Nature
Open-I, 2015.07. Preparing a collection of radiology examinations for distribution and retrieval, JAMIA
Psych8k, 2024.03. ChatCounselor: A Large Language Models for Mental Health Support, arXiv
CMD. dataset, 2019.09. DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task, arXiv
BianQueCorpus, 2023.10. BianQue: Balancing the Questioning and Suggestion Ability of Health LLMs with Multi-turn Health Conversations Polished by ChatGPT, arXiv
MedQA-USMLE, 2021.04. What Disease does this Patient Have? A Large-scale Open Domain Question Answering Dataset from Medical Exams, arXiv
MedMCQA, 2022.03. MedMCQA : A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering, arXiv
JAMA Clinical Challenge dataset, 2024.05. Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions, arXiv
CMtMedQA, 2023.08. Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue, arXiv
huatuo-26M, 2023.05. Huatuo-26M, a Large-scale Chinese Medical QA Dataset, arXiv
MMLU, 2020.09. Measuring Massive Multitask Language Understanding, arXiv
C-Eval, 2023.05. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models, arXiv
AGIEval 2023.05. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models, arXiv
ScienceQA, 2022.09. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering, arXiv
Xiezhi, 2023.06. Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation, arXiv
SciEval, 2023.08. SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research, arXiv
Bioinfo-Bench, 2023.10. A Simple Benchmark Framework for LLM Bioinformatics Skills Evaluation, bioRxiv
BLURB, 2020.07. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing, arXiv
ARC, 2018.03. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge, arXiv
SciQ, 2017.07. Crowdsourcing Multiple Choice Science Questions, arXiv

🧪 Molecular Large Language Models (Mol-LLMs)

Molecule Property Prediction

2019.09 SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction, ACM-BCB, Code
2019.11 SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery, arXiv, Code
2020.02 Molecule attention transformer, arXiv, Code
2020.10 ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction, arXiv, Code
2020.10 Self-Supervised Graph Transformer on Large-Scale Molecular Data, arXiv, Code
2020.11 Language models in molecular discovery, NeurIPS, Code
2021.05 MG-BERT: leveraging unsupervised atomic representation learning for molecular property prediction, Briefings in Bioinformatics, Code
2021.06 Algebraic graph-assisted bidirectional transformers for molecular property prediction, Nature Communications, Code
2021.09 Mol-BERT: An Effective Molecular Representation with BERT for Molecular Property Prediction, Wireless Communications and Mobile Computing, Code
2021.10 Relative molecule self-attention transformer, Journal of Cheminformatics, Code
2022.08 KPGT: Knowledge-Guided Pre-training of Graph Transformer for Molecular Property Prediction, Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Code
2022.09 ChemBERTa-2: Towards Chemical Foundation Models, arXiv, Code
2022.01 Chemformer: a pre-trained transformer for computational chemistry, Mach. Learn.: Sci. Technol., Code
2022.10 Large-Scale Distributed Training of Transformers for Chemical Fingerprinting, JCIM, Code
2022.11 BARTSmiles: Generative Masked Language Models for Molecular Representations, arXiv, Code
2022.12 Large-Scale Chemical Language Representations Capture Molecular Structure and Properties, arXiv, Code
2022.12 Pushing the Boundaries of Molecular Property Prediction for Drug Discovery with Multitask Learning BERT Enhanced by SMILES Enumeration, Research, Code
2023.01 MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction, Journal of Molecular Graphics and Modelling
2023.01 Molformer: Motif-based Transformer on 3D Heterogeneous Molecular Graphs, arXiv, Code
2023.02 UNI-MOL: A UNIVERSAL 3D MOLECULAR REPRESENTATION LEARNING FRAMEWORK, NeurIPS, Code
2023.05 SELFORMER: MOLECULAR REPRESENTATION LEARNING VIA SELFIES LANGUAGE MODELS, arXiv, Code
2023.07 Molecular Descriptors Property Prediction Using Transformer-Based Approach, IJMS
2023.10 GTMGC: USING GRAPH TRANSFORMER TO PREDICT MOLECULE’S GROUND-STATE CONFORMATION, ICLR

Interaction Prediction

2020.12 X-MOL: large-scale pre-training for molecular understanding and diverse molecular analysis, bioRxiv, Code

Molecule Generation/Design/Edit

2021.05 MolGPT: Molecular Generation Using a Transformer-Decoder Model, JCIM, Code
2021.07 Transmol: repurposing a language model for molecular generation, RSC Advances, Code
2021.09 GENERATIVE PRE-TRAINING FROM MOLECULES, ChemRxiv, Code
2021.12 Generative Chemical Transformer: Neural Machine Learning of Molecular Geometric Structures from Chemical Language via Attention, JCIM, Code
2022.10 A Pre-trained Conditional Transformer for Target-specific De Novo Molecular Generation, arXiv
2023.05 iupacGPT: IUPAC-based large-scale molecular pre-trained model for property prediction and molecule generation, ChemRxiv, Code
2023.05 cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation, Molecules, Code
2023.05 Molecule generation using transformers and policy gradient reinforcement learning, Scientific Reports, Code
2023.10 DOMAIN-AGNOSTIC MOLECULAR GENERATION WITH SELF-FEEDBACK, arXiv, Code

Reaction Prediction

2019.08 Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction, ACS Cent. Sci., Code
2019.08 Molecular Transformer unifies reaction prediction and retrosynthesis across pharma chemical space, Chemical Communications
2019.09 A Transformer Model for Retrosynthesis, ICANN, Code
2019.12 Predicting Retrosynthetic Reaction using Self-Corrected Transformer Neural Networks, arXiv, Code
2020.11 State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis, Nature Communications, Code
2021.01 Valid, Plausible, and Diverse Retrosynthesis Using Tied Two-Way Transformers with Latent Variables, JCIM, Code
2021.01 Prediction of chemical reaction yields using deep learning, Mach. Learn.: Sci. Technol., Code
2021.03 Predicting Chemical Reaction Outcomes: A Grammar Ontology-based Transformer Framework, AIChE Journal
2021.10 Molecular Graph Enhanced Transformer for Retrosynthesis Prediction, Neurocomputing, Code
2021.10 PERMUTATION INVARIANT GRAPH-TO-SEQUENCE MODEL FOR TEMPLATE-FREE RETROSYNTHESIS AND REACTION PREDICTION, arXiv, Code
2022.03 Retrosynthetic reaction pathway prediction through neural machine translation of atomic environments, Nature Communications, Code
2023.02 Enhancing diversity in language based models for single-step retrosynthesis, Digital Discovery, Code
2023.07 Unbiasing Retrosynthesis Language Models with Disconnection Prompts, ACS Cent. Sci., Code

Datasets and Benchmarks

ZINC 15, 2015.10 ZINC 15 – Ligand Discovery for Everyone, JCIM
ZINC 20, 2020.12 ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery, JCIM
ZINC-250k, 2012.07 ZINC − A Free Database of Commercially Available Compounds for Virtual Screening, JCIM
PubChem, 2023.01 PubChem 2023 update, Nucleic Acids Research
USPTO, USPTO MIT, USPTO-15K, USPTO-full, 2012.10 Extraction of chemical structures and reactions from the literature, University of Cambridge
PCQM4Mv2, 2021.10 OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs, arXiv
PCQM4M-LSC, 2021.06 First Place Solution of KDD Cup 2021 & OGB Large-Scale Challenge Graph Prediction Track, arXiv
GEOM, 2022.04 GEOM, energy-annotated molecular conformations for property prediction and molecular generation, Nature
ToyMix, LargeMix, UltraLarge, 2023.10 Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets, arXiv
ChEMBL, 2023.05 The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods, Nucleic Acids Research
DrugBank 5.0, 2017.11 DrugBank 5.0: a major update to the DrugBank database for 2018, Nucleic Acids Research
GDB-17, 2012.10 Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17, JCIM
ExCAPE-DB, 2017.03 ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics, Journal of Cheminformatics
MoleculeNet, 2017.10 MoleculeNet: a benchmark for molecular machine learning, Chemical Science
MARCEL, 2023.09 Learning Over Molecular Conformer Ensembles: Datasets and Benchmarks, arXiv
GuacaMol, 2019.03 GuacaMol: Benchmarking Models for de Novo Molecular Design, JCIM
MOSES, 2020.12 Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models, Frontiers in Pharmacology
ADMETlab 2.0, 2021.04 ADMETlab 2.0: an integrated online platform for accurate and comprehensive predictions of ADMET properties, Nucleic Acids Research
SPECTRA, 2024.02 Evaluating generalizability of artificial intelligence models for molecular datasets, bioRxiv
Molecule3D, 2021.09 Molecule3D: A Benchmark for Predicting 3D Geometries from Molecular Graphs, arXiv

🧬 Protein Large Language Models (Prot-LLMs)

Protein Sequence Representation

2020.02 Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences, PNAS, Code
2021.02 MSA transformer, PMLR, Code
2021.02 Multi-scale representation learning on proteins, Neurips
2021.02 Language models enable zero-shot prediction of the effects of mutations on protein function, Neurips, Code
2021.07 ProtTrans: Toward Understanding the Language of Life Through Self-Supervised Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence, Code
2021.07 Pre-training Co-evolutionary Protein Representation via A Pairwise Masked Language Model, CoRR
2021.09 Toward more general embeddings for protein design: Harnessing joint representations of sequence and structure, bioRxiv
2022.02 ProteinBERT: a universal deep-learning model of protein sequence and function, bioRxiv, Code
2022.04 Lm-gvp: an extensible sequence and structure informed deep learning framework for protein property prediction, bioRxiv, Code
2022.05 Retrieved Sequence Augmentation for Protein Representation Learning, bioRxiv, Code
2022.06 OntoProtein: Protein Pretraining With Gene Ontology Embedding, arXiv, Code
2022.07 Language models of protein sequences at the scale of evolution enable accurate structure prediction, bioRxiv, Code
2023.02 Multi-level Protein Structure Pre-training via Prompt Learning, ICLR, Code
2023.02 Protein Representation Learning via Knowledge Enhanced Primary Structure Modeling, arXiv, Code
2023.10 Deciphering the protein landscape with ProtFlash, a lightweight language model, bioRxiv, Code
2023.10 Enhancing protein language models with structure-based encoder and pre-training, arXiv, Code
2023.10 Saprot: Protein language modeling with structure-aware vocabulary, bioRxiv, Code
2023.12 ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers, bioRxiv
2024.02 Codon language embeddings provide strong signals for use in protein engineering, Nature Machine Intelligence, Code
2024.04 Pre-training Sequence, Structure, and Surface Features for Comprehensive Protein Representation Learning, ICLR
2024.06 ESM All-Atom: Multi-scale Protein Language Model for Unified Molecular Modeling, bioRxiv
2024.06 Knowledge-aware Reinforced Language Models for Protein Directed Evolution, ICML
2024.06 Simulating 500 million years of evolution with a language model, bioRxiv, Code

Protein Sequence Generation/Design

2020.03 ProGen: Language Modeling for Protein Generation, arXiv, Code
2021.01 A deep unsupervised language model for protein design, bioRxiv, Code
2021.01 Fold2seq: A joint sequence (1d)-fold (3d) embedding-based generative model for protein design, PMLR, Code
2022.01 ZymCTRL: a conditional language model for the controllable generation of artificial enzymes, NeurIPS, Code
2022.04 Few Shot Protein Generation, arXiv
2022.05 RITA: a Study on Scaling Up Generative Protein Sequence Models, arXiv
2022.12 Generative language modeling for antibody design, arXiv, Code
2023.02 Structure-informed Language Models Are Protein Designers, bioRxiv
2023.02 Generative power of a protein language model trained on multiple sequence alignments, Elife, Code
2023.02 Protein sequence design in a latent space via model-based reinforcement learning, ICLR
2023.06 Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation, arXiv, Code
2023.07 ProstT5: Bilingual Language Model for Protein Sequence and Structure, bioRxiv, Code
2023.07 xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein, bioRxiv
2023.08 Efficient and accurate sequence generation with small-scale protein language models, bioRxiv
2023.10 Generative Antibody Design for Complementary Chain Pairing Sequences through Encoder-Decoder Language Model, NeurIPS
2023.10 ProGen2: exploring the boundaries of protein language models, Cell, Code
2023.10 ProteinRL: Reinforcement learning with generative protein language models for property-directed sequence design, NeurIPS
2023.11 PoET: A generative model of protein families as sequences-of-sequences, arXiv
2024.03 Protein Discovery with Discrete Walk-Jump Sampling, arxiv.

Datasets and Benchmarks

UniRef100, 90, 50, 2007.03 UniRef: comprehensive and non-redundant UniProt reference clusters, Bioinformatics
UniProtKB/Swiss-Prot, 2016.01 UniProtKB/Swiss-Prot, the manually annotated section of the UniProt KnowledgeBase: how to use the entry view, Springer Plant Bioinformatics
UniProtKB/TrEMBL, 1999.03 EDITtoTrEMBL: a distributed approach to high-quality automated protein sequence annotation, Bioinformatics
UniParc, 2022.11 UniProt: the Universal Protein Knowledgebase in 2023, Bioinformatics
Pfam, 1999.03 Pfam: clans, web tools and services, Nucleic Acids Research
BFD, 2018.06 Clustering huge protein sequence sets in linear time, Nature Communications
PDB, 2018.10 Protein Data Bank: the single global archive for 3D macromolecular structure data, Nucleic Acids Research
AlphaFoldDB, 2021.11 AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research
CASP, 2021.09 Critical assessment of methods of protein structure prediction (CASP)—Round XIV, PROTEINS
EC, 2008.09 ExplorEnz: the primary source of the IUBMB enzyme list, Nucleic Acids Research
GO, 2000.05 Gene ontology: tool for the unification of biology, Nature Genetics
CATH, 1997.08 CATH--a hierarchic classification of protein domain structures, NIH
HIPPIE, 2012.02 HIPPIE: Integrating protein interaction networks with experiment based quality scores, PLoS ONE
SCOP, 2000.01 SCOP: a structural classification of proteins database, Nucleic Acids Research
ProteinGym, 2023.09 Proteingym: Large-scale benchmarks for protein fitness prediction and design, NeurIPS
FLIP, 2022.01 FLIP: Benchmark tasks in fitness landscape inference for proteins, bioRxiv
PEER, 2022.09 Peer: a comprehensive and multi-task benchmark for protein sequence understanding, NeurIPS
TAPE, 2019.09 Evaluating Protein Transfer Learning with TAPE, NeurIPS

🦠 Genomic Large Language Models (Gene-LLMs)

General

2021.02 DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome Bioinformatics
2022.08 MoDNA: motif-oriented pre-training for DNA language model ACM-BCB
2023.01 Species-aware DNA language modeling bioRxiv
2023.01 The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics bioRxiv
2023.06 HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution arXiv
2023.06 DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome arXiv
2023.06 GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences bioRxiv
2023.06 Geneformer: Transfer learning enables predictions in network biology bioRxiv
2023.07 EpiGePT: a Pretrained Transformer model for epigenomics bioRxiv
2023.08 Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision bioRxiv
2023.08 DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks bioRxiv
2024.02 Evo: Sequence modeling and design from molecular to genome scale with Evo Nature
2024.02 GenomicLLM: Exploring Genomic Large Language Models: Bridging the Gap between Natural Language and Gene Sequences BioRxiv

Function Prediction

2021.10 Effective gene expression prediction from sequence by integrating long-range interactions Nature Methods
2022.08 iEnhancer-BERT: A Novel Transfer Learning Architecture Based on DNA-Language Model for Identifying Enhancers and Their Strength ICIC 2022
2022.10 iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of DNA methylations Genome Biology
2022.12 iEnhancer-ELM: improve enhancer identification by extracting position-related multiscale contextual information based on enhancer language models arXiv
2023.03 miProBERT: identification of microRNA promoters based on the pre-trained model BERT Briefings in Bioinformatics
2023.07 PLPMpro: Enhancing promoter sequence prediction with prompt-learning based pre-trained language model Computers in Biology and Medicine
2024.02 FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics ArXiv

Variants and Evolution Prediction

2022.08 DNA language models are powerful predictors of genome-wide variant effects bioRxiv
2022.10 GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics bioRxiv
2023.10 GPN-MSA: an alignment-based DNA language model for genome-wide variant effect prediction bioRxiv
2024.12 Bio-xLSTM: Generative modeling, representation and in-context learning of biological and chemical sequences ArXiv

DNA-Protein Interaction Prediction

2023.05 Improving language model of human genome for DNA–protein binding prediction based on task-specific pre-training Interdisciplinary Sciences: Computational Life Sciences
2024.04 gLM: Genomic language model predicts protein co-regulation and function BioRxiv

RNA Prediction

2023.02 Self-supervised learning on millions of pre-mRNA sequences improves sequence-based RNA splicing prediction bioRxiv
2023.03 Multiple sequence-alignment-based RNA language model and its application to structural inference bioRxiv
2023.06 Prediction of Multiple Types of RNA Modifications via Biological Language Model IEEE/ACM Transactions on Computational Biology and Bioinformatics
2023.07 Uni-RNA: Universal Pre-trained Models Revolutionize RNA Research bioRxiv
2024.02 RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks ArXiv
2024.05 RNAErnie: Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning Nature Machine Intelligence

Datasets and Benchmarks

MGC, 1999.10 The Mammalian Gene Collection Science
GRCh38, 2013.12 Improvements and impacts of GRCh38 human reference on high throughput sequencing data analysis Genomics
690 ChIP-seq, 2016.06 Convolutional neural network architectures for predicting DNA–protein binding
DeepSEA, 2017.04 Predicting effects of noncoding variants with deep learning–based sequence model Nature Methods
1000 Genomes Project, 2017.10 A global reference for human genetic variation Nature
EPDnew, 2019.11 EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era Nucleic Acids Research
Panglao Dataset, 2020.03 PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data Database
ExPecto, 2020.12 Sequence-based prediction of variants’ effects Nature Methods
UCSC Genome Database, 2022.11 The UCSC Genome Browser Database Nucleic Acids Research
BV-BRC, 2023.01 Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR Nucleic Acids Research
Ensembl, 2023.02 The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools Nucleic Acids Research
RNAcmap, 2023.07 RNAcmap: a fully automatic pipeline for predicting contact maps of RNAs by evolutionary coupling analysis
ENCODE, 2023.09 An integrated encyclopedia of DNA elements in the human genome Nature
NCBI Genome Database, 2023.10
TAIR, 2023.12 The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools Nucleic Acids Research
VGDB, 2023.12 Viral Genome DataBase: storing and analyzing genes and proteins from complete viral genomes Bioinformatics
CAGI5, 2023.07 CAGI5: Objective performance assessments of predictions based on the Evolutionary Action equation Human Mutation
Protein–RNA Interaction Prediction, 2023.08 A systematic benchmark of machine learning methods for protein–RNA interaction prediction
The Nucleaotide Transformer Benchmark, 2023.09 The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics
GenBench 2024.06 GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models ArXiv
BEACON BEACON: Benchmark for Comprehensive RNA Tasks and Language Models ArXiv

Ⓜ️ Multimodal Scientific Large Language Models (MM-Sci-LLMs)

Molecule&text

2021.11 Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries, EMNLP, Code
2022.02 KV-PLM: A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals, Nature, Code
2022.09 MoMu: A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language, arXiv, Code
2022.11 MolT5: Translation between Molecules and Natural Language, arXiv, Code
2023.05 Text+Chem T5: Unifying Molecular and Textual Representations via Multi-task Language Modelling, arXiv, Code
2023.05 DrugChat: Towards Enabling ChatGPT-Like Capabilities on Drug Molecule Graphs, techRxiv, Code
2023.06 Enhancing Activity Prediction Models in Drug Discovery with the Ability to Understand Human Language, arXiv, Code
2023.06 GIMLET: A Unified Graph-Text Model for Instruction-Based Molecule Zero-Shot Learning, bioRxiv, Code
2023.06 MolReGPT: Empowering Molecule Discovery for Molecule-Caption Translation with Large Language Models: A ChatGPT Perspective, arXiv, Code
2023.06 ChatMol: Interactive Molecular Discovery with Natural Language, arXiv, Code
2023.07 MolXPT: Wrapping Molecules with Text for Generative Pre-training, ACL
2023.07 MolFM: A Multimodal Molecular Foundation Model, arXiv, Code
2023.08 GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text, arXiv
2023.10 GPT-MolBERTa: GPT Molecular Features Language Model for molecular property prediction, arXiv, Code
2023.12 MoleculeSTM: Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing, arXiv, Code
2024.04 Atomas: Hierarchical Alignment on Molecule-Text for Unified Molecule Understanding and Generation, arXiv, Code
2024.05 DrugLLM: Open Large Language Model for Few-shot Molecule Generation, arXiv
2024.06 3D-MolT5: Towards Unified 3D Molecule-Text Modeling with 3D Molecular Tokenization, arXiv
2024.06 MolecularGPT: Open Large Language Model (LLM) for Few-Shot Molecular Property Prediction, arXiv, Code
2024.10 Chemical Language Model Linker: blending text and molecules with modular adapters, arXiv, Code
2024.11 MolReFlect: Towards In-Context Fine-grained Alignments between Molecules and Texts, arXiv

Protein&text

2022.04 ProTranslator: zero-shot protein function prediction using textual description, arXiv, Code
2023.02 ProteinDT: A Text-guided Protein Design Framework, arXiv
2023.07 ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts, arXiv, Code
2023.07 Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers, arXiv
2023.10 InstructProtein: Aligning Human and Protein Language via Knowledge Instruction, arXiv
2024.02 ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training, arXiv, Code
2024.02 ProtChatGPT: Towards Understanding Proteins with Large Language Models, arXiv
2024.02 ProLLaMA: A Protein Large Language Model for Multi-Task Protein Language Processing, arXiv, Code
2024.04 Functional Protein Design with Local Domain Alignment, arXiv
2024.05 ProtT3: Protein-to-Text Generation for Text-based Protein Understanding, arXiv, Code
2024.05 ProteinCLIP: enhancing protein language models with natural language, bioArXiv, Code
2024.07 ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction, bioArXiv, Code
2024.08 ProteinGPT: Multimodal LLM for Protein Property Prediction and Structure Understanding, arXiv
2024.10 ProteinAligner: A Multi-modal Pretraining Framework for Protein Foundation Models, bioArXiv, Code
2024.10 Structure-Enhanced Protein Instruction Tuning: Towards General-Purpose Protein Understanding, arXiv
2024.10 TourSynbio: A Multi-Modal Large Model and Agent Framework to Bridge Text and Protein Sequences for Protein Engineering, IEEE 2024, Code
2024.12 ProtDAT: A Unified Framework for Protein Sequence Design from Any Protein Text Description, arXiv, Code

Protein&molecule

2022.09 ChemBERTaLM: Exploiting pretrained biochemical language models for targeted drug design, Bioinformatics, Code
2023.03 Deep generative model for drug design from protein target sequence, Journal of Cheminformatics , Code
2023.06 DrugGPT: A GPT-based Strategy for Designing Potential Ligands Targeting Specific Proteins, bioRxiv, Code
2023.10 DrugCLIP: Contrastive Protein-Molecule Representation Learning for Virtual Screening, arXiv
2024.10 scChat: A Large Language Model-Powered Co-Pilot for Contextualized Single-Cell RNA Sequencing Analysis, bioArXiv, Code
2024.10 BioLLMNet: Enhancing RNA-Interaction Prediction with a Specialized Cross-LLM Transformation Network, bioArXiv

Cell&Text

2023.09 Cell2Sentence: Teaching Large Language Models the Language of Biology, bioRxiv, Code
2023.10 CELLPLM: PRE-TRAINING OF CELL LANGUAGE MODEL BEYOND SINGLE CELLS, ICLR, Code
2023.10 GenePT: A Simple But Effective Foundation Model for Genes and Cells Built From ChatGPT, bioRxiv, Code
2024.02 ChatCell: Facilitating Single-Cell Analysis with Natural Language, arXiv, Code
2024.03 Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis, Nature Methods, Code
2024.03 Joint Embedding of Transcriptomes and Text Enables Interactive Single-Cell RNA-seq Data Exploration via Natural Language, ICLR, Code
2024.06 LangCell: Language-Cell Pre-training for Cell Identity Understanding, arXiv, Code
2024.11 RNA-GPT: Multimodal Generative System for RNA Sequence Understanding, NeurIPS 2024
2024.11 BioLLM: A Standardized Framework for Integrating and Benchmarking Single-Cell Foundation Models, bioArxiv, Code

Comprehensive

2022.11 Galactica: A Large Language Model for Science, arXiv, Code
2023.02 BioTranslator: Multilingual translation for zero-shot biomedical classification using BioTranslator, Nature, Code
2023.05 ChatDrug: ChatGPT-powered Conversational Drug Editing Using Retrieval and Domain Feedback, arXiv, Code
2023.08 BioMedGPT：A Pre-trained Language Model for Biomedical Text Mining, arXiv, Code
2023.08 DARWIN Series: Domain Specific Large Language Models for Natural Science, arXiv, Code
2023.10 BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations, arXiv, Code
2023.11 Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models, arXiv, Code
2024.01 BioBridge: Bridging Biomedical Foundation Models via Knowledge Graphs, arXiv, Code
2024.02 LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset, arXiv, Page, Model, Dataset
2024.02 Sequence modeling and design from molecular to genome scale with Evo, bioRxiv, Code
2024.02 BioT5+: Towards Generalized Biological Understanding with IUPAC Integration and Multi-task Tuning, arXiv, Code
2024.04 MolBind: Multimodal Alignment of Language, Molecules, and Proteins, arXiv, Code
2024.06 Uni-SMART: Universal Science Multimodal Analysis and Research Transformer, arXiv
2024.07 SciMind: A Multimodal Mixture-of-Experts Model for Advancing Pharmaceutical Sciences, ACL Workshop
2024.10 BSM: Small but Powerful Biological Sequence Model for Genes and Proteins, arXiv
2024.10 MAMMAL -- Molecular Aligned Multi-Modal Architecture and Language, arXiv, Code

Datasets and Benchmarks

Molecule&Text

ChEBI-20, 2021.11 Text2mol: Cross-modal molecule retrieval with natural language queries, EMNLP2021
PCdes, 2022.02 A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals, Nature
MoMu, 2022.09 A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language, arXiv
PubChemSTM, 2022.12. Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing, arXiv
ChEBL-dia, 2023.06 ChatMol: Interactive Molecular Discovery with Natural Language, arXiv
PubChemQA, 2023.08 BioMedGPT：A Pre-trained Language Model for Biomedical Text Mining, arXiv
MoleculeQA, 2024.03 MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension, arXiv
MolCap-Arena, 2024.11 MolCap-Arena: A Comprehensive Captioning Benchmark on Language-Enhanced Molecular Property Prediction, arXiv
TOMG-Bench, 2024.12 TOMG-Bench: Evaluating LLMs on Text-based Open Molecule Generation, arXiv

Protein&Text

SwissProtCLAP, 2023.02 ProteinDT: A Text-guided Protein Design Framework, arXiv
ProtDescribe, 2023.07 ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts, arXiv
Prot2Text, 2023.07 Prot2Text: Multimodal Protein's Function Generation with GNNs and Transformers, arXiv
UniProtQA, 2023.08 BioMedGPT：A Pre-trained Language Model for Biomedical Text Mining, arXiv
InstructProtein, 2023.10 InstructProtein: Aligning Human and Protein Language via Knowledge Instruction, arXiv
ProteinLMDataset,ProteinLMBench, 2024.06 A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding, arXiv
OPI, 2024.11 OPI: An Open Instruction Dataset for Adapting Large Language Models to Protein-Related Tasks, Neurips 2024

Protein&Molecule

DUD-E, 2012.06 Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking, Journal of Medicinal Chemistry
BioLiP, 2012.10 BioLiP: a semi-manually curated database for biologically relevant ligand–protein interactions, Nucleic Acids Research
BindingDB, 2016.01 BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology, Nucleic Acids Research

Cell&Text

GEO, 2016 The gene expression omnibus database, Methods Mol Biol
The Human Cell Atlas, 2017.12 The Human Cell Atlas, eLife
ARCHS4, 2018.04 Massive mining of publicly available RNA-seq data from human and mouse, Nat Commun
NCBI, 2020.10 Database resources of the national center for biotechnology information, Nucleic Acids Res
SRT, 2021.11 High-Plex Multiomic Analysis in FFPE Tissue at Single-Cellular and Subcellular Resolution by Spatial Molecular Imaging, bioRxiv
cellxgene, 2021.04 cellxgene: a performant, scalable exploration platform for high dimensional sparse matrices, arXiv
CellTypist, 2022.05 Cross-tissue immune cell analysis reveals tissue-specific features in humans, arXiv
scLibrary, 2024.06 LangCell: Language-Cell Pre-training for Cell Identity Understanding, arXiv

Comprehensive

Galactica, 2022.11 Galactica: A Large Language Model for Science, arXiv
Scientific Knowledge Dataset, 2023.08 DARWIN Series: Domain Specific Large Language Models for Natural Science, arXiv
Mol-Instructions, 2023.10 Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models, arXiv
SMolInstruct, 2024.02 LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, arXiv
MolBind-M4, 2024.04 MolBind: Multimodal Alignment of Language, Molecules, and Proteins, arXiv

👥 Contributions

Citation

If you find this repository useful, please cite our paper:

@misc{zhang2024scientific,
      title={Scientific Large Language Models: A Survey on Biological & Chemical Domains}, 
      author={Qiang Zhang and Keyan Ding and Tianwen Lyv and Xinda Wang and Qingyu Yin and Yiwen Zhang and Jing Yu and Yuhao Wang and Xiaotong Li and Zhuoyi Xiang and Xiang Zhuang and Zeyuan Wang and Ming Qin and Mengyao Zhang and Jinlu Zhang and Jiyu Cui and Renjun Xu and Hongyang Chen and Xiaohui Fan and Huabin Xing and Huajun Chen},
      year={2024},
      eprint={2401.14656},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributors

Keyan Ding @dingkeyan93
Jing Yu @jiing17
Tianwen Lyu @smiling-k
Yiwen Zhang @zhangyiwen2002
Xinda Wang @Wwwduojin
Qingyu Yin @MikaStars39

Contact

Xinda Wang [email protected]

For Tasks:

Click tags to check more tools for each tasks

predict molecule properties generate molecular structures analyze protein sequences predict genetic variants model dna-protein interactions

For Jobs:

data scientist bioinformatician computational chemist biomedical researcher machine learning engineer

Alternative AI tools for Scientific-LLM-Survey

Similar Open Source Tools

Scientific-LLM-Survey

github

: 261

can-ai-code

Can AI Code is a self-evaluating interview tool for AI coding models. It includes interview questions written by humans and tests taken by AI, inference scripts for common API providers and CUDA-enabled quantization runtimes, a Docker-based sandbox environment for validating untrusted Python and NodeJS code, and the ability to evaluate the impact of prompting techniques and sampling parameters on large language model (LLM) coding performance. Users can also assess LLM coding performance degradation due to quantization. The tool provides test suites for evaluating LLM coding performance, a webapp for exploring results, and comparison scripts for evaluations. It supports multiple interviewers for API and CUDA runtimes, with detailed instructions on running the tool in different environments. The repository structure includes folders for interviews, prompts, parameters, evaluation scripts, comparison scripts, and more.

github

: 511

bonito

Bonito is an open-source model for conditional task generation, converting unannotated text into task-specific training datasets for instruction tuning. It is a lightweight library built on top of Hugging Face `transformers` and `vllm` libraries. The tool supports various task types such as question answering, paraphrase generation, sentiment analysis, summarization, and more. Users can easily generate synthetic instruction tuning datasets using Bonito for zero-shot task adaptation.

github

: 742

LLM-as-HH

LLM-as-HH is a codebase that accompanies the paper ReEvo: Large Language Models as Hyper-Heuristics with Reflective Evolution. It introduces Language Hyper-Heuristics (LHHs) that leverage LLMs for heuristic generation with minimal manual intervention and open-ended heuristic spaces. Reflective Evolution (ReEvo) is presented as a searching framework that emulates the reflective design approach of human experts while surpassing human capabilities with scalable LLM inference, Internet-scale domain knowledge, and powerful evolutionary search. The tool can improve various algorithms on problems like Traveling Salesman Problem, Capacitated Vehicle Routing Problem, Orienteering Problem, Multiple Knapsack Problems, Bin Packing Problem, and Decap Placement Problem in both black-box and white-box settings.

github

: 78

mem-kk-logic

This repository provides a PyTorch implementation of the paper 'On Memorization of Large Language Models in Logical Reasoning'. The work investigates memorization of Large Language Models (LLMs) in reasoning tasks, proposing a memorization metric and a logical reasoning benchmark based on Knights and Knaves puzzles. It shows that LLMs heavily rely on memorization to solve training puzzles but also improve generalization performance through fine-tuning. The repository includes code, data, and tools for evaluation, fine-tuning, probing model internals, and sample classification.

github

: 60

oso

Open Source Observer is a free analytics suite that helps funders measure the impact of open source software contributions to the health of their ecosystem. The repository contains various subprojects such as OSO apps, documentation, frontend application, API services, Docker files, common libraries, utilities, GitHub app for validating pull requests, Helm charts for Kubernetes, Kubernetes configuration, Terraform modules, data warehouse code, Python utilities for managing data, OSO agent, Dagster configuration, sqlmesh configuration, Python package for pyoso, and other tools to manage warehouse pipelines.

github

: 104

backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs. It allocates and isolates the underlying computing resources for multi-tenant computation sessions on-demand or in batches with customizable job schedulers with its own orchestrator. All its functions are exposed as REST/GraphQL/WebSocket APIs.

github

: 579

detoxify

Detoxify is a library that provides trained models and code to predict toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintended Bias in Toxic comments, Multilingual toxic comment classification. It includes models like 'original', 'unbiased', and 'multilingual' trained on different datasets to detect toxicity and minimize bias. The library aims to help in stopping harmful content online by interpreting visual content in context. Users can fine-tune the models on carefully constructed datasets for research purposes or to aid content moderators in flagging out harmful content quicker. The library is built to be user-friendly and straightforward to use.

github

: 980

airbase

Airbase is a Maven project management tool that provides a parent pom structure and conventions for defining new projects. It includes guidelines for project pom structure, deployment to Maven Central, project build and checkers, well-known dependencies, and other properties. Airbase helps in enforcing build configurations, organizing project pom files, and running various checkers to catch problems early in the build process. It also offers default properties that can be overridden in the project pom.

github

: 51

hound

Hound is a security audit automation pipeline for AI-assisted code review that mirrors how expert auditors think, learn, and collaborate. It features graph-driven analysis, sessionized audits, provider-agnostic models, belief system and hypotheses, precise code grounding, and adaptive planning. The system employs a senior/junior auditor pattern where the Scout actively navigates the codebase and annotates knowledge graphs while the Strategist handles high-level planning and vulnerability analysis. Hound is optimized for small-to-medium sized projects like smart contract applications and is language-agnostic.

github

: 325

aicsimageio

AICSImageIO is a Python tool for Image Reading, Metadata Conversion, and Image Writing for Microscopy Images. It supports various file formats like OME-TIFF, TIFF, ND2, DV, CZI, LIF, PNG, GIF, and Bio-Formats. Users can read and write metadata and imaging data, work with different file systems like local paths, HTTP URLs, s3fs, and gcsfs. The tool provides functionalities for full image reading, delayed image reading, mosaic image reading, metadata reading, xarray coordinate plane attachment, cloud IO support, and saving to OME-TIFF. It also offers benchmarking and developer resources.

github

: 198

nano-graphrag

nano-GraphRAG is a simple, easy-to-hack implementation of GraphRAG that provides a smaller, faster, and cleaner version of the official implementation. It is about 800 lines of code, small yet scalable, asynchronous, and fully typed. The tool supports incremental insert, async methods, and various parameters for customization. Users can replace storage components and LLM functions as needed. It also allows for embedding function replacement and comes with pre-defined prompts for entity extraction and community reports. However, some features like covariates and global search implementation differ from the original GraphRAG. Future versions aim to address issues related to data source ID, community description truncation, and add new components.

github

: 2.6k

receipt-scanner

The receipt-scanner repository is an AI-Powered Receipt and Invoice Scanner for Laravel that allows users to easily extract structured receipt data from images, PDFs, and emails within their Laravel application using OpenAI. It provides a light wrapper around OpenAI Chat and Completion endpoints, supports various input formats, and integrates with Textract for OCR functionality. Users can install the package via composer, publish configuration files, and use it to extract data from plain text, PDFs, images, Word documents, and web content. The scanned receipt data is parsed into a DTO structure with main classes like Receipt, Merchant, and LineItem.

github

: 95

req_llm

ReqLLM is a Req-based library for LLM interactions, offering a unified interface to AI providers through a plugin-based architecture. It brings composability and middleware advantages to LLM interactions, with features like auto-synced providers/models, typed data structures, ergonomic helpers, streaming capabilities, usage & cost extraction, and a plugin-based provider system. Users can easily generate text, structured data, embeddings, and track usage costs. The tool supports various AI providers like Anthropic, OpenAI, Groq, Google, and xAI, and allows for easy addition of new providers. ReqLLM also provides API key management, detailed documentation, and a roadmap for future enhancements.

github

: 118

twitter-automation-ai

Advanced Twitter Automation AI is a modular Python-based framework for automating Twitter at scale. It supports multiple accounts, robust Selenium automation with optional undetected Chrome + stealth, per-account proxies and rotation, structured LLM generation/analysis, community posting, and per-account metrics/logs. The tool allows seamless management and automation of multiple Twitter accounts, content scraping, publishing, LLM integration for generating and analyzing tweet content, engagement automation, configurable automation, browser automation using Selenium, modular design for easy extension, comprehensive logging, community posting, stealth mode for reduced fingerprinting, per-account proxies, LLM structured prompts, and per-account JSON summaries and event logs for observability.

github

: 56

nuxt-llms

Nuxt LLMs automatically generates llms.txt markdown documentation for Nuxt applications. It provides runtime hooks to collect data from various sources and generate structured documentation. The tool allows customization of sections directly from nuxt.config.ts and integrates with Nuxt modules via the runtime hooks system. It generates two documentation formats: llms.txt for concise structured documentation and llms_full.txt for detailed documentation. Users can extend documentation using hooks to add sections, links, and metadata. The tool is suitable for developers looking to automate documentation generation for their Nuxt applications.

github

: 117

For similar tasks

Scientific-LLM-Survey

github

: 261

matchem-llm

A public repository collecting links to state-of-the-art training sets, QA, benchmarks and other evaluations for various ML and LLM applications in materials science and chemistry. It includes datasets related to chemistry, materials, multimodal data, and knowledge graphs in the field. The repository aims to provide resources for training and evaluating machine learning models in the materials science and chemistry domains.

github

: 51

For similar jobs

NoLabs

NoLabs is an open-source biolab that provides easy access to state-of-the-art models for bio research. It supports various tasks, including drug discovery, protein analysis, and small molecule design. NoLabs aims to accelerate bio research by making inference models accessible to everyone.

github

: 75

OpenCRISPR

OpenCRISPR is a set of free and open gene editing systems designed by Profluent Bio. The OpenCRISPR-1 protein maintains the prototypical architecture of a Type II Cas9 nuclease but is hundreds of mutations away from SpCas9 or any other known natural CRISPR-associated protein. You can view OpenCRISPR-1 as a drop-in replacement for many protocols that need a cas9-like protein with an NGG PAM and you can even use it with canonical SpCas9 gRNAs. OpenCRISPR-1 can be fused in a deactivated or nickase format for next generation gene editing techniques like base, prime, or epigenome editing.

github

: 253

ersilia

The Ersilia Model Hub is a unified platform of pre-trained AI/ML models dedicated to infectious and neglected disease research. It offers an open-source, low-code solution that provides seamless access to AI/ML models for drug discovery. Models housed in the hub come from two sources: published models from literature (with due third-party acknowledgment) and custom models developed by the Ersilia team or contributors.

github

: 249

ontogpt

OntoGPT is a Python package for extracting structured information from text using large language models, instruction prompts, and ontology-based grounding. It provides a command line interface and a minimal web app for easy usage. The tool has been evaluated on test data and is used in related projects like TALISMAN for gene set analysis. OntoGPT enables users to extract information from text by specifying relevant terms and provides the extracted objects as output.

github

: 584

bia-bob

BIA `bob` is a Jupyter-based assistant for interacting with data using large language models to generate Python code. It can utilize OpenAI's chatGPT, Google's Gemini, Helmholtz' blablador, and Ollama. Users need respective accounts to access these services. Bob can assist in code generation, bug fixing, code documentation, GPU-acceleration, and offers a no-code custom Jupyter Kernel. It provides example notebooks for various tasks like bio-image analysis, model selection, and bug fixing. Installation is recommended via conda/mamba environment. Custom endpoints like blablador and ollama can be used. Google Cloud AI API integration is also supported. The tool is extensible for Python libraries to enhance Bob's functionality.

github

: 110

Scientific-LLM-Survey

github

: 261

polaris

Polaris establishes a novel, industry‑certified standard to foster the development of impactful methods in AI-based drug discovery. This library is a Python client to interact with the Polaris Hub. It allows you to download Polaris datasets and benchmarks, evaluate a custom method against a Polaris benchmark, and create and upload new datasets and benchmarks.

github

: 111

awesome-AI4MolConformation-MD

The 'awesome-AI4MolConformation-MD' repository focuses on protein conformations and molecular dynamics using generative artificial intelligence and deep learning. It provides resources, reviews, datasets, packages, and tools related to AI-driven molecular dynamics simulations. The repository covers a wide range of topics such as neural networks potentials, force fields, AI engines/frameworks, trajectory analysis, visualization tools, and various AI-based models for protein conformational sampling. It serves as a comprehensive guide for researchers and practitioners interested in leveraging AI for studying molecular structures and dynamics.

github

: 249