matchem-llm

matchem-llm

A public repository collecting links to state of the art QA and evaluation sets for various ML and LLM applications

Stars: 51

Visit
 screenshot

A public repository collecting links to state-of-the-art training sets, QA, benchmarks and other evaluations for various ML and LLM applications in materials science and chemistry. It includes datasets related to chemistry, materials, multimodal data, and knowledge graphs in the field. The repository aims to provide resources for training and evaluating machine learning models in the materials science and chemistry domains.

README:

Materials Science and Chemistry LLM Resources

A public repository collecting links to state-of-the-art training sets, QA, benchmarks and other evaluations for various ML and LLM applications in materials science and chemistry.

Chemistry

  • ChemQA: ChemQA is a Multimodal Question-and-Answering dataset on chemistry reasoning. This work is inspired by IsoBench and ChemLLMBench. Containts 5 QA Tasks in total: Counting Numbers of Carbons and Hydrogens in Organic Molecules, Calculating Molecular Weights in Organic Molecules, Name Conversion: From SMILES to IUPAC, Molecule Captioning and Editing, and retro-synthesis Planning: inspired by [2], adapted from dataset provided in [4], following the same training, validation and evaluation splits.
  • ChemLLMBench: A comprehensive benchmark on eight chemistry tasks
  • SMolInstruct: SMolInstruct is a large-scale, comprehensive, and high-quality instruction tuning dataset crafted for chemistry. It centers around small molecules, and contains 14 meticulously selected tasks and over 3M samples. This dataset has both SMILES and SELFIES versions, and you could switch to SELFIES by using use_selfies=True when loading. Tasks include name conversion, property prediction, molecule description, and chemical reaction prediction.
  • CamelAI - Chemistry: Chemistry dataset is composed of 20K problem-solution pairs obtained using gpt-4. The dataset problem-solutions pairs generating from 25 chemistry topics, 25 subtopics for each topic and 32 problems for each "topic,subtopic" pairs.
  • ChemData700k: ChemData is a large-scale chemistry competency instruction tuning dataset for language models, which includes nine chemistry core tasks and 730K high-quality questions and answers, sampled from 1/10 of 7 million pieces of data.
  • ChemBench4k: ChemBench is a large-scale chemistry competency evaluation benchmark for language models, which includes nine chemistry core tasks and 4100 high-quality single-choice questions and answers.
  • ChemBench - Lamalab: A benchmark with more than 7000 questions, manually curated for various chemical topics. Covering multi-choice and free-form questions. Supports models as well as tool-augmented systems. Provides leaderboards and human baseline. See paper for more details.
  • Chem-RnD and ChemEDU CLAIRify: Chem-EDU (Everyday Educational Chemistry) - 40 natural language instructions containing only safe (edible) chamicals. Chem-RnD (Chemistry Research & Development) - 108 detailed chemistry-protocols for synthesizing different organic compounds in real-world chemisty labs. This is a subset of Benchmarking results and the XDL XML schema.

Materials

  • Battery Device QA: Battery device records, including anode, cathode, and electrolyte. Examples of the question answering evaluation dataset: {'question': 'What is the cathode?', 'answer': 'Al foil', 'context': 'The blended slurry was then cast onto a clean current collector (Al foil for the cathode and Cu foil for the anode) and dried at 90 °C under vacuum overnight.', 'start index': 645}
  • MaScQA: A dataset of 650 challenging questions from the materials domain that require the knowledge and skills of a materials science student who has cleared their undergraduate degree. Questions are classified based on their structure and the materials science domain-based subcategories.
  • Optical BERT Training Data: Has HF errors, but data are available.
  • Optical Table QA:
  • LLMs for Sustainable Concrete paper
  • PNCExtract: Extracting Polymer Nanocomposite Samples from Articles: The contains a manually curated list of samples for each PNC article. The data is divided into 52 validation articles and 151 test articles.
  • MatKG: MatKG is a comprehensive Knowledge Graph of Materials Science that captures a diverse range of entities and relationships from scientific literature. MatKG includes materials, properties, applications, characterization methods, synthesis methods, symmetry phase labels, and descriptors, among other entities, which are extracted automatically using advanced natural language processing methods from over 5 million papers in the domain.

Multimodal

Unknown Quality

Chemistry Dataset from Andrersonbcedef

Community Needs

This is an area for relatively "open comments" one community needs.

  • One good thing to add would be a benchmark set for weak interactions between aminoacids in proteins, if it exists. (Marcos Veríssimo Alves)

Ethics and Dual Use Concerns

Ensuring safe usage of data is of paramount concern, especially for the chemistry data included in this list. We encourage all users of these data to read papers by e.g., Gabe Gomes et al. [1,2] on the potential dangers of dual use before constructing their datasets or models. If you have concerns about potential dual use of any of the datasets listed, create an issue and we will do our best to evaluate the safety concerns, remove the data, or consult with outside experts if necessary.

[1] Emergent autonomous scientific research capabilities of large language models [2] Censoring chemical data to mitigate dual use risk.

Contributing Guidelines

Add other resources to the list by 1. forking the repository and creating a pull request back to the main repository or 2. create an issue with the markdown that should be added to the repository and one of the admins will add to the main document.

If you would like to help maintain this repository, open an issue expressing your interest and we will reach out to you!

For Tasks:

Click tags to check more tools for each tasks

For Jobs:

Alternative AI tools for matchem-llm

Similar Open Source Tools

For similar tasks

For similar jobs