ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge

While large language models (LLMs) have achieved impressive progress, their application in scientific domains such as chemistry remains hindered by shallow domain understanding and limited reasoning capabilities. In this work, we focus on the specific field of chemistry and develop a Chemical Reasoning LLM, ChemDFM-R. We first construct a comprehensive dataset of atomized chemical knowledge, ChemFG, annotating the presence of functional groups in molecules and the changes of functional groups during chemical reactions, to enhance the model’s understanding of the fundamental principles and internal logic of chemistry. Then, we propose a mix-sourced distillation method that integrates expertise in atomized knowledge with general reasoning skills, followed by domain-specific reinforcement learning to enhance chemical reasoning. Experiments on diverse chemical benchmarks demonstrate that ChemDFM-R achieves cutting-edge performance while providing interpretable, rationale-driven outputs. Further case studies illustrate how explicit reasoning chains significantly improve the model's reliability, transparency, and practicality in real-world human-AI collaboration scenarios. For more details, please refer to our paper.

News

2025-10-26: The parameter of ChemDFM-R-14B is open-sourced!
2025-10-26: ChemDFM-v2.0-14B is released! The improved domain pre-training and instruction tuning procedure is implemented on Qwen2.5-14B to achieve a more advanced general LLM in Chemistry. More details can be found here.
2025-07-29: The paper of ChemDFM-R-14B is released on arXiv: ChemDFM-R: A Chemical Reasoning LLM Enhanced with Atomized Chemical Knowledge.
2024-11-09: ChemDFM-v1.5-8B is released! We implemented our domain pre-training and instruction tuning procedure on a stronger base model LLaMA-3-8B.
2024-03-12: The parameter of ChemDFM-v1.0-13B is open-sourced!
2024-01-26: The paper of ChemDFM-13B is released on arXiv: ChemDFM: Dialogue Foundation Model for Chemistry

local inference

To load and run ChemDFM-R locally, here is an example:

import re
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

model_name_or_id = "OpenDFM/ChemDFM-R-14B"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_id)
model = AutoModelForCausalLM.from_pretrained(model_name_or_id, torch_dtype=torch.float16).to("cuda")

instruction = "Can you please give detailed descriptions of the molecule below?\nCl.O=C1c2c(O)cccc2-c2nn(CCNCCO)c3ccc(NCCNCCO)c1c23"
message = [
    {
        "role": "system",
        "content": "You are a helpful assistant that is good at reasoning. You always reason thoroughly before giving response. The reasoning process and answer are enclosed within <think> </think> and <ans    wer> </answer> tags, respectively.\ni.e.,\n<think>\nreasoning process here\n</think>\n<answer>\nanswer here\n</answer>"
    },
    {
        "role": "user",
        "content": instruction
    }
]

input_text = tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to("cuda")
generation_config = GenerationConfig(
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.9,
    max_new_tokens=1024,
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id
)
outputs = model.generate(**inputs, generation_config=generation_config)

generated_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
input_text = tokenizer.decode(inputs["input_ids"][0], skip_special_tokens=True)
generated_text = generated_text[len(input_text):].strip()
print(f"{generated_text=}")

thinking, answer = re.match(r'<think>(.*?)</think>\s?<answer>(.*?)</answer>', generated_text, re.DOTALL).groups()
thinking, answer = thinking.strip(), answer.strip()
print(f"{thinking=}")
print(f"{answer=}")

SMILES preprocess

When there involves SMILES notation in your input, we recommend to preprocess the SMILES with the rdkit package to canonicalize the SMILES. Here is an example:

from rdkit import Chem
def canonicalize_smiles(smiles):
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        return None
    return Chem.MolToSmiles(mol, isomericSmiles=True, kekuleSmiles=False)

or directly:

from rdkit import Chem
def canonicalize_smiles(smiles):
    return Chem.CanonSmiles(smiles, useChiral=True)

Citation

@misc{zhao2025chemdfmr,
      title={ChemDFM-R: An Chemical Reasoner LLM Enhanced with Atomized Chemical Knowledge}, 
      author={Zihan Zhao and Bo Chen and Ziping Wan and Lu Chen and Xuanze Lin and Shiyang Yu and Situo Zhang and Da Ma and Zichen Zhu and Danyang Zhang and Huayang Wang and Zhongyang Dai and Liyang Wen and Xin Chen and Kai Yu},
      year={2025},
      eprint={2507.21990},
      archivePrefix={arXiv},
      primaryClass={cs.CE},
      url={https://arxiv.org/abs/2507.21990}, 
}

Disclaimer

Current version of ChemDFM may generate incorrect or misleading information. Please use it with caution and verify the results with domain experts before making any decisions based on the results.

Contact

If you have any questions or further requests, please contact Zihan Zhao, Bo Chen, and Lu Chen.

Downloads last month: 230

Safetensors

Model size

15B params

Tensor type

BF16

Model tree for OpenDFM/ChemDFM-R-14B

Quantizations

2 models

Collection including OpenDFM/ChemDFM-R-14B

ChemDFM

Collection

ChemDFM is the pioneering open-sourced dialogue foundation model for Chemistry and molecular science. • 5 items • Updated 5 days ago