A newer version of this model is available: Derify/ChemMRL

ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer

This is a Chem-MRL (sentence-transformers) model finetuned from Derify/ChemBERTa-druglike on the pubchem_10m_genmol_similarity dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Chem-MRL)

First install the Chem-MRL library:

pip install -U chem-mrl>=0.7.3

Then you can load this model and run inference.

from chem_mrl import ChemMRL

# Download from the 🤗 Hub
model = ChemMRL("Derify/ChemMRL-beta")
# Run inference
sentences = [
    "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
    "O=Cc1nc2ccccc2o1",
    "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
]
embeddings = model.backbone.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.backbone.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.3200, 0.1209],
#         [0.3200, 1.0000, 0.0950],
#         [0.1209, 0.0950, 1.0000]])

# Load the model with half precision
model = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
sentences = [
    "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
    "O=Cc1nc2ccccc2o1",
    "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
]
embeddings = model.embed(sentences)  # Use the embed method for half precision
print(embeddings.shape)
# [3, 1024]

Evaluation

Metrics

Semantic Similarity

  • Dataset: pubchem_10m_genmol_similarity
  • Evaluated with chem_mrl.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator with these parameters:
    {
        "precision": "float32"
    }
    
Split Metric Value
validation spearman 0.993212
test spearman 0.993243

Training Details

Training Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at f68d779
  • Size: 19,692,766 training samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 17 tokens
    • mean: 39.66 tokens
    • max: 119 tokens
    • min: 11 tokens
    • mean: 38.29 tokens
    • max: 115 tokens
    • min: 0.02
    • mean: 0.57
    • max: 1.0
  • Loss: Matryoshka2dLoss with these parameters:
    Click to expand
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": -1,
        "last_layer_weight": 2.0,
        "prior_layers_weight": 1.0,
        "kl_div_weight": 0.5,
        "kl_temperature": 0.3,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 128
  • learning_rate: 8e-06
  • weight_decay: 6.505130550397454e-06
  • warmup_ratio: 0.2
  • data_seed: 42
  • fp16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_apex_fused
  • dataloader_pin_memory: False

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 8e-06
  • weight_decay: 6.505130550397454e-06
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.2
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: 42
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_apex_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: False
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss pubchem_10m_genmol_similarity_spearman
0.0796 24500 121.4633 -
0.08 24616 - 0.9739
0.1592 49000 118.6111 -
0.16 49232 - 0.9817
0.2389 73500 117.491 -
0.24 73848 - 0.9848
0.3185 98000 116.3786 -
0.32 98464 - 0.9865
0.3997 123000 115.9773 -
0.4 123080 - 0.9873
0.4794 147500 115.2441 -
0.48 147696 - 0.9885
0.5590 172000 114.8674 -
0.56 172312 - 0.9887
0.6386 196500 114.6483 -
0.64 196928 - 0.9892
0.7199 221500 114.0507 -
0.72 221544 - 0.9898
0.7995 246000 113.5606 -
0.8 246160 - 0.9902
0.8791 270500 113.2762 -
0.88 270776 - 0.9907
0.9587 295000 113.3295 -
0.96 295392 - 0.9908
1.0400 320000 112.9253 -
1.04 320008 - 0.9909
1.1196 344500 112.584 -
1.12 344624 - 0.9910
1.1992 369000 112.616 -
1.2 369240 - 0.9916
1.2788 393500 112.4692 -
1.28 393856 - 0.9914
1.3585 418000 112.2679 -
1.3600 418472 - 0.9917
1.4397 443000 112.1639 -
1.44 443088 - 0.9919
1.5193 467500 112.1139 -
1.52 467704 - 0.9921
1.5990 492000 111.8096 -
1.6 492320 - 0.9923
1.6786 516500 111.8252 -
1.6800 516936 - 0.9922
1.7598 541500 111.836 -
1.76 541552 - 0.9924
1.8395 566000 111.8471 -
1.8400 566168 - 0.9924
1.9191 590500 111.7778 -
1.92 590784 - 0.9925
1.9987 615000 111.4892 -
2.0 615400 - 0.9927
2.0799 640000 111.2659 -
2.08 640016 - 0.9928
2.1596 664500 111.3635 -
2.16 664632 - 0.9927
2.2392 689000 111.0114 -
2.24 689248 - 0.9928
2.3188 713500 111.0559 -
2.32 713864 - 0.9929
2.3984 738000 110.5276 -
2.4 738480 - 0.9929
2.4797 763000 110.9828 -
2.48 763096 - 0.9930
2.5593 787500 110.8404 -
2.56 787712 - 0.9930
2.6389 812000 111.1937 -
2.64 812328 - 0.9931
2.7186 836500 110.6662 -
2.7200 836944 - 0.9931
2.7998 861500 110.7714 -
2.8 861560 - 0.9932
2.8794 886000 110.7638 -
2.88 886176 - 0.9932
2.9591 910500 110.7021 -
2.96 910792 - 0.9932
2.9997 923000 110.6097 -

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: AMD Ryzen 7 3700X 8-Core Processor
  • RAM Size: 62.70 GB

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.0.0
  • Transformers: 4.53.3
  • PyTorch: 2.7.1+cu126
  • Accelerate: 1.9.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Matryoshka2dLoss

@misc{li20242d,
    title={2D Matryoshka Sentence Embeddings},
    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
    year={2024},
    eprint={2402.14776},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}

TanimotoSentLoss

@online{cortes-2025-tanimotosentloss,
    title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
    author={Emmanuel Cortes},
    year={2025},
    month={Jan},
    url={https://github.com/emapco/chem-mrl},
}

Model Card Authors

@eacortes

Model Card Contact

Manny Cortes ([email protected])

Downloads last month
210
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Derify/ChemMRL-beta

Finetuned
(1)
this model

Dataset used to train Derify/ChemMRL-beta

Space using Derify/ChemMRL-beta 1

Collection including Derify/ChemMRL-beta

Evaluation results