ChemMRL-beta / README.md
eacortes's picture
Upload README.md
8e08327 verified
metadata
license: apache-2.0
tags:
  - sentence-transformers
  - molecular-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - loss:Matryoshka2dLoss
  - loss:MatryoshkaLoss
  - loss:TanimotoSentLoss
base_model: Derify/ChemBERTa-druglike
widget:
  - source_sentence: CC1CCc2c(N)nc(C3CCCC3)n2C1
    sentences:
      - CC1CCc2c(N)nc(OC3CC3)n2C1
      - CN1CC[NH+](C[C@H](O)C2CC2)C2(CCCCC2)C1
      - Cc1c(F)cc(CNCC2CCC(C3CCC(C)CO3)CO2)cc1F
  - source_sentence: CC(CCCO)NC(=O)CNc1ccccc1
    sentences:
      - CC(CCCO)N[C@H]1CCCN(Nc2ccccc2)[C@H]1C
      - Cc1ccc(OC2=NCCO2)nc1
      - Cc1ccccc1C#Cc1ccccc1N(O)c1ccccc1
  - source_sentence: CCCCCCCc1ccc(CC=N[NH+]=C(N)N)cc1
    sentences:
      - COCC1(N2CCN(C)CC2)CCC[NH+]1Cc1cnc(N(C)C)nc1
      - Cc1ccc(N=C(c2ccccc2)c2ccc(-n3ccnn3)cc2)cc1
      - CCCCCCCc1cncc(CC=N[NH+]=C(N)N)c1
  - source_sentence: CC(=CCCS(=O)(=O)[O-])C(=O)OCCCS(=O)(=O)[O-]
    sentences:
      - CC(=CCCS(=O)(=O)[O-])C(=O)OCCCS(=O)(=O)[O-]
      - CCCCCOc1ccc(NC(=S)NC=O)cc1
      - CCC(=O)N1CCCC(NC(=O)c2ccc(S(=O)(=O)N(C)C)cc2)C1
  - source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
    sentences:
      - O=Cc1nc2ccccc2o1
      - >-
        O=C([O-])COc1ccc(CCCS(=O)(=O)c2ccc(Cl)cc2)cc1NC(=O)c1cccc(C=Cc2nc3ccccc3s2)c1
      - O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
datasets:
  - Derify/pubchem_10m_genmol_similarity
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - spearman
model-index:
  - name: 'ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer'
    results:
      - task:
          type: semantic-similarity
          name: Semantic Similarity
        dataset:
          name: pubchem 10m genmol similarity
          type: pubchem_10m_genmol_similarity
        metrics:
          - type: spearman
            value: 0.9932120589500998
            name: Spearman
new_version: Derify/ChemMRL

ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer

This is a Chem-MRL (sentence-transformers) model finetuned from Derify/ChemBERTa-druglike on the pubchem_10m_genmol_similarity dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'RobertaModel'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Chem-MRL)

First install the Chem-MRL library:

pip install -U chem-mrl>=0.7.3

Then you can load this model and run inference.

from chem_mrl import ChemMRL

# Download from the 🤗 Hub
model = ChemMRL("Derify/ChemMRL-beta")
# Run inference
sentences = [
    "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
    "O=Cc1nc2ccccc2o1",
    "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
]
embeddings = model.backbone.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.backbone.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.3200, 0.1209],
#         [0.3200, 1.0000, 0.0950],
#         [0.1209, 0.0950, 1.0000]])

# Load the model with half precision
model = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
sentences = [
    "Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
    "O=Cc1nc2ccccc2o1",
    "O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
]
embeddings = model.embed(sentences)  # Use the embed method for half precision
print(embeddings.shape)
# [3, 1024]

Evaluation

Metrics

Semantic Similarity

  • Dataset: pubchem_10m_genmol_similarity
  • Evaluated with chem_mrl.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator with these parameters:
    {
        "precision": "float32"
    }
    
Split Metric Value
validation spearman 0.993212
test spearman 0.993243

Training Details

Training Dataset

pubchem_10m_genmol_similarity

  • Dataset: pubchem_10m_genmol_similarity at f68d779
  • Size: 19,692,766 training samples
  • Columns: smiles_a, smiles_b, and label
  • Approximate statistics based on the first 1000 samples:
    smiles_a smiles_b label
    type string string float
    details
    • min: 17 tokens
    • mean: 39.66 tokens
    • max: 119 tokens
    • min: 11 tokens
    • mean: 38.29 tokens
    • max: 115 tokens
    • min: 0.02
    • mean: 0.57
    • max: 1.0
  • Loss: Matryoshka2dLoss with these parameters:
    Click to expand
    {
        "loss": "TanimotoSentLoss",
        "n_layers_per_step": -1,
        "last_layer_weight": 2.0,
        "prior_layers_weight": 1.0,
        "kl_div_weight": 0.5,
        "kl_temperature": 0.3,
        "matryoshka_dims": [
            1024,
            512,
            256,
            128,
            64,
            32,
            16,
            8
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 128
  • learning_rate: 8e-06
  • weight_decay: 6.505130550397454e-06
  • warmup_ratio: 0.2
  • data_seed: 42
  • fp16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_apex_fused
  • dataloader_pin_memory: False

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 8e-06
  • weight_decay: 6.505130550397454e-06
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.2
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: 42
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_apex_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: False
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss pubchem_10m_genmol_similarity_spearman
0.0796 24500 121.4633 -
0.08 24616 - 0.9739
0.1592 49000 118.6111 -
0.16 49232 - 0.9817
0.2389 73500 117.491 -
0.24 73848 - 0.9848
0.3185 98000 116.3786 -
0.32 98464 - 0.9865
0.3997 123000 115.9773 -
0.4 123080 - 0.9873
0.4794 147500 115.2441 -
0.48 147696 - 0.9885
0.5590 172000 114.8674 -
0.56 172312 - 0.9887
0.6386 196500 114.6483 -
0.64 196928 - 0.9892
0.7199 221500 114.0507 -
0.72 221544 - 0.9898
0.7995 246000 113.5606 -
0.8 246160 - 0.9902
0.8791 270500 113.2762 -
0.88 270776 - 0.9907
0.9587 295000 113.3295 -
0.96 295392 - 0.9908
1.0400 320000 112.9253 -
1.04 320008 - 0.9909
1.1196 344500 112.584 -
1.12 344624 - 0.9910
1.1992 369000 112.616 -
1.2 369240 - 0.9916
1.2788 393500 112.4692 -
1.28 393856 - 0.9914
1.3585 418000 112.2679 -
1.3600 418472 - 0.9917
1.4397 443000 112.1639 -
1.44 443088 - 0.9919
1.5193 467500 112.1139 -
1.52 467704 - 0.9921
1.5990 492000 111.8096 -
1.6 492320 - 0.9923
1.6786 516500 111.8252 -
1.6800 516936 - 0.9922
1.7598 541500 111.836 -
1.76 541552 - 0.9924
1.8395 566000 111.8471 -
1.8400 566168 - 0.9924
1.9191 590500 111.7778 -
1.92 590784 - 0.9925
1.9987 615000 111.4892 -
2.0 615400 - 0.9927
2.0799 640000 111.2659 -
2.08 640016 - 0.9928
2.1596 664500 111.3635 -
2.16 664632 - 0.9927
2.2392 689000 111.0114 -
2.24 689248 - 0.9928
2.3188 713500 111.0559 -
2.32 713864 - 0.9929
2.3984 738000 110.5276 -
2.4 738480 - 0.9929
2.4797 763000 110.9828 -
2.48 763096 - 0.9930
2.5593 787500 110.8404 -
2.56 787712 - 0.9930
2.6389 812000 111.1937 -
2.64 812328 - 0.9931
2.7186 836500 110.6662 -
2.7200 836944 - 0.9931
2.7998 861500 110.7714 -
2.8 861560 - 0.9932
2.8794 886000 110.7638 -
2.88 886176 - 0.9932
2.9591 910500 110.7021 -
2.96 910792 - 0.9932
2.9997 923000 110.6097 -

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: AMD Ryzen 7 3700X 8-Core Processor
  • RAM Size: 62.70 GB

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.0.0
  • Transformers: 4.53.3
  • PyTorch: 2.7.1+cu126
  • Accelerate: 1.9.0
  • Datasets: 3.6.0
  • Tokenizers: 0.21.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Matryoshka2dLoss

@misc{li20242d,
    title={2D Matryoshka Sentence Embeddings},
    author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
    year={2024},
    eprint={2402.14776},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

CoSENTLoss

@online{kexuefm-8847,
    title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
    author={Su Jianlin},
    year={2022},
    month={Jan},
    url={https://kexue.fm/archives/8847},
}

TanimotoSentLoss

@online{cortes-2025-tanimotosentloss,
    title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
    author={Emmanuel Cortes},
    year={2025},
    month={Jan},
    url={https://github.com/emapco/chem-mrl},
}

Model Card Authors

@eacortes

Model Card Contact

Manny Cortes ([email protected])