ChemMRL-beta / README.md

Upload README.md

8e08327 verified 12 days ago

19.5 kB

	---
	license: apache-2.0
	tags:
	- sentence-transformers
	- molecular-similarity
	- feature-extraction
	- dense
	- generated_from_trainer
	- loss:Matryoshka2dLoss
	- loss:MatryoshkaLoss
	- loss:TanimotoSentLoss
	base_model: Derify/ChemBERTa-druglike
	widget:
	- source_sentence: CC1CCc2c(N)nc(C3CCCC3)n2C1
	sentences:
	- CC1CCc2c(N)nc(OC3CC3)n2C1
	- CN1CC[NH+](C[C@H](O)C2CC2)C2(CCCCC2)C1
	- Cc1c(F)cc(CNCC2CCC(C3CCC(C)CO3)CO2)cc1F
	- source_sentence: CC(CCCO)NC(=O)CNc1ccccc1
	sentences:
	- CC(CCCO)N[C@H]1CCCN(Nc2ccccc2)[C@H]1C
	- Cc1ccc(OC2=NCCO2)nc1
	- Cc1ccccc1C#Cc1ccccc1N(O)c1ccccc1
	- source_sentence: CCCCCCCc1ccc(CC=N[NH+]=C(N)N)cc1
	sentences:
	- COCC1(N2CCN(C)CC2)CCC[NH+]1Cc1cnc(N(C)C)nc1
	- Cc1ccc(N=C(c2ccccc2)c2ccc(-n3ccnn3)cc2)cc1
	- CCCCCCCc1cncc(CC=N[NH+]=C(N)N)c1
	- source_sentence: CC(=CCCS(=O)(=O)[O-])C(=O)OCCCS(=O)(=O)[O-]
	sentences:
	- CC(=CCCS(=O)(=O)[O-])C(=O)OCCCS(=O)(=O)[O-]
	- CCCCCOc1ccc(NC(=S)NC=O)cc1
	- CCC(=O)N1CCCC(NC(=O)c2ccc(S(=O)(=O)N(C)C)cc2)C1
	- source_sentence: Clc1nccc(C#CCCc2nc3ccccc3o2)n1
	sentences:
	- O=Cc1nc2ccccc2o1
	- >-
	O=C([O-])COc1ccc(CCCS(=O)(=O)c2ccc(Cl)cc2)cc1NC(=O)c1cccc(C=Cc2nc3ccccc3s2)c1
	- O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1
	datasets:
	- Derify/pubchem_10m_genmol_similarity
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	metrics:
	- spearman
	model-index:
	- name: 'ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer'
	results:
	- task:
	type: semantic-similarity
	name: Semantic Similarity
	dataset:
	name: pubchem 10m genmol similarity
	type: pubchem_10m_genmol_similarity
	metrics:
	- type: spearman
	value: 0.9932120589500998
	name: Spearman
	new_version: Derify/ChemMRL
	---

	# ChemMRL: SMILES Matryoshka Representation Learning Embedding Transformer

	This is a [Chem-MRL](https://github.com/emapco/chem-mrl) ([sentence-transformers](https://www.SBERT.net)) model finetuned from [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) on the [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) dataset. It maps SMILES to a 1024-dimensional dense vector space and can be used for molecular similarity, semantic search, database indexing, molecular classification, clustering, and more.

	## Model Details

	### Model Description
	- Model Type: ChemMRL (Sentence Transformer)
	- Base model: [Derify/ChemBERTa-druglike](https://huggingface.co/Derify/ChemBERTa-druglike) <!-- at revision 5e76559157fde4f1aead643d9e1d402289f522af -->
	- Maximum Sequence Length: 128 tokens
	- Output Dimensionality: 1024 dimensions
	- Similarity Function: Tanimoto
	- Training Dataset:
	- [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity)
	- License: apache-2.0

	### Model Sources

	- Repository: [Chem-MRL on GitHub](https://github.com/emapco/chem-mrl)
	- Demo App Repository: [Chem-MRL-demo on GitHub](https://github.com/emapco/chem-mrl-demo)

	### Full Model Architecture

	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'RobertaModel'})
	(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
	(2): Normalize()
	)
	```

	## Usage

	### Direct Usage (Chem-MRL)

	First install the Chem-MRL library:

	```bash
	pip install -U chem-mrl>=0.7.3
	```

	Then you can load this model and run inference.
	```python
	from chem_mrl import ChemMRL

	# Download from the 🤗 Hub
	model = ChemMRL("Derify/ChemMRL-beta")
	# Run inference
	sentences = [
	"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
	"O=Cc1nc2ccccc2o1",
	"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
	]
	embeddings = model.backbone.encode(sentences)
	print(embeddings.shape)
	# [3, 1024]

	# Get the similarity scores for the embeddings
	similarities = model.backbone.similarity(embeddings, embeddings)
	print(similarities)
	# tensor([[1.0000, 0.3200, 0.1209],
	# [0.3200, 1.0000, 0.0950],
	# [0.1209, 0.0950, 1.0000]])

	# Load the model with half precision
	model = ChemMRL("Derify/ChemMRL-beta", use_half_precision=True)
	sentences = [
	"Clc1nccc(C#CCCc2nc3ccccc3o2)n1",
	"O=Cc1nc2ccccc2o1",
	"O[C@H]1CN(C(Cc2ccccc2)c2ccccc2)C[C@@H]1Cc1cnc[nH]1",
	]
	embeddings = model.embed(sentences) # Use the embed method for half precision
	print(embeddings.shape)
	# [3, 1024]
	```

	## Evaluation

	### Metrics

	#### Semantic Similarity

	* Dataset: `pubchem_10m_genmol_similarity`
	* Evaluated with <code>chem_mrl.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator</code> with these parameters:
	```json
	{
	"precision": "float32"
	}
	```

	\| Split \| Metric \| Value \|
	\| :------------- \| :----------- \| :----------- \|
	\| validation \| spearman \| 0.993212 \|
	\| test \| spearman \| 0.993243 \|

	## Training Details

	### Training Dataset

	#### pubchem_10m_genmol_similarity

	* Dataset: [pubchem_10m_genmol_similarity](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity) at [f68d779](https://huggingface.co/datasets/Derify/pubchem_10m_genmol_similarity/tree/f68d779a6284578132a3922655f6b1f74c576642)
	* Size: 19,692,766 training samples
	* Columns: <code>smiles_a</code>, <code>smiles_b</code>, and <code>label</code>
	* Approximate statistics based on the first 1000 samples:
	\| \| smiles_a \| smiles_b \| label \|
	\| :------ \| :---------------------------------------------------------------------------------- \| :---------------------------------------------------------------------------------- \| :-------------------------------------------------------------- \|
	\| type \| string \| string \| float \|
	\| details \| <ul><li>min: 17 tokens</li><li>mean: 39.66 tokens</li><li>max: 119 tokens</li></ul> \| <ul><li>min: 11 tokens</li><li>mean: 38.29 tokens</li><li>max: 115 tokens</li></ul> \| <ul><li>min: 0.02</li><li>mean: 0.57</li><li>max: 1.0</li></ul> \| \| <code>0.7123287916183472</code> \|
	* Loss: [<code>Matryoshka2dLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshka2dloss) with these parameters:
	<details><summary>Click to expand</summary>

	```json
	{
	"loss": "TanimotoSentLoss",
	"n_layers_per_step": -1,
	"last_layer_weight": 2.0,
	"prior_layers_weight": 1.0,
	"kl_div_weight": 0.5,
	"kl_temperature": 0.3,
	"matryoshka_dims": [
	1024,
	512,
	256,
	128,
	64,
	32,
	16,
	8
	],
	"matryoshka_weights": [
	1,
	1,
	1,
	1,
	1,
	1,
	1,
	1
	],
	"n_dims_per_step": -1
	}
	```
	</details>

	### Training Hyperparameters
	#### Non-Default Hyperparameters

	- `eval_strategy`: steps
	- `per_device_train_batch_size`: 64
	- `per_device_eval_batch_size`: 128
	- `learning_rate`: 8e-06
	- `weight_decay`: 6.505130550397454e-06
	- `warmup_ratio`: 0.2
	- `data_seed`: 42
	- `fp16`: True
	- `tf32`: True
	- `load_best_model_at_end`: True
	- `optim`: adamw_apex_fused
	- `dataloader_pin_memory`: False

	#### All Hyperparameters
	<details><summary>Click to expand</summary>

	- `overwrite_output_dir`: False
	- `do_predict`: False
	- `eval_strategy`: steps
	- `prediction_loss_only`: True
	- `per_device_train_batch_size`: 64
	- `per_device_eval_batch_size`: 128
	- `per_gpu_train_batch_size`: None
	- `per_gpu_eval_batch_size`: None
	- `gradient_accumulation_steps`: 1
	- `eval_accumulation_steps`: None
	- `torch_empty_cache_steps`: None
	- `learning_rate`: 8e-06
	- `weight_decay`: 6.505130550397454e-06
	- `adam_beta1`: 0.9
	- `adam_beta2`: 0.999
	- `adam_epsilon`: 1e-08
	- `max_grad_norm`: 1.0
	- `num_train_epochs`: 3
	- `max_steps`: -1
	- `lr_scheduler_type`: linear
	- `lr_scheduler_kwargs`: {}
	- `warmup_ratio`: 0.2
	- `warmup_steps`: 0
	- `log_level`: passive
	- `log_level_replica`: warning
	- `log_on_each_node`: True
	- `logging_nan_inf_filter`: True
	- `save_safetensors`: True
	- `save_on_each_node`: False
	- `save_only_model`: False
	- `restore_callback_states_from_checkpoint`: False
	- `no_cuda`: False
	- `use_cpu`: False
	- `use_mps_device`: False
	- `seed`: 42
	- `data_seed`: 42
	- `jit_mode_eval`: False
	- `use_ipex`: False
	- `bf16`: False
	- `fp16`: True
	- `fp16_opt_level`: O1
	- `half_precision_backend`: auto
	- `bf16_full_eval`: False
	- `fp16_full_eval`: False
	- `tf32`: True
	- `local_rank`: 0
	- `ddp_backend`: None
	- `tpu_num_cores`: None
	- `tpu_metrics_debug`: False
	- `debug`: []
	- `dataloader_drop_last`: False
	- `dataloader_num_workers`: 0
	- `dataloader_prefetch_factor`: None
	- `past_index`: -1
	- `disable_tqdm`: False
	- `remove_unused_columns`: True
	- `label_names`: None
	- `load_best_model_at_end`: True
	- `ignore_data_skip`: False
	- `fsdp`: []
	- `fsdp_min_num_params`: 0
	- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
	- `fsdp_transformer_layer_cls_to_wrap`: None
	- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
	- `deepspeed`: None
	- `label_smoothing_factor`: 0.0
	- `optim`: adamw_apex_fused
	- `optim_args`: None
	- `adafactor`: False
	- `group_by_length`: False
	- `length_column_name`: length
	- `ddp_find_unused_parameters`: None
	- `ddp_bucket_cap_mb`: None
	- `ddp_broadcast_buffers`: False
	- `dataloader_pin_memory`: False
	- `dataloader_persistent_workers`: False
	- `skip_memory_metrics`: True
	- `use_legacy_prediction_loop`: False
	- `push_to_hub`: False
	- `hub_model_id`: None
	- `hub_strategy`: every_save
	- `hub_private_repo`: None
	- `hub_always_push`: False
	- `hub_revision`: None
	- `gradient_checkpointing`: False
	- `gradient_checkpointing_kwargs`: None
	- `include_inputs_for_metrics`: False
	- `include_for_metrics`: []
	- `eval_do_concat_batches`: True
	- `fp16_backend`: auto
	- `push_to_hub_model_id`: None
	- `push_to_hub_organization`: None
	- `mp_parameters`:
	- `auto_find_batch_size`: False
	- `full_determinism`: False
	- `torchdynamo`: None
	- `ray_scope`: last
	- `ddp_timeout`: 1800
	- `torch_compile`: False
	- `torch_compile_backend`: None
	- `torch_compile_mode`: None
	- `include_tokens_per_second`: False
	- `include_num_input_tokens_seen`: False
	- `neftune_noise_alpha`: None
	- `optim_target_modules`: None
	- `batch_eval_metrics`: False
	- `eval_on_start`: False
	- `use_liger_kernel`: False
	- `liger_kernel_config`: None
	- `eval_use_gather_object`: False
	- `average_tokens_across_devices`: False
	- `prompts`: None
	- `batch_sampler`: batch_sampler
	- `multi_dataset_batch_sampler`: proportional
	- `router_mapping`: {}
	- `learning_rate_mapping`: {}

	</details>

	### Training Logs
	<details><summary>Click to expand</summary>

	\| Epoch \| Step \| Training Loss \| pubchem_10m_genmol_similarity_spearman \|
	\| :----: \| :----: \| :-----------: \| :------------------------------------: \|
	\| 0.0796 \| 24500 \| 121.4633 \| - \|
	\| 0.08 \| 24616 \| - \| 0.9739 \|
	\| 0.1592 \| 49000 \| 118.6111 \| - \|
	\| 0.16 \| 49232 \| - \| 0.9817 \|
	\| 0.2389 \| 73500 \| 117.491 \| - \|
	\| 0.24 \| 73848 \| - \| 0.9848 \|
	\| 0.3185 \| 98000 \| 116.3786 \| - \|
	\| 0.32 \| 98464 \| - \| 0.9865 \|
	\| 0.3997 \| 123000 \| 115.9773 \| - \|
	\| 0.4 \| 123080 \| - \| 0.9873 \|
	\| 0.4794 \| 147500 \| 115.2441 \| - \|
	\| 0.48 \| 147696 \| - \| 0.9885 \|
	\| 0.5590 \| 172000 \| 114.8674 \| - \|
	\| 0.56 \| 172312 \| - \| 0.9887 \|
	\| 0.6386 \| 196500 \| 114.6483 \| - \|
	\| 0.64 \| 196928 \| - \| 0.9892 \|
	\| 0.7199 \| 221500 \| 114.0507 \| - \|
	\| 0.72 \| 221544 \| - \| 0.9898 \|
	\| 0.7995 \| 246000 \| 113.5606 \| - \|
	\| 0.8 \| 246160 \| - \| 0.9902 \|
	\| 0.8791 \| 270500 \| 113.2762 \| - \|
	\| 0.88 \| 270776 \| - \| 0.9907 \|
	\| 0.9587 \| 295000 \| 113.3295 \| - \|
	\| 0.96 \| 295392 \| - \| 0.9908 \|
	\| 1.0400 \| 320000 \| 112.9253 \| - \|
	\| 1.04 \| 320008 \| - \| 0.9909 \|
	\| 1.1196 \| 344500 \| 112.584 \| - \|
	\| 1.12 \| 344624 \| - \| 0.9910 \|
	\| 1.1992 \| 369000 \| 112.616 \| - \|
	\| 1.2 \| 369240 \| - \| 0.9916 \|
	\| 1.2788 \| 393500 \| 112.4692 \| - \|
	\| 1.28 \| 393856 \| - \| 0.9914 \|
	\| 1.3585 \| 418000 \| 112.2679 \| - \|
	\| 1.3600 \| 418472 \| - \| 0.9917 \|
	\| 1.4397 \| 443000 \| 112.1639 \| - \|
	\| 1.44 \| 443088 \| - \| 0.9919 \|
	\| 1.5193 \| 467500 \| 112.1139 \| - \|
	\| 1.52 \| 467704 \| - \| 0.9921 \|
	\| 1.5990 \| 492000 \| 111.8096 \| - \|
	\| 1.6 \| 492320 \| - \| 0.9923 \|
	\| 1.6786 \| 516500 \| 111.8252 \| - \|
	\| 1.6800 \| 516936 \| - \| 0.9922 \|
	\| 1.7598 \| 541500 \| 111.836 \| - \|
	\| 1.76 \| 541552 \| - \| 0.9924 \|
	\| 1.8395 \| 566000 \| 111.8471 \| - \|
	\| 1.8400 \| 566168 \| - \| 0.9924 \|
	\| 1.9191 \| 590500 \| 111.7778 \| - \|
	\| 1.92 \| 590784 \| - \| 0.9925 \|
	\| 1.9987 \| 615000 \| 111.4892 \| - \|
	\| 2.0 \| 615400 \| - \| 0.9927 \|
	\| 2.0799 \| 640000 \| 111.2659 \| - \|
	\| 2.08 \| 640016 \| - \| 0.9928 \|
	\| 2.1596 \| 664500 \| 111.3635 \| - \|
	\| 2.16 \| 664632 \| - \| 0.9927 \|
	\| 2.2392 \| 689000 \| 111.0114 \| - \|
	\| 2.24 \| 689248 \| - \| 0.9928 \|
	\| 2.3188 \| 713500 \| 111.0559 \| - \|
	\| 2.32 \| 713864 \| - \| 0.9929 \|
	\| 2.3984 \| 738000 \| 110.5276 \| - \|
	\| 2.4 \| 738480 \| - \| 0.9929 \|
	\| 2.4797 \| 763000 \| 110.9828 \| - \|
	\| 2.48 \| 763096 \| - \| 0.9930 \|
	\| 2.5593 \| 787500 \| 110.8404 \| - \|
	\| 2.56 \| 787712 \| - \| 0.9930 \|
	\| 2.6389 \| 812000 \| 111.1937 \| - \|
	\| 2.64 \| 812328 \| - \| 0.9931 \|
	\| 2.7186 \| 836500 \| 110.6662 \| - \|
	\| 2.7200 \| 836944 \| - \| 0.9931 \|
	\| 2.7998 \| 861500 \| 110.7714 \| - \|
	\| 2.8 \| 861560 \| - \| 0.9932 \|
	\| 2.8794 \| 886000 \| 110.7638 \| - \|
	\| 2.88 \| 886176 \| - \| 0.9932 \|
	\| 2.9591 \| 910500 \| 110.7021 \| - \|
	\| 2.96 \| 910792 \| - \| 0.9932 \|
	\| 2.9997 \| 923000 \| 110.6097 \| - \|
	</details>

	### Training Hardware
	- On Cloud: No
	- GPU Model: 1 x NVIDIA GeForce RTX 3090
	- CPU Model: AMD Ryzen 7 3700X 8-Core Processor
	- RAM Size: 62.70 GB

	### Framework Versions
	- Python: 3.12.11
	- Sentence Transformers: 5.0.0
	- Transformers: 4.53.3
	- PyTorch: 2.7.1+cu126
	- Accelerate: 1.9.0
	- Datasets: 3.6.0
	- Tokenizers: 0.21.2

	## Citation

	### BibTeX

	#### Sentence Transformers
	```bibtex
	@inproceedings{reimers-2019-sentence-bert,
	title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
	author = "Reimers, Nils and Gurevych, Iryna",
	booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
	month = "11",
	year = "2019",
	publisher = "Association for Computational Linguistics",
	url = "https://arxiv.org/abs/1908.10084",
	}
	```

	#### Matryoshka2dLoss
	```bibtex
	@misc{li20242d,
	title={2D Matryoshka Sentence Embeddings},
	author={Xianming Li and Zongxi Li and Jing Li and Haoran Xie and Qing Li},
	year={2024},
	eprint={2402.14776},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```

	#### MatryoshkaLoss
	```bibtex
	@misc{kusupati2024matryoshka,
	title={Matryoshka Representation Learning},
	author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
	year={2024},
	eprint={2205.13147},
	archivePrefix={arXiv},
	primaryClass={cs.LG}
	}
	```

	#### CoSENTLoss
	```bibtex
	@online{kexuefm-8847,
	title={CoSENT: A more efficient sentence vector scheme than Sentence-BERT},
	author={Su Jianlin},
	year={2022},
	month={Jan},
	url={https://kexue.fm/archives/8847},
	}
	```

	#### TanimotoSentLoss
	```bibtex
	@online{cortes-2025-tanimotosentloss,
	title={TanimotoSentLoss: Tanimoto Loss for SMILES Embeddings},
	author={Emmanuel Cortes},
	year={2025},
	month={Jan},
	url={https://github.com/emapco/chem-mrl},
	}
	```

	## Model Card Authors

	[@eacortes](https://huggingface.co/eacortes)

	## Model Card Contact

	Manny Cortes ([email protected])