CLAP model trained on COCO Captions

This is a sentence-transformers model finetuned from tomaarsen/clap-htsat-fused-librispeech on the librispeech_asr dataset. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: tomaarsen/clap-htsat-fused-librispeech
  • Maximum Sequence Length: None tokens
  • Output Dimensionality: None dimensions
  • Similarity Function: Cosine Similarity
  • Training Dataset:
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'get_text_features', 'method_output_name': None}, 'audio': {'method': 'get_audio_features', 'method_output_name': None}}, 'module_output_name': 'sentence_embedding', 'architecture': 'ClapModel'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/clap-htsat-fused-librispeech-cont-4-epochs-128bs")
# Run inference
sentences = [
    'THERE ARE NATURES TOO TO WHOSE SENSE OF JUSTICE THE PRICE EXACTED LOOMS UP MONSTROUSLY ENORMOUS ODIOUS OPPRESSIVE WORRYING HUMILIATING EXTORTIONATE INTOLERABLE THOSE ARE THE FANATICS',
    'HE BEGAN TO WISH THAT HE HAD COMPROMISED IN SOME WAY OR OTHER THAT HE HAD SENT THE MONEY PERHAPS HE COULD DO IT UP HERE',
    'HERE THE HOLY PRELATE OF FERNS MET HIM AND RELATED A VISION IN WHICH HE HAD BEEN INSTRUCTED TO DEMAND THE ABOLITION OF THE IMPOST',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, -0.1652, -0.0721],
#         [-0.1652,  1.0000,  0.6024],
#         [-0.0721,  0.6024,  1.0000]])

Evaluation

Metrics

Information Retrieval

Metric librispeech-eval librispeech-test
cosine_accuracy@1 0.616 0.66
cosine_accuracy@3 0.81 0.838
cosine_accuracy@5 0.875 0.9
cosine_accuracy@10 0.93 0.94
cosine_precision@1 0.616 0.66
cosine_precision@3 0.27 0.2793
cosine_precision@5 0.175 0.18
cosine_precision@10 0.093 0.094
cosine_recall@1 0.616 0.66
cosine_recall@3 0.81 0.838
cosine_recall@5 0.875 0.9
cosine_recall@10 0.93 0.94
cosine_ndcg@10 0.7732 0.8051
cosine_mrr@10 0.7227 0.7613
cosine_map@100 0.7263 0.7645

Training Details

Training Dataset

librispeech_asr

  • Dataset: librispeech_asr at 71cacbf
  • Size: 132,553 training samples
  • Columns: audio and text
  • Approximate statistics based on the first 1000 samples:
    audio text
    type dict string
    details
    • min: 20 characters
    • mean: 189.15 characters
    • max: 294 characters
  • Samples:
    audio text
    {'path': '374-180298-0000.flac', 'array': array([ 6.92203816e-04, 8.04404495e-04, 8.03834875e-04, ...,
    -3.02505396e-05, -6.59527450e-06, 1.11444592e-06]), 'sampling_rate': 48000}
    CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED
    {'path': '374-180298-0001.flac', 'array': array([-9.33515839e-05, -1.25754057e-04, -1.44482241e-04, ...,
    -2.66165182e-04, -2.03228556e-04, -1.03404833e-04]), 'sampling_rate': 48000}
    MARGUERITE TO BE UNABLE TO LIVE APART FROM ME IT WAS THE DAY AFTER THE EVENING WHEN SHE CAME TO SEE ME THAT I SENT HER MANON LESCAUT FROM THAT TIME SEEING THAT I COULD NOT CHANGE MY MISTRESS'S LIFE I CHANGED MY OWN
    {'path': '374-180298-0002.flac', 'array': array([-2.47883319e-04, -2.91854434e-04, -2.82971043e-04, ...,
    -1.43931946e-04, -1.17829914e-04, -6.32331648e-05]), 'sampling_rate': 48000}
    I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM
  • Loss: CachedMultipleNegativesSymmetricRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "mini_batch_size": 64,
        "gather_across_devices": false
    }
    

Evaluation Dataset

librispeech_asr

  • Dataset: librispeech_asr at 71cacbf
  • Size: 1,000 evaluation samples
  • Columns: audio and text
  • Approximate statistics based on the first 1000 samples:
    audio text
    type dict string
    details
    • min: 8 characters
    • mean: 104.62 characters
    • max: 516 characters
  • Samples:
    audio text
    {'path': '2277-149896-0000.flac', 'array': array([ 0.00179741, 0.00170625, 0.00120927, ..., -0.00144462,
    -0.00102732, -0.00048062]), 'sampling_rate': 48000}
    HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE
    {'path': '2277-149896-0001.flac', 'array': array([ 0.00111104, 0.00081758, 0.00021103, ..., -0.00138193,
    -0.0009173 , -0.00041702]), 'sampling_rate': 48000}
    HE WOULD HAVE TO PAY HER THE MONEY WHICH SHE WOULD NOW REGULARLY DEMAND OR THERE WOULD BE TROUBLE IT DID NOT MATTER WHAT HE DID
    {'path': '2277-149896-0002.flac', 'array': array([0.00080266, 0.00088462, 0.00083408, ..., 0.00105488, 0.00083673,
    0.00043296]), 'sampling_rate': 48000}
    HURSTWOOD WALKED THE FLOOR MENTALLY ARRANGING THE CHIEF POINTS OF HIS SITUATION
  • Loss: CachedMultipleNegativesSymmetricRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "mini_batch_size": 64,
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • warmup_ratio: 0.1
  • bf16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • use_cpu: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • bf16: True
  • fp16: False
  • half_precision_backend: None
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss librispeech-eval_cosine_ndcg@10 librispeech-test_cosine_ndcg@10
-1 -1 - - 0.2578 0.3112
0.0801 83 1.8184 2.0262 0.2650 -
0.1602 166 1.8023 2.0307 0.2663 -
0.2403 249 1.7706 1.8957 0.3116 -
0.3205 332 1.7092 1.8817 0.2935 -
0.4006 415 1.6373 1.8190 0.3382 -
0.4807 498 1.6326 1.8886 0.3072 -
0.5608 581 1.5066 1.8244 0.3356 -
0.6409 664 1.47 1.6148 0.3962 -
0.7210 747 1.3779 1.5519 0.4049 -
0.8012 830 1.3566 1.4406 0.4418 -
0.8813 913 1.3229 1.4122 0.4560 -
0.9614 996 1.2295 1.3453 0.4777 -
1.0415 1079 1.1413 1.3783 0.4647 -
1.1216 1162 1.0143 1.2593 0.4813 -
1.2017 1245 0.9226 1.3579 0.4552 -
1.2819 1328 0.8701 1.1575 0.5407 -
1.3620 1411 0.8354 1.0661 0.5742 -
1.4421 1494 0.7969 1.0900 0.5615 -
1.5222 1577 0.7667 0.9902 0.6099 -
1.6023 1660 0.7354 1.0506 0.5770 -
1.6824 1743 0.6864 0.9822 0.5971 -
1.7625 1826 0.6407 0.9009 0.6293 -
1.8427 1909 0.6193 0.8974 0.6319 -
1.9228 1992 0.5999 0.8587 0.6571 -
2.0029 2075 0.5631 0.8723 0.6448 -
2.0830 2158 0.5036 0.8252 0.6558 -
2.1631 2241 0.4913 0.8168 0.6585 -
2.2432 2324 0.4722 0.7609 0.6969 -
2.3234 2407 0.4558 0.7469 0.6923 -
2.4035 2490 0.4425 0.6988 0.7048 -
2.4836 2573 0.4307 0.7233 0.6907 -
2.5637 2656 0.4047 0.6843 0.7170 -
2.6438 2739 0.3956 0.6634 0.7251 -
2.7239 2822 0.3846 0.6762 0.7214 -
2.8041 2905 0.3781 0.6236 0.7428 -
2.8842 2988 0.3511 0.6418 0.7397 -
2.9643 3071 0.3408 0.6076 0.7537 -
3.0444 3154 0.3324 0.6056 0.7553 -
3.1245 3237 0.3029 0.6142 0.7437 -
3.2046 3320 0.2983 0.6205 0.7451 -
3.2847 3403 0.288 0.5939 0.7618 -
3.3649 3486 0.2841 0.5538 0.7750 -
3.4450 3569 0.2796 0.5916 0.7637 -
3.5251 3652 0.2781 0.5686 0.7671 -
3.6052 3735 0.2762 0.5639 0.7726 -
3.6853 3818 0.2635 0.5395 0.7825 -
3.7654 3901 0.2657 0.5386 0.7781 -
3.8456 3984 0.2652 0.5323 0.7821 -
3.9257 4067 0.2637 0.5405 0.7797 -
-1 -1 - - 0.7732 0.8051

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 1.542 kWh
  • Carbon Emitted: 0.413 kg of CO2
  • Hours Used: 6.749 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
  • RAM Size: 31.78 GB

Framework Versions

  • Python: 3.11.6
  • Sentence Transformers: 5.2.0.dev0
  • Transformers: 4.57.0.dev0
  • PyTorch: 2.8.0+cu128
  • Accelerate: 1.6.0
  • Datasets: 3.6.0
  • Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
16
Safetensors
Model size
0.2B params
Tensor type
I64
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tomaarsen/clap-htsat-fused-librispeech-cont-4-epochs-128bs

Finetuned
(1)
this model

Dataset used to train tomaarsen/clap-htsat-fused-librispeech-cont-4-epochs-128bs

Evaluation results