CLAP model trained on COCO Captions

This is a sentence-transformers model finetuned from tomaarsen/clap-htsat-fused-librispeech on the librispeech_asr dataset. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: tomaarsen/clap-htsat-fused-librispeech
Maximum Sequence Length: None tokens
Output Dimensionality: None dimensions
Similarity Function: Cosine Similarity
Training Dataset:
- librispeech_asr
Language: en
License: apache-2.0

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'get_text_features', 'method_output_name': None}, 'audio': {'method': 'get_audio_features', 'method_output_name': None}}, 'module_output_name': 'sentence_embedding', 'architecture': 'ClapModel'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/clap-htsat-fused-librispeech-cont-4-epochs-128bs")
# Run inference
sentences = [
    'THERE ARE NATURES TOO TO WHOSE SENSE OF JUSTICE THE PRICE EXACTED LOOMS UP MONSTROUSLY ENORMOUS ODIOUS OPPRESSIVE WORRYING HUMILIATING EXTORTIONATE INTOLERABLE THOSE ARE THE FANATICS',
    'HE BEGAN TO WISH THAT HE HAD COMPROMISED IN SOME WAY OR OTHER THAT HE HAD SENT THE MONEY PERHAPS HE COULD DO IT UP HERE',
    'HERE THE HOLY PRELATE OF FERNS MET HIM AND RELATED A VISION IN WHICH HE HAD BEEN INSTRUCTED TO DEMAND THE ABOLITION OF THE IMPOST',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.0000, -0.1652, -0.0721],
#         [-0.1652,  1.0000,  0.6024],
#         [-0.0721,  0.6024,  1.0000]])

Evaluation

Metrics

Information Retrieval

Datasets: librispeech-eval and librispeech-test
Evaluated with InformationRetrievalEvaluator

Metric	librispeech-eval	librispeech-test
cosine_accuracy@1	0.616	0.66
cosine_accuracy@3	0.81	0.838
cosine_accuracy@5	0.875	0.9
cosine_accuracy@10	0.93	0.94
cosine_precision@1	0.616	0.66
cosine_precision@3	0.27	0.2793
cosine_precision@5	0.175	0.18
cosine_precision@10	0.093	0.094
cosine_recall@1	0.616	0.66
cosine_recall@3	0.81	0.838
cosine_recall@5	0.875	0.9
cosine_recall@10	0.93	0.94
cosine_ndcg@10	0.7732	0.8051
cosine_mrr@10	0.7227	0.7613
cosine_map@100	0.7263	0.7645

Training Details

Training Dataset

librispeech_asr

Dataset: librispeech_asr at 71cacbf
Size: 132,553 training samples
Columns: audio and text
Approximate statistics based on the first 1000 samples:
audio text
type dict string
details

min: 20 characters
mean: 189.15 characters
max: 294 characters

	audio	text
type	dict	string
details		min: 20 characters mean: 189.15 characters max: 294 characters

Samples:

audio	text
`{'path': '374-180298-0000.flac', 'array': array([ 6.92203816e-04, 8.04404495e-04, 8.03834875e-04, ..., -3.02505396e-05, -6.59527450e-06, 1.11444592e-06]), 'sampling_rate': 48000}`	`CHAPTER SIXTEEN I MIGHT HAVE TOLD YOU OF THE BEGINNING OF THIS LIAISON IN A FEW LINES BUT I WANTED YOU TO SEE EVERY STEP BY WHICH WE CAME I TO AGREE TO WHATEVER MARGUERITE WISHED`
`{'path': '374-180298-0001.flac', 'array': array([-9.33515839e-05, -1.25754057e-04, -1.44482241e-04, ..., -2.66165182e-04, -2.03228556e-04, -1.03404833e-04]), 'sampling_rate': 48000}`	`MARGUERITE TO BE UNABLE TO LIVE APART FROM ME IT WAS THE DAY AFTER THE EVENING WHEN SHE CAME TO SEE ME THAT I SENT HER MANON LESCAUT FROM THAT TIME SEEING THAT I COULD NOT CHANGE MY MISTRESS'S LIFE I CHANGED MY OWN`
`{'path': '374-180298-0002.flac', 'array': array([-2.47883319e-04, -2.91854434e-04, -2.82971043e-04, ..., -1.43931946e-04, -1.17829914e-04, -6.32331648e-05]), 'sampling_rate': 48000}`	`I WISHED ABOVE ALL NOT TO LEAVE MYSELF TIME TO THINK OVER THE POSITION I HAD ACCEPTED FOR IN SPITE OF MYSELF IT WAS A GREAT DISTRESS TO ME THUS MY LIFE GENERALLY SO CALM`

Loss: CachedMultipleNegativesSymmetricRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "mini_batch_size": 64,
    "gather_across_devices": false
}

Evaluation Dataset

librispeech_asr

Dataset: librispeech_asr at 71cacbf
Size: 1,000 evaluation samples
Columns: audio and text
Approximate statistics based on the first 1000 samples:
audio text
type dict string
details

min: 8 characters
mean: 104.62 characters
max: 516 characters

	audio	text
type	dict	string
details		min: 8 characters mean: 104.62 characters max: 516 characters

Samples:

audio	text
`{'path': '2277-149896-0000.flac', 'array': array([ 0.00179741, 0.00170625, 0.00120927, ..., -0.00144462, -0.00102732, -0.00048062]), 'sampling_rate': 48000}`	`HE WAS IN A FEVERED STATE OF MIND OWING TO THE BLIGHT HIS WIFE'S ACTION THREATENED TO CAST UPON HIS ENTIRE FUTURE`
`{'path': '2277-149896-0001.flac', 'array': array([ 0.00111104, 0.00081758, 0.00021103, ..., -0.00138193, -0.0009173 , -0.00041702]), 'sampling_rate': 48000}`	`HE WOULD HAVE TO PAY HER THE MONEY WHICH SHE WOULD NOW REGULARLY DEMAND OR THERE WOULD BE TROUBLE IT DID NOT MATTER WHAT HE DID`
`{'path': '2277-149896-0002.flac', 'array': array([0.00080266, 0.00088462, 0.00083408, ..., 0.00105488, 0.00083673, 0.00043296]), 'sampling_rate': 48000}`	`HURSTWOOD WALKED THE FLOOR MENTALLY ARRANGING THE CHIEF POINTS OF HIS SITUATION`

Loss: CachedMultipleNegativesSymmetricRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "mini_batch_size": 64,
    "gather_across_devices": false
}

Training Hyperparameters

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
learning_rate: 2e-05
num_train_epochs: 4
warmup_ratio: 0.1
bf16: True
batch_sampler: no_duplicates

All Hyperparameters

Click to expand

overwrite_output_dir: False
do_predict: False
eval_strategy: steps
prediction_loss_only: True
per_device_train_batch_size: 128
per_device_eval_batch_size: 128
gradient_accumulation_steps: 1
eval_accumulation_steps: None
torch_empty_cache_steps: None
learning_rate: 2e-05
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 4
max_steps: -1
lr_scheduler_type: linear
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_nan_inf_filter: True
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
use_cpu: False
seed: 42
data_seed: None
jit_mode_eval: False
bf16: True
fp16: False
half_precision_backend: None
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
debug: []
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: False
ignore_data_skip: False
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: adamw_torch_fused
optim_args: None
group_by_length: False
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: every_save
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
include_for_metrics: []
eval_do_concat_batches: True
mp_parameters:
auto_find_batch_size: False
full_determinism: False
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
prompts: None
batch_sampler: no_duplicates
multi_dataset_batch_sampler: proportional
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	Validation Loss	librispeech-eval_cosine_ndcg@10	librispeech-test_cosine_ndcg@10
-1	-1	-	-	0.2578	0.3112
0.0801	83	1.8184	2.0262	0.2650	-
0.1602	166	1.8023	2.0307	0.2663	-
0.2403	249	1.7706	1.8957	0.3116	-
0.3205	332	1.7092	1.8817	0.2935	-
0.4006	415	1.6373	1.8190	0.3382	-
0.4807	498	1.6326	1.8886	0.3072	-
0.5608	581	1.5066	1.8244	0.3356	-
0.6409	664	1.47	1.6148	0.3962	-
0.7210	747	1.3779	1.5519	0.4049	-
0.8012	830	1.3566	1.4406	0.4418	-
0.8813	913	1.3229	1.4122	0.4560	-
0.9614	996	1.2295	1.3453	0.4777	-
1.0415	1079	1.1413	1.3783	0.4647	-
1.1216	1162	1.0143	1.2593	0.4813	-
1.2017	1245	0.9226	1.3579	0.4552	-
1.2819	1328	0.8701	1.1575	0.5407	-
1.3620	1411	0.8354	1.0661	0.5742	-
1.4421	1494	0.7969	1.0900	0.5615	-
1.5222	1577	0.7667	0.9902	0.6099	-
1.6023	1660	0.7354	1.0506	0.5770	-
1.6824	1743	0.6864	0.9822	0.5971	-
1.7625	1826	0.6407	0.9009	0.6293	-
1.8427	1909	0.6193	0.8974	0.6319	-
1.9228	1992	0.5999	0.8587	0.6571	-
2.0029	2075	0.5631	0.8723	0.6448	-
2.0830	2158	0.5036	0.8252	0.6558	-
2.1631	2241	0.4913	0.8168	0.6585	-
2.2432	2324	0.4722	0.7609	0.6969	-
2.3234	2407	0.4558	0.7469	0.6923	-
2.4035	2490	0.4425	0.6988	0.7048	-
2.4836	2573	0.4307	0.7233	0.6907	-
2.5637	2656	0.4047	0.6843	0.7170	-
2.6438	2739	0.3956	0.6634	0.7251	-
2.7239	2822	0.3846	0.6762	0.7214	-
2.8041	2905	0.3781	0.6236	0.7428	-
2.8842	2988	0.3511	0.6418	0.7397	-
2.9643	3071	0.3408	0.6076	0.7537	-
3.0444	3154	0.3324	0.6056	0.7553	-
3.1245	3237	0.3029	0.6142	0.7437	-
3.2046	3320	0.2983	0.6205	0.7451	-
3.2847	3403	0.288	0.5939	0.7618	-
3.3649	3486	0.2841	0.5538	0.7750	-
3.4450	3569	0.2796	0.5916	0.7637	-
3.5251	3652	0.2781	0.5686	0.7671	-
3.6052	3735	0.2762	0.5639	0.7726	-
3.6853	3818	0.2635	0.5395	0.7825	-
3.7654	3901	0.2657	0.5386	0.7781	-
3.8456	3984	0.2652	0.5323	0.7821	-
3.9257	4067	0.2637	0.5405	0.7797	-
-1	-1	-	-	0.7732	0.8051

Environmental Impact

Carbon emissions were measured using CodeCarbon.

Energy Consumed: 1.542 kWh
Carbon Emitted: 0.413 kg of CO2
Hours Used: 6.749 hours

Training Hardware

On Cloud: No
GPU Model: 1 x NVIDIA GeForce RTX 3090
CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
RAM Size: 31.78 GB

Framework Versions

Python: 3.11.6
Sentence Transformers: 5.2.0.dev0
Transformers: 4.57.0.dev0
PyTorch: 2.8.0+cu128
Accelerate: 1.6.0
Datasets: 3.6.0
Tokenizers: 0.22.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

Downloads last month: 16

Safetensors

Model size

0.2B params

Tensor type

I64

F32

Model tree for tomaarsen/clap-htsat-fused-librispeech-cont-4-epochs-128bs

Base model

laion/clap-htsat-fused

Finetuned

tomaarsen/clap-htsat-fused-librispeech

Finetuned

(1)

this model

Dataset used to train tomaarsen/clap-htsat-fused-librispeech-cont-4-epochs-128bs

Evaluation results

Cosine Accuracy@1 on librispeech eval
self-reported

0.616
Cosine Accuracy@3 on librispeech eval
self-reported

0.810
Cosine Accuracy@5 on librispeech eval
self-reported

0.875
Cosine Accuracy@10 on librispeech eval
self-reported

0.930
Cosine Precision@1 on librispeech eval
self-reported

0.616
Cosine Precision@3 on librispeech eval
self-reported

0.270
Cosine Precision@5 on librispeech eval
self-reported

0.175
Cosine Precision@10 on librispeech eval
self-reported

0.093
Cosine Recall@1 on librispeech eval
self-reported

0.616
Cosine Recall@3 on librispeech eval
self-reported

0.810

View on Papers With Code