--- language: - en tags: - ColBERT - PyLate - sentence-transformers - sentence-similarity - feature-extraction - generated_from_trainer - dataset_size:640000 - loss:Distillation base_model: NeuML/bert-hash-nano datasets: - lightonai/ms-marco-en-bge-gemma pipeline_tag: sentence-similarity library_name: PyLate license: apache-2.0 metrics: - MaxSim_accuracy@1 - MaxSim_accuracy@3 - MaxSim_accuracy@5 - MaxSim_accuracy@10 - MaxSim_precision@1 - MaxSim_precision@3 - MaxSim_precision@5 - MaxSim_precision@10 - MaxSim_recall@1 - MaxSim_recall@3 - MaxSim_recall@5 - MaxSim_recall@10 - MaxSim_ndcg@10 - MaxSim_mrr@10 - MaxSim_map@100 model-index: - name: ColBERT MUVERA Nano results: - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoClimateFEVER type: NanoClimateFEVER metrics: - type: MaxSim_accuracy@1 value: 0.3 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.4 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.48 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.54 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.3 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.14666666666666664 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.10800000000000001 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.07200000000000001 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.12999999999999998 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.19833333333333333 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.24 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.295 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.25693689476232956 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.3689126984126984 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.20238579189860822 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoDBPedia type: NanoDBPedia metrics: - type: MaxSim_accuracy@1 value: 0.66 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.82 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.9 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.92 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.66 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.5666666666666665 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.524 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.48200000000000004 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.05242687651506869 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.13921550887445522 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.19747421384152156 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.3240079111162478 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.5626087795584744 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.7505 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.43170467867134016 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoFEVER type: NanoFEVER metrics: - type: MaxSim_accuracy@1 value: 0.8 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.84 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.86 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.92 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.8 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.29333333333333333 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.184 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.09999999999999998 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.7566666666666667 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.8133333333333332 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.84 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.89 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.8297799319515553 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.8250476190476193 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.8113438403701564 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoFiQA2018 type: NanoFiQA2018 metrics: - type: MaxSim_accuracy@1 value: 0.4 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.52 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.66 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.76 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.4 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.22666666666666668 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.18799999999999997 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.11199999999999999 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.2461904761904762 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.3277936507936508 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.4542380952380952 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.5184047619047619 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.4359941442854233 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.5049126984126984 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.36500352221028065 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoHotpotQA type: NanoHotpotQA metrics: - type: MaxSim_accuracy@1 value: 0.8 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.94 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.96 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 1.0 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.8 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.4599999999999999 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.30799999999999994 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.16999999999999996 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.4 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.69 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.77 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.85 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.7791724226460205 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.8793333333333332 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.7004123402170076 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoMSMARCO type: NanoMSMARCO metrics: - type: MaxSim_accuracy@1 value: 0.44 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.64 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.72 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.78 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.44 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.21333333333333332 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.14400000000000002 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.07800000000000001 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.44 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.64 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.72 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.78 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.6037947687284007 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.5473333333333332 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.5593945067344082 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoNFCorpus type: NanoNFCorpus metrics: - type: MaxSim_accuracy@1 value: 0.44 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.6 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.62 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.66 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.44 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.3933333333333333 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.33199999999999996 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.26 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.042699664136408834 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.07806895769271134 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.09327844593663599 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.11679808931654996 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.32747074296711476 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.5297222222222222 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.14001894101433573 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoNQ type: NanoNQ metrics: - type: MaxSim_accuracy@1 value: 0.36 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.58 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.72 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.8 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.36 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.19333333333333333 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.14800000000000002 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.08199999999999999 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.35 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.56 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.69 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.77 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.5666668175637105 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.5109365079365079 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.5060753330396947 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoQuoraRetrieval type: NanoQuoraRetrieval metrics: - type: MaxSim_accuracy@1 value: 0.86 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.9 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.92 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.94 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.86 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.32666666666666666 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.21599999999999997 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.11599999999999998 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.784 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.8313333333333333 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.8713333333333333 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.898 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.8685515910259487 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.8838888888888888 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.8518536903500825 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoSCIDOCS type: NanoSCIDOCS metrics: - type: MaxSim_accuracy@1 value: 0.38 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.5 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.62 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.76 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.38 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.2333333333333333 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.18799999999999997 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.128 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.08066666666666666 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.14566666666666667 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.19466666666666668 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.26366666666666666 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.26913925437260855 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.4848809523809524 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.19279409328239955 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoArguAna type: NanoArguAna metrics: - type: MaxSim_accuracy@1 value: 0.14 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.32 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.42 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.56 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.14 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.10666666666666666 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.08400000000000002 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.056000000000000015 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.14 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.32 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.42 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.56 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.3381955845251465 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.2689920634920634 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.27812414648450906 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoSciFact type: NanoSciFact metrics: - type: MaxSim_accuracy@1 value: 0.6 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.72 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.8 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.82 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.6 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.2533333333333333 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.17600000000000002 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.092 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.565 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.7 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.785 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.81 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.6965110171594289 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.6668888888888889 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.659209595959596 name: Maxsim Map@100 - task: type: py-late-information-retrieval name: Py Late Information Retrieval dataset: name: NanoTouche2020 type: NanoTouche2020 metrics: - type: MaxSim_accuracy@1 value: 0.673469387755102 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.9591836734693877 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 1.0 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 1.0 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.673469387755102 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.7074829931972788 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.6612244897959185 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.5204081632653061 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.04513010438618095 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.14053118239478446 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.21284594516135155 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.3318018815073785 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.5876998904974655 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.826530612244898 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.41690806080588444 name: Maxsim Map@100 - task: type: nano-beir name: Nano BEIR dataset: name: NanoBEIR mean type: NanoBEIR_mean metrics: - type: MaxSim_accuracy@1 value: 0.5271899529042385 name: Maxsim Accuracy@1 - type: MaxSim_accuracy@3 value: 0.6722448979591836 name: Maxsim Accuracy@3 - type: MaxSim_accuracy@5 value: 0.7446153846153846 name: Maxsim Accuracy@5 - type: MaxSim_accuracy@10 value: 0.8046153846153847 name: Maxsim Accuracy@10 - type: MaxSim_precision@1 value: 0.5271899529042385 name: Maxsim Precision@1 - type: MaxSim_precision@3 value: 0.3169858712715855 name: Maxsim Precision@3 - type: MaxSim_precision@5 value: 0.25086342229199377 name: Maxsim Precision@5 - type: MaxSim_precision@10 value: 0.17449293563579277 name: Maxsim Precision@10 - type: MaxSim_recall@1 value: 0.31021388112011294 name: Maxsim Recall@1 - type: MaxSim_recall@3 value: 0.4295596897247899 name: Maxsim Recall@3 - type: MaxSim_recall@5 value: 0.49914128462904644 name: Maxsim Recall@5 - type: MaxSim_recall@10 value: 0.5698214854239696 name: Maxsim Recall@10 - type: MaxSim_ndcg@10 value: 0.5478862953879712 name: Maxsim Ndcg@10 - type: MaxSim_mrr@10 value: 0.6190676783533925 name: Maxsim Mrr@10 - type: MaxSim_map@100 value: 0.47040219546448486 name: Maxsim Map@100 --- # ColBERT MUVERA Nano This is a [PyLate](https://github.com/lightonai/pylate) model finetuned from [neuml/bert-hash-nano](https://huggingface.co/neuml/bert-hash-nano) on the [msmarco-en-bge-gemma unnormalized split](https://huggingface.co/datasets/lightonai/ms-marco-en-bge-gemma) dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator. This model is trained with un-normalized scores, making it compatible with [MUVERA fixed-dimensional encoding](https://arxiv.org/abs/2405.19504). ## Usage (txtai) This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG). _Note: txtai 9.0+ is required for late interaction model support_ ```python import txtai embeddings = txtai.Embeddings( sparse="neuml/colbert-muvera-nano", content=True ) embeddings.index(documents()) # Run a query embeddings.search("query to run") ``` Late interaction models excel as reranker pipelines. ```python from txtai.pipeline import Reranker, Similarity similarity = Similarity(path="neuml/colbert-muvera-nano", lateencode=True) ranker = Reranker(embeddings, similarity) ranker("query to run") ``` ## Usage (PyLate) Alternatively, the model can be loaded with [PyLate](https://github.com/lightonai/pylate). ```python from pylate import rank, models queries = [ "query A", "query B", ] documents = [ ["document A", "document B"], ["document 1", "document C", "document B"], ] documents_ids = [ [1, 2], [1, 3, 2], ] model = models.ColBERT( model_name_or_path="neuml/colbert-muvera-nano", ) queries_embeddings = model.encode( queries, is_query=True, ) documents_embeddings = model.encode( documents, is_query=False, ) reranked_documents = rank.rerank( documents_ids=documents_ids, queries_embeddings=queries_embeddings, documents_embeddings=documents_embeddings, ) ``` ### Full Model Architecture ``` ColBERT( (0): Transformer({'max_seq_length': 299, 'do_lower_case': False}) with Transformer model: BertHashModel (1): Dense({'in_features': 128, 'out_features': 128, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'}) ) ``` ## Evaluation ### BEIR Subset The following table shows a subset of BEIR scored with the [txtai benchmarks script](https://github.com/neuml/txtai/blob/master/examples/benchmarks.py). Scores reported are `ndcg@10` and grouped into the following three categories. #### FULL multi-vector maxsim | Model | Parameters | NFCorpus | SciDocs | SciFact | Average | |:------------------|:-----------|:---------|:---------|:--------|:--------| | [ColBERT v2](https://huggingface.co/colbert-ir/colbertv2.0) | 110M | 0.3165 | 0.1497 | 0.6456 | 0.3706 | | [ColBERT MUVERA Femto](https://huggingface.co/neuml/colbert-muvera-femto) | 0.2M | 0.2513 | 0.0870 | 0.4710 | 0.2698 | | [ColBERT MUVERA Pico](https://huggingface.co/neuml/colbert-muvera-pico) | 0.4M | 0.3005 | 0.1117 | 0.6452 | 0.3525 | | [**ColBERT MUVERA Nano**](https://huggingface.co/neuml/colbert-muvera-nano) | **0.9M** | **0.3180** | **0.1262** | **0.6576** | **0.3673** | | [ColBERT MUVERA Micro](https://huggingface.co/neuml/colbert-muvera-micro) | 4M | 0.3235 | 0.1244 | 0.6676 | 0.3718 | #### MUVERA encoding + maxsim re-ranking of the top 100 results per MUVERA paper | Model | Parameters | NFCorpus | SciDocs | SciFact | Average | |:------------------|:-----------|:---------|:---------|:--------|:--------| | [ColBERT v2](https://huggingface.co/colbert-ir/colbertv2.0) | 110M | 0.3025 | 0.1538 | 0.6278 | 0.3614 | | [ColBERT MUVERA Femto](https://huggingface.co/neuml/colbert-muvera-femto) | 0.2M | 0.2316 | 0.0858 | 0.4641 | 0.2605 | | [ColBERT MUVERA Pico](https://huggingface.co/neuml/colbert-muvera-pico) | 0.4M | 0.2821 | 0.1004 | 0.6090 | 0.3305 | | [**ColBERT MUVERA Nano**](https://huggingface.co/neuml/colbert-muvera-nano) | **0.9M** | **0.2996** | **0.1201** | **0.6249** | **0.3482** | | [ColBERT MUVERA Micro](https://huggingface.co/neuml/colbert-muvera-micro) | 4M | 0.3095 | 0.1228 | 0.6464 | 0.3596 | #### MUVERA encoding only | Model | Parameters | NFCorpus | SciDocs | SciFact | Average | |:------------------|:-----------|:---------|:---------|:--------|:--------| | [ColBERT v2](https://huggingface.co/colbert-ir/colbertv2.0) | 110M | 0.2356 | 0.1229 | 0.5002 | 0.2862 | | [ColBERT MUVERA Femto](https://huggingface.co/neuml/colbert-muvera-femto) | 0.2M | 0.1851 | 0.0411 | 0.3518 | 0.1927 | | [ColBERT MUVERA Pico](https://huggingface.co/neuml/colbert-muvera-pico) | 0.4M | 0.1926 | 0.0564 | 0.4424 | 0.2305 | | [**ColBERT MUVERA Nano**](https://huggingface.co/neuml/colbert-muvera-nano) | **0.9M** | **0.2355** | **0.0807** | **0.4904** | **0.2689** | | [ColBERT MUVERA Micro](https://huggingface.co/neuml/colbert-muvera-micro) | 4M | 0.2348 | 0.0882 | 0.4875 | 0.2702 | _Note: The scores reported don't match scores reported in the respective papers due to different default settings in the txtai benchmark scripts._ As noted earlier, models trained with min-max score normalization don't perform well with MUVERA encoding. See this [GitHub Issue](https://github.com/lightonai/pylate/issues/142) for more. **This model packs a punch into 950K parameters. It's the same architecture as the 4M parameter model with the modified embeddings layer taking the parameter county down. It even beats the original ColBERT v2 model on a couple of the benchmarks.** ### Nano BEIR * Dataset: `NanoBEIR_mean` * Evaluated with pylate.evaluation.nano_beir_evaluator.NanoBEIREvaluator | Metric | Value | |:--------------------|:-----------| | MaxSim_accuracy@1 | 0.5272 | | MaxSim_accuracy@3 | 0.6722 | | MaxSim_accuracy@5 | 0.7446 | | MaxSim_accuracy@10 | 0.8046 | | MaxSim_precision@1 | 0.5272 | | MaxSim_precision@3 | 0.317 | | MaxSim_precision@5 | 0.2509 | | MaxSim_precision@10 | 0.1745 | | MaxSim_recall@1 | 0.3102 | | MaxSim_recall@3 | 0.4296 | | MaxSim_recall@5 | 0.4991 | | MaxSim_recall@10 | 0.5698 | | **MaxSim_ndcg@10** | **0.5479** | | MaxSim_mrr@10 | 0.6191 | | MaxSim_map@100 | 0.4704 | ## Training Details ### Training Hyperparameters #### Non-Default Hyperparameters - `eval_strategy`: steps - `per_device_train_batch_size`: 32 - `learning_rate`: 0.0003 - `num_train_epochs`: 1 - `warmup_ratio`: 0.05 - `fp16`: True #### All Hyperparameters
Click to expand - `overwrite_output_dir`: False - `do_predict`: False - `eval_strategy`: steps - `prediction_loss_only`: True - `per_device_train_batch_size`: 32 - `per_device_eval_batch_size`: 8 - `per_gpu_train_batch_size`: None - `per_gpu_eval_batch_size`: None - `gradient_accumulation_steps`: 1 - `eval_accumulation_steps`: None - `torch_empty_cache_steps`: None - `learning_rate`: 0.0003 - `weight_decay`: 0.0 - `adam_beta1`: 0.9 - `adam_beta2`: 0.999 - `adam_epsilon`: 1e-08 - `max_grad_norm`: 1.0 - `num_train_epochs`: 1 - `max_steps`: -1 - `lr_scheduler_type`: linear - `lr_scheduler_kwargs`: {} - `warmup_ratio`: 0.05 - `warmup_steps`: 0 - `log_level`: passive - `log_level_replica`: warning - `log_on_each_node`: True - `logging_nan_inf_filter`: True - `save_safetensors`: True - `save_on_each_node`: False - `save_only_model`: False - `restore_callback_states_from_checkpoint`: False - `no_cuda`: False - `use_cpu`: False - `use_mps_device`: False - `seed`: 42 - `data_seed`: None - `jit_mode_eval`: False - `bf16`: False - `fp16`: True - `fp16_opt_level`: O1 - `half_precision_backend`: auto - `bf16_full_eval`: False - `fp16_full_eval`: False - `tf32`: None - `local_rank`: 0 - `ddp_backend`: None - `tpu_num_cores`: None - `tpu_metrics_debug`: False - `debug`: [] - `dataloader_drop_last`: False - `dataloader_num_workers`: 0 - `dataloader_prefetch_factor`: None - `past_index`: -1 - `disable_tqdm`: False - `remove_unused_columns`: True - `label_names`: None - `load_best_model_at_end`: False - `ignore_data_skip`: False - `fsdp`: [] - `fsdp_min_num_params`: 0 - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False} - `fsdp_transformer_layer_cls_to_wrap`: None - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None} - `parallelism_config`: None - `deepspeed`: None - `label_smoothing_factor`: 0.0 - `optim`: adamw_torch_fused - `optim_args`: None - `adafactor`: False - `group_by_length`: False - `length_column_name`: length - `project`: huggingface - `trackio_space_id`: trackio - `ddp_find_unused_parameters`: None - `ddp_bucket_cap_mb`: None - `ddp_broadcast_buffers`: False - `dataloader_pin_memory`: True - `dataloader_persistent_workers`: False - `skip_memory_metrics`: True - `use_legacy_prediction_loop`: False - `push_to_hub`: False - `resume_from_checkpoint`: None - `hub_model_id`: None - `hub_strategy`: every_save - `hub_private_repo`: None - `hub_always_push`: False - `hub_revision`: None - `gradient_checkpointing`: False - `gradient_checkpointing_kwargs`: None - `include_inputs_for_metrics`: False - `include_for_metrics`: [] - `eval_do_concat_batches`: True - `fp16_backend`: auto - `push_to_hub_model_id`: None - `push_to_hub_organization`: None - `mp_parameters`: - `auto_find_batch_size`: False - `full_determinism`: False - `torchdynamo`: None - `ray_scope`: last - `ddp_timeout`: 1800 - `torch_compile`: False - `torch_compile_backend`: None - `torch_compile_mode`: None - `include_tokens_per_second`: False - `include_num_input_tokens_seen`: no - `neftune_noise_alpha`: None - `optim_target_modules`: None - `batch_eval_metrics`: False - `eval_on_start`: False - `use_liger_kernel`: False - `liger_kernel_config`: None - `eval_use_gather_object`: False - `average_tokens_across_devices`: True - `prompts`: None - `batch_sampler`: batch_sampler - `multi_dataset_batch_sampler`: proportional
### Framework Versions - Python: 3.10.18 - Sentence Transformers: 4.0.2 - PyLate: 1.3.2 - Transformers: 4.57.0 - PyTorch: 2.8.0+cu128 - Accelerate: 1.10.1 - Datasets: 4.1.1 - Tokenizers: 0.22.1 ## Citation ### BibTeX #### Sentence Transformers ```bibtex @inproceedings{reimers-2019-sentence-bert, title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", author = "Reimers, Nils and Gurevych, Iryna", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", month = "11", year = "2019", publisher = "Association for Computational Linguistics", url = "https://arxiv.org/abs/1908.10084" } ``` #### PyLate ```bibtex @misc{PyLate, title={PyLate: Flexible Training and Retrieval for Late Interaction Models}, author={Chaffin, Antoine and Sourty, Raphaƫl}, url={https://github.com/lightonai/pylate}, year={2024} } ```