lbourdois commited on
Commit
1aa1caa
·
verified ·
1 Parent(s): c27da8a

Update model card for Danish

Browse files
Files changed (1) hide show
  1. README.md +51 -23
README.md CHANGED
@@ -1,42 +1,70 @@
1
  ---
2
- pipeline_tag: fill-mask
3
  language: dan
4
  license: mit
5
  tags:
6
  - trimmed
7
- library_name: transformers
8
  base_model: intfloat/multilingual-e5-base
9
  base_model_relation: quantized
10
  datasets:
11
- - Lumberjackk/fineweb-2-trimming
12
  ---
13
 
14
  # multilingual-e5-base-dan-32768
 
 
15
 
16
- This model is a 60.0% smaller version of [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base)
17
- optimized for multiple languages with vocabulary pruning.
 
 
 
18
 
19
- **Total vocabulary size**: 32768 tokens (reduced from 250002)
20
- **Tokenizer type**: Unigram
21
- **Training samples per language**: 200000 texts
22
- **Dataset**: [Lumberjackk/fineweb-2-trimming](https://huggingface.co/datasets/Lumberjackk/fineweb-2-trimming)
23
 
24
- ## Language Distribution
25
-
26
- - **dan**: 32768 tokens
27
-
28
- This pruned model should perform similarly to the original model for the selected languages with a much smaller
29
- memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected
30
- languages were removed from the vocabulary.
31
 
32
  ## Usage
33
-
34
- You can use this model with the Transformers library:
35
  ```python
36
- from transformers import AutoModel, AutoTokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
- model_name = "Lumberjackk/multilingual-e5-base-dan-32768"
39
- model = AutoModel.from_pretrained(model_name)
40
- tokenizer = AutoTokenizer.from_pretrained(model_name)
41
  ```
42
-
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: sentence-similarity
3
  language: dan
4
  license: mit
5
  tags:
6
  - trimmed
7
+ library_name: sentence-transformers
8
  base_model: intfloat/multilingual-e5-base
9
  base_model_relation: quantized
10
  datasets:
11
+ - lbourdois/fineweb-2-trimming
12
  ---
13
 
14
  # multilingual-e5-base-dan-32768
15
+ This model is a **60.00% smaller** version of [intfloat/multilingual-e5-base](https://huggingface.co/intfloat/multilingual-e5-base) optimized for **Danish** language via vocabulary size reduction using the [trimming](https://huggingface.co/blog/lbourdois/introduction-to-trimming) method.
16
+ This trimmed model should perform similarly to the original model with only 32,768 tokens and a much smaller memory footprint. However, it may not perform well for other languages as tokens not commonly used in the selected languages were removed from the vocabulary.
17
 
18
+ ## Model Statistics
19
+ | Metric | Original | Trimmed | Reduction |
20
+ |--------|----------|---------|-----------|
21
+ | **Vocabulary size** | 250,037 tokens | 32,768 tokens | **86.89%** |
22
+ | **Model size** | 278,043,648 params | 111,207,936 params | **60.00%** |
23
 
24
+ ![image](https://raw.githubusercontent.com/lbourdois/blog/refs/heads/master/assets/images/Trimming/me5-base-32768.png)
 
 
 
25
 
26
+ ## Mining Dataset Statistics
27
+ - **Number of texts used for mining**: 200,000 texts
28
+ - **Dataset**: [lbourdois/fineweb-2-trimming](https://huggingface.co/datasets/lbourdois/fineweb-2-trimming)
 
 
 
 
29
 
30
  ## Usage
 
 
31
  ```python
32
+ from sentence_transformers import SentenceTransformer
33
+ # Download from the 🤗 Hub
34
+ model = SentenceTransformer("alphaedge-ai/multilingual-e5-base-dan-32768")
35
+ # Run inference with queries and documents
36
+ query = "My query in Danish"
37
+ documents = [
38
+ "Chunk in Danish",
39
+ "Chunk in Danish",
40
+ "Chunk in Danish",
41
+ ]
42
+ query_embeddings = model.encode_query(query)
43
+ document_embeddings = model.encode_document(documents)
44
+ print(query_embeddings.shape, document_embeddings.shape)
45
+ # Compute similarities to determine a ranking
46
+ similarities = model.similarity(query_embeddings, document_embeddings)
47
+ print(similarities)
48
+ ```
49
+
50
+ ## Citations
51
+
52
+ #### Multilingual E5
53
+ ```
54
+ @article{wang2024multilingual,
55
+ title={Multilingual E5 Text Embeddings: A Technical Report},
56
+ author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
57
+ journal={arXiv preprint arXiv:2402.05672},
58
+ year={2024}
59
+ }
60
+ ```
61
 
62
+ #### Trimming blog post
 
 
63
  ```
64
+ @misc{hf_blogpost_trimming,
65
+ title={Introduction to Trimming},
66
+ author={Loïck BOURDOIS and Tom AARSEN and Bram VANROY and Christopher AKIKI and Woojun JUNG and Manuel ROMERO and Prithiv SAKTHI},
67
+ year={2026},
68
+ url={https://huggingface.co/blog/lbourdois/introduction-to-trimming},
69
+ }
70
+ ```