jinaai
/

jina-embeddings-v3

@@ -66,7 +66,7 @@ language:
   - my
   - ne
   - nl
-  - 'no'
   - om
   - or
   - pa
@@ -160,37 +160,107 @@ The data and training details are described in the technical report (coming soon
 ## Usage
-1. The easiest way to starting using jina-clip-v1-en is to use Jina AI's [Embeddings API](https://jina.ai/embeddings/).
-2. Alternatively, you can use Jina CLIP directly via transformers package.
 ```python
-!pip install transformers einops flash_attn
 from transformers import AutoModel
 # Initialize the model
 model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
-# New meaningful sentences
-sentences = [
-    "Organic skincare for sensitive skin with aloe vera and chamomile.",
-    "New makeup trends focus on bold colors and innovative techniques",
-    "Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille",
-    "Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken",
-    "Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla",
-    "Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras",
-    "针对敏感肌专门设计的天然有机护肤产品",
-    "新的化妆趋势注重鲜艳的颜色和创新的技巧",
-    "敏感肌のために特別に設計された天然有機スキンケア製品",
-    "新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています",
 ]
-# Encode sentences
-embeddings = model.encode(sentences, truncate_dim=1024, task_type='index') # TODO UPDATE
 # Compute similarities
 print(embeddings[0] @ embeddings[1].T)
 ```
 ## Performance

   - my
   - ne
   - nl
+  - no
   - om
   - or
   - pa
 ## Usage
+**<details><summary>Apply mean pooling when integrating the model.</summary>**
+<p>
+### Why Use Mean Pooling?
+Mean pooling takes all token embeddings from the model's output and averages them at the sentence or paragraph level.
+This approach has been shown to produce high-quality sentence embeddings.
+We provide an `encode` function that handles this for you automatically.
+However, if you're working with the model directly, outside of the `encode` function,
+you'll need to apply mean pooling manually. Here's how you can do it:
 ```python
+import torch
+import torch.nn.functional as F
+from transformers import AutoTokenizer, AutoModel
+def mean_pooling(model_output, attention_mask):
+    token_embeddings = model_output[0]
+    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+sentences = ['How is the weather today?', 'What is the current weather like today?']
+tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v3')
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
+encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
+with torch.no_grad():
+    model_output = model(**encoded_input)
+embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
+embeddings = F.normalize(embeddings, p=2, dim=1)
+```
+</p>
+</details>
+The easiest way to start using `jina-embeddings-v3` is Jina AI's [Embeddings API](https://jina.ai/embeddings/).
+Alternatively, you can use `jina-embeddings-v3` directly via Transformers package:
+```python
+!pip install transformers
 from transformers import AutoModel
 # Initialize the model
 model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True)
+texts = [
+    'Follow the white rabbit.',              # English
+    'Sigue al conejo blanco.',               # Spanish
+    'Suis le lapin blanc.',                  # French
+    '跟着白兔走。',                            # Chinese
+    'اتبع الأرنب الأبيض.',                     # Arabic
+    'Folge dem weißen Kaninchen.'            # German
 ]
+# When calling the `encode` function, you can choose a `task_type` based on the use case:
+# 'retrieval.query', 'retrieval.passage', 'separation', 'classification', 'text-matching'
+# Alternatively, you can choose not to pass a `task_type`, and no specific LoRA adapter will be used.
+embeddings = model.encode(texts, task_type='text-matching')
 # Compute similarities
 print(embeddings[0] @ embeddings[1].T)
 ```
+By default, the model supports a maximum sequence length of 8192 tokens.
+However, if you want to truncate your input texts to a shorter length, you can pass the `max_length` parameter to the `encode` function:
+```python
+embeddings = model.encode(
+    ['Very long ... document'],
+    max_length=2048
+)
+```
+In case you want to use **Matryoshka embeddings** and switch to a different dimension,
+you can adjust it by passing the `truncate_dim` parameter to the `encode` function:
+```python
+embeddings = model.encode(
+    ['Sample text'],
+    truncate_dim=256
+)
+```
+The latest version (#todo: specify version) of SentenceTransformers also supports `jina-embeddings-v3`:
+```python
+!pip install -U sentence-transformers
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer(
+    "jinaai/jina-embeddings-v3", trust_remote_code=True
+)
+embeddings = model.encode(['How is the weather today?'], task_type='retrieval.query')
+```
 ## Performance