dariakryvosheieva commited on
Commit
42b39ca
·
verified ·
1 Parent(s): 700de1c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +136 -0
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-Coder-1.5B
4
+ license: cc-by-nc-4.0
5
+ ---
6
+
7
+ <br><br>
8
+
9
+ <p align="center">
10
+ <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
11
+ </p>
12
+
13
+ <p align="center">
14
+ <b>The code embedding model trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
15
+ </p>
16
+
17
+ # Jina Embeddings c1: A Small but Performant Code Embedding Model
18
+
19
+ ## Intended Usage & Model Info
20
+ `jina-embeddings-c1` is an embedding model for code retrieval.
21
+ The model supports various types of code retrieval (text-to-code, code-to-code, code-to-text, code-to-completion) and technical question answering across 15+ programming languages.
22
+
23
+
24
+ Built on [Qwen/Qwen2.5-Coder-1.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-1.5B), `jina-embeddings-c1-1.5B` features:
25
+
26
+ - **Multilingual support** (15+ programming languages) and compatibility with a wide range of domains, including web development, software development, machine learning, data science, and educational coding problems.
27
+ - **Task-specific instruction prefixes** for NL2Code, Code2Code, Code2NL, Code2Completion, and Technical QA, which can be selected at inference time.
28
+ - **Flexible embedding size**: dense embeddings are 896-dimensional by default but can be truncated to as low as 64 with minimal performance loss.
29
+
30
+
31
+ Summary of features:
32
+
33
+ | Feature | Jina Embeddings C1 1.5B |
34
+ |------------|------------|
35
+ | Base Model | Qwen2.5-Coder-1.5B |
36
+ | Supported Tasks | `nl2code`, `code2code`, `code2nl`, `code2completion`, `qa` |
37
+ | Model DType | BFloat 16 |
38
+ | Max Sequence Length | 32768 |
39
+ | Embedding Vector Dimension | 1536 |
40
+ | Matryoshka dimensions | 128, 256, 512, 1024, 1536 |
41
+ | Pooling Strategy | Last-token pooling |
42
+ | Attention Mechanism | FlashAttention2 |
43
+
44
+ ## Usage
45
+
46
+ <details>
47
+ <summary>Requirements</a></summary>
48
+
49
+ The following Python packages are required:
50
+
51
+ - `transformers>=4.53.0`
52
+ - `torch>=2.7.1`
53
+
54
+ ### Optional / Recommended
55
+ - **flash-attention**: Installing [flash-attention](https://github.com/Dao-AILab/flash-attention) is recommended for improved inference speed and efficiency, but not mandatory.
56
+ - **sentence-transformers**: If you want to use the model via the `sentence-transformers` interface, install this package as well.
57
+ </details>
58
+
59
+ <details>
60
+ <summary>via <a href="https://huggingface.co/docs/transformers/en/index">transformers</a></summary>
61
+
62
+ ```python
63
+ # !pip install transformers>=4.53.0 torch>=2.7.1
64
+
65
+ from transformers import AutoModel
66
+ import torch
67
+
68
+ # Initialize the model
69
+ model = AutoModel.from_pretrained("jinaai/jina-embeddings-c1-1.5B", trust_remote_code=True)
70
+ model.to("cuda")
71
+
72
+ # Configure truncate_dim, max_length, batch_size in the encode function if needed
73
+
74
+ # Encode query
75
+ query_embeddings = model.encode(
76
+ ["print hello world in python"],
77
+ task="nl2code",
78
+ prompt_name="query",
79
+ )
80
+
81
+ # Encode passage
82
+ passage_embeddings = model.encode(
83
+ ["print('Hello World!')"],
84
+ task="nl2code",
85
+ prompt_name="passage",
86
+ )
87
+ ```
88
+ </details>
89
+
90
+ <details>
91
+ <summary>via <a href="https://sbert.net/">sentence-transformers</a></summary>
92
+
93
+ ```python
94
+ # !pip install sentence_transformers>=5.0.0 torch>=2.7.1
95
+
96
+ import torch
97
+ from sentence_transformers import SentenceTransformer
98
+
99
+ # Load the model
100
+ model = SentenceTransformer(
101
+ "jinaai/jina-embeddings-c1-1.5B",
102
+ model_kwargs={
103
+ "torch_dtype": torch.bfloat16,
104
+ "attn_implementation": "flash_attention_2",
105
+ "device_map": "auto"
106
+ }
107
+ )
108
+
109
+ # The queries and documents to embed
110
+ queries = [
111
+ "print hello world in python",
112
+ "initialize array of 5 zeros in c++"
113
+ ]
114
+ documents = [
115
+ "print('Hello World!')",
116
+ "int arr[5] = {0, 0, 0, 0, 0};"
117
+ ]
118
+
119
+ query_embeddings = model.encode(queries, prompt_name="nl2code_query")
120
+ document_embeddings = model.encode(documents, prompt_name="nl2code_document")
121
+
122
+ # Compute the (cosine) similarity between the query and document embeddings
123
+ similarity = model.similarity(query_embeddings, document_embeddings)
124
+ print(similarity)
125
+ # tensor([[0.8157, 0.1222],
126
+ # [0.1201, 0.5500]])
127
+ ```
128
+ </details>
129
+
130
+ ## Training & Evaluation
131
+
132
+ Please refer to our technical report of jina-embeddings-c1 for training details and benchmarks.
133
+
134
+ ## Contact
135
+
136
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.