jina-code-embeddings-1.5b / README.md

dariakryvosheieva

Update README.md

dcb9e31 verified 2 months ago

preview code

raw

history blame

4.34 kB

metadata

base_model:
  - Qwen/Qwen2.5-Coder-1.5B
license: cc-by-nc-4.0

The code embedding model trained by Jina AI.

Jina Code Embeddings: A Small but Performant Code Embedding Model

Intended Usage & Model Info

jina-code-embeddings is an embedding model for code retrieval. The model supports various types of code retrieval (text-to-code, code-to-code, code-to-text, code-to-completion) and technical question answering across 15+ programming languages.

Built on Qwen/Qwen2.5-Coder-1.5B, jina-code-embeddings-1.5b features:

Multilingual support (15+ programming languages) and compatibility with a wide range of domains, including web development, software development, machine learning, data science, and educational coding problems.
Task-specific instruction prefixes for NL2Code, Code2Code, Code2NL, Code2Completion, and Technical QA, which can be selected at inference time.
Flexible embedding size: dense embeddings are 1536-dimensional by default but can be truncated to as low as 128 with minimal performance loss.

Summary of features:

Feature	Jina Code Embeddings 1.5B
Base Model	Qwen2.5-Coder-1.5B
Supported Tasks	`nl2code`, `code2code`, `code2nl`, `code2completion`, `qa`
Model DType	BFloat 16
Max Sequence Length	32768
Embedding Vector Dimension	1536
Matryoshka dimensions	128, 256, 512, 1024, 1536
Pooling Strategy	Last-token pooling
Attention Mechanism	FlashAttention2

Usage

Requirements

The following Python packages are required:

transformers>=4.53.0
torch>=2.7.1

Optional / Recommended

flash-attention: Installing flash-attention is recommended for improved inference speed and efficiency, but not mandatory.
sentence-transformers: If you want to use the model via the sentence-transformers interface, install this package as well.

via transformers

# !pip install transformers>=4.53.0 torch>=2.7.1

from transformers import AutoModel
import torch

# Initialize the model
model = AutoModel.from_pretrained("jinaai/jina-code-embeddings-1.5b", trust_remote_code=True)
model.to("cuda")

# Configure truncate_dim, max_length, batch_size in the encode function if needed

# Encode query
query_embeddings = model.encode(
    ["print hello world in python"],
    task="nl2code",
    prompt_name="query",
)

# Encode passage
passage_embeddings = model.encode(
    ["print('Hello World!')"],
    task="nl2code",
    prompt_name="passage",
)

via sentence-transformers

# !pip install sentence_transformers>=5.0.0 torch>=2.7.1

import torch
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer(
    "jinaai/jina-code-embeddings-1.5b",
    model_kwargs={
        "torch_dtype": torch.bfloat16,
        "attn_implementation": "flash_attention_2",
        "device_map": "auto"
    }
)

# The queries and documents to embed
queries = [
    "print hello world in python",
    "initialize array of 5 zeros in c++"
]
documents = [
    "print('Hello World!')",
    "int arr[5] = {0, 0, 0, 0, 0};"
]

query_embeddings = model.encode(queries, prompt_name="nl2code_query")
document_embeddings = model.encode(documents, prompt_name="nl2code_document")

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
# tensor([[0.8157, 0.1222],
#         [0.1201, 0.5500]])

Training & Evaluation

Please refer to our technical report of jina-code-embeddings for training details and benchmarks.

Contact

Join our Discord community and chat with other community members about ideas.