Model Card for BGE-M3 ONNX Int8

This is the ONNX version of BAAI/BGE-M3 embedding model quantized to int8.

Model Description

Developed by: Mahrad Hosseini
Model type: Embedding Model
License: Apache-2.0
Finetuned from model: BAAI/BGE-M3

Uses

Running with better performance on CPU
Running with better performance on low-end GPUs
Running on Edge Devices with low computational power
Running on low-latency servers
Running on devices with limited RAM

How to Get Started with the Model

Use the code below to get started with the model.

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3") # Better to use the original tokenizer (very lightweight)
model = ORTModelForFeatureExtraction.from_pretrained("MahradHosseini/bge-m3-onnx-int8")

questions = ["What is your opening hour?", "Where are your offices?"]

input_q = tokenizer(
    questions,
    padding=True,
    truncation=True,
    return_tensors="np"
)
print(f"Question input keys: {list(input_q.keys())}, shapes: {[v.shape for v in input_q.values()]}")

output_q = self.model(**input_q)
print(f"Question output keys: {list(output_q.keys())}, shapes: {[v.shape for v in output_q.values()]}")

question_embeddings = {
    "dense_vecs": mean_pooling(output_q["last_hidden_state"], input_q["attention_mask"]),
}
print(f"Embedded {len(question_embeddings['dense_vecs'])} questions from {self.data_file}")

def mean_pooling(last_hidden_state, attention_mask):
    # last_hidden_state: [batch_size, seq_len, hidden_size]
    # attention_mask: [batch_size, seq_len]
    input_mask_expanded = np.expand_dims(attention_mask, -1).astype(np.float32)
    return np.sum(last_hidden_state * input_mask_expanded, axis=1) / np.clip(
        input_mask_expanded.sum(axis=1), a_min=1e-9, a_max=None
    )

Conversion Details

HuggingFace Optimum was used to convert the base model to ONNX and then to quantize to int8.

0. fresh env with safe versions

pip install "optimum[exporters]" onnx onnxruntime

1. export WITHOUT optimisation

optimum-cli export onnx -m BAAI/bge-m3 bge-m3-onnx

2. int8 dynamic quantisation, AVX2 preset, per-channel

optimum-cli onnxruntime quantize
--onnx_model bge-m3-onnx
--avx2 --per_channel
-o bge-m3-int8

Downloads last month: 4

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aris-bb/bge-m3-onnx-int8

Base model

BAAI/bge-m3

Quantized

(65)

this model