Model Card for BGE-M3 ONNX Int8

This is the ONNX version of BAAI/BGE-M3 embedding model quantized to int8.

Model Description

  • Developed by: Mahrad Hosseini
  • Model type: Embedding Model
  • License: Apache-2.0
  • Finetuned from model: BAAI/BGE-M3

Uses

  • Running with better performance on CPU
  • Running with better performance on low-end GPUs
  • Running on Edge Devices with low computational power
  • Running on low-latency servers
  • Running on devices with limited RAM

How to Get Started with the Model

Use the code below to get started with the model.

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-m3") # Better to use the original tokenizer (very lightweight)
model = ORTModelForFeatureExtraction.from_pretrained("MahradHosseini/bge-m3-onnx-int8")

questions = ["What is your opening hour?", "Where are your offices?"]

input_q = tokenizer(
    questions,
    padding=True,
    truncation=True,
    return_tensors="np"
)
print(f"Question input keys: {list(input_q.keys())}, shapes: {[v.shape for v in input_q.values()]}")

output_q = self.model(**input_q)
print(f"Question output keys: {list(output_q.keys())}, shapes: {[v.shape for v in output_q.values()]}")

question_embeddings = {
    "dense_vecs": mean_pooling(output_q["last_hidden_state"], input_q["attention_mask"]),
}
print(f"Embedded {len(question_embeddings['dense_vecs'])} questions from {self.data_file}")

def mean_pooling(last_hidden_state, attention_mask):
    # last_hidden_state: [batch_size, seq_len, hidden_size]
    # attention_mask: [batch_size, seq_len]
    input_mask_expanded = np.expand_dims(attention_mask, -1).astype(np.float32)
    return np.sum(last_hidden_state * input_mask_expanded, axis=1) / np.clip(
        input_mask_expanded.sum(axis=1), a_min=1e-9, a_max=None
    )

Conversion Details

HuggingFace Optimum was used to convert the base model to ONNX and then to quantize to int8.

0. fresh env with safe versions

pip install "optimum[exporters]" onnx onnxruntime

1. export WITHOUT optimisation

optimum-cli export onnx -m BAAI/bge-m3 bge-m3-onnx

2. int8 dynamic quantisation, AVX2 preset, per-channel

optimum-cli onnxruntime quantize
--onnx_model bge-m3-onnx
--avx2 --per_channel
-o bge-m3-int8

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for aris-bb/bge-m3-onnx-int8

Base model

BAAI/bge-m3
Quantized
(65)
this model