---
license: mit
language:
- en
base_model:
- google/siglip-so400m-patch14-384
pipeline_tag: zero-shot-image-classification
tags:
- siglip
- Int8
---

# SigLIP (shape-optimized model)

SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository.

The Original repo is https://huggingface.co/google/siglip-so400m-patch14-384.

This model of SigLIP has been converted to run on the Axera NPU using **w8a16** quantization.

This model has been optimized with the following LoRA: 

Compatible with Pulsar2 version: 3.4

## Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through 


- [The repo of AXera Platform](https://github.com/AXERA-TECH/SigLIP.axera), which you can get the detial of guide

- [Pulsar2 Link, How to Convert ONNX to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/pulsar2/introduction.html) 


## Support Platform

- AX650
  - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html)
  - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html)
  
 
| Models                | Raspberry Pi5 Only CPU | Intel i7-13700 | Raspberry Pi5 + M.2 Card |
| --------------------- | ---------------------- | -------------- | ------------------------ |
| Image Encoder         | 8.3 s                   | 1.2 s          | 0.19 s                   |
| Text Encoder          | 1.3 s                   | 0.3 s          | 0.05 s                   |

## How to use

Download all files from this repository to the device

```
(axcl) axera@raspberrypi:~/samples/siglip $ tree -L 2
.
├── 000000039769.jpg
├── ax650
│   ├── siglip_text_u16.axmodel
│   └── siglip_vision_u16_fcu8.axmodel
├── config.json
├── onnx
│   ├── siglip-so400m-patch14-384_text.onnx
│   └── siglip-so400m-patch14-384_vision.onnx
├── python
│   ├── inference_axmodel.py
│   ├── inference_onnx.py
│   └── requirements.txt
└── tokenizer
    ├── config.json
    ├── preprocessor_config.json
    ├── special_tokens_map.json
    ├── spiece.model
    ├── tokenizer_config.json
    └── tokenizer.json

5 directories, 15 files
```

### python env requirement

#### pyaxengine

https://github.com/AXERA-TECH/pyaxengine

```
wget https://github.com/AXERA-TECH/pyaxengine/releases/download/0.1.3rc0/axengine-0.1.3-py3-none-any.whl
pip install axengine-0.1.3-py3-none-any.whl
```

#### others

```
pip install -r python/requirements.txt
```

## Inputs

**Test**
```
"a photo of 2 cats", "a photo of 2 dogs"
```

**Image**
![](000000039769.jpg)

## Inference with AX650 Host, such as M4N-Dock(爱芯派Pro)

```
root@ax650:/mnt/qtang/inner/SigLIP.axera# python3 python/inference_axmodel.py
[INFO] Available providers:  ['AxEngineExecutionProvider']
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Chip type: ChipType.MC50
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Engine version: 2.7.2a
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 3.4-dirty 739e2b35-dirty
Model loading time: 3.86 seconds
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 3.4-dirty 739e2b35-dirty
Model loading time: 3.22 seconds
Total model loading time: 7.08 seconds
Model inference time: 0.19 seconds
Model inference time: 0.05 seconds
Total inference time: 0.24 seconds
49.4% that image 0 is 'a photo of 2 cats'
root@ax650:/mnt/qtang/inner/SigLIP.axera# 
```

## Inference with M.2 Accelerator card

[What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5.

```
(axcl) axera@raspberrypi:~/samples/siglip $ python python/inference_axmodel.py
[INFO] Available providers:  ['AXCLRTExecutionProvider']
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 3.4-dirty 739e2b35-dirty
Model loading time: 12.31 seconds
[INFO] Using provider: AXCLRTExecutionProvider
[INFO] SOC Name: AX650N
[INFO] VNPU type: VNPUType.DISABLED
[INFO] Compiler version: 3.4-dirty 739e2b35-dirty
Model loading time: 12.37 seconds
Total model loading time: 24.68 seconds
Model inference time: 0.19 seconds
Model inference time: 0.05 seconds
Total inference time: 0.24 seconds
52.5% that image 0 is 'a photo of 2 cats'
(axcl) axera@raspberrypi:~/samples/siglip $ 
```