--- license: mit language: - en base_model: - google/siglip-so400m-patch14-384 pipeline_tag: zero-shot-image-classification tags: - siglip - Int8 --- # SigLIP (shape-optimized model) SigLIP model pre-trained on WebLi at resolution 384x384. It was introduced in the paper Sigmoid Loss for Language Image Pre-Training by Zhai et al. and first released in this repository. The Original repo is https://huggingface.co/google/siglip-so400m-patch14-384. This model of SigLIP has been converted to run on the Axera NPU using **w8a16** quantization. This model has been optimized with the following LoRA: Compatible with Pulsar2 version: 3.4 ## Convert tools links: For those who are interested in model conversion, you can try to export axmodel through - [The repo of AXera Platform](https://github.com/AXERA-TECH/SigLIP.axera), which you can get the detial of guide - [Pulsar2 Link, How to Convert ONNX to axmodel](https://pulsar2-docs.readthedocs.io/en/latest/pulsar2/introduction.html) ## Support Platform - AX650 - [M4N-Dock(爱芯派Pro)](https://wiki.sipeed.com/hardware/zh/maixIV/m4ndock/m4ndock.html) - [M.2 Accelerator card](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html) | Models | Raspberry Pi5 Only CPU | Intel i7-13700 | Raspberry Pi5 + M.2 Card | | --------------------- | ---------------------- | -------------- | ------------------------ | | Image Encoder | 8.3 s | 1.2 s | 0.19 s | | Text Encoder | 1.3 s | 0.3 s | 0.05 s | ## How to use Download all files from this repository to the device ``` (axcl) axera@raspberrypi:~/samples/siglip $ tree -L 2 . ├── 000000039769.jpg ├── ax650 │   ├── siglip_text_u16.axmodel │   └── siglip_vision_u16_fcu8.axmodel ├── config.json ├── onnx │   ├── siglip-so400m-patch14-384_text.onnx │   └── siglip-so400m-patch14-384_vision.onnx ├── python │   ├── inference_axmodel.py │   ├── inference_onnx.py │   └── requirements.txt └── tokenizer ├── config.json ├── preprocessor_config.json ├── special_tokens_map.json ├── spiece.model ├── tokenizer_config.json └── tokenizer.json 5 directories, 15 files ``` ### python env requirement #### pyaxengine https://github.com/AXERA-TECH/pyaxengine ``` wget https://github.com/AXERA-TECH/pyaxengine/releases/download/0.1.3rc0/axengine-0.1.3-py3-none-any.whl pip install axengine-0.1.3-py3-none-any.whl ``` #### others ``` pip install -r python/requirements.txt ``` ## Inputs **Test** ``` "a photo of 2 cats", "a photo of 2 dogs" ``` **Image** ![](000000039769.jpg) ## Inference with AX650 Host, such as M4N-Dock(爱芯派Pro) ``` root@ax650:/mnt/qtang/inner/SigLIP.axera# python3 python/inference_axmodel.py [INFO] Available providers: ['AxEngineExecutionProvider'] [INFO] Using provider: AxEngineExecutionProvider [INFO] Chip type: ChipType.MC50 [INFO] VNPU type: VNPUType.DISABLED [INFO] Engine version: 2.7.2a [INFO] Model type: 2 (triple core) [INFO] Compiler version: 3.4-dirty 739e2b35-dirty Model loading time: 3.86 seconds [INFO] Using provider: AxEngineExecutionProvider [INFO] Model type: 2 (triple core) [INFO] Compiler version: 3.4-dirty 739e2b35-dirty Model loading time: 3.22 seconds Total model loading time: 7.08 seconds Model inference time: 0.19 seconds Model inference time: 0.05 seconds Total inference time: 0.24 seconds 49.4% that image 0 is 'a photo of 2 cats' root@ax650:/mnt/qtang/inner/SigLIP.axera# ``` ## Inference with M.2 Accelerator card [What is M.2 Accelerator card?](https://axcl-docs.readthedocs.io/zh-cn/latest/doc_guide_hardware.html), Show this DEMO based on Raspberry PI 5. ``` (axcl) axera@raspberrypi:~/samples/siglip $ python python/inference_axmodel.py [INFO] Available providers: ['AXCLRTExecutionProvider'] [INFO] Using provider: AXCLRTExecutionProvider [INFO] SOC Name: AX650N [INFO] VNPU type: VNPUType.DISABLED [INFO] Compiler version: 3.4-dirty 739e2b35-dirty Model loading time: 12.31 seconds [INFO] Using provider: AXCLRTExecutionProvider [INFO] SOC Name: AX650N [INFO] VNPU type: VNPUType.DISABLED [INFO] Compiler version: 3.4-dirty 739e2b35-dirty Model loading time: 12.37 seconds Total model loading time: 24.68 seconds Model inference time: 0.19 seconds Model inference time: 0.05 seconds Total inference time: 0.24 seconds 52.5% that image 0 is 'a photo of 2 cats' (axcl) axera@raspberrypi:~/samples/siglip $ ```