---
license: apache-2.0
datasets:
- frgfm/imagenette
language:
- en
base_model:
- google/siglip2-base-patch16-224
pipeline_tag: image-classification
library_name: transformers
tags:
- ImageNet
- SigLIP2
- Classifier
---

![3.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/XKcsM33R3XKl5JBBfQHNM.png)

# IMAGENETTE

> IMAGENETTE is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for multi-class image classification. It is trained to classify images into 10 categories from the popular Imagenette dataset using the SiglipForImageClassification architecture.

> [!note]
*SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features* https://arxiv.org/pdf/2502.14786

> [!note]
> *ImageNet Large Scale Visual Recognition Challenge* https://arxiv.org/pdf/1409.0575

```py
Classification Report:
                  precision    recall  f1-score   support

           tench     0.9885    0.9834    0.9859       963
english springer     0.9843    0.9822    0.9832       955
 cassette player     0.9544    0.9486    0.9515       993
       chain saw     0.9257    0.8998    0.9125       858
          church     0.9654    0.9798    0.9726       941
     French horn     0.9757    0.9665    0.9711       956
   garbage truck     0.8883    0.9761    0.9301       961
        gas pump     0.9366    0.9044    0.9202       931
       golf ball     0.9925    0.9716    0.9819       951
       parachute     0.9821    0.9708    0.9764       960

        accuracy                         0.9590      9469
       macro avg     0.9593    0.9583    0.9586      9469
    weighted avg     0.9597    0.9590    0.9591      9469
```

![download.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/74PN9tMCvZIfg_qegVOa9.png)

---

## Label Space: 10 Classes

The model predicts one of the following image classes:

```
0: tench
1: english springer
2: cassette player
3: chain saw
4: church
5: French horn
6: garbage truck
7: gas pump
8: golf ball
9: parachute
```

---

## Install Dependencies

```bash
pip install -q transformers torch pillow gradio hf_xet
```

---

## Inference Code

```python
import gradio as gr
from transformers import AutoImageProcessor, SiglipForImageClassification
from PIL import Image
import torch

# Load model and processor
model_name = "prithivMLmods/IMAGENETTE"
model = SiglipForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)

# Label mapping
id2label = {
    "0": "tench",
    "1": "english springer",
    "2": "cassette player",
    "3": "chain saw",
    "4": "church",
    "5": "French horn",
    "6": "garbage truck",
    "7": "gas pump",
    "8": "golf ball",
    "9": "parachute"
}

def classify_image(image):
    image = Image.fromarray(image).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()
    
    prediction = {
        id2label[str(i)]: round(probs[i], 3) for i in range(len(probs))
    }

    return prediction

# Gradio Interface
iface = gr.Interface(
    fn=classify_image,
    inputs=gr.Image(type="numpy"),
    outputs=gr.Label(num_top_classes=3, label="Image Classification"),
    title="IMAGENETTE - SigLIP2 Classifier",
    description="Upload an image to classify it into one of 10 categories from the Imagenette dataset."
)

if __name__ == "__main__":
    iface.launch()
```

---

## Intended Use

IMAGENETTE is designed for:

* Educational purposes and model benchmarking.
* Demonstrating the performance of SigLIP2 on a small but diverse classification task.
* Fine-tuning workflows on vision-language models.