--- license: apache-2.0 datasets: - frgfm/imagenette language: - en base_model: - google/siglip2-base-patch16-224 pipeline_tag: image-classification library_name: transformers tags: - ImageNet - SigLIP2 - Classifier --- ![3.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/XKcsM33R3XKl5JBBfQHNM.png) # IMAGENETTE > IMAGENETTE is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for multi-class image classification. It is trained to classify images into 10 categories from the popular Imagenette dataset using the SiglipForImageClassification architecture. > [!note] *SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features* https://arxiv.org/pdf/2502.14786 > [!note] > *ImageNet Large Scale Visual Recognition Challenge* https://arxiv.org/pdf/1409.0575 ```py Classification Report: precision recall f1-score support tench 0.9885 0.9834 0.9859 963 english springer 0.9843 0.9822 0.9832 955 cassette player 0.9544 0.9486 0.9515 993 chain saw 0.9257 0.8998 0.9125 858 church 0.9654 0.9798 0.9726 941 French horn 0.9757 0.9665 0.9711 956 garbage truck 0.8883 0.9761 0.9301 961 gas pump 0.9366 0.9044 0.9202 931 golf ball 0.9925 0.9716 0.9819 951 parachute 0.9821 0.9708 0.9764 960 accuracy 0.9590 9469 macro avg 0.9593 0.9583 0.9586 9469 weighted avg 0.9597 0.9590 0.9591 9469 ``` ![download.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/74PN9tMCvZIfg_qegVOa9.png) --- ## Label Space: 10 Classes The model predicts one of the following image classes: ``` 0: tench 1: english springer 2: cassette player 3: chain saw 4: church 5: French horn 6: garbage truck 7: gas pump 8: golf ball 9: parachute ``` --- ## Install Dependencies ```bash pip install -q transformers torch pillow gradio hf_xet ``` --- ## Inference Code ```python import gradio as gr from transformers import AutoImageProcessor, SiglipForImageClassification from PIL import Image import torch # Load model and processor model_name = "prithivMLmods/IMAGENETTE" model = SiglipForImageClassification.from_pretrained(model_name) processor = AutoImageProcessor.from_pretrained(model_name) # Label mapping id2label = { "0": "tench", "1": "english springer", "2": "cassette player", "3": "chain saw", "4": "church", "5": "French horn", "6": "garbage truck", "7": "gas pump", "8": "golf ball", "9": "parachute" } def classify_image(image): image = Image.fromarray(image).convert("RGB") inputs = processor(images=image, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist() prediction = { id2label[str(i)]: round(probs[i], 3) for i in range(len(probs)) } return prediction # Gradio Interface iface = gr.Interface( fn=classify_image, inputs=gr.Image(type="numpy"), outputs=gr.Label(num_top_classes=3, label="Image Classification"), title="IMAGENETTE - SigLIP2 Classifier", description="Upload an image to classify it into one of 10 categories from the Imagenette dataset." ) if __name__ == "__main__": iface.launch() ``` --- ## Intended Use IMAGENETTE is designed for: * Educational purposes and model benchmarking. * Demonstrating the performance of SigLIP2 on a small but diverse classification task. * Fine-tuning workflows on vision-language models.