File size: 2,941 Bytes
06a2cf1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
library_name: transformers
license: apache-2.0
base_model: google/vit-base-patch16-224
tags:
- image-classification
- cifar10
- computer-vision
- vision-transformer
- transfer-learning
metrics:
- accuracy
model-index:
- name: vit-base-cifar10-augmented
results:
- task:
type: image-classification
name: Image Classification
dataset:
name: CIFAR-10
type: cifar10
metrics:
- type: accuracy
value: 0.9554
---
# vit-base-cifar10-augmented
This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) using data augmentation.
It achieves the following results on the evaluation set:
- **Loss:** 0.0445
- **Accuracy:** 95.54%
## 🧠 Model Description
The base model is a Vision Transformer (ViT) originally trained on ImageNet-21k. This version has been fine-tuned on CIFAR-10, a standard image classification benchmark, using PyTorch and Hugging Face Transformers.
Training was done using extensive **data augmentation**, including random crops, flips, rotations, and color jitter to improve generalization on small input images (32×32, resized to 224×224).
## ✅ Intended Uses & Limitations
### Intended uses
- Educational and research use on small image classification tasks
- Benchmarking transfer learning for ViT on CIFAR-10
- Demonstrating the impact of data augmentation on fine-tuning performance
### Limitations
- Not optimized for real-time inference
- Fine-tuned only on CIFAR-10; not suitable for general-purpose image classification
- Requires resized input (224×224)
## 📦 Training and Evaluation Data
- **Dataset**: [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)
- **Size**: 60,000 images (10 classes)
- **Split**: 75% training, 25% test
All images were resized to 224×224 and normalized using ViT’s original mean/std values.
## ⚙️ Training Procedure
### Hyperparameters
- Learning rate: `1e-4`
- Optimizer: `Adam`
- Batch size: `8`
- Epochs: `10`
- Scheduler: `ReduceLROnPlateau`
### Data Augmentation Used
- `RandomResizedCrop(224)`
- `RandomHorizontalFlip()`
- `RandomRotation(10)`
- `ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1)`
### Training Results
| Epoch | Training Loss | Test Accuracy |
|-------|---------------|---------------|
| 1 | 0.1969 | 94.62% |
| 2 | 0.1189 | 95.05% |
| 3 | 0.0899 | **95.54%** |
| 4 | 0.0720 | 94.68% |
| 5 | 0.0650 | 94.84% |
| 6 | 0.0576 | 94.76% |
| 7 | 0.0560 | 95.33% |
| 8 | 0.0488 | 94.31% |
| 9 | 0.0499 | 95.42% |
| 10 | 0.0445 | 94.33% |
## 🧪 Framework Versions
- `transformers`: 4.50.0
- `torch`: 2.6.0+cu124
- `datasets`: 3.4.1
- `tokenizers`: 0.21.1 |