File size: 2,941 Bytes
06a2cf1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
library_name: transformers
license: apache-2.0
base_model: google/vit-base-patch16-224
tags:
- image-classification
- cifar10
- computer-vision
- vision-transformer
- transfer-learning
metrics:
- accuracy
model-index:
- name: vit-base-cifar10-augmented
  results:
  - task:
      type: image-classification
      name: Image Classification
    dataset:
      name: CIFAR-10
      type: cifar10
    metrics:
      - type: accuracy
        value: 0.9554
---

# vit-base-cifar10-augmented

This model is a fine-tuned version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224) on the [CIFAR-10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) using data augmentation.

It achieves the following results on the evaluation set:
- **Loss:** 0.0445
- **Accuracy:** 95.54%

## 🧠 Model Description

The base model is a Vision Transformer (ViT) originally trained on ImageNet-21k. This version has been fine-tuned on CIFAR-10, a standard image classification benchmark, using PyTorch and Hugging Face Transformers.

Training was done using extensive **data augmentation**, including random crops, flips, rotations, and color jitter to improve generalization on small input images (32×32, resized to 224×224).

## ✅ Intended Uses & Limitations

### Intended uses
- Educational and research use on small image classification tasks
- Benchmarking transfer learning for ViT on CIFAR-10
- Demonstrating the impact of data augmentation on fine-tuning performance

### Limitations
- Not optimized for real-time inference
- Fine-tuned only on CIFAR-10; not suitable for general-purpose image classification
- Requires resized input (224×224)

## 📦 Training and Evaluation Data

- **Dataset**: [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html)
- **Size**: 60,000 images (10 classes)
- **Split**: 75% training, 25% test

All images were resized to 224×224 and normalized using ViT’s original mean/std values.

## ⚙️ Training Procedure

### Hyperparameters

- Learning rate: `1e-4`
- Optimizer: `Adam`
- Batch size: `8`
- Epochs: `10`
- Scheduler: `ReduceLROnPlateau`

### Data Augmentation Used
- `RandomResizedCrop(224)`
- `RandomHorizontalFlip()`
- `RandomRotation(10)`
- `ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1)`

### Training Results

| Epoch | Training Loss | Test Accuracy |
|-------|---------------|---------------|
| 1     | 0.1969        | 94.62%        |
| 2     | 0.1189        | 95.05%        |
| 3     | 0.0899        | **95.54%**    |
| 4     | 0.0720        | 94.68%        |
| 5     | 0.0650        | 94.84%        |
| 6     | 0.0576        | 94.76%        |
| 7     | 0.0560        | 95.33%        |
| 8     | 0.0488        | 94.31%        |
| 9     | 0.0499        | 95.42%        |
| 10    | 0.0445        | 94.33%        |

## 🧪 Framework Versions

- `transformers`: 4.50.0  
- `torch`: 2.6.0+cu124  
- `datasets`: 3.4.1  
- `tokenizers`: 0.21.1