Matthijs Hollemans
commited on
Commit
·
cdc6976
1
Parent(s):
3caac63
add basic usage instructions
Browse files
README.md
CHANGED
|
@@ -13,3 +13,52 @@ datasets:
|
|
| 13 |
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
|
| 14 |
|
| 15 |
This repo contains a Core ML version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
Vision Transformer (ViT) model pre-trained on ImageNet-21k (14 million images, 21,843 classes) at resolution 224x224, and fine-tuned on ImageNet 2012 (1 million images, 1,000 classes) at resolution 224x224. It was introduced in the paper [An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929) by Dosovitskiy et al. and first released in [this repository](https://github.com/google-research/vision_transformer). However, the weights were converted from the [timm repository](https://github.com/rwightman/pytorch-image-models) by Ross Wightman, who already converted the weights from JAX to PyTorch. Credits go to him.
|
| 14 |
|
| 15 |
This repo contains a Core ML version of [google/vit-base-patch16-224](https://huggingface.co/google/vit-base-patch16-224).
|
| 16 |
+
|
| 17 |
+
## Usage instructions
|
| 18 |
+
|
| 19 |
+
Create a `VNCoreMLRequest` that loads the ViT model:
|
| 20 |
+
|
| 21 |
+
```swift
|
| 22 |
+
import CoreML
|
| 23 |
+
import Vision
|
| 24 |
+
|
| 25 |
+
lazy var classificationRequest: VNCoreMLRequest = {
|
| 26 |
+
do {
|
| 27 |
+
let config = MLModelConfiguration()
|
| 28 |
+
config.computeUnits = .all
|
| 29 |
+
let coreMLModel = try ViT(configuration: config)
|
| 30 |
+
let visionModel = try VNCoreMLModel(for: coreMLModel.model)
|
| 31 |
+
|
| 32 |
+
let request = VNCoreMLRequest(model: visionModel, completionHandler: { [weak self] request, error in
|
| 33 |
+
if let results = request.results as? [VNClassificationObservation] {
|
| 34 |
+
/* do something with the results */
|
| 35 |
+
}
|
| 36 |
+
})
|
| 37 |
+
|
| 38 |
+
request.imageCropAndScaleOption = .centerCrop
|
| 39 |
+
return request
|
| 40 |
+
} catch {
|
| 41 |
+
fatalError("Failed to create VNCoreMLModel: \(error)")
|
| 42 |
+
}
|
| 43 |
+
}()
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
Perform the request:
|
| 47 |
+
|
| 48 |
+
```swift
|
| 49 |
+
func classify(image: UIImage) {
|
| 50 |
+
guard let ciImage = CIImage(image: image) else {
|
| 51 |
+
print("Unable to create CIImage")
|
| 52 |
+
return
|
| 53 |
+
}
|
| 54 |
+
|
| 55 |
+
DispatchQueue.global(qos: .userInitiated).async {
|
| 56 |
+
let handler = VNImageRequestHandler(ciImage: ciImage, orientation: .up)
|
| 57 |
+
do {
|
| 58 |
+
try handler.perform([self.classificationRequest])
|
| 59 |
+
} catch {
|
| 60 |
+
print("Failed to perform classification: \(error)")
|
| 61 |
+
}
|
| 62 |
+
}
|
| 63 |
+
}
|
| 64 |
+
```
|