Ross Wightman
commited on
Commit
·
e0a996a
1
Parent(s):
923c8d7
Update README add tokenizer/vocab/preprocessor cfg
Browse files- README.md +19 -8
- preprocessor_config.json +19 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- vocab.json +0 -0
README.md
CHANGED
|
@@ -6,11 +6,12 @@ license: mit
|
|
| 6 |
# Table of Contents
|
| 7 |
|
| 8 |
1. [Model Details](#model-details)
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
|
|
|
| 14 |
|
| 15 |
|
| 16 |
# Model Details
|
|
@@ -19,9 +20,11 @@ license: mit
|
|
| 19 |
|
| 20 |
A CLIP ViT-g/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
|
| 21 |
|
|
|
|
|
|
|
| 22 |
# Uses
|
| 23 |
|
| 24 |
-
As per the original OpenAI CLIP
|
| 25 |
|
| 26 |
The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset.
|
| 27 |
|
|
@@ -55,7 +58,7 @@ This model was trained with the 2 Billion sample English subset of LAION-5B (htt
|
|
| 55 |
|
| 56 |
## Training Procedure
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
# Evaluation
|
| 61 |
|
|
@@ -71,7 +74,15 @@ The testing is performed with VTAB+ (A combination of VTAB (https://arxiv.org/ab
|
|
| 71 |
|
| 72 |
## Results
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
# Citation
|
| 77 |
|
|
|
|
| 6 |
# Table of Contents
|
| 7 |
|
| 8 |
1. [Model Details](#model-details)
|
| 9 |
+
2. [Uses](#uses)
|
| 10 |
+
3. [Training Details](#training-details)
|
| 11 |
+
4. [Evaluation](#evaluation)
|
| 12 |
+
5. [Acknowledgements](#acknowledgements)
|
| 13 |
+
6. [Citation](#citation)
|
| 14 |
+
7. [How To Get Started With the Model](#how-to-get-started-with-the-model)
|
| 15 |
|
| 16 |
|
| 17 |
# Model Details
|
|
|
|
| 20 |
|
| 21 |
A CLIP ViT-g/14 model trained with the LAION-2B English subset of LAION-5B (https://laion.ai/blog/laion-5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
|
| 22 |
|
| 23 |
+
Model training done by Romain Beaumont on the [stability.ai](https://stability.ai/) cluster.
|
| 24 |
+
|
| 25 |
# Uses
|
| 26 |
|
| 27 |
+
As per the original [OpenAI CLIP model card](https://github.com/openai/CLIP/blob/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1/model-card.md), this model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such model.
|
| 28 |
|
| 29 |
The OpenAI CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis. Additionally, the LAION-5B blog (https://laion.ai/blog/laion-5b/) and upcoming paper include additional discussion as it relates specifically to the training dataset.
|
| 30 |
|
|
|
|
| 58 |
|
| 59 |
## Training Procedure
|
| 60 |
|
| 61 |
+
Please see [training notes](https://docs.google.com/document/d/1EFbMLRWSSV0LUf9Du1pWzWqgeiIRPwEWX2s1C6mAk5c) and [wandb logs](https://wandb.ai/rom1504/eval_openclip/reports/slow-g-14--VmlldzoyNTMwMjg5).
|
| 62 |
|
| 63 |
# Evaluation
|
| 64 |
|
|
|
|
| 74 |
|
| 75 |
## Results
|
| 76 |
|
| 77 |
+
The model achieves a 76.6 zero-shot top-1 accuracy on ImageNet-1k.
|
| 78 |
+
|
| 79 |
+
An initial round of benchmarks have been performed on a wider range of datasets, currently viewable at https://github.com/LAION-AI/CLIP_benchmark/blob/main/benchmark/results.ipynb
|
| 80 |
+
|
| 81 |
+
**TODO** - create table for just this model's metrics.
|
| 82 |
+
|
| 83 |
+
# Acknowledgements
|
| 84 |
+
|
| 85 |
+
Acknowledging [stability.ai](https://stability.ai/) for the compute used to train this model.
|
| 86 |
|
| 87 |
# Citation
|
| 88 |
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"crop_size": 224,
|
| 3 |
+
"do_center_crop": true,
|
| 4 |
+
"do_normalize": true,
|
| 5 |
+
"do_resize": true,
|
| 6 |
+
"feature_extractor_type": "CLIPFeatureExtractor",
|
| 7 |
+
"image_mean": [
|
| 8 |
+
0.48145466,
|
| 9 |
+
0.4578275,
|
| 10 |
+
0.40821073
|
| 11 |
+
],
|
| 12 |
+
"image_std": [
|
| 13 |
+
0.26862954,
|
| 14 |
+
0.26130258,
|
| 15 |
+
0.27577711
|
| 16 |
+
],
|
| 17 |
+
"resample": 3,
|
| 18 |
+
"size": 224
|
| 19 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": "<|endoftext|>"}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": "<|endoftext|>", "add_prefix_space": false, "errors": "replace", "do_lower_case": true, "name_or_path": "./clip_ViT_B_32/"}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|