Automatic Speech Recognition
Transformers
PyTorch
English
wav2vec2
audio
speech
xlsr-fine-tuning-week
Eval Results (legacy)
Instructions to use jonatasgrosman/wav2vec2-large-english with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use jonatasgrosman/wav2vec2-large-english with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="jonatasgrosman/wav2vec2-large-english")# Load model directly from transformers import AutoProcessor, AutoModelForCTC processor = AutoProcessor.from_pretrained("jonatasgrosman/wav2vec2-large-english") model = AutoModelForCTC.from_pretrained("jonatasgrosman/wav2vec2-large-english") - Notebooks
- Google Colab
- Kaggle
Commit ·
9938d89
1
Parent(s): 7fee82d
update model
Browse files- README.md +14 -72
- config.json +1 -1
- preprocessor_config.json +1 -0
- pytorch_model.bin +1 -1
README.md
CHANGED
|
@@ -24,10 +24,10 @@ model-index:
|
|
| 24 |
metrics:
|
| 25 |
- name: Test WER
|
| 26 |
type: wer
|
| 27 |
-
value: 21.
|
| 28 |
- name: Test CER
|
| 29 |
type: cer
|
| 30 |
-
value: 9.
|
| 31 |
---
|
| 32 |
|
| 33 |
# Wav2vec2-Large-English
|
|
@@ -81,16 +81,16 @@ for i, predicted_sentence in enumerate(predicted_sentences):
|
|
| 81 |
|
| 82 |
| Reference | Prediction |
|
| 83 |
| ------------- | ------------- |
|
| 84 |
-
| "SHE'LL BE ALL RIGHT." |
|
| 85 |
| SIX | SIX |
|
| 86 |
-
| "ALL'S WELL THAT ENDS WELL." |
|
| 87 |
-
| DO YOU MEAN IT? |
|
| 88 |
-
| THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES
|
| 89 |
-
| HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? | HOW IS
|
| 90 |
-
| "I GUESS YOU MUST THINK I'M KINDA BATTY." |
|
| 91 |
| NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
|
| 92 |
-
| SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. | SAUCE FOR THE
|
| 93 |
-
| GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. |
|
| 94 |
|
| 95 |
## Evaluation
|
| 96 |
|
|
@@ -159,76 +159,18 @@ print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_
|
|
| 159 |
|
| 160 |
**Test Result**:
|
| 161 |
|
| 162 |
-
In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-
|
| 163 |
-
|
| 164 |
-
---
|
| 165 |
-
|
| 166 |
-
**Common Voice**
|
| 167 |
|
| 168 |
| Model | WER | CER |
|
| 169 |
| ------------- | ------------- | ------------- |
|
| 170 |
-
| jonatasgrosman/wav2vec2-large-xlsr-53-english | **
|
| 171 |
-
| jonatasgrosman/wav2vec2-large-english | 21.
|
| 172 |
| facebook/wav2vec2-large-960h-lv60-self | 22.03% | 10.39% |
|
| 173 |
| facebook/wav2vec2-large-960h-lv60 | 23.97% | 11.14% |
|
|
|
|
| 174 |
| facebook/wav2vec2-large-960h | 32.79% | 16.03% |
|
| 175 |
-
| boris/xlsr-en-punctuation | 34.81% | 15.51% |
|
| 176 |
| facebook/wav2vec2-base-960h | 39.86% | 19.89% |
|
| 177 |
| facebook/wav2vec2-base-100h | 51.06% | 25.06% |
|
| 178 |
| elgeish/wav2vec2-large-lv60-timit-asr | 59.96% | 34.28% |
|
| 179 |
| facebook/wav2vec2-base-10k-voxpopuli-ft-en | 66.41% | 36.76% |
|
| 180 |
| elgeish/wav2vec2-base-timit-asr | 68.78% | 36.81% |
|
| 181 |
-
|
| 182 |
-
---
|
| 183 |
-
|
| 184 |
-
**LibriSpeech (clean)**
|
| 185 |
-
|
| 186 |
-
| Model | WER | CER |
|
| 187 |
-
| ------------- | ------------- | ------------- |
|
| 188 |
-
| facebook/wav2vec2-large-960h-lv60-self | **1.86%** | **0.54%** |
|
| 189 |
-
| facebook/wav2vec2-large-960h-lv60 | 2.15% | 0.61% |
|
| 190 |
-
| facebook/wav2vec2-large-960h | 2.82% | 0.84% |
|
| 191 |
-
| facebook/wav2vec2-base-960h | 3.44% | 1.06% |
|
| 192 |
-
| jonatasgrosman/wav2vec2-large-xlsr-53-english | 4.16% | 1.28% |
|
| 193 |
-
| facebook/wav2vec2-base-100h | 6.26% | 2.00% |
|
| 194 |
-
| jonatasgrosman/wav2vec2-large-english | 8.00% | 2.55% |
|
| 195 |
-
| elgeish/wav2vec2-large-lv60-timit-asr | 15.53% | 4.93% |
|
| 196 |
-
| boris/xlsr-en-punctuation | 19.28% | 6.45% |
|
| 197 |
-
| elgeish/wav2vec2-base-timit-asr | 29.19% | 8.38% |
|
| 198 |
-
| facebook/wav2vec2-base-10k-voxpopuli-ft-en | 31.82% | 12.41% |
|
| 199 |
-
|
| 200 |
-
---
|
| 201 |
-
|
| 202 |
-
**LibriSpeech (other)**
|
| 203 |
-
|
| 204 |
-
| Model | WER | CER |
|
| 205 |
-
| ------------- | ------------- | ------------- |
|
| 206 |
-
| facebook/wav2vec2-large-960h-lv60-self | **3.89%** | **1.40%** |
|
| 207 |
-
| facebook/wav2vec2-large-960h-lv60 | 4.45% | 1.56% |
|
| 208 |
-
| facebook/wav2vec2-large-960h | 6.49% | 2.52% |
|
| 209 |
-
| jonatasgrosman/wav2vec2-large-xlsr-53-english | 8.82% | 3.42% |
|
| 210 |
-
| facebook/wav2vec2-base-960h | 8.90% | 3.55% |
|
| 211 |
-
| jonatasgrosman/wav2vec2-large-english | 13.62% | 5.24% |
|
| 212 |
-
| facebook/wav2vec2-base-100h | 13.97% | 5.51% |
|
| 213 |
-
| boris/xlsr-en-punctuation | 26.40% | 10.11% |
|
| 214 |
-
| elgeish/wav2vec2-large-lv60-timit-asr | 28.39% | 12.08% |
|
| 215 |
-
| elgeish/wav2vec2-base-timit-asr | 42.04% | 15.57% |
|
| 216 |
-
| facebook/wav2vec2-base-10k-voxpopuli-ft-en | 45.19% | 20.32% |
|
| 217 |
-
|
| 218 |
-
---
|
| 219 |
-
|
| 220 |
-
**TIMIT**
|
| 221 |
-
|
| 222 |
-
| Model | WER | CER |
|
| 223 |
-
| ------------- | ------------- | ------------- |
|
| 224 |
-
| facebook/wav2vec2-large-960h-lv60-self | **5.17%** | **1.33%** |
|
| 225 |
-
| facebook/wav2vec2-large-960h-lv60 | 6.24% | 1.54% |
|
| 226 |
-
| jonatasgrosman/wav2vec2-large-xlsr-53-english | 6.81% | 2.02% |
|
| 227 |
-
| facebook/wav2vec2-large-960h | 9.63% | 2.19% |
|
| 228 |
-
| facebook/wav2vec2-base-960h | 11.48% | 2.76% |
|
| 229 |
-
| elgeish/wav2vec2-large-lv60-timit-asr | 13.83% | 4.36% |
|
| 230 |
-
| jonatasgrosman/wav2vec2-large-english | 13.91% | 4.01% |
|
| 231 |
-
| facebook/wav2vec2-base-100h | 16.75% | 4.79% |
|
| 232 |
-
| elgeish/wav2vec2-base-timit-asr | 25.40% | 8.16% |
|
| 233 |
-
| boris/xlsr-en-punctuation | 25.93% | 9.99% |
|
| 234 |
-
| facebook/wav2vec2-base-10k-voxpopuli-ft-en | 51.08% | 19.84% |
|
|
|
|
| 24 |
metrics:
|
| 25 |
- name: Test WER
|
| 26 |
type: wer
|
| 27 |
+
value: 21.53
|
| 28 |
- name: Test CER
|
| 29 |
type: cer
|
| 30 |
+
value: 9.66
|
| 31 |
---
|
| 32 |
|
| 33 |
# Wav2vec2-Large-English
|
|
|
|
| 81 |
|
| 82 |
| Reference | Prediction |
|
| 83 |
| ------------- | ------------- |
|
| 84 |
+
| "SHE'LL BE ALL RIGHT." | SHELL BE ALL RIGHT |
|
| 85 |
| SIX | SIX |
|
| 86 |
+
| "ALL'S WELL THAT ENDS WELL." | ALLAS WELL THAT ENDS WELL |
|
| 87 |
+
| DO YOU MEAN IT? | W MEAN IT |
|
| 88 |
+
| THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS. | THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESTION |
|
| 89 |
+
| HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE? | HOW IS MOSILLA GOING TO BANDL AND BE WHIT IS LIKE QU AND QU |
|
| 90 |
+
| "I GUESS YOU MUST THINK I'M KINDA BATTY." | RUSTION AS HAME AK AN THE POT |
|
| 91 |
| NO ONE NEAR THE REMOTE MACHINE YOU COULD RING? | NO ONE NEAR THE REMOTE MACHINE YOU COULD RING |
|
| 92 |
+
| SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER. | SAUCE FOR THE GUCE IS SAUCE FOR THE GONDER |
|
| 93 |
+
| GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD. | GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD |
|
| 94 |
|
| 95 |
## Evaluation
|
| 96 |
|
|
|
|
| 159 |
|
| 160 |
**Test Result**:
|
| 161 |
|
| 162 |
+
In the table below I report the Word Error Rate (WER) and the Character Error Rate (CER) of the model. I ran the evaluation script described above on other models as well (on 2021-06-17). Note that the table below may show different results from those already reported, this may have been caused due to some specificity of the other evaluation scripts used.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 163 |
|
| 164 |
| Model | WER | CER |
|
| 165 |
| ------------- | ------------- | ------------- |
|
| 166 |
+
| jonatasgrosman/wav2vec2-large-xlsr-53-english | **18.98%** | **8.29%** |
|
| 167 |
+
| jonatasgrosman/wav2vec2-large-english | 21.53% | 9.66% |
|
| 168 |
| facebook/wav2vec2-large-960h-lv60-self | 22.03% | 10.39% |
|
| 169 |
| facebook/wav2vec2-large-960h-lv60 | 23.97% | 11.14% |
|
| 170 |
+
| boris/xlsr-en-punctuation | 29.10% | 10.75% |
|
| 171 |
| facebook/wav2vec2-large-960h | 32.79% | 16.03% |
|
|
|
|
| 172 |
| facebook/wav2vec2-base-960h | 39.86% | 19.89% |
|
| 173 |
| facebook/wav2vec2-base-100h | 51.06% | 25.06% |
|
| 174 |
| elgeish/wav2vec2-large-lv60-timit-asr | 59.96% | 34.28% |
|
| 175 |
| facebook/wav2vec2-base-10k-voxpopuli-ft-en | 66.41% | 36.76% |
|
| 176 |
| elgeish/wav2vec2-base-timit-asr | 68.78% | 36.81% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
config.json
CHANGED
|
@@ -64,6 +64,6 @@
|
|
| 64 |
"num_feat_extract_layers": 7,
|
| 65 |
"num_hidden_layers": 24,
|
| 66 |
"pad_token_id": 0,
|
| 67 |
-
"transformers_version": "4.
|
| 68 |
"vocab_size": 33
|
| 69 |
}
|
|
|
|
| 64 |
"num_feat_extract_layers": 7,
|
| 65 |
"num_hidden_layers": 24,
|
| 66 |
"pad_token_id": 0,
|
| 67 |
+
"transformers_version": "4.7.0.dev0",
|
| 68 |
"vocab_size": 33
|
| 69 |
}
|
preprocessor_config.json
CHANGED
|
@@ -1,5 +1,6 @@
|
|
| 1 |
{
|
| 2 |
"do_normalize": true,
|
|
|
|
| 3 |
"feature_size": 1,
|
| 4 |
"padding_side": "right",
|
| 5 |
"padding_value": 0.0,
|
|
|
|
| 1 |
{
|
| 2 |
"do_normalize": true,
|
| 3 |
+
"feature_extractor_type": "Wav2Vec2FeatureExtractor",
|
| 4 |
"feature_size": 1,
|
| 5 |
"padding_side": "right",
|
| 6 |
"padding_value": 0.0,
|
pytorch_model.bin
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
size 1262022892
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ef32d54a3ebc911d64e7a8b8d04896f0957bbab4e3da4c6c0ae42ab901d6e4e7
|
| 3 |
size 1262022892
|