Den4ikAI
/

faster-whisper-large-v2-no-digits-norm-punct

Automatic Speech Recognition

audio

Model card Files Files and versions

xet

Community

Den4ikAI commited on Jun 20

Commit

f353225

verified ·

1 Parent(s): de5ffbb

Update README.md

Browse files

Files changed (1) hide show

README.md +3 -61

README.md CHANGED Viewed

@@ -108,66 +108,8 @@ base_model:
 pipeline_tag: automatic-speech-recognition
 ---
-# Den4ikAI/whisper-large-v2-no-digits-norm-punct
-This is a special version of the `openai/whisper-large-v2` model whose vocabulary has had all tokens corresponding to digits removed, as well as tokens with extraneous punctuation.
-The primary goal of this modification is to **force the model to generate numbers as words rather than digits**. This is extremely useful for text normalization tasks, for example when preparing data for text-to-speech (TTS) systems, where numbers need to be fully spelled out.
-## Comparison with the Original Model
-Here’s a clear example demonstrating the difference in behavior between the models when transcribing the same audio clip containing the phrase “Билет стоил двадцать тысяч рублей” (“The ticket cost twenty thousand rubles”).
-| Model                                                       | Transcription Output                                                                                   |
-| ----------------------------------------------------------- | ------------------------------------------------------------------------------------------------------ |
-| `openai/whisper-large-v2` (Original)                        | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **20000** рублей.<\|endoftext\|>` |
-| `Den4ikAI/whisper-large-v2-no-digits-norm-punct` (This model) | `<\|startoftranscript\|><\|ru\|><\|transcribe\|><\|notimestamps\|> Билет стоил **двадцать тысяч** рублей.<\|endoftext\|>` |
-As you can see, this modified model correctly normalized the number into words, whereas the original version left it as digits.
-## How to Use
-You can use this model just like any other Whisper model in the `transformers` library.
-```python
-from transformers import WhisperProcessor, WhisperForConditionalGeneration
-import torchaudio
-import torch
-# Specify the device (GPU if available)
-device = "cuda:0" if torch.cuda.is_available() else "cpu"
-# Load the audio file
-wav, sr = torchaudio.load("numbers5.mp3")
-# Convert to mono and resample to 16 kHz
-if wav.shape[0] > 1:
-    wav = torch.mean(wav, dim=0, keepdim=True)
-resampler = torchaudio.transforms.Resample(sr, 16000)
-wav = resampler(wav)
-audio_input = wav.squeeze(0)
-# Load the processor and model
-model_id = "Den4ikAI/whisper-large-v2-no-digits-norm-punct"
-processor = WhisperProcessor.from_pretrained(model_id)
-model = WhisperForConditionalGeneration.from_pretrained(model_id).to(device)
-# Prepare inputs and extract features
-input_features = processor(
-    audio_input,
-    sampling_rate=16000,
-    return_tensors="pt"
-).input_features.to(device)
-# Generate token IDs (for Russian specify language="russian")
-predicted_ids = model.generate(input_features, language="russian", task="transcribe")
-# Decode tokens back to text
-transcription = processor.batch_decode(
-    predicted_ids,
-    skip_special_tokens=False
-)
-print(transcription)
-# Example output for an audio clip with numbers:
-# ['<|startoftranscript|><|ru|><|transcribe|><|notimestamps|> Билет стоил двадцать тысяч рублей.<|endoftext|>']

 pipeline_tag: automatic-speech-recognition
 ---
+# Den4ikAI/faster-whisper-large-v2-no-digits-norm-punct
+Ctranslate2 version of https://huggingface.co/Den4ikAI/whisper-large-v2-no-digits-norm-punct
+Since the dumbfucks who developed CTranslate2 don't know how to write code, you'll have to build the improved version of CTranslate2 yourself. See here: https://github.com/Den4ikAI/CTranslate2/