spanish-verification-model-pkt-c

Click to expand

Paper
Model Summary
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Model Summary

We define verification models as ASR models specifically designed to assess the reliability of transcriptions. These models are particularly useful when no reference transcription is available, as they can generate hypotheses with a certain degree of confidence.

The core idea behind verification models is to train or fine-tune two or more ASR models on different datasets. If these models produce identical transcriptions for the same audio input, the result is likely to be accurate. Furthermore, if a verification model agrees with an existing reference transcription, this agreement can also be interpreted as a signal of reliability.

In this model card, we present Verification Model C for Spanish, available as "spanish-verification-model-pkt-c". This acoustic model is based on "nvidia/parakeet-rnnt-1.1b" and is designed for Automatic Speech Recognition in Spanish. It is intended to be used in tandem with Verification Model D, "spanish-verification-model-pkt-d", to enable cross-verification and boost transcription confidence in unannotated or weakly supervised scenarios. These models can also be used together with Model A "spanish-verification-model-pkt-a" and Model B "spanish-verification-model-pkt-b" for better results.

Intended Uses and Limitations

This model is designed for the following scenarios:

Verification of transcriptions: When two or more verification models produce the same output for a given audio segment, the transcription can be considered highly reliable. This is particularly useful in low-resource or weakly supervised settings.
Transcription without references: In situations where no reference transcription exists, this model can still produce a hypothesis that -when corroborated by a second verification model- may be considered trustworthy.
Data filtering and quality control: It can be used to automatically detect and retain high-confidence segments in large-scale speech datasets (e.g., for training or evaluation purposes).
Human-in-the-loop workflows: These models can assist human annotators by flagging reliable transcriptions, helping reduce manual verification time.

As limitations, we identify the following:

No ground-truth guarantee: Agreement between models does not guarantee correctness; it only increases the likelihood of reliability.
Domain sensitivity: The accuracy and agreement rate may drop if used on speech data that differs significantly from the training domain (e.g., different accents, topics, or recording conditions).
Designed for pairwise comparison: This model is intended to work in conjunction with at least one other verification model. Using it in isolation does not provide verification benefits.
Language and model-specific: This particular model is optimized for Spanish and based on the Parakeet RNNT architecture. Performance in other languages or under different acoustic models may vary significantly.

How to Get Started with the Model

To see an updated and functional version of this code, please visit NVIDIA's official repository

Installation

To use this model, you may install the NVIDIA NeMo Framework:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

BRANCH = 'main'
python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

For Inference

To transcribe audio in Spanish using this model, you can follow this example:

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="BSC-LT/spanish-verification-model-pkt-c")

output = asr_model.transcribe(['YOUR_WAV_FILE.wav'])
print(output[0].text)

Training Details

Training data

The training data for Model C consists of 1,500 hours of Spanish speech extracted from the YODAS dataset.

To ensure high-quality supervision, we applied a triple-consensus filtering strategy: we only kept those utterances where the reference transcription in YODAS, the output of Model A "spanish-verification-model-pkt-a", and the output of Model B "spanish-verification-model-pkt-b" were identical.

This approach allowed us to minimize noisy or ambiguous transcriptions while maintaining a large amount of diverse training material.

Training procedure

This model is the result of finetuning the model "parakeet-rnnt-1.1b" by following this tutorial

Training Hyperparameters

language: Spanish
hours of training audio: 1500
learning rate: 2e-4
devices=4
num_nodes=8
batch_size=8
accelerator=accelerator
strategy="ddp"
max_epochs=20
enable_checkpointing=True
logger=False
log_every_n_steps=100
check_val_every_n_epoch=1
precision='bf16-mixed'
callbacks=[checkpoint_callback]

Citation

If this model contributes to your research, please cite the work:

@misc{bsc-esvermodel-pkt-c-2025,
  title={Spanish Verification Model Parakeet C},
  author={Hernandez Mena, Carlos Daniel; España-Bonet, Cristina},
  organization={Barcelona Supercomputing Center},
  url={https://huggingface.co/BSC-LT/spanish-verification-model-pkt-c},
  year={2025}
}

Additional Information

Author

The fine-tuning process was performed during August (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena supervised by Cristina España-Bonet.

Contact

For further information, please email [email protected].

Copyright

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA.

The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.

Downloads last month: 7

Dataset used to train BSC-LT/spanish-verification-model-pkt-c

Evaluation results

WER on Common Voice 17.0 Spanish (Test)
test set self-reported

8.747
WER on Common Voice 17.0 Spanish (Dev)
self-reported

7.867
WER on Multilingual LibriSpeech Spanish (Test)
test set self-reported

7.471
WER on Multilingual LibriSpeech Spanish (Dev)
self-reported

7.738

View on Papers With Code

BSC-LT
/

spanish-verification-model-pkt-c