nithinraok
commited on
Commit
·
7cc2afa
1
Parent(s):
f3c361a
update
Browse filesSigned-off-by: nithinraok <[email protected]>
README.md
CHANGED
|
@@ -1763,7 +1763,6 @@ model-index:
|
|
| 1763 |
---
|
| 1764 |
## <span style="color:#ffb300;">🐤 Canary 1B v2: Multitask Speech Transcription and Translation Model </span>
|
| 1765 |
|
| 1766 |
-
## <span style="color:#b37800;">Description</span>
|
| 1767 |
|
| 1768 |
**``Canary-1b-v2``** is a powerful 1-billion parameter model built for high-quality speech transcription and translation across 25 European languages.
|
| 1769 |
|
|
@@ -1779,6 +1778,13 @@ Bulgarian (**bg**), Croatian (**hr**), Czech (**cs**), Danish (**da**), Dutch (*
|
|
| 1779 |
|
| 1780 |
🗣️ **Experience `Canary-1b-v2` in action** at [Hugging Face Demo](https://huggingface.co/spaces/nvidia/canary-1b-v2)
|
| 1781 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1782 |
## <span style="color:#b37800;">Key Features</span>
|
| 1783 |
|
| 1784 |
**`Canary-1b-v2`** is a scaled and enhanced version of the Canary model family, offering:
|
|
@@ -1799,9 +1805,6 @@ For a deeper glimpse to Canary family models, explore this comprehensive [NeMo t
|
|
| 1799 |
|
| 1800 |
We will soon release a comprehensive **Canary-1b-v2 technical report** detailing the model architecture, training methodology, datasets, and evaluation results.
|
| 1801 |
|
| 1802 |
-
`Canary-1b-v2` model is ready for commercial/non-commercial use.
|
| 1803 |
-
|
| 1804 |
-
---
|
| 1805 |
|
| 1806 |
### Automatic Speech Recognition (ASR)
|
| 1807 |
|
|
@@ -1837,10 +1840,6 @@ We will soon release a comprehensive **Canary-1b-v2 technical report** detailing
|
|
| 1837 |
---
|
| 1838 |
|
| 1839 |
|
| 1840 |
-
## <span style="color:#b37800;">License/Terms of Use</span>
|
| 1841 |
-
|
| 1842 |
-
GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license.
|
| 1843 |
-
|
| 1844 |
## <span style="color:#b37800;">Deployment Geography</span>
|
| 1845 |
|
| 1846 |
Global
|
|
@@ -1851,7 +1850,7 @@ This model serves developers, researchers, academics, and industries building ap
|
|
| 1851 |
|
| 1852 |
## <span style="color:#b37800;">Release Date</span>
|
| 1853 |
|
| 1854 |
-
08/14/2025
|
| 1855 |
|
| 1856 |
## <span style="color:#b37800;">Model Architecture</span>
|
| 1857 |
|
|
@@ -1914,6 +1913,8 @@ print(output[0].text)
|
|
| 1914 |
|
| 1915 |
#### Transcribing with timestamps
|
| 1916 |
|
|
|
|
|
|
|
| 1917 |
To transcribe with timestamps:
|
| 1918 |
```python
|
| 1919 |
output = asr_model.transcribe(['2086-149220-0033.wav'], source_lang='en', target_lang='en', timestamps=True)
|
|
@@ -1944,7 +1945,7 @@ For translation task, please, refer to segment-level timestamps for getting intu
|
|
| 1944 |
|
| 1945 |
**Runtime Engine(s):**
|
| 1946 |
|
| 1947 |
-
* NeMo 2.
|
| 1948 |
|
| 1949 |
**Supported Hardware Microarchitecture Compatibility:**
|
| 1950 |
|
|
@@ -2022,7 +2023,13 @@ To read more about the pseudo-labeling technique and [pipeline](https://github.c
|
|
| 2022 |
All transcripts include punctuation and capitalization.
|
| 2023 |
|
| 2024 |
|
| 2025 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2026 |
|
| 2027 |
|
| 2028 |
---
|
|
@@ -2034,7 +2041,13 @@ All transcripts include punctuation and capitalization.
|
|
| 2034 |
* Earnings-22 \[14], This American Life \[15] (long-form)
|
| 2035 |
* MUSAN \[16]
|
| 2036 |
|
| 2037 |
-
**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2038 |
|
| 2039 |
## <span style="color:#b37800;">Benchmark Results</span>
|
| 2040 |
|
|
@@ -2219,4 +2232,4 @@ Model and dataset restrictions | The Principle of least privilege (P
|
|
| 2219 |
|
| 2220 |
\[15] [Speech Recognition and Multi-Speaker Diarization of Long Conversations](https://arxiv.org/abs/2005.08072)
|
| 2221 |
|
| 2222 |
-
\[16] [MUSAN: A Music, Speech, and Noise Corpus](https://arxiv.org/abs/1510.08484)
|
|
|
|
| 1763 |
---
|
| 1764 |
## <span style="color:#ffb300;">🐤 Canary 1B v2: Multitask Speech Transcription and Translation Model </span>
|
| 1765 |
|
|
|
|
| 1766 |
|
| 1767 |
**``Canary-1b-v2``** is a powerful 1-billion parameter model built for high-quality speech transcription and translation across 25 European languages.
|
| 1768 |
|
|
|
|
| 1778 |
|
| 1779 |
🗣️ **Experience `Canary-1b-v2` in action** at [Hugging Face Demo](https://huggingface.co/spaces/nvidia/canary-1b-v2)
|
| 1780 |
|
| 1781 |
+
`Canary-1b-v2` model is ready for commercial/non-commercial use.
|
| 1782 |
+
|
| 1783 |
+
|
| 1784 |
+
## <span style="color:#b37800;">License/Terms of Use</span>
|
| 1785 |
+
|
| 1786 |
+
GOVERNING TERMS: Use of this model is governed by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode.en) license.
|
| 1787 |
+
|
| 1788 |
## <span style="color:#b37800;">Key Features</span>
|
| 1789 |
|
| 1790 |
**`Canary-1b-v2`** is a scaled and enhanced version of the Canary model family, offering:
|
|
|
|
| 1805 |
|
| 1806 |
We will soon release a comprehensive **Canary-1b-v2 technical report** detailing the model architecture, training methodology, datasets, and evaluation results.
|
| 1807 |
|
|
|
|
|
|
|
|
|
|
| 1808 |
|
| 1809 |
### Automatic Speech Recognition (ASR)
|
| 1810 |
|
|
|
|
| 1840 |
---
|
| 1841 |
|
| 1842 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1843 |
## <span style="color:#b37800;">Deployment Geography</span>
|
| 1844 |
|
| 1845 |
Global
|
|
|
|
| 1850 |
|
| 1851 |
## <span style="color:#b37800;">Release Date</span>
|
| 1852 |
|
| 1853 |
+
Huggingface [08/14/2025](https://huggingface.co/nvidia/canary-1b-v2)
|
| 1854 |
|
| 1855 |
## <span style="color:#b37800;">Model Architecture</span>
|
| 1856 |
|
|
|
|
| 1913 |
|
| 1914 |
#### Transcribing with timestamps
|
| 1915 |
|
| 1916 |
+
> **Note:** Use [main branch of NeMo](https://github.com/NVIDIA/NeMo/) to get timestamps until it is released in NeMo 2.5.
|
| 1917 |
+
|
| 1918 |
To transcribe with timestamps:
|
| 1919 |
```python
|
| 1920 |
output = asr_model.transcribe(['2086-149220-0033.wav'], source_lang='en', target_lang='en', timestamps=True)
|
|
|
|
| 1945 |
|
| 1946 |
**Runtime Engine(s):**
|
| 1947 |
|
| 1948 |
+
* NeMo main branch (until it is released in NeMo 2.5)
|
| 1949 |
|
| 1950 |
**Supported Hardware Microarchitecture Compatibility:**
|
| 1951 |
|
|
|
|
| 2023 |
All transcripts include punctuation and capitalization.
|
| 2024 |
|
| 2025 |
|
| 2026 |
+
**Data Collection Method by dataset**
|
| 2027 |
+
|
| 2028 |
+
* Hybrid: Automated, Human
|
| 2029 |
+
|
| 2030 |
+
**Labeling Method by dataset**
|
| 2031 |
+
|
| 2032 |
+
* Hybrid: Synthetic, Human
|
| 2033 |
|
| 2034 |
|
| 2035 |
---
|
|
|
|
| 2041 |
* Earnings-22 \[14], This American Life \[15] (long-form)
|
| 2042 |
* MUSAN \[16]
|
| 2043 |
|
| 2044 |
+
**Data Collection Method by dataset**
|
| 2045 |
+
|
| 2046 |
+
* Human
|
| 2047 |
+
|
| 2048 |
+
**Labeling Method by dataset**
|
| 2049 |
+
|
| 2050 |
+
* Human
|
| 2051 |
|
| 2052 |
## <span style="color:#b37800;">Benchmark Results</span>
|
| 2053 |
|
|
|
|
| 2232 |
|
| 2233 |
\[15] [Speech Recognition and Multi-Speaker Diarization of Long Conversations](https://arxiv.org/abs/2005.08072)
|
| 2234 |
|
| 2235 |
+
\[16] [MUSAN: A Music, Speech, and Noise Corpus](https://arxiv.org/abs/1510.08484)
|