Automatic Speech Recognition Models
By LINAGORA - French, English, Arabic
We propose models for a few language, but we do it right, achieving beyond state of the art performance and accuracy for French, Arabic and English
tip
Those models are the most generic ones, achieving best all-over performance, we however maintain specific accoustic models for business use-cases like heavily noisy environment, aeroplanes, phones, call-centers and decoding graphs for specific vocabulary, like medical or banking... contact us to learn more.
- French v2
- French v1
- English US
- Arabic
Acoustic model
- A deep Time Delay Neural Network (TDNN) model, trained on a large spontanious speech corpora. Data augmentation was applied to increase the quantity of training data and to simulate artificially some environment conditions (noise, speaker). The full corpus after data augmentation is approximately 7100 hours.
- A deep neural network architecture (~30M parameters). This model is trained on the same data (7100 hours).
Decoding graph
- This model is trained on multiple text corpus from different resources. It requires important memory resource on the one hand and provides very accurate transcription.
- This model is trained on various large corpus. Should provide best accuracy but is a bit more resource intensive than the other models.
Acoustic model
- A deep Time Delay Neural Network (TDNN) model, trained on a 1700 hours of spontanious speech corpora. It has a background noise resistance. A speaker adaptation model is used to have robust predictions among speaker variability.
Decoding graph
- This model is trained on small corpus. It is a small Model (100Mo) which generates an acceptable transcription but is very suitable for use in embedded applications.
- This model is trained on much more corpus than the small one. It requires important memory resource on the one hand and provides very accurate transcription.
- This model is trained on various large corpus. Should provide the best accuracy but is a bit more resource intensive than the other models.
Acoustic model
- A chain model based on TDNN-F, trained on a 1000 hours with volume and speed perturbation.
Decoding graph
- Two language model are used for the decoding. A medium model is used to perform the decoding pass. A big model, trained on a large corpus of books, is used to perform the rescoring pass.
- This model is trained on various large corpus. Should provide the best accuracy but is a bit more resource intensive than the other models.
Community built models & Other languages
Model | Size | Word error rate/Speed | Notes | License |
---|---|---|---|---|
English | ||||
vosk-model-small-en-us-0.15 | 40M | 9.85 (librispeech test-clean) 10.38 (tedlium) | Lightweight wideband model for Android and RPi | Apache 2.0 |
vosk-model-en-us-0.22 | 1.8G | 5.69 (librispeech test-clean) 6.05 (tedlium) 29.78(callcenter) | Accurate generic US English model | Apache 2.0 |
vosk-model-en-us-0.22-lgraph | 128M | 7.82 (librispeech) 8.20 (tedlium) | Big US English model with dynamic graph | Apache 2.0 |
English Other | Older Models | |||
vosk-model-en-us-daanzu-20200905 | 1.0G | 7.08 (librispeech test-clean) 8.25 (tedlium) | Wideband model for dictation from Kaldi-active-grammar project | AGPL |
vosk-model-en-us-daanzu-20200905-lgraph | 129M | 8.20 (librispeech test-clean) 9.28 (tedlium) | Wideband model for dictation from Kaldi-active-grammar project with configurable graph | AGPL |
vosk-model-en-us-librispeech-0.2 | 845M | TBD | Repackaged Librispeech model from Kaldi, not very accurate | Apache 2.0 |
vosk-model-small-en-us-zamia-0.5 | 49M | 11.55 (librispeech test-clean) 12.64 (tedlium) | Repackaged Zamia model f_250, mainly for research | LGPL-3.0 |
vosk-model-en-us-aspire-0.2 | 1.4G | 13.64 (librispeech test-clean) 12.89 (tedlium) 33.82(callcenter) | Kaldi original ASPIRE model, not very accurate | Apache 2.0 |
vosk-model-en-us-0.21 | 1.6G | 5.43 (librispeech test-clean) 6.42 (tedlium) 40.63(callcenter) | Wideband model previous generation | Apache 2.0 |
Indian English | ||||
vosk-model-en-in-0.5 | 1G | 36.12 (NPTEL Pure) | Generic Indian English model for telecom and broadcast | Apache 2.0 |
vosk-model-small-en-in-0.4 | 36M | 49.05 (NPTEL Pure) | Lightweight Indian English model for mobile applications | Apache 2.0 |
Chinese | ||||
vosk-model-small-cn-0.22 | 42M | 23.54 (SpeechIO-02) 38.29 (SpeechIO-06) 17.15 (THCHS) | Lightweight model for Android and RPi | Apache 2.0 |
vosk-model-cn-0.22 | 1.3G | 13.98 (SpeechIO-02) 27.30 (SpeechIO-06) 7.43 (THCHS) | Big generic Chinese model for server processing | Apache 2.0 |
Chinese Other | ||||
vosk-model-cn-kaldi-multicn-0.15 | 1.5G | 17.44 (SpeechIO-02) 9.56 (THCHS) | Original Wideband Kaldi multi-cn model from Kaldi with Vosk LM | Apache 2.0 |
Russian | ||||
vosk-model-ru-0.22 | 1.5G | 5.74 (our audiobooks) 13.35 (open_stt audiobooks) 20.73 (open_stt youtube) 37.38 (openstt calls) 8.65 (golos crowd) 19.71 (sova devices) | Big mixed band Russian model for server processing | Apache 2.0 |
vosk-model-small-ru-0.22 | 45M | 22.71 (openstt audiobooks) 31.97 (openstt youtube) 29.89 (sova devices) 11.79 (golos crowd) | Lightweight wideband model for Android/iOS and RPi | Apache 2.0 |
Russian Other | ||||
vosk-model-ru-0.10 | 2.5G | 5.71 (our audiobooks) 16.26 (open_stt audiobooks) 26.20 (public_youtube_700_val open_stt) 40.15 (asr_calls_2_val open_stt) | Big narrowband Russian model for server processing | Apache 2.0 |
French | ||||
vosk-model-small-fr-0.22 | 41M | 23.95 (cv test) 19.30 (mtedx) 27.25 (podcast) | Lightweight wideband model for Android/iOS and RPi | Apache 2.0 |
vosk-model-fr-0.22 | 1.4G | 14.72 (cv test) 11.64 (mls) 13.10 (mtedx) 21.61 (podcast) 13.22 (voxpopuli) | Big accurate model for servers | Apache 2.0 |
French Other | ||||
vosk-model-small-fr-pguyot-0.3 | 39M | 37.04 (cv test) 28.72 (mtedx) 37.46 (podcast) | Lightweight wideband model for Android and RPi trained by Paul Guyot | CC-BY-NC-SA 4.0 |
vosk-model-fr-0.6-linto-2.2.0 | 1.5G | 16.19 (cv test) 16.44 (mtedx) 23.77 (podcast) 0.4xRT | Model from LINTO project | AGPL |
German | ||||
vosk-model-de-0.21 | 1.9G | 9.83 (Tuda-de test), 24.00 (podcast) 12.82 (cv-test) 12.42 (mls) 33.26 (mtedx) | Big German model for telephony and server | Apache 2.0 |
vosk-model-de-tuda-0.6-900k | 4.4G | 9.48 (Tuda-de test), 25.82 (podcast) 4.97 (cv-test) 11.01 (mls) 35.20 (mtedx) | Latest big wideband model from Tuda-DE project | Apache 2.0 |
vosk-model-small-de-zamia-0.3 | 49M | 14.81 (Tuda-de test, 37.46 (podcast) | Zamia f_250 small model repackaged (not recommended) | LGPL-3.0 |
vosk-model-small-de-0.15 | 45M | 13.75 (Tuda-de test), 30.67 (podcast) | Lightweight wideband model for Android and RPi | Apache 2.0 |
Spanish | ||||
vosk-model-small-es-0.42 | 39M | 16.02 (cv test) 16.72 (mtedx test) 11.21 (mls) | Lightweight wideband model for Android and RPi | Apache 2.0 |
vosk-model-es-0.42 | 1.4G | 7.50 (cv test) 10.05 (mtedx test) 5.84 (mls) | Big model for Spanish | Apache 2.0 |
Portuguese/Brazilian Portuguese | ||||
vosk-model-small-pt-0.3 | 31M | 68.92 (coraa dev) 32.60 (cv test) | Lightweight wideband model for Android and RPi | Apache 2.0 |
vosk-model-pt-fb-v0.1.1-20220516_2113 | 1.6G | 54.34 (coraa dev) 27.70 (cv test) | Big model from FalaBrazil | GPLv3.0 |
Greek | ||||
vosk-model-el-gr-0.7 | 1.1G | TBD | Big narrowband Greek model for server processing, not extremely accurate though | Apache 2.0 |
Turkish | ||||
vosk-model-small-tr-0.3 | 35M | TBD | Lightweight wideband model for Android and RPi | Apache 2.0 |
Vietnamese | ||||
vosk-model-small-vn-0.3 | 32M | TBD | Lightweight wideband model for Android and RPi | Apache 2.0 |
Italian | ||||
vosk-model-small-it-0.22 | 48M | 16.88 (cv test) 25.87 (mls) 17.01 (mtedx) | Lightweight model for Android and RPi | Apache 2.0 |
vosk-model-it-0.22 | 1.2G | 8.10 (cv test) 15.68 (mls) 11.23 (mtedx) | Big generic Italian model for servers | Apache 2.0 |
Dutch | ||||
vosk-model-small-nl-0.22 | 39M | 22.45 (cv test) 26.80 (tv) 25.84 (mls) 24.09 (voxpopuli) | Lightweight model for Dutch | Apache 2.0 |
Dutch Other | ||||
vosk-model-nl-spraakherkenning-0.6 | 860M | 20.40 (cv test) 32.64 (tv) 17.73 (mls) 19.96 (voxpopuli) | Medium Dutch model from Kaldi_NL | CC-BY-NC-SA |
vosk-model-nl-spraakherkenning-0.6-lgraph | 100M | 22.82 (cv test) 34.01 (tv) 18.81 (mls) 21.01 (voxpopuli) | Smaller model with dynamic graph | CC-BY-NC-SA |
Catalan | ||||
vosk-model-small-ca-0.4 | 42M | TBD | Lightweight wideband model for Android and RPi for Catalan | Apache 2.0 |
Arabic | ||||
vosk-model-ar-mgb2-0.4 | 318M | 16.40 (MGB-2 dev set) | Repackaged Arabic model trained on MGB2 dataset from Kaldi | Apache 2.0 |
Farsi | ||||
vosk-model-small-fa-0.4 | 47M | TBD | Lightweight wideband model for Android and RPi for Farsi (Persian) | Apache 2.0 |
vosk-model-fa-0.5 | 1G | TBD | Model with large vocabulary, not yet accurate but better than before (Persian) | Apache 2.0 |
vosk-model-small-fa-0.5 | 60M | TBD | Bigger small model for desktop application (Persian) | Apache 2.0 |
Filipino | ||||
vosk-model-tl-ph-generic-0.6 | 320M | TBD | Medium wideband model for Filipino (Tagalog) by feddybear | CC-BY-NC-SA 4.0 |
Ukrainian | ||||
vosk-model-small-uk-v3-nano | 73M | TBD | Nano model from Speech Recognition for Ukrainian | Apache 2.0 |
vosk-model-small-uk-v3-small | 133M | TBD | Small model from Speech Recognition for Ukrainian | Apache 2.0 |
vosk-model-uk-v3 | 343M | TBD | Bigger model from Speech Recognition for Ukrainian | Apache 2.0 |
vosk-model-uk-v3-lgraph | 325M | TBD | Big dynamic model from Speech Recognition for Ukrainian | Apache 2.0 |
Kazakh | ||||
vosk-model-small-kz-0.15 | 42M | 9.60(dev) 8.32(test) | Small mobile model from SAIDA_Kazakh | Apache 2.0 |
vosk-model-kz-0.15 | 378M | 8.06(dev) 6.81(test) | Bigger wideband model SAIDA_Kazakh | Apache 2.0 |
Swedish | ||||
vosk-model-small-sv-rhasspy-0.15 | 289M | TBD | Repackaged model from Rhasspy project | MIT |
Japanese | ||||
vosk-model-small-ja-0.22 | 48M | 9.52(csj CER) 17.07(ted10k CER) | Lightweight wideband model for Japanese | Apache 2.0 |
vosk-model-ja-0.22 | 1Gb | 8.40(csj CER) 13.91(ted10k CER) | Big model for Japanese | Apache 2.0 |
Esperanto | ||||
vosk-model-small-eo-0.42 | 42M | 7.24 (CV Test) | Lightweight model for Esperanto | Apache 2.0 |
Hindi | ||||
vosk-model-small-hi-0.22 | 42M | 20.89 (IITM Challenge) 24.72 (MUCS Challenge) | Lightweight model for Hindi | Apache 2.0 |
vosk-model-hi-0.22 | 1.5Gb | 14.85 (CV Test) 14.83 (IITM Challenge) 13.11 (MUCS Challenge) | Big accurate model for servers | Apache 2.0 |
Czech | ||||
vosk-model-small-cs-0.4-rhasspy | 44M | 21.29 (CV Test) | Lightweight model for Czech from Rhasspy project | MIT |
Polish | ||||
vosk-model-small-pl-0.22 | 50.5M | 18.36 (CV Test) 16.88 (MLS Test) 11.55 (Voxpopuli Test) | Lightweight model for Polish for Android | Apache 2.0 |
Speaker identification model | ||||
vosk-model-spk-0.4 | 13M | TBD | Model for speaker identification, should work for all languages | Apache 2.0 |