Picovoice on-device speech-to-text engines slash the requirements and cost of transcription

Picovoice Leopard and Cheetah offline, on-device speech-to-text engines are said to achieve cloud-level accuracy, rely on tiny Speech-to-Text models, and slash the cost of automatic transcription by up to 10 times.

Leopard is an on-device speech-to-text engine, while Cheetah is an on-device streaming speech-to-text engine, and both are cross-platform with support for Linux x86_64, macOS (x86_64, arm64), Windows x86_64, Android, iOS, Raspberry Pi 3/4, and NVIDIA Jetson Nano.

Picovoice speech-to-textLooking at the cost is always tricky since companies have different pricing structures, and the table above basically shows the best scenario, where Picovoice is 6 to 20 times more cost-effective than solutions from Microsoft Azure or Google STT. Picovoice Leopard/Cheetah is free for the first 100 hours, and customers can pay a monthly $999 fee for up to 10,000 hours hence the $0.1 per hour cost with PicoVoice. If you were to use only 1000 hours out of your plan that would be $1 per hour, still not too bad. Check out the pricing page for details.

But the price is not everything, and a cheap service that does not do the job would be worthless, so the company provided some speech-to-text benchmarks with instructions to reproduce their setup on Github comparing Picovoice Leopard/Cheetah against AWS Transcribe, Google STT/STT-Enhanced, IBM Watson STT, and Microsoft Azure.

Speech-to-text benchmarks accuracy

The first metric looked into is the word error rate to estimate the accuracy of the services/solutions. Picovoice Leopard and Cheetah achieve a relatively low word error rate similar to cloud-based services such as Azure, Amazon, and Google Enhanced, and much better than Mozilla DeepSpeech offline, on-device speech-to-text engine.

Mozilla DeepSpeech would still be the most cost-effective solution (since it’s free) provided your application can do with the lower accuracy, but another aspect is that Picovoice speech-to-text engines make use of much fewer resources than the Mozilla STT solution with a lower Real-Time Factor (RTF), the ratio of CPU processing time to the length of the input speech file, and acoustic and language models that are 60 times smaller.

Mozilla DeepSpeech vs Picovoice Cheetah & Leopard
Tested on Ubuntu 20.04 machine with Intel Core i5-6500 CPU @ 3.20GHz, 64 GB of RAM, and NVMe storage

The closed-source libraries for all supported platforms, as well as documentation, can be found on Github in the respective Cheetah and Leopard repositories.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK 5 ITX RK3588 mini-ITX motherboard
Notify of
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
2 years ago

Interesting but does it manage many languages ? Like French for example 😅

2 years ago

Funny how far the benchmark numbers diverge depending on the dataset. The numbers shown on the diagram picture, being an average between the 5 speech datasets Picovoice chose for their benchmark, seem to be very highly driven by how badly each speech-to-text model/service f*cks up on the CommonVoice dataset. Ironically, that’s the dataset collected by mozilla themselves and where Mozilla DeepSpeech has the worst hangups. The sad thing with Mozilla DeepSpeech is that in 2020, Mozilla CEO Mitchell Baker fired first 70 then 250 more Mozilla employees (out of a staff ca. 1000), among which, iirc, the Rust team and…… Read more »

Khadas VIM4 SBC