Espressif ESP-SR enables on-device speech recognition framework on ESP32-S3 and ESP32 WiSoCs

Espressif ESP-SR is a speech recognition framework enabling on-device speech recognition on ESP32 and ESP32-S3 wireless microcontrollers with the latter being recommended due to its vector extension for AI acceleration and larger, high-speech octal SPI PSRAM.

The ESP-SR framework was first released on December 17, 2021 with version 1.0, before the v1.20 update was introduced in March of this year, but I only found out about ESP-SR offline speech recognition solution through a tweet by John Lee showing an ESP-SR demo video by @ThatProject.

I initially was confused since ESP32 boards have supported speech recognition for years using the ESP-ADF framework. But the key difference is that the latter relies on online voice assistants such as Baidu DuerOS, Amazon Alexa, and Google Assistant, while the relatively new ESP-SR does that locally directly on the ESP32 CPU, so you don’t even need a network connection for this to work. We’ve written about various offline voice recognition modules in the last few years, and I didn’t know this was already implemented on the ESP32 chips.

The GitHub repository for ESP-SR lists four main components:

  • Audio Front-end AFE
  • WakeNet Wake Word Engine
  • MultiNet Speech Command Word Recognition
  • Speech Synthesis (only supports the Chinese language at this time)

If some of the components above ring a bell, that’s because they are existing solutions and we covered the ESP-AFE algorithms when they become Alexa certified, while WakeNet and MultiNet are part of the ESP-SKAINET assistant introduced in 2019. What appears to be new are test apps for speech recognition and text-to-speech conversion that were committed just 3 to 5 days ago.

ESP-SR ESP32 on-device speech recognition workflow
Speech recognition workflow

So it looks like the ESP-SR simply combines all those different projects as components to help with integration into customers’ projects. You’ll find documentation on the Espressif website, and the company recommends the ESP32-S3-Korvo-1 or ESP32-S3-Korvo-2 development boards to get started although I’d assume it should probably work on other ESP32-S3 smart audio devkits with microphones such as the ESP32-S3-BOX as well.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK 5 ITX RK3588 mini-ITX motherboard
Subscribe
Notify of
guest
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
3 Comments
oldest
newest
Hedda
Hedda
10 months ago

Wake word on ESP32 would be a perfect project in ESPHome for “Home Assistant’s year of Voice”
https://www.home-assistant.io/blog/2022/12/20/year-of-voice/

Jon Smirl
10 months ago

Willow is already doing this.
https://github.com/toverainc/willow

The problem with WIllow is that you need to leave RTX 3xxx class card running 24/7 to do the processing.

Stuart Naylor
10 months ago

For some reason Willow has chosen to use Whisper for ASR which yeah if you use the large model then you need some Ooomf when using off device ASR. ESP-SR does offer on device ASR through Multinet but its really pushing past the devices capability hence why it broadcasts via a KW trigger to a central ASR. Wakenet the KW part also isn’t the best and the Claims willow is competive is extremely optimistic, but the ADF does give a 2/3 mic BSS (blind source seperation alg) that can really help with noise and far field. Still though it uses… Read more »

Khadas VIM4 SBC