Espressif ESP-SR enables on-device speech recognition framework on ESP32-S3 and ESP32 WiSoCs

Espressif ESP-SR is a speech recognition framework enabling on-device speech recognition on ESP32 and ESP32-S3 wireless microcontrollers with the latter being recommended due to its vector extension for AI acceleration and larger, high-speech octal SPI PSRAM.

The ESP-SR framework was first released on December 17, 2021 with version 1.0, before the v1.20 update was introduced in March of this year, but I only found out about ESP-SR offline speech recognition solution through a tweet by John Lee showing an ESP-SR demo video by @ThatProject.

Comrades of the world, liberate your hands from the chains of typing and touching germy switches! Embrace the revolutionary power of speech recognition with ESP32-S3 + ESP-SR. Let your words flow freely, for the proletariat shall not be silenced by keyboards or bourgeois input… pic.twitter.com/bm3udteB3o

— John Lee (@EspressifSystem) July 15, 2023

I initially was confused since ESP32 boards have supported speech recognition for years using the ESP-ADF framework. But the key difference is that the latter relies on online voice assistants such as Baidu DuerOS, Amazon Alexa, and Google Assistant, while the relatively new ESP-SR does that locally directly on the ESP32 CPU, so you don’t even need a network connection for this to work. We’ve written about various offline voice recognition modules in the last few years, and I didn’t know this was already implemented on the ESP32 chips.

The GitHub repository for ESP-SR lists four main components:

Audio Front-end AFE
WakeNet Wake Word Engine
MultiNet Speech Command Word Recognition
Speech Synthesis (only supports the Chinese language at this time)

If some of the components above ring a bell, that’s because they are existing solutions and we covered the ESP-AFE algorithms when they become Alexa certified, while WakeNet and MultiNet are part of the ESP-SKAINET assistant introduced in 2019. What appears to be new are test apps for speech recognition and text-to-speech conversion that were committed just 3 to 5 days ago.

ESP-SR ESP32 on-device speech recognition workflow — Speech recognition workflow

So it looks like the ESP-SR simply combines all those different projects as components to help with integration into customers’ projects. You’ll find documentation on the Espressif website, and the company recommends the ESP32-S3-Korvo-1 or ESP32-S3-Korvo-2 development boards to get started although I’d assume it should probably work on other ESP32-S3 smart audio devkits with microphones such as the ESP32-S3-BOX as well.

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

Name*

Email*

Website

I agree to the Privacy Policy

The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.

3 Comments

oldest

newest

Hedda

1 year ago

Wake word on ESP32 would be a perfect project in ESPHome for “Home Assistant’s year of Voice”
https://www.home-assistant.io/blog/2022/12/20/year-of-voice/

Jon Smirl

Willow is already doing this.
https://github.com/toverainc/willow

The problem with WIllow is that you need to leave RTX 3xxx class card running 24/7 to do the processing.

Stuart Naylor

For some reason Willow has chosen to use Whisper for ASR which yeah if you use the large model then you need some Ooomf when using off device ASR. ESP-SR does offer on device ASR through Multinet but its really pushing past the devices capability hence why it broadcasts via a KW trigger to a central ASR. Wakenet the KW part also isn’t the best and the Claims willow is competive is extremely optimistic, but the ADF does give a 2/3 mic BSS (blind source seperation alg) that can really help with noise and far field. Still though it uses… Read more »