Espressif introduces ESP32-S3-BOX AI development kit for online and offline voice applications

Espressif Systems has very recently introduced the ESP32-S3-BOX AI voice devkit designed for the development of applications with offline and online voice assistants, and whose design I find similar to the M5Stack Core2 devkit, but the applications will be different.

The ESP32-S3-BOX features the latest ESP32-S3 processor with WiFi and BLE connectivity, AI capabilities, as well as a 2.4-inch capacitive touchscreen display, a 2-mic microphone array, a speaker, and I/O connectors with everything housed in a plastic enclosure with a stand.


ESP32-S3-BOX specifications:

  • WiSoC – ESP32-S3 dual-core Tensilica LX7 up to 240 MHz with Wi-Fi & Bluetooth 5, AI instructions, 512KB SRAM
  • Memory and Storage – 8MB octal PSRAM and 16MB QSPI flash
  • Display – 2.4-inch capacitive touchscreen display with 320×240 resolution
  • Audio – Dual microphone, speaker
  • USB – 1x USB Type-C port for power and debugging (JTAG/serial)
  • Expansion – 2x Pmod-compatible headers for up to 16x GPIOs
  • Misc
    • Power LED, Mute button and LED, boot mode button, reset button
    • 6-axis IMU Sensor (IvenSense ICM-42670)
    • Infrared “controller”
  • Power Supply – 5V via USB Type-C connector or dock
  • Dimensions – TBD

ESP32-S3-BOX specificationsThe company says the ESP32-S3-BOX is ideal for the development of smart speakers, gateways, and IoT devices that requires human-computer voice interaction. The development kit supports far-field voice interaction thanks to the built-in microphone array, offline voice wake-up and speech commands recognition in Chinese and English languages, reconfigurable voice commands again in Chinese and English languages, as well as ESP-RainMaker IoT development framework.

Software support for ESP-S3-BOX built upon previous work done for ESP32 including ESP-Skainet Voice Assistant and ESP-DL library for machine learning, as well as third-party solutions like Alexa for IoT SDK or the LVGL open-source graphics library used to develop HMI solution. You’ll find the ESP-BOX AIoT development framework and documentation to get started with the AI voice development kit on Github.

ESP32-S3-BOX Bluetooth 5G 4G LTEA blog post goes into more details about the current and future capabilities of the kit, notably the use of the Pmod connectors to add Zigbee and Thread connectivity with an ESP32-H2 module and/or even cellular IoT connectivity (5G, NB-IoT, LTE Cat-M1).

ESP32-S3-BOX AI voice development kit can be pre-order for $45 on Amazon, Aliexpress and Adafruit. At the time of writing, the devkit is only in stock on Aliexpress, and interestingly it’s the just-opened official Espressif Systems store on Aliexpress, so we may expect the company to sell future devkits that way going forward.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK Pi 4C Plus
Notify of
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
2 years ago

The ESP-BOX looks interesting, but… The Getting Started page [1] says: “…with the SDKs and examples provided by Espressif, you will be able to develop a wide variety of AIoT applications based on the ESP32-S3-BOX such as online and offline voice assistants, voice-enabled devices, HMI touch-screen devices, control panels, multi-protocol gateways easily.” Yeah, but I seriously doubt the ESP-BOX can do much voice recognition-wise besides turning a light on or off (the only provided getting-started example) without a (creepy) persistent online connection with Espressif. Model training, which is never mentioned in the Getting Started guide, will very likely also need… Read more »

Stuart Naylor
2 years ago

ESP32-S3-BOX is just a demo unit that I have ordered and should ship in 20 days £47 to the UK with 287 in stock on ali express.The box like many of there audio dev kit boards has much that isn’t needed but what you are missing if you have ever tried for a voice HMI is the ESP32 could only manage simple CNN models whilst the new vector instructions speed up ML 4-6x as benched with some of there model zoo samples.There is no need for the unit to do ASR as it can act as a KWS broadcast mic… Read more »

Stuart Naylor
2 years ago

PS what should be interesting is how it can handle examples such as Where embedded voice in product becomes a low cost HMI rather than the esp32-S3 becoming the new Alexa. Also due to new federated learning like on the Pixel phones ML can equally be offline as online and its only control as units are not controlled direct they poll a server which is more about bridging the problems of NAT and firewalls for remote control rather than any ominous snooping needs where no-one has managed to balance remote needs so its an intermediary server for most users… Read more »

Khadas VIM4 SBC