Espressif introduces ESP32-S3-BOX AI development kit for online and offline voice applications

Espressif Systems has very recently introduced the ESP32-S3-BOX AI voice devkit designed for the development of applications with offline and online voice assistants, and whose design I find similar to the M5Stack Core2 devkit, but the applications will be different.

The ESP32-S3-BOX features the latest ESP32-S3 processor with WiFi and BLE connectivity, AI capabilities, as well as a 2.4-inch capacitive touchscreen display, a 2-mic microphone array, a speaker, and I/O connectors with everything housed in a plastic enclosure with a stand.

ESP32-S3-Box

ESP32-S3-BOX specifications:

  • WiSoC – ESP32-S3 dual-core Tensilica LX7 up to 240 MHz with Wi-Fi & Bluetooth 5, AI instructions, 512KB SRAM
  • Memory and Storage – 8MB octal PSRAM and 16MB QSPI flash
  • Display – 2.4-inch capacitive touchscreen display with 320×240 resolution
  • Audio – Dual microphone, speaker
  • USB – 1x USB Type-C port for power and debugging (JTAG/serial)
  • Expansion – 2x Pmod-compatible headers for up to 16x GPIOs
  • Misc
    • Power LED, Mute button and LED, boot mode button, reset button
    • 6-axis IMU Sensor (IvenSense ICM-42670)
    • Infrared “controller”
  • Power Supply – 5V via USB Type-C connector or dock
  • Dimensions – TBD

ESP32-S3-BOX specificationsThe company says the ESP32-S3-BOX is ideal for the development of smart speakers, gateways, and IoT devices that requires human-computer voice interaction. The development kit supports far-field voice interaction thanks to the built-in microphone array, offline voice wake-up and speech commands recognition in Chinese and English languages, reconfigurable voice commands again in Chinese and English languages, as well as ESP-RainMaker IoT development framework.

Software support for ESP-S3-BOX built upon previous work done for ESP32 including ESP-Skainet Voice Assistant and ESP-DL library for machine learning, as well as third-party solutions like Alexa for IoT SDK or the LVGL open-source graphics library used to develop HMI solution. You’ll find the ESP-BOX AIoT development framework and documentation to get started with the AI voice development kit on Github.

ESP32-S3-BOX Bluetooth 5G 4G LTEA blog post goes into more details about the current and future capabilities of the kit, notably the use of the Pmod connectors to add Zigbee and Thread connectivity with an ESP32-H2 module and/or even cellular IoT connectivity (5G, NB-IoT, LTE Cat-M1).

ESP32-S3-BOX AI voice development kit can be pre-order for $45 on Amazon, Aliexpress and Adafruit. At the time of writing, the devkit is only in stock on Aliexpress, and interestingly it’s the just-opened official Espressif Systems store on Aliexpress, so we may expect the company to sell future devkits that way going forward.

Share this:
FacebookTwitterHacker NewsSlashdotRedditLinkedInPinterestFlipboardMeWeLineEmailShare

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK 5 ITX RK3588 mini-ITX motherboard

3 Replies to “Espressif introduces ESP32-S3-BOX AI development kit for online and offline voice applications”

  1. The ESP-BOX looks interesting, but…

    The Getting Started page [1] says: “…with the SDKs and examples provided by Espressif, you will be able to develop a wide variety of AIoT applications based on the ESP32-S3-BOX such as online and offline voice assistants, voice-enabled devices, HMI touch-screen devices, control panels, multi-protocol gateways easily.”

    Yeah, but I seriously doubt the ESP-BOX can do much voice recognition-wise besides turning a light on or off (the only provided getting-started example) without a (creepy) persistent online connection with Espressif. Model training, which is never mentioned in the Getting Started guide, will very likely also need to be online to be useful. I did not look further into the Dev-Tools, the name “Skainet” scared me away.

    The Digilent Pmod connectors are in a specification-compliant [2][3] 100mil spaced, 25mil square, male-pin female keyed-box-header physical form. Good luck finding mating wired keyed-box-connectors, crimp pin-sockets, and an quality affordable crimp tool. They exist, but Digilent does not specify examples, so digging is needed. Fortunately, the ubiquitous female “DuPont Wires” should work in a pinch.

    Where to purchase the ESP-BOX?[4] At my writing time:

    AliExpress [5] has only eight ESP-BOXs left at USD $45.00 + $3.64 Shipping each to the U.S. (where I live). That’s $48.64 each total with a ridiculous estimated 38 day ship time to the U.S., which will probably be much longer due to the now perpetually-broken supply chains.

    Adafruit [6] shows the ESP-BOX at $49.95 each plus shipping with zero-stock and no estimated available date.

    Amazon [7] lists the ESP-BOX as “Currently unavailable. We don’t know when or if this item will be back in stock.” No price or shipping estimate is shown

    1. ESP32-S3-BOX is just a demo unit that I have ordered and should ship in 20 days £47 to the UK with 287 in stock on ali express.
      The box like many of there audio dev kit boards has much that isn’t needed but what you are missing if you have ever tried for a voice HMI is the ESP32 could only manage simple CNN models whilst the new vector instructions speed up ML 4-6x as benched with some of there model zoo samples.
      There is no need for the unit to do ASR as it can act as a KWS broadcast mic to a central machine that can relay back and use that GPIO if needed.
      So for me my wish list of a low cost low energy broadcast KWS looks very interesting as if the esp32-s3 follows the economies of scale the esp32 did then distributed array microphones should be very cost effective and voice becomes a default HMI for rooms.
      How well the simple algs of AEC, BSS & NS work on the BOX is of most interest and that with AEC due to clock drift audio in/out usually needs to be from the same clock source.I have been wondering for a time if NTP synced audio protocols like snapcast can be used with AEC as wireless room audio & KWS microphone are much better suited as interoperable separates than a ‘Smart speaker’ of a single vendor.
      I am also slightly bemused at the custom wakeword offering but guess its an offering to those who can not DiY a wakeword model.https://www.cnx-software.com/2021/08/29/esp32-s3-ai-capabilities-esp-dl-library/ Allows you to convert and use ONNX models which means near all models with supported layers can be used.This is really interesting purely with the LX7 vector instructions and what layers have been ported as from tensorflow-lite micro to torch models should be able to be converted and we have x4-6 more processing power than the previous ESP32.
      PMOD is more of a electrical spec and pinout as they are just standard 2.54mm spaced headers.never really used PMOD prob will not but they have standard pinouts for certain interfaces Hbridge to I2S and espressif have used diligents rather than say create their own such as a PI. PMOD tends to be a cluster for a sole purpose.

    2. PS what should be interesting is how it can handle examples such as https://github.com/google-research/google-research/blob/master/kws_streaming/experiments/kws_experiments_35_labels.md

      Where embedded voice in product becomes a low cost HMI rather than the esp32-S3 becoming the new Alexa.
      Also due to new federated learning like on the Pixel phones ML can equally be offline as online and its only control as units are not controlled direct they poll a server which is more about bridging the problems of NAT and firewalls for remote control rather than any ominous snooping needs where no-one has managed to balance remote needs so its an intermediary server for most users or the choice of local only control of an embedded appliance.

      Google seems to be doing the most interesting embedded ML at the moment and its interesting as it is offline.
      But a esp32-s3 is not a TensorTPU equipped Pixel phone or was never intended to be but for me is equally interesting for voice enabled product and maybe distributed array microphones for interoperable systems.

      Time will tell and also my patience as I scream why did I not spend more effort in C proficiency 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

Khadas VIM4 SBC
Khadas VIM4 SBC