Those Charts Show The Benefits of Microphone Arrays for Hot Word Detection

Since I started looking more into smart speakers, including DIY ones such as the I made with Orange Pi Zero board + Google Assistant with a single microphone, I was told about the importance of microphone arrays, but so far, I had not seen any clear study or data about that. That changed today, as I came across a review of mic arrays by the makers of Snips Voice Platform. They tested five arrays connected to a Raspberry Pi 3 with the system, and also added a generic USB microphone to the mix. The results speak for themselves…

Click to Enlarge

In that experiment, they measured the rate at which a hot word was successfully detected by incrementally increasing the distance between 0.5 meters to 5 meters (16 ft), and for each distance, repeating the hot word 25 times at 3 second intervals using pre-recording to keep the voice level constant, and the same gain for all microphones. They did so in a silent room, a room with white noise, and finally one with background music. The generic USB microphone works just as well as mic arrays in a silent room up to 2 meters, but further away, the success rate drops dramatically.

The case for the arrays is even better when white noise is added, as the USB microphone’s success rate is only comparable around 50 cm away. When you add background music to the mix, it’s a bit messier, with the USB microphone performing even worse than with noise, and while microphone arrays’ success rate drop, most managed fairly well. The best array based on that test is PlayStation 3 Eye, which comes with 4 microphones and costs just $8 US on Amazon, significantly cheaper than any competitors in the test including the USB microphone… ReSpeaker does not perform too bad either, and MiniDSP appears to be the weakest of the lot, especially for tests at one meter, despite using the same XMOS XVSM-2000 chip on in ReSpeaker array.

Share this:
FacebookTwitterHacker NewsSlashdotRedditLinkedInPinterestFlipboardMeWeLineEmailShare

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK 5 ITX RK3588 mini-ITX motherboard

11 Replies to “Those Charts Show The Benefits of Microphone Arrays for Hot Word Detection”

  1. The best microphone array is microsoft kinect, It works even better than this ps3 array.
    Ps3 stays a really great device at a great price.
    I know that because we made many tests years ago with S.A.R.A.H community.
    S.A.R.A.H is a kind of jarvis, google home or amazon echo but I use it before 2010 and it is really easy to creates plugins on it or add some recognition unknow words. May be it is mainly oriented for french people…
    An example with kodi/xbmc:

    Anyway, make a test with kinect microphone array you will not be disappointed!

  2. I’d like to see the Conexant CX20924 compared too. This is a new chip that does hot word and array processing inside the chip. That is different than the x-powers chip which relies on the host to do the processing. The idea with the Conexant chip is that the host can be in sleep mode while it watches for the hot word.

    Complete dev kit:

  3. I missed the Conexant entry. The chip is $6.25 so it is doing pretty good for something that cheap. That includes the hot word engine so you avoid a license fee for something like snowboy.

    I am still waiting for someone to release this on ESP32 with the two ADC and no license fees.

  4. Take those test results with a LARGE grain of salt…

    Mic synthetic beam-formed array performance is all about good DSP code carefully tuned to the position and directivity patterns of the mics. This makes these arrays notoriously difficult to test and compare.

    Often testing in an anechoic chamber will yield widely different results when the microphones are repositioned by even a small amount. This is because the synthetic directivity pattern of all the microphones working together post processing can be very complex with narrow lobes. The same effect is observed as the test frequency changes.

    In a working environment, echoes help to fill in the synthetic directivity gaps, and using human speech spreads the spectrum. But still, even in a controlled echo environment using spoken word stimulus, comparative test results are highly variable and difficult to reproduce.

    Time-domain polling of the microphone array often outperforms a synthetic array in pure word recognition. But with polling comes the need for more mics, multiplexing, and higher power consumption because of reduced sleep time. In the end, the best solution is probably going to be a hybrid of synthetic and polled mics, especially as integration increases and prices drop.

    The varying results between the two arrays that both use the xCORE-VOICE parts may be due to not only the test setup, but also due to how well the manufacturer’s DSP code mates up with the mic array design. Oh how I wish the documentation and tool-chain was better for the xCORE chips, they’re cool.

    For “Makers”, especially when the design is fixed in place and not constrained by battery operation, I would like to see more experiments done with simple polled mics using native code. With some thoughtful mic positioning, mic polling can be surprisingly effective, sometimes easily outperforming complex synthetic mic arrays.

  5. PS3 eye mic require special drivers to support on linux/windows. It is installed by default without array mic features and speacial driver are available paid version.

  6. @cnxsoft

    “Any idea about the wide price disparity? How can PS3 Eye be only $8, while others are close to $100? Or is it just because the other are dev kits, and hence cost more?”

    1. The xCORE-VOICE devices are at the development level, so yes you are paying for the “free” software tool-chain and documentation they come with.

    2. At $8, the PS Eye hardware looks to me like surplus old stock bought in bulk for pennies on the dollar. The PS Eye was first introduced in 2007, a decade ago. I don’t believe the PS Eye being tested here is really acting as a true active mic array at all. So the good test results from the PS Eye are just another reason to mistrust these tests all together.

Leave a Reply

Your email address will not be published. Required fields are marked *

Khadas VIM4 SBC
Khadas VIM4 SBC