TensorFlow Lite for Microcontrollers Benchmarked on Linux SBCs

Dimitris Tassopoulos (Dimtass) decided to learn more about machine learning for embedded systems now that the technology is more mature, and wrote a series of five posts documenting his experience with low-end hardware such as STM32 Bluepill board, Arduino UNO, or ESP8266-12E module starting with simple NN examples, before moving to TensorFlow Lite for microcontrollers.

Dimitris recently followed up his latest “stupid project” (that’s the name of his blog, not being demeaning here :)) by running and benchmarking TensorFlow Lite for microcontrollers on various Linux SBC.

But why? you might ask. Dimitris tried to build tflite C++ API designed for Linux, but found it was hard to build, and no pre-built binary are available except for x86_64. He had no such issues with tflite-micro API, even though it’s really meant for baremetal MCU platforms.

Let’s get straight to the results which also include a Ryzen platform, probably a laptop, for reference:

SBC Average for 1000 runs  (ms)
Ryzen 2700X (this is not SBC) 2.19
AML-S905X-CC 15.54
Raspberry Pi 3 B+ 13.47
Jetson nano 9.34
NanoPi Duo 36.76
NanoPi Neo 16
NanoPi NEO2 22.83
NanoPi NEO4 5.82
NanoPi K1 Plus 14.32
Orange Pi Prime 18.40
Beaglebone Black 97.03
STM32F746 @ 216MHz 76.75
STM32F746 @ 288 MHz 57.95

And in chart form.

Click to Enlarge

The Ryzen 2700X processor is the fastest, but Rockchip RK3399 CPU found in NanoPi NEO4 is only 2.6 times slower, and outperforms all other Arm SBCs, including Jetson Nano. Not bad for a $50 board. Allwinner H3 based NanoPi Neo board also deserves a mention as at $10, it offers the best performance/price ratio for those test.

If you want to try it on your own board or computer, you can do so as follows:

Note that’s for Aarch64 (Arm 64-bit targets), the last command line will be different for other architectures, for example on Cortex-A7 based SoC, the program will be named “mnist-tflite-micro-armv7l” instead.

Note that while tflite-micro is easy to port to any SBCs, there are some drawbacks over using tflite C++ API. Notably tflite-micro does not support multi-threading, and it’s much slower than tflite C++ API.

CPU tflite-micro/tflite speed ratio
Ryzen 2700X 10.63x
Jetson nano (5W) 9.46x
Jetson nano (MAXN) 3.86x

The model is also embedded in the executable instead of being loading from a file, unless you implement your own parse. You’ll find a more detailed analysis and explanation on Dimtass’ blog post.

Support CNX Software - Donate via PayPal or cryptocurrencies, become a Patron on Patreon, or buy review samples
Notify of
newest most voted