Dimitris Tassopoulos (Dimtass) decided to learn more about machine learning for embedded systems now that the technology is more mature, and wrote a series of five posts documenting his experience with low-end hardware such as STM32 Bluepill board, Arduino UNO, or ESP8266-12E module starting with simple NN examples, before moving to TensorFlow Lite for microcontrollers.
Dimitris recently followed up his latest “stupid project” (that’s the name of his blog, not being demeaning here :)) by running and benchmarking TensorFlow Lite for microcontrollers on various Linux SBC.
But why? you might ask. Dimitris tried to build tflite C++ API designed for Linux, but found it was hard to build, and no pre-built binary are available except for x86_64. He had no such issues with tflite-micro API, even though it’s really meant for baremetal MCU platforms.
Let’s get straight to the results which also include a Ryzen platform, probably a laptop, for reference:
|SBC||Average for 1000 runs (ms)|
|Ryzen 2700X (this is not SBC)||2.19|
|Raspberry Pi 3 B+||13.47|
|NanoPi K1 Plus||14.32|
|Orange Pi Prime||18.40|
|STM32F746 @ 216MHz||76.75|
|STM32F746 @ 288 MHz||57.95|
And in chart form.
The Ryzen 2700X processor is the fastest, but Rockchip RK3399 CPU found in NanoPi NEO4 is only 2.6 times slower, and outperforms all other Arm SBCs, including Jetson Nano. Not bad for a $50 board. Allwinner H3 based NanoPi Neo board also deserves a mention as at $10, it offers the best performance/price ratio for those test.
If you want to try it on your own board or computer, you can do so as follows:
sudo apt install cmake g++
git clone https://email@example.com/dimtass/tflite-micro-python-comparison.git
Note that’s for Aarch64 (Arm 64-bit targets), the last command line will be different for other architectures, for example on Cortex-A7 based SoC, the program will be named “mnist-tflite-micro-armv7l” instead.
Note that while tflite-micro is easy to port to any SBCs, there are some drawbacks over using tflite C++ API. Notably tflite-micro does not support multi-threading, and it’s much slower than tflite C++ API.
|CPU||tflite-micro/tflite speed ratio|
|Jetson nano (5W)||9.46x|
|Jetson nano (MAXN)||3.86x|
The model is also embedded in the executable instead of being loading from a file, unless you implement your own parse. You’ll find a more detailed analysis and explanation on Dimtass’ blog post.