Movidius Neural Compute Stick Shown to Boost Deep Learning Performance by about 3 Times on Raspberry Pi 3 Board

Intel recently launched Movidius Neural Compute Stick (MvNCS)for low power USB based deep learning applications such as object recognition, and after some initial confusions, we could confirm the Neural stick could also be used on ARM based platforms such as the Raspberry Pi 3. Kochi Nakamura, who wrote the code for GPU accelerated object recognition on the Raspberry Pi 3 board, got hold of one sample in order to compare the performance between GPU and MvNCS acceleration.

His first attempt was quite confusing as with GoogLeNet, Raspberry Pi 3 + MvNCS achieved an average inference time of about 560ms, against 320 ms while using VideoCore IV GPU in RPi3 board. But then it was discovered that the “stream_infer.py” demo would only use one core out of the 12 VLIW 128-bit vector SHAVE processors in Intel’s Movidius Myriad 2 VPU, and after enabling all those 12 cores instead of just one, performance increased to around 108 ms average time per inference. That’s almost 3 times faster compare to using the GPU in RPi3 for this specific demo, and it may vary for other demos / applications.

That’s the description in YouTube:

Comparison of deep learning inference acceleration by Movidius’ Neural Compute Stick (MvNCS) and by Idein’s software which uses Raspberry Pi’s GPU (VideoCore IV) without any extra computing resources.

Movidius’ demo runs GoogLeNet with 16-bit floating point precision.Average inference time is 108ms.
We used MvNC SDK 1.07.07 and their official demo script without any changes. (ncapi/py_examples/stream_infer/stream_infer.py)
It seems something is wrong with the inference results.
We recompiled graph file with -s12 option to use 12 SHAVE vector processor simultaneously.

Idein’s demo also runs GoogLeNet with 32-bit floating point precision. Average inference time is 320ms.

It’s interesting to note the GPU demo used 32-bit floating point precision, against 16-bit floating point precision on the Neural Compute Stick, although it’s unclear to me how that may affect performance of such algorithms. Intel recommends a USB 3.0 interface for MvNCS, and the Raspberry Pi 3 only comes with a USB 2.0 interface that shares the bandwidth for the USB webcam and the MvNCS, so it’s possible an ARM board with a USB 3.0 interface for the stick, and a separate USB interface for the webcam could perform better. Has anybody tested it? A USB 3.0 interface and hub would also allow to cascade several Neural Compute Sticks.

16
Leave a Reply

avatar
15 Comment threads
1 Thread replies
4 Followers
 
Most reacted comment
Hottest comment thread
9 Comment authors
onebirMark Jaymic_snobewillmore Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
tkaiser
Guest
tkaiser

Interesting that switching from 1 to 12 vector SHAVE processors does decrease the inference time only from 560ms to 108ms (5 times better when throwing 12 times more hardware at the problem).

Maybe it’s just another nice example for a use case where Raspberries don’t really fit due to ‘single USB2 port’ bottleneck (camera stream trashing Movidius performance?). I hope Movidius team starts to focus on more interesting ARM hardware instead. If they would also support CSI camera streams the rather weird video encode/decode efforts could be saved (decoding might probably harm performance too?) and some small Allwinner quad-core H3 boards that are popular on drones (Orange Pi Lite, NanoPi Air, Orange Pi Zero 2+) could become nice target platforms even if USB connection requires there using pins on GPIO header (but by looking at the size of the stick ‘USB on pin header’ is also the only opportunity to use more than one Movidius dongle at the same time on most SBC except for maybe OPi Lite, OPi PC/PC+ and NanoPi M1)

And if USB3 is Movidius’ recommendation for the stick I would really love to see results made with a board like ROCK64 with the stick on the USB3 port and camera on a separate USB2 receptable. In case the Movidius software installation has problems running on arm64 distros there exists a couple of community provided armhf Debian/Ubuntu images to test with.

blu
Guest
blu

@tkaiser
I too suspect it’s very likely a buffer transfer bottleneck at this stage. The problem itself is Embarrassingly Parallel ™, so it should scale well with ALUs.

crashoverride
Guest
crashoverride

I think @cnxsoft should contact the company for comment. If the 3X number is true, then an Odroid XU4 would absolutely destroy this stick in performance and cost (6 core Mali + 8 ARM NEON cores).

tkaiser
Guest
tkaiser

@crashoverride
I would think using this Movidius stick is something where battery life (drones/robots) is important so something like an ODROID-XU4 utilizing all CPU/GPU cores has a clear disadvantage here (the Movidius Myriad 2 VPU is said to have a 1W power profile).

Relying on the performance numbers after ‘We recompiled graph file with -s12 option to use 12 SHAVE vector processor simultaneously’ in active benchmarking mode it should be pretty easy to identify/graph bottlenecks. Just test through -s1 to -s12 and let an ‘inference time’ graph draw. Then change the platform, leave the bottlenecked Raspberry Pi, follow Intel’s recommendation to access the Movidius stick through USB3 at ‘SuperSpeed’ while having the camera on a separate USB bus and draw the graph again (a $25 ROCK64 comes to my mind here as device of choice).

In case the dependencies really require Raspbian this also isn’t an issue especially with ROCK64 since all that’s needed is choosing any of the communtiy OS images, then replacing the entire rootfs on the 7th partition with a Raspbian userland (only /lib/modules/$(uname -r)/ has to be preserved!) and re-test.

crashoverride
Guest
crashoverride

@tkaiser
The Exynos in XU4 was designed for use in mobile phone products. It does “low power” better than chips targeting set-top boxes. The concept behind big.LITTLE is that its more power efficient to “burst” high computation tasks since they complete faster. The shorter time results in power savings. It can only be speculated which would be more efficient without any actual testing (PI3 + compute stick v.s. XU4). This topic *almost* has enough of my curiosity to conduct experiments.

willmore
Guest
willmore

Does the XU4 have that much GPU OpenCL ability? I know it have four very capable big cores, but we’re not looking at a CPU benchmark here, this is the Videocore IV vs the Movidius NCS. Are the A53 cores on the Rpi doing much?

tkaiser
Guest
tkaiser

crashoverride :
The concept behind big.LITTLE is that its more power efficient to “burst” high computation tasks since they complete faster.

Sure. But that mostly applies to use cases like a mobile phone (let the thing run on the little cores by default, as soon as some peak performance is needed switch to the big CPU cores, finish the work and switch back over to the little ones). But in such a ‘deep learning’ situation as discussed above is there a ‘completes faster’ or are the CPU cores utilized at full clockspeed permanently to achieve better results (I’m a total noob in this area)?

Wrt NEON: I did a test few months ago on my XU4 and let cpuminer run as an example for a highly optimized NEON software: 2.27 khash/s on the little cores at 1.4 GHz vs. 8.23 khash/s on the big cores starting with 2.0GHz and immediately throttling so maybe it’s even 8.50 khash/s when constantly running at 2.0 GHz (I find Hardkernel’s fansink somewhat disappointing). Since cpuminer also heavily depends on memory bandwidth I’m not entirely sure how to interpret these numbers but if the Exynos should shine with NEON performance you clearly want to utilize the big cores as well and also at 100% CPU utilization. And then the Exyos is quite the opposite of ‘power efficient’…

Anyway: I still think a more interesting test wrt this Movidius stick would be to leave the crippled Raspberry platform and move on to ARM boards of similar size that have at least 1 USB3 port and a separate USB2 port for a camera to avoid being bottlenecked by limited IO.

And CSI support seems mandatory to me for the use cases such an external deep learning device would most probably be used for (drones/robots — since otherwise it will be hard to use the Movidius stick with a Raspberry Pi Zero W or one of the similar small but better suited OrangePi or NanoPi mentioned above)

crashoverride
Guest
crashoverride

@willmore
I have not done any formal tests/benchmarks. Based on experiments, it amazes me that XU4 is faster than more modern Malis in recent SoCs. I attribute this to core count and architecture.

crashoverride
Guest
crashoverride

@tkaiser
My *theory* is that if the XU4 is magnitudes faster, then the power required (and thus throttling) to achieve the same amount of work in the same time would be lower than RPi + compute stick. Its not an assertion that higher FPS (more work) will result in lower power requirements.

I would also like to see a comparison of ROCK64 vs RPi3. However, I do not believe the bottle neck is the image acquisition (USB/CSI bandwidth). I speculate a very low resolution image is used (640×480?).

crashoverride
Guest
crashoverride

@cnxsoft
The Intel compute stick does not support Tensorflow from what I understand. It uses Caffe Deep Learning so comparisons should be based on that to ensure we are all measuring the same thing.

https://github.com/BVLC/caffe/tree/opencl

nobe
Guest
nobe

@crashoverride
XU4 has mali T628 MP6 (midgard architecture), this gpu model has 2 arithmetic pipelines per core, which means 12 total (this could explain part of your experiments results)

if i remember correctly, mali 4xx/T720/T820 only have 1 AP per core while mali T760/T830/T860 have also 2 AP per core

mic_s
Guest
mic_s

(1) USB20/USB30 ? : When ONE stick is connected the AI is here not bandwidth-limited, so there is not a great difference between USB20/USB30. When more than one stick is connected (via a USB-hub) to one USB, there is a difference between USB20 and USB30

(2) The 12 QPUs in the videocore IV don’t support 16 bit FP, they are 32 bit FP only. (This has been changed with the videocore V. BTW there are other changes as well: The videocore V uses the MMU, so CMA isn’t needed any more.) In contrast to the videocore IV, the stick does support 16 bit Floats, so the stick is (very roughly spoken) about 2 times faster than the videocore IV in pi-0/1/2/3.

(3) Be fair, take the pricepoint into acount. The stick is currently about $80, the pi-0 (with the videocore IV and its 12 QPUs, and 512 MB Ram) is $5-$10. That’s a great difference.

Michael

Mark Jay
Guest

I was able to get the Movidius NCS working with the rock64 board, which has USB3. I did see a performance boost. with a tiny-yolo model, it went from 1.4FPS on raspberry pi 3B+, to 2.0FPS with rock64. I posted the results to youtube (https://www.youtube.com/watch?v=AXzIYk7-lr8)

onebir
Guest
onebir

Might be worth trying this on the rock64 GPU via plaidml*:
https://github.com/plaidml/plaidml
…which supports ONNX:
https://github.com/onnx/models/tree/master/tiny_yolov2

*It’s very easy to install, got it working in a few minutes on my laptop’s discrete GPU (NVS 5200M, too old to support CUDA/cuDNN etc). Should work on any GPU supporting OpenCL 1.2+ (I think).