August 9, 2017August 9, 2017 by Jean-Luc Aufranc (CNXSoft) - 16 Comments

Movidius Neural Compute Stick Shown to Boost Deep Learning Performance by about 3 Times on Raspberry Pi 3 Board

Intel recently launched Movidius Neural Compute Stick (MvNCS)for low power USB based deep learning applications such as object recognition, and after some initial confusions, we could confirm the Neural stick could also be used on ARM based platforms such as the Raspberry Pi 3. Kochi Nakamura, who wrote the code for GPU accelerated object recognition on the Raspberry Pi 3 board, got hold of one sample in order to compare the performance between GPU and MvNCS acceleration.

His first attempt was quite confusing as with GoogLeNet, Raspberry Pi 3 + MvNCS achieved an average inference time of about 560ms, against 320 ms while using VideoCore IV GPU in RPi3 board. But then it was discovered that the “stream_infer.py” demo would only use one core out of the 12 VLIW 128-bit vector SHAVE processors in Intel’s Movidius Myriad 2 VPU, and after enabling all those 12 cores instead of just one, performance increased to around 108 ms average time per inference. That’s almost 3 times faster compare to using the GPU in RPi3 for this specific demo, and it may vary for other demos / applications.

That’s the description in YouTube:

Comparison of deep learning inference acceleration by Movidius’ Neural Compute Stick (MvNCS) and by Idein’s software which uses Raspberry Pi’s GPU (VideoCore IV) without any extra computing resources.

Movidius’ demo runs GoogLeNet with 16-bit floating point precision.Average inference time is 108ms.
We used MvNC SDK 1.07.07 and their official demo script without any changes. (ncapi/py_examples/stream_infer/stream_infer.py)
It seems something is wrong with the inference results.
We recompiled graph file with -s12 option to use 12 SHAVE vector processor simultaneously.

Idein’s demo also runs GoogLeNet with 32-bit floating point precision. Average inference time is 320ms.

It’s interesting to note the GPU demo used 32-bit floating point precision, against 16-bit floating point precision on the Neural Compute Stick, although it’s unclear to me how that may affect performance of such algorithms. Intel recommends a USB 3.0 interface for MvNCS, and the Raspberry Pi 3 only comes with a USB 2.0 interface that shares the bandwidth for the USB webcam and the MvNCS, so it’s possible an ARM board with a USB 3.0 interface for the stick, and a separate USB interface for the webcam could perform better. Has anybody tested it? A USB 3.0 interface and hub would also allow to cascade several Neural Compute Sticks.

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

Name*

Email*

Website

I agree to the Privacy Policy

The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.

Name*

Email*

Website

I agree to the Privacy Policy

The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.

16 Comments

oldest

newest

tkaiser

6 years ago

Interesting that switching from 1 to 12 vector SHAVE processors does decrease the inference time only from 560ms to 108ms (5 times better when throwing 12 times more hardware at the problem). Maybe it’s just another nice example for a use case where Raspberries don’t really fit due to ‘single USB2 port’ bottleneck (camera stream trashing Movidius performance?). I hope Movidius team starts to focus on more interesting ARM hardware instead. If they would also support CSI camera streams the rather weird video encode/decode efforts could be saved (decoding might probably harm performance too?) and some small Allwinner quad-core H3… Read more »

blu

6 years ago

@tkaiser
I too suspect it’s very likely a buffer transfer bottleneck at this stage. The problem itself is Embarrassingly Parallel ™, so it should scale well with ALUs.

crashoverride

6 years ago

I think @Jean-Luc Aufranc (CNXSoft) should contact the company for comment. If the 3X number is true, then an Odroid XU4 would absolutely destroy this stick in performance and cost (6 core Mali + 8 ARM NEON cores).

tkaiser

6 years ago

@crashoverride I would think using this Movidius stick is something where battery life (drones/robots) is important so something like an ODROID-XU4 utilizing all CPU/GPU cores has a clear disadvantage here (the Movidius Myriad 2 VPU is said to have a 1W power profile). Relying on the performance numbers after ‘We recompiled graph file with -s12 option to use 12 SHAVE vector processor simultaneously’ in active benchmarking mode it should be pretty easy to identify/graph bottlenecks. Just test through -s1 to -s12 and let an ‘inference time’ graph draw. Then change the platform, leave the bottlenecked Raspberry Pi, follow Intel’s recommendation… Read more »

crashoverride

6 years ago

@tkaiser
The Exynos in XU4 was designed for use in mobile phone products. It does “low power” better than chips targeting set-top boxes. The concept behind big.LITTLE is that its more power efficient to “burst” high computation tasks since they complete faster. The shorter time results in power savings. It can only be speculated which would be more efficient without any actual testing (PI3 + compute stick v.s. XU4). This topic *almost* has enough of my curiosity to conduct experiments.

Author

cnxsoft

6 years ago

@crashoverride
It might be a challenge, as it looks like OpenCL may not be well supported by TensorFlow.
The thread started in 2015, but people are still involved today.
https://github.com/tensorflow/tensorflow/issues/22

So if I understood correctly, you might be able to install TensorFlow with GoogleNet model on ODROID-XU4, but probably with CPU support only. It will be slower than the Raspberry Pi 3, unless you find a way to make OpenCL work.

willmore

6 years ago

Does the XU4 have that much GPU OpenCL ability? I know it have four very capable big cores, but we’re not looking at a CPU benchmark here, this is the Videocore IV vs the Movidius NCS. Are the A53 cores on the Rpi doing much?

tkaiser

6 years ago

crashoverride : The concept behind big.LITTLE is that its more power efficient to “burst” high computation tasks since they complete faster. Sure. But that mostly applies to use cases like a mobile phone (let the thing run on the little cores by default, as soon as some peak performance is needed switch to the big CPU cores, finish the work and switch back over to the little ones). But in such a ‘deep learning’ situation as discussed above is there a ‘completes faster’ or are the CPU cores utilized at full clockspeed permanently to achieve better results (I’m a total… Read more »

crashoverride

6 years ago

@willmore
I have not done any formal tests/benchmarks. Based on experiments, it amazes me that XU4 is faster than more modern Malis in recent SoCs. I attribute this to core count and architecture.

crashoverride

6 years ago

@tkaiser
My *theory* is that if the XU4 is magnitudes faster, then the power required (and thus throttling) to achieve the same amount of work in the same time would be lower than RPi + compute stick. Its not an assertion that higher FPS (more work) will result in lower power requirements.

I would also like to see a comparison of ROCK64 vs RPi3. However, I do not believe the bottle neck is the image acquisition (USB/CSI bandwidth). I speculate a very low resolution image is used (640×480?).

crashoverride

6 years ago

@Jean-Luc Aufranc (CNXSoft)
The Intel compute stick does not support Tensorflow from what I understand. It uses Caffe Deep Learning so comparisons should be based on that to ensure we are all measuring the same thing.

https://github.com/BVLC/caffe/tree/opencl

Author

cnxsoft

6 years ago

@crashoverride
I meant about the code running RPi 3 GPU demo. I thought it was based on TensorFlow, but I’m not sure what they are running actually, since they did not provide that many details.

nobe

6 years ago

@crashoverride
XU4 has mali T628 MP6 (midgard architecture), this gpu model has 2 arithmetic pipelines per core, which means 12 total (this could explain part of your experiments results)

if i remember correctly, mali 4xx/T720/T820 only have 1 AP per core while mali T760/T830/T860 have also 2 AP per core

mic_s

6 years ago

(1) USB20/USB30 ? : When ONE stick is connected the AI is here not bandwidth-limited, so there is not a great difference between USB20/USB30. When more than one stick is connected (via a USB-hub) to one USB, there is a difference between USB20 and USB30 (2) The 12 QPUs in the videocore IV don’t support 16 bit FP, they are 32 bit FP only. (This has been changed with the videocore V. BTW there are other changes as well: The videocore V uses the MMU, so CMA isn’t needed any more.) In contrast to the videocore IV, the stick does… Read more »

Mark Jay

6 years ago

I was able to get the Movidius NCS working with the rock64 board, which has USB3. I did see a performance boost. with a tiny-yolo model, it went from 1.4FPS on raspberry pi 3B+, to 2.0FPS with rock64. I posted the results to youtube (https://www.youtube.com/watch?v=AXzIYk7-lr8)

onebir

6 years ago

Mark Jay

Might be worth trying this on the rock64 GPU via plaidml*:
https://github.com/plaidml/plaidml
…which supports ONNX:
https://github.com/onnx/models/tree/master/tiny_yolov2

*It’s very easy to install, got it working in a few minutes on my laptop’s discrete GPU (NVS 5200M, too old to support CUDA/cuDNN etc). Should work on any GPU supporting OpenCL 1.2+ (I think).