How to Get Started with OpenCL on ODROID-XU4 Board (with Arm Mali-T628MP6 GPU)

Orange Pi Development Boards

Last week, I reviewed Ubuntu 18.04 on ODROID-XU4 board testing most of the advertised features. However I skipped on the features listed in the Changelog:

GPU hardware acceleration via OpenGL ES 3.1 and OpenCL 1.2 drivers for Mali T628MP6 GPU

While I tested OpenGL ES with tools like glmark2-es2 and es2gears, as well as WebGL demos in Chromium, I did not test OpenCL, since I’m not that familiar with it, except it’s used for GPGPU (General Purpose GPU) to accelerate tasks like image/audio processing. That was a good excuse to learn a bit more, try it out on the board, and write a short guide to get started with OpenGL on hardware with Arm Mali GPU. The purpose of this tutorial is to show how to run an OpenCL sample, and OpenCL utility, and I won’t go into the nitty gritty of OpenCL code. If you want to learn more about OpenCL coding on Arm, one way would be to check out the source code of the provided samples.

Arm Compute Library and OpenCL Samples

Since I did not know where to start, Hardkernel redirected me to a forum thread where we are shown how to use Arm Compute Library to test OpenCL on the board.

The relevant post is dated January 2018, and relies on Compute Library 17.12, but you can check out the latest version and documentation @  https://arm-software.github.io/ComputeLibrary/latest/. The latest version is 18.03 at the time of writing this post, so I retrieved it, and tried to build it as instructed:


However, It failed with:


Looking at the kernel log with dmesg, it was clear the board ran out of memory: “Out of memory: Kill process 4984 (cc1plus) Out of memory: Kill process 4984 (cc1plus)“. So I had to setup a swap file (1GB):


…giving us more memory…


before restarting the build with NEON and OpenCL enabled:


and this time it could complete:


[Update: Based on comments below, setting up ZRAM instead of swap is usually better in case you run out of memory]

And we can copy the libraries to /usr/lib:


We have a bunch of samples to play with:


Note that some are NEON only, not using OpenCL, and the prefix explains the type of sample:

  1. cl_*.cpp –> OpenCL examples
  2. gc_*.cpp –> GLES compute shaders examples
  3. graph_*.cpp –> Graph examples
  4. neoncl_*.cpp –> NEON / OpenCL interoperability examples
  5. neon_*.cpp –> NEON examples

All samples have also been built and can be found in build/examples directory. I ran cl_convolution after generating a Raw ppm image using Gimp:


It could process the photo (5184 x 3456) in less than 6 seconds. If we look at the resulting image, we can see the OpenCL convolution converts the image to grayscale.

Original Image (Left) vs After OpenCL Convolution (Right) – Click to Enlarge

So I’ve repeated a similar operation with convert which has not been compiled with OpenCL support, so using software only:


It took a little over 10 seconds, so almost twice the time used by the OpenCL demo. The PPM image files are however over 50MB, so part of the time is used to read and save the file from the eMMC flash. Repeating the tests provide similar performance (~6s vs ~11s), so it may be negligible.

convert version output showing OpenCL is not part of the enabled features in ImageMagick:


It’s fun, so I tried another sample:


What did it do? When I open the file it looks the same of the first sample (Grayscale image), but it actually scaled the image (50% width, 50% height):


The last sample cl_sgemm manipulates matrices. The main goal of the three OpenCL (cl_xxx_ samples) is to show how to use OpenCL Convolution, Events and SGEMM (Single-precision GEneral Matrix Multiply) using the Compute Library.

You can also play with other samples for NEON and OpenGL ES, and Arm Community published a blog post explaining how to run neon_cartoon_effect on Raspberry Pi , and explaining the source code in details. You don’t actually need an RPi board for that as any Arm board with a processor supporting NEON should do.

clinfo Utility

clinfo is a utility that print information about OpenCL platforms and devices in the system. So I install it in the board:


But running the program does not return any useful information:


Not what I expected. Luckily, setting up clinfo is explained in ODROID Magazine, so let’s have a try.

We need to Mali’s framebuffer driver:


and setup the vendor ICD file:


Now we can run clinfo:


That’s a lot of information, and it shows one platform with two OpenCL devices (both Mali-T628) supporting OpenCL 1.2.

That’s all for this little getting started guide. Now if you actually want to make something with OpenCL, it’s time to read Arm Compute Library documentation, and other resources on the web.

20
Leave a Reply

avatar
4 Comment threads
16 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
8 Comment authors
onebirback2futureAreaScoutcrashoverride Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
tkaiser
Guest
tkaiser

I wonder why everyone is using swap when it’s so easy to activate zram on Ubuntu?

Also it should be noted when comparing execution times that the ImageMagick version you’re using is Q16 and not Q8 (internal bit depth, defaults to 16 bit which results in more precise operations with some tasks but with a simple grayscale conversion or downscaling only slows things unnecessarily down — this can only be specified at build time ‘–with-quantum-depth’)

It would be worth a try to repeat the test with GraphicsMagick which should default to Q8 (can be checked with ‘gm version’). Execution syntax as above just prefixed with ‘gm ‘. Installation is just the usual ‘sudo apt install graphicsmagick’

tkaiser
Guest
tkaiser

And on a big.LITTLE platform I would try to ensure execution on an A15 core: So prefixing IM commands with

ImageMagick is considered the ‘swiss army knife’ of image processing for whatever reasons but usually it’s slow as hell and we try to avoid it whereever possible (using GraphicsMagick instead usually results in better performance, avoiding it altogether and using tools that are optimized for the task is an even better idea)

Just mentioning since comparing ‘convert’ execution times with the OpenCL examples is not comparing efficiency of CPU vs. GPU but most probably just unoptimized vs. optimized software in this case.

blu
Guest
blu

Another way to avoid costly swapfiles is to lower the -jN option to the compiler. In this particular case -j4 would have utilized the big cores and likely not needed the swapfile.

back2future
Guest
back2future

[ additional zram (<3/4 of ram, because of fs overhead for zram, if not used) together with conventional swap space (file/partition), could be recommendable when compiling ]

echo lz4 into /sys/block/zram0/comp_algorithm

ASM
Guest
ASM

Thanks for this guide!

Pretty sure clinfo is not reporting anything but support for OpenCL 1.2 — there is no 2.1 support listed.

Also, I find it odd that the MP6 GPU is listed as *two* devices: one with 4 compute units (MPs) and one with 2… I’ve never seen that before!

blu
Guest
blu

@ASM, Indeed, the T628 in the exynos 5422 is a 6-cluster setup, which apparently comes at 4 + 2 CUs.

I was pondering not long ago how to split a workload for it, and one inevitably gets to a 3-way split, which is, well, odd for most workloads.

ASM
Guest
ASM

@BLU, thanks for confirming the 4+2 split!

I didn’t know and don’t understand why ARM would ever split their GPUs !

This may be useful for a reliable embedded application but it’s still odd.

It’s a challenge to program and coordinate multiple GPUs.

Furthermore, OpenCL 1.2+ already supports a “device fission” feature (clCreateSubDevices()) in case you wanted to logically split the CL device into sub-devices.

Too bad… as an OpenCL dev I would prefer using all 6 MPs in one device. 🙁

FWIW, here is the clinfo for the HiKey 960 and the G71 MP8: https://pastebin.com/RTjNdKyT

blu
Guest
blu

Thanks for the G71 clinfo — that’s the first time I see it.

‘Max work group size: 384’ — apparently ARM have a thing for the number 3..

AreaScout
Guest
AreaScout

The error “Out of memory: Kill process 4984 (cc1plus) Out of memory: Kill process 4984 (cc1plus)” happens rarely if you choose to build on all eight cores -j8

onebir
Guest
onebir

“Now if you actually want to make something with OpenCL, it’s time to read Arm Compute Library documentation….”

Unless you’re interested in deep learning, and use Keras, in which case PlaidML* can make things quite simple:
1) pip install plaidml
2 plaidml-setup
Great, there’s ‘experimental support’ for the Nvidia NVS 5200M in my laptop
3) Pull Keras code for MNIST digit recognition code into a Jupyter notebook**
4) Add two lines at top:

import plaidml.keras
plaidml.keras.install_backend()

5) Run. Fails: Great Firewall of China is blocking dataset download. Fortunately, it’s on Baidupan (Chinese Google Drive). Modify code slightly to use downloaded data.

6) Run. “ERROR:plaidml:unable to fill buffer: CL_INVALID_OPERATION”

7) Could be out of memory – an Nvidia NVS 5200M only has 1GB. Cutting the batch size / layers sizes doesn’t seem to help. Let’s just kill a layer:
##model.add(Conv2D(64, (3, 3), activation=’relu’))

8) Success! Kind of…
About 15s/epoch, but with a far smaller network.

Eliminating that layer did away with a lot of weights and the network’s capacity to fit the data, with accuracy of only 98,35% after 12 epochs (CF 99.25% reported for the original network).

That said, PlaidML can turn even an aging laptop’s GPU into a ‘just works’ platform for experimenting with fairly small, fairly standard networks.

Love to hear how it works on cheapish new ARM devices!

*https://github.com/plaidml/plaidml
**https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py