May 13, 2018May 16, 2018 by Jean-Luc Aufranc (CNXSoft) - 20 Comments

How to Get Started with OpenCL on ODROID-XU4 Board (with Arm Mali-T628MP6 GPU)

Last week, I reviewed Ubuntu 18.04 on ODROID-XU4 board testing most of the advertised features. However I skipped on the features listed in the Changelog:

GPU hardware acceleration via OpenGL ES 3.1 and OpenCL 1.2 drivers for Mali T628MP6 GPU

While I tested OpenGL ES with tools like glmark2-es2 and es2gears, as well as WebGL demos in Chromium, I did not test OpenCL, since I’m not that familiar with it, except it’s used for GPGPU (General Purpose GPU) to accelerate tasks like image/audio processing. That was a good excuse to learn a bit more, try it out on the board, and write a short guide to get started with OpenGL on hardware with Arm Mali GPU. The purpose of this tutorial is to show how to run an OpenCL sample, and OpenCL utility, and I won’t go into the nitty gritty of OpenCL code. If you want to learn more about OpenCL coding on Arm, one way would be to check out the source code of the provided samples.

Arm Compute Library and OpenCL Samples

Since I did not know where to start, Hardkernel redirected me to a forum thread where we are shown how to use Arm Compute Library to test OpenCL on the board.

The relevant post is dated January 2018, and relies on Compute Library 17.12, but you can check out the latest version and documentation @ https://arm-software.github.io/ComputeLibrary/latest/. The latest version is 18.03 at the time of writing this post, so I retrieved it, and tried to build it as instructed:

wget https://github.com/ARM-software/ComputeLibrary/archive/v18.03.tar.gz
tar xvf v18.03.tar.gz
cd ComputeLibrary-18.03/
sudo apt install scons
scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=armv7a build=native

wget https://github.com/ARM-software/ComputeLibrary/archive/v18.03.tar.gz

tar xvf v18.03.tar.gz

cd ComputeLibrary-18.03/

sudo apt install scons

scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=armv7a build=native

However, It failed with:

g++: internal compiler error: Killed (program cc1plus)

1	g++: internal compiler error: Killed (program cc1plus)

Looking at the kernel log with dmesg, it was clear the board ran out of memory: “Out of memory: Kill process 4984 (cc1plus) Out of memory: Kill process 4984 (cc1plus)“. So I had to setup a swap file (1GB):

sudo dd if=/dev/zero of=/swapfile bs=1024 count=1M
sudo chown root.root /swapfile
sudo chmod 0600 /swapfile
sudo mkswap /swapfile 
sudo swapon /swapfile

sudo dd if=/dev/zero of=/swapfile bs=1024 count=1M

sudo chown root.root /swapfile

sudo chmod 0600 /swapfile

sudo mkswap /swapfile

sudo swapon /swapfile

…giving us more memory…

free -m
              total        used        free      shared  buff/cache   available
Mem:           1994         336         520          34        1137        1568
Swap:          1023           0        1023

free -m

total used free shared buff/cache available

Mem: 1994 336 520 34 1137 1568

Swap: 1023 0 1023

before restarting the build with NEON and OpenCL enabled:

scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=armv7a build=native

1	scons Werror=1 -j8 debug=0 neon=1 opencl=1 embed_kernels=1 os=linux arch=armv7a build=native

and this time it could complete:

scons: done building targets.

1	scons: done building targets.

[Update: Based on comments below, setting up ZRAM instead of swap is usually better in case you run out of memory]

And we can copy the libraries to /usr/lib:

sudo cp build/*.so /usr/lib/

1	sudo cp build/*.so /usr/lib/

We have a bunch of samples to play with:

ls examples/
SConscript              graph_mobilenet_qasymm8.cpp
cl_convolution.cpp      graph_resnet50.cpp
cl_events.cpp           graph_squeezenet.cpp
cl_sgemm.cpp            graph_squeezenet_v1_1.cpp
gc_absdiff.cpp          graph_vgg16.cpp
gc_dc.cpp               graph_vgg19.cpp
graph_alexnet.cpp       neon_cartoon_effect.cpp
graph_googlenet.cpp     neon_cnn.cpp
graph_inception_v3.cpp  neon_convolution.cpp
graph_inception_v4.cpp  neon_copy_objects.cpp
graph_lenet.cpp         neon_scale.cpp
graph_mobilenet.cpp     neoncl_scale_median_gaussian.cpp

ls examples/

SConscript graph_mobilenet_qasymm8.cpp

cl_convolution.cpp graph_resnet50.cpp

cl_events.cpp graph_squeezenet.cpp

cl_sgemm.cpp graph_squeezenet_v1_1.cpp

gc_absdiff.cpp graph_vgg16.cpp

gc_dc.cpp graph_vgg19.cpp

graph_alexnet.cpp neon_cartoon_effect.cpp

graph_googlenet.cpp neon_cnn.cpp

graph_inception_v3.cpp neon_convolution.cpp

graph_inception_v4.cpp neon_copy_objects.cpp

graph_lenet.cpp neon_scale.cpp

graph_mobilenet.cpp neoncl_scale_median_gaussian.cpp

Note that some are NEON only, not using OpenCL, and the prefix explains the type of sample:

cl_*.cpp –> OpenCL examples
gc_*.cpp –> GLES compute shaders examples
graph_*.cpp –> Graph examples
neoncl_*.cpp –> NEON / OpenCL interoperability examples
neon_*.cpp –> NEON examples

All samples have also been built and can be found in build/examples directory. I ran cl_convolution after generating a Raw ppm image using Gimp:

time ./cl_convolution ~/ODROID-XU4Q-Large.ppm 

./cl_convolution


Test passed

real	0m5.814s
user	0m4.893s
sys	0m0.758s

time ./cl_convolution ~/ODROID-XU4Q-Large.ppm

./cl_convolution

Test passed

real 0m5.814s

user 0m4.893s

sys 0m0.758s

It could process the photo (5184 x 3456) in less than 6 seconds. If we look at the resulting image, we can see the OpenCL convolution converts the image to grayscale.

Original Image (Left) vs After OpenCL Convolution (Right) – Click to Enlarge

So I’ve repeated a similar operation with convert which has not been compiled with OpenCL support, so using software only:

time convert ODROID-XU4Q-Large.ppm -colorspace Gray ODROID-XU4Q-Large-Grayscale.ppm

real	0m10.475s
user	0m0.724s
sys	0m2.957s

time convert ODROID-XU4Q-Large.ppm -colorspace Gray ODROID-XU4Q-Large-Grayscale.ppm

real 0m10.475s

user 0m0.724s

sys 0m2.957s

It took a little over 10 seconds, so almost twice the time used by the OpenCL demo. The PPM image files are however over 50MB, so part of the time is used to read and save the file from the eMMC flash. Repeating the tests provide similar performance (~6s vs ~11s), so it may be negligible.

convert version output showing OpenCL is not part of the enabled features in ImageMagick:

convert -version
Version: ImageMagick 6.9.7-4 Q16 arm 20170114 http://www.imagemagick.org
Copyright: © 1999-2017 ImageMagick Studio LLC
License: http://www.imagemagick.org/script/license.php
Features: Cipher DPC Modules OpenMP 
Delegates (built-in): bzlib djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png tiff wmf x xml zlib

convert -version

Version: ImageMagick 6.9.7-4 Q16 arm 20170114 http://www.imagemagick.org

License: http://www.imagemagick.org/script/license.php

Features: Cipher DPC Modules OpenMP

Delegates (built-in): bzlib djvu fftw fontconfig freetype jbig jng jpeg lcms lqr ltdl lzma openexr pangocairo png tiff wmf x xml zlib

It’s fun, so I tried another sample:

time ./cl_events ~/ODROID-XU4Q-Large.ppm 

./cl_events


Test passed

real	0m3.068s
user	0m2.527s
sys	0m0.369s

time ./cl_events ~/ODROID-XU4Q-Large.ppm

./cl_events

Test passed

real 0m3.068s

user 0m2.527s

sys 0m0.369s

What did it do? When I open the file it looks the same of the first sample (Grayscale image), but it actually scaled the image (50% width, 50% height):

file ~/ODROID-XU4Q-Large.ppm_out.ppm 
/home/odroid/ODROID-XU4Q-Large.ppm_out.ppm: Netpbm image data, size = 2592 x 1728, rawbits, pixmap

1 2	file ~/ODROID-XU4Q-Large.ppm_out.ppm /home/odroid/ODROID-XU4Q-Large.ppm_out.ppm: Netpbm image data, size = 2592 x 1728, rawbits, pixmap

The last sample cl_sgemm manipulates matrices. The main goal of the three OpenCL (cl_xxx_ samples) is to show how to use OpenCL Convolution, Events and SGEMM (Single-precision GEneral Matrix Multiply) using the Compute Library.

You can also play with other samples for NEON and OpenGL ES, and Arm Community published a blog post explaining how to run neon_cartoon_effect on Raspberry Pi , and explaining the source code in details. You don’t actually need an RPi board for that as any Arm board with a processor supporting NEON should do.

clinfo Utility

clinfo is a utility that print information about OpenCL platforms and devices in the system. So I install it in the board:

sudo apt install clinfo

1	sudo apt install clinfo

But running the program does not return any useful information:

clinfo
Number of platforms 0

1 2	clinfo Number of platforms 0

Not what I expected. Luckily, setting up clinfo is explained in ODROID Magazine, so let’s have a try.

We need to Mali’s framebuffer driver:

sudo apt install mali-fbdev

1	sudo apt install mali-fbdev

and setup the vendor ICD file:

sudo mkdir -p /etc/OpenCL/vendors
sudo sh -c 'echo "/usr/lib/arm-linux-gnueabihf/mali-egl/libOpenCL.so" &gt; /etc/OpenCL/vendors/armocl.icd'

1 2	sudo mkdir -p /etc/OpenCL/vendors sudo sh -c 'echo "/usr/lib/arm-linux-gnueabihf/mali-egl/libOpenCL.so" > /etc/OpenCL/vendors/armocl.icd'

Now we can run clinfo:

clinfo
Number of platforms                               1
  Platform Name                                   ARM Platform
  Platform Vendor                                 ARM
  Platform Version                                OpenCL 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory
  Platform Extensions function suffix             ARM

  Platform Name                                   ARM Platform
Number of devices                                 2
  Device Name                                     Mali-T628
  Device Vendor                                   ARM
  Device Vendor ID                                0x6200010
  Device Version                                  OpenCL 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82
  Driver Version                                  1.2
  Device OpenCL C Version                         OpenCL C 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               4
  Max clock frequency                             600MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple              4
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 8 / 8        (cl_khr_fp16)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              2090397696 (1.947GiB)
  Error Correction support                        No
  Max memory allocation                           522599424 (498.4MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        131072 (128KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             65536x65536 pixels
    Max 3D image size                             65536x65536x65536 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory

  Device Name                                     Mali-T628
  Device Vendor                                   ARM
  Device Vendor ID                                0x6200010
  Device Version                                  OpenCL 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82
  Driver Version                                  1.2
  Device OpenCL C Version                         OpenCL C 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               2
  Max clock frequency                             600MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple              4
  Preferred / native vector sizes                 
    char                                                16 / 16      
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 8 / 8        (cl_khr_fp16)
    float                                                4 / 4       
    double                                               2 / 2        (cl_khr_fp64)
  Half-precision Floating-point support           (cl_khr_fp16)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    64, Little-Endian
  Global memory size                              2090397696 (1.947GiB)
  Error Correction support                        No
  Max memory allocation                           522599424 (498.4MiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        131072 (128KiB)
  Global Memory cache line size                   64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Max 2D image size                             65536x65536 pixels
    Max 3D image size                             65536x65536x65536 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Global
  Local memory size                               32768 (32KiB)
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        Yes
    Profiling                                     Yes
  Prefer user sync for interop                    No
  Profiling timer resolution                      1000ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  ARM Platform
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [ARM]
  clCreateContext(NULL, ...) [default]            Success [ARM]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-T628
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (2)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-T628
    Device Name                                   Mali-T628
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (2)
    Platform Name                                 ARM Platform
    Device Name                                   Mali-T628
    Device Name                                   Mali-T628

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

clinfo

Number of platforms 1

Platform Name ARM Platform

Platform Vendor ARM

Platform Version OpenCL 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82

Platform Profile FULL_PROFILE

Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory

Platform Extensions function suffix ARM

Platform Name ARM Platform

Number of devices 2

Device Name Mali-T628

Device Vendor ARM

Device Vendor ID 0x6200010

Device Version OpenCL 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82

Driver Version 1.2

Device OpenCL C Version OpenCL C 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82

Device Type GPU

Device Profile FULL_PROFILE

Device Available Yes

Compiler Available Yes

Linker Available Yes

Max compute units 4

Max clock frequency 600MHz

Device Partition (core)

Max number of sub-devices 0

Supported partition types None

Max work item dimensions 3

Max work item sizes 256x256x256

Max work group size 256

Preferred work group size multiple 4

Preferred / native vector sizes

char 16 / 16

short 8 / 8

int 4 / 4

long 2 / 2

half 8 / 8 (cl_khr_fp16)

float 4 / 4

double 2 / 2 (cl_khr_fp64)

Half-precision Floating-point support (cl_khr_fp16)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Single-precision Floating-point support (core)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Correctly-rounded divide and sqrt operations No

Double-precision Floating-point support (cl_khr_fp64)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Address bits 64, Little-Endian

Global memory size 2090397696 (1.947GiB)

Error Correction support No

Max memory allocation 522599424 (498.4MiB)

Unified memory for Host and Device Yes

Minimum alignment for any data type 128 bytes

Alignment of base address 1024 bits (128 bytes)

Global Memory cache type Read/Write

Global Memory cache size 131072 (128KiB)

Global Memory cache line size 64 bytes

Image support Yes

Max number of samplers per kernel 16

Max size for 1D images from buffer 65536 pixels

Max 1D or 2D image array size 2048 images

Max 2D image size 65536x65536 pixels

Max 3D image size 65536x65536x65536 pixels

Max number of read image args 128

Max number of write image args 8

Local memory type Global

Local memory size 32768 (32KiB)

Max number of constant args 8

Max constant buffer size 65536 (64KiB)

Max size of kernel argument 1024

Queue properties

Out-of-order execution Yes

Profiling Yes

Prefer user sync for interop No

Profiling timer resolution 1000ns

Execution capabilities

Run OpenCL kernels Yes

Run native kernels No

printf() buffer size 1048576 (1024KiB)

Built-in kernels

Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_fp64 cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_fp16 cl_khr_gl_sharing cl_khr_icd cl_khr_egl_event cl_khr_egl_image cl_arm_core_id cl_arm_printf cl_arm_thread_limit_hint cl_arm_non_uniform_work_group_size cl_arm_import_memory

Device Name Mali-T628

Device Vendor ARM

Device Vendor ID 0x6200010

Device Version OpenCL 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82

Driver Version 1.2

Device OpenCL C Version OpenCL C 1.2 v1.r12p0-04rel0.03af15950392f3702b248717f4938b82

Device Type GPU

Device Profile FULL_PROFILE

Device Available Yes

Compiler Available Yes

Linker Available Yes

Max compute units 2

Max clock frequency 600MHz

Device Partition (core)

Max number of sub-devices 0

Supported partition types None

Max work item dimensions 3

Max work item sizes 256x256x256

Max work group size 256

Preferred work group size multiple 4

Preferred / native vector sizes

char 16 / 16

short 8 / 8

int 4 / 4

long 2 / 2

half 8 / 8 (cl_khr_fp16)

float 4 / 4

double 2 / 2 (cl_khr_fp64)

Half-precision Floating-point support (cl_khr_fp16)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Single-precision Floating-point support (core)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Correctly-rounded divide and sqrt operations No

Double-precision Floating-point support (cl_khr_fp64)

Denormals Yes

Infinity and NANs Yes

Round to nearest Yes

Round to zero Yes

Round to infinity Yes

IEEE754-2008 fused multiply-add Yes

Support is emulated in software No

Address bits 64, Little-Endian

Global memory size 2090397696 (1.947GiB)

Error Correction support No

Max memory allocation 522599424 (498.4MiB)

Unified memory for Host and Device Yes

Minimum alignment for any data type 128 bytes

Alignment of base address 1024 bits (128 bytes)

Global Memory cache type Read/Write

Global Memory cache size 131072 (128KiB)

Global Memory cache line size 64 bytes

Image support Yes

Max number of samplers per kernel 16

Max size for 1D images from buffer 65536 pixels

Max 1D or 2D image array size 2048 images

Max 2D image size 65536x65536 pixels

Max 3D image size 65536x65536x65536 pixels

Max number of read image args 128

Max number of write image args 8

Local memory type Global

Local memory size 32768 (32KiB)

Max number of constant args 8

Max constant buffer size 65536 (64KiB)

Max size of kernel argument 1024

Queue properties

Out-of-order execution Yes

Profiling Yes

Prefer user sync for interop No

Profiling timer resolution 1000ns

Execution capabilities

Run OpenCL kernels Yes

Run native kernels No

printf() buffer size 1048576 (1024KiB)

Built-in kernels

NULL platform behavior

clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) ARM Platform

clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [ARM]

clCreateContext(NULL, ...) [default] Success [ARM]

clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT) Success (1)

Platform Name ARM Platform

Device Name Mali-T628

clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform

clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (2)

Platform Name ARM Platform

Device Name Mali-T628

clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform

clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform

clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (2)

Platform Name ARM Platform

Device Name Mali-T628

ICD loader properties

ICD loader Name OpenCL ICD Loader

ICD loader Vendor OCL Icd free software

ICD loader Version 2.2.11

ICD loader Profile OpenCL 2.1

That’s a lot of information, and it shows one platform with two OpenCL devices (both Mali-T628) supporting OpenCL 1.2.

That’s all for this little getting started guide. Now if you actually want to make something with OpenCL, it’s time to read Arm Compute Library documentation, and other resources on the web.

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

20 Replies to “How to Get Started with OpenCL on ODROID-XU4 Board (with Arm Mali-T628MP6 GPU)”

I wonder why everyone is using swap when it’s so easy to activate zram on Ubuntu?

Also it should be noted when comparing execution times that the ImageMagick version you’re using is Q16 and not Q8 (internal bit depth, defaults to 16 bit which results in more precise operations with some tasks but with a simple grayscale conversion or downscaling only slows things unnecessarily down — this can only be specified at build time ‘–with-quantum-depth’)

It would be worth a try to repeat the test with GraphicsMagick which should default to Q8 (can be checked with ‘gm version’). Execution syntax as above just prefixed with ‘gm ‘. Installation is just the usual ‘sudo apt install graphicsmagick’

And on a big.LITTLE platform I would try to ensure execution on an A15 core: So prefixing IM commands with

taskset -c 7 gm

1	taskset -c 7 gm

ImageMagick is considered the ‘swiss army knife’ of image processing for whatever reasons but usually it’s slow as hell and we try to avoid it whereever possible (using GraphicsMagick instead usually results in better performance, avoiding it altogether and using tools that are optimized for the task is an even better idea)

Just mentioning since comparing ‘convert’ execution times with the OpenCL examples is not comparing efficiency of CPU vs. GPU but most probably just unoptimized vs. optimized software in this case.

gm is also build with Q16

gm version
GraphicsMagick 1.3.28 2018-01-20 Q16 http://www.GraphicsMagick.org/

1 2	gm version GraphicsMagick 1.3.28 2018-01-20 Q16 http://www.GraphicsMagick.org/

Result if I force convert / gm convert to run on the big core

time taskset -c 7 convert ODROID-XU4Q-Large.ppm -colorspace Gray ODROID-XU4Q-Large-Grayscale.ppm

real	0m10.703s
user	0m0.723s
sys	0m3.425s
odroid@odroid:~$ time taskset -c 7 gm convert ODROID-XU4Q-Large.ppm -colorspace Gray ODROID-XU4Q-Large-Grayscale.ppm

real	0m1.354s
user	0m0.705s
sys	0m0.638s

time taskset -c 7 convert ODROID-XU4Q-Large.ppm -colorspace Gray ODROID-XU4Q-Large-Grayscale.ppm

real 0m10.703s

user 0m0.723s

sys 0m3.425s

odroid@odroid:~$ time taskset -c 7 gm convert ODROID-XU4Q-Large.ppm -colorspace Gray ODROID-XU4Q-Large-Grayscale.ppm

real 0m1.354s

user 0m0.705s

sys 0m0.638s

So gm is much faster, but not related to Q8 or Q16.
Full result of gm version for reference:

gm -version
GraphicsMagick 1.3.28 2018-01-20 Q16 http://www.GraphicsMagick.org/
Copyright (C) 2002-2018 GraphicsMagick Group.
Additional copyrights and licenses apply to this software.
See http://www.GraphicsMagick.org/www/Copyright.html for details.

Feature Support:
  Native Thread Safe       yes
  Large Files (> 32 bit)   yes
  Large Memory (> 32 bit)  no
  BZIP                     yes
  DPS                      no
  FlashPix                 no
  FreeType                 yes
  Ghostscript (Library)    no
  JBIG                     yes
  JPEG-2000                no
  JPEG                     yes
  Little CMS               yes
  Loadable Modules         no
  OpenMP                   yes (201511)
  PNG                      yes
  TIFF                     yes
  TRIO                     no
  UMEM                     no
  WebP                     yes
  WMF                      yes
  X11                      yes
  XML                      yes
  ZLIB                     yes

Host type: arm-unknown-linux-gnueabihf

Configured using the command:
  ./configure  '--build' 'arm-linux-gnueabihf' '--enable-shared' '--enable-static' '--enable-libtool-verbose' '--prefix=/usr' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--docdir=${prefix}/share/doc/graphicsmagick' '--with-gs-font-dir=/usr/share/fonts/type1/gsfonts' '--with-x' '--x-includes=/usr/include/X11' '--x-libraries=/usr/lib/X11' '--without-dps' '--without-modules' '--without-frozenpaths' '--with-webp' '--with-perl' '--with-perl-options=INSTALLDIRS=vendor' '--enable-quantum-library-names' '--with-quantum-depth=16' 'build_alias=arm-linux-gnueabihf' 'CFLAGS=-Wall -g -fno-strict-aliasing -O2' 'LDFLAGS=' 'CXXFLAGS=-Wall -g -fno-strict-aliasing -O2'

Final Build Parameters:
  CC       = gcc
  CFLAGS   = -fopenmp -Wall -g -fno-strict-aliasing -O2 -Wall -pthread
  CPPFLAGS = -I/usr/include/X11 -I/usr/include/freetype2 -I/usr/include/libxml2
  CXX      = g++
  CXXFLAGS = -Wall -g -fno-strict-aliasing -O2 -pthread
  LDFLAGS  = -L/usr/lib/X11 -R/usr/lib/X11
  LIBS     = -ljbig -lwebp -lwebpmux -llcms2 -ltiff -lfreetype -ljpeg -lpng16 -lwmflite -lXext -lSM -lICE -lX11 -llzma -lbz2 -lxml2 -lz -lm -lgomp -lpthread

gm -version

GraphicsMagick 1.3.28 2018-01-20 Q16 http://www.GraphicsMagick.org/

Additional copyrights and licenses apply to this software.

See http://www.GraphicsMagick.org/www/Copyright.html for details.

Feature Support:

Native Thread Safe yes

Large Files (> 32 bit) yes

Large Memory (> 32 bit) no

BZIP yes

DPS no

FlashPix no

FreeType yes

Ghostscript (Library) no

JBIG yes

JPEG-2000 no

JPEG yes

Little CMS yes

Loadable Modules no

OpenMP yes (201511)

PNG yes

TIFF yes

TRIO no

UMEM no

WebP yes

WMF yes

X11 yes

XML yes

ZLIB yes

Host type: arm-unknown-linux-gnueabihf

Configured using the command:

./configure '--build' 'arm-linux-gnueabihf' '--enable-shared' '--enable-static' '--enable-libtool-verbose' '--prefix=/usr' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--docdir=${prefix}/share/doc/graphicsmagick' '--with-gs-font-dir=/usr/share/fonts/type1/gsfonts' '--with-x' '--x-includes=/usr/include/X11' '--x-libraries=/usr/lib/X11' '--without-dps' '--without-modules' '--without-frozenpaths' '--with-webp' '--with-perl' '--with-perl-options=INSTALLDIRS=vendor' '--enable-quantum-library-names' '--with-quantum-depth=16' 'build_alias=arm-linux-gnueabihf' 'CFLAGS=-Wall -g -fno-strict-aliasing -O2' 'LDFLAGS=' 'CXXFLAGS=-Wall -g -fno-strict-aliasing -O2'

Final Build Parameters:

CC = gcc

CFLAGS = -fopenmp -Wall -g -fno-strict-aliasing -O2 -Wall -pthread

CPPFLAGS = -I/usr/include/X11 -I/usr/include/freetype2 -I/usr/include/libxml2

CXX = g++

CXXFLAGS = -Wall -g -fno-strict-aliasing -O2 -pthread

LDFLAGS = -L/usr/lib/X11 -R/usr/lib/X11

LIBS = -ljbig -lwebp -lwebpmux -llcms2 -ltiff -lfreetype -ljpeg -lpng16 -lwmflite -lXext -lSM -lICE -lX11 -llzma -lbz2 -lxml2 -lz -lm -lgomp -lpthread

tkaiser says:

May 13, 2018 at 14:49

> So gm is much faster, but not related to Q8 or Q16.

Thank you for the tests which simply show one more time that ImageMagick is just slow as hell. When I tested last time with Q8 vs. Q16 I got a performance improvement of 20% for Q8 (with a task that involved loading the rather large images through a slow GbE network so no filesystem buffers/caches involved as in your example).

So by using GraphicsMagick vs. ImageMagick now CPU processing easily outperforms OpenCL on the GPU. This should be taken into account by people really interested in doing high performance image processing (or other tasks that could run via OpenCL on the GPU).

BTW: ‘Converting to grayscale’ can be done in many ways. The most primitive and fastest is to add the values of the 3 RGB channels and divide by 3. A more complicated variant would be to transform the RGB image into a ‘device independent color space’ and then reducing saturation to 0. This is way more ‘expensive’ in terms of processing power but results in a slightly better image result. Next improvement is to take ICC profiles into account (assuming something generic like sRGB for source and destination images if the fileformat doesn’t support color profiles).

I would assume the OpenCL example uses the most primitive and fastest algorithm while GraphicsMagick implements the latter variant. Would be interesting to compare the two grayscale results:

gm convert gm.ppm opencl.ppm -compose difference -composite -colorspace Gray difference.ppm

1

gm convert gm.ppm opencl.ppm -compose difference -composite -colorspace Gray difference.ppm

Usually this will result in a pretty dark image so adding ‘-normalize’ or inverting the resulting image helps in spotting differences.

Reply
1. crashoverride says:
  
  May 13, 2018 at 16:21
  
  Before this discussion goes further off the rails, the cl_convolution example’s function is not to convert to grayscale:
  https://github.com/ARM-software/ComputeLibrary/blob/master/examples/cl_convolution.cpp
  // Apply a Gaussian 3×3 filter to the source image followed by a Gaussian 5×5:
  
  A convolution filter in this context is explained here:
  https://en.wikipedia.org/wiki/Kernel_(image_processing)
  
  The function it performs depends on the filter values used. Since the operation is scaler (single value) and not vector (multiple values), the output is typically represented as grayscale.
  
  Further information (and uses) can be found here:
  http://www.roborealm.com/help/Convolution.php
  
  For an ImageMagic comparison, the appropriate convolution filter operation should be used.:
  https://www.imagemagick.org/Usage/convolve/
  
  Note there are two passes (3×3 and 5×5). The kernel filter values used are here:
  https://github.com/ARM-software/ComputeLibrary/blob/02c62c8030e7aca592b294396556a93c6bfb9f7a/examples/cl_convolution.cpp#L36-L54
2. tkaiser says:
  
  May 13, 2018 at 17:14
  
  Interesting. So it’s a gaussian blur applied twice in reality and for whatever reasons color information has been dropped from the final image (5184 x 3456 in grayscale is 17.9 MB in size but resulting image is 53.7 MB, so it’s not a Graymap but still a Pixmap)

blu says:

May 13, 2018 at 13:23

Another way to avoid costly swapfiles is to lower the -jN option to the compiler. In this particular case -j4 would have utilized the big cores and likely not needed the swapfile.

Reply
1. cnxsoft says:
  
  May 13, 2018 at 14:20
  
  Good point, but I think the build failed at the link stage, which I believe only uses one core. I’m compiling the thing with -j1 to see if it goes through, then I’ll try with ZRAM enabled.
  
  Reply
  1. cnxsoft says:
    
    May 14, 2018 at 09:22
    
    -j1 could complete, but I repeated with -j8 with zram nor swap, and completed too. Maybe it was just at the limit…
    
    Reply
back2future says:

May 13, 2018 at 21:51

[ additional zram (<3/4 of ram, because of fs overhead for zram, if not used) together with conventional swap space (file/partition), could be recommendable when compiling ]

echo lz4 into /sys/block/zram0/comp_algorithm

Reply

ASM says:

May 13, 2018 at 14:03

Thanks for this guide!

Pretty sure clinfo is not reporting anything but support for OpenCL 1.2 — there is no 2.1 support listed.

Also, I find it odd that the MP6 GPU is listed as *two* devices: one with 4 compute units (MPs) and one with 2… I’ve never seen that before!

Reply
1. cnxsoft says:
  
  May 13, 2018 at 14:16
  
  hmm, you’re right the drivers only show OpenCL 1.2. I saw OpenCL 2.1 at this line:
  
  ICD loader Profile OpenCL 2.1
  
  1
  
  ICD loader Profile OpenCL 2.1
  
  Since we have two devices, I wonder if the selection of the device matters, especially one has 4 compute units, and the other has both, or do OpenCL code just run on both at the same time?
  
  Reply
  1. blu says:
    
    May 13, 2018 at 18:40
    
    I really depends how the client code was written — while a CL kernel would not care on how many devices it would run, the client-side context and queue setup would need to be carried properly for multi-device dispatch.
    
    For client code that was not written to run on multiple devices, the user has to make sure they run the test on the most performant device — in this case on the 4-CU one.
    
    Reply
2. blu says:
  
  May 13, 2018 at 18:54
  
  @ASM, Indeed, the T628 in the exynos 5422 is a 6-cluster setup, which apparently comes at 4 + 2 CUs.
  
  I was pondering not long ago how to split a workload for it, and one inevitably gets to a 3-way split, which is, well, odd for most workloads.
  
  Reply
  1. ASM says:
    
    May 14, 2018 at 00:31
    
    @BLU, thanks for confirming the 4+2 split!
    
    I didn’t know and don’t understand why ARM would ever split their GPUs !
    
    This may be useful for a reliable embedded application but it’s still odd.
    
    It’s a challenge to program and coordinate multiple GPUs.
    
    Furthermore, OpenCL 1.2+ already supports a “device fission” feature (clCreateSubDevices()) in case you wanted to logically split the CL device into sub-devices.
    
    Too bad… as an OpenCL dev I would prefer using all 6 MPs in one device. 🙁
    
    FWIW, here is the clinfo for the HiKey 960 and the G71 MP8: https://pastebin.com/RTjNdKyT
    
    Reply
    1. blu says:
      
      May 14, 2018 at 01:04
      
      Thanks for the G71 clinfo — that’s the first time I see it.
      
      ‘Max work group size: 384’ — apparently ARM have a thing for the number 3..
      
      Reply
AreaScout says:

May 13, 2018 at 19:10

The error “Out of memory: Kill process 4984 (cc1plus) Out of memory: Kill process 4984 (cc1plus)” happens rarely if you choose to build on all eight cores -j8

Reply
onebir says:

May 18, 2018 at 14:27

“Now if you actually want to make something with OpenCL, it’s time to read Arm Compute Library documentation….”

Unless you’re interested in deep learning, and use Keras, in which case PlaidML* can make things quite simple:
1) pip install plaidml
2 plaidml-setup
Great, there’s ‘experimental support’ for the Nvidia NVS 5200M in my laptop
3) Pull Keras code for MNIST digit recognition code into a Jupyter notebook**
4) Add two lines at top:

import plaidml.keras
plaidml.keras.install_backend()

5) Run. Fails: Great Firewall of China is blocking dataset download. Fortunately, it’s on Baidupan (Chinese Google Drive). Modify code slightly to use downloaded data.

6) Run. “ERROR:plaidml:unable to fill buffer: CL_INVALID_OPERATION”

7) Could be out of memory – an Nvidia NVS 5200M only has 1GB. Cutting the batch size / layers sizes doesn’t seem to help. Let’s just kill a layer:
##model.add(Conv2D(64, (3, 3), activation=’relu’))

8) Success! Kind of…
About 15s/epoch, but with a far smaller network.

Eliminating that layer did away with a lot of weights and the network’s capacity to fit the data, with accuracy of only 98,35% after 12 epochs (CF 99.25% reported for the original network).

That said, PlaidML can turn even an aging laptop’s GPU into a ‘just works’ platform for experimenting with fairly small, fairly standard networks.

Love to hear how it works on cheapish new ARM devices!

*https://github.com/plaidml/plaidml
**https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py

Reply
1. cnxsoft says:
  
  May 18, 2018 at 14:34
  
  That name is familiar… It’s been a while. Welcome back!
  
  Reply
  1. onebir says:
    
    May 18, 2018 at 18:33
    
    Thanks – I’m hoping to tempt you into trying this on some ARM boards from that big stack you must have in Chiang Mai!
    
    The networks they can train might be limited (big neural networks takes a lot of memory). But some ARM GPUs may quite useful for inference (predicting with pre-trained models). PlaidML’s support of ONNX (an open format to represent deep learning models, https://onnx.ai/) should make it fairly easy to test how well they work.
    
    In particular, the Intel Movidius and AAEON UP AI Core only have 512-1024MB ram, which I think (?) is comparable with the RAM ARM GPUs can often access. So some people may find projects they thought would need a Movidius/UP AI Core/Jetson will work on other cheaper devices.
    
    (I also stumbled across Qualcomm’s Neural Processing Engine (NPE) SDK while googling for this:
    https://developer.qualcomm.com/blog/run-your-onnx-ai-models-faster-snapdragon )
    
    Reply

Boardcon LGA3576 Rockchip RK3576 System-on-Module designed for AI and IoT applications

Arm Compute Library and OpenCL Samples

clinfo Utility

20 Replies to “How to Get Started with OpenCL on ODROID-XU4 Board (with Arm Mali-T628MP6 GPU)”

Leave a Reply Cancel reply

Leave a Reply