Archive

Posts Tagged ‘opencl’

First OpenCL Encounters on Cortex-A72: Some Benchmarking

November 14th, 2017 1 comment

This is a guest post by blu about his experience with OpenCL on MacchiatoBin board with a quad core Cortex A72 processor and an Intel based MacBook. He previously contributed several technical articles such as How ARM Nerfed NEON Permute Instructions in ARMv8 or OpenGL ES development on Ubuntu Touch.

Qualcomm launched their long-awaited server ARM chip the other day, and we started getting the first benchmarks. Incidentally, I too managed to get some OpenCL ray-tracing code running on an ARM Cortex-A72 machine that same day (thanks to pocl – an LLVM-based open-source OCL multi-platform implementation), so my benchmarking curiosity got me.

The code in question is an OCL (half-finished) port of a graphics demo from 2014. Some remarks of what it does:

For each frame: a single thread builds a sparse voxel octree from a dynamic voxel scene; the octree, along with current camera settings are passed to an OCL kernel via double buffering; kernel computes a screen-space map of object IDs from primary-ray-hit voxels (kernel utilizes all compute units of a user-specified device); then, in headless mode used in the test, the app discards the frame. Test continues for a user-specified number of frames, and reports the average frames per second (FPS) upon termination.

Now, one of the baselines I wanted to compare the ARM machine against was a MacBook with Penryn (Intel Core 2 Duo Processor P8600), as the latter had exhibited very similar IPC characteristics to the Cortex-A72 in previous (non-OCL) tests, and also both machines had very similar FLOPS paper specs (and our OCL test is particularly FP-heavy):

  • 2x Penryn @ 2400MHz: 4xfp32 mul + 4xfp32 add per clock = 38.4GFLOPS total
  • 4x Cortex-A72 @ 1300MHz: 4xfp32 mul-add per clock = 41.6GFLOPS total

Beyond paper specs, on a SGEMM test the two machines showed the following performance for cached data:

  • Penryn: 4.86 flop/clock/core, 23.33GFLOPS total
  • Cortex-A72: 6.52 flop/clock/core, 33.90GFLOPS total

And finally RAM bandwidth (again, paper specs):

  • Penryn: 8.53GB/s (DDR3 @ 1066MT/s)
  • Cortex-A72: 12.8GB/s (DDR4 @ 1600MT/s)

On the ray-tracing OCL test, though, things turned out interesting (MacBook running Apple’s own OCL stack, which, to the best of my knowledge, is also LLVM-based):

  • Penryn average FPS: 2.31
  • Cortex-A72 average FPS: 7.61

So while on the SGEMM test the ARM was ~1.5x faster than Penryn for cached data, on the ray-tracing test, which is a much more complex code than SGEMM, the ARM speedup turned out ~3x? Remember, we are talking of two μarchs that perform quite closely by general-purpose-code IPC. Could something be wrong with Apple’s OCL stack? Let’s try pocl (exact same version of pocl and LLVM as on ARM):

  • Penryn average FPS: 11.58

OK, that’s much more reasonable. This time Penryn holds a speed advantage of 1.5x. Now, while Penryn is a fairly mature μarch that has reached its toolchain-support peak long ago, could we expect improvements from LLVM’s (and pocl’s) support for the Cortex family? Perhaps. In the case of our little test I could even finish the Aarch64 port of the non-OCL version of this code (originally x86-64 with SSE/AVX), but hey, OCL saved me the initial effort for satisfying my curiosity!

What is more interesting, though, is that assuming a Qualcomm Falkor core is at least as performant as a Cortex-A72 core in both gen-purpose and NEON IPC (not a baseless supposition), and taking into account that the top specced Centriq 2400 has 12x the cores and 10x the RAM bandwidth of our ARM machine, we could speculate about Centriq 2400’s performance on this OCL test when using the same OCL stack.

Hypothetical Qualcomm Centriq 2400 server: Centriq 2400 48x Falkor @ 2200MHz-2600MHz, 6x DDR4 @ 2667MT/s (128GB/s)

Assumed linearly scaling from the ARMADA 8040 measured performance; in practice the single-thread part of the test will impede the linear scaling, and so could the slightly-lower per-core RAM BW paper specs.

Of course, CPU-based solutions are not the best candidate for this OCL test — a decent GPU would obliterate even a 2S Xeon server here. But the goal of this entire test was to get a first-encounter estimate of the Cortex-A72 for FP-heavy non-matrix-multiplication-trivial scenarios, and things can go only up from here. Raw data for POCL tests on MacchiatoBin and MacBook is available here.

Imagination PowerVR “Furian” Series8XT GT8525 GPU Targets High-end Smartphones, Virtual Reality and Automotive Products

May 11th, 2017 No comments

Imagination Technologies has unveiled their first GPU based on PowerVR Furian architecture with Series8XT GT8525 GPU equipped with two clusters and designed for SoCs going to into products such as high-end smartphones and tablets, mid-range dedicated VR and AR devices, and mid- to high-end automotive infotainment and ADAS systems.

Block Diagram for PowerVR Furian GT8525 GPU – Click to Enlarge

The Furian architecture is said to allow for improvements in performance density, GPU efficiency, and system efficiency, features a new 32-wide ALU cluster design, and can be manufactured using sub-14nm (e.g. 7nm process once available). PowerVR GT8525 GPU supports compute APIs such as OpenCL 2.0, Vulkan 1.0 and OpenVX 1.1.

Compared to the previous Series7XT GPU family, Series8XT GT8525 GPU delivers 80% higher fps in Trex benchmark, an extra 50% fps in GFXbench Manhattan benchmark, 50% higher fps in Antutu, doubles the fillrate throughput for GUI, and increases GFLOPs for compute applications by over 50%.

GT8525 GPU is available for licensing now, and has already been delivered to lead customers. More details should eventually surface on PowerVR Series8XT Core page.

Open Source ARM Compute Library Released with NEON and OpenCL Accelerated Functions for Computer Vision, Machine Learning

April 3rd, 2017 12 comments

GPU compute promises to deliver much better performance compared to CPU compute for application such a computer vision and machine learning, but the problem is that many developers may not have the right skills or time to leverage APIs such as OpenCL. So ARM decided to write their own ARM Compute library and has now released it under an MIT license.

The functions found in the library include:

  • Basic arithmetic, mathematical, and binary operator functions
  • Color manipulation (conversion, channel extraction, and more)
  • Convolution filters (Sobel, Gaussian, and more)
  • Canny Edge, Harris corners, optical flow, and more
  • Pyramids (such as Laplacians)
  • HOG (Histogram of Oriented Gradients)
  • SVM (Support Vector Machines)
  • H/SGEMM (Half and Single precision General Matrix Multiply)
  • Convolutional Neural Networks building blocks (Activation, Convolution, Fully connected, Locally connected, Normalization, Pooling, Soft-max)

The library works on Linux, Android or bare metal on armv7a (32bit) or arm64-v8a (64bit) architecture, and makes use of  NEON, OpenCL, or  NEON + OpenCL. You’ll need an OpenCL capable GPU, so all Mali-4xx GPUs won’t be fully supported, and you need an SoC with Mali-T6xx, T-7xx, T-8xx, or G71 GPU to make use of the library, except for NEON only functions.

In order to showcase their new library, ARM compared its performance to OpenCV library on Huawei Mate 9 smartphone with HiSilicon Kirin 960 processor with an ARM Mali G71MP8  GPU.

ARM Compute Library vs OpenCV, single-threaded, CPU (NEON)

Even with some NEON acceleration in OpenCV, Convolutions and SGEMM functions are around 15 times faster with the ARM Compute library. Note that ARM selected a hardware platform with one of their best GPU, so while it should still be faster on other OpenCL capable ARM GPUs the difference will be lower, but should still be significantly, i.e. several times faster.

ARM Compute Library vs OpenCV, single-threaded, CPU (NEON)

The performance boost in other function is not quite as impressive, but the compute library is still 2x to 4x faster than OpenCV.

While the open source release was just about three weeks ago, the ARM Compute library has already been utilized by several embedded, consumer and mobile silicon vendors and OEMs better it was open sourced, for applications such as 360-degree camera panoramic stitching, computational camera, virtual and augmented reality, segmentation of images, feature detection and extraction, image processing, tracking, stereo and depth calculation, and several machine learning based algorithms.

ARM Introduces Bifrost Mali-G51 GPU, and Mali-V61 4K H.265 & VP9 Video Processing Unit

November 1st, 2016 4 comments

Back in May of this year, ARM unveiled Mali-G71 GPU for premium devices, and the first GPU of the company based on Bifrost architecture. The company has now introduced the second Bifrost GPU with Mali-G51 targeting augmented & virtual reality and higher resolution screens to be found in mainstream devices in 2018, as well as Mali-V61 VPU with 4K H.265 & VP9 video decode and encode capabilities, previously unknown under the codename “Egil“.

Mali-G51 GPU

Click to Enlarge

Click to Enlarge

ARM Mali-G51 will be 60% more energy efficiency, and have 60% more performance density compared to Mali-T830 GPU, making the new GPU the most efficient ARM GPU to date. It will also be 30% smaller, and support 1080p to 4K displays.

Under the hood, Mali-G51 include an updated Bifrost’s low level instruction set, a dual-pixel shader core per GPU core to deliver twice the texel and pixel rates, features the latest ARM Frame Buffer Compression (AFBC) 1.2, and supports Vulkan, OpenGL ES 3.2, and OpenCL 2.0 APIs.

More information can be found on the product page, and an ARM community blog post entitled “The Mali-G51 GPU brings premium performance to mainstream mobile“.

Mali-V61 VPU

mali-v61-4k-120hz

Mali-V61 can scale from 1 to 8 cores to handle 1080p60 up to 4K @ 120 fps, supports 8-/10-bit HEVC & 8-/10-bit VP9 up to 4K UHD video encoding and decoding, making it ideal for 4K video conference and chat, as well as 32MP multi-shot @ 20 fps.

The company claims H.265 and VP9 video encoding quality is about the same for a given bitrate with Mali-V61 as shown in the diagram below.

Click to Enlarge

VP9 vs HEVC vs H.264 – Click to Enlarge

Beside the capability of selecting 1 to 8 cores, silicon vendors can also decide whether they need encoding or decoding block for their SoC. For example camera SoC may not need video decoding support, while STB SoCs might do without encoding. While Mali-V61 is a premium IP block, ARM is also expecting it in mainstream devices possibly also featuring Cortex A53 processor cores and Mali-G51 GPU.

You’ll find more details on the product page, and ARM community “Mali-V61 – Premium video processing for Generation Z and beyond” blog post.

PowerVR GT7200 Plus and GT7400 Plus GPUs Support OpenCL 2.0, Better Computer Vision Features

January 7th, 2016 3 comments

Imagination Technologies introduced PowerVR Series7XT GPU family with up to 512 cores at the end of 2014, and at CES 2016, they’ve announced Series7XT Plus family with GT7200 Plus and GT7400 Plus GPUs, with many of the same features of Series7XT family, plus the addition of OpenCL 2.0 API support, and improvements for computer vision with a new Image Processing Data Master, and support for 8-bit and 16-bit integer data paths, instead of just 32-bit in the previous generation, for example leading to up to 4 times more performance for applications, e.g. deep learning, leveraging OpenVX computer vision API.

Block Diagram (Click to Enlarge)

Block Diagram (Click to Enlarge)

GT7200 Plus GPU features 64 ALU cores in two clusters, and GT7400 Plus 128 ALU cores in a quad-cluster configuration. Beside OpenCL2.0, and improvements for computer vision, they still support OpenGL ES 3.2, Vulkan, hardware virtualization, advanced security, and more. The company has also made some microarchitectural enhancements to improve performance and reduce power consumption:

  • Support for the latest bus interface features including requestor priority support
  • Doubled memory burst sizes, matching the latest system fabrics, memory controllers and memory components
  • Tuned the size of caches and improved their efficiency, leading to a ~10% reduction in bandwidth

The new features and improvements of PowerVR Series7XT Plus GPUs should help designed better systems for image classification, face/body/gesture tracking, smart video surveillance, HDR rendering, advanced driver assistance systems (ADAS), object and scene reconstruction, augmented reality, visual inspection, robotics, etc…

You can find more details on Imagination Tech Blog.

Fujitsu MB86S70 and MB86S73 ARM Cortex A15 & A7 Processors Run Linux for the Embedded Market

November 28th, 2014 1 comment

I like to check the ARM Linux kernel mailing list from time to time, as you may discover a few upcoming ARM processors. This week I found out Exynos 5433 and Exynos 7 are actually two different processors (thanks David!), and that AMD had submitted code for their 64-bit ARM Opteron A1100 SoC for servers. I also noticed a patchset for Fujitsu MB86S7X SoCs, and since I don’t often mention Japanese silicon vendors, probably because they now mainly deal mostly with the embedded market that gets very little press, and most information is in Japanese, I decide to have a look.

Fujitsu MB86S70 Block Diagram

Fujitsu MB86S70 Block Diagram

There seems to be four SoC parts in MB86S7x family with MB86S70 quad core processor with two ARM Cortex A15 and two ARM Cortex A7 cores in big.LITTLE configuration, and MB86S73 with two ARM Cortex A7 cores only, as well as MB86S71/72 with 2x A15 and 2x A7, with all featuring a single or quad core Mali-T624 GPU.

Fujitsu provided a comparison tables for both MB86S70 and MB86S73 processors in English, but there’s very little info about MB86S71/72 SoCs.

Block Function MB86S70 MB86S73
CSS
DMC
CPU Cortex-A15
2 cores Up to 2.4GHz 1MB-L2C
CPU Cortex-A7
2 cores up to 800MHz 256k-L2C
Cortex-A7
2 cores up to 1.2GHz 512k-L2C
3D/GPGPU Mali-T624
4 cores @ 400MHz 128k-L2C
Mali-T624
1 core @400MHz 32k-L2C
MEMC 2-ch DDR3-1.333Gbps 32bit 1-ch DDR3-1.333Gbps 64bit
SCB CPU ARM Cortex M3 @ 125MHz ARM Cortex M3 @ 125MHz
LAN GbE, WoL, TCP Acceleration GbE, WoL, TCP Acceleration
FLASH-IF HSSPI, NOR, eMMC, NAND
SecureBoot (SROM/NOR)
HSSPI, NOR, eMMC, NAND
SecureBoot (SROM/NOR)
SERIAL-IF 3x UART, 16x GPIO, 10x I2C 3x UART, 16 GPIO, 3x I2C
MPB CODEC 1080p Multi Encode, 4 stream H.264 Decode
32k × 32k JPEG CODEC
32k × 32k JPEG CODEC
Display HDMI-1.4a HDCP
MIPI-DSI 1Gbps-4Lane
LVDS (CLK 1ch / data 4ch)
CAPTURE 1-ch RBG/YUV 720p capture only
TSIF 2 serial TS Demux
AUDIO 2-ch I2S (I/O Independent) + 4ch
I2S (HDMI)
2-ch I2S (I/O Independent)
SD 1-ch SDIO UHS-I 1ch SDIO UHS-I
HSIOB PCIe 2-ch PCIe-Gen2-4Lane + Data Scrambler 2-ch PCIe-Gen2-4Lane + Data Scrambler
USB USB3 Host 2ch USB3 Host
USB USB2 HDC 1ch USB2 Host, 1ch USB2 Device

MB86S70 is the more powerful of the two, not only when it comes with CPU power, but also with regards to multimedia capabilities with 1080p encode, and 4-k encode, TS demux, and RGB/YUV 720p video capture, whereas MB86S73 does not seem to support hardware video decoding / encoding at all, providing only JPEG acceleration, and an LVDS interface, so it’s mostly probably desinted to be used in control panels for example. Both processors however feature high-speed interfaces like USB 3.0 host, Ggiabit Ethernet, and PCI-E interface, the latter being not so common in ARM SoCs, and only found in a few products like Freescale i.MX6 and Tegra K1 SoCs.

Fujitsu MB86S73 Block Diagram

Fujitsu MB86S73 Block Diagram

The company also provides evaluation boards for their two processors, together with a software development platform based on Linux with support for OpenGL, OpenCL, and OpenMAX for graphics and video decoding, and they’ve also started getting some code to mainline kernel.

MB86S70 (Left) and MB86S73 (Right) Evaluation Kits (Click to Enlarge)

MB86S70 (Left) and MB86S73 (Right) Evaluation Kits (Click to Enlarge)

More information is available in Japanese only on Fujitsu’s Platform SoC page, and a presentation (PDF) made at Java Day Tokyo 2014.

Imagination Technologies Introduces PowerVR Series7 GPUs with Up to 512 Cores, Virtualization Support

November 10th, 2014 3 comments

Imagination Technologies has announced a new PowerVR Series7 GPU architecture that will be used in their high end PowerVR Series7XT GPUs delivering up to 1.5 TFLOPS for mid range and high-end mobioe devices, set-top boxes, gaming consoles and even servers, as well as their low power lost cost PowerVR Series7XE GPUs for entry-level mobile devices, set-top boxes, and wearables.

PowerVR_Series7XT_Block_Diagram

PowerVR Series7XT GPU Block Diagram

PowerVR Series7 GPU, both Series7XT and Series7XE GPUs, can achieve up to a 60% performance improvement over PowerVR Series6XT/6XE GPUs for a given configuration. For example a 64-core PowerVR7XT GPU should be up to 60% faster than a 64-core PowerVR Series6XT clocked at the same frequency, with all extra performance due to a different and improved architecture.

Some of Series7 architectural enhancements include:

  • Instruction set enhancements including added co-issue capability, resulting in improved application performance and increased GPU efficiency
  • New hierarchical layout structure that enables scalable polygon throughput and pixel fillrate improvements in addition to increased clock frequencies
  • GPU compute setup and cache throughput improvements resulting in up to 300% better parallel processing performance

The new GPUs can also optional support 10-bit YUV color depths, security (e.g. DRM), and hardware virtualization, as well as other feature specific to some market segments:

  • Android Extension Pack (AEP) – Full hardware tessellation and native OpenGL ES 3.1 support. Compatible with Android 5.0 ‘Lollipop’ release.
  • DirectX 11 Feature Pack – Full DirectX 11.2 feature set for Microsoft operating systems.
  • OpenCL FP64 Feature Pack –  Scalable 64-bit floating point co-processor per cluster for high-performance server compute. Series7XT only.
PowerVR Seris7XE Block Diagram

PowerVR Seris7XE Block Diagram

The PowerVR Series7XT family scales between 100 GFLOPS to 1.5 TFLOPS, and is designed to provide the best possible performance. It features AEP and 10-bit YUV support by default, and supports between two to sixteen clusters with 32 multi-threaded multi-tasking ALU cores each. Current Series7XT GPUs include the GT7200 (64 cores),  GT7400 (128 cores), GT7600 (192 cores),  GT7800 (256 cores), and GT7900, the most powerful PowerVR GPU to date with 512 cores.

On the other hand, Series7XE GPUs are optimized for area, efficiency, and cost thanks to feature configurability, with let SoC manufacturers choose whether they want options such as 10-bit YUV support for HEVC, virtualization, or AEP support. Beside low cost mobile devices and media player, Series7XE GPU are also expected to be used in photocopiers, printers, consumer and other enterprise devices which may require 3D user interfaces at a lower price point. There are two GPUs part of the Series7XE family: GE7400 with 16 cores, and the GE7800 with 32 cores.

The company will provide their usual free PowerVR SDK for 3D graphics and GPU compute application development. Hypervisors will be able to utilize the virtualization in the GPUs to implement true heterogeneous security in any of the PowerVR Series7 GPUs (if virtualization is enabled).

PowerVR Series7XE and Series7XT GPUs are available for licensing now, and Imagination Technologies has already started implemention their new GPU IP into SoCs from licensing partners. More technical details about be found on two blog post: New PowerVR Series7XE family targets the next billion mobile and embedded GPUs and PowerVR Series7XT GPUs push graphics and compute performance to the max.

Adapteva Announces Three Parallella Fanless Boards for Microserver, Desktop, and Embedded Applications

July 15th, 2014 6 comments

Adapteva’s Parallella low cost open source hardware “supercomputer” is a board powered by Xilinx Zynq-7010/7020 dual core Cortex A9 + FPGA SoC and the company’s Ephipany epiphany coprocessor, that’s had a successful Kickstarter campaign in 2012 as the 16-core version sold for just $99, and is capable of handling applications such as image and video processing, and ray-tracing, and also comes with an OpenCL SDK. The board was fairly difficult to source after the crowdfunding campaign, and one the common complain of backers was the board had to be actively cooled by a fan. The company has fixed both issues by increasing slightly the price, and redesigning the board so that it can be passively cooled by a larger heatsink.

Parallela Desktop Board

Parallella Desktop Board with Heatsink

There are now three versions of the parallela board:

  • Parallella Microserver ($119) – Used as an Ethernet connected headless server
  • Parallella Desktop ($149) – Used as a  personal computer
  • Parallella Embedded ($249) – Used for “leading edge” embedded system

Here are the simplified specs of the boards:

  • SoC
    • Microserver and Desktop – Xilinx Zynq Z7010 dual-Core ARM Cortex A9 with 512KB L2 Shared Cache + Artix-7 FPGA with 28K logic cells
    • Embedded – Xilinx Zynq Z7020 dual-Core ARM Cortex A9 with 512KB L2 Shared Cache + Artix-7 FPGA with 85K logic cells
  • Coprocessor – 16-core Epiphany-III processor
  • System Memory – 1GB DDR3
  • Storage – micro SD slot + 128Mb quad SPI flash
  • Connectivity – 10/100/100M Ethernet
  • Video Output – 1x micro HDMI (Desktop and Embedded only)
  • USB – 1x micro USB host port  (Desktop and Embedded only)
  • Expansion I/O
    • Microserver – N/A
    • Desktop – 2 eLinks (Ephiphany Links) + 24 GPIO pins
    • Embedded – 2 eLinks + 48 GPIO pins
  • Dimensions – 86.36mm x 53.34mm
Parallela Embedded

Parallela Embedded

The board will sell with the heatsink and a power adapter. If you have one of the boards from the Kickstarter campaign,  or boards purchased before the 10th of July, you can’t go fanless by just replacing the fan by the new heatsink, as it won’t fit.

Parallella-16 Desktop computer is available now for $149 on Adapteva shop, and in a couple of days, it will be on Amazon US. The Microserver and Embedded versions will be available in a few weeks. You can read the announcement on the company website, where you’ll also find some interesting projects (videos) that have been done so far by the community of developers.