Archive

Posts Tagged ‘gpu’

ARM Cortex-A75 & Cortex-A55 Cores, and Mali-G72 GPU Details Revealed

May 27th, 2017 23 comments

We’ve already seen ARM Cortex A75 cores were coming thanks to leak showing Snapdragon 845 SoC will feature custom Cortex A75 cores, but we did not have many details. But since we live in a world where “to leak is glorious”, we already have some slides originally leaked through VideoCardz with the post now deleted, but Liliputing & TheAndroidSoul got some of the slides before deletion, so let’s see what we’ve got here.

ARM Cortex A75

So ARM Cortex-A75 will be  about 20% faster than Cortex A73 for single thread operation, itself already 30% faster than Cortex A72. It will also be the first DynamIQ capable processor together with Cortex A55 with both cores potentially used in big.LITTLE configuration.

Cortex A75 performance is only better for peak performance, and remain the same as Cortex-A73 for sustained performance.

The chart above does not start at zero, so it appear as though there are massive performance increases, but looks at the number and we can see 1.34x higher score with GeekBench, and 1.48x with Octane 2.0. Other benchmarks also have higher scores but between 1.16 and 1.33 times higher.

Click to Enlarge

Cortex A75 cores will be manufactured using 10nm process technology, and clocked at up to 3.0 GHz. While (peak) performance will be higher than Cortex A73, efficiency will remain the same.

ARM Cortex A55

Click to Enlarge

ARM Cortex A55 is the successor if Cortex-A53 with about twice the performance, and support for up to eight cores in a single cluster. There are octa-core (and even 24-core) ARM Cortex A53 processor but they also use multiple 4-core clusters.

Click to Enlarge

Power efficiency is 15% better too, and ARM claims it is 10x more configurable probably because of DynamIQ & 8-core cluster support.

Click to Enlarge

If we have a closer look at the benchmarks released by the company, we can see the 2x performance increase is only valid with LMBench memcpy memory benchmark, with other benchmarks from GeekBench v4 to SPECINT2006 showing 1.14x to 1.38x better performance. So Integer performance appears to be only slightly better, floating point gets close to 40%, and the most noticeable improvement is with memory bandwidth.

ARM Mali-G72 GPU

Click to Enlarge

Mali-G72 will offer 1.4x performance improvement over 2017 devices, which must be Mali-G71…, and will allow for machine learning directly on the device instead of having to rely on the cloud, better games, and an improved mobile VR experience.

Click to Enlarge

The new GPU is also 25& more efficient, and supports up to 32 shader cores. GEMM (general matrix multiplication) – used for example in machine learning algorithms – is improved by 17% over Cortex A73.

Click to Enlarge

Based on the information we’ve got from Qualcomm Snapdragon 845 leak, devices based on ARM Cortex A75/A55 processor and Mali-G72 GPU should start selling in Q1 2018. We may learn a few more details on Monday, once the embargo is lifted.

Getting Started with OpenCV for Tegra on NVIDIA Tegra K1, CPU vs GPU Computer Vision Comparison

May 24th, 2017 No comments

This is a guest post by Leonardo Graboski Veiga, Field Application Engineer, Toradex Brasil

Introduction

Computer vision (CV) is everywhere – from cars to surveillance and production lines, the need for efficient, low power consumption yet powerful embedded systems is nowadays one of the bleeding edge scenarios of technology development.

Since this is a very computationally intensive task, running computer vision algorithms in an embedded system CPU might not be enough for some applications. Developers and scientists have noticed that the use of dedicated hardware, such as co-processors and GPUs – the latter traditionally employed for graphics rendering – can greatly improve CV algorithms performance.

In the embedded scenario, things usually are not as simple as they look. Embedded GPUs tend to be different from desktop GPUs, thus requiring many workarounds to get extra performance from them. A good example of a drawback from embedded GPUs is that they are hardly supported by OpenCV – the de facto standard libraries for computer vision – thus requiring a big effort from the developer to achieve some performance gains.

The silicon manufacturers are paying attention to the growing need for graphics and CV-oriented embedded systems, and powerful processors are being released. This is the case with the NVIDIA Tegra K1, which has a built-in GPU using the NVIDIA Kepler architecture, with 192 cores and a processing power of 325 GFLOPS. In addition, this is one of the very few embedded GPUs in the market that supports CUDA, a parallel computing platform from NVIDIA. The good news is that OpenCV also supports CUDA.

And this is why Toradex has decided to develop a System on Module (aka Computer on Module) – the Apalis TK1 – using this processor. In it, the K1 SoC Quad Core ARM Cortex-A15 CPU runs at up to 2.2GHz, interfaced to 2GB DDR3L RAM memory and a 16GB 8-bit eMMC. The full specification of the CoM can be found here.

The purpose of this article is to install the NVIDIA JetPack on the Apalis TK1 System on Module, thus also installing OpenCV for Tegra, and trying to assess how much effort is required to code some simple CV application accelerated by CUDA. The public OpenCV is also tested using the same examples, to determine if it is a viable alternative to the closed-source version from NVIDIA.

Hardware

The hardware employed in this article consists of the Apalis TK1 System on Module and the Apalis Evaluation Board. The main features of the Apalis TK1 have been presented in the introduction, and regarding the Apalis Evaluation Board, we will use the DVI output to connect to a display and the USB ports to interface a USB camera and a keyboard. The Apalis TK1 is presented in figure 1 and the Apalis Evaluation Board in figure 2:

Figure 1 – Apalis TK1 – Click to Enlarge

Figure 2 – Apalis Evaluation Board – Click to Enlarge

System Setup

NVIDIA already provides an SDK package – the NVIDIA JetPack – that comes with all tools that are supported for the TK1 architecture. It is an easy way to start developing applications with OpenCV for Tegra support. JetPack also provides many source code samples for CUDA, VisionWorks, and GameWorks. It also installs the NVIDIA Nsight, an IDE that is based on Eclipse and can be useful for debugging CPU and GPU applications.

OpenCV for Tegra is based on version 2.4.13 of the public OpenCV source code. It is closed-source but free to use and benefits from NEON and multicore optimizations that are not present in the open-source version; on the other hand, the non-free libraries are not included. If you want or need the open-source version, you can find more information on how to build OpenCV with CUDA support here – these instructions were followed and the public OpenCV 2.4.13 was also tested during this article’s development.

Toradex provides an article in the developer website with concise information describing how to install JetPack on the Apalis TK1.

Regarding hardware, it is recommended that you have an USB webcam connected to the Apalis Evaluation Board because samples tested in this article often need a video source as input.

OpenCV for Tegra

After you have finished installing the NVIDIA JetPack, OpenCV for Tegra will already be installed on the system, as well as the toolchain required for compilation on the target. You must have access to the serial terminal by means of an USB to RS-232 adapter or an SSH connection.

If you want to run Python code, an additional step on the target is required:

The easiest way to check that everything works as expected is to compile and run some samples from the public OpenCV repository since it already has the Cmake configuration files as well as some source code for applications that make use of CUDA:

We can begin testing a Python sample, for instance, the edge detector. The running application is displayed in figure 3.

Figure 3 – running Python edge detector sample – Click to Enlarge

After the samples are compiled, you can try some of them. A nice try is the “background/foreground segmentation” samples since they are available with and without GPU support. You can run them from the commands below, as well as see the results in figures 4 and 5.

Figure 4 – running bgfg_segm CPU sample – Click to Enlarge

Figure 5 – running bgfg_segm GPU sample – Click to Enlarge

By running both samples it is possible to subjectively notice the performance difference. The CPU version has more delay.

Playing Around

After having things setup, the question comes: how easy it is to port some application from CPU to GPU, or even start developing with GPU support? It was decided to play around a little with the Sobel application that is well described in the Sobel Derivatives tutorial.

The purpose is to check if it’s possible to benefit from CUDA out-of-the-box, therefore only the function getTickCount from OpenCV is employed to measure the execution time of the main loop of the Sobel implementations. You can use the NVIDIA Nsight for advanced remote debugging and profiling.

The Code

The first code is run completely on the CPU, while in the first attempt to port to GPU (the second code, which will be called CPU-GPU), the goal is to try to find functions analog to the CPU ones, but with GPU optimization. In the last attempt to port, some improvements are done, such as creating filter engines, which reduces buffer allocation, and finding a way to replace the CPU function convertScaleAbs into GPU accelerated functions.

A diagram describing the loop for the three examples is provided in figure 6.

Figure 6 – CPU / CPU-GPU / GPU main loop for Sobel implementations

The main loop for the three applications tested is presented below. You can find the full source code for them on Github:

  • CPU only code:
  • CPU-GPU code:
  • GPU code

The Tests

  • Each of the three examples is executed using a random picture in jpeg format as input.
  • The input pictures dimensions in pixels that were tested are: 3483×2642, 2122×1415, 845×450 and 460×290.
  • The main loop is being iterated 500 times for each run.
  • All of the steps described in figure 6 have their execution time measured. This section will present the results.
  • Therefore there are 12 runs total.
  • The numbers presented in the results are the average values of the 500 iterations for each run.

The Results

The results presented are the total time required to execute the main loop – with and without image capture and display time, available in tables 1 and 2 – and the time each task takes to be executed, which is described in figures 7, 8, 9 and 10. If you want to have a look at the raw data or reproduce the tests, everything is in the aforelinked GitHub repository.

Table 1 – Main loop execution time, in milliseconds

Table 2 – Main loop execution time, discarding read and display image times, in milliseconds

Figure 7 – execution time by task – larger image (3483×2642 pixels) – Click to Enlarge

Figure 8 – execution time by task – large image (2122×1415 pixels) – Click to Enlarge

Figure 9 – execution time by task – small image (845×450 pixels) – Click to Enlarge

Figure 10 – execution time by task – smaller image (460×290 pixels) – Click to Enlarge

The Analysis

Regarding OpenCV for Tegra in comparison to the public OpenCV, the results point out that OpenCV for Tegra has been optimized, mostly for some CPU functions. Even when discarding image read  – that takes a long time to be executed, and has approximately a 2x gain – and display frame execution times, OpenCV for Tegra still bests the open-source version.

When considering only OpenCV for Tegra, from the tables, it is possible to see that using GPU functions without care might even make the performance worse than using only the CPU. Also, it is possible to notice that, for these specific implementations, GPU is better for large images, while CPU is best for small images – when there is a tie, it would be nice to have a power consumption comparison, which hasn’t been done, or also consider the fact that this GPU code is not optimized as best as possible.

Looking at the figures 7 to 10, it can be seen that the Gaussian blur and scale conversion from 16 bits to 8 bits had a big boost when running on GPU, while conversion of the original image to grayscale and the Sobel derivatives had their performance degraded. Another point of interest is the fact that transferring data from/to the GPU has a high cost, and this is, in part, one of the reasons why the first GPU port was unsuccessful – it had more copies than needed.

Regarding image size, it can be noticed that the image read and display have an impact in overall performance that might be relevant depending on the complexity of the algorithm being implemented, or how the image capture is being done.

There are probably many ways to try and/or make this code more optimized, be it by only using OpenCV; by combining custom CUDA functions with OpenCV; by writing the application fully in CUDA or; by using another framework or tool such as VisionWorks.

Two points that might be of interest regarding optimization still in OpenCV are the use of streams – asynchronous execution of code on the CPU/GPU – and zero-copy or shared memory, since the Tegra K1 has CPU and GPU shared memory supported by CUDA (see this NVIDIA presentation from GPU Technology Conference and this NVIDIA blog post for reference).

Conclusion

In this article, the installation of the NVIDIA JetPack SDK and deployment on the Toradex Apalis TK1 have been presented. Having this tool installed, you are able to use OpenCV for Tegra, thus benefiting from all of the optimizations provided by NVIDIA. The JetPack SDK also provides many other useful contents, such as CUDA, VisionWorks and GameWorks samples, and the NVIDIA Nsight IDE.

In order to assess how easy it is for a developer freshly introduced to the CV and GPU concepts to take advantage of CUDA, purely using OpenCV optimized functions, a CPU to GPU port of a Sobel filter application was written and tested. From this experience, some interesting results were found, such as the facts that GPU indeed improves performance – and this improvement magnitude depends on a series of factors, such as size of the input image, quality of implementation – or developer experience, algorithms being used and complexity of the application.

Having a myriad of sample source code, it is easy to start developing your own applications, although care is required in order to make the Apalis TK1 System on Module yield its best performance. You can find more development information in the NVIDIA documentation, as well as the OpenCV documentation. Toradex also provides documentation about Linux usage in its developer website, and has a community forum. Hope this information was helpful, see you next time!

Android Play Store Tidbits – Blocking Unlocked/Uncertified/Rooted Devices, Graphics Drivers as an App

May 20th, 2017 10 comments

There’s been at least two or three notable stories about the Play Store this week. It started with Netflix not installing from the Google Play Store anymore on rooted device, with unclocked bootloader, or uncertified devices, and showing as “incompatible”. AndroidPolice contacted Netflix which answered:

With our latest 5.0 release, we now fully rely on the Widevine DRM provided by Google; therefore, many devices that are not Google-certified or have been altered will no longer work with our latest app and those users will no longer see the Netflix app in the Play Store.

So that means you need to  Google Widevine DRM in your device, which mean many Android TV boxes may stop to work with Netflix. You can check whether you device is certified by opening Google Play and click on settings, Scroll to the bottom and check Device Certification to see if it is Certified or Uncertified (H/T jon for the tip).

I tried this in my Chinese phone, and unsurprisingly it is “Uncertified”. AndroidPolice however successfully tested both Netflix 4.16 and Netflix 5.0.4 on an unlocked Galaxy S tab with Level 3 DRM and both worked. So the only drawback right now is that you can’t install Netflix from the Play Store, but it still works normally. Some boxes do not come with any DRM at all, which you can check with DRM info, and they may not work at all (TBC).

We’ve know learned this will not only affect Netflix, as developers will now be able to block installation of apps that fail “SafetyNet” as explained at Google I/O 2017:

Developers will be able to choose from 3 states shown in the top image:

  • not excluding devices based on SafetyNet
  • excluding those that don’t pass integrity
  • excluding the latter plus those that aren’t certified by Google.

That means any dev could potentially block their apps from showing and being directly installable in the Play Store on devices that are rooted and/or running a custom ROM, as well as on emulators and uncertified devices ….. This is exactly what many of you were afraid would happen after the Play Store app started surfacing a Device certification status.

This would mean it might become more complicated to install apps from the Google Play store on some devices, and we may have to start to side-load apps again, or use other app store. That’s provided they don’t start to stop apps running all together. The latter has been possible for year, as for example many mobile banking apps refuse to run on rooted phones.

I’ll end up with a better news, as starting with Android O it will be possible to update Graphics Drivers from the Play Store, just like you would update an app. Usually, a graphics driver update would require an OTA firmware update, or flash a new firmware image manually, and it’s quite possible this new feature has been made possible thanks to Project Treble.

Categories: Android Tags: Android, app, driver, drm, google, gpu, netflix, oreo

Imagination PowerVR “Furian” Series8XT GT8525 GPU Targets High-end Smartphones, Virtual Reality and Automotive Products

May 11th, 2017 No comments

Imagination Technologies has unveiled their first GPU based on PowerVR Furian architecture with Series8XT GT8525 GPU equipped with two clusters and designed for SoCs going to into products such as high-end smartphones and tablets, mid-range dedicated VR and AR devices, and mid- to high-end automotive infotainment and ADAS systems.

Block Diagram for PowerVR Furian GT8525 GPU – Click to Enlarge

The Furian architecture is said to allow for improvements in performance density, GPU efficiency, and system efficiency, features a new 32-wide ALU cluster design, and can be manufactured using sub-14nm (e.g. 7nm process once available). PowerVR GT8525 GPU supports compute APIs such as OpenCL 2.0, Vulkan 1.0 and OpenVX 1.1.

Compared to the previous Series7XT GPU family, Series8XT GT8525 GPU delivers 80% higher fps in Trex benchmark, an extra 50% fps in GFXbench Manhattan benchmark, 50% higher fps in Antutu, doubles the fillrate throughput for GUI, and increases GFLOPs for compute applications by over 50%.

GT8525 GPU is available for licensing now, and has already been delivered to lead customers. More details should eventually surface on PowerVR Series8XT Core page.

Think Silicon Ultra Low Power NEMA GPUs are Designed for Wearables and IoT Applications

May 8th, 2017 1 comment

When you have to purchase a wearable device, let’s say a smartwatch or fitness tracker, you have to make trade offs between user interface and battery life. For example, a fitness tracker such as Xiaomi Mi Band 2 will last about 2 weeks per charge with a limited display, while Android smartwatches with a much better interface need to be recharged every 1 or 2 days. Think Silicon aims to improve battery life of the devices with nicer user interfaces thanks to their ultra-low power NEMA 2D, 3D, and GP GPU that can be integrated into SoCs with ARM Cortex-M and Cortex-A cores.

Nema|t 3D GPU Block Diagram

The company has three family of GPUs:

  • NEMA|p pico 2D GPU with one core
    • 4bpp framebuffer, 6bpp texture with/out alpha
    • Fill Rate – 1pixel/cycle
    • Silicon Area – 0.07 mm2 with 28nm process
    • Power Consumption – leakage power GPU consumption of 0.06mW; with compression (TSFSc): 0.03 mW
  • NEMA|t tiny 2D & 3D GPU with one to 4 cores
    • 4bpp framebuffer, 6bpp texture with/out alpha
    • OpenGL ES support
    • Can render a 420×420 3D UI @ 80 MHz
    • Fill Rate – 1-4pixel/cycle; up to 1,600 MPixel/s for the quad core version  @ 400 MHz
    • Silicon Area – 0.1 to 0.25 mm2 with 28nm process
    • Power Consumption – leakage power GPU consumption of 0.07mW; with proprietary compression technology (TSFBc, TSTXc): 0.03 mW
  • NEMA|s GPGPU with one to four cores
    • Supports Network On Chip (NoC) interconnect for clusters with each cluster supporting up to four cores, and each core handling up to 128 threads
    • Fill Rate – 1pixel/cycle
    • Silicon Area and Power Consumption – TBA, as Nema|s is only implemented via FPGA for now

NEMA|s GPU

The first two models are available right now, while the third is still in development. The company is also working on the fourth family with NEMA|ts “tiny small” GPU, but no details have been provided.

Provided the website is up-to-date, NEMA|p 2D GPU is supported in FreeRTOS V8.0.1 and Linux kernel 3.x, while NEMA|t can be used in Linux 3.x and Android 4.x. The company also provides a software library in ANSI C, as well as DirectFB and Qt support.

I found out about the NEMA through a Charbax video at Mobile World Congress 2017.


Think Silicon GPUs are said to be already used in Microchip and Dialog MCUs, and Sequant recently announced an “LTE for IoT System-on-Chip” with a NEMA|p 2D/2.5D GPU. The demo in the video above also shows an Ambiq Micro board connected to an FPGA implementation of one of their GPUs. You’ll find more information on Think Silicon website.

MQMaker MiQi & ASUS Tinker Boards Get Linux 4.11 with 3D Graphics Acceleration

May 2nd, 2017 5 comments

One day after the release of Linux 4.11, developer Miouyouyou” has released Linux 4.11 for Rockchip RK3288 platforms such as MQMaker MiQi and ASUS Tinker boards with some patchsets for ARM Mali r16p0 kernel drivers, ARM fbdev, and to improve performance.

The kernel has been tested with the Mali User-space r12p0 drivers for fbdev and wayland written for Firefly-RK3288, and some OpenGL ES 3.1/3.2 samples could successfully run on the board. 3D graphics acceleration does not work in X11 however.

Miouyouyou also plans to add support for Rockchip VPU code, as well as ARM gator, and document how to use ARM DS-5 Streamline for OpenGL ES 2.x/3.x debugging.

If you have a MiQi or Tinker board running Debian, you can try the kernel by adding beta.armbian.com Debian repository to your apt source file, and installing the following packages:

Via linux-rockchip G+ community.

Linux 4.11 Release – Main Changes, ARM & MIPS Architecture

May 1st, 2017 9 comments

Linus Torvalds has just released Linux 4.11:

So after that extra week with an rc8, things were pretty calm, and I’m much happier releasing a final 4.11 now.

We still had various smaller fixes the last week, but nothing that made me go “hmm..”. Shortlog appended for people who want to peruse the details, but it’s a mix all over, with about half being drivers (networking dominates, but some sound fixlets too), with the rest being some arch updates, generic networking, and filesystem (nfs[d]) fixes. But it’s all really small, which is what I like to see the last week of the release cycle.

And with this, the merge window is obviously open. I already have two pull request for 4.12 in my inbox, I expect that overnight I’ll get a lot more.

Linux 4.10 added Virtual GPU support, perf c2c’ tool, improved writeback management, a faster initial WiFi connection (802.11ai), and more.

Some notable changes for Linux 4.11 include:

  • Pluggable IO schedulers framework in the multiqueue block layer – The Linux block layer is know to have different IO schedulers (deadline, cfq, noop, etc). In Linux 3.13, the block layer added a new multiqueue design that performs better with modern hardware (eg. SSD, NVM). However, this new multiqueue design didn’t include support for pluggable IO schedulers. This release solves that problem with the merge of a multiqueue-ready IO scheduling framework. A port of the deadline scheduler has also been added (more IO schedulers will be added in the future)
  • Support for OPAL drives – The Opal Storage Specification is a set of specifications for features of data storage devices that enhance their security. For example, it defines a way of encrypting the stored data so that an unauthorized person who gains possession of the device cannot see the data. This release adds Linux support for Opal nvme enabled controllers. It enables users to setup/unlock/lock locking ranges for SED devices using the Opal protocol.
  • Support for the SMC-R protocol (RFC7609) – This release includes the initial part of the implementation of the “Shared Memory Communications-RDMA” (SMC-R) protocol as defined in RFC7609. SMC-R is an IBM protocol that provides RDMA capabilities over RoCE transparently for applications exploiting TCP sockets. While SMC-R does not aim to replace TCP, it taps a wealth of existing data center TCP socket applications to become more efficient without the need for rewriting them. A new socket protocol family PF_SMC is introduced. There are no changes required to applications using the sockets API for TCP stream sockets other than the specification of the new socket family AF_SMC. Unmodified applications can be used by means of a dynamic preload shared library.
  • Intel Bay Trail (and Cherry Trail) improvements – Intel HDMI audio support, patchsets for AXP288 PMIC, I2C driver, and C-state support to avoid freezes.

New features and bug fixes specific to ARM architecture:

  • Allwinner:
    • Allwinner A23 –  Audio codec device tree changes
    • Allwinner A31 – SPDIF output support
    • Allwinner A33 – cpufreq support, Audio codec support
    • Allwinner A64 – MMC Support, USB support
    • Allwinner A80 – sunxi-ng style clock support
    • Allwinner H2+ – New SoC variant, similar to H3 (mostly with a different, lower end VPU)
    • Allwinner H3 – Audio codec device tree changes, SPDIF output support
    • Allwinner V3s – New SoC support, USB PHY driver, pinctrl driver, CCU driver
    • New boards & devices – LicheePi One, Orange Pi Zero, LicheePi Zero, Banana Pi M64, Beelink X2
  • Rockchip:
    • Renamed RK1108 to RV1108
    • Clock drivers – New driver for RK3328, and non-critical fixes and clk id additions
    • Tweaks for Rockchip GRF (General Register File) usage (kitchensink misc register range on the SoCs)
    • thermal, eDP, pinctrl enhancements
    • PCI – add Rockchip system power management support
    • Add machine driver for RK3288 boards that use analog/HDMI audio
  • Amlogic
    • Add support for Amlogic Meson I2C controller
    • Add SAR ADC driver
    • Add ADC laddered keys to meson-gxbb-p200 board
    • Add configurable RGMII TX delay to fix/improve Gigabit Ethernet performance on some boards
    • Add pinctrl nodes for HDMI HPD and DDC pins modes for Amlogic Meson GXL and GXBB SoCs
    • New hardware: WeTek TV boxes
  • Samsung
    • Add USB 3.0 support in Exynos 5433
    • Removed clock driver for Samsung Exynos4415 SoCs
    • TM2 touchkey, Exynos5433 HDMI and power management improvements
    • Added Samsung Exynos4412 Prime SoC
    • Removed Samsung Exynos 4412 SoC
    • Added audio on Odroid-X board
    • Samsung Device Tree updates:
      • Add necessary initial configuration for clocks of display subsystem. Till now it worked mostly thanks to bootloader.
      • Use macro definitions instead of hard-coded values for pinctrl on Exynos7.
      • Enable USB 3.0 (DWC3) on Exynos7.
      • Add descriptive user-friendly label names for power domains. This  makes debugging easier
      • Use proper drive strengths on Exynos7.
      • Use bigger reserved memory region for Multi Format Codec on all Exynos chipsets so it could decode FullHD easily
      • Cleanup from old MACHs in s5pv210.
      • Enable IP_MULTICAST for libnss-mdns
      • Add bus frequency and voltage scalling on Exynos5433 TM2 device (along with  necessary bus nodes and Platform Performance Monitoring Unit on Exynos5433).
      • Use macros for pinctrl settings on Exynos5433.
      • Create common DTSI between Exynos5433 TM2E and TM2E.
  • Qualcomm
    • Added coresight, gyro/accelerometer, hdmi to Qualcomm MSM8916 SoC
    • Clock drivers – Updates to Qualcomm IPQ4019 CPU clks and general PLL support, Qualcomm MSM8974 RPM
    • Errata workarounds for Qualcomm’s Falkor CPU
    • Qualcomm L2 Cache PMU driver
    • Qualcomm SMCCC firmware quirk
    • Qualcomm PM8xxx ADC bindings
    • Add USB HSIC and HS phy driver for Qualcomm’s SoC
    • Device Tree Changes:
      • Add Coresight components for APQ8064
      • Fixup PM8058 nodes
      • Add APQ8060 gyro and accel support
      • Enable SD600 HDMI support
      • Add RIVA supprort for Sony Yuga and SD600
      • Add PM8821 support
      • Add MSM8974 ADSP, USB gadget, SMD, and SMP2P support
      • Fix IPQ8064 clock frequencies
      • Enable APQ8060 Dragonboard related devices
      • Add Vol+ support for DB820C and APQ8016
      • Add HDMI audio support for APQ8016
      • Fix DB820C GPIO pinctrl name
      • etc…
  • Mediatek
    • Mediatek MT2701 – Added clocks, iommu, spi, nand, adc, thermal
    • Added Mediatek MT8173 thermal
    • Added Mediatek IR remote receiver
  • GPU – Add Mali Utgard bindings;  the ARM Mali Utgard GPU family is embedded into a number of SoCs from Allwinner, Amlogic, Mediatek or Rockchip
  • Other new ARM hardware platforms and SoCs:
    • Marvell – SolidRun MACCHIATOBin board, Marvell Prestera DX packet processors
    • Broadcom – BCM958712DxXMC NorthStar2 reference board
    • HiSilicon – Kirin960/Hi3660 SoC, and HiKey960 development board
    • NXP – LS1012a SoC with three reference board; SoMs: Is.IoT MX6UL, SavageBoard, Engicam i.Core; Liebherr (LWN) monitor 6;
    • Microchip/Atmel – SAMA5d36ek Reference platform
    • Texas Instruments – Beaglebone Green Wireless and Black Wireless, phyCORE-AM335x System on Module
    • Lego Mindstorms EV3
    • “Romulus” baseboard management controller for OpenPower
    • Axentia TSE-850 Data Radio Channel (DARC) encoder
    • Luxul XAP-1410 and XWR-1200 wireless access points
    • New revision of “vf610-zii” Zodiac Inflight Innovations board

Finally here are some of the change made to MIPS architecture in Linux 4.11:

  • PCI: Register controllers in the right order to avoid a PCI error
  • KGDB: Use kernel context for sleeping threads
  • smp-cps: Fix potentially uninitialised value of core
  • KASLR: Fix build
  • ELF: Fix BUG() warning in arch_check_elf
  • Fix modversioning of _mcount symbol
  • fix out-of-tree defconfig target builds
  • cevt-r4k: Fix out-of-bounds array access
  • perf: fix deadlock
  • Malta: Fix i8259 irqchip setup
  • Lantiq – Fix adding xbar resoures causing a panic
  • Loongson3
    • Some Loongson 3A don’t identify themselves as having an FTLB so hardwire that knowledge into CPU probing.
    • Handle Loongson 3 TLB peculiarities in the fast path of the RDHWR  emulation.
    • Fix invalid FTLB entries with huge page on VTLB+FTLB platforms
    • Add missing calculation of S-cache and V-cache cache-way size
  • Ralink – Fix typos in rt3883 pinctrl data
  • Generic:
    • Force o32 fp64 support on 32bit MIPS64r6 kernels
    • Yet another build fix after the linux/sched.h changes
    • Wire up statx system call
    • Fix stack unwinding after introduction of IRQ stack
    • Fix spinlock code to build even for microMIPS with recent binutils
  • SMP-CPS: Fix retrieval of VPE mask on big endian CPUs”

Read Linux 4.11 changelog – with comments only – generated using git log v4.10..v4.11 --stat, to get the full list of changes. You may also want to checkout Linux 4.11 changelog on kernelnewbies.org.

Open Source ARM Compute Library Released with NEON and OpenCL Accelerated Functions for Computer Vision, Machine Learning

April 3rd, 2017 11 comments

GPU compute promises to deliver much better performance compared to CPU compute for application such a computer vision and machine learning, but the problem is that many developers may not have the right skills or time to leverage APIs such as OpenCL. So ARM decided to write their own ARM Compute library and has now released it under an MIT license.

The functions found in the library include:

  • Basic arithmetic, mathematical, and binary operator functions
  • Color manipulation (conversion, channel extraction, and more)
  • Convolution filters (Sobel, Gaussian, and more)
  • Canny Edge, Harris corners, optical flow, and more
  • Pyramids (such as Laplacians)
  • HOG (Histogram of Oriented Gradients)
  • SVM (Support Vector Machines)
  • H/SGEMM (Half and Single precision General Matrix Multiply)
  • Convolutional Neural Networks building blocks (Activation, Convolution, Fully connected, Locally connected, Normalization, Pooling, Soft-max)

The library works on Linux, Android or bare metal on armv7a (32bit) or arm64-v8a (64bit) architecture, and makes use of  NEON, OpenCL, or  NEON + OpenCL. You’ll need an OpenCL capable GPU, so all Mali-4xx GPUs won’t be fully supported, and you need an SoC with Mali-T6xx, T-7xx, T-8xx, or G71 GPU to make use of the library, except for NEON only functions.

In order to showcase their new library, ARM compared its performance to OpenCV library on Huawei Mate 9 smartphone with HiSilicon Kirin 960 processor with an ARM Mali G71MP8  GPU.

ARM Compute Library vs OpenCV, single-threaded, CPU (NEON)

Even with some NEON acceleration in OpenCV, Convolutions and SGEMM functions are around 15 times faster with the ARM Compute library. Note that ARM selected a hardware platform with one of their best GPU, so while it should still be faster on other OpenCL capable ARM GPUs the difference will be lower, but should still be significantly, i.e. several times faster.

ARM Compute Library vs OpenCV, single-threaded, CPU (NEON)

The performance boost in other function is not quite as impressive, but the compute library is still 2x to 4x faster than OpenCV.

While the open source release was just about three weeks ago, the ARM Compute library has already been utilized by several embedded, consumer and mobile silicon vendors and OEMs better it was open sourced, for applications such as 360-degree camera panoramic stitching, computational camera, virtual and augmented reality, segmentation of images, feature detection and extraction, image processing, tracking, stereo and depth calculation, and several machine learning based algorithms.