Optimizing JPEG Transformations on Qualcomm Centriq Arm Servers with NEON Instructions

Arm servers are already deployed in some datacenters, but they are pretty new compared to their Intel counterparts, so at this stage software may not always be optimized as well on Arm as on Intel. Vlad Krasnow working for Cloudflare found  one of those unoptimized cases when testing out Jpegtran – a utility performing lossless transformation of JPEG files – on one of their Xeon Silver 4116 Server: and comparing it to one based on Qualcomm Centriq 2400 Arm SoC: Nearly four times slower on a single core. Not so good, as the company aims for at least 50% of the performance since the Arm processor has double the number of cores. Vlad did some optimization on The Intel processor using SSE instructions before, so he decided to look into optimization the Arm code with NEON instructions instead. First step was to check which functions may slowdown the process the most using perf: encode_mcu_AC_refine and encode_mcu_AC_first are the main culprits. …

How ARM Nerfed NEON Permute Instructions in ARMv8

This is a guest post by blu about an issue he found with a specific instruction in ARMv8 NEON. He previously wrote an article about OpenGL ES development on Ubuntu Touch, and one or two other posts. This is not a happy-ending story. But as with most unhappy-ending stories, this is a story with certain moral for the reader. So read on if you appreciate a good moral. Once upon a time there was a very well-devised SIMD instruction set. Its name was NEON, or formally — ARM Advanced SIMD — ASIMD for short (most people still called it NEON). It was so nice, that veteran coders versed in multiple SIMD ISAs often wished other SIMD ISAs were more like NEON. NEON had originated as part of the larger ARM ISA version 7, or ARMv7, for short. After much success in the mobile and embedded domains, ARMv7 was superseded by what experts acknowledged as the next step in the evolution …

Open Source ARM Compute Library Released with NEON and OpenCL Accelerated Functions for Computer Vision, Machine Learning

GPU compute promises to deliver much better performance compared to CPU compute for application such a computer vision and machine learning, but the problem is that many developers may not have the right skills or time to leverage APIs such as OpenCL. So ARM decided to write their own ARM Compute library and has now released it under an MIT license. The functions found in the library include: Basic arithmetic, mathematical, and binary operator functions Color manipulation (conversion, channel extraction, and more) Convolution filters (Sobel, Gaussian, and more) Canny Edge, Harris corners, optical flow, and more Pyramids (such as Laplacians) HOG (Histogram of Oriented Gradients) SVM (Support Vector Machines) H/SGEMM (Half and Single precision General Matrix Multiply) Convolutional Neural Networks building blocks (Activation, Convolution, Fully connected, Locally connected, Normalization, Pooling, Soft-max) The library works on Linux, Android or bare metal on armv7a (32bit) or arm64-v8a (64bit) architecture, and makes use of  NEON, OpenCL, or  NEON + OpenCL. You’ll need an …

Linaro Connect Hong Kong 2015 Schedule and Demos

Linaro Connect Hong Kong 2015 will take place on February 9 – 13,2015 in Hong Kong, and the organization has released the schedule for the five days events with keynotes, sessions, and demos. Each day will start with the keynote including speakers such as: George Grey, Linaro CEO, who will welcome attendees to Linaro Connect, and provide an update on the latest Linaro developments Jon Masters, Chief ARM Architect, Redhat, who will present Red Hat update and latest ARMv8-A demonstrations Dejan Milojicic, Senior Researcher & Manager, HP Labs Bob Monkman, Enterprise Segment Marketing Manager, ARM, will discuss about  the impact of ARM in next generation cloud and communication network infrastructure Greg Kroah-Hartman, Linux Foundation Fellow, will introduce the Greybus Project (Linux for Project Ara modular phones) Warren Rehman,  Android Partner Engineering Manager, Google The agenda also features sessions covering Android, ARMv8-A, Automation & Validation, Digital Home, Enterprise Servers, LAVA, Linux Kernel, Networking, Power Management, Security, Toolchain, Virtualization and multiple training …

Linaro 13.08 Release With Linux Kernel 3.11 and Android 4.3

Linaro 13.08 has been released with Linux Kernel 3.11-rc6 (stating), Kernel 3.10.9 (LSK – beta), and Android 4.3. This month is the first release based on Android 4.3, which was only pushed to AOSP at the end of last month. I can also see work on new SoCs/hardware this month with Texas Instruments Keystone II ARM Cortex A15+DSP SoC and Fujitsu AA9 board (Which processor?, I could not find out). A lot of work also appears to have gone in OpenEmbedded, further optimizations have gone into NEON optimized AES encryption in OpenSSL, and more. It’s also the first time I can see a Ubuntu Raring engineering build image for HighBank (Calxeda Energycore). Here are the highlights of this release: Android Engineering Android stack was tuned to achieve 100% CTS pass result on Android 4.3 Analyzing the UEFI EDK II boot loader for Android completed, implementation of fastboot application and USB drivers in progress. Builds and Baselines Linaro Stable Kernel (beta) …

ARM Releases Ne10: An Open Source Library with NEON Optimized Functions

The Advanced SIMD extension (aka NEON or “MPE” Media Processing Engine) is a combined 64- and 128-bit single instruction multiple data (SIMD) instruction set that provides standardized acceleration for media and signal processing applications for ARM Cortex A (ARMv7) processors and the goal of these instructions is similar to MMX, SSE and 3DNow! extensions for x86 processors. Starting early 2011, ARM has been working internally on a project codenamed Snappy to develop common functions accelerated by NEON. They have now released the first version of Snappy, now called the Ne10 library, which is available on GitHub at https://github.com/projectNe10/Ne10 . The code has been developed in C and Assembler and tested on Ubuntu on ARM (Linaro). A Makefile is also included to build it for Android (AOSP). The current functions include vector and matrix operations accelerated by NEON instructions. Since the library is open source, ARM hopes developers to make use of the Ne10 library in their open source packages, add new functions …

ARM NEON Tutorial in C and Assembler

The Advanced SIMD extension (aka NEON or “MPE” Media Processing Engine) is a combined 64- and 128-bit single instruction multiple data (SIMD) instruction set that provides standardized acceleration for media and signal processing applications similar to MMX, SSE and 3DNow! extensions found in x86 processors. Doulos has a video tutorial showing how you can exploit NEON instructions in assembler, how to modify your C code and provides the compile options for gcc to enable NEON during the build. Abstract: With the v7-A architecture, ARM has introduced a powerful SIMD implementation called NEON™. NEON is a coprocessor which comes with its own instruction set for vector operations. While NEON instructions could be hand coded in assembler language, ideally we want our compiler to generate them for us. Automatic analysis whether an iterative algorithm can be mapped to parallel vector operations is not trivial not the least because the C language is lacking constructs necessary to support this. This paper explains how …

Faster JPEG decoding on ARM with libjpeg-turbo and NEON Instructions

libjpeg-turbo is based on libjpeg, but uses SIMD instructions (MMX, SSE2, etc.) to accelerate JPEG compression and decompression on x86 targets. On such systems, libjpeg-turbo is generally 2-4x as fast as the original version of libjpeg with the same hardware. ARM does not support MMX or SSE2 instructions, but it has its own SIMD instructions processed by the NEON Engine on ARM Cortex Core A5, A8, A9 and A15. ARM claims that “NEON technology can accelerate multimedia and signal processing algorithms such as video encode/decode, 2D/3D graphics, gaming, audio and speech processing, image processing, telephony, and sound synthesis by at least 3x the performance of ARMv5 and at least 2x the performance of ARMv6 SIMD.” Linaro worked on libjpeg-turbo and added NEON support to it. The code is available on launchpad at https://code.launchpad.net/~tom-gall/linaro/libjpeg-turbo Linaro has also provide benchmark result for libjpeg-turbo with a 12 Mpixel image on TI OMAP4 (Pandaboard) using the command: djpeg 12mp.jpeg > /dev/null Non Optimized libjpeg-turbo(5 runs): 2078 …