August 7, 2017 by Jean-Luc Aufranc (CNXSoft) - 36 Comments

How ARM Nerfed NEON Permute Instructions in ARMv8

This is a guest post by blu about an issue he found with a specific instruction in ARMv8 NEON. He previously wrote an article about OpenGL ES development on Ubuntu Touch, and one or two other posts.

This is not a happy-ending story. But as with most unhappy-ending stories, this is a story with certain moral for the reader. So read on if you appreciate a good moral.

Once upon a time there was a very well-devised SIMD instruction set. Its name was NEON, or formally — ARM Advanced SIMD — ASIMD for short (most people still called it NEON). It was so nice, that veteran coders versed in multiple SIMD ISAs often wished other SIMD ISAs were more like NEON.

NEON had originated as part of the larger ARM ISA version 7, or ARMv7, for short. After much success in the mobile and embedded domains, ARMv7 was superseded by what experts acknowledged as the next step in the evolution of modern ISAs – ARMv8. It was so good that it was praised by compiler writers as possibly the best ISA they could wish for. As part of all the enhancements in the new ISA, NEON too got its fair share of improvements – and so ASIMD2 superseded NEON (ARMv8’s SIMD ISA is called ASIMD2, but some call it NEON2).

Now, one of the many things the original NEON got right was the permute capabilities. Contrary to other ISAs whose architects kept releasing head-banging permute ops one after another, the architects of NEON got permutes right from the start. They did so by providing a compact-yet-powerful set of permutation ops, the most versatile of which, by far, being the tbl op and its sister op tbx; each of those provided a means to compose a SIMD vector from all the thinkable combinations of the individual byte-lanes of up to 4 source SIMD vectors. Neat. The closest thing on AMD64 is pshufb from SSSE3, but it takes a single vector as input (and the AVX2 256-bit vpshufb is borked even further).

Not only NEON had those ops on an architectural level, but the actual implementations – the different μarchitectures that embodied NEON, did so quite efficiently. Second and third generation performant ARMv7 Cortex CPUs could issue up to two tbl ops per clock and return the results as soon as 3 clocks later.

So, with this fairy tale to jump-start our story, let’s teleport ourselves to present-day reality.

I was writing down an ingenious algorithm last week, one that was meant to filter elements from an input stream. Naturally, the algorithm relied heavily on integer SIMD vectors for maximum efficiency, and it happened so that I was writing the initial version on ARM64, with plans for later translation to AMD64. Now, as part of that algorithm, a vector-wise horizontal sort had to be carried – something which is best left to a sorting network (See Ken Batcher’s Sorting Network algorithms). Sorting networks are characterized by doing a fixed number of steps to sort their input vector, and at each of those steps a good amount of permutations occur. As I was sorting a 16-lane vector (a rather wide one), its sorting network was a 10-deep one, and while some of the stages required trivial permutations, others called for the most versatile permutes of them all – the mighty tbl op. So I decided that for an initial implementation I’d use tbl throughput the sorting network.

As I was writing the algorithm away from home, I was using my trusty Ubuntu tablet (Cortex-A53, ARM64) as a workstation (yes, with a keyboard). I had a benchmark for a prima-vista version up and running off L1 cache, showing the algo performing in line with my work-per-clock expectations. It wasn’t until early on the following week that I was able to finally test it on my Cortex-A72 ARM64 workhorse desktop. And there things turned bizarre.

To my stupefaction, on the A72 the bench performed nothing like on the A53. It was effectively twice slower, both in absolute times as well as in per-clock performance (tablet is 1.5GHz, desktop is 2.0GHz but I kept it at 1.3GHz when doing nothing taxing). I checked and double-checked that the compiler had not done anything stupid – it hadn’t – disassembled code was exactly as expected, and yet, there was the ‘big’ A72, 3-decode, 8-dispatch, potent-OoO design getting owned by a ‘little’ tablet’s (or a toaster’s – A53s are so omnipresent these days) in-order, 2-decode design. Luckily for me, my ARM64 desktop is perf-clad (perf is the linux profiler used by kernel developers), so seconds later I was staring at perf reports.

There was no room for guessing – there were some huge, nay, massive stalls clumped around the permute ops. The algo was spending the bulk of its time in stalling on those permutes. Those beautiful, convenient tbl permutes – part of the reason I went to prototype the algo on ARM64 in the first place. The immediate take was that A72 tbl op performed nothing like the A53 tbl op. Time to dust up the manual, buddy. What I saw in the A72 (and A57) optimization manual had me scratch my head more than I could’ve expected.

First off, in 32-bit mode (A32) tbl op performs as I’d expect it to, and as it appears to still do on the A53 in A64 mode (64-bit):

op	throughput, ops/clock	latency, clocks
tbl from 1 source, 64-bit-wide	2	3
tbl from 2 sources, 64-bit-wide	2	3
tbl from 3 sources, 64-bit-wide	2	6
tbl from 4 sources, 64-bit-wide	2	6

But in 64-bit mode (A64), that transforms into:

op	throughput, ops/clock	latency, clocks
tbl from 1 source, 64-bit-wide	2	3 * 1 = 3
tbl from 2 sources, 64-bit-wide	2	3 * 2 = 6
tbl from 3 sources, 64-bit-wide	2	3 * 3 = 9
tbl from 4 sources, 64-bit-wide	2	3 * 4 = 12
tbl from 1 source, 128-bit-wide	2	3 * 1 + 3 = 6
tbl from 2 sources, 128-bit-wide	2	3 * 2 + 3 = 9
tbl from 3 sources, 128-bit-wide	2	3 * 3 + 3 = 12
tbl from 4 sources, 128-bit-wide	2	3 * 4 + 3 = 15

That’s right – 64-bit-wide tbl is severely penalized in A64 mode on A72 vs A32 mode. In my case, I was using the 128-bit-wide versions of the op, with 2 source arguments. So on the A72 I ended up getting (snippet of relevant code timeline):

 >- tbl 2-source
     |   tbl 2-source
     |
     |
     | 9 clocks latency, with co-issue
     |
     |
     |
     >- consumer_op of 1st tbl
     |   consumer_op of 2nd tbl
     |
     | 3 clocks latency, with co-issue
     |
     >

>- tbl 2-source

| tbl 2-source

| 9 clocks latency, with co-issue

>- consumer_op of 1st tbl

| consumer_op of 2nd tbl

| 3 clocks latency, with co-issue

= 12 clocks of latency for the snippet

But on the A53 same snippet yielded:

>- tbl 2-source
     |   tbl 2-source
     |
     | 3 clocks latency, with co-issue
     |
     >- consumer_op of 1st tbl
     |   consumer_op of 2nd tbl
     |
     | 3 clocks latency, with co-issue
     |
     >

>- tbl 2-source

| tbl 2-source

| 3 clocks latency, with co-issue

>- consumer_op of 1st tbl

| consumer_op of 2nd tbl

| 3 clocks latency, with co-issue

= 6 clocks of latency for the snippet

As the performance of the entire algorithm was dominated by the network sort, and the entirety of the network sort was comprised of repetitions of the above snippet, all observations fell into place — A53 was indeed twice faster (per-clock) than A72/A57 on this code, by design! So much for my elegant algorithm. Now I’d need to increase the data window so much as to be able to amortize the massive pipeline bubbles with more non-dependent work. Anything less would penalize the ‘big’ ARMv8 designs.

But that’s not what gets me in this entire story – I have no issue rewriting prototype or any other code. What does put me into contemplative mood is that code written for optimal work on A53’s pipeline could choke its ‘big’ brothers A57 & A72, and code written for optimal utilization of the pipelines of those CPUs could not necessarily be the most efficient code on the A53. All it takes is some tbl permutes. That is only exacerbated by big.LITTLE setups. That begs the question what were ARM thinking when they were designing A64 mode tbl on the ‘big’ cores.

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

36 Replies to “How ARM Nerfed NEON Permute Instructions in ARMv8”

willmore says:

August 7, 2017 at 18:46

Great article! I would love to see if ARM will comment on this.

Reply
dx says:

August 7, 2017 at 20:06

If you have that much inter loop dependency you are screwed anyway. But with 15 cycle latency how would arm think to handle the massive register pressure?

Reply
blu says:

August 7, 2017 at 20:37

@dx
Exactly. Re inner-loop data dependencies, my original plan was to 3x the data window to fill up the bubbles on A53 (there’re enough regs for that), but with 9 clock latency that is an entirely different ball game.

Reply
geokon says:

August 7, 2017 at 21:42

Sorry for the dumb question from an amateur – but how do you build NEON code in an armhf userspace?

I was actually just trying to learn more about NEON these past few days but I’ve hit a bit of a wall. (I usually do higher level C++, so I’m trying to push my comfort zone here and explore these lower level optimizations) I don’t have a tablet like you, but I’m using a Chromebook with crouton to build my code. It also has an armhf userspace – but from what I understood of the GCC manual – NEON and hard floats don’t mix in the same code (b/c the FPU shares registers with NEON/SIMD). So I’m scratching my head on how to get it all working on my system.

The NEON intrinsics make GCC angry and I get errors that look like this:

/usr/lib/gcc/arm-linux-gnueabihf/6/include/arm_neon.h:6252:1: error: inlining failed in call to always_inline ‘float32x2_t vget_high_f32(float32x4_t)’: target specific option mismatch

adding a -mfpu=neon or -mfpu=neon-vfpv4 flag doesn’t help

I think I’m misunderstanding something fundamentally here though. Any pointers? And any recommendations on resources where to learn more about this?

Great write up btw. It’s good to read something really from the technical trenches 🙂

Reply
blu says:

August 7, 2017 at 22:30

@geokon
You’re on the right track. I don’t have a chromebook to test this on, but Ubuntu tablets come out of the factory as armhf (I’ve force-fed mine with an aarch64 toolchain on top of armhf), so something along the lines of:

$ gcc -march=armv8-a -marm -mfpu=neon

for ARMv8 cpus, and for ARMv7 cpus:

$ gcc -march=armv7-a -marm -mfpu=neon

should do. You can even throw in -mcpu= for good measure (e.g. cortex-a57).

Also, there’s no issue mixing neon and vfp code, particularly when you do neon via intrinsics, as the compiler is fully aware of the effects of each op. Back on the Cortex-A8 you could tell the compiler to do all fp math via neon, as the vfp was non-pipelined ( : / ) but that was a software optimisation, not a hw limitation.

Reply
AlexN says:

August 8, 2017 at 04:40

Thanks.

I would like to see more posts like this here, and the convinient replies.

Reply
- says:

August 8, 2017 at 10:18

Optimising for CPUs often does mean you have to write multiple versions of the same function targeting each specific CPU.

ARM, unfortunately, doesn’t make this easy as an A53 is quite different to an A72. I’ve never optimised for ARM, only for x86. Fortunately in x86, most CPUs are fairly similar.
As an aside, PSHUFB typically has a 1 cycle latency and 1-2 ops/clock throughput on most x86 CPUs. There was an XOP VPPERM instruction which could source from 2x 128-bit registers, but support for it is limited. AVX-512 VBMI has a VPPERMB instruction which can source from 2x 512-bit registers, but no CPU supports it as of yet; I also don’t expect it to be fast. SSE/AVX in general isn’t as consistent as NEON.

I can’t explain your find, but permute is usually quite different from most other SIMD operations. Perhaps the A72 decided to trade permute performance (which is probably not frequently used for most ARM applications) for something else.

Reply
blu says:

August 8, 2017 at 12:55

@–
Don’t get me wrong – this is not a rant about permutes on the A72/A57 per se. I’m looking at them in the context of their LITTLE companion, the one they’re supposed to ping-pong code with. Clearly, ARM had their priorities with the big designs, but some of those turned out rather puzzling in retrospective, you got to agree.

Apropos, just out of curiosity, I rewrote the original function in d-form ops (64-bit SIMD) and that made the A72 only 50% slower than the A53 (and the A53 took a slight hit thanks to a few ins ops not needed in the q-form (128-bit)).

For the record, the entire effort along that algo has moved on – now I’m on a revision of the code which is again performing predictably-well on A53 and not so well on the A72, while on the AMD64 it performs nicely on desktop CPUs but suffers on Bobcats/Jaguars (death by popcnt ™). So I’m wondering if I should prioritize desktop AMD64 and little ARM64, and call it a day : )

Reply
Shimon says:

August 8, 2017 at 16:19

@blu
That’s an interesting find. Have you opened a gcc issue about it?

Reply
blu says:

August 8, 2017 at 17:57

@Shimon
It’s a hardware limitation – there’s only so much compilers can do in certain situations. When multitudes of big latencies are involved in tight data-dependency closures it’s the code’s author who can (or cannot) properly fix things by manipulating the workload size and distribution; a classic example of if one does not help themselves nobody else will. The crux of the issue in my particular case is that I did not anticipate such large permute latencies on the big cores, vis-a-vis the normal latencies on the little cores. I always viewed the big cores as better-or-equally performing versions of the little cores. Or perhaps I hoped that’s how ARM viewed them. I was clearly naive.

Reply
Shimon says:

August 8, 2017 at 19:30

Even if you’re right and there’s nothing more to it, an erratum might still be possible to mitigate the problem.

It won’t hurt to get a more formal reply from ARM’s gcc developers.

Reply
champ says:

August 8, 2017 at 20:26

In article, you mentioned that your A53 platform is BQ Aquaris M10 (MT8163).
I’m curious. What’s the A72 platform you used?

Reply
blu says:

August 8, 2017 at 21:17

@champ
Marvell MacchiatoBin miniITX board using Marvell ARMADA 8040 (2x dual-A72 clusters), running Ubuntu 16.04 LTS.

Reply
Nobody of Import says:

August 8, 2017 at 21:20

@willmore
Egg on face (Which is what this is…)?

ARM will take a long while, if ever, much like most other CPU vendors, to comment or even OWN this one.

This is off into the, “Lay down the damn crack pipe,” stupid on their part.

Reply
rm says:

August 9, 2017 at 15:27

The instruction itself is not “nerfed”, it’s just a problem of suboptimal interpretation. And that can always be fixed.

Reading the subject I thought there’s something in the ARMv8 spec redifining the behaviour of the instruction itself, making it less useful or useless entirely.

Reply
blu says:

August 9, 2017 at 18:45

@rm
Apologies if you found the title misleading. Once one reads the body I think the title snaps into place, though. As you put it, the instruction was made (significantly) less useful for at least two of the big cores. I’m curious if the trend continues in A73 and A75.

Reply
crashoverride says:

August 9, 2017 at 20:11

blu :
@rm
I’m curious if the trend continues in A73 and A75.

You should make a test program that illustrates the issue using cycle counters. The issue may be isolated to a particular silicon revision.

Reply
blu says:

August 9, 2017 at 22:11

@crashoverride
The observed behavior is in agreement with the software optimisation manuals for A72 and A57. But yes, a latency-measuring test app for these ops would be both useful and trivial to do.

Reply
blu says:

August 10, 2017 at 03:17

https://pastebin.com/NsufCsbx

Just cpufreq an aarch64 machine into a steady clock, then run the above via time, then multiply the resulting time by the machine’s clock and divide that by 5 * 10^8 * 16. Do the same passing -DCOISSUE to the compiler to see how/if tbl coissue works on that machine.

Turns out that on my A53 the q-form tbl has latency of 2, rather than the expected 3.

Reply
crashoverride says:

August 10, 2017 at 04:32

Billions of years of evolution. Trillions of dollars of computing infrastructure at our fingertips. Yet, somehow, I am still the one that has to do the math?

Reply
blu says:

August 10, 2017 at 14:32

@crashoverride
You can always hire a professional.

Reply
crashoverride says:

August 10, 2017 at 14:51

@blu
I should outsource it!

I actually did run the program (after some corrections), and it told me what I wanted to know: the A53 cores completed faster than the A72 cores (big.LITTLE) despite the big cores being clocked faster (1.5Ghz vs 2Ghz on RK3399 using taskset for CPU affinity). I did a generic compile and did not specify any CPU tuning parameters to gcc.

Reply
blu says:

August 10, 2017 at 16:28

@crashoverride
A generic compile is fine – the entire measured code is in inline assembly. Here’s a shell script that does all the latency computations: https://pastebin.com/MFEJcWCt — just pass the latency executable to it. Note that it requires the time utility (not the bash built-in) and bc utility.

Reply
blu says:

August 10, 2017 at 20:17

Just to demonstrate how the script works, here’s the result from the amd64 version of the code (https://pastebin.com/QbCGFexr) on an intel desktop:

$ clang++-3.9 -Ofast -mssse3 lattest.cpp -DCOISSUE
$ ./lat.sh ./a.out
.9997
$ clang++-3.9 -Ofast -mssse3 lattest.cpp -DCOISSUE=0
$ ./lat.sh ./a.out
.9997

It shows that (a) there is not co-issue of pshufb on intel, and (b) the op’s latency is 1 clock.

Reply
Shimon says:

August 11, 2017 at 03:40

@crashoverride
Does it make any difference if you actually shut the little cores down instead of using taskset?

Reply
crashoverride says:

August 11, 2017 at 04:11

Offline-ing the little cores yielded no change in execution time from what was previously observed for the big cores.

Reply
blu says:

August 11, 2017 at 05:03

Running the latency test on my A72 yields the following:

latency of q-form 1-src tbl with co-issue is 2
latency of q-form 1-src tbl without co-issue is 5

Curiously enough the difference form the figures quoted in the optimisation manual is 1, ie. 2 + 1 = 3; 3 * 2 (for co-issue) = 6, and 5 + 1 = 6 without co-issue. I guess there’s some sort of early-forwarding mechanism if the results get consumed by the same port, but that’s just a wild guess; it could be just as well imprecision of the method.

Reply
willmore says:

August 15, 2017 at 18:56

@blu
Do you have that testing code available? I’d like to run it on the various A53 cores I have if I may.

Reply
rm says:

August 17, 2017 at 16:42

rm :
The instruction itself is not “nerfed”, it’s just a problem of suboptimal interpretation. And that can always be fixed.

Bit too late, but I meant to say “implementation” there.

Reply
blu says:

August 18, 2017 at 13:05

@willmore
All relevant code can be found in this repo — it contains the original algorithm I was working on, and the latency tester.
https://github.com/blu/ascii_pruner

Reply
willmore says:

August 18, 2017 at 23:00

@blu
Thank you!

Reply
willmore says:

September 4, 2017 at 21:55

@blu
On an Odroid-C2 I get:
2.0016
1.0621

Reply
blu says:

September 5, 2017 at 00:19

@willmore
That’s normal for an A53. It means the latency of tbl is 2 clocks and co-issue works.

Reply
Pavel P says:

May 20, 2018 at 00:35

I wish I could get some arm device where I could profile neon code similarly to what you show here. Any tutorials that you could possibly point to?

Reply
1. blu says:
  
  June 1, 2018 at 14:30
  
  @Pavel P
  You need a PMU (performance-monitoring-unit)-enabled Cortex (pretty much any armv8 — I haven’t tried armv7 PMUs, even though those should be operational) with a modern-enough mainline kernel (4.x) with performance counters enabled (should be by default these days). Then you just build perf profiler found in the kernel tree under tools/perf and use it.
  
  Here’s a thread on the macchiatobin forums investigating certain A72 perf counters: http://macchiatobin.net/forums/topic/kernel-4-4-52-armada-17-06-2/#post-401
  
  Reply
Jake says:

September 12, 2021 at 14:04

I’m glad I came across this article. I was writing an image feature extractor in aarch64 assembly that makes heavy use of 4 sources tbl instructions, and successfully managed to hide the latency of 15 cycles completely after reading this post. I thought 5 cycles on A53 were bad enough, but ARM surprised me with 15 cycles. I have to stop trusting ARM blindly.

Reply

Boardcon LGA3576 Rockchip RK3576 System-on-Module designed for AI and IoT applications

36 Replies to “How ARM Nerfed NEON Permute Instructions in ARMv8”

Leave a Reply Cancel reply

Leave a Reply