How ARM Nerfed NEON Permute Instructions in ARMv8

This is a guest post by blu about an issue he found with a specific instruction in ARMv8 NEON. He previously wrote an article about OpenGL ES development on Ubuntu Touch, and one or two other posts.

This is not a happy-ending story. But as with most unhappy-ending stories, this is a story with certain moral for the reader. So read on if you appreciate a good moral.

Once upon a time there was a very well-devised SIMD instruction set. Its name was NEON, or formally — ARM Advanced SIMD — ASIMD for short (most people still called it NEON). It was so nice, that veteran coders versed in multiple SIMD ISAs often wished other SIMD ISAs were more like NEON.

NEON had originated as part of the larger ARM ISA version 7, or ARMv7, for short. After much success in the mobile and embedded domains, ARMv7 was superseded by what experts acknowledged as the next step in the evolution of modern ISAs – ARMv8. It was so good that it was praised by compiler writers as possibly the best ISA they could wish for. As part of all the enhancements in the new ISA, NEON too got its fair share of improvements – and so ASIMD2 superseded NEON (ARMv8’s SIMD ISA is called ASIMD2, but some call it NEON2).

Now, one of the many things the original NEON got right was the permute capabilities. Contrary to other ISAs whose architects kept releasing head-banging permute ops one after another, the architects of NEON got permutes right from the start. They did so by providing a compact-yet-powerful set of permutation ops, the most versatile of which, by far, being the tbl op and its sister op tbx; each of those provided a means to compose a SIMD vector from all the thinkable combinations of the individual byte-lanes of up to 4 source SIMD vectors. Neat. The closest thing on AMD64 is pshufb from SSSE3, but it takes a single vector as input (and the AVX2 256-bit vpshufb is borked even further).

Not only NEON had those ops on an architectural level, but the actual implementations – the different μarchitectures that embodied NEON, did so quite efficiently. Second and third generation performant ARMv7 Cortex CPUs could issue up to two tbl ops per clock and return the results as soon as 3 clocks later.

So, with this fairy tale to jump-start our story, let’s teleport ourselves to present-day reality.

I was writing down an ingenious algorithm last week, one that was meant to filter elements from an input stream. Naturally, the algorithm relied heavily on integer SIMD vectors for maximum efficiency, and it happened so that I was writing the initial version on ARM64, with plans for later translation to AMD64. Now, as part of that algorithm, a vector-wise horizontal sort had to be carried – something which is best left to a sorting network (See Ken Batcher’s Sorting Network algorithms). Sorting networks are characterized by doing a fixed number of steps to sort their input vector, and at each of those steps a good amount of permutations occur. As I was sorting a 16-lane vector (a rather wide one), its sorting network was a 10-deep one, and while some of the stages required trivial permutations, others called for the most versatile permutes of them all – the mighty tbl op. So I decided that for an initial implementation I’d use tbl throughput the sorting network.

As I was writing the algorithm away from home, I was using my trusty Ubuntu tablet (Cortex-A53, ARM64) as a workstation (yes, with a keyboard). I had a benchmark for a prima-vista version up and running off L1 cache, showing the algo performing in line with my work-per-clock expectations. It wasn’t until early on the following week that I was able to finally test it on my Cortex-A72 ARM64 workhorse desktop. And there things turned bizarre.

To my stupefaction, on the A72 the bench performed nothing like on the A53. It was effectively twice slower, both in absolute times as well as in per-clock performance (tablet is 1.5GHz, desktop is 2.0GHz but I kept it at 1.3GHz when doing nothing taxing). I checked and double-checked that the compiler had not done anything stupid – it hadn’t – disassembled code was exactly as expected, and yet, there was the ‘big’ A72, 3-decode, 8-dispatch, potent-OoO design getting owned by a ‘little’ tablet’s (or a toaster’s – A53s are so omnipresent these days) in-order, 2-decode design. Luckily for me, my ARM64 desktop is perf-clad (perf is the linux profiler used by kernel developers), so seconds later I was staring at perf reports.

There was no room for guessing – there were some huge, nay, massive stalls clumped around the permute ops. The algo was spending the bulk of its time in stalling on those permutes. Those beautiful, convenient tbl permutes – part of the reason I went to prototype the algo on ARM64 in the first place. The immediate take was that A72 tbl op performed nothing like the A53 tbl op. Time to dust up the manual, buddy. What I saw in the A72 (and A57) optimization manual had me scratch my head more than I could’ve expected.

First off, in 32-bit mode (A32) tbl op performs as I’d expect it to, and as it appears to still do on the A53 in A64 mode (64-bit):

op throughput, ops/clock latency, clocks
tbl from 1 source,  64-bit-wide 2 3
tbl from 2 sources, 64-bit-wide 2 3
tbl from 3 sources, 64-bit-wide 2 6
tbl from 4 sources, 64-bit-wide 2 6

But in 64-bit mode (A64), that transforms into:

op throughput, ops/clock latency, clocks
tbl from 1 source,  64-bit-wide 2 3 * 1 = 3
tbl from 2 sources, 64-bit-wide 2 3 * 2 = 6
tbl from 3 sources, 64-bit-wide 2 3 * 3 = 9
tbl from 4 sources, 64-bit-wide 2 3 * 4 = 12
tbl from 1 source,  128-bit-wide 2 3 * 1 + 3 = 6
tbl from 2 sources, 128-bit-wide 2 3 * 2 + 3 = 9
tbl from 3 sources, 128-bit-wide 2 3 * 3 + 3 = 12
tbl from 4 sources, 128-bit-wide 2 3 * 4 + 3 = 15

That’s right – 64-bit-wide tbl is severely penalized in A64 mode on A72 vs A32 mode. In my case, I was using the 128-bit-wide versions of the op, with 2 source arguments. So on the A72 I ended up getting (snippet of relevant code timeline):


= 12 clocks of latency for the snippet

But on the A53 same snippet yielded:


= 6 clocks of latency for the snippet

As the performance of the entire algorithm was dominated by the network sort, and the entirety of the network sort was comprised of repetitions of the above snippet, all observations fell into place — A53 was indeed twice faster (per-clock) than A72/A57 on this code, by design! So much for my elegant algorithm. Now I’d need to increase the data window so much as to be able to amortize the massive pipeline bubbles with more non-dependent work. Anything less would penalize the ‘big’ ARMv8 designs.

But that’s not what gets me in this entire story – I have no issue rewriting prototype or any other code. What does put me into contemplative mood is that code written for optimal work on A53’s pipeline could choke its ‘big’ brothers A57 & A72, and code written for optimal utilization of the pipelines of those CPUs could not necessarily be the most efficient code on the A53. All it takes is some tbl permutes. That is only exacerbated by big.LITTLE setups. That begs the question what were ARM thinking when they were designing A64 mode tbl on the ‘big’ cores.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK Pi 4C Plus
Subscribe
Notify of
guest
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
36 Comments
oldest
newest
willmore
willmore
6 years ago

Great article! I would love to see if ARM will comment on this.

dx
dx
6 years ago

If you have that much inter loop dependency you are screwed anyway. But with 15 cycle latency how would arm think to handle the massive register pressure?

blu
blu
6 years ago

@dx
Exactly. Re inner-loop data dependencies, my original plan was to 3x the data window to fill up the bubbles on A53 (there’re enough regs for that), but with 9 clock latency that is an entirely different ball game.

geokon
geokon
6 years ago

Sorry for the dumb question from an amateur – but how do you build NEON code in an armhf userspace? I was actually just trying to learn more about NEON these past few days but I’ve hit a bit of a wall. (I usually do higher level C++, so I’m trying to push my comfort zone here and explore these lower level optimizations) I don’t have a tablet like you, but I’m using a Chromebook with crouton to build my code. It also has an armhf userspace – but from what I understood of the GCC manual – NEON and… Read more »

blu
blu
6 years ago

@geokon You’re on the right track. I don’t have a chromebook to test this on, but Ubuntu tablets come out of the factory as armhf (I’ve force-fed mine with an aarch64 toolchain on top of armhf), so something along the lines of: $ gcc -march=armv8-a -marm -mfpu=neon for ARMv8 cpus, and for ARMv7 cpus: $ gcc -march=armv7-a -marm -mfpu=neon should do. You can even throw in -mcpu= for good measure (e.g. cortex-a57). Also, there’s no issue mixing neon and vfp code, particularly when you do neon via intrinsics, as the compiler is fully aware of the effects of each op.… Read more »

AlexN
AlexN
6 years ago

Thanks.

I would like to see more posts like this here, and the convinient replies.

-
-
6 years ago

Optimising for CPUs often does mean you have to write multiple versions of the same function targeting each specific CPU. ARM, unfortunately, doesn’t make this easy as an A53 is quite different to an A72. I’ve never optimised for ARM, only for x86. Fortunately in x86, most CPUs are fairly similar. As an aside, PSHUFB typically has a 1 cycle latency and 1-2 ops/clock throughput on most x86 CPUs. There was an XOP VPPERM instruction which could source from 2x 128-bit registers, but support for it is limited. AVX-512 VBMI has a VPPERMB instruction which can source from 2x 512-bit… Read more »

blu
blu
6 years ago

@– Don’t get me wrong – this is not a rant about permutes on the A72/A57 per se. I’m looking at them in the context of their LITTLE companion, the one they’re supposed to ping-pong code with. Clearly, ARM had their priorities with the big designs, but some of those turned out rather puzzling in retrospective, you got to agree. Apropos, just out of curiosity, I rewrote the original function in d-form ops (64-bit SIMD) and that made the A72 only 50% slower than the A53 (and the A53 took a slight hit thanks to a few ins ops not… Read more »

Shimon
Shimon
6 years ago

@blu
That’s an interesting find. Have you opened a gcc issue about it?

blu
blu
6 years ago

@Shimon It’s a hardware limitation – there’s only so much compilers can do in certain situations. When multitudes of big latencies are involved in tight data-dependency closures it’s the code’s author who can (or cannot) properly fix things by manipulating the workload size and distribution; a classic example of if one does not help themselves nobody else will. The crux of the issue in my particular case is that I did not anticipate such large permute latencies on the big cores, vis-a-vis the normal latencies on the little cores. I always viewed the big cores as better-or-equally performing versions of… Read more »

Shimon
Shimon
6 years ago

Even if you’re right and there’s nothing more to it, an erratum might still be possible to mitigate the problem.

It won’t hurt to get a more formal reply from ARM’s gcc developers.

champ
champ
6 years ago

In article, you mentioned that your A53 platform is BQ Aquaris M10 (MT8163).
I’m curious. What’s the A72 platform you used?

blu
blu
6 years ago

@champ
Marvell MacchiatoBin miniITX board using Marvell ARMADA 8040 (2x dual-A72 clusters), running Ubuntu 16.04 LTS.

Nobody of Import
Nobody of Import
6 years ago

@willmore
Egg on face (Which is what this is…)?

ARM will take a long while, if ever, much like most other CPU vendors, to comment or even OWN this one.

This is off into the, “Lay down the damn crack pipe,” stupid on their part.

rm
rm
6 years ago

The instruction itself is not “nerfed”, it’s just a problem of suboptimal interpretation. And that can always be fixed.

Reading the subject I thought there’s something in the ARMv8 spec redifining the behaviour of the instruction itself, making it less useful or useless entirely.

blu
blu
6 years ago

@rm
Apologies if you found the title misleading. Once one reads the body I think the title snaps into place, though. As you put it, the instruction was made (significantly) less useful for at least two of the big cores. I’m curious if the trend continues in A73 and A75.

crashoverride
crashoverride
6 years ago

blu :
@rm
I’m curious if the trend continues in A73 and A75.

You should make a test program that illustrates the issue using cycle counters. The issue may be isolated to a particular silicon revision.

blu
blu
6 years ago

@crashoverride
The observed behavior is in agreement with the software optimisation manuals for A72 and A57. But yes, a latency-measuring test app for these ops would be both useful and trivial to do.

blu
blu
6 years ago

https://pastebin.com/NsufCsbx

Just cpufreq an aarch64 machine into a steady clock, then run the above via time, then multiply the resulting time by the machine’s clock and divide that by 5 * 10^8 * 16. Do the same passing -DCOISSUE to the compiler to see how/if tbl coissue works on that machine.

Turns out that on my A53 the q-form tbl has latency of 2, rather than the expected 3.

crashoverride
crashoverride
6 years ago

Billions of years of evolution. Trillions of dollars of computing infrastructure at our fingertips. Yet, somehow, I am still the one that has to do the math?

blu
blu
6 years ago

@crashoverride
You can always hire a professional.

crashoverride
crashoverride
6 years ago

@blu
I should outsource it!

I actually did run the program (after some corrections), and it told me what I wanted to know: the A53 cores completed faster than the A72 cores (big.LITTLE) despite the big cores being clocked faster (1.5Ghz vs 2Ghz on RK3399 using taskset for CPU affinity). I did a generic compile and did not specify any CPU tuning parameters to gcc.

blu
blu
6 years ago

@crashoverride
A generic compile is fine – the entire measured code is in inline assembly. Here’s a shell script that does all the latency computations: https://pastebin.com/MFEJcWCt — just pass the latency executable to it. Note that it requires the time utility (not the bash built-in) and bc utility.

blu
blu
6 years ago

Just to demonstrate how the script works, here’s the result from the amd64 version of the code (https://pastebin.com/QbCGFexr) on an intel desktop:

$ clang++-3.9 -Ofast -mssse3 lattest.cpp -DCOISSUE
$ ./lat.sh ./a.out
.9997
$ clang++-3.9 -Ofast -mssse3 lattest.cpp -DCOISSUE=0
$ ./lat.sh ./a.out
.9997

It shows that (a) there is not co-issue of pshufb on intel, and (b) the op’s latency is 1 clock.

Shimon
Shimon
6 years ago

@crashoverride
Does it make any difference if you actually shut the little cores down instead of using taskset?

crashoverride
crashoverride
6 years ago

Offline-ing the little cores yielded no change in execution time from what was previously observed for the big cores.

blu
blu
6 years ago

Running the latency test on my A72 yields the following:

latency of q-form 1-src tbl with co-issue is 2
latency of q-form 1-src tbl without co-issue is 5

Curiously enough the difference form the figures quoted in the optimisation manual is 1, ie. 2 + 1 = 3; 3 * 2 (for co-issue) = 6, and 5 + 1 = 6 without co-issue. I guess there’s some sort of early-forwarding mechanism if the results get consumed by the same port, but that’s just a wild guess; it could be just as well imprecision of the method.

willmore
willmore
6 years ago

@blu
Do you have that testing code available? I’d like to run it on the various A53 cores I have if I may.

rm
rm
6 years ago

rm :
The instruction itself is not “nerfed”, it’s just a problem of suboptimal interpretation. And that can always be fixed.

Bit too late, but I meant to say “implementation” there.

blu
blu
6 years ago

@willmore
All relevant code can be found in this repo — it contains the original algorithm I was working on, and the latency tester.
https://github.com/blu/ascii_pruner

willmore
willmore
6 years ago

@blu
Thank you!

willmore
willmore
6 years ago

@blu
On an Odroid-C2 I get:
2.0016
1.0621

blu
blu
6 years ago

@willmore
That’s normal for an A53. It means the latency of tbl is 2 clocks and co-issue works.

Pavel P
5 years ago

I wish I could get some arm device where I could profile neon code similarly to what you show here. Any tutorials that you could possibly point to?

blu
blu
5 years ago

@Pavel P
You need a PMU (performance-monitoring-unit)-enabled Cortex (pretty much any armv8 — I haven’t tried armv7 PMUs, even though those should be operational) with a modern-enough mainline kernel (4.x) with performance counters enabled (should be by default these days). Then you just build perf profiler found in the kernel tree under tools/perf and use it.

Here’s a thread on the macchiatobin forums investigating certain A72 perf counters: http://macchiatobin.net/forums/topic/kernel-4-4-52-armada-17-06-2/#post-401

Jake
Jake
2 years ago

I’m glad I came across this article. I was writing an image feature extractor in aarch64 assembly that makes heavy use of 4 sources tbl instructions, and successfully managed to hide the latency of 15 cycles completely after reading this post. I thought 5 cycles on A53 were bad enough, but ARM surprised me with 15 cycles. I have to stop trusting ARM blindly.

Khadas VIM4 SBC