How ARM Nerfed NEON Permute Instructions in ARMv8

This is a guest post by blu about an issue he found with a specific instruction in ARMv8 NEON. He previously wrote an article about OpenGL ES development on Ubuntu Touch, and one or two other posts.

This is not a happy-ending story. But as with most unhappy-ending stories, this is a story with certain moral for the reader. So read on if you appreciate a good moral.

Once upon a time there was a very well-devised SIMD instruction set. Its name was NEON, or formally — ARM Advanced SIMD — ASIMD for short (most people still called it NEON). It was so nice, that veteran coders versed in multiple SIMD ISAs often wished other SIMD ISAs were more like NEON.

NEON had originated as part of the larger ARM ISA version 7, or ARMv7, for short. After much success in the mobile and embedded domains, ARMv7 was superseded by what experts acknowledged as the next step in the evolution of modern ISAs – ARMv8. It was so good that it was praised by compiler writers as possibly the best ISA they could wish for. As part of all the enhancements in the new ISA, NEON too got its fair share of improvements – and so ASIMD2 superseded NEON (ARMv8’s SIMD ISA is called ASIMD2, but some call it NEON2).

Now, one of the many things the original NEON got right was the permute capabilities. Contrary to other ISAs whose architects kept releasing head-banging permute ops one after another, the architects of NEON got permutes right from the start. They did so by providing a compact-yet-powerful set of permutation ops, the most versatile of which, by far, being the tbl op and its sister op tbx; each of those provided a means to compose a SIMD vector from all the thinkable combinations of the individual byte-lanes of up to 4 source SIMD vectors. Neat. The closest thing on AMD64 is pshufb from SSSE3, but it takes a single vector as input (and the AVX2 256-bit vpshufb is borked even further).

Not only NEON had those ops on an architectural level, but the actual implementations – the different μarchitectures that embodied NEON, did so quite efficiently. Second and third generation performant ARMv7 Cortex CPUs could issue up to two tbl ops per clock and return the results as soon as 3 clocks later.

So, with this fairy tale to jump-start our story, let’s teleport ourselves to present-day reality.

I was writing down an ingenious algorithm last week, one that was meant to filter elements from an input stream. Naturally, the algorithm relied heavily on integer SIMD vectors for maximum efficiency, and it happened so that I was writing the initial version on ARM64, with plans for later translation to AMD64. Now, as part of that algorithm, a vector-wise horizontal sort had to be carried – something which is best left to a sorting network (See Ken Batcher’s Sorting Network algorithms). Sorting networks are characterized by doing a fixed number of steps to sort their input vector, and at each of those steps a good amount of permutations occur. As I was sorting a 16-lane vector (a rather wide one), its sorting network was a 10-deep one, and while some of the stages required trivial permutations, others called for the most versatile permutes of them all – the mighty tbl op. So I decided that for an initial implementation I’d use tbl throughput the sorting network.

As I was writing the algorithm away from home, I was using my trusty Ubuntu tablet (Cortex-A53, ARM64) as a workstation (yes, with a keyboard). I had a benchmark for a prima-vista version up and running off L1 cache, showing the algo performing in line with my work-per-clock expectations. It wasn’t until early on the following week that I was able to finally test it on my Cortex-A72 ARM64 workhorse desktop. And there things turned bizarre.

To my stupefaction, on the A72 the bench performed nothing like on the A53. It was effectively twice slower, both in absolute times as well as in per-clock performance (tablet is 1.5GHz, desktop is 2.0GHz but I kept it at 1.3GHz when doing nothing taxing). I checked and double-checked that the compiler had not done anything stupid – it hadn’t – disassembled code was exactly as expected, and yet, there was the ‘big’ A72, 3-decode, 8-dispatch, potent-OoO design getting owned by a ‘little’ tablet’s (or a toaster’s – A53s are so omnipresent these days) in-order, 2-decode design. Luckily for me, my ARM64 desktop is perf-clad (perf is the linux profiler used by kernel developers), so seconds later I was staring at perf reports.

There was no room for guessing – there were some huge, nay, massive stalls clumped around the permute ops. The algo was spending the bulk of its time in stalling on those permutes. Those beautiful, convenient tbl permutes – part of the reason I went to prototype the algo on ARM64 in the first place. The immediate take was that A72 tbl op performed nothing like the A53 tbl op. Time to dust up the manual, buddy. What I saw in the A72 (and A57) optimization manual had me scratch my head more than I could’ve expected.

First off, in 32-bit mode (A32) tbl op performs as I’d expect it to, and as it appears to still do on the A53 in A64 mode (64-bit):

op throughput, ops/clock latency, clocks
tbl from 1 source,  64-bit-wide 2 3
tbl from 2 sources, 64-bit-wide 2 3
tbl from 3 sources, 64-bit-wide 2 6
tbl from 4 sources, 64-bit-wide 2 6

But in 64-bit mode (A64), that transforms into:

op throughput, ops/clock latency, clocks
tbl from 1 source,  64-bit-wide 2 3 * 1 = 3
tbl from 2 sources, 64-bit-wide 2 3 * 2 = 6
tbl from 3 sources, 64-bit-wide 2 3 * 3 = 9
tbl from 4 sources, 64-bit-wide 2 3 * 4 = 12
tbl from 1 source,  128-bit-wide 2 3 * 1 + 3 = 6
tbl from 2 sources, 128-bit-wide 2 3 * 2 + 3 = 9
tbl from 3 sources, 128-bit-wide 2 3 * 3 + 3 = 12
tbl from 4 sources, 128-bit-wide 2 3 * 4 + 3 = 15

That’s right – 64-bit-wide tbl is severely penalized in A64 mode on A72 vs A32 mode. In my case, I was using the 128-bit-wide versions of the op, with 2 source arguments. So on the A72 I ended up getting (snippet of relevant code timeline):

= 12 clocks of latency for the snippet

But on the A53 same snippet yielded:

= 6 clocks of latency for the snippet

As the performance of the entire algorithm was dominated by the network sort, and the entirety of the network sort was comprised of repetitions of the above snippet, all observations fell into place — A53 was indeed twice faster (per-clock) than A72/A57 on this code, by design! So much for my elegant algorithm. Now I’d need to increase the data window so much as to be able to amortize the massive pipeline bubbles with more non-dependent work. Anything less would penalize the ‘big’ ARMv8 designs.

But that’s not what gets me in this entire story – I have no issue rewriting prototype or any other code. What does put me into contemplative mood is that code written for optimal work on A53’s pipeline could choke its ‘big’ brothers A57 & A72, and code written for optimal utilization of the pipelines of those CPUs could not necessarily be the most efficient code on the A53. All it takes is some tbl permutes. That is only exacerbated by big.LITTLE setups. That begs the question what were ARM thinking when they were designing A64 mode tbl on the ‘big’ cores.

Share this:

Support CNX Software! Donate via cryptocurrencies or become a Patron on Patreon

36 Replies to “How ARM Nerfed NEON Permute Instructions in ARMv8”

  1. If you have that much inter loop dependency you are screwed anyway. But with 15 cycle latency how would arm think to handle the massive register pressure?

  2. @dx
    Exactly. Re inner-loop data dependencies, my original plan was to 3x the data window to fill up the bubbles on A53 (there’re enough regs for that), but with 9 clock latency that is an entirely different ball game.

  3. Sorry for the dumb question from an amateur – but how do you build NEON code in an armhf userspace?

    I was actually just trying to learn more about NEON these past few days but I’ve hit a bit of a wall. (I usually do higher level C++, so I’m trying to push my comfort zone here and explore these lower level optimizations) I don’t have a tablet like you, but I’m using a Chromebook with crouton to build my code. It also has an armhf userspace – but from what I understood of the GCC manual – NEON and hard floats don’t mix in the same code (b/c the FPU shares registers with NEON/SIMD). So I’m scratching my head on how to get it all working on my system.

    The NEON intrinsics make GCC angry and I get errors that look like this:

    /usr/lib/gcc/arm-linux-gnueabihf/6/include/arm_neon.h:6252:1: error: inlining failed in call to always_inline ‘float32x2_t vget_high_f32(float32x4_t)’: target specific option mismatch

    adding a -mfpu=neon or -mfpu=neon-vfpv4 flag doesn’t help

    I think I’m misunderstanding something fundamentally here though. Any pointers? And any recommendations on resources where to learn more about this?

    Great write up btw. It’s good to read something really from the technical trenches 🙂

  4. @geokon
    You’re on the right track. I don’t have a chromebook to test this on, but Ubuntu tablets come out of the factory as armhf (I’ve force-fed mine with an aarch64 toolchain on top of armhf), so something along the lines of:

    $ gcc -march=armv8-a -marm -mfpu=neon

    for ARMv8 cpus, and for ARMv7 cpus:

    $ gcc -march=armv7-a -marm -mfpu=neon

    should do. You can even throw in -mcpu= for good measure (e.g. cortex-a57).

    Also, there’s no issue mixing neon and vfp code, particularly when you do neon via intrinsics, as the compiler is fully aware of the effects of each op. Back on the Cortex-A8 you could tell the compiler to do all fp math via neon, as the vfp was non-pipelined ( : / ) but that was a software optimisation, not a hw limitation.

  5. Optimising for CPUs often does mean you have to write multiple versions of the same function targeting each specific CPU.

    ARM, unfortunately, doesn’t make this easy as an A53 is quite different to an A72. I’ve never optimised for ARM, only for x86. Fortunately in x86, most CPUs are fairly similar.
    As an aside, PSHUFB typically has a 1 cycle latency and 1-2 ops/clock throughput on most x86 CPUs. There was an XOP VPPERM instruction which could source from 2x 128-bit registers, but support for it is limited. AVX-512 VBMI has a VPPERMB instruction which can source from 2x 512-bit registers, but no CPU supports it as of yet; I also don’t expect it to be fast. SSE/AVX in general isn’t as consistent as NEON.

    I can’t explain your find, but permute is usually quite different from most other SIMD operations. Perhaps the A72 decided to trade permute performance (which is probably not frequently used for most ARM applications) for something else.

  6. @–
    Don’t get me wrong – this is not a rant about permutes on the A72/A57 per se. I’m looking at them in the context of their LITTLE companion, the one they’re supposed to ping-pong code with. Clearly, ARM had their priorities with the big designs, but some of those turned out rather puzzling in retrospective, you got to agree.

    Apropos, just out of curiosity, I rewrote the original function in d-form ops (64-bit SIMD) and that made the A72 only 50% slower than the A53 (and the A53 took a slight hit thanks to a few ins ops not needed in the q-form (128-bit)).

    For the record, the entire effort along that algo has moved on – now I’m on a revision of the code which is again performing predictably-well on A53 and not so well on the A72, while on the AMD64 it performs nicely on desktop CPUs but suffers on Bobcats/Jaguars (death by popcnt ™). So I’m wondering if I should prioritize desktop AMD64 and little ARM64, and call it a day : )

  7. @Shimon
    It’s a hardware limitation – there’s only so much compilers can do in certain situations. When multitudes of big latencies are involved in tight data-dependency closures it’s the code’s author who can (or cannot) properly fix things by manipulating the workload size and distribution; a classic example of if one does not help themselves nobody else will. The crux of the issue in my particular case is that I did not anticipate such large permute latencies on the big cores, vis-a-vis the normal latencies on the little cores. I always viewed the big cores as better-or-equally performing versions of the little cores. Or perhaps I hoped that’s how ARM viewed them. I was clearly naive.

  8. Even if you’re right and there’s nothing more to it, an erratum might still be possible to mitigate the problem.

    It won’t hurt to get a more formal reply from ARM’s gcc developers.

  9. In article, you mentioned that your A53 platform is BQ Aquaris M10 (MT8163).
    I’m curious. What’s the A72 platform you used?

  10. @willmore
    Egg on face (Which is what this is…)?

    ARM will take a long while, if ever, much like most other CPU vendors, to comment or even OWN this one.

    This is off into the, “Lay down the damn crack pipe,” stupid on their part.

  11. The instruction itself is not “nerfed”, it’s just a problem of suboptimal interpretation. And that can always be fixed.

    Reading the subject I thought there’s something in the ARMv8 spec redifining the behaviour of the instruction itself, making it less useful or useless entirely.

  12. @rm
    Apologies if you found the title misleading. Once one reads the body I think the title snaps into place, though. As you put it, the instruction was made (significantly) less useful for at least two of the big cores. I’m curious if the trend continues in A73 and A75.

  13. blu :
    I’m curious if the trend continues in A73 and A75.

    You should make a test program that illustrates the issue using cycle counters. The issue may be isolated to a particular silicon revision.

  14. @crashoverride
    The observed behavior is in agreement with the software optimisation manuals for A72 and A57. But yes, a latency-measuring test app for these ops would be both useful and trivial to do.


    Just cpufreq an aarch64 machine into a steady clock, then run the above via time, then multiply the resulting time by the machine’s clock and divide that by 5 * 10^8 * 16. Do the same passing -DCOISSUE to the compiler to see how/if tbl coissue works on that machine.

    Turns out that on my A53 the q-form tbl has latency of 2, rather than the expected 3.

  16. Billions of years of evolution. Trillions of dollars of computing infrastructure at our fingertips. Yet, somehow, I am still the one that has to do the math?

  17. @blu
    I should outsource it!

    I actually did run the program (after some corrections), and it told me what I wanted to know: the A53 cores completed faster than the A72 cores (big.LITTLE) despite the big cores being clocked faster (1.5Ghz vs 2Ghz on RK3399 using taskset for CPU affinity). I did a generic compile and did not specify any CPU tuning parameters to gcc.

  18. Just to demonstrate how the script works, here’s the result from the amd64 version of the code ( on an intel desktop:

    $ clang++-3.9 -Ofast -mssse3 lattest.cpp -DCOISSUE
    $ ./ ./a.out
    $ clang++-3.9 -Ofast -mssse3 lattest.cpp -DCOISSUE=0
    $ ./ ./a.out

    It shows that (a) there is not co-issue of pshufb on intel, and (b) the op’s latency is 1 clock.

  19. @crashoverride
    Does it make any difference if you actually shut the little cores down instead of using taskset?

  20. Offline-ing the little cores yielded no change in execution time from what was previously observed for the big cores.

  21. Running the latency test on my A72 yields the following:

    latency of q-form 1-src tbl with co-issue is 2
    latency of q-form 1-src tbl without co-issue is 5

    Curiously enough the difference form the figures quoted in the optimisation manual is 1, ie. 2 + 1 = 3; 3 * 2 (for co-issue) = 6, and 5 + 1 = 6 without co-issue. I guess there’s some sort of early-forwarding mechanism if the results get consumed by the same port, but that’s just a wild guess; it could be just as well imprecision of the method.

  22. rm :
    The instruction itself is not “nerfed”, it’s just a problem of suboptimal interpretation. And that can always be fixed.

    Bit too late, but I meant to say “implementation” there.

  23. I wish I could get some arm device where I could profile neon code similarly to what you show here. Any tutorials that you could possibly point to?

    1. @Pavel P
      You need a PMU (performance-monitoring-unit)-enabled Cortex (pretty much any armv8 — I haven’t tried armv7 PMUs, even though those should be operational) with a modern-enough mainline kernel (4.x) with performance counters enabled (should be by default these days). Then you just build perf profiler found in the kernel tree under tools/perf and use it.

      Here’s a thread on the macchiatobin forums investigating certain A72 perf counters:

  24. I’m glad I came across this article. I was writing an image feature extractor in aarch64 assembly that makes heavy use of 4 sources tbl instructions, and successfully managed to hide the latency of 15 cycles completely after reading this post. I thought 5 cycles on A53 were bad enough, but ARM surprised me with 15 cycles. I have to stop trusting ARM blindly.

Leave a Reply

Your email address will not be published. Required fields are marked *