NUMA emulation patch boosts Geekbench 6 benchmark results by up to 18% on Raspberry Pi 5

Igalia Engineer, Tvrtko Ursulin has recently submitted a patch to the Linux kernel adding a NUMA (Non-Uniform Memory Access) emulation implementation for arm64 platforms that can boost the performance of 64-bit Arm targets by “splitting the physical RAM into chunks and utilizing an allocation policy to better utilize parallelism in physical memory chip organization”.

The NUMA emulation implementation was tested on a Raspberry Pi 5 SBC and the Geekbench 6 single-core score improved by 6%, while the multi-core score boosted by 18% after splitting into four emulated NUMA nodes. In other words, that’s like having a Broadcom BCM2712 CPU overclocked from 2.4 GHz up to 2.83 GHz.

Raspberry Pi 5 NUMA emulation

The patch is actually quite short, around 100 lines, and the main C code file is about 60 lines long (stripped from SPDX header):


Code can be enabled using the new NUMA_EMULATION Kconfig option and then at runtime using the existing (shared with other platforms) numa=fake=<N> kernel boot argument. Users would also need to set up an interleaving allocation policy using a test program with:


So that would be for one specific program, but Tvrtko also explains a system-wide policy could be configured via systemd.

Although there’s no guarantee benchmark improvements transfer to overall system improvement, that’s great to have a “free” performance boost. The patch will still have to go through some iterations, and it’s still unclear whether the patch will be accepted, as Greg replied:

Why not just properly describe the numa topology in your bootloader or device tree and not need any such “fake” stuff at all?

Also, you are now asking me to maintain these new files, not something I’m comfortable doing at all sorry.

Time will tell.

Via Tom’s Hardware and Phoronix

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK 5 ITX RK3588 mini-ITX motherboard
Subscribe
Notify of
guest
The comment form collects your name, email and content to allow us keep track of the comments placed on the website. Please read and accept our website Terms and Privacy Policy to post a comment.
9 Comments
oldest
newest
tkaiser
tkaiser
20 days ago

> the Geekbench 6 single-core score improved by 6%

That’s the same ‘improvement’ you get on the RPi 5 by executing Geekbench 6 multiple times after booting w/o any modification at all. As for the multi score the RPi 5 is a great target since suffering from some sort of a memory bottleneck anyway.

The quad core BCM2712 scores in GB6 multi only ~200% of the single score while SoCs with better memory interface get closer to 300% or beyond (e.g. only the A76 cores in RK3588)

tkaiser
tkaiser
20 days ago

> executing Geekbench 6 multiple times after booting w/o any modification at all

I remembered wrong. The 6% score improvement on RPi 5 is not due to multiple executions but uptime related. You get lower GB6 scores directly after (re)boot but if you wait 20 minutes or so the scores magically improve by 6%.

fossxplorer
fossxplorer
20 days ago

Amazing.

fossxplorer
fossxplorer
20 days ago

I wonder which of these ARM SBCs would work ok-ish for a desktop with Fedora and i3/Sway?
Would RK3588 do it?

tkaiser
tkaiser
20 days ago

> and other arm64 platforms

IMO both the title and the contents are misleading since the patch comment talks only about RPi 5 (massively being bottlenecked wrt memory access) and not arm64 in general and parts of the ‘speed improvements’ are most probably the result of ‘benchmarking gone wrong’.

It’s easy to reproduce: boot RPi 5 and execute GB6 immediately, then wait 20 minutes and execute it again: scores improve by ~6% anyway.

tkaiser
tkaiser
20 days ago

The patch code is arm64 but the ‘speed improvements’ have been only tested on a single SBC in a flawed way using a) a benchmark that improves scores based on uptime and b) testing it solely on the arm64 SoC with worst memory interface known.

Your readers now may think they will see GB6 scores (or even ‘real world performance’ – LMAO) improving by 18% on any arm64 with this patch set which is pretty unlikely.

jfikar
15 days ago

Well, I have tried GeekBench6 on a RPi5 immediately after reboot and also 30 minutes later and they are identical. I have just the Raspberry Pi OS lite – no GUI, no HDMI connected, just SSH access. When I apply the NUMA patch in question, I see an improvement of +4.6% and +15.6% for singlecore and mutlicore, respectively. On RPi4 I see +1.0% and +7.1% for singlecore and mutlicore, respectively. This is with THP enabled. THP itself brings about +1.8% and +1.1% (RPi5) and +2.7% and 1.9% (RPi4). Don’t know, why THP is not compiled in on Raspberry Pi kernel.… Read more »

Khadas VIM4 SBC