Raspberry Pi 4 Benchmarked with 32-bit and 64-bit Debian OS

The first Raspberry Pi board with a 64-bit Arm processor was Raspberry Pi 3 Model B, and all new models including the latest Raspberry Pi 4 come with four Arm Cortex-A 64-bit cores.

But in order to keep backward software compatibility with the original Raspberry Pi and Raspberry Pi 2, the Raspberry Pi foundation decided to keep provided 32-bit OS image, so nearly everybody is now running a 32-bit OS on 64-bit hardware, and Eben Upton famously claimed it did not matter.

We already wrote that 64-bit Arm (Aarch64) boosted performance by 15 to 30% against 32-bit Arm (Aarch32) several years ago, but Matteo Croce decided to try it out himself on Raspberry Pi 4 board first running benchmarks on Raspbian 32-bit before switching to a lightweight version of Debian compiled as aarch64.

Dhrystones is much faster with the 64-bit OS, namely 50% faster, but as a synthetic benchmark, its use is limited. Benchmarks closer to real use cases such as SHA1 or audio encoding do confirm the improved performance although to a lesser extent, but still significant.

However, in some cases, there are no benefits of switching to a 64-bit OS with VPN performance with either OpenVPN or Wireguard being virtually the same with the default 32-bit Raspbian OS.

But the firewall works much better with Aarch64 (557k packets/s) than when the software is compiled with armv7 (268k packets/s).

Benchmarks results can differ greatly depending on compile select flags, but sadly Matteo did not provide the full command lines used to build the OS and samples.

I want to get some more data points, so I had a look at sbc-bench results available both for 32-bit Raspbian and 64-bit Debian Buster with the processor overclocked to 1850 Mhz and running Linux 4.19 in both cases. But the results we have here a completely different, at least when it comes to AES numbers which are twice as slow on the 64-bit version, and one of the reasons is the lack of ARMv8 Crypto Extensions in Broadcom BCM2711 processor.

Higher is better – memset/memcpy in MB/s, AES in KB/s

The lack of hardware crypto may explain why it’s not faster, but it does not explain why it is that much slower with 64-bit instructions. Thomas Kaiser also noted that 64-bit code has a larger footprint which leads to 7-zip test to run out of memory (oom-killer) in Raspberry Pi 4 with 1GB RAM while it can run fine while using a 32-bit OS on the same hardware.

Via Hackaday

 

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Support CNX Software - Donate via PayPal, become a Patron on Patreon, or buy review samples
Subscribe
Notify of
guest
49 Comments
oldest
newest most voted
jqpabc123
jqpabc123
7 months ago

Isn’t the primary motivation for 64 bit to increase the available memory address space?

But at this point in time, most embedded systems have limited memory and storage — constrained by cost. 64 bit does nothing to alleviate this constraint and actually makes it worse — it typically *requires* more memory and storage footprint to provide the same function (as in the 7zip example).

But never say never — maybe one day it will have more to offer but right now, it appears to be a solution looking for a problem.

KarelG
KarelG
7 months ago

The primary motivation for 32bit -> 64bit go is change from armv7 to armv8 architecture. Armv8 provides more physical resources (e.g. number of registers) hence is more performant. The situation is very similar to what we had during i386 -> amd64 transition in the past.

dgp
dgp
7 months ago

>Armv8 provides more physical resources (e.g. number of registers) More registers that are visible from the programming model but how does it actually work internally? More registers might make your assembly look better but does it actually make a real world difference for anything that actually matters? Modern CPUs are terribly complex internally and I’d say it gets very hard to actually say that some tiny snippet of assembly “performs better” out of context. Which is why benchmarks should try to benchmark a system as a whole. >hence is more performant. I think the fact that no one can usually… Read more »

Laurent
Laurent
7 months ago

Yes indeed. And you can see that happen on Intel where i386 code is as fast as x86-64 (when there’s no data cache thrashing or explicit 64-bit data).

That being said it’s not necessarily applicable to ARM CPU and 32-bit support might get worse in the future. For instance Cortex-A76 has no 32-bit system support any more and I wouldn’t be surprised that 32-bit is not the priority and could even fully disappear (as Apple did).

I wonder where ARM ILP32 support is in the wild.

dgp
dgp
7 months ago

I don’t see 32 bit ARM going anywhere. I can see support for anything pre v7 being dropped from GCC, linux etc but v7 itself is the go to core for cheap junk that runs linux.

tkaiser
tkaiser
7 months ago

> I wouldn’t be surprised that 32-bit is not the priority and could even fully disappear

Not on the Raspberry Pi which is a true 32-bit platform anyway. The primary OS (ThreadX) is 32-bit and will remain 32-bit. Currently when running a secondary OS in 64-bit fashion on any RPi you can’t access the primary OS in a reasonable way which is a show stopper for a lot of stuff those Raspberries are used for.

Laurent
Laurent
7 months ago

I meant AArch32 might disappear from high-end chips even the ones that don’t go in servers.

Mark
Mark
6 months ago

The things that seem to always come out as clearly performing better like crypto can be accelerated totally from the CPU by DMA capable hardware which could be far far superior and more beneficial to overall system performance.

Dan
Dan
7 months ago

64 bit is a new architecture, has more available registers and newer commands. There are additional optimizations that can be made.

https://www.cnx-software.com/2016/03/01/64-bit-arm-aarch64-instructions-boost-performance-by-15-to-30-compared-to-32-bit-arm-aarch32-instructions/

Laurent
Laurent
7 months ago

That doesn’t mean all programs will be faster. For instance on a board with an SDA845 chip (Cortex-A75) derivative 2 of the tests of nbench are faster when compiled for AArch32: NUMERIC SORT index is 9.66 vs 7.50 and ASSIGNMENT is 23.42 vs 20.75.

But the other tests of nbench are indeed faster for AArch64 and sometimes much faster.

AArch32:
MEMORY INDEX : 13.964
INTEGER INDEX : 16.480
FLOATING-POINT INDEX: 24.726

AArch64:
MEMORY INDEX : 15.533
INTEGER INDEX : 18.994
FLOATING-POINT INDEX: 30.047

tkaiser
tkaiser
7 months ago

> But the other tests of nbench are indeed faster for AArch64 and sometimes much faster Where are those ‘much faster’ numbers? I read your numbers like this (‘improvement’ of 64-bit over 32-bit): NUMERIC SORT: –28% ASSIGNMENT: –13% MEMORY INDEX : 11% INTEGER INDEX : 15% FLOATING-POINT INDEX: 22% I personally consider benchmark results that vary by less than 10% as identical or ‘margin of error’ so I fail to see ‘much faster’ here 🙂 Anyway, when switching from the ‘micro benchmark’ perspective to ‘system as a whole’ I always wonder why nobody is concerned about the larger memory footprint… Read more »

dgp
dgp
7 months ago

>I always wonder why nobody is concerned about the larger >memory footprint of a 64-bit userland compared to 32-bit. I think if your application needs a “performance” grade chip you have the budget for enough memory that it doesn’t matter. I think situations like the RK3308G (4 x A35 + 64MB of built in DDR2) might be a bit painful and you would opt for a 64bit kernel and 32bit userland but in those cases you know what you’re getting into. Either way memory is a problem even for small 32bit machines. I’d love to use systemd on my system… Read more »

Willy
Willy
7 months ago

> >I always wonder why nobody is concerned about the larger > >memory footprint of a 64-bit userland compared to 32-bit. > I think if your application needs a “performance” grade chip you have the budget for enough memory that it doesn’t matter. It’s not as much a matter of memory cost as it is of L1 and iTLB waste. I really miss thumb2 on aarch64. Thumb2 was *theorically* slower than ARM due to less capable instructions but ended up being faster on most machines thanks to the much more compact code. And with aarch64 even larger than aarch32 code,… Read more »

Laurent
Laurent
7 months ago

You don’t think >20% faster isn’t “much faster”? It’s worth a generation of chips or two.

I work in a CPU design team and I can guarantee you we fight to gain even 0.5% for high-end chips.

rando
rando
7 months ago

Nope. This us a microsoft lie.

32bit systems can address any relavant memory amount one wishes.

WinXP server (32 bit OS) before SP1 could address way more than 4Gb

https://en.wikipedia.org/wiki/3_GB_barrier#Windows_version_dependencies

willy
willy
7 months ago

You obviously have no idea what you’re talking about! Paginated memory doesn’t allow your application to use more memory than its address space permits. For efficient use your dataset often needs to be accessible in the process’s address space, and this is almost mandatory when you go multi-threaded. So sure, you can design your application to use implement paged mode in userland and swapping large work areas to disk like image manipulation programs do, but that’s not exactly the way efficient applications work…

Hugh
Hugh
7 months ago

It is quite possible to use segmentation in userland to allow addresses to be wider than a “word”. It does not work well in C but is fairly easy in languages that don’t expose pointer characteristics. A generation of programmers remember small/medium/large program models for 16-bit x86. Explicit segmentation is a lot easier for a program than explicit paging. Paging is better for providing virtual memory (implicit paging, managed by the OS). Database programs were designed in x86-32 days to use segmentation to access more than 4G of RAM. Actually, depending on the OS, the userland address space was limited… Read more »

zoobab
7 months ago

I wait for Tkaiser comments 🙂

Rob
Rob
7 months ago

Interesting write-up. I did some research and there is a config.txt switch to enable 64-bit kernel on Raspbian. Did you test that out?

tkaiser
tkaiser
7 months ago

I just did that and added arm_64bit=1 to the ThreadX config so it loads an aarch64 kernel with userland unchanged. Quick sbc-bench test shows no real differences compared to running with kernel7.img (ARMv7/32-bit): http://ix.io/28N5 — though stuff that runs entirely inside the kernel might benefit from running with this 64-bit kernel.

BTW: the RPi 4 sat passively cooled in a cheap aluminium enclosure. No throttling whatsoever…

willy
willy
7 months ago

Thanks for the link to the enclosure, it’s better than mine. It’s slightly larger, but completely encloses it, while mine leaves some connectors/corners exposed. In any case, yes such passively cooled devices are perfect for this board.

tkaiser
tkaiser
7 months ago

> Thanks for the link to the enclosure, it’s better than mine. I was skeptical before it arrived since it was below $10 at that time. I chose it since it also tries to take care of the heat dissipation of the PMIC area (which is one of the hottest spots on the PCB under load even with recent ThreadX releases). When it arrived I used some Blu-Tack to ‘measure’ the distance between enclosure and SoC/RAM/ PMIC and since this was fine I used the provided thermal pads instead of going with a copper shim + thermal paste at least… Read more »

Franco
Franco
7 months ago

ThreadX! You said ThreadX! LOL. Nobody cares you complete & utter bore!

tkaiser
tkaiser
7 months ago

Seriously, this is not a comparison of “32-bit vs. 64-bit” but Raspbian vs. official Debian arm64. Raspbian on the 64-bit ARMv8 RPis combines a kernel built for ARMv7 with a userland built and optimized for ARMv6. And in Matteo’s case he switched entirely to ‘compiler benchmarks’ by using this pathetic Dhrystone anachronism or comparing a stock Raspbian binary’s performance with his own optimized build (‘Unfortunately the Debian sha1sum utility was compiled without libssl or kernel crypto support, so I had to compile it from source’). I won’t comment on his ‘network benchmarks’ since I have not the slightest idea which… Read more »

Jerry
Jerry
7 months ago

Many use RPi as their nice little home NAS. There’s been gigabit ethernet support for some time. At least this shows that Raspbian AES performance is better so they’ve made the right choice sticking with Raspbian. ArmBian might be better for Chinese knockoffs, but nowadays you have to worry about the CoV when shipping from China. Sunxi kernel support might also be inferior, maybe even fail to boot at all. For instance Orange Pi Zero boards contains builtin storage but don’t boot to onboard Linux by default.

Marco
Marco
7 months ago

>but nowadays you have to worry about the CoV when shipping from China

Are you serious ?
What do you do with every pack you receive ? You lick it or touch it everywhere before putting fingers in your eyes or mouth ?
What a bunch of ignorance in just one sentence.

Willy
Willy
7 months ago

He could even lick it, he would only get the bacteria left there by the postman. Any trace of a virus left on the pack in the factory would have died even before leaving the factory. Frankly, I fear that erratic behaviors from scared people reading facebook will cause far more harm to Chinese people than the virus.

Willy
Willy
7 months ago

BTW I received a package yesterday from China (and I forgot to lick the package before trashing it, we’ll see in 14 days if I’m still alive :-)). They seem to be shipped very quickly these days, probably that the queue to the airport and/or cargo is much shorter due to the reduced activity, it’s the right moment to order things you need from shops that are still open!

Willy
Willy
7 months ago

My observation has always been very mixed in 32-vs-64 bit. When you have crypto extensions and depend on them, you’ll definitely win with 64-bit. When lots of pointers are used (hash tables, linked lists etc), you store twice as more pointers per cache line in 32-bits than 64-bits and waste much less memory bandwidth, which is critical when you only have a 32-bit memory bus. My observation on ARMv7 and ARMv8 has consistently shown that gcc is 15-20% slower when built in 64-bit mode when building code for the same target (both 32 and 64-bit). If you need to manipulate… Read more »

blu
blu
7 months ago

For situations where pointers dominate the throughput ILP32 should be the target, not aarch32, IMO.

Willy
Willy
7 months ago

That’s apparently what you get when building for armv8l IIRC, which is used when you build code for Cortex A32. I tried this a year ago with my compilers and got sensibly similar code to ARM one in aarch32, I couldn’t equal thumb2 in code density.

Hugh
Hugh
7 months ago

Since the 16 -> 32 bit transition, I’ve always thought that much of UNIX / Linux userland should be compiled with the more modest model and only the performance or memory-eating programs should be compiled with the wider model. But I’ve been too lazy to do this. I seem to remember that on 64-bit Power and SPARC this approach is used. That’s partly because the code density went way down. On X86 and (I think) Arm, the transition was accompanied by a major improvement in the instruction set and so code density didn’t take such a big hit. I would… Read more »

Willy
Willy
6 months ago

> I’ve always thought that much of UNIX / Linux userland should be compiled with the more modest model That’s exactly what we used to do on our load balancers in the past: kernel+haproxy were 64 bits and the rest was 32. But these days you have to share many libraries, forcing to have them in both versions, taking twice the space. So while we used to do that to save on the porting effort, it actually resulted in more space usage and code loaded in memory (two libc, two openssl etc). All this to say that it’s not always… Read more »

theguyuk
theguyuk
7 months ago

It just adds the confusion, over performance in real home media use, the rpi, hot tart 4 with heat sink, fan, case and remote sits.

I will get flamed, but for home media player with storage that won’t throttle after 60 mins playing 4 k files.

This better when no gpio needed and in a 32 bit OS.

theguyuk
theguyuk
7 months ago

Since my link got removed。 This Magicsee N5 Plus Android 9.0 TV Box 8K HDR Ultra-HD Video with Amlogic S905X3 4GB RAM 64GB ROM Dual-band WiFi USB 3.0 4TB HDD / SSD Hard Drive Expansion
On sale at gearbest for less than £47.00. inc p&p, case, power supply. remote, hdmi cable and antenna.

theguyuk
theguyuk
7 months ago

What do the down voters fear people knowing.

tkaiser
tkaiser
7 months ago

Jean-Luc, can you please remove this voting crap on your blog entirely? It already sucks to be confronted with all those TV box advertisements by the theclownuk but him always whining about others not happy with his ad links really adds to the mess.

blu
blu
7 months ago

Au contraire, let’s keep it — I find the random downvotes a good indicator of the level of butthurt going around.

theguyuk
theguyuk
7 months ago

Not ad links at all , real life value comparrisions and alternatives.
lt is rich coming from you with all your hardware and software promtion, so you get discounts.
I still remember Steven paid you for software support of Orange pi. Which hardware you then promoted like mad! Users who believed you and bought the hardware found your software buggy and unreliable.

Thank heaven for Armbian and Friendlyelec for the useable softeware support for Friendlyelec boards.

rando
rando
7 months ago

You seek a credibility epeen measuring contest with tkaiser? On a rando nerd/geek hardware forum on some backwater of the internet?

Step back and look at what you are saying.

Your point that specific ewaste android boxes MAY be better vid players for some short window before you trash them and get another is getting lost in your desire to measure dicks.

And to a crowd that does not think of genitals as a unit of measurment? Who are you playing to?

Franco
Franco
7 months ago

How could ThreadX anyone possibly argue ThreadX with TKaiser? I’ve heard ThreadX on the grapevine that ThreadX he knows ThreadX the name of the real-time OS ThreadX used by the Raspberry Pi and ThreadX likes to name drop it regularly, almost randomly. However, that name currently escapes me.

Gaetano
Gaetano
7 months ago

This magicsee set top box is not comparable to PI4: it only has a 100MBit ethernet, while the PI4 has 1000Mbps, better for use as a PC and also in some mediaplayer use cases.

theguyuk
theguyuk
7 months ago

Not so as home media player 100M ethernet is enough.

Many on audio Visual forums are dumping PC Nas and Blu-ray rips for Android box with built in codecs for video and audio. HDR, 4K video playback and Android app support.

Jerry
Jerry
7 months ago

Do these Android boxes support the latest Bluray lossless audio, Dirac, room eq, Atmos? H265 isn’t sufficient.

theguyuk
theguyuk
7 months ago

Many,
Does hot tart rpi 4 support android, Android apps Amazon, itunes, rakuten, sling, hulu, the CW, tubi, stirr, pluto tv offical apps. Does hot tart 4 rpi come with intergrated remote control, a case and power supply pluss inc p&p in the purchase price?
Is hot tart rpi4 reconised by Netflix for 4k playback.

Willy
Willy
7 months ago

It’s fun to see heated discussions with huge upvotes/downvotes like every single time there’s an RPi article. At least it proves these articles drain some traffic to the site, which is great 🙂 Yeah, please start downvoting me as well you gentle RPi fanboys, just to see how far we can go 🙂

Franco
Franco
7 months ago

It just goes to show the RPi is the only SBC worth talking about. All of the competing SBCs exist in a social vacuum, and due to limited software support are destined to become e-waste much sooner than any RPi board so who – other than eco-warriors – really cares about them?

And if there weren’t any more RPi articles, tkaiser would have to find a new audience to bore to death.

willy
willy
7 months ago

> It just goes to show the RPi is the only SBC worth talking about No it’s even the opposite. When there are discussions about other ones here, the vendors take note of good ideas and improve future models. Look how good the VIMs, NanoPis, RockPis, Odroid, Librecomputer’s and whatever good ones have become by taking comments into account; their designers participate to the discussions here and at other places. *These* are worth talking about because there’s a hope to improve them, and of course sometimes comments are wrong but who cares. RPi doesn’t care a single second about comments… Read more »

tkaiser
tkaiser
7 months ago

No ‘RPi fanboys’ involved in this constant downvoting/upvoting crap happening here from time to time.

Advertisements