Raspberry Pi 4 Benchmarked with 32-bit and 64-bit Debian OS

The first Raspberry Pi board with a 64-bit Arm processor was Raspberry Pi 3 Model B, and all new models including the latest Raspberry Pi 4 come with four Arm Cortex-A 64-bit cores.

But in order to keep backward software compatibility with the original Raspberry Pi and Raspberry Pi 2, the Raspberry Pi foundation decided to keep provided 32-bit OS image, so nearly everybody is now running a 32-bit OS on 64-bit hardware, and Eben Upton famously claimed it did not matter.

We already wrote that 64-bit Arm (Aarch64) boosted performance by 15 to 30% against 32-bit Arm (Aarch32) several years ago, but Matteo Croce decided to try it out himself on Raspberry Pi 4 board first running benchmarks on Raspbian 32-bit before switching to a lightweight version of Debian compiled as aarch64.

Dhrystones is much faster with the 64-bit OS, namely 50% faster, but as a synthetic benchmark, its use is limited. Benchmarks closer to real use cases such as SHA1 or audio encoding do confirm the improved performance although to a lesser extent, but still significant.

Aarch32 vs Aarch64 Raspberry Pi 4

However, in some cases, there are no benefits of switching to a 64-bit OS with VPN performance with either OpenVPN or Wireguard being virtually the same with the default 32-bit Raspbian OS.

Raspberry Pi 4 32-bit vs 64-bit VPN & Firewall

But the firewall works much better with Aarch64 (557k packets/s) than when the software is compiled with armv7 (268k packets/s).

Benchmarks results can differ greatly depending on compile select flags, but sadly Matteo did not provide the full command lines used to build the OS and samples.

I want to get some more data points, so I had a look at sbc-bench results available both for 32-bit Raspbian and 64-bit Debian Buster with the processor overclocked to 1850 Mhz and running Linux 4.19 in both cases. But the results we have here a completely different, at least when it comes to AES numbers which are twice as slow on the 64-bit version, and one of the reasons is the lack of ARMv8 Crypto Extensions in Broadcom BCM2711 processor.

Raspberry Pi 4 AES memset 32-bit vs 64-bit
Higher is better – memset/memcpy in MB/s, AES in KB/s

The lack of hardware crypto may explain why it’s not faster, but it does not explain why it is that much slower with 64-bit instructions. Thomas Kaiser also noted that 64-bit code has a larger footprint which leads to 7-zip test to run out of memory (oom-killer) in Raspberry Pi 4 with 1GB RAM while it can run fine while using a 32-bit OS on the same hardware.

Via Hackaday

 

Share this:
FacebookTwitterHacker NewsSlashdotRedditLinkedInPinterestFlipboardMeWeLineEmailShare

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress

ROCK Pi 4C Plus

49 Replies to “Raspberry Pi 4 Benchmarked with 32-bit and 64-bit Debian OS”

  1. Isn’t the primary motivation for 64 bit to increase the available memory address space?

    But at this point in time, most embedded systems have limited memory and storage — constrained by cost. 64 bit does nothing to alleviate this constraint and actually makes it worse — it typically *requires* more memory and storage footprint to provide the same function (as in the 7zip example).

    But never say never — maybe one day it will have more to offer but right now, it appears to be a solution looking for a problem.

    1. The primary motivation for 32bit -> 64bit go is change from armv7 to armv8 architecture. Armv8 provides more physical resources (e.g. number of registers) hence is more performant. The situation is very similar to what we had during i386 -> amd64 transition in the past.

      1. >Armv8 provides more physical resources (e.g. number of registers)

        More registers that are visible from the programming model but how does it actually work internally? More registers might make your assembly look better but does it actually make a real world difference for anything that actually matters? Modern CPUs are terribly complex internally and I’d say it gets very hard to actually say that some tiny snippet of assembly “performs better” out of context. Which is why benchmarks should try to benchmark a system as a whole.

        >hence is more performant.

        I think the fact that no one can usually agree on whether benchmarks that try to prove this are actually good benchmarks makes moots that statement. The things that seem to always come out as clearly performing better like crypto can be accelerated totally off of the CPU by DMA capable hardware which could be far far superior and more beneficial to overall system performance.

        1. Yes indeed. And you can see that happen on Intel where i386 code is as fast as x86-64 (when there’s no data cache thrashing or explicit 64-bit data).

          That being said it’s not necessarily applicable to ARM CPU and 32-bit support might get worse in the future. For instance Cortex-A76 has no 32-bit system support any more and I wouldn’t be surprised that 32-bit is not the priority and could even fully disappear (as Apple did).

          I wonder where ARM ILP32 support is in the wild.

          1. I don’t see 32 bit ARM going anywhere. I can see support for anything pre v7 being dropped from GCC, linux etc but v7 itself is the go to core for cheap junk that runs linux.

          2. > I wouldn’t be surprised that 32-bit is not the priority and could even fully disappear

            Not on the Raspberry Pi which is a true 32-bit platform anyway. The primary OS (ThreadX) is 32-bit and will remain 32-bit. Currently when running a secondary OS in 64-bit fashion on any RPi you can’t access the primary OS in a reasonable way which is a show stopper for a lot of stuff those Raspberries are used for.

          3. I meant AArch32 might disappear from high-end chips even the ones that don’t go in servers.

        2. The things that seem to always come out as clearly performing better like crypto can be accelerated totally from the CPU by DMA capable hardware which could be far far superior and more beneficial to overall system performance.

      1. That doesn’t mean all programs will be faster. For instance on a board with an SDA845 chip (Cortex-A75) derivative 2 of the tests of nbench are faster when compiled for AArch32: NUMERIC SORT index is 9.66 vs 7.50 and ASSIGNMENT is 23.42 vs 20.75.

        But the other tests of nbench are indeed faster for AArch64 and sometimes much faster.

        AArch32:
        MEMORY INDEX : 13.964
        INTEGER INDEX : 16.480
        FLOATING-POINT INDEX: 24.726

        AArch64:
        MEMORY INDEX : 15.533
        INTEGER INDEX : 18.994
        FLOATING-POINT INDEX: 30.047

        1. > But the other tests of nbench are indeed faster for AArch64 and sometimes much faster

          Where are those ‘much faster’ numbers? I read your numbers like this (‘improvement’ of 64-bit over 32-bit):

          NUMERIC SORT: –28%
          ASSIGNMENT: –13%
          MEMORY INDEX : 11%
          INTEGER INDEX : 15%
          FLOATING-POINT INDEX: 22%

          I personally consider benchmark results that vary by less than 10% as identical or ‘margin of error’ so I fail to see ‘much faster’ here 🙂

          Anyway, when switching from the ‘micro benchmark’ perspective to ‘system as a whole’ I always wonder why nobody is concerned about the larger memory footprint of a 64-bit userland compared to 32-bit.

          This comparison of an Octane 2.0 benchmark for nodeJS shows identical performance for 32-bit and 64-bit binaries while the needed memory of the arm64 userland was almost twice as high: https://github.com/nodesource/distributions/issues/375#issuecomment-290440706

          As soon as systems low on memory (like SBC for example) start to swap usually it’s game over with performance.

          1. >I always wonder why nobody is concerned about the larger
            >memory footprint of a 64-bit userland compared to 32-bit.

            I think if your application needs a “performance” grade chip you have the budget for enough memory that it doesn’t matter. I think situations like the RK3308G (4 x A35 + 64MB of built in DDR2) might be a bit painful and you would opt for a 64bit kernel and 32bit userland but in those cases you know what you’re getting into.

            Either way memory is a problem even for small 32bit machines. I’d love to use systemd on my system with 16MB of flash and 64MB of DDR2 but it ends up bigger (storage and resident size) than a whole busybox based system currently. There was some effort to make the kernel more configurable to make it smaller for these cases but that seems to have died out and no one seems interested in making systemd small enough that you don’t need 256MB+ of memory and eMMC storage to use it.

          2. > >I always wonder why nobody is concerned about the larger
            > >memory footprint of a 64-bit userland compared to 32-bit.
            > I think if your application needs a “performance” grade chip you have the budget for enough memory that it doesn’t matter.
            It’s not as much a matter of memory cost as it is of L1 and iTLB waste. I really miss thumb2 on aarch64. Thumb2 was *theorically* slower than ARM due to less capable instructions but ended up being faster on most machines thanks to the much more compact code. And with aarch64 even larger than aarch32 code, we do lose performance just because of the instruction set and code size here. However it’s also a bit more modern and comes with new goodies (like the CAS instruction in ARMv8.1).

          3. You don’t think >20% faster isn’t “much faster”? It’s worth a generation of chips or two.

            I work in a CPU design team and I can guarantee you we fight to gain even 0.5% for high-end chips.

      1. You obviously have no idea what you’re talking about! Paginated memory doesn’t allow your application to use more memory than its address space permits. For efficient use your dataset often needs to be accessible in the process’s address space, and this is almost mandatory when you go multi-threaded. So sure, you can design your application to use implement paged mode in userland and swapping large work areas to disk like image manipulation programs do, but that’s not exactly the way efficient applications work…

        1. It is quite possible to use segmentation in userland to allow addresses to be wider than a “word”. It does not work well in C but is fairly easy in languages that don’t expose pointer characteristics. A generation of programmers remember small/medium/large program models for 16-bit x86.

          Explicit segmentation is a lot easier for a program than explicit paging. Paging is better for providing virtual memory (implicit paging, managed by the OS).

          Database programs were designed in x86-32 days to use segmentation to access more than 4G of RAM. Actually, depending on the OS, the userland address space was limited to 2G or 3G.

          In those days, swapping wasn’t great, but it was more practical than the RPi’s swapping to SD card. A 6G program might have a 3G working set, for example.

  2. Interesting write-up. I did some research and there is a config.txt switch to enable 64-bit kernel on Raspbian. Did you test that out?

    1. I just did that and added arm_64bit=1 to the ThreadX config so it loads an aarch64 kernel with userland unchanged. Quick sbc-bench test shows no real differences compared to running with kernel7.img (ARMv7/32-bit): http://ix.io/28N5 — though stuff that runs entirely inside the kernel might benefit from running with this 64-bit kernel.

      BTW: the RPi 4 sat passively cooled in a cheap aluminium enclosure. No throttling whatsoever…

      1. Thanks for the link to the enclosure, it’s better than mine. It’s slightly larger, but completely encloses it, while mine leaves some connectors/corners exposed. In any case, yes such passively cooled devices are perfect for this board.

        1. > Thanks for the link to the enclosure, it’s better than mine.

          I was skeptical before it arrived since it was below $10 at that time. I chose it since it also tries to take care of the heat dissipation of the PMIC area (which is one of the hottest spots on the PCB under load even with recent ThreadX releases).

          When it arrived I used some Blu-Tack to ‘measure’ the distance between enclosure and SoC/RAM/ PMIC and since this was fine I used the provided thermal pads instead of going with a copper shim + thermal paste at least for the SoC.

          All fine so far (enclosure fits perfectly) and no throttling even under highest loads.

  3. Seriously, this is not a comparison of “32-bit vs. 64-bit” but Raspbian vs. official Debian arm64. Raspbian on the 64-bit ARMv8 RPis combines a kernel built for ARMv7 with a userland built and optimized for ARMv6. And in Matteo’s case he switched entirely to ‘compiler benchmarks’ by using this pathetic Dhrystone anachronism or comparing a stock Raspbian binary’s performance with his own optimized build (‘Unfortunately the Debian sha1sum utility was compiled without libssl or kernel crypto support, so I had to compile it from source’). I won’t comment on his ‘network benchmarks’ since I have not the slightest idea which use cases they represent…

    Some more datapoints and an explanation why such ‘Raspbian vs. something else’ comparisons are not 32-bit vs. 64-bit can be found here: https://www.raspberrypi.org/forums/viewtopic.php?t=247959&start=175#p1524499 (by searching the forum for ‘ejolson benchmark 64 32’ combined with ‘python’ or ‘openssl’ a few more explanations might come up why numbers differ).

  4. Many use RPi as their nice little home NAS. There’s been gigabit ethernet support for some time. At least this shows that Raspbian AES performance is better so they’ve made the right choice sticking with Raspbian. ArmBian might be better for Chinese knockoffs, but nowadays you have to worry about the CoV when shipping from China. Sunxi kernel support might also be inferior, maybe even fail to boot at all. For instance Orange Pi Zero boards contains builtin storage but don’t boot to onboard Linux by default.

    1. >but nowadays you have to worry about the CoV when shipping from China

      Are you serious ?
      What do you do with every pack you receive ? You lick it or touch it everywhere before putting fingers in your eyes or mouth ?
      What a bunch of ignorance in just one sentence.

      1. He could even lick it, he would only get the bacteria left there by the postman. Any trace of a virus left on the pack in the factory would have died even before leaving the factory. Frankly, I fear that erratic behaviors from scared people reading facebook will cause far more harm to Chinese people than the virus.

      2. BTW I received a package yesterday from China (and I forgot to lick the package before trashing it, we’ll see in 14 days if I’m still alive :-)). They seem to be shipped very quickly these days, probably that the queue to the airport and/or cargo is much shorter due to the reduced activity, it’s the right moment to order things you need from shops that are still open!

  5. My observation has always been very mixed in 32-vs-64 bit. When you have crypto extensions and depend on them, you’ll definitely win with 64-bit. When lots of pointers are used (hash tables, linked lists etc), you store twice as more pointers per cache line in 32-bits than 64-bits and waste much less memory bandwidth, which is critical when you only have a 32-bit memory bus. My observation on ARMv7 and ARMv8 has consistently shown that gcc is 15-20% slower when built in 64-bit mode when building code for the same target (both 32 and 64-bit). If you need to manipulate large objects in memory, 64-bit might help (just mmap the whole file and you’re done, instead of paging in the application). Code size is quite smaller in ARMv7 thumb2 mode and causes less cache misses.

    I’d say that usually 64-bit is the way to go at least because it’s where future efforts will go, unless there is a very compelling reason to stay in 32-bit (i.e. your most significant application runs slower).

      1. That’s apparently what you get when building for armv8l IIRC, which is used when you build code for Cortex A32. I tried this a year ago with my compilers and got sensibly similar code to ARM one in aarch32, I couldn’t equal thumb2 in code density.

      2. Since the 16 -> 32 bit transition, I’ve always thought that much of UNIX / Linux userland should be compiled with the more modest model and only the performance or memory-eating programs should be compiled with the wider model. But I’ve been too lazy to do this.

        I seem to remember that on 64-bit Power and SPARC this approach is used. That’s partly because the code density went way down. On X86 and (I think) Arm, the transition was accompanied by a major improvement in the instruction set and so code density didn’t take such a big hit.

        I would guess that ARMv8l is a big improvement over ARMv7. Too bad about thumb2. I don’t know if there is a blessed variant with 64-bit registers and instructions but where they are only used for long long. I don’t even know if that would be a win.

        1. > I’ve always thought that much of UNIX / Linux userland should be compiled with the more modest model

          That’s exactly what we used to do on our load balancers in the past: kernel+haproxy were 64 bits and the rest was 32. But these days you have to share many libraries, forcing to have them in both versions, taking twice the space. So while we used to do that to save on the porting effort, it actually resulted in more space usage and code loaded in memory (two libc, two openssl etc). All this to say that it’s not always the best solution.

  6. It just adds the confusion, over performance in real home media use, the rpi, hot tart 4 with heat sink, fan, case and remote sits.

    I will get flamed, but for home media player with storage that won’t throttle after 60 mins playing 4 k files.

    This better when no gpio needed and in a 32 bit OS.

    1. Since my link got removed。 This Magicsee N5 Plus Android 9.0 TV Box 8K HDR Ultra-HD Video with Amlogic S905X3 4GB RAM 64GB ROM Dual-band WiFi USB 3.0 4TB HDD / SSD Hard Drive Expansion
      On sale at gearbest for less than £47.00. inc p&p, case, power supply. remote, hdmi cable and antenna.

        1. Jean-Luc, can you please remove this voting crap on your blog entirely? It already sucks to be confronted with all those TV box advertisements by the theclownuk but him always whining about others not happy with his ad links really adds to the mess.

          1. Au contraire, let’s keep it — I find the random downvotes a good indicator of the level of butthurt going around.

          2. Not ad links at all , real life value comparrisions and alternatives.
            lt is rich coming from you with all your hardware and software promtion, so you get discounts.
            I still remember Steven paid you for software support of Orange pi. Which hardware you then promoted like mad! Users who believed you and bought the hardware found your software buggy and unreliable.

            Thank heaven for Armbian and Friendlyelec for the useable softeware support for Friendlyelec boards.

          3. You seek a credibility epeen measuring contest with tkaiser? On a rando nerd/geek hardware forum on some backwater of the internet?

            Step back and look at what you are saying.

            Your point that specific ewaste android boxes MAY be better vid players for some short window before you trash them and get another is getting lost in your desire to measure dicks.

            And to a crowd that does not think of genitals as a unit of measurment? Who are you playing to?

          4. How could ThreadX anyone possibly argue ThreadX with TKaiser? I’ve heard ThreadX on the grapevine that ThreadX he knows ThreadX the name of the real-time OS ThreadX used by the Raspberry Pi and ThreadX likes to name drop it regularly, almost randomly. However, that name currently escapes me.

      1. This magicsee set top box is not comparable to PI4: it only has a 100MBit ethernet, while the PI4 has 1000Mbps, better for use as a PC and also in some mediaplayer use cases.

        1. Not so as home media player 100M ethernet is enough.

          Many on audio Visual forums are dumping PC Nas and Blu-ray rips for Android box with built in codecs for video and audio. HDR, 4K video playback and Android app support.

          1. Do these Android boxes support the latest Bluray lossless audio, Dirac, room eq, Atmos? H265 isn’t sufficient.

          2. Many,
            Does hot tart rpi 4 support android, Android apps Amazon, itunes, rakuten, sling, hulu, the CW, tubi, stirr, pluto tv offical apps. Does hot tart 4 rpi come with intergrated remote control, a case and power supply pluss inc p&p in the purchase price?
            Is hot tart rpi4 reconised by Netflix for 4k playback.

  7. It’s fun to see heated discussions with huge upvotes/downvotes like every single time there’s an RPi article. At least it proves these articles drain some traffic to the site, which is great 🙂 Yeah, please start downvoting me as well you gentle RPi fanboys, just to see how far we can go 🙂

    1. It just goes to show the RPi is the only SBC worth talking about. All of the competing SBCs exist in a social vacuum, and due to limited software support are destined to become e-waste much sooner than any RPi board so who – other than eco-warriors – really cares about them?

      And if there weren’t any more RPi articles, tkaiser would have to find a new audience to bore to death.

      1. > It just goes to show the RPi is the only SBC worth talking about

        No it’s even the opposite. When there are discussions about other ones here, the vendors take note of good ideas and improve future models. Look how good the VIMs, NanoPis, RockPis, Odroid, Librecomputer’s and whatever good ones have become by taking comments into account; their designers participate to the discussions here and at other places. *These* are worth talking about because there’s a hope to improve them, and of course sometimes comments are wrong but who cares. RPi doesn’t care a single second about comments on their products, otherwise it wouldn’t have taken them as much as 8 years before arriving to a decent one for the price. However fanboys do care a lot about what is being said about the latest device they’ve put their money on, it’s only a matter of ego, which is very well reflected in these high vote counts. And I’m absolutely certain that a number of them only comment on RPi posts and are never seen in other discussions, which is another indication of their feeling that some comments hurt them directly.

Leave a Reply

Your email address will not be published. Required fields are marked *

Khadas VIM4 SBC
Khadas VIM4 SBC