Linux Benchmarks – Intel J3455 Apollo Lake vs Z3735F Bay Trail vs RK3399 and Other ARM Platforms

Since I’ve just installed Ubuntu 17.10 on MeLE PCG35 Apo, I decided I should also run some benchmarks comparing with other ARM and x86 Linux platforms I’ve tested in the past.I was particularly interested to compare the performance of Intel Apollo Lake processors (Celeron J3455 in this case) against higher end ARM processors like Rockchip RK3399 (2x A72, 4x A53) since systems have a similar price (~$150+), as well as against the older Bay Trail processor to see the progress achieved over the last 2 to 3 years.

To do so, I used Phoronix Benchmark Suite against Videostrong VS-RK3399 results (RK3399 development board):


The benchmark first issued a warning about “powersave” governor, but I still went ahead, and once completed I change it to “performance” governor:


…and ran the tests again. All results are available on OpenBenchmarking.

Let’s address the governor results first. cpufreq-info reports that powersave governor can also switch between 800 MHz and 2.30 GHz (turbo freq).


As we’ll see from the results below pitting “MeLE PCG35 Apo – Ubuntu 17.10” (with powersave) and “MeLE PCG35 Apo- Ubuntu 17.10 Performance” that the governor settings did not matter one bit on the results, at least for the six benchmarks I ran.

Note that “MeUbuntu 14.04.3” represents MeLE PCG02U TV stick running Ubuntu 14.04.3. Every platform runs a different OS and kernel, so keep in mind the results may differ slightly (up or down) with different version. But as we’ll see the differences in performance are large enough that it likely does not matter that much.
John the Ripper password cracker, a multi-threaded benchmark, shows the Apollo Lake processor is clearly ahead of Rockchip RK3399 hexa-core processor, and the fastest ARM platform, Banana Pi M3, is equipped with an Allwinner A83T octa-core Cortex A7 processor @ 2.0 GHz. The Bay Trail system is over  twice as slow as the Apollo Lake one, also note the larg-ish standard deviation (+/- 83.72) due to some cooling problem in the small form factor.

C-Ray is another multi-threaded benchmark, and here Rockchip RK3399 SoC does fairly well, but still but quite as well as the Celeron J3455.

Smallpt, yet another multi-threaded benchmark, does not really change the order with MeLE PCG35 Apo well ahead.

Himeno, a linear solver of pressure Poisson, must be using some x86 specific instructions or optimizations, as Intel platforms are well ahead, with Celeron J3455 about 2.5x faster than Rockchip RK3399 board.

OpenSSL is the domain of Intel platforms likely benefiting from Advanced Encryption Standard instruction set (AES-NI). Performance improvement between Bay Trail and Apollo Lake is also impressive here. You’d need 10 Raspberry Pi 3 to match MeLE PCG35 Apo in this particular test.


Intel is normally better with SIMD accelerated multimedia application, and FLAC audio encoding (single threaded) confirms that.

I was expecting a close fight between Rockchip RK3399 and Celeron J3455, but RK3399 only has two fast Cortex A72 cores against four x86 cores in the Intel Apollo Lake SoC.

 

Support CNX Software - Donate via PayPal or become a Patron on Patreon
Advertisements
Subscribe
Notify of
guest
68 Comments
oldest
newest most voted
tkaiser
tkaiser
2 years ago

Two small remarks: 1) On Intel systems with intel_pstate driver running (Sandy Bridge or above and obviously a somewhat recent kernel) powersave and performance are expected to give similar results in synthetic benchmarks (since ramping up CPU clockspeeds fast enough). But on other platforms depending on kernel version powersave vs performance can result in huge performance differences (since powersave can lead to the CPU cores remaining on lowest allowed clockspeed all the time while performance chooses the maximum as on Intel) 2) OpenSSL performance and AES-NI:if we’re talking about AES here then ARMv8 SoCs with AES crypto extensions (like RK3399… Read more »

willy
willy
2 years ago

For now I’ve been quite disappointed by RK3399’s performance. I managed to make my build farm run on the H96max at 1.8 GHz (A72) + 1.4 GHz (A53), and at 1.8 GHz, the A72 shows slightly lower performance than the RK3288 on the MiQi (ie 16.2 sec instead of 16.0 for a build). And that’s the best I can take out of it, it’s in 32-bit (armv7t or armv8l). In 64-bit (armv8), it’s about 10-15% slower for the same task. It was said that the A72’s architecture was very close to the A17, the former with 3-decode and 5-issue, the… Read more »

blu
blu
2 years ago

@willy
Those settop boxes traditionally have notoriously bad thermal design, and a 28nm A72 can get really hot at near-2GHz — have you monitored the throttling?

Manuel
Manuel
2 years ago

Why not add a comparative about power consumption? It would be interesting…

Thanks.

theguyuk
theguyuk
2 years ago

Unless it is like for like hardware the results have no place in the real world.

Most TV boxes watch TV and comparing a £30 less board against a £130 + Intel J3455 Apollo Lake. Is like racing a child’s peddle bike against a F1 Racer. Funny but no value.

How about benchmarking a Xbox One against a Intel J3455 Apollo Lake? It would have just as much real world use.

willy
willy
2 years ago

@blu Yes absolutely, and it’s really at 1.8 GHz while running such tests, and doesn’t throttle. BTW the advertised 1.5 and 2.0 GHz frequencies are not even part of the frequencies list at all, that’s the beauty of “up to”, it probably means that it can go “up to 2.0 GHz if you hack your DTB”. Honestly the perf is not bad at all and it heats less than the 3288 at the same performance level, it just doesn’t have enough cores to compete with it. The RK3288 has 4 strong cores, but the 3399 has only two, and the… Read more »

bob
bob
2 years ago

ARM is not power efficient, it’s just cheap in some TV box :

50$ A53 x4 cpu burn = 5Watt
200$ intel I3 5005U cpu burn = 15-20W
Intel is 10x more powerfull in benchmark, so 3X more efficient benchmark pt / watt.
Conclusion expensive Arm is a no go

tkaiser
tkaiser
2 years ago

@willy
I don’t get it. Your workload scales horizontally? And 2 x A72 are as fast as 4 x A17?

theguyuk
theguyuk
2 years ago

@bob
The real fact is most Arm SoC spend most of the time not running at the claimed top speed, they can only run very sort bursts.

Even over designed cooler cannot help.

blu
blu
2 years ago

@bob
There are no quad-core A53 that a 5005U would be 10x as powerful — the average IPC difference between Broadwell and A53 is approx 2-2.5 in favor of the former. There might be isolated cases where AVX2-centric scenarios can produce 10x difference, but how common are they?

btw, 410c is a severely underclocked SBC — its normal ‘stock-cooling’ clock is ~600MHz IIRC.

bob
bob
2 years ago

@theguyuk
Right about idle/low use cpu need, for that i don’t understand strategy of arm don’t provide hw encoder/decoder driver for a good power efficiency…it’s a shame for their business

And maybe we miss a compilation souce code benchmarck, because i don’t think my little H3 compile 10X slower than my intel cpu…

Jonathan
Jonathan
2 years ago

What are the RAM speeds and configurations on these boxes? I suspect that may tell as much or more of the story, especially if the benchmark is blowing out the icache on the ARMs.

tkaiser
tkaiser
2 years ago

@Jonathan
The Phoronix test suite as well as the results on openbenchmarking.org should be considered useless garbage especially when trying to compare different platforms in the way it’s done here. Most of the ‘benchmarks’ use questionable or no optimization settings at build time so results vary a lot with the OS used and the platform it’s running on (compiler shipped and default settings).

I’ve seen some of these ‘benchmarks’ performing 3 times faster on the same hardware just by exchanging the OS and thereby default compiler and settings (switching from GCC 4.x to 6.x for example).

RooTer
RooTer
2 years ago

@theguyuk
when thinking about (relatively) low power server I actually do think whenever should I go with arm and x86 and such benchmark did made sense for me.
You can argue like other commenters the metric used was not the best, but the idea itself? no. Look at the scores, they are clearly comparable, especially if you think about the price involved for the performance.

tkaiser
tkaiser
2 years ago

RooTer :
@theguyuk
when thinking about (relatively) low power server I actually do think whenever should I go with arm and x86 and such benchmark did made sense for me.

Which kind of server task(s) could be represented by the collection of synthetic benchmarks above? Really curious 🙂

Eversor
Eversor
2 years ago

@bob
That may be true for these cheap implementations of A53 cores. Have a look at Nvidia Jetson TX2 benchmarks. Phoronix did some a good while back. Same is true for Apple’s cores.

m][sko
2 years ago

My results for Hikey 960
https://pastebin.com/twarU4eC

tkaiser
tkaiser
2 years ago

@m][sko
Can you please provide output from

blu
blu
2 years ago


Those benchmarks are extremely sensitive to compiler version and flags. One wrong flag and/or compiler version can make all the difference. It’s essential when doing synthetic benchmarking to know very well what a given compiler produces from the sources (e.g. ‘Hey, it’s not generating AES instructions!’, ‘Hey, it produced this snafu in the inner-most loop!’, etc)

raul
raul
2 years ago

Arm boards for the general use case have lost their initial promise. Lack of support from Arm, lack of proper software support from vendors, and the throttling and performance has left the promise in tatters. And the cost of higher performing boards, even the 3399, takes one into low power lower priced Intel platforms at which point its game over as the software support on Intel just cannot be compared. But for boards like the Pi or Odroid with better software support, at least things like a low power HTPC makes sense. And if the SOC provider decides to provide… Read more »

willmore
willmore
2 years ago

@raul
I would say that there are vendors who don’t suffer the problems you mention. ODROUD from Hard Kernel for one provides boards without cooling problems. They support their boards with software.

tkaiser
tkaiser
2 years ago

blu : Those benchmarks are extremely sensitive to compiler version and flags. One wrong flag and/or compiler version can make all the difference. And that’s the main reason no one with a brain in his head should use this terrible Phoronix crap. I used Jean-Luc’s last results to rely on and tested on a ROCK64: https://openbenchmarking.org/result/1710279-TY-1710254TY63 According to the Phoronix results ROCK64 shows just 41% OpenSSL performance compared to the octa core Banana Pi M3 and less than 9 percent compared to the Apollo Lake box (obviously the latter making use of AES-NI here). So obviously the Phoronix ‘benchmark’ uses… Read more »

tkaiser
tkaiser
2 years ago

@Jean-Luc Aufranc (CNXSoft) Thank you. So on x64 we’re talking about: Shell openssl speed rsa4096 -multi 4: Phoronix: 201.8 sign/s and 12934.5 verify/s Ubuntu 17.10: 160.4 sign/s and 10645.9 verify/s 123 openssl speed rsa4096 -multi 4: Phoronix: 201.8 sign/s and 12934.5 verify/sUbuntu 17.10: 160.4 sign/s and 10645.9 verify/s Shell openssl speed -elapsed -evp aes-128-cbc: Phoronix: 436592.26k (16 bytes) 598267.22k (8KB) Ubuntu 17.10: 415316.34k (16 bytes) 571697.83k (8KB) 123 openssl speed -elapsed -evp aes-128-cbc: Phoronix: 436592.26k (16 bytes) 598267.22k (8KB)Ubuntu 17.10: 415316.34k (16 bytes) 571697.83k (8KB) With both tests the PTS binary scores even better than Ubuntu’s (but on Intel totally… Read more »

blu
blu
2 years ago

@tkaiser
Nice catch. Yes, that’s part of what I was talking about — in this particular case what phoronix has produced is literally apples to shotshells.

theguyuk
theguyuk
2 years ago

I know lots will think I am nuts for suggesting it, but if you want TV boxes and ARM SBC to improve, people need to be encouraged to play more 3D games on them.

Just look how demanding 3D games have pushed Game Console and Gaming PC hardware. The same market can push ARM hardware.

Does the average office really need a i7, i5 to write letters etc, no.

Network gear is a different market same as financial data, Movie editing etc

willy
willy
2 years ago

@tkaiser
No Thomas, I meant 2*A72 are exactly as fast as 2*A17, or about half as fast as 4*A17. Yes the build workload scales reasonably well (provided the memory bandwidth is there and the number of files to build is high enough and these files are about the same size). So for this workload, the RK3288 is almost twice as fast as the RK3399 just because it has twice the number of really usable cores.

willy
willy
2 years ago

@bob Power efficiency doesn’t compare like this because it decreases with peak performance. You need to compare the power drawn by a device capable of achieving a given peak performance. For example, my RK3288, while less power efficient than my Atom 8350, is significantly faster (15-20%). These two are the only ones in this thermal envelope capable of delivering more or less comparable performance. If I want higher performance on x86, I have to seek a significantly larger design (a much faster one) which will eat more power than the RK3288. It might be a bit more power efficient but… Read more »

theguyuk
theguyuk
2 years ago

I always understood that the main trick Intel use, is to do more things in memory, and have as much of the data as you can, as close to the CPU cores as possible.

Bigger faster cache and faster memory?

willy
willy
2 years ago

@theguyuk That’s true for high end CPUs, when the cache algorithm are very advanced, allowing almost instant access to any memory location. But having compared memory performance between a few Atoms and the RK3288 in dual-channel configuration shows a significant difference in favor of the latter on in-cache and RAM patterns. One feature helping x86 CPUs is the trace cache replacing the instruction cache. The principle is to store decoded and fused instructions, saving a few pipeline stages for hot code paths. This also allows some old CISC instructions (eg: “rep movs” for memcpy()) to be expanded to very efficient… Read more »

wzyy2
wzyy2
2 years ago

RK3399 is more like Celeron N series, since Celeron J series usually have a higher power consumption.

tkaiser
tkaiser
2 years ago

@blu Well, almost all of those Phoronix ‘benchmarks’ are totally broken since how they’re built and executed is not with providing useful numbers in mind but only focusing on ‘portability/compatibility’ so that PTS users being tricked into believing they would do serious benchmarking get all those tests compiled on their hosts regardless of platform, OS and environment. Isn’t it really ‘funny’ when I’m interested in encryption performance of an inexpensive ARM thingie and compare with a way more expensive Intel box that Phoronix is telling me the ARM thing would be 11 times slower since the ‘test suite’ only produces… Read more »

blu
blu
2 years ago

@willy
Re x86 i-cache — Atoms, at least up to Silvermont, don’t have either uop cache or trace cache (latter was something found in Netburst; former appeared in SNB). The only i-cache optimisation employed by Atoms (read: Silvermonts) is op boundary marks in the i-cache, and thus Atoms are often decode-bound (one of the reasons the Cortexes fair so well against them).

tkaiser
tkaiser
2 years ago

@Jean-Luc Aufranc (CNXSoft) Yes, PTS is using $some defaults in ‘fire and forget’ mode. Sometimes build settings are hard encoded in the makefile, sometimes relying on what the environment defines (so if for example EXTRAOPTFLAGS=’-O3′ in /etc/profile some ‘benchmarks’ will show magnitudes better scores later, eg. the infamous Smallpt stuff) and sometimes just using what the operating system’s default compiler does for whatever reasons. Please think about again: what PTS reports as ‘OpenSSL’ performance is RK3328 being 11 times slower than the J3455. While with another benchmark and when using a sanely built openssl binary the ARM SoC easily outperforms… Read more »

Danand
2 years ago

If you are interested: This is how it looks on a ODROID XU4 ( Exynos5422 Cortex™-A15 2Ghz and Cortex™-A7 Octa core CPU ) on a Ubuntu 16.04.3 LTS – Kernel 4.13.0 # openssl speed -elapsed -evp aes-128-cbc You have chosen to measure elapsed time instead of user CPU time. Doing aes-128-cbc for 3s on 16 size blocks: 14966032 aes-128-cbc’s in 3.00s Doing aes-128-cbc for 3s on 64 size blocks: 4189907 aes-128-cbc’s in 3.00s Doing aes-128-cbc for 3s on 256 size blocks: 1104424 aes-128-cbc’s in 3.00s Doing aes-128-cbc for 3s on 1024 size blocks: 279763 aes-128-cbc’s in 3.00s Doing aes-128-cbc for… Read more »

nobitakun
nobitakun
2 years ago

I’m sorry but talking about efficiency comparing two different fabs is nonsense, each platform is designed for what they are, everything said.

tkaiser
tkaiser
2 years ago

Danand : This is how it looks on a ODROID XU4 Same numbers as in the referenced link in my first comment above (we tried to collect numbers from A7, A9, A15, A17, A53 — with and without ARMv8 crypto stuff — and A72. Currently A17 and A72 still missing). With RSA signing the Exynos doesn’t look that bad but when it comes to AES encryption the SoC is far behind compared to those SoCs with ARMv8 crypto extensions, both considering ‘raw performance’ and especially ‘performance per Watt’. An ODROID XU4/HC1 doing AES stuff on the CPU cores compared to… Read more »

blu
blu
2 years ago

@tkaiser
Let me know if and what A72 benchmarking you need – I have one macchiatobin idling here.

tkaiser
tkaiser
2 years ago

@blu
Would be great to get full output from

Compiler and version the binaries were built with are important of course. And if you can results from github.com/ssvb/tinymembench would also be great.

willy
willy
2 years ago

@tkaiser On H96-Max, I get this on the 2 [email protected] GHz : # openssl version -a OpenSSL 1.0.2k 26 Jan 2017 built on: reproducible build, date unspecified platform: linux-aarch64 options: bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) idea(int) blowfish(ptr) compiler: aarch64-gcc47l_glibc218-linux-gnueabi-gcc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -O3 -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM OPENSSLDIR: “/usr/share/openssl” (built on linaro gcc-4.7.4, not rebuilt since). # taskset -c 4,5 openssl speed rsa4096 -multi 2 sign verify sign/s verify/s rsa 4096 bits 0.019186s 0.000257s 52.1 3887.3 And this on the 4 A53 cores at 1.4 GHz : # taskset -c 0-3 openssl speed rsa4096 -multi 4… Read more »

blu
blu
2 years ago

Model: MACCHIATOBin-8040
Clock: CPU 1300 [MHz]
DDR 800 [MHz]
FABRIC 800 [MHz]
MSS 200 [MHz]

macchiato:~$ ./gov.sh # columns: core governor clock
0 performance 1300000
1 performance 1300000
2 performance 1300000
3 performance 1300000

tinymembench: https://pastebin.com/bg1Wha41
openssl: https://pastebin.com/WRnVa07L
7z: https://pastebin.com/Z6RFZLkh

dogs breath
dogs breath
2 years ago

bench mark we want to see is huawei fastest versus apple latest

tkaiser
tkaiser
2 years ago

@blu Thank you! This is DDR4, right? As a quick reference Jean-Luc’s tinymembench numbers for RK3399 from few weeks ago. And the 7-zip numbers are the only ‘generic CPU’ benchmark resuilts I need for my use cases (server stuff). A72 with ARMv8 crypto extensions performs really impressive, this time also with very small chunks of data (the A53 implementation suffers here somewhat). @willy I put your AES encryption numbers and the Celeron results together: Shell type 16 bytes 64 bytes 256 bytes 1024 bytes AES-NI 436592.26k 549185.24k 554032.38k 588775.08k A72 465434.04k 938648.81k 1221223.85k 1300021.93k A53 170812.96k 464463.87k 783981.14k 982121.81k 1234… Read more »

blu
blu
2 years ago

@tkaiser
Yep, DDR4 at the lowest rate (1600) — board and DIMM are capable of 2400, but I keep mine at the lowest setting for minimal active cooling (side-mounted 80mm silent fan).

On an unrelated note, Teres-A64 acquired! ; ]

Anonymous
Anonymous
2 years ago

@tkaiser

Maybe file a bug against PTS instead of just complaining on a random article on another site?

PTS is mostly used/developed on x86_64 so they probably aren’t even aware it doesn’t work properly (much slower than distro defaults) on arm.

Advertisements