Qualcomm Centriq 2400 ARM SoC Launched for Datacenters, Benchmarked against Intel Xeon SoCs

Qualcomm Centriq 2400 ARM Server-on-Chip has been four years in the making. The company announced sampling in Q4 2016 using 10nm FinFET process technology with the SoC featuring up to 48 Qualcomm Falkor ARMv8 CPU cores optimized for datacenter workloads. More recently, Qualcomm provided a few more details about the Falkor core, fully customized with a 64-bit only micro-architecture based on ARMv8 / Aarch64.

Finally, here it is as the SoC formally launched with the company announcing commercial shipments of Centriq 2400 SoCs.

Qualcom Centriq 2400 key features and specifications:

CPU – Up to 48 physical ARMv8 compliant 64-bit only Falkor cores @ 2.2 GHz (base frequency) / 2.6 GHz (peak frequency)
Cache – 64 KB L1 instructions cache with 24 KB single-cycle L0 cache, 512 KB L2 cache per duplex; 60 MB unified L3 cache; Cache QoS
Memory – 6 channels of DDR4 2667 MT/s for up to 768 GB RAM; 128 GB/s peak aggregate bandwidth; inline memory bandwidth compression
Integrated Chipset – 32 PCIe Gen3 lanes with 6 PCIe controllers; low speed I/Os; management controller
Security – Root of trust, EL3 (TrustZone) and EL2 (hypervisor)
TDP – < 120W (~2.5 W per core)

The SoC is ARM SBSA v3 compliant, meaning it can run any compliant operating systems without having to resort to “cute embedded nonsense hacks“. The processor if optimized for cloud workloads, and the company explains the SoC are already been used demonstrated for the following tasks:

Web front end with HipHop Virtual Machine
NoSQL databases including MongoDB, Varnish, Scylladb
Cloud orchestration and automation including Kubernetes, Docker, metal-as-a-service
Data analytics including Apache Spark
Deep learning inference
Network function virtualization
Video and image processing acceleration
Multi-core electronic design automation
High throughput compute bioinformatics
Neural class networks
OpenStack Platform
Scaleout Server SAN with NVMe
Server-based network offload

Three Qualcom Centriq 2400 SKUs are available today

Centriq 2434 – 40 cores @ 2.3 / 2.5 GHz; 50 MB L3 cache; 110W TDP
Centriq 2452 – 46 cores @ 2.2 / 2.6 GHz; 57.5 MB L3 cache; 120W TDP
Centriq 2460 – 48 cores @ 2.2 / 2.6 GHz; 60 MB L3 cache; 120W TDP

Qualcomm Centriq 2460 (48-cores) was compared to an Intel Xeon Platinum 8160 with 24-cores/48 threads (150 W) and found to perform a little better in both integer and floating point benchmarks.

The most important metrics for server SoCs are performance per thread, performance per watt, and performance per dollars, so Qualcomm pitted Centriq 2460, 2452 and 2434 against respectively Intel Xeon Platinum 8180 (28 cores/205W TDP), Xeon Gold 6152 (22 cores/140W TDP), and Xeon Silver 4116 (12 cores/85W TDP). Performance per watt was found to be significantly better for the Qualcomm chip when using SPECint_rate2006 benchmark.

Performance per dollars of the Qualcomm SoCs look excellent too, but…

Qualcomm took Xeon SoCs pricing from Intel’s ARK, and in the past prices there did not reflect the real selling price of the chip, at least for low power Apollo Lake / Cherry Trail processors.

This compares to the prices for Centriq 2434 ($880), Centriq 2452 ($1,373), and Centriq 2460 ($1,995).

Qualcomm also boasted better performance per mm2, and typical power consumption of Centriq 2460 under load of around 60W, well below the 120W TDP. Idle power consumption is around 8 watts using C1 mode, and under 4 Watts when all idle states are enabled.

If you are wary of company provided benchmarks, Cloudflare independently tested Qualcomm Centriq and Intel Skylake/Broadwell servers using Openssl speed, compression algorithms (gzip, brotli…), Go, NGINX web server, and more.

Usually, Intel single core performance is better, but since ARM has more cores, multi-threaded performance is often better on ARM. Here’s their conclusion:

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

Back to software support, the SoC is supported by a large ecosystem with technologies such as memcached, MongoDB, MySQL, …, cloud management solutions such as Openstack and Kubernetes, programming languages (Java, Python, PHP, Node, Golang…), tools (GVV/ LLVM, GBD…), virtualization solution including KVM, Xen and Docker, as well as operating systems like Ubuntu, Redhat, Suse, and Centos.

Qualcomm is already working on its next generation SoC: Firetail based on Qualcomm Saphira core. But no details were provided yet.

Thanks to David for the links.

Jean-Luc Aufranc (CNXSoft)

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Share this:

Support CNX Software! Donate via cryptocurrencies, become a Patron on Patreon, or purchase goods on Amazon or Aliexpress. We also use affiliate links in articles to earn commissions if you make a purchase after clicking on those links.

13 Replies to “Qualcomm Centriq 2400 ARM SoC Launched for Datacenters, Benchmarked against Intel Xeon SoCs”

Performance per watt is where ARM have been excelling so far, so nothig extraordinary here. As long as the single-thread performance is good enough ™, many server workloads care more about throughput.

If Qualcomm can make a board with this chip, 1 dimm per channel, 4 x4 NVMe slots, and 2 x8 PCIe slots, they’ll have my money.

Just give me minimal firmware and a serial port, and no extraneous bullshit.

@noone
Seconded. I want something in this space, even if it seems “pricey” in comparison to the embedded stuff we’re normally used to.

What an amazing read Cloudfare’s benchmark test is. Kudos to Vlad Krasnov! Very very well done pointing out what the numbers mean, why they are as they are and how they could/will improve.

@tkaiser
They have a follow up article about Intel performance only. The CPU cores are clocked lower when some crypto instructions (e.g. AVX-512) are used, so they may actually impact the system performance negatively (depending on your load).
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

@cnxsoft
That’s a well-known phenomenon on Intel cores since AVX2 and AVX512 only exacerbated things. If your code has predominantly SIMD portions then you win from the SIMD throughput gains. But:
1) on recent AVX architectures core clocks are dropped when AVX is encountered — recent intels just cannot run AVX at their stock clocks.
2) on AVX512 the SIMD engines are kept dormant, so upon encountering AVX512 code the SIMD engines are first fired up, and then clocks are dropped.

So it’s perfectly possible to have code where the throughput gain from limited use of AVX/AVX512 is insufficient to offset the increased latencies from the need to fire up engines first (the smaller detriment) and to downclock entire pipelines (the larger detriment), so you end up with deceleration from using AVX/AVX512.

@blu
Do ARM processors exhibit the same behavior for some SIMD instructions, or is it specific to x86?

@cnxsoft
Not on any of the ARMs I’ve encountered, but we shall see how things develop with the new SVE (scalable vector extension), where vectors can go as wide as 2048 bits. On Intel it started with AVX2 on Haswell, though it’s not necessarily linked to AVX2 per se, as much as to AVX256 workloads and Intel’s push for ever higher clocks. Basically, very high clocks or very wide throughput — you cannot have both; the ever-lasting dilemma of latency vs throughput.

@blu

It’s not the dilemma of latency vs. throughput, it’s the dilemma of the thermal envelope.

@slackstick
The dilemma (an imposed choice between two or more alternatives) is latency vs throughput; what brought to it in this case was beyond my point (yes, it’s usually thermal envelope and product costs).

How do they meet the cooling needs on these at present?

@theguyuk

Under-clocking for the most part.

@blu

You get both latency and throughput 10 times over per watt with DSPs. Won’t be long before someone figures out how to get it done for general compute.

They are working on improving Go crypto performance on arm64 based on disappointing benchmark results reported by Cloudflare:

Specifically, the work is to be done to address performance issues in the Go crypto libraries, which to date have been extensively optimized for x86 processors but not yet for arm64. The algorithms in the Go librares crypto/ecsda, crypto/rsa, crypto/aes, and crypto/chacha20poly1305 are targeted for improvement.

The Go source tree had a feature freeze for Go 1.10 on November 1, 2017, which means that the earliest that these issues will be addressed is likely to be in Go 1.11.

https://github.com/golang/go/issues/19840#issuecomment-345513189
https://github.com/golang/go/issues/22806 – ECDSA
https://github.com/golang/go/issues/22807 – RSA
https://github.com/golang/go/issues/22808 – AES
https://github.com/golang/go/issues/22809 – CHACHA
https://github.com/golang/go/milestone/62 – Go 1.11 milestone

Source: https://www.worksonarm.com/blog/woa-issue-30/

Boardcon LGA3576 Rockchip RK3576 System-on-Module designed for AI and IoT applications

blu says:

November 9, 2017 at 14:36

Performance per watt is where ARM have been excelling so far, so nothig extraordinary here. As long as the single-thread performance is good enough ™, many server workloads care more about throughput.

noone says:

November 9, 2017 at 22:59

If Qualcomm can make a board with this chip, 1 dimm per channel, 4 x4 NVMe slots, and 2 x8 PCIe slots, they’ll have my money.

Just give me minimal firmware and a serial port, and no extraneous bullshit.

Nobody of Import says:

November 10, 2017 at 00:55

@noone
Seconded. I want something in this space, even if it seems “pricey” in comparison to the embedded stuff we’re normally used to.

tkaiser says:

November 10, 2017 at 05:23

What an amazing read Cloudfare’s benchmark test is. Kudos to Vlad Krasnov! Very very well done pointing out what the numbers mean, why they are as they are and how they could/will improve.

cnxsoft says:

November 11, 2017 at 11:10

@tkaiser
They have a follow up article about Intel performance only. The CPU cores are clocked lower when some crypto instructions (e.g. AVX-512) are used, so they may actually impact the system performance negatively (depending on your load).
https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/

blu says:

November 11, 2017 at 14:20

@cnxsoft
That’s a well-known phenomenon on Intel cores since AVX2 and AVX512 only exacerbated things. If your code has predominantly SIMD portions then you win from the SIMD throughput gains. But:
1) on recent AVX architectures core clocks are dropped when AVX is encountered — recent intels just cannot run AVX at their stock clocks.
2) on AVX512 the SIMD engines are kept dormant, so upon encountering AVX512 code the SIMD engines are first fired up, and then clocks are dropped.

So it’s perfectly possible to have code where the throughput gain from limited use of AVX/AVX512 is insufficient to offset the increased latencies from the need to fire up engines first (the smaller detriment) and to downclock entire pipelines (the larger detriment), so you end up with deceleration from using AVX/AVX512.

cnxsoft says:

November 11, 2017 at 14:33

@blu
Do ARM processors exhibit the same behavior for some SIMD instructions, or is it specific to x86?

blu says:

November 11, 2017 at 14:49

@cnxsoft
Not on any of the ARMs I’ve encountered, but we shall see how things develop with the new SVE (scalable vector extension), where vectors can go as wide as 2048 bits. On Intel it started with AVX2 on Haswell, though it’s not necessarily linked to AVX2 per se, as much as to AVX256 workloads and Intel’s push for ever higher clocks. Basically, very high clocks or very wide throughput — you cannot have both; the ever-lasting dilemma of latency vs throughput.

slackstick says:

November 11, 2017 at 18:01

@blu

It’s not the dilemma of latency vs. throughput, it’s the dilemma of the thermal envelope.

blu says:

November 11, 2017 at 19:04

@slackstick
The dilemma (an imposed choice between two or more alternatives) is latency vs throughput; what brought to it in this case was beyond my point (yes, it’s usually thermal envelope and product costs).

theguyuk says:

November 11, 2017 at 19:50

How do they meet the cooling needs on these at present?

RK says:

November 12, 2017 at 00:10

@theguyuk

Under-clocking for the most part.

@blu

You get both latency and throughput 10 times over per watt with DSPs. Won’t be long before someone figures out how to get it done for general compute.

cnxsoft says:

December 5, 2017 at 10:42

They are working on improving Go crypto performance on arm64 based on disappointing benchmark results reported by Cloudflare:

Specifically, the work is to be done to address performance issues in the Go crypto libraries, which to date have been extensively optimized for x86 processors but not yet for arm64. The algorithms in the Go librares crypto/ecsda, crypto/rsa, crypto/aes, and crypto/chacha20poly1305 are targeted for improvement.

The Go source tree had a feature freeze for Go 1.10 on November 1, 2017, which means that the earliest that these issues will be addressed is likely to be in Go 1.11.

https://github.com/golang/go/issues/19840#issuecomment-345513189
https://github.com/golang/go/issues/22806 – ECDSA
https://github.com/golang/go/issues/22807 – RSA
https://github.com/golang/go/issues/22808 – AES
https://github.com/golang/go/issues/22809 – CHACHA
https://github.com/golang/go/milestone/62 – Go 1.11 milestone

Source: https://www.worksonarm.com/blog/woa-issue-30/

13 Replies to “Qualcomm Centriq 2400 ARM SoC Launched for Datacenters, Benchmarked against Intel Xeon SoCs”

Leave a Reply Cancel reply

Leave a Reply