Qualcomm Centriq 2400 ARM SoC Launched for Datacenters, Benchmarked against Intel Xeon SoCs

Qualcomm Centriq 2400 ARM Server-on-Chip has been four years in the making. The company announced sampling in Q4 2016 using 10nm FinFET process technology with the SoC featuring up to 48 Qualcomm Falkor ARMv8 CPU cores optimized for datacenter workloads. More recently, Qualcomm provided a few more details about the Falkor core, fully customized with a 64-bit only micro-architecture based on ARMv8 / Aarch64.

Finally, here it is as the SoC formally launched with the company announcing commercial shipments of Centriq 2400 SoCs.

Qualcom Centriq 2400 key features and specifications:

  • CPU – Up to 48 physical ARMv8 compliant 64-bit only Falkor cores @ 2.2 GHz (base frequency) / 2.6 GHz (peak frequency)
  • Cache – 64 KB L1 instructions cache with 24 KB single-cycle L0 cache, 512 KB L2 cache per duplex; 60 MB unified L3 cache; Cache QoS
  • Memory – 6 channels of DDR4 2667 MT/s  for up to 768 GB RAM; 128 GB/s peak aggregate bandwidth; inline memory bandwidth compression
  • Integrated Chipset – 32 PCIe Gen3 lanes with 6 PCIe controllers; low speed I/Os; management controller
  • Security – Root of trust, EL3 (TrustZone) and EL2 (hypervisor)
  • TDP – < 120W (~2.5 W per core)
Click to Enlarge

The SoC is ARM SBSA v3 compliant, meaning it can run any compliant operating systems without having to resort to “cute embedded nonsense hacks“. The processor if optimized for cloud workloads, and the company explains the SoC are already been used demonstrated for the following tasks:

  • Web front end with HipHop Virtual Machine
  • NoSQL databases including MongoDB, Varnish, Scylladb
  • Cloud orchestration and automation including Kubernetes, Docker, metal-as-a-service
  • Data analytics including Apache Spark
  • Deep learning inference
  • Network function virtualization
  • Video and image processing acceleration
  • Multi-core electronic design automation
  • High throughput compute bioinformatics
  • Neural class networks
  • OpenStack Platform
  • Scaleout Server SAN with NVMe
  • Server-based network offload

Three Qualcom Centriq 2400 SKUs are available today

  • Centriq 2434 – 40 cores @ 2.3 / 2.5 GHz; 50 MB L3 cache; 110W TDP
  • Centriq 2452 – 46 cores @ 2.2 / 2.6 GHz; 57.5 MB L3 cache; 120W TDP
  • Centriq 2460 – 48 cores @ 2.2 / 2.6 GHz; 60 MB L3 cache; 120W TDP

Qualcomm Centriq 2460 (48-cores) was compared to an Intel Xeon Platinum 8160 with 24-cores/48 threads (150 W) and found to perform a little better in both integer and floating point benchmarks.

The most important metrics for server SoCs are performance per thread, performance per watt, and performance per dollars, so Qualcomm pitted Centriq 2460, 2452 and 2434 against respectively Intel Xeon Platinum 8180 (28 cores/205W  TDP), Xeon Gold 6152 (22 cores/140W TDP), and Xeon Silver 4116 (12 cores/85W  TDP). Performance per watt was found to be significantly better for the Qualcomm chip when using SPECint_rate2006 benchmark.

Performance per dollars of the Qualcomm SoCs look excellent too, but…

Qualcomm took Xeon SoCs pricing from Intel’s ARK, and in the past prices there did not reflect the real selling price of the chip, at least for low power Apollo Lake / Cherry Trail processors.

This compares to the prices for Centriq 2434 ($880), Centriq 2452 ($1,373), and Centriq 2460 ($1,995).

Qualcomm also boasted better performance per mm2, and typical power consumption of Centriq 2460 under load of around 60W, well below the 120W TDP. Idle power consumption is around 8 watts using C1 mode, and under 4 Watts when all idle states are enabled.

If you are wary of company provided benchmarks, Cloudflare independently tested Qualcomm Centriq and Intel  Skylake/Broadwell servers using Openssl speed, compression algorithms (gzip, brotli…), Go, NGINX web server, and more.

Multicore OpenSSL Performance

Usually, Intel single core performance is better, but since ARM has more cores, multi-threaded performance is often better on ARM. Here’s their conclusion:

The engineering sample of Falkor we got certainly impressed me a lot. This is a huge step up from any previous attempt at ARM based servers. Certainly core for core, the Intel Skylake is far superior, but when you look at the system level the performance becomes very attractive.

The production version of the Centriq SoC will feature up to 48 Falkor cores, running at a frequency of up to 2.6GHz, for a potential additional 8% better performance.

Obviously the Skylake server we tested is not the flagship Platinum unit that has 28 cores, but those 28 cores come both with a big price and over 200W TDP, whereas we are interested in improving our bang for buck metric, and performance per watt.

Currently my main concern is weak Go language performance, but that is bound to improve quickly once ARM based servers start gaining some market share.

Both C and LuaJIT performance is very competitive, and in many cases outperforms the Skylake contender. In almost every benchmark Falkor shows itself as a worthy upgrade from Broadwell.

The largest win by far for Falkor is the low power consumption. Although it has a TDP of 120W, during my tests it never went above 89W (for the go benchmark). In comparison Skylake and Broadwell both went over 160W, while the TDP of the two CPUs is 170W.

Back to software support, the SoC is supported by a large ecosystem with technologies such as memcached, MongoDB, MySQL, …, cloud management solutions such as  Openstack and Kubernetes, programming languages (Java, Python, PHP, Node, Golang…), tools (GVV/ LLVM, GBD…), virtualization solution including KVM, Xen and Docker, as well as operating systems like Ubuntu, Redhat, Suse, and Centos.

Qualcomm is already working on its next generation SoC: Firetail based on Qualcomm Saphira core. But no details were provided yet.

Thanks to David for the links.

13
Leave a Reply

avatar
13 Comment threads
0 Thread replies
6 Followers
 
Most reacted comment
Hottest comment thread
8 Comment authors
RKtheguyukslacksticktkaiser Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
blu
Guest
blu

Performance per watt is where ARM have been excelling so far, so nothig extraordinary here. As long as the single-thread performance is good enough ™, many server workloads care more about throughput.

noone
Guest
noone

If Qualcomm can make a board with this chip, 1 dimm per channel, 4 x4 NVMe slots, and 2 x8 PCIe slots, they’ll have my money.

Just give me minimal firmware and a serial port, and no extraneous bullshit.

Nobody of Import
Guest
Nobody of Import

@noone
Seconded. I want something in this space, even if it seems “pricey” in comparison to the embedded stuff we’re normally used to.

tkaiser
Guest
tkaiser

What an amazing read Cloudfare’s benchmark test is. Kudos to Vlad Krasnov! Very very well done pointing out what the numbers mean, why they are as they are and how they could/will improve.

blu
Guest
blu

@cnxsoft
That’s a well-known phenomenon on Intel cores since AVX2 and AVX512 only exacerbated things. If your code has predominantly SIMD portions then you win from the SIMD throughput gains. But:
1) on recent AVX architectures core clocks are dropped when AVX is encountered — recent intels just cannot run AVX at their stock clocks.
2) on AVX512 the SIMD engines are kept dormant, so upon encountering AVX512 code the SIMD engines are first fired up, and then clocks are dropped.

So it’s perfectly possible to have code where the throughput gain from limited use of AVX/AVX512 is insufficient to offset the increased latencies from the need to fire up engines first (the smaller detriment) and to downclock entire pipelines (the larger detriment), so you end up with deceleration from using AVX/AVX512.

blu
Guest
blu

@cnxsoft
Not on any of the ARMs I’ve encountered, but we shall see how things develop with the new SVE (scalable vector extension), where vectors can go as wide as 2048 bits. On Intel it started with AVX2 on Haswell, though it’s not necessarily linked to AVX2 per se, as much as to AVX256 workloads and Intel’s push for ever higher clocks. Basically, very high clocks or very wide throughput — you cannot have both; the ever-lasting dilemma of latency vs throughput.

slackstick
Guest
slackstick

@blu

It’s not the dilemma of latency vs. throughput, it’s the dilemma of the thermal envelope.

blu
Guest
blu

@slackstick
The dilemma (an imposed choice between two or more alternatives) is latency vs throughput; what brought to it in this case was beyond my point (yes, it’s usually thermal envelope and product costs).

theguyuk
Guest
theguyuk

How do they meet the cooling needs on these at present?

RK
Guest
RK

@theguyuk

Under-clocking for the most part.

@blu

You get both latency and throughput 10 times over per watt with DSPs. Won’t be long before someone figures out how to get it done for general compute.