More Details about Goldmont Plus Microarchitecture (used in Gemini Lake Processors)

2017 was the year of systems based on Intel’s low power, low cost Apollo Lake processors, and provided Intel does not suddenly decide to cancel yet another product, they will be replaced by Gemini Lake processors in 2018. The former is based on Goldmont microarchitecture, while the latter relies on the updated Goldmont microarchitecture.

Intel has now released a document entitled “Intel 64 and IA-32 Architectures Optimization Reference Manual” where you’ll find more gritty technical details about Goldmont Plus in chapter 16 “SOFTWARE OPTIMIZATION FOR GOLDMONT PLUS, GOLDMONT, AND SILVERMONT MICROARCHITECTURES”.

Click to Enlarge

The enhancements over Goldmont include:

  • Widen previous generation Atom processor back-end pipeline to 4-wide allocation to 4-wide retire, while maintaining 3-wide fetch and decode pipeline.
  • Enhanced branch prediction unit.
  • Improved AES-NI instruction latency and throughput.
  • 64KB shared second level pre-decode cache (16KB in Goldmont microarchitecture).
  • Larger reservation station and ROB entries to support large out-of-order window.
  • Wider integer execution unit. New dedicated JEU port with support for faster branch redirection.
  • Radix-1024 floating point divider for fast scalar/packed single, double and extended precision floating point divides.
  • Larger load and store buffers. Improved store-to-load forwarding latency store data from register.
  • Shared instruction and data second level TLB. Paging Cache Enhancements (PxE/ePxE caches).
  • Modular system design with four cores sharing up to 4MB L2 cache.
  • Support for Read Processor ID (RDP) new instruction.

We had a discussion a little while ago comparing 64-bit ARM and Intel Apollo Lake OpenSSL benchmark, and Intel was a bit behind for some key sizes, so maybe the new AES-NI improvements in Gemini Lake/Goldmont Plus will bring low power Intel processor back to the front.

The document also contains a table comparing Goldmont Plus and Goldmont’s “Front End Cluster Features”.

FeatureGoldmont Plus MicroarchitectureGoldmont Microarchitecture
Number of Decoders3
Max. Throughput Decoders20 Bytes per cycle
Fetch and Icache PipelineDecoupled
ITLB48 entries, large page support
2nd Level ITLBShared with DTLB
Branch Mispredict Penalty13 cycles (12 cycles for certain Jcc)12 cycles
L2 Predecode Cache64K16K

This table shows many similarities, but GLM+ has a bigger 64KB L2 cache, and a larger mispredict penalty (that’s certainly more than compensated by the larger cache). More information can be found in the Intel document.

Jean-Luc started CNX Software in 2010 as a part-time endeavor, before quitting his job as a software engineering manager, and starting to write daily news, and reviews full time later in 2011.

Support CNX Software - Donate via PayPal, become a Patron on Patreon, or buy review samples
Subscribe
Notify of
guest
2 Comments
oldest
newest most voted
willy
willy
2 years ago

It looks a lot like what I remember of the P3 or Pentium-M, both of which were very efficient CPUs in their time. It’s good to see x86 improving on the low power designs. If at least Atoms can become again as fast as ARM CPUs, we can hope for a more balanced offering depending on expected performance level, power envelope and price. Ie choose between power efficiency and price for a similar performance level, or select between performance and price for a given power envelope.

Tomm
Tomm
2 years ago

😮 that 64KB difference with 16KB

Advertisements