More Details about Goldmont Plus Microarchitecture (used in Gemini Lake Processors)

Orange Pi Development Boards

2017 was the year of systems based on Intel’s low power, low cost Apollo Lake processors, and provided Intel does not suddenly decide to cancel yet another product, they will be replaced by Gemini Lake processors in 2018. The former is based on Goldmont microarchitecture, while the latter relies on the updated Goldmont microarchitecture.

Intel has now released a document entitled “Intel 64 and IA-32 Architectures Optimization Reference Manual” where you’ll find more gritty technical details about Goldmont Plus in chapter 16 “SOFTWARE OPTIMIZATION FOR GOLDMONT PLUS, GOLDMONT, AND SILVERMONT MICROARCHITECTURES”.

Click to Enlarge

The enhancements over Goldmont include:

  • Widen previous generation Atom processor back-end pipeline to 4-wide allocation to 4-wide retire, while maintaining 3-wide fetch and decode pipeline.
  • Enhanced branch prediction unit.
  • Improved AES-NI instruction latency and throughput.
  • 64KB shared second level pre-decode cache (16KB in Goldmont microarchitecture).
  • Larger reservation station and ROB entries to support large out-of-order window.
  • Wider integer execution unit. New dedicated JEU port with support for faster branch redirection.
  • Radix-1024 floating point divider for fast scalar/packed single, double and extended precision floating point divides.
  • Larger load and store buffers. Improved store-to-load forwarding latency store data from register.
  • Shared instruction and data second level TLB. Paging Cache Enhancements (PxE/ePxE caches).
  • Modular system design with four cores sharing up to 4MB L2 cache.
  • Support for Read Processor ID (RDP) new instruction.

We had a discussion a little while ago comparing 64-bit ARM and Intel Apollo Lake OpenSSL benchmark, and Intel was a bit behind for some key sizes, so maybe the new AES-NI improvements in Gemini Lake/Goldmont Plus will bring low power Intel processor back to the front.

The document also contains a table comparing Goldmont Plus and Goldmont’s “Front End Cluster Features”.

Feature Goldmont Plus Microarchitecture Goldmont Microarchitecture
Number of Decoders 3
Max. Throughput Decoders 20 Bytes per cycle
Fetch and Icache Pipeline Decoupled
ITLB 48 entries, large page support
2nd Level ITLB Shared with DTLB
Branch Mispredict Penalty 13 cycles (12 cycles for certain Jcc) 12 cycles
L2 Predecode Cache 64K 16K

This table shows many similarities, but GLM+ has a bigger 64KB L2 cache, and a larger mispredict penalty (that’s certainly more than compensated by the larger cache). More information can be found in the Intel document.

2
Leave a Reply

avatar
2 Comment threads
0 Thread replies
0 Followers
 
Most reacted comment
Hottest comment thread
2 Comment authors
Tommwilly Recent comment authors
  Subscribe  
newest oldest most voted
Notify of
willy
Guest
willy

It looks a lot like what I remember of the P3 or Pentium-M, both of which were very efficient CPUs in their time. It’s good to see x86 improving on the low power designs. If at least Atoms can become again as fast as ARM CPUs, we can hope for a more balanced offering depending on expected performance level, power envelope and price. Ie choose between power efficiency and price for a similar performance level, or select between performance and price for a given power envelope.

Tomm
Guest
Tomm

😼 that 64KB difference with 16KB