Updated x86/x64 Software Optimization Manual Reveals More Intel Tremont Details

Tremont Microarchitecture Processor Pipeline

We first found out about Tremont microarchitecture in April 2018 in some Intel documents and Linux mainline source showing it was likely meant to be Goldmont Plus successor. Last year, Intel formally announced Tremont architecture providing some of the details with a block diagram and key features. But this morning, we were informed by email of a new revision of the x86/x64 Software Optimization Manual (PDF) with even further details about Tremont architecture. If you want to know all the details, jump to 4.1 Tremont Architecture section of the document, but here are some of the highlights / improvements over Golmond Plus microarchitecture: Enhanced branch prediction unit. Increased capacity with improved path-based conditional and indirect prediction. New committed Return Stack Buffer. Clustered 6-wide out-of-order front-end fetch and decode pipeline. Banked ICache with dual 16B reads. Two 3-wide decode clusters enabling up to 6 instructions per cycle. Deeper back-end out-of-order windows. Dedicated integer and vector integer/floating-point store data ports. 33% increase …

Support CNX Software – Donate via PayPal or become a Patron on Patreon

Intel Unveils Tremont Low-power x86 Architecture, Lakefield Hybrid Processor

Intel LakeField Processor

Intel Tremont microarchitecture was first leaked in April 2018 as a successor to Goldmont Plus used in Gemini Lake processors among others. But Intel has now made it official and revealed details about Tremont architecture at Linley Fall Processor Conference in Santa Clara, California. The new architecture is said to deliver significant IPC (instructions per cycle) gains compared with Intel’s prior low-power x86 architectures. Tremont-based processors will target client devices, IoT products, 5G networking, efficient datacenter servers, etc… Tremont Architecture Some of the highlights of Tremont architecture include: Intel Core class branch prediction with long history,32 bytes based, L1 predictor (no penalty) and large L2 predictor Out of order fetch – 32KB instruction cache, 32 bytes/cycle, up to 8 outstanding misses 6-wide out of order instruction decode Dual 3-wide clusters Wide decode without the area of a uop cache Optional single cluster mode based on product targets 4 wide allocation 10 execution ports Dual load/store pipelines 32KB data cache 1024 …

Support CNX Software – Donate via PayPal or become a Patron on Patreon