Arm Cortex A75 based processors are only found in a few SoCs and devices, but Arm keeps on innovating, and they’ve now announced a new suite of of IP with Cortex-A76 CPU enabling 35 percent more performance, and Mali-G76 GPU with ML support and 30 percent higher efficiency and performance.
SoC based on those new CPU and GPU IP will provide “laptop-class” performance, and the company also announced Arm Mali-V76 VPU with support for 8K video decoding and encoding.
After Cortex A75, the Arm Cortex-A76 CPU is the second high performance processor core based on DynamIQ technology, and beside the 35 percent performance gain mentioned in the introduction, it also offers 40 percent improved efficiency, as well as delivers 4x compute performance improvements for AI/ML at the edge.
Highlights of Cortex A76:
- Architecture – Armv8-A (Harvard) with Armv8.1, Armv8.2, Armv8.3 (LDAPR instructions only), cryptography and RAS extensions
- ISA support – A64; A32 and T32 (at the EL0 only)
- Pipeline – Out-of-order
- NEON / Floating Point Unit
- Optional Cryptography Unit
- Up to four CPUs in cluster
- Physical addressing (PA) – 40-bit
- Memory system and external interfaces
- 64KB L1 I-Cache / D-Cache
- 256KB to 512KB L2 Cache
- Optional 512KB to 4MB L3 cache
- ECC Support, LPAE
- Bus interfaces – AMBA ACE or CHI
- Optional ACP
- Optional Peripheral Port
Cortex A76 SoC should provide around twice the performance as Cortex A73 SoC in laptops, considering the improvements in microarchitecture, lower process node (7nm vs 16nm), and higher CPU frequency (Up to 3GHz+). big.LITTLE performance at the same power envelop (5W) should also be about twice as good.
Some of the key microarchitectural enhancements include:
- Decoupled branch prediction and instruction fetch – Built to hide latency at high bandwidth, the in-order Cortex-A76 front-end is able to fetch 4 to 8 instructions per cycle, using multi-level branch target caches and hybrid indirect predictor to sustain the maximum throughput.
- Arm’s first 4-wide decode core, increasing the maximum instruction per cycle capability. Up to 8 operations per cycle can then be dispatched to the out-of-order core, supporting a wider area-/power-optimized instruction window.
- More integer and vector execution throughput – Quad-issue integer units are integrated in the core including 3x simple ALU and 1x multi-cycle integer. Moreover, Cortex-A76 supports dual-issue native 16B (128-bit) vector and floating-point units, twice the throughput of any previous Arm CPU. Vitally, it can deliver the 4x ML performance improvements we mentioned earlier.
- Enhanced memory system – The full cache hierarchy is co-optimized for latency and bandwidth, with a sophisticated 4th generation prefetcher, deep memory-level parallelism
Beside the 30% improvement in performance density and energy efficiency, Arm Mali-G76 Bifrost architecture based GPU also delivers around 2.7 times machine learning (ML) improvements over Mali-G72 GPU.
Some of the specifications of Mali-G76 GPU:
- Anti-Aliasing – 4x MSAA, 8x MSAA, 16x MSAA
- API Support – OpenGL ES 1.1, 2.0, 3.1, 3.2, Vulkan 1.1, OpenCL 1.1, 1.2, 2.0 Full Profile
- Bus Interface – AMBA 4, ACE-LITE
- L2 Cache – 512KB to 4MB
- Scalability – 4 to 20 Cores
- Adaptive Scalable Texture Compression (ASTC) – Low Dynamic Range (LDR) and High Dynamic Range (HDR), supports both 2D and 3D images.
- Arm Frame Buffer Compression (AFBC) – Version 1.2; 4×4 pixel block size
The GPU will be used in “premium mobile”, virtual reality, machine learning, and automotive applications.
Mali-V76 is the latest video processing unit (VPU) from Arm with support for 8K video decoding @ 60 fps, and also suitable for video walls with 2×2 4K UHD videos, or 4×4 1080p HD videos.
Main features of Mali-V76 VPU:
- Multi-standard video processor
- 10/8-bit HEVC, VP9, VP8, H.264, AVS+/AVS and legacy
- Simultaneous encode and decode
- Scalable 2-8 cores (8K60D/8K30E)
No mention of AV1 codec, so we’ll probably have to wait for 2020 or beyond before AV1 makes it into silicon.
Mali-V76 is an evolution of Mali-V61 video processor with twice the decode performance, a 40% smaller area for 4K120 performance, 25% additional bitrate saving , twice the bus fabric latency tolerance, and additional support for 10-bit H.264 codec and 8-bit AVS+/AVS decode.