Hearts of Lake Alder: Gracemont and Golden Cove

0


Next-gen Intel Alder Lake processors are expected to release later this year, bringing with them a new design philosophy on a new node set to challenge AMD. But ahead of its release, Intel provided a breakdown of its best features during Intel Architecture Day 2021 to answer a question: What makes Alder Lake tick?

All image sources: Intel

Alder Lake deviates significantly from previous processor designs from Intel. Recalling the systems-on-a-chip (SoC) of smartphones, it presents not one, but two main architectures linked together using new packaging technology from Intel. It will have up to 16 cores split between eight Golden Cove performance cores (Intel calls them P-cores) and eight Gracemont efficiency cores (E-cores). Alder Lake also offers up to 30MB cache, 16 PCIe 5 and DDR5 lanes as well as support for DDR4 memory and Tiger Lake Xe LP graphics ported on the Intel 7 node.

Alder Lake design targets.
Alder Lake targets different form factors using a modular approach.

Alder Lake uses three fabrics to connect all of its rooms together and refine energy consumption. The path between compute cores, graphics, last-level cache (LLC), and memory is the compute fabric, which can run at 1TB / s. The input / output (I / O) fabric, which operates at 64 Gb / s, passes data between I / O and internal devices. Finally, the memory structure operates at 204 Gb / s and can dynamically adjust the width and frequency of the bus for several operating points. Intel says having multiple dynamically scaled structures allows Alder Lake to more efficiently direct power to where it is needed most.

Gracemont efficiency core

Gracemont is the name of Intel’s efficiency core architecture. It features a revamped architecture with a deeper interface, a larger backend, and will be built on top of the Intel 7 node, formerly known as Intel 10nm +. Its many energy and performance improvements, along with its advanced transistors, converge to form the efficient cores that will make their debut at Alder Lake.

Branch prediction is an essential feature of modern processors. It predicts the next necessary instructions before a program even requests them, reducing CPU wait times and wasted instructions. Many processor processing steps depend on accurate branch predictions; for example, if there is a prediction error, the instructions stored in the out-of-service buffer may need to be flushed. Gracemont has a branch target cache of 5,000 entries for its history-based branch prediction to generate accurate instruction pointers, reducing the risk of prediction errors.

In addition to reducing the wait time, more branch prediction resources reduce cache errors by loading more relevant data into the cache, also before the program requests it. Gracemont contains a 64KB instruction cache that stores frequently used instructions at your fingertips, as well as Intel’s first “on-demand instruction length decoder” that quickly decodes new code.

The main instruction decoder has also been improved. It can now decode up to six instructions per cycle while retaining the efficiency of a much narrower kernel. The decoder, which translates the opcode into micro-ops (uOps), is important to keep the backend powered at all times so that the processor reaches maximum efficiency; being able to decode more instructions per clock is of course better for overall performance.

The set-top boxes are aided by a new hardware-driven load balancer. Instead of offloading a long chain of sequential instructions onto a few decoders, load balancers break them up into smaller segments and distribute them across all decoders, increasing parallelism.

On the backend, Gracemont has a wide five allocation step and a window of 256 out of service entries. The allocation step bridges the front-end and back-end of the processor, while the out-of-service window specifies the number of out-of-service uOp entries it can buffer before they go. are sent to threads.

Intel claims that Gracemont’s microarchitecture improvements deliver higher overall CPI increases while consuming a fraction of the power.

Further down in the process flow are data threads, or EUs for short. Gracemont’s 17 fulfillment ports can be tailored to the needs of each unit. Entire EU ports are supplemented with dual energy multipliers and dividers. In addition, vector operations single instruction multiple data arithmetic logic units (ALUs) (SIMD) now support Intel Virtual Neural Network (VNNI) instructions.

Two floating point pipelines allow the execution of two independent add or multiply operations, as well as two multiply-add instructions per cycle with new vector extension instructions. Gracemont’s vector stack also comes with crypto units that provide AES and SHA acceleration, allowing it to offload encryption workloads in security-sensitive applications.

Finally, there is the memory subsystem. To increase cache bandwidth, Intel added two load pipelines and two storage pipelines that allow simultaneous reading and writing of 32 bytes. The size of the L2 cache is configurable between 2 and 4 MB.

In a core-to-core comparison, Intel said that Gracemont delivers 40% more performance at the same power as Skylake, and delivers the same performance using 40% less power. This means that Gracemont is about 2.5 times more efficient in single-core scenarios. In a four-core configuration, Gracemont provided 80% more performance than two Skylake cores running four threads while consuming less power. Additionally, Intel noted that four Gracemont cores can fit in the same footprint as a single Skylake core.

Golden Cove Performance Core

The story is pretty much the same with Golden Cove, the performance core of Alder Lake (P-core). The theme of making them deeper, wider, and smarter persists, starting with branch prediction.

Like Gracemont, Golden Cove also has a deeper out-of-order scheduler and buffer, more physical registers, a larger allocation window, and more execution ports to increase parallelism.

He can do four rounds of the table in parallel. A table, or page table, is a “map” of virtual addresses assigned to a program and is used to help allocate memory more efficiently. A table walk is the tracing of page tables to determine which virtual memory addresses are mapped to physical addresses. The mappings are stored in a translation buffer (TLB) to minimize table movements.

For programs with larger code footprints, Alder Lake’s P-cores feature double the number of 4K pages stored in iTLB, along with improved branch prediction accuracy to reduce skip errors and better code prefetch mechanism. The target branch buffer is also twice the size of the previous generation and uses a machine learning algorithm to dynamically adjust its size to reduce power consumption or improve performance.

It also includes new dedicated hardware and ISA extensions for matrix multiplication, which Intel says will significantly improve the accelerated workload of AI.

Being the heart of performance doesn’t mean that efficiency is left behind; power management is also one of Golden Cove’s primary goals. On that front, Golden Cove has a new microcontroller that can measure and adjust power consumption in microseconds instead of milliseconds. Intel says the change is based on the actual behavior of the app rather than general speculation. The finer power tuning allows for a higher average frequency in any application without a severe power penalty.

Golden Cove has six longer decoders capable of operating at 32 bytes per cycle. The uOp cache has been increased to hold 4000 instead of 2250 operations, allowing it to increase front-end bandwidth while reducing latency in a shorter pipeline.

The frontend has certainly seen some improvements, but Intel credits the broken engine as the component that separates Alder Lake from previous architectures. P-cores have a registry renaming allowance on six wide and 12 execution ports, compared to five and 10 of the previous generation. Other improvements include more physical registers, a deeper schedule window, and a new 512 deep reorder buffer.

The L1 and L2 cache sizes have been extended and their recovery rate increased. Two L2 cache configurations are available: 1.25 MB for individuals and 2 MB for businesses.

Overall, Intel says these improvements give Alder Lake’s P cores an average performance lead of 19% over the previous generation Rocket Lake Cypress Cove core at the same frequency. Rocket Lake’s Cypress Cove microarchitecture is built on Intel’s 14nm node and backports the Sunny Cove microarchitecture of Ice Lake.

Intel Thread Director

Improving performance while reducing power consumption is a perpetual struggle for balance in processor design. With previous generations of processors, a single-core architecture had to duplicate both ends. With Alder Lake’s dedicated performance and efficiency cores, Intel hopes to respond better on both ends, much the same way it does in today’s smartphone chips.

But the mix of architectures presents its own set of challenges. Now that the processor is no longer monolithic, it needs a data highway that connects its components to avoid latency, which engineers are working to minimize. The scheduling of threads also becomes an issue; what workloads should be prioritized and how? And how do you optimize them for current and emerging workloads?

To resolve these issues, Intel has added a new hardware scheduler called Intel Thread Director. Its job is to keep an eye on the types of instructions fed into the processor and help the operating system make optimal planning decisions. In addition to programs, Thread Director always takes into account thermals, operating conditions and power limits. It also selects the threads that require the most performance in order to be able to assign them to the P-cores. Likewise, it delegates background tasks to E-cores and AI threads to P-cores. Everything is dynamic, based on the tasks at hand, and is completely autonomous.

But that doesn’t mean Thread Director locks heavy workloads exclusively on P-cores. It will take advantage of all inactive cores if there are resources available. In a heavy multithreaded workload, Thread Director will distribute the workload across all P and E cores.


Share.

About Author

Leave A Reply