To some extent, the only thing that really matters in a computer system is what changes in its memory, and to that extent, that’s what makes computers look like us. All the computational capacity in the world, or the type of manipulation or transformation of that data, doesn’t matter as much as creating new data that is then stored in memory so that we can use it in a one way or another at high speed.
The problem with systems and their memory is that you can’t get a memory subsystem that has it all.
You can turn 3D XPoint into a kind of main memory, as Intel demonstrated with its Optane PMem DIMM form factor; that persistence in the PMEM is useful, but you end up with memory that’s more expensive than flash and slower than normal DRAM, so it can’t really replace either but it can be used as another layer in the memory hierarchy – and is in some systems and storage.
With regular vanilla DRAM, you can create significant memory space for applications and data, but it can be expensive and the bandwidth isn’t great. Increased memory speeds and more controllers placed on CPUs help, but latency is still relatively high (at least compared to HBM stacked memory) and bandwidth is nowhere near as high as with HBM. The industry knows how to manufacture HBM in large volumes, and as a result yields are low and unit costs are higher.
The ubiquity of DDR DIMMs – there have been five generations now – and their mass production means it’s inexpensive even when bandwidth is challenged. DDR SDRAM memory, which was specified in 1998 by JEDEC and widely marketed in 2000, debuted at a minimum of 100 MHz and a maximum of 200 MHz and provided between 1.6 GB/s and 3.1 GB/s. s of bandwidth per channel. Over DDR generations, the memory clock rate, I/O bus clock rate, and data throughput of memory modules have all increased, as have capacity and bandwidth. With DDR4, still commonly used in servers, high-end modules have memory running at 400 MHz, I/O bus rates of 1.6 GHz, data rates of 3.2 GT/sec and 25.6 GB/sec bandwidth per module. DDR5 doubles the bandwidth to 51.2 GB/s and doubles the maximum capacity per USB stick to 512 GB. The JEDEC specification for DDR5 allows for speeds of up to 7.2 GT/sec, and we’ll see how it affects the design of the system.
Our guess is that for many devices this capability is great, but the bandwidth just won’t be enough. And so we will end up with a split memory hierarchy inside the node and near the compute engine for the foreseeable future. Or, more specifically, customers will have to choose between devices with DDR5 memory and HBM3 memory, they can mix them within systems and between nodes in a cluster, and some of them may have Optane or another persistent memory type ReRAM or PCM. if applicable.
Programming across major memory types and speeds is going to remain a problem for mixed memory systems until someone creates a memory processing unit and memory hypervisor that can provide single level memory space for shared computing engines – Memverge, VMware are you listening? (We need more than a memory hypervisor, we need something that speeds it up.)
Or, companies will use one type of memory to cache the other. Fast and lean memory can cache fat and slow memory, or vice versa. So in many CPU-GPU hybrid systems today, GPU memory is where most of the processing is done and the consistency between DDR memory in the CPU and HBM memory in the GPU is primarily used so that this DDR memory acts like a giant. L4 cache for the GPU – yes, the CPU has been relegated to data guardianship. Conversely, with Xeon SP systems that support Optane DIMMs, in one mode (and the easiest to program) 3D XPoint memory is treated as slow main memory and DDR4 DIMMs or DDR5 of the machine are a super-fast cache for Optane memory.
As we pointed out last July when previewing what HBM3 memory could mean for systems as it becomes available this year, we believe that HBM memory – never put the name of the thing you create in the abbreviation, because we can’t say high bandwidth Memory memory, and what was wrong with HBRAM? – is going to be used in all sorts of systems and will eventually become more ubiquitous, and therefore cheaper. After all, not all of us use core memory yet, and many workloads are limited by memory bandwidth, not compute. That’s why we think there will be versions of HBM with slimmer 512-bit buses and no interposer as well as those with the 1024-bit bus and an interposer.
With HBM memory (as well as the now defunct Hybrid Memory Cube stacked memory once created by Intel and Micron and used in its Xeon Phi accelerators), you can stack DRAM and attach it to a very wide bus very close to a calculation engine. and increase bandwidth by many factors up to an order of magnitude greater than what is seen on DRAM directly attached to processors. But that fast HBM memory is lean, and it’s also considerably more expensive. It’s inherently more expensive, but the price/performance ratio of the memory subsystem is probably better.
We don’t have a good idea of how much HBM costs compared to DDR main memory, but Frank Ferro, senior product marketing manager for IP cores at Rambus, knows what it costs compared to GDDR memory.
“The adder for GDDR5 compared to HBM2 was around 4X,” Ferro said. The next platform. “And the reason is not just the DRAM chips, but the cost of the interposer and the 2.5D fabrication. But the good news with HBM is you get the highest bandwidth, you get really good power and very good performance, and you get a very small footprint. You have to pay for all of that. But the HPC and hyperscale communities aren’t particularly cost-constrained. They want less power, of course, but for them, it’s all about bandwidth
Nvidia knows the benefits of HBM3 memory and is the first to market it in the “Hopper” H100 GPU accelerator announced last month. That was pretty hot on the heels of JEDEC releasing the final HBM3 specification in January.
The HBM3 specification is coming faster than SK Hynix hinted last July with its early work, when it expected at least 5.2 Gb/sec of signaling and at least 665 Gb/sec of bandwidth per stack .
The HBM3 specification calls for the signaling rate per pin to double to 6.4 Gb/s from the 3.2 Gb/s used with Samsung’s implementation of HBM2E, an extended form of HBM2 that took the technology further. beyond the official JEDEC specification, which defined the signaling rate at 2 Gb/s initially. (There was an earlier variant of HBM2E that used 2.5 Gb/s signaling, and SK Hynix used 3.6 Gb/s signaling to try to gain an HBM2E advantage over Samsung.)
The number of memory channels has also been doubled, from 8 channels with HBM2 to 16 with HBM3, and there’s even support for 32 “pseudo channels” in the architecture, by which we assume there is a kind of interleaving possible between DRAM banks as was usually the case. made in the main memories of the high-end server. The HBM2 and HBM2E variants could stack 4-, 8-, or 12-chip DRAMs, and HBM3 allows for expansion to 16-chip DRAM stacks. DRAM capacities for HBM3 are expected to range from 8 GB to 32 GB, with a four-high stack using 8 GB chips producing 4 GB of capacity and a 16-high stack using 32 GB chips producing 64 GB per stack. First-generation devices using HBM3 memory are expected to be based on 16GB chips, according to JEDEC. The memory interface is still 1024 bits wide and a single HBM3 stack can handle 819 GB/s of bandwidth.
So, with six stacks of HBM3, a device could, in theory, handle 4.8 TB/sec of bandwidth and 384 GB of capacity. We wonder what a Hopper H100 GPU accelerator would weigh with that much bandwidth and capacity in terms of cost and thermals. . . .
Because the upper echelons of computing are impatient with memory bandwidth, Rambus is already going beyond the relatively new HBM3 specification, with something that could possibly be called HBM3E in the table above. Specifically, Rambus already has signaling circuitry designed to drive 8.4 Gb/s signals for the HBM3 pins and provide 1,075 Gb/s — yes, 1.05 TB/s — of bandwidth per HBM3 stack. Six of these stacks and you have up to 6.3 TB/sec of memory bandwidth. This is possible through custom HBM3 memory controllers and custom HBM3 stack PHYs. (Rambus had up to 4Gb/s signaling on HBM2E, by the way.)
Such bandwidth could actually keep a compute device like the Nvidia Hopper GPU, or an upcoming Google TPU5 machine-learning matrix engine, or choose the data-heavy device of your dreams. We shudder at the watts and the costs, though. But again, if bandwidth is the bottleneck, maybe it makes sense to invest more there and liquid cool everything.
Someone please build such a beast so we can see how it works and analyze its economy.
It’s up to you, Samsung and SK Hynix.