One of the reasons Intel spent $ 16.7 billion to acquire FPGA maker Altera six years ago was because it was convinced that its charging model, where large parts of the storage and networking stack were running on processors, would fall out of favor and companies would want to offload this work onto network interface cards with much of their own processing that was much cheaper and much more energy efficient.
This is what we called SmartNICs, which meant offloading and speeding up certain functions using a custom ASIC on the network interface card and what we now increasingly call DPU, short for Data. Processing Unit, because these devices benefit from a hybrid approach for their computation and acceleration, by mixing CPU, GPU and FPGA on the same device. Because it has to be different, Intel gives offload devices that are significantly extended SmartNICs the name of Infrastructure Processing Unit, or IPUs, but to avoid confusion, we stick to the name DPU for all of these. .
Either way, Intel showcased three of its impending DPUs during its recent Architecture Day extravaganza, and leaders of its data platform group have shown that they are indeed on the line. Damascus Road over the past two years and would not only stop persecuting DPUs, but fully accept them. Well, it wasn’t so much a conversion as it was an injection of new people bringing new ideas, and that includes Guido Appenzeller, who is now CTO of what was previously called the Data Center Group. Appenzeller led the Clean Slate Lab at Stanford University, which spawned the OpenFlow software-defined network control plane standard and was co-founder and CEO of Big Switch Networks (now part of Arista Networks). Appenzeller was Director of Technology Strategy in the Networks and Security business unit of VMware for a time and was behind the OpenSwitch open source network operating system project created by Hewlett Packard Enterprise some time ago. years.
Intel hasn’t talked much about offloading the work of processors, because it’s heresy – even if it does happen and even though there are very good economic and security reasons for doing so. The DPU metaphor that Appenzeller came up with and talked about at Architecture Day is smart, and it’s more about resource sharing and multi-tenancy than better price / performance on a cluster of systems, which, in our opinion, is the real engine behind the DPU. (We realize this is mind blowing. Offloading network and storage to the DPU helps reduce latency, improve throughput, lower costs, and provide secure multi-tenancy.)
“If you want to think of an analogy, it’s kind of like hotels versus single-family homes,” Appenzeller explained. “In my house, I want it to be easy to move from the living room to the kitchen to the dining table. In a hotel, it’s very different. The bedrooms, the dining room and the kitchen are perfectly separated. The areas where hotel staff work are different from those where hotel guests are located. And you get a bed, you might want to switch between them in some cases. And that’s basically the same trend we’re seeing in cloud infrastructure today.
In Intel’s design of the DPU, the IPU is where the control plane of cloud service providers – what we call hyperscalers and cloud builders – runs and the hypervisor and tenant’s code runs. ‘run on the processor cores inside the server chassis where the DPU is located. plugged. Many would dispute this approach, and Amazon Web Services, which has perfected the art of DPU with its “Nitro” smart network cards, would be the first to object. All network and storage virtualization code runs on the Nitro DPU for all EC2 instances, and most importantly, the server virtualization hypervisor does, except for everything but the smallest piece of paravirtualized code. which has practically no additional cost. Processor cores are only intended for running operating systems and performing computational tasks. No more.
In a sense, as we’ve been saying for some time, a CPU is really a serial compute accelerator for the DPU, and not too far into the future the DPU will have all of the accelerators connected to them in a high-speed fabric that allows everything to be disaggregated and composable, with the DPU – and not the CPU – at the heart of the architecture. This goes too far for Intel, we suspect. But it makes more sense, and goes a long way in responding to the four-decade view that “the network is the computer” espoused by former Sun Microsystems tech techie John Gage. There will be more and more processing in the network, in the DPUs and in the switches themselves, as we go along, as this is the natural place for collective operations and they might never have had to be placed on the processors in the first place.
To be fair, later in his speech, as you see in the graph above, Appenzeller admitted that CPU offload is happening, allowing customers to “maximize CPU revenue.” Intel has certainly done this over the past decade, but this strategy no longer works. This is one of the reasons Appenzeller was brought in from outside of Intel.
And this data below, from Facebook, as cited by Appenzeller, makes it clear why Intel has changed its mindset – especially after seeing AWS and Microsoft fully embrace DPUs in recent years and other hyperscalers and cloud builders following suit. the step with different levels of deployment and success.
This might be a generous dataset, especially if you don’t include the overhead of a server virtualization hypervisor, as many large companies have to, even though hyperscalers and builders Clouds tend to run bare metal with containers on top.
At the moment, because it doesn’t have its fully prepared oneAPI software stack and doesn’t have a software ecosystem running on GPU-accelerated devices, Intel is only talking about DPUs based on GPUs. Custom GPUs, FPGAs and ASICs. But eventually, we believe that GPUs, which excel at certain types of parallel processing and are faster to reprogram than FPGAs, will be part of the DPU mix at Intel, as they have become dominant at Nvidia. It’s just a matter of time.
But for now, two of the DPUs presented by Intel at Architecture Day were based on CPU and FPGA combos – one called “Arrow Creek” based on an FPGA / CPU SoC, the other called “Oak Springs. Canyon “with a mix of an FPGA plus an external Xeon D processor – or was based on a custom ASIC named” Mount Evans “that Intel was creating for an unnamed” first cloud provider “.
Here are the Arrow Creek (left) and Oak Springs Canyon (right) cards, which plug into the PCI-Express slots inside the servers:
And here’s a rundown of Arrow Creek’s features:
The Arrow Creek DPU has two 100 Gb / s ports that use QSFP28 connectors and has an Agilex FPA compute engine. The DPU has an E810 dual port Ethernet controller chip that connects to eight lanes of PCI-Express 4.0 slot capacity and the Agilex FPGA also has its own eight lanes of PCI-Express; both return to the CPU complex on the servers via the PCI-Express bus. The Agilex FPGA has built-in Arm cores, which can perform modest compute tasks and has five memory channels (four plus one spare, it seems) with a total of 1 GB of capacity. The FPGA part of the Agilex device has four channels of DDR4 memory with a combined capacity of 16 GB.
This Arrow Creek DPU is specifically for network acceleration workloads, including customizable packet processing done on the ‘bump in the wire’, as we’ve long said about FPGA-accelerated SmartNICs. This device is programmable through OFS and DPDK software development kits and features Open vSwitch and Juniper Contrail virtual switching as well as SRv6 and vFW stacks already formed on their FPGA logic gates. It’s for workloads that change sometimes, but not very often, that’s what we’ve been saying about FPGAs from the start.
Oak Springs Canyon is a little different, as you can see:
The power supplies and speeds of the Xeon D processor have yet to be revealed, but it probably has 16 cores like a lot of SmartNICs tend to do these days. As far as we know, the Xeon D processor and the Agilex FPGA are on the same dice – Intel has been working on this for years and promised such devices as part of the Altera acquisition in 2015 – but so far as we know, they are built into one socket using EMIB interconnects. The CPU and GPU each have 16GB of four-channel DDR4 memory, and they connect through the FPGA to a pair of 100Gbps QSFP28 ports.
The Oak Springs Canyon DPU is programmable through OFS, DPDK and SPDK toolkits and has built-in stacks for Open vSwitch virtual switching as well as NVM-Express over Fabrics and RoCE RDMA protocols. Obviously, this DPU aims to speed up the network and storage and offload them from the complex CPU in the servers.
The third DPU, the Mount Evans device, is perhaps the most interesting since it was co-designed with this “first cloud provider” and has a custom Arm processor complex and subsystem. custom network integrated on the same package. Like that:
The networking subsystem has four SerDe operating at 56 Gb / s, which deliver 200 Gb / s in full duplex, and which can be split and used by four host servers. (The graphics indicate that it must be Xeons, but it seems unlikely that this is a requirement. Ethernet is Ethernet.) The network interface implements the RoCE v2 protocol to speed up the network without involving the processor (like the do RDMA implementations) and also has an NVM-Express Offload Engine so that host processors do not have to deal with this overhead either. There is a custom programmable packet processing engine, which used the P4 programming language and which we strongly suspect to be based on pieces of the “Tofino” switching ASICs from Intel’s acquisition of Barefoot Networks ago. more than two years. The network subsystem has a traffic shaping logical block to improve performance and reduce latency between the network and hosts, and there is also a logical block that performs IPSec online encryption and decryption at line speed.
The compute complex on the Mount Evans device has 16 Neoverse N1 cores licensed from Arm Holdings, which end in an undisclosed cache hierarchy and three unusual DDR4 memory controllers (this is not a very low number 2). The compute complex also has a lookaside cryptography engine and a compression engine, thus offloading these two tasks from the host CPUs, and a management complex to allow external management of the DPU.
The workload is unclear, but Intel says that when it comes to the programming environment, it will “leverage and extend” the DPK and SPDK tools, presumably with P4. We strongly suspect that Mount Evans is used in Facebook microservers. But this is only a guess. And we also strongly suspect that he won’t be available to anyone other than his target customer, which would be a shame. Hope we are wrong on this intuition.