In this article, I focus on the compilation process in hardware emulators. Also, the design capability of hardware emulators, as well as the compilation flow, is highly dependent on the type of technology used in the verification engine. Currently, each of the three hardware emulation vendors has adopted its own architecture:
- Cadence: processor-based architecture
- Mentor: Architecture based on a custom FPGA or an on-chip emulator
- Synopsys: Commercial FPGA-Based Architecture
The first two are based on custom chips; the third is built using commercial FPGA boards. The compilation process is different and unique to each architecture.
Let us briefly recall the basics of a software compiler. A software compiler is a computer program or, better, a set of programs that transforms source code written in a programming language into an executable program consisting of a sequence of instructions in the form of object code. The main operations performed by a compiler include lexical analysis, syntactic analysis, semantic analysis, code optimization, and code generation.
Compilation into software simulators
A simulator is basically a software algorithm running on a computer. The algorithm processes data representing a design pattern described in a design language at one of multiple hierarchical levels, as illustrated in Table 1.
The compiler converts the design model into this data structure, following the guidelines of a software compiler, with a few differences.
When the output of a compiler targets computer hardware at a very low level, such as an FPGA or structured ASIC, it is a hardware compiler because the source code produced effectively controls the final hardware configuration and operation. . The compiler output is not a sequence of instructions. Rather, it is an interconnection of transistors or look-up tables. A quintessential example is the compiler of a hardware emulator.
Compiler in processor-based emulators
The operation of a processor-based emulator is vaguely similar to that of a software simulator. In both engines, the design under test (DUT) model is converted into a data structure stored in memory. In the case of the emulator, the design data structure is processed by a calculation engine implemented in a wide range of boolean processors, which gives the name to the type of emulator.
Typically, the vast array is made up of relatively simple 4-input ALUs, which can reach into the millions in fully extended configurations. These processors are responsible for evaluating all the logic (Boolean) functions of the DUT in a time order controlled by a sequencer.
The principle of operation is shown in the following Figure 1 and Table 2 using an oversimplified approach.
All operations are time-stepped and assigned to processors according to a set of rules to preserve the functional integrity of the DUT.
Figure 1 The logic functions in the DUT are in a time order controlled by a sequencer.
Table 2 An emulator’s processors are responsible for evaluating the functions in the DUT in a time order controlled by a sequencer.
An emulation cycle consists of executing all the stages of the processor for a complete execution of the design. Large designs typically require hundreds of steps. The more operations that can be performed in each step, the faster the emulation speed.
During each time step, each processor is capable of executing any input logic function using as inputs the results of any prior calculation of any of the processors and any input design or memory content. The more processors available in the emulator, the more parallelism can be achieved with the benefit of faster emulation time.
The task of the compiler is to partition a design between processors and schedule individual boolean operations in time steps. Here are the main steps of the build flow:
- Map RTL code into primitive cells such as gates and registers
- Reduce boolean logic (gates) to input functions
- Assign all design inputs and outputs to processors
- Assign each cell in the design to a processor
- Schedule the activity of each processor in sequential time steps
The compiler optimizes CPU scheduling for any given emulator configuration to maximize speed and capacity. It must also consider several additional issues, such as three-state bus modeling, memory modeling, debug detection, triggering, and other factors. The main tasks listed above, however, characterize the flow of a CPU-based emulator.
What the compiler does not have to deal with is the internal synchronization of FPGA emulators, one of the most critical issues of these emulators. Timing control in an assembly of FPGAs is difficult, unpredictable, and a path to failure unless care and engineering skill are involved in compiler design.
For this reason, CPU-based emulation compiles much faster and with fewer resources. The compiler retains all RTL network names originally designated for use in debugging despite the Boolean optimization it performs. This allows users to debug with familiar signal names.
Compiler in FPGA-based emulators
The compiler of an FPGA-based emulator, whether based on custom or commercial FPGAs, is quite different from that of the CPU-based emulator. It is more complex and includes additional steps not required in its counterpart. The differences between custom and commercial FPGA emulators relate to some of the steps, but not the overall flow, as highlighted later.
The main tasks of a compiler include RTL parsing, synthesis, netlist partitioning, timing analysis, clock mapping, memory mapping, map routing, and map placement and FPGA routing. The goal is to map the RTL SoC into a fully-placed, fully-routed, timing-correct system of tens or even hundreds of FPGAs.
The similarities with the processor-based emulator end with the mapping of the RTL code into a netlist of gates and registers.
Once the design is synthesized, the netlist is partitioned onto an array of FPGAs to implement the DUT. This is a critical step because assigning unequal logic blocks to one or a few FPGAs can lead to a large increase in interconnectivity that requires pin multiplexing at factors of 16x or 32x or perhaps even more . The impact on the emulation speed can then be deleterious.
Similarly, a timing-blind partitioner can introduce long critical paths on combinatorial signals by routing them through multiple FPGAs, generating jumps that are detrimental to emulation speed. An effective partitioner must include an accurate timing analysis tool to identify these long critical paths and avoid these jumps.
The need to effectively map clocks poses an even greater challenge. Modern designs can use hundreds of thousands of derived clocks spread across hundreds of FPGAs. Designers reduce power consumption by using complex clock gating strategies. A significant amount of effort goes into the compiler’s ability to manage these clocks well.
Finally, the FPGAs must be placed and routed.
Additional tasks, such as for the CPU-based emulator, are performed by the compiler, but are beyond the scope of this article.
The compiler of an FPGA-based emulator requires state-of-the-art synthesis, partitioning, timing analysis, clock mapping, and placement and routing technologies.
So what are the differences in the build flow between a custom FPGA-based emulator and a commercial FPGA-based emulator?
I have written about the architectural differences between the two approaches. Let me mention that in the custom FPGA used by Mentor Graphics in its Veloce2 on-chip emulation platform, the interconnect network of programmable elements internal to the FPGAs and between the FPGAs follows a patented architecture that guarantees performance predictable, repeatable, fast and congestion-free. routing. It removes placement constraints and ensures simple routing and fast compilation. Clock trees are hardwired on dedicated paths independent of data paths, leading to predictable and repeatable synchronization as well as preventing synchronization violations by design because data paths are longer than data paths. ‘clock. Unpredictable timing and holdover timing violations are detrimental to the viability of commercial FPGAs.
The custom FPGA architecture used in the on-chip emulator ensures a high level of reprogrammable resource utilization. It is designed to ease the process of partitioning a large design into a wide range of these custom FPGAs, and it enables rapid P&R. One spec says it all: Placing and routing a custom FPGA takes about five minutes, nearly 100 times faster than P&R, a leading commercial FPGA. However, it is not intended for use as a general-purpose FPGA, since the capability of the custom FPGA is less than that offered by larger commercial FPGAs.
Table 3 compares compile times between the three main types of emulation platforms.
Table 3 A comparison of compile times in processor-based, emulator-on-chip, and commercial FPGA-based emulators based on vendor datasheets shows the differences in compile times.
Compiling a design is a time-consuming process, rather dependent on the size and complexity of the design. In order to speed up the task, the process is highly parallelized in multiple threads that can run concurrently on PC batteries. This parallelization adds another dimension to the already difficult task of designing a compiler (Figure 2).
Figure 2 The compilation flow of an FPGA-based emulation system includes many steps, from synthesis to partitioning to placement and routing, which can be parallelized to speed up the process.
In summary, all three types of hardware emulators present their own set of challenges to the hardware compiler. The processor-based emulator compiler is simpler and runs faster than FPGA-based emulators. Due to its architecture, the compiler of the custom FPGA-based emulator avoids most of the pitfalls of the commercial FPGA-based counterpart at a compilation speed that is not far from the fastest CPU-based emulator .
Dr. Lauro Rizzatti is a verification consultant and industry expert in hardware emulation.