e600
|
The e600 implements the PowerPC architecture and is a reduced instruction set computer (RISC) core. The e600 core consists of 32-Kbyte separate L1 instruction and data caches and a 1-Mbyte L2 cache. The core is a high-performance design supporting multiple execution units, including four independent units that execute AltiVec instructions.
The e600 core implements the 32-bit portion of the PowerPC architecture, which provides 32-bit effective addresses, integer data types of 8, 16, and 32 bits, and floating-point data types of 32 and 64 bits. The core provides virtual memory support for up to 4 Petabytes (252) of virtual memory and real memory support for up to 64 Gigabytes (236) of physical memory.
The e600 core also implements the AltiVec instruction set architectural extension. The e600 core can dispatch and complete three instructions simultaneously. It incorporates the following execution units:
- 64-bit floating-point unit (FPU)
- Branch processing unit (BPU)
- Load/store unit (LSU)
- Four integer units (IUs):
- Three shorter latency IUs (IU1a–IU1c)—execute all integer instructions except multiply, divide, and move to/from special-purpose register (SPR) instructions.
- Longer latency IU (IU2)—executes miscellaneous instructions including condition register (CR) logical operations, integer multiplication and division instructions, and move to/from SPR instructions.
- •Four vector units that support AltiVec instructions:
- Vector permute unit (VPU)
- Vector integer unit 1 (VIU1)—performs shorter latency integer calculations
- Vector integer unit 2 (VIU2)—performs longer latency integer calculations
- Vector floating-point unit (VFPU)
The ability to execute several instructions in parallel and the use of simple instructions with rapid execution times yield high efficiency and throughput for e600-based systems. Most integer instructions (including VIU1 instructions) have a one-clock cycle execution latency.
Several execution units feature multiple-stage pipelines; that is, the tasks they perform are broken into subtasks executed in successive stages. Typically, instructions follow one another through the stages, so a four-stage unit can work on four instructions when its pipeline is full. So, although an instruction may have to pass through several stages, the execution unit can achieve a throughput of one instruction per clock cycle.
AltiVec computational instructions are executed in four independent, pipelined AltiVec execution units. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1–VIQ0). This means an instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions. The VPU has a two-stage pipeline; the VIU2 and VFPU each have four-stage pipelines. As many as ten AltiVec instructions can be executing concurrently.
Note that for the e600 core, double- and single-precision versions of floating-point instructions have the same latency. For example, a floating-point multiply-add instruction takes 5 cycles to execute, regardless of whether it is single (fmadds) or double precision (fmadd).
The e600 core has independent on-chip, 32-Kbyte, eight-way set-associative, physically addressed L1 (level one) caches for instructions and data, and independent instruction and data memory management units (MMUs). Each MMU has a 128-entry, two-way set-associative translation lookaside buffer (DTLB and ITLB) that saves recently used page address translations. Block address translation is implemented with the eight-entry instruction and data block address translation (IBAT and DBAT) arrays defined by the PowerPC architecture. During block translation, effective addresses are compared simultaneously with all BAT entries.
The L2 cache is implemented with an on-chip, 1-Mbyte, eight-way set-associative physically addressed memory available for storing data, instructions, or both. The L2 cache supports parity generation and checking for both tags and data. If ECC is disabled, it responds with an 11-cycle load latency for an L1 miss that hits in L2; if ECC is enabled, the L2 load access time is 12 cycles. The L2 cache is fully pipelined for two-cycle throughput.
The e600 core has three power-saving modes, nap, sleep, and deep sleep, which progressively reduce power dissipation. When functional units are idle, a dynamic power management mode causes those units to enter a low-power mode automatically without affecting operational performance, software execution, or external hardware.
The performance monitor facility provides the ability to monitor and count predefined events such as processor clocks, misses in the instruction cache, data cache, or L2 cache, types of instructions dispatched, mispredicted branches, and other occurrences. The count of such events (which may be an approximation) can be used to trigger the performance monitor exception.
|