To recap, the number of clockcycles required to process an instruction, or the total time an instruction spends in the execution units, is the instruction latency; having a longer pipeline means longer instruction latencies. To minimize the performance penalty from fetching every new instruction from main memory, all modern x86 CPUs use instruction buffers. A longer pipeline and increased execution latency increases any instruction's wait to be processed, requiring a larger buffer than in a less pipelined CPU. The Pentium 4 naturally has a larger instruction buffer, making it capable of handling more than 100 instructions in flight.
However, an even more compelling feature does away with the 'decode' section in the fetch-decode-execute-store loop. The Pentium 4 has a Trace Cache, which stores instructions in execution sequence. For example, if instruction A jumps from location 100 to instruction B at location 200, the trace cache will store B right behind A. This simplifies processing, as it does away with the decode section, shortens execution and reduces execution latency.
Fig 1. Pentium 4 NetBurst Micro-Architecture, notice the absence of the decode unit.
Discuss This Article (2 Comments) - If you have any questions, comments or suggestions about the article and/or its contents please leave your comments here and we'll do our best to address any concerns.