 |
Intel's Pentium 4, a closer look |
 |
 |
May 17, 2001, 12:00pm EDT |
 |
| |

Branch Prediction Cont. By: Sander Sassen |
To recap, the number of clockcycles required to process an instruction, or the total time an instruction spends in the execution units, is the instruction latency; having a longer pipeline means longer instruction latencies. To minimize the performance penalty from fetching every new instruction from main memory, all modern x86 CPUs use instruction buffers. A longer pipeline and increased execution latency increases any instruction's wait to be processed, requiring a larger buffer than in a less pipelined CPU. The Pentium 4 naturally has a larger instruction buffer, making it capable of handling more than 100 instructions in flight.
However, an even more compelling feature does away with the 'decode' section in the fetch-decode-execute-store loop. The Pentium 4 has a Trace Cache, which stores instructions in execution sequence. For example, if instruction A jumps from location 100 to instruction B at location 200, the trace cache will store B right behind A. This simplifies processing, as it does away with the decode section, shortens execution and reduces execution latency.
Fig 1. Pentium 4 NetBurst Micro-Architecture, notice the absence of the decode unit.
|
1. Introduction 2. Clockspeed and Bandwidth 3. Pipelining and Performance 4. Pipelining and Performance Cont. 5. Branch Prediction 6. Branch Prediction Cont. 7. SSE2 and Misc. Features 8. Conclusion
|
|
Discuss This Article (2 Comments) - If you have any questions, comments or suggestions about the article and/or its contents please leave your comments here and we'll do our best to address any concerns. |
|
|
 |
|
 |
 |
 |
 |

|
 |



|