In order to increase the performance of a CPU, so that it executes these instructions faster and reduces the instruction latency, the obvious answer is to increase clockspeed and thus complete the 'fetch-decode-execute-store' loop faster. That's quite viable, and is frequently used, but can only go so far. So once we reach the point were we simply can't make the CPU execute any faster, why not give it less to do per cycle? Instead of fetching, decoding, executing, and storing, suppose we break it into four steps: fetch, decode, execute and store are each done in a single clock cycle. We’ve now created something called a 4-stage pipeline that effectively quadruples clockspeed.
However, the pipelined CPU will not be any faster than the original one, as it takes the same time to finish the instruction set. The IPC, instructions per second, ratings are equal and thus both execute identically.
In reality, the different stages of the fetch-decode-execute-store loop do not need to be executed sequentially; for example, why wait to fetch the next instruction until the first fetch-decode-execute-store loop is finished? Simply start fetching the next instruction right away. As a result, only the first instruction requires four clock cycles; subsequent instructions are finished once per clock cycle after that.
I.e., after 100 clockcycles our 4-stage pipeline CPU will actually complete 97 instructions: 4 cycles for the first instruction, then one instruction per clock for the subsequent 96 clocks, and not 25, as happened earlier without the prefetch. This, in fact, gives a 4-stage CPU an IPC rating of about 0.9 instructions per clockcycle, much better than the 0.25, but still less than the 1.0 IPC of the non-pipelined CPU. Although the IPC rating is 10% lower than our non-pipelined CPU, the clockspeed is 400% faster, so our 4-stage CPU is actually a much faster design in fact it is 4 x 0.9 = 3.6 times faster.
This has been one of the most important motivations for Intel's design of the Pentium 4 micro-architecture, as the P6 architecture could not be made to run much faster than a GHz without extensive rework of its fundamentals. One of the most prominent features of the Pentium 4 architecture is therefore its deep 20-stage pipeline, implemented to reduce the execution latency and increase the scalability of architecture clockspeed.
Discuss This Article (2 Comments) - If you have any questions, comments or suggestions about the article and/or its contents please leave your comments here and we'll do our best to address any concerns.