Please register or login. There are 4 registered and 1251 anonymous users currently online. Current bandwidth usage: 1308.91 kbit/s July 30 - 07:12pm EDT 
Hardware Analysis
      
Forums Product Prices
  Contents 
 
 

  Latest Topics 
 

More >>
 

    
 
 

  You Are Here: 
 
/ Forums / Other Hardware /
 

  Pentium4 L1 cache (Dan or Sander) 
 
 Author 
 Date Written 
 Tools 
futureman Jun 20, 2001, 09:46pm EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List Replies: 9 - Views: 112
Any thoughts about the pitifully small 8kb L1 cache on the P4?

What kind of speed increases in legacy apps would be seen by having at least 32kb like the PIII or a L1 the size of the Athlon?

I really think that this is one area where Intel blew it. Whether by design or due to size on the .18u process.

Will the L1 size be increased in any further implementations of the P4?


Want to enjoy fewer advertisements and more features? Click here to become a Hardware Analysis registered user.
Dan Mepham Jun 20, 2001, 10:04pm EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> [No Subject]
I'm not sure I'd say they 'blew it', but rather that they made a tough call, and I'm not sure how it would've worked out the other way (I suppose only a few engineers at Intel know that for sure).

First off, it's not technically correct to say the entire L1 cache is 8kb, where the P3s is 32kb. That 32kb on the P3 consists of a 16kb Instruction and 16kb Data cache. On the P4, that 8kb of L1 is talking about the Data cache only. There is an 'Instruction cache' as well (although it's not called that..), which brings the overall cache size up to probably around 16kb.

Anyway, the reason the D-cache is so small, is because it's fast. The P3 and Athlon's L1 D-caches use a 3-cycle load-to-use latency. The P4's L1 D-cache, by contrast, is only 2-cycles, meaning you get at the data that much faster. Unfortunately, the problem with memory in general, is that you can't have both fast and big at the same time. You either get big and slow (Athlon, P3), or small and fast (P4). The question is, why didn't Intel make the L1 larger? The answer is that they probably CAN'T, at least not while maintaining the 2-cycle latency. If you make the cache bigger, it (a) takes longer to sort/find your data, and (b) is physically further away on the die of the processor, which means electrons have to travel further. That may not seem like a big deal, but when you're talking about a CPU that operates at 1,700,000,000 cycles per SECOND, the speed of the electrons through the silicon becomes a factor.

Basically what Intel did can be likened to what they did with the Katmai L2 versus the Coppermine L2. They made it smaller, and faster (lower latency).

Was that the right call? I don't know. I don't have a P4 with 32kb of L1 to compare it to. :-) But I'd be inclined to believe that Intel's engineers know what they're doing, and that it's probably the best option more often than not.

Hope that helped. And I'm sure if I missed anything Sander will happily jump in. ;-)
Dan Mepham

Editor in Chief, Hardware Analysis
Email : dmepham@hardwareanalysis.com
Visit us at : http://www.hardwareanalysis.com

Dan Mepham
Dan Mepham Jun 20, 2001, 10:06pm EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> [No Subject]
And as to whether they'll increase the size later - I have no idea (I'm not privy to any plans or anything). Maybe later...but not on 0.18um, they won't. But once they get down to 0.13 or smaller, it might become doable. I'm sure someone at Intel has at least considered it, so we'll just have to wait and see.

Dan Mepham

Editor in Chief, Hardware Analysis
Email : dmepham@hardwareanalysis.com
Visit us at : http://www.hardwareanalysis.com

Dan Mepham
Robert Kropiewnicki Jun 20, 2001, 11:43pm EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> Regarding Coppermine cache...
Dan,

I thought the Katmai to Coppermine cache design change was the same as when AMD went from the original Athlon core to the Thunderbird core......

That being, the cache went from being off die at speeds being at some divisor of the processor speed to on-die cache at the same speed as the processor.

Granted, your conclusion still sort of holds true with regards to smaller cache operating at higher speed but the parallel to the Pentium 4 seems to be sketchy. Though your original explanation of the design decision (before making that parallel) sounds plausible.

Dan Mepham Jun 20, 2001, 11:48pm EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> [No Subject]
Yes, the Katmai -> Coppermine was pretty much the same as K7 -> Thunderbird. (Intel actually made some changes beyond what AMD did, but same general idea). But yes, you're right ... it's not the same thing, I was just using that as an example of smaller and faster versus larger and slower.

In general, you cant have very large, very fast memory (at least not practically/affordably). For Intel to have made the L1 D-cache larger, it would have to have been made slower as well. Can't have your cake and eat it too. :-)

Dan Mepham

Editor in Chief, Hardware Analysis
Email : dmepham@hardwareanalysis.com
Visit us at : http://www.hardwareanalysis.com

Dan Mepham
Robert Kropiewnicki Jun 20, 2001, 11:53pm EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> Just out of curiosity....
Do you remember what other changes there were between Katmai and Coppermine?

I remember the change in the cache (both size and move from off die to on die), but I don't recall much else of note.

I think both Intel and AMD went from .25 to .18 micron process at that point in their respective processors timelines. But I can't really remember anything more significant.

Dan Mepham Jun 21, 2001, 12:00am EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> [No Subject]
AMD was on 0.18um with the K7 even before the switch. For them, the change was just introducing the cache onto the die, and nothing more. 512kb of external to 256kb of internal, that's it. Actually, now that I think about it, the K7s used an inclusive cache, which means that the L1 is mirrored in the L2. So effectively the L2 on a K7 was only 384kb, since 128kb of it was data mirrored from the L1. The Thunderbird switched to exclusive cache, where the L1 isnt mirrored in the L2 (for obvious reasons...mirroring 128kb in a 256kb L2 would cost you half the cache size).

Both the Katmai and Coppermine use inclusive caches, which is fine, since their L1s are only 32kb. The big difference with the Coppermine was that Intel expanded the L2 bus width to 256kb (K7, Tbird, and Katmai are all 64bit), which basically quadroupled the L2 bandwidth. Coppermines cache structure is still one of the best in the business, IMHO. There were also some more minor changes, Intel increased the number of fillback buffers, write buffers, etc..

Thats all I can think of right now. :-)

Dan Mepham

Editor in Chief, Hardware Analysis
Email : dmepham@hardwareanalysis.com
Visit us at : http://www.hardwareanalysis.com

Dan Mepham
Robert Kropiewnicki Jun 21, 2001, 12:03am EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> Ahhh, that's right
Forgot about the L2 bus widening for Coppermine......yep, the AMD faithful have been pining away for it for a long time now.

Still, I'm glad that they put in two of the main things that were really needed for Palomino to be a performance success, hardware prefetch and full SSE compatibility.

Dan Mepham Jun 21, 2001, 12:07am EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> [No Subject]
It's possible that a 256-bit L2 bus wouldn't help the Athlon at all ... who knows. Maybe that's why AMD never bothered.

But yes, the prefetch and SSE seem to be helping Palomino nicely. Sander should be getting a couple in shortly... hopefully we can run some tests to see the impact each has on its own. I'm curious as to what's actually going on.

Dan Mepham

Editor in Chief, Hardware Analysis
Email : dmepham@hardwareanalysis.com
Visit us at : http://www.hardwareanalysis.com

Dan Mepham
Robert Kropiewnicki Jun 21, 2001, 09:26am EDT Reply - Quote - Report Abuse
Private Message - Add to Buddy List  
>> What's going on.....
Hardware prefetch, as far as I'm concerned, was the most important new feature in the Palomino core. Lack of it was a major reason as to why neither the Thunderbird core or the P3 were able to gain any real world performance from DDR. Note that the P4 does have it.

That brings me to another point......with DDR barely carrying any price premium over SDR, why is Intel pushing the Northwood on SDR first? I'd need someone more knowledgeable than I to go over the numbers, but it seems to me that the Northwood P4 on an SDR systems is going to be bandwidth starved.


Write a Reply >>


 

    
 
 

  Topic Tools 
 
RSS UpdatesRSS Updates
 

  Related Articles 
 
 

  Newsletter 
 
A weekly newsletter featuring an editorial and a roundup of the latest articles, news and other interesting topics.

Please enter your email address below and click Subscribe.