Please register or login. There are 0 registered and 387 anonymous users currently online. Current bandwidth usage: 326.30 kbit/s September 27 - 11:18pm EDT 
Hardware Analysis
Forums Product Prices

  Latest Topics 

More >>


  You Are Here: 
/ Forums / Upcoming Intel and AMD processors, first looks.

  Radical Changes 
 Date Written 
Patrick Eberhart Nov 13, 2005, 11:48am EST Report Abuse
I guess its because I'm a bit disappointed with the performance/cost ratio of 64 bit extentions and dual core that has my mind searching for a better solution. Instead of tying up a registers with a thread of instructions that must complete before the register is free why not use hard wired registers? Sort of like FTP, Telnet, SMTP, DNS, TFTP, finger, HTTP, POP3, NNTP and SNMP site services work.

Instead of using a single register to work on the same data with a sequence of various instructions why not let the data be transient as if it were on an assembly line instead of held in a vice to be operated on by various tools? Such a scheme would go a long way toward the realization of processors based upon light instead of electrons since prism based processing registers do not require an electronic interface.

(Please reply to The email address I used to register with is no longer active.)

Want to enjoy fewer advertisements and more features? Click here to become a Hardware Analysis registered user.
Robert Eachus Dec 02, 2005, 10:00pm EST Report Abuse
>> Re: Radical Changes
Instead of tying up a registers with a thread of instructions that must complete before the register is free why not use hard wired registers?

Modern CPUs don't work that way. ;-) Instead of having real hardware registers for most purposes, they have registers files combined with register renaming. The logic that does the out of order scheduling for instructions determines dependencies between instructions caused by use of the same register. But the logic doesn't just look at the NAME of the register. Every time the code clears or overwrites the contents of a register, the logic assigns a new (temporary) name to that register.

If you know what you are doing, and I can assure you that compilers do, your code should try to erase or overwrite any register completely to eliminate dependencies in the code. Loading a byte or half-word into a 32-bit register without clearing the otherwise unchanged bits is bad juju--unless you really care about both halves of the register.

AFAIK, the Athlon had 88, and Athlon64 CPUs has 120 renaming registers for floating-point, MMX, SSE, etc. registers. For integer/general purpose registers it isn't so easy to count. AMD plays some neat tricks, and effectively separates data by how it is used.. The integer/general purpose unit has 44 LSU (load/store unit) locations in the Athlon64. (In the Athlon as I recall, there were 36 load and 8 store locations.) If a calculation is used only for generating an address, it is assigned to the address side of a LSU location. If it is used as data in subsequent caclulations or is stored, it goes into the data side. (As does data fetched from memory or cache.)

What if the result of a (partial) calculation is needed in both places? Different trick, if you want to call it that. Pretend that the value is stored in a renaming register and use that register as the 'calculated' address. A programmer can't write that instruction--he has no name for the virtual register used. But after instruction decoding, the CPU only works with virtual addresses, so it doesn't even know it should have had a problem. (And yes, no arithmetic operations translates into no clock cycles required.) The LSU is also used for several other purposes. For example, what happens if you write a result to memory then reference that memory location before the write completes? No problem--the CPU can use the copy in the LSU. In fact that, and dealing with some less than 32-bit writes was the only real use of the 8 location store unit in the original Athlon. Incidently the LSU can transfer up to two 64-bit operands per clock cycle, reading or writing. If the CPU actually generates more data than that, the LSU will fill up with results to be written, and eventually choke down the CPU to a level it can handle. (This should only happen with vector path instructions that write lots of locations. For example, fill a block of memory with a non-zero constant. Why can it only happen on writes? The L1 data cache can only deliver two 64-bit operands per clock, so in that direction you can't create a backlog.)

Also why are there so many renaming registers for floating-point? Two reasons. First floating-point and/or vector instructions work better if you have lots of fast registers. The original programmer may write a sequence of instructions that pound heavily on a couple of the XMM registers, or on the x87 result stack. But register renaming, if the code is written correctly, will translate those apparantly inherently sequential instructions into a nicely unrolled loop with, if necessary, more than one hundred register names. For best performance, you want the burst speed when you do this to match the main-memory fetch times.

If the code you use is written correctly, for things like matrix multiply, you want any numbers used in the calculation to be read once from main memory. So in AxB =C, you want to read every element of A or B once, and write every element of C exactly once. You can't always do this, but clever use of cache, and non-caching reads and writes will help. The Athlon64 supports special instructions to do this. For example you can write results direct from a register to main memory bypassing the L1 and L2 caches. This prevents the values of C elements you write from pushing elements of A or B out of the L1 or L2 caches. Neat trick, but again it requires lots of renaming registers so that the results being written can sit around until the write completes.*

Complicated? You bet. And I only described one particular set of complications in the Athlon64/Opteron design. The net result is that the ISA that programmers (or in reality compilers) write to is not implemented by the CPU, and the states in the ISA model of what is going on only correspond to actual machine states when the CPU is halted.

*Why does the value in the register have to sit around all that time? The Athlon and Athlon64 computes and transmit ECC values throughout the processor, and of course, if you have ECC memory to main memory as well. (Hypertransport between CPUs actually uses a CRC check every N words sent instead.) What happens if the ECC or CRC doesn't check? Obviously, the recieving location requests the data be retransmitted. If it isn't there?



  Topic Tools 
RSS UpdatesRSS Updates

  Related Articles 

A weekly newsletter featuring an editorial and a roundup of the latest articles, news and other interesting topics.

Please enter your email address below and click Subscribe.