Darek Mihocka van Emulators Inc. heeft een zeer uitgebreid artikel geschreven over de nieuwste processor van Chipzilla (Intel). Aangezien deze man kennelijk het liefste nog achter zijn Atari ST zit, is het geen vrolijk verhaal. Ondanks dat de schrijver niet geheel objectief te noemen is, zullen liefhebbers van dit soort uitgebreide technische verhandelingen het toch interessant leesvoer vinden.
Het artikel begint met de bewering dat de Pentium 4 alles behalve de meest krachtige processor is. Vervolgens noemt de schrijver de volgens hem grootste gebreken in het Pentium 4 ontwerp:
1) Small L1 data cache: My testing shows that while the Pentium 4 has extremely fast memory access for working sets of data up to 8K in size, at 16K and 32K sizes it is no faster than a 650 MHz Pentium III.
2) No L3 cache (as originally specified): My testing shows that at working sets between 256K and 2M, a 700 MHz Xeon processor easily outperforms the Pentium 4 at memory operations. How much is 256K or 2M? Well, that's about the typical size of an uncompress bitmap. It's the reason a Power Mac G4 running Photoshop kills a typical Pentium III running Photoshop. And axing the L3 cache is a main reason why the Pentium 4 is not the G4 killer it could have been.
3) Decoder is crippled: Intel took a rather idiotic approach to the U-V pairing and 4-1-1 grouping limitations of past decoders. They simply eliminated the extra decoders and went back to a single decoder. ...How long [will] it thus take that piece of code to execute? More than 21 clock cycles. Now, compare this to the Pentium III or Athlon. How long will those chips need to decode the bytes? Roughly 7 to 11 cycles.
4) Trace cache throughput too low: these execution units can in theory process 9 micro-ops per clock cycle - 4 simple integer operations, 1 integer shift/rotate, a read and write to memory, a floating point operating, and an MMX operation. Sounds pretty sweet, except for the problem that the trace cache feeds only 3 micro-ops at a time! While on the Pentium III we have the situation that the decoder can feed up to 3 instructions and 6 micro-ops (4+1+1) to the core per clock cycle, the Pentium 4 is crippled to the point of decoding one instruction per cycle and feeding at most 3 micro-ops to the code per clock cycle.
5) Wrong distribution of execution units: ...5 of the 7 execution units are dedicated to handling the integer registers... ...only one single execution unit handles MMX. And if you read Intel's specs in more detail, it states that the unit can only accept a micro-ops every second clock cycle. ...the three ALUs can accept up to 5 micro-ops per clock cycle. But we've already learned that the trace cache can provide at most 3. So one or more integer ALUs sit idle each clock cycle.
6) Shifts and rotates are slow: ...they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor. How bad is this mistake? For emulation code, it's absolutely devastating. Shift operations are used for table lookups, for bit extractions, for byte swapping, and for any number of other operations.
7) Fixed the partial register stall with a worse solution: Accessing certain partial registers now involves the shift/rotate unit, meaning that a simple 8-bit register read or write can take longer than accessing L1 cache memory!
Uiteindelijk komt de schrijver tot de conclusie dat we vooral geen Pentium 4 moeten te kopen, tenzij Intel alle features terug in de P4 stopt die ze eruit gehaald hebben om de chip betaalbaar te maken. Al met al een boeiend artikel voor de geïnteresseerde lezer, mits de nodige korrels zout maar bij de hand gehouden worden. Er is de afgelopen dagen dan ook veel kritiek geweest op dit artikel, vooral omdat de schrijver volgens velen te eenzijdig kijkt naar de trade-offs die Intel heeft genomen om de schaalbaarheid van de Pentium 4 architectuur met het oog op de toekomst te verbeteren. Hier een quotje uit een posting van Paul DeMone op het forum van Ace's Hardware: [/break] I don't think the author understands many of the key features and design decisions that went into the P4. I also think he doesn't understand the principle of delayed gratification in microarchitecture design trade-offs. Many of the choices P4 architects made have little or no benefit and are sometimes even liabilities in the original 0.18 um implementation. But these decisions make a whole lot more sense viewed over the expected ~5 year lifetime of this basic microarchitecture where it will likely experience multiple process shrinks. The original 0.8 um Pentium and 0.5 um PPro were not overly impressive either compared to their predecessor in the same process but these cores proved to be quite good after one or two shrinks.
![]() |
Zie voor meer info ook dit artikel van The Register en de discussies op het Technical Forum van Ace's Hardware.