Darek Mihocka van Emulators Inc. heeft een zeer uitgebreid artikel geschreven over de nieuwste processor van Chipzilla (Intel). Aangezien deze man kennelijk het liefste nog achter zijn Atari ST zit, is het geen vrolijk verhaal. Ondanks dat de schrijver niet geheel objectief te noemen is, zullen liefhebbers van dit soort uitgebreide technische verhandelingen het toch interessant leesvoer vinden.
Het artikel begint met de bewering dat de Pentium 4 alles behalve de meest krachtige processor is. Vervolgens noemt de schrijver de volgens hem grootste gebreken in het Pentium 4 ontwerp:
1) Small L1 data cache: My testing shows that while the Pentium 4 has extremely fast memory access for working sets of data up to 8K in size, at 16K and 32K sizes it is no faster than a 650 MHz Pentium III.
2) No L3 cache (as originally specified): My testing shows that at working sets between 256K and 2M, a 700 MHz Xeon processor easily outperforms the Pentium 4 at memory operations. How much is 256K or 2M? Well, that's about the typical size of an uncompress bitmap. It's the reason a Power Mac G4 running Photoshop kills a typical Pentium III running Photoshop. And axing the L3 cache is a main reason why the Pentium 4 is not the G4 killer it could have been.
3) Decoder is crippled: Intel took a rather idiotic approach to the U-V pairing and 4-1-1 grouping limitations of past decoders. They simply eliminated the extra decoders and went back to a single decoder. ...How long [will] it thus take that piece of code to execute? More than 21 clock cycles. Now, compare this to the Pentium III or Athlon. How long will those chips need to decode the bytes? Roughly 7 to 11 cycles.
4) Trace cache throughput too low: these execution units can in theory process 9 micro-ops per clock cycle - 4 simple integer operations, 1 integer shift/rotate, a read and write to memory, a floating point operating, and an MMX operation. Sounds pretty sweet, except for the problem that the trace cache feeds only 3 micro-ops at a time! While on the Pentium III we have the situation that the decoder can feed up to 3 instructions and 6 micro-ops (4+1+1) to the core per clock cycle, the Pentium 4 is crippled to the point of decoding one instruction per cycle and feeding at most 3 micro-ops to the code per clock cycle.
5) Wrong distribution of execution units: ...5 of the 7 execution units are dedicated to handling the integer registers... ...only one single execution unit handles MMX. And if you read Intel's specs in more detail, it states that the unit can only accept a micro-ops every second clock cycle. ...the three ALUs can accept up to 5 micro-ops per clock cycle. But we've already learned that the trace cache can provide at most 3. So one or more integer ALUs sit idle each clock cycle.
6) Shifts and rotates are slow: ...they created the shift/rotate execution unit, which by design operates at normal clock speed (not double clock speed), but in my testing actually operates even slower. A typical shift operation on the Pentium 4 requires 4 to 6 clock cycles to complete. Compare this with a single clock cycle on any 486, Pentium, or Athlon processor. How bad is this mistake? For emulation code, it's absolutely devastating. Shift operations are used for table lookups, for bit extractions, for byte swapping, and for any number of other operations.
7) Fixed the partial register stall with a worse solution: Accessing certain partial registers now involves the shift/rotate unit, meaning that a simple 8-bit register read or write can take longer than accessing L1 cache memory!