First the L2 seems to be 8way although we were having some sources talking about 4way, and I hope Intel would clear this bit up by releasing the uPF presentation like AMD. Second the 25 SPECfp_base95 seems legit as PC watch Japan again cited this number, and is reported as being part of the uPF presentation by Intel.
Now for people questioning is the number possible, we can try to take a look from the architecture and highest known SPECfp95 from a few well known architecture. First we look at the 21264, this CPU have 2 pipelined FP excution units and 2 load store units, and quad issue in its front end. Then we look at the P6 architecture with 1 issue port to the pipelined FPU, one load, one store and mostly a 2 decode/issue in the front end. Lastly we look at the K6-3 with a single issue FPU thats have a throughput of 2 cycles, 1 load/store unit and being a 2 issue machine. Now if we assume the excution units are feed to the same level and memory latencies are being hide by prefetch, OOE, or deep buffering/cache we can adjust the MHz linearly and same with throughput. Now look at K6-3 450's top Specfp95 score known (8.33 from c't) and multiply it by 6/4.5 = 1.3333.... to adjust for MHz and multiply by 2 to adjust for throughput we get 22.2. So this shows if we can feed P6 600MHz 's FPU units as good as we can do for a K6-3 450 we should be able to get at least 22 out. Now look at 21264 700's score of 68.1 and multiply by 6/7 and divide by two to account for the two unit present in 21264 we get 29 which is higher than the 25. Of course since the Rops in each CPU isn't the same and how busy can one keep the excution units at depend on the instruction set also this is a very crude estimate only aimed at showing that P6 core do indeed have enough excution units to get the score shown. This of course will also show that the offdie L2 the P3s have is not feeding the core very well and memory subsystem might be the deciding factor for all the future CPUs.
Note that besides the very low latency ondie L2s the new scores is also affected by the new compilers that seems to use prefetch to hide the memory latencies and keep the excution units as busy as possible. [break] Zie net dat JC een follow-up op z'n eigen posting heeft neer geprikt: [/break] Confirmation from SI that the impressively high Coppermine performance numbers are partially so high because the compiler was changed to take advantage of ISSE instructions. The poster (an Intel guy himself, I believe) speculated (very wisely) that it's likely ISSE prefetching optimizations, and not actual optimizing for SIMD (once again, since the phenominal majority of specfp uses double precision floating point, neither SSE nor 3DNow! SIMDs apply). Now, I'm tempted to go into ponder mode ... see, I'm pretty sure that spec rules say that this compiler has to be freely available to be published officially. And the Athlon happens to support the prefetching instructions from ISSE (as well as separate 3DNow! prefetching instructions). So I'll assume that it can be used with this compiler (unless Intel's compiler turns off sse support "if(CPUID) ne 'IntelInside'"). I'm interested in finding out the timings (you know, latencies and such) of the prefetching instructions in both the Coppermine (should be same as in Katmai PIII) and the Athlon. See, if the timings are identical, you can probably expect to see some strong performance boosts with an SSE'd compiler on Athlon. It the timings are different, then you'll likely see some improvement, but it'd probably get muted. Oh, BTW, the same fellow put up some estimations of how much prefetching would help the Katmai on 133MHz bus (PIII-600B).