The Core architecture (1)
Before we delve into the ins and outs of the Core, it is useful to consider what sort of processor it is. According to Intel it is a mix of the best of the Pentium 4 and the Pentium M, but it does not require a PhD to see that the overlap with the latter is much larger than with the former. Effectively, only features of Netburst (such as the 64 bit extension) have been used, but not a brick of that chip's design was left standing. Whether it can be classified as Pentium M offspring depends on the level of analysis. It was designed by the same team – the characteristics make it clear that the architects did not forget their experiences working on earlier projects. The Core is clearly based on a mobile philosophy, but closer examination betrays the presence of a few striking new features, which suffice to justify characterizing the design as new.
Wide Dynamic Execution
The Core is built to decode, execute and process up to four instructions per clock cycle. Other x86 chips such as the Pentium 4, Pentium M and the Athlon 64 have a maximum of three. Although it is hard in practice to find instructions within a single thread that can be executed independently, every improvement of the average is welcome. To maximize its potential, Core utilizes 'fusion': two accompanying instructions will be merged together by the hardware. That does not suffice to bring the workload down, but it nevertheless leads to slightly better efficiency since less bookkeeping needs to be done.
Fusion takes place at two levels: that of the processor's internal instruction set (microfusion) and of the external set (macrofusion). The Pentium M can also do microfusion, but the technology has been improved for the Core to allow for more combinations. Intel says that microfusion takes ten percent off the processor's instruction load. Macrofusion operates directly on x86 instruction execution requests, and eliminates unnecessary complexity, for instance by merging a 'compare' operation and a 'jump' operation into a single 'compare and jump'. Note that macrofusion is not applied in 64 bit mode, which is possibly due to an assumption that modern compilers do not generate unnecessary instructions.

Smart Memory Access
One of the most innovative features of Core's architecture is Memory Disambiguation. To guarantee the correct execution of a program, instructions must be processed in the right order. Or, at the very least, it should appear as if the instructions are handled in the correct order. For a number of years, processors have been around that are capable of 'covertly' crunching away instructions in a different order to achieve performance gains. The first x86 chips which applied this principle were the AMD's K5 and Intel's Pentium Pro. For these so-called OoOE-architectures it is of paramount importance that the appearance of sequential processing is kept up. After all, it is undesirable, to say the least, to have a processor so intent on executing instructions that it ends up working with data that needs to be altered before it is operated on.
In practice, it is hardly ever necessary to execute all instructions in the exact order in which they are stated in the program source. One of the tricks in Intel's hat (which is also on AMD's K8L repertoire) is to execute read actions before it is their turn, making the data available quicker. Write actions are somewhat more troublesome: if there is a store instruction in the pipeline with an unknown target, the CPU cannot risk reading something in the mean time. Programmers and compilers avoid writing and rereading the same data for the sake of efficiency, but a processor's job is to get the correct results with all possible code, including suboptimal code. So far, this has been ample reason to block read actions until all preceding write actions have been handled.
The Core's Memory Disambiguator solves this issue by predicting the target of write actions and giving the go-ahead to any read instructions that are considered safe. The result is less average waiting time for instructions while the CPU can do more work in the same time span. The accuracy of the Disambiguator is allegedly 90%, so it frequently goes to work on the wrong data. When this is noticed, and it always is before the results of code execution are finalized, the processing simply starts again. The system is somewhat comparable to branch prediction, which involves guessing at which branch of an

The Core is built to decode, execute and process up to four instructions per clock cycle. Other x86 chips such as the Pentium 4, Pentium M and the Athlon 64 have a maximum of three. Although it is hard in practice to find instructions within a single thread that can be executed independently, every improvement of the average is welcome. To maximize its potential, Core utilizes 'fusion': two accompanying instructions will be merged together by the hardware. That does not suffice to bring the workload down, but it nevertheless leads to slightly better efficiency since less bookkeeping needs to be done.
Fusion takes place at two levels: that of the processor's internal instruction set (microfusion) and of the external set (macrofusion). The Pentium M can also do microfusion, but the technology has been improved for the Core to allow for more combinations. Intel says that microfusion takes ten percent off the processor's instruction load. Macrofusion operates directly on x86 instruction execution requests, and eliminates unnecessary complexity, for instance by merging a 'compare' operation and a 'jump' operation into a single 'compare and jump'. Note that macrofusion is not applied in 64 bit mode, which is possibly due to an assumption that modern compilers do not generate unnecessary instructions.

One of the most innovative features of Core's architecture is Memory Disambiguation. To guarantee the correct execution of a program, instructions must be processed in the right order. Or, at the very least, it should appear as if the instructions are handled in the correct order. For a number of years, processors have been around that are capable of 'covertly' crunching away instructions in a different order to achieve performance gains. The first x86 chips which applied this principle were the AMD's K5 and Intel's Pentium Pro. For these so-called OoOE-architectures it is of paramount importance that the appearance of sequential processing is kept up. After all, it is undesirable, to say the least, to have a processor so intent on executing instructions that it ends up working with data that needs to be altered before it is operated on.
In practice, it is hardly ever necessary to execute all instructions in the exact order in which they are stated in the program source. One of the tricks in Intel's hat (which is also on AMD's K8L repertoire) is to execute read actions before it is their turn, making the data available quicker. Write actions are somewhat more troublesome: if there is a store instruction in the pipeline with an unknown target, the CPU cannot risk reading something in the mean time. Programmers and compilers avoid writing and rereading the same data for the sake of efficiency, but a processor's job is to get the correct results with all possible code, including suboptimal code. So far, this has been ample reason to block read actions until all preceding write actions have been handled.
The Core's Memory Disambiguator solves this issue by predicting the target of write actions and giving the go-ahead to any read instructions that are considered safe. The result is less average waiting time for instructions while the CPU can do more work in the same time span. The accuracy of the Disambiguator is allegedly 90%, so it frequently goes to work on the wrong data. When this is noticed, and it always is before the results of code execution are finalized, the processing simply starts again. The system is somewhat comparable to branch prediction, which involves guessing at which branch of an
if-construction program executions is to continue. The guesses are often good resulting in increased efficiency, but sometimes extra work is necessary to undo the damage of a miss. Naturally, there should be a net positive result in this, so too many errors on the part of the Disambiguator means it gets disabled for the remaining part of the thread.
Next page (The Core architecture (2) - 4/16)
