Advanced Smart Cache
The Core has a shared L2 cache, which comes in 2MB or 4MB depending on the edition of the CPU. The processor cores can each access the data that the other one has requested, reducing the average waiting time in case the cores are both working on the same task. Cache capacity is dynamically distributed among the cores, so if needed, one thread can take the full blow. Cache sharing also decreases bus bandwidth, since internal communication can be handled via the L2. Incidentally, the cores' L1 caches are also connected, but Intel has so far declined to divulge the reason for that.
A dualcore Core-processor has a total of eight prefetchers on board, working together with the large cache at lowering CPU latency. Each core has three prefetchers at the L1 level: two for data and one for instructions – the remaining two are shared in the L2. The aim of using multiple prefetchers for the same cache lies in this construct's ability to recognize various access patterns. In contrast to older designs the Core's prefetchers check whether the data that they put into position actually ends up serving a purpose - in order to avoid putting unnecessary load of the bus while reducing the amount of useful data that accidentally gets pushed out of the cache. Other than that, read instructions from the source code get priority over the ones from the various prefetchers, which minimizes the risk of performance degradation due to overly enthusiastic prefetchers.
Although not having an integrated memory controller gives the Core a higher latency than the K8, the combination of cache and prefetchers is so good that it even fools various latency benchmarks that were specifically designed to circumvent primitive prefetchers. The only drawback to prefetchers is that they can get so active that they drive up power consumption. This is the reason Intel has provided an option to set a level of aggressiveness of the prefetchers, and has given the mobile Merom the mildest default settings while putting the server chip Woodcrest in the heaviest configuration.
Intelligent Power Capability
The efficiency of the Core in comparison to its predecessors is not only due to the 65nm-production process. but is largely the result of clever tricks: just about every part of the core can be switched on and off. The surface area is partitioned into several main areas that are only active when they are in use. Certain components such as caches, busses and buffers can even be partially switched off. A disadvantage of switching off components is that it takes time to fire them back up, driving response time up and performance down. To solve this problem, Intel's engineers have designed a system that predicts when a particular part of the chip needs to be active, so that these are always ready in the nick of time.
Advanced Digital Media Boost
The Core is the first CPU capable of processing 128 bit SSE instructions in one go. Earlier chip designs had these partitioned into two 64 bit sections, which is an extra clock tick in any case, but also requires more bookkeeping. The wide data paths for multimedia cater for a maximum of four 64 bit flops per clock tick per core, which is twice as much as what Netburst and K8 can handle. Finally, eight new multimedia instructions have been added under the header SSE4, which are to aid specific applications in achieving performance gains. However, Intel has paid so little attention to this that we do not consider it likely that this is anything spectacular.