Chip magicians at work: patching at 45nm

Stephen Fisher

This is a translation of a Dutch article that can be found here.

Most people who visit the Californian town of Folsom, which lies at a two hour drive to the northeast of San Francisco, go there because it is situated close to the beautiful Lake Tahoe and some of the skiing areas in the Sierra Nevada mountain range. Driving around the picturesque, characteristically American town, which has about 70,000 inhabitants, does not readily suggest that it is also the birthplace of one of the most advanced pieces of technology that man has so far produced.

Talking to the locals would probably change that, since the presence of the huge campus at the edge of the town - seven buildings displaying large 'Intel' logos - can only mean that nearly every inhabitant of Folsom knows someone working there.

Within the complex some seven thousand men and women do all kinds of work at the various departments, and these days they pride themselves primarily in that it is the location where the world's first 45nm processor was developed, tested and perfected. Tweakers.net was shown around the facility and spoke to Stephen Fisher, lead architect of the Penryn project.

Stephen Fisher Fisher has been on Intel's payroll for quite some time: he worked on the 486 cpu, the definition of mmx and sse instructions, and also on the Pentium III. The previous product that Fisher worked on was codenamed 'Tejas'. It was to be a 65nm version of the Pentium 4 with an extremely long pipeline of 40 to 50 steps, in order to achieve clock speeds of 7GHz or even higher.

Work on Tejas had progressed considerably when it dawned on Intel, back in 2004, that there was no future for the Pentium 4 architecture. The team had just achieved the 'tape out'-point when news came in that the project had been canceled. The 'tape', which was meant to be sent to the factory to make the first physical version of the chip, is now lying in a safe gathering dust.

Needless to say, this was a disappointment for Fisher, but he wasn't without work for long. Immediately after cancelling Tejas two new projects were started: Penryn and Nehalem. Two teams started simultaneously and from the same starting point: an early alpha version of the Merom architecture, which is now known as Core 2 Duo. However, the Penryn team was supposed to finish a year ahead of the folks that worked on Nehalem.

Tick tock

In Intel jargon, simultaneously developing multiple designs for the same manufacturing process is known as the 'tick tock' model. One of the reasons for applying this model is that the behaviour of small transistors is uncertain to some degree. When work on Penryn and Nehalem commenced in May 2004, the most advanced manufacturing technology available was 90nm. Back then, it was hard to say how 45nm circuits would behave, just as it is now difficult to make predictions on how things will go at 22nm. There are always rough estimates to work from, but these change as more decisions are made and theory is put into practice.

The first 45nm-chip was made in early 2006, less than a year before Penryn had to be finished.

Changing characteristics of a manufacturing process, also known as 'design rules', present significant challenges to designers. If problems are tackled in too radical a fashion, ideas may turn out to be impossible to implement. On the other hand, doing things conservatively may lead to suboptimal use of the possibilities that the factories, which cost billions, have to offer.

This is why Intel chooses to explore two different approaches simultaneously, with an offset of about a year. For the 45nm generation, Penryn represents the conservative approach while Nehalem is the progressive one. Every obstacle that the first team encountered was a lesson for the second one, which then had an extra six to twelve months to solve it. This luxurious position enabled the second team to concentrate much more on new features, such as the integration of multithreading, a memory controller and a video chip.

Intel plans to follow this strategy in the coming years, meaning that the 32nm generation also has two entries on the roadmap: 'Westmere' will be a conservatively shrunk version of Nehalem, while the progressive 'Sandy Bridge' is to add a good deal of new features. And things don't stop there: Fisher told us that he is already working on a 22nm design, and elsewhere within Intel, teams are already figuring out what features might be added into the 16nm generation.

Penryn's birth

Let us return to the present, and to Penryn. The first version of this chip was produced in December 2006 in the Oregon development facility Fab D1D, which is almost 500 miles north of Folsom. Some fifty engineers stayed up during the night while the A0 version of the processor was flown to Folsom by one of Intel's private aircraft.

Of course they could have waited until the next morning, but they were just too anxious to find out whether the chip, which by then had already been under development for over two years, actually worked. Besides, they were eager to beat the Merom team, that had managed to boot Windows on the A0 version of the Core 2 Duo in under thirty minutes. That night there was both good news and bad news: Penryn worked, but it took six hours to get Windows to boot properly on it.

After the A0 version, several other steppings were made, for reasons such as fixing bugs, improving the yields and enabling higher clock speeds. The version that will hit the shops on November 12 is called C0 and is the fifth revision of the design.

Penryn up close: 420 million transistors on 107mm²

But how does Intel go about testing these things? Simply starting Windows is obviously not sufficient to find out whether or not everything is functioning properly. An all-nighter running SuperPi? Contrary to what many overclockers believe, this, too, is not enough to be able to state that a chip is stable at a given clock speed. And what if a bug is found in the hardware? Penryn crams almost four million transistors on each square millimeter, so surely it is impossible to figure out where exactly it is that things go wrong. Or is it?

Logical validation

Logical validation is a process purely intended for testing the functionality of a chip, and it involves three types of tests. The first category comprises benchmarks, games, operating systems, server applications, and so on. This is the easiest category to achieve success in, because this sort of 'normal' software usually does not contain very exotic code.

The second type of tests is harder, and involves trying out specific (new) features. Usually, these tests are written by the chip developers themselves, since they know best how to cover every possible peculiarity and exception for their design.

The final category is probably the hardest. It involves firing random instructions and data at the processor to see if the physical chip behaves in the same way as the software model of it. Naturally, the 'emulator' is tested extensively before the first chip arrives from the factory, so if the hardware implementation responds identically to every combination of instructions and data, then chances are that everything is fine.

This board is set apart for some serious continuity testing

As can be expected from one of the largest technology companies in the world, nearly everything is done automatically. A large network takes in tests and returns the results automatically, so the employees only need to concern themselves with the actual problems.

When a problem is found, the main job for the people in the lab is to figure out what is causing it, since it does not necessarily have to have anything to do with the processor: software, bios and chipset can cause crashes too. By attaching the test platform to a whole range of hardware, the flow of signals across the buses and the state of the processor (such as the contents of the registers) can be determined with great precision, so that a diagnosis can be made. Sometimes the real problem is in the testing software, rather than in the actual product. All in all, finding a cause can be quite a puzzle.

Circuit validation

It is one thing for the chip to be functionally OK, but it also needs to achieve the desired clock speed. As long as the speed isn't satisfactory, the engineers will want to know which 'speed paths', bottlenecks in the circuitry, need to be fixed in order to take things to the next level.

The lab where this is done has been set up in much the same way as the one described in the previous section: a room packed with 19" racks, with every case containing a motherboard specifically designed for testing purposes. These test boards use standard chipsets, but have been augmented with dozens of extra connections around the processor, to enable the engineers to read all sorts of data while the system is running. An ordinary pc sits above the system being tested, issuing experiments and collecting the results.

The main difference between the logics lab and the circuitry lab is that here, the engineers also play around with clock speed and temperature. At Intel, an advanced form of 'temperature control' is used, a central liquid cooling system that allows temperatures to be set in a range between -50 and 80 degrees Celsius.

The goal of the so-called 'circuit marginality validation' is figuring out what the maximum clock speed and temperature is at which the chip can still pass all tests. As soon as the chip starts failing or the level of cooling required is deemed too high, the engineers will try to find the cause of the problem. In most cases, a circuit that performs subpar can be found.

The obvious question whether Penryn has been overclocked in this lab was answered with broad smiles, but no engineers were willing to comment on whether they managed to exceed the 5,56GHz mark which was shown by XtremeSystems. The triple phase change cooling system which was used for that particular effort is definitely more advanced than the gear in the circuit lab, but on the other hand only Intel has access to the guts of the chip.

Patching at 45nm

Needless to say, whenever a bug is encountered it needs to be taken care of. Whether the problem is with the logic or an unfortunate implementation of a circuit that limits the clock speed: in order to get from the A0 version to a product, a fair bit of tweaking is necessary.

However, some bugs are very difficult to trace. Especially when the design seems OK or the simulations predict that a slow circuit should have been fast, the designers can apparently do little more than change a few things and hope that it improves the results... but Intel has a couple of very expensive machines designed to tackle just this sort of situation.

A processor suffering from an 'inexplicable' problem will receive a special treatment. First, the standard metal heat spreader is removed. Next, the protective layer that lies underneath is stripped from the chip, leaving only 10µm from the original 750µm. The chip is subsequently placed within a specially designed test socket with a pure-diamond heatsink. Diamond is a pretty good thermal conductor, but, more importantly, it is also transparent.

As it happens, one of silicon's lesser known properties is that it is also transparent: not to the naked eye, but infrared can penetrate it. Hence a very precise laser can be used to see what goes on inside the chip. For this purpose, Intel has a custom built machine costing 2,6 million dollars at its disposal, allowing it to scrutinize the individual transistors to check whether they are live or not, and this with picosecond accuracy - a millionth of a millionth of a second. It lets Intel see quite literally where and how fast the transistors switch.

One of the interesting features of the test platforms that are used, is the possibility to slow down the chip clock. The operator of the machine can choose to go from billions of ticks per second to a single tick, allowing human eyes to follow what is going on. With the same ease the clock can be stopped altogether. If needed, that gives the engineers the luxury to be able to stare for two hours at a situation that would normally be over in a fraction of a nanosecond.

In order to find transistors, the chip design is automatically projected over the images.

A second machine makes things even more fancy: it can zoom in to details of less than a nanometer and even execute instructions at that level. This is done by evaporating certain chemicals into the laser beam, which are then shot directly into the chip. Using corrosives, connections can be broken, while metal and silicon are used to make new structures.

This machine is so sensitive that it had to be stabilized after traffic ramps had been constructed outside. Vibrations caused by cars passing over them made the images bounce up and down.

The technology saves Intel huge amounts of time. Having the factory create a new revision takes at least four to six weeks, but with this machine it can be done within a day. Although what comes out isn't meant for mass production, it has happened: two weeks before the original Pentium 4 was supposed to launch, a bug was found in the southbridge of the chipset. Intel engineers scrambled to patch hundreds of chips with this kind of machines, in order to still be able to ship motherboards in the first few weeks.

Errata and conclusion

However hard Intel works, a chip as complex as Penryn can never be wholly perfect, at least not if the company doesn't put a few extra years of work into it. Some problems are solved using software patches, such as microcode or bios updates. Other problems, usually the ones that are discovered while running random instructions, sometimes aren't fixed at all - either because it is thought that the circumstances are so bizarre that practically no-one will ever encounter them, or because the potential problems are so minute that it simply isn't worth the effort of solving them.

But everything that is found in the labs does get published. For the first commercially available version of the Core 2 Duo, a total of 112 of these so-called 'Errata' are known. Intel uses this term because it thinks the word 'bug' does not apply: bugs must be solved, but most errata aren't really anything to worry about. For example:

With respect to the retirement of instructions, stores to the uncacheable memory-based APIC register space are handled in a non-synchronized way. For example if an instruction that masks the interrupt flag, e.g. CLI, is executed soon after an uncacheable write to the Task Priority Register (TPR) that lowers the APIC priority, the interrupt masking operation may take effect before the actual priority has been lowered. This may cause interrupts whose priority is lower than the initial TPR, but higher than the final TPR, to not be serviced until the interrupt enabled flag is finally set, i.e. by STI instruction. Interrupts will remain pending and are not lost.

This summer, 39 of these issues have been addressed in the new G0 stepping, but a new issue was also identified. In spite of the fact that there are still 74 known 'mistakes' in the chip design, most people will concede that they are not having any problems whatsoever, which means that the people in the validation lab have done a decent job.

Conclusion

Developing a new processor will easily take three to four years, where the last year prior to the introduction is mostly reserved for testing and perfecting the chip. However, validation does not start at the moment the first silicon returns from the factory, nor does it stop once the product hits the market. Also, it is rare that the processor can be tested in isolation. Usually, the chipset must be tested simultaneously, which makes things even more complicated.

Hence, what we showed in this article is only a small part of the total trajectory, but hopefully it has become clear that there is a huge difference between having a sample and having a commercial product. The fact that two weeks ago, Intel had a machine up on stage that could read the sentence 'Hi, I am Nehalem. I am only three weeks old, and I am already talking' may be encouraging, but in the coming year this chip has to be subjected to an extensive testing trajectory before it will be ready for release.