Tweakers maakt gebruik van cookies, onder andere om de website te analyseren, het gebruiksgemak te vergroten en advertenties te tonen.
Door gebruik te maken van deze website, of door op 'Ga verder' te klikken, geef je toestemming voor het gebruik van cookies.
Wil je meer informatie over cookies en hoe ze worden gebruikt, bekijk dan ons cookiebeleid.
The first word that you’ll think of when someone yells ‘Intel’ is probably ‘Pentium’. These two words have been used together on so many occasions the past years, that the average computer buyer doesn’t even know the difference between them. But as you, a faithful reader of Tweakers.net, will undoubtetly be aware of, Intel does a lot more than ‘just’ designing and producing chips for desktops. A substantial part of Intel's revenues is generated by the server and professional workstation market. The number of customers that create this part of the revenue isn’t exactly large, and in general we’re only talking bigger companies here. Of course even those big companies won’t have a server on every desktop, so this market is quantitatively very small. The thing that makes this market so attractive is the customers, who demand high-quality products and are prepared to pay for it. This is one of the reasons why companies like AMD are trying to enter this market; there’s a lot of money involved. We’re talking about amounts of money that let Compaq and IBM decide to specifically design processors for servers only.
‘Xeon’ has the same recognizable sound to it for server or professional workstation users as the name ‘Pentium’ has to home users. After the design of the original Pentium processor, Intel designed an early successor to it, codenamed P6. The P6 was heavily optimized for 32-bit applications, which weren’t really available for the desktop market at the time. The result was that pre-production samples of the P6 were, in many of the current applications, slower than a normal Pentium. Intel decided to wait a while with the desktop launch of the P6, and created a server version of it, the Pentium Pro. The new server processor was released in autumn 1995. Things changed a bit when Intel finally released the Pentium II, two years later. While the packaging was different and the cache size smaller, the new desktop chip used exactly the same techniques as the Pentium Pro did. Intel decided to create a clearer distinction between the two processors, and named the Slot 2 versions of the Pentium Pro ‘Xeon’. Still, it has always been common to use the name of the desktop chip which it resembles most too. At this moment, we’re talking about the Pentium III Xeon processors, and they’re available in speeds up to 1GHz.
The future of the Intel server series once again, just like 6 years ago, depends on a new core; Itanium. However, the differences between the x86 IA-32 and the EPIC IA-64 architecture are so big, that an immediate change to the new architecture would be impossible. As you’ve seen in our previous sneak preview, the Itanium isn’t exactly the best at emulating its predecessors, so you’ll need special software to make use of the new generation of processors. Add to that that the prices of the 64-bit processors will be very high when they're finally released, even for the wealthy server and workstation buyers. Seeing this, it’s only reasonable that Intel will continue to develop server and workstation processors based on the old x86-architecture. The only logic next step in the evolution of the Xeon processor is to use the Pentium 4 core, and that’s exactly what Intel has been working on lately. The new Xeon is codenamed ‘Foster', and will be released as the ‘Intel Xeon’, omitting the ‘Pentium’ name in it.
Foster is, as said, based on the Pentium 4 core, but it can’t be used on the same chipsets as its little desktop nephew, just like the older Xeons. Intel chose a new interface for Foster, Socket 603, and the accompanying chipset was named i860 Colusa. The i860 chipset supports a maximum of 2 processors and uses Rambus RDRAM memory. The official presentation of Foster to the press and the public will be in a few days, until then we can’t make a lot of statements about the features of the new Xeon. But you can read on the next pages what we can expect of it, based on the specs of previous Xeons and some leaked facts.
One look at the pricewatch tells us that every Xeon is a lot more expensive than its Pentium III equivalent, and the Slot 2 mainboards don’t come in cheap either. But what exactly is the difference between the two? Actually, the differences between the two have decreased over time. The original Pentium Pro had a new core and was the first processor with an in-package, full-speed L2 cache, and a lot of it too; you could buy them with up to 2MB cache. The Pentium had a maximum of 512KB cache located on the motherboard and used the same clock speed as the FSB. The Pentium II also had 512KB, but it only ran at half the speed of the processor. The Xeon has always had full-speed cache, and was also the first Intel processor able to address 64GB of memory, 16 times as much as its desktop equivalents.
This made the Xeon the perfect processor for some heavy tasks, where the
most important advantage of it was very clear; the Xeon was the only
Intel-chip that could be used in a 4-way, or even bigger, SMP system. Because of the high prices of RISC chips, the possibility to make relatively low-cost but powerful x86 systems was a solution for many companies. The Xeon also had ECC features and hardware monitoring functions. Despite the – for us mortals, that is – high cost of the Xeon, the buyers where very glad with the introduction of it.
The release of the 0.18 micron Pentium III Coppermine changed the situation. This core gave every home-user on-die full-speed L2 cache. A feature previously found only in the expensive Xeons. Furthermore, Intel created a distinction between ‘light’ and ‘heavy’ Xeons. The light ones were equipped with only 256KB of L2 cache, just like the normal Coppermine, and couldn’t be used in an SMP system. The advantages of using a light Xeon over a normal Pentium III became minimal, and the prices reflect that; the pricing of the light Xeons is closer to that of desktop processors than ever before. The heavy Xeons still exist, but there aren’t many of them. If you would like to build a 4-way system, for example, until recently, you only had the choice of the 550MHz and 700MHz versions, both only available with 1 or 2MB cache. The recently announced 900MHz version is only available with a 2MB cache.
This strategy won’t change a lot with the release of Foster, but Intel did put a little more effort in giving the light version its own place. Alas, this has an unfortunate consequence for the Pentium 4, which isn’t able to run in an SMP-configuration. If you want a dual processor system based on the Pentium 4 core, you’ll have to buy a pair of Foster DP’s. DP stands, of course, for Dual Processor, and that means that you can’t have more than two of them in one system. Companies who don’t think that that is enough, will have to buy the more expensive MP (Multi-Processor) versions. The MP’s also distinguish themselves from the DP’s by a maximum of 4MB full-speed L3 cache.
We can only speculate about the exact amounts of L2 and L3 cache in the different versions of Foster. We only know that the first available DP models will have 256kb of L2 cache. The MP version is mostly named with 512KB or 1MB L3 cache. This cache memory has to provide a better performance in applications that use the same data many times, like databases. The cache memory can also hide a part of the system memory latency, something that really comes in handy when dealing with RDRAM. The original Pentium 4 design had an L3 cache too, but this has been scrapped because of its high cost, something that is of less importance in the market for server processors.
According to the latest rumours, the DP version of Foster will be released on the 8th of may, at the same frequency range as the Pentium 4; 1,3GHz to 1,7GHz. The heavier models will make their debut a couple of months later and will be only available at 1,4GHz when they are just released. At this moment, three chipsets with support for 4 Fosters have been announced: Intel 870, ServerWorks Grand Champion and IBM Summit.
Aside from the option of more cache and SMP, a leaked Powerpoint presentation from Intel pointed us to another interesting feature of Foster. At the sheet below, you can see that ‘Jackson Technology’ is mentioned, with the caption ‘On-Chip multi-threading support’. According to some sources, Jackson Technology is an incarnation of SMT (Simultaneous Multi-Threading).
There isn’t a lot of documentation available about Jackson Technology, and the mysterious line under its name on the sheet won’t tell us much either. There are of course a lot of speculations about it. Some people thought, for example, that this meant that Foster would consist of multiple processor cores on one chip. Fortunately, there is a more realistic explanation to all this, and it’s called SMT. SMT stands for Simultaneous MultiThreading, and it is a technology that should take care of optimal usage of the execution units of a processor at any time. To understand what SMT exactly is, we’ll work to this in a few small steps. We’ll start with a schematic diagram of a normal processor:
The little squares are the execution units, the parts of the processor that do the raw calculating. This imaginary processor has four of them (four rows from top to bottom). The time in clock ticks goes from the left to the right. To follow a 1GHz chip for one second, the picture would have to be 60 million times as big as this, so we’ll just stick to 17Hz here . When an execution unit is used at a certain moment, the square is coloured. The different colours are the different programs that are running on this processor, with the grey colour representing the operating system.
The processor is, as you can see, capable of executing multiple instructions at a time, but you can also see that a program isn’t optimally using the possibilities of the processor all the time. In this case, the OS has the task to switch between two threads. As you can see this can hardly be called multitasking; the processor can only run one program at a time, and the OS is necessary to create the illusion of multitasking.
The graphic you can see above looks a lot better already. The method that is used here is called Coarse-Grained Multithreading. A processor that works this way always has multiple threads loaded in its hardware. When the active thread needs some data from the system memory – something that will take an eternity if you’re looking through the eyes of the processor – the processor will work on another thread while waiting for data. There isn’t an operating system involved in switching threads, which guarantees that all this happens fast and efficiently.
The image above looks a lot like the previous one, but this method, Fine-Grained Multithreading, is actually a little bit more advanced. It switches threads at every clock tick, which makes the processor easier to design than the previous one; with that one, it wasn’t clear when it would have to switch threads. There are only some rumours about processors using CGM, while FGM has already been used. There is, however, a big disadvantage of working with FGM when only running one thread; it would only be able to use a quarter of the processor time.
Here we see the ultimate form of thread-level parallelism: SMT. You could describe it as a train. Little groups of people and lonely travellers are waiting for a train that arrives at a predetermined time. When the train arrives, and the doors open, everybody will try to find a place that isn’t taken yet, until nobody wants to enter the train anymore or it’s just full. The available resources of the processor are all used the best way possible here, but it is of course dependent on the number of threads that a processor can load at a time, how many instructions these threads need and the way that these instructions are spread. Still, the difference with the first picture is huge.
SMT has no effect on the performance of programs that consist of just one thread, or if the threads of a program need each others’ results. Because of this, SMT won’t always have an obvious advantage when using the current software, which isn’t optimized for it. However, Intel has already mentioned that there should be about 20 programs that fully use Jackson Technology in 2002. Still, Foster should be able to get a performance boost when doing some heavy multitasking. We don’t know yet if Jackson Technology has to be supported by the operating system.
As you have no doubt already guessed, looking at the page titles (yes, we know you already looked at them, you little rascals ) and the fact that we actually published this article, Tweakers.net had the chance to experiment with a pre-production system based on the Intel Foster processor. The story starts at a cold Wednesday, when the both of us (Wouter Tinus and Arjan van Leeuwen) were sitting in a train, heading for a secret location, of which we can only say that it is in Europe. In the next pages, we’ll tell you exactly what our experiences were with this new processor.
The mainboard was of course based on the Intel 860 Colusa chipset, and had two sockets ready for action. Alas, only one of these sockets was filled, so we won’t be able to tell you anything about the performance of a dual Foster system (say: Aaaaahhhh….). But wait! This gave us one big advantage. This way, we could compare the new Foster to its desktop nephew, the Pentium 4, which ran at the same speed! (yell: Yeeeaaaaahhhhhh!) Both processors ran at a speed of 1.5GHz, and had 256MB of PC800 RDRAM at their service.
The Intel 860 chipset in all its glory
The processor itself looks a lot like the Pentium 4. At the picture below, you can see both alongside each other (Foster on the left, the Pentium 4 on the right). Remarkable differences between the two are the little blocks that you can see at both sides of the processor. Unfortunately, we don’t have a clue what these things are for, and the same holds for the 2 little IC’s that you can see at the top. It might have something to do with voltage regulation or hardware monitoring, but that’s just a wild guess.
At the other side of both processors, you can see the new pin-layout of the Foster, which is remarkable. It looks like it's an exact negative of the bottom of the Pentium 4. It’s possible that this makes it easier for Intel to produce both processors. The little resistors at the bottom have been placed in another direction too, but it still looks a lot more like its desktop brother than the previous Xeons.
The Foster was supported in its tasks by a GeForce2 GTS, the Pentium 4 had to do with an ATi Radeon. While this difference could possibly disrupt our comparison of the two, we don’t think that’s the case here; all the benchmark programs we used are a lot more dependent on chipset and CPU than they are on the graphics card. Both systems ran Windows 2000 Professional with Service Pack 1 installed. Starting WCPUID on the Foster computer already gave us the first surprise. While we were dealing with a 1.5GHz system here, the CPU name string indicated something different:
We heard that it is possible to change the multiplier of the CPU with a special disk, which should also work for the Pentium 4. This was of course very interesting information, and we really hope to hear more about it in the near feature. However, the disk wasn’t available at the time, so we couldn’t check if this processor was really able to run at 1.8GHz. According to the latest news, the Foster should debut at 1.7GHz, so this should be no problem. We can also see that WCPUID recognizes our Foster as a Pentium 4, which is of course logical, looking at the similarities between the two. Our Foster has 256KB L2 cache, the same amount as a normal Pentium 4. This makes our comparison even more interesting. A look at the cache info window in WCPUID reveals that the caches are divided in the same way as is the case with a normal Pentium 4.
Trying to look at the chipset information gives us, as expected, this alarming window:
Intel’s own Frequency ID program obviously knows more about it and identifies our processor, correctly, as an Intel Xeon. Still we do get a warning that we’re not dealing with a production processor and that only Intel processors are supported.
A good benchmark to start with is SysMark 2000. SysMark 2000 uses real programs and performs tasks that are very likely to be used in real life. For example, PowerPoint 2000 is installed, and the computer creates two new presentations at an amazing speed guided by a script, including animations and other nice gadgets. Twelve of these tests are performed, and the computer gets an index number from SysMark. A Pentium II 450MHz gets an index of 100. Unfortunately, our pre-production system wouldn’t finish all of the tests, even after a rerunning and re-installing SysMark a couple of times. Nothing to be worried about, however; Sysmark still continues with the other tests, so that we can calculate our own index for both machines. Below are the results:
This was a pleasant surprise. When you are comparing a Pentium III 1GHz and a Pentium III Xeon 1GHz with 256KB cache, the difference would be negligible; the average difference between a Foster and a Pentium 4, however, is almost 10%! In Photoshop we even see the Foster sprinting 23% faster than the Pentium 4 at the same clock speed. The only software that didn’t profit from the new possibilities of the Intel Xeon is Excel, but this could also be caused by the same problem that disrupted our CorelDraw, Paradox and Word tests.
Elastic Reality 3.1
NaturallySpeaking Pref 4.0
Windows Media Encoder 4.0
* = The tests that wouldn't finish on Foster have also been omitted when calculating the average of Pentium 4
The next test on our list was SiSoft Sandra 2001. Sandra measures the raw power of a CPU and gives us an indication of the internal and external bandwidth. The CPU benchmark was the first test we ran. We can obviously see here that the Foster does better than a Pentium 4, but it’s still not fast enough to get past a Pentium 4 1.6GHz. The CPU Multimedia benchmark shows the same results:
The real surprise for us came when running the memory benchmark. The i850 Tehama chipset does 1201 and 1216MB/s in the ALU and FPU tests. The new Colusa chipset beats these numbers with ease, and ends at 1394 and 1457MB/s. The 860 chipset is almost 18% faster than the i850, and that might explain why Photoshop really liked this one .
FlaskMPEG & ScienceMark
The re-encoding of MPEG2 video images to the MPEG4 DivX format is where the Pentium 4 really shines, and shows what you can do with the bandwidth of RDRAM. The nice thing of FlaskMPEG is that you can use different encoding methods, so that you can see the effects of optimizations such as MMX and SSE2. We see that the average performance in fps climbs 5% when using Foster. The improved bandwidth of the i860 Colusa chipset will probably account for the biggest part of this difference. Remarkably, encoding with SSE2 at low quality is far slower than encoding at medium or high quality. We think this is a bug in Flask, as the same thing happened on the P4 system. Below you’ll find a table with the results and a screenshot of the record:
Now that we know that Foster has a higher memory bandwidth available, thanks to the 860 chipset, and can outperform the Pentium 4 with Jackson Technology, we can look at a third method that can be used to further improve the performance of the chip; optimizations. Intel always said that software developed especially for the Pentium 4 will run significantly better. With ‘especially developed’, we are, in this case, not referring to SSE2 routines, but simple compiler settings. It takes little effort to use them, and they can generate code that's far more optimized for the specific strongholds of a processor.
Intel have been working for some time on version 5 of their compilers and performance analyser. The new software offers, alongside special options for the Pentium 4, support for the IA-64 architecture. Tim Wilkens, the creator of ScienceMark, was very kind to provide us some new special versions of his benchmark, created with the latest beta-version of this new compiler. ScienceMark is, as the name implies, a scientific benchmark. We can't explain what it does exactly with virtual liquid Argon molecules, but that isn’t the most important thing. The fact that a 1.5GHz Pentium 4 takes almost 2 minutes to finish calculating says enough about the complexity of this software. The optimized executables surprised us, positively:
At the left you can see the result of the normal executable, which is free for download at the ScienceMark site. QxW and QaxW are optimized versions, compiled with the latest beta if the Intel 5.0 compiler. Despite the fact that the source code of the three programs is exactly the same, and they’re essentially doing exactly the same job, the optimized versions finish almost 2 times as fast as the standard one. A performance gain of almost 100% caused by a simple recompile promises a lot for the future of the Pentium 4, but it’s not all roses there:
This has to hurt. Despite the fantastic result in one test, the other one has to suffer badly. The effect is actually turned around here. QxW and QaxW are almost two times as slow as the original version, which has been optimized for a Pentium III. The explanation for this is that ScienceMark is a very strange program to a compiler, including instructions that aren't used very often. Tim Wilkens is working with Intel on improving the performance and fixing bugs in the new ScienceMark.
While a recent version is already better than the one we had at the time of testing, only 2 of the 4 important test in ScienceMark could be compiled in the ‘optimized’ version, which causes the differences in the end result that you can see below:
* = Absent tests cause a lower score
The compiler is still far from finished, but it has shown us some impressive results. While everyday applications won’t easily benefit from the new optimizations as much as ScienceMark does, the difference should still be big enough. People that are afraid that the new optimizations will slow down their Pentium III or Athlon systems don’t have to be worried; while QxW immediately crashes on these systems, QaxW shows the same performance effects as it does on the P4, only a little less extreme. Moreover, the compiler has the possibility of optimizing code for the Athlon processor.
According to our tests, the development of Foster has been a step in the right direction for Xeon-class processors. Intel has obviously learned from the mistakes that have been made at the time of the Pentium III, and the new generation of Xeons has its own place on the market. By taking the dual-processing ability out of the desktop chips, high-end workstation users will always have to get a Xeon, which will cause bigger sales than ever for the Xeon processors. While the Foster will of course be more expensive than the Pentium 4, the extra money is partly justified by Jackson Technology and the faster i860 chipset. This justification is hard to find when looking at the current generation of Xeon processors (at least the ‘light’ ones). The improvements that the Xeon has made over the Pentium 4 are very obvious, but we’re pretty sure that these will play a more important role in the future, especially when commonly used server software is optimised for these new technologies.
What the exact price of the new Xeon will be isn’t clear yet, but with the price of the Pentium 4 falling rapidly at this moment, it is quite possible that the Foster DP will be released at the same price as the one at which the Pentium 4 is currently selling. With big names like SuperMicro and Tyan behind the mainboards it won’t be difficult to build attractive CAD/CAM workstations based on two Fosters. This brings us to another point, because there is another company that has a solution for dual-processor workstations in the making.
The AMD 760MP chipset will be released at the same time as the Palomino processor, probably before Foster and the i860. We can expect that a system with two Palomino processors is somewhat cheaper than a dual Foster setup. Moreover, Palomino will have some improvements to the core and will probably be performing a bit better than the current Athlon. However, the choice of mainboards and chipsets will be bigger on the Intel side, and big companies that don’t trust a newcomer like AMD will certainly not work to the advantage of the Palomino, too. Furthermore, AMD is far from the scalability of Intel. Designing dual Socket A mainboards is already a complex task because of the point-to-point bus, let alone designing them with two or four times as many processors. We still don’t know what exactly the AMD 760MPX chipset is, but we do know that Intel isn’t waiting for AMD to make the next move at this point. Also, there are no official plans of AMD to release Athlons with a larger cache.
The future of Foster depends, among others, on the 870 chipset from Intel. With this chipset, systems consisting of up to 512 processors can be built, in subclusters of four. This chipset will support both RDRAM and DDR SDRAM. At the same time that the Pentium 4 will receive its Northwood core, Foster will get a new 0,13 micron revision too. This core, codenamed ‘Prestonia’, will be released together with the Plumas DDR chipset.
Of course our thanks go out to our anonymous source, who let us use this system while endangering his own life . Thanks a lot and keep up the good work .