Introduction
Dit artikel is ook in het
Nederlands beschikbaar.
This article was originally written in
Dutch.
Computers with x86 processors come in many shapes and sizes, ranging from ultra-slim notebooks for business folks to neon-pimped desktops for gamers. One of the most excessive members of this family is the Sun Fire X4600, a server that can accommodate up to eight dual-core Opterons in its casing. Compared to this machine, all the gear that we tested earlier looks like a bunch of toys. But is it possible to use sixteen cores effectively? And if so, do things run smoothly enough to justify a price tag of more than 35,000 euros? These are the questions we shall be trying to answer in this review.
Market distribution
Each quarter, some 1.8 million x86 servers are sold worldwide. About 95% of these are of the standard type which we have tested before, with one or two processors on board. They get used as web, mail, file, print or proxy/firewall boxes, but also for (light) database and application work. Many of the smaller companies will never need more than that. Out of the remaining 90,000 or so servers, the great majority has four sockets. Usually, these machines are used for running groupware, ERP, CRM and other 'enterprise' like software, which many hundreds to a few thousands of people work with on a daily basis. Only about 1500 machines remain that have eight or more processors, which are used for the heaviest and/or most critical tasks. In this segment, x86 competes with chips such as the Itanium, Power and UltraSparc, which have been specifically designed for the most demanding applications.
Although the number of heavy-duty x86 servers that get sold is relatively small, the same rule applies for servers as in other markets: the more expensive the model, the bigger the margin. Usually, the purchase of a powerful server goes hand in hand with a substantial storage and/or backup system, software and service packages or other services such as consultancy. Consequently, selling such systems brings in more money than one might expect purely on the basis of the number of machines sold. More than half of the total server revenue (i.e. x86, Itanium RISC put together) is made in the segment with four or more processors, which comprises only fifteen percent of the market. Note that we are just talking about the hardware: software and services must still be added.
The AMD-advantage
In just a few years, the Opteron has pinched a considerable amount of market share off from the Xeon's once dominant position. AMD managed to cause most damage across the segment of servers with four or more processors. Worldwide, it commands more than forty percent of this market; in the US the figure is even higher and stands at more than fifty percent – contrasting with the Opteron's overall share of a quarter. The reason for this is that the Opteron can seamlessly scale from two to four processors. Sockets can be tied together with integrated HyperTransport links, and since each chip has its own memory controller, there is always a sufficient supply of bandwidth. Intel's Xeons, on the other hand, can only cooperate via an expensive chipset and have to share limited amounts of bandwidth. At the moment, a system with four Xeons has only 12.8GB/s at its disposal, compared with 42.7GB for a four-way Socket F Opteron. And, believe it or not, this is an enormous improvement compared with the old Xeon chipset, which AMD fought against crushed with ease in the first two years of its server adventure.
 |
 | Total bandwidth 4-way servers |  |
 |
 | Xeon MP (2002-2005) |   3.2 |  |
 |
 | Opteron (2003-2006) |   25.6 |  |
 |
 | Xeon MP (2005-2007) |   12.8 |  |
 |
 | Opteron (2006-...) |   42.7 |  |
 |
 | Xeon MP (2007-...) |   34.1 |  |
 |
A few remarks must be made concerning both sides of the battlefield. First of all, AMD's HyperTransport links cannot scale up indefinitely, after all, a simple 'broadcast' protocol is used to keep the caches of the various cores in sync. This means that each processor is continuously talking to every other processor, even if they are working on completely unrelated tasks. This mutual chatter causes delays since a core can do nothing but sit and wait with a piece of data until all the other cores have confirmed that no change has been applied to it. The more chips (or better, the greater the longest distance in the network), the higher the latency will get. Although the influence of this is minor in four-socket systems, there are benchmarks for eight sockets where the effect is clearly noticeable.
The Xeon, meanwhile, does not need to be as limb as it is with Intel's own chipset: with the help of IBM's X3 'Hurricane', up to 32 Xeons can cooperate quite effectively. This is achieved by building a network between the processors at the chipset level – something which is built-in in the Opteron's architecture. The difference is that the X3 chipset manages things somewhat smarter, and won't let two processors exchange small talk when they are not working on the same data. An attempt by Newisys to make such a filtering chipset for Opteron unfortunately never made it to market, but rumour now has it that AMD is planning to integrate this technique into the processor.
Trends
There are several trends going on which influence the position of 4-way and 8-way servers. On one side there’s consolidation and virtualization. The first term involves running multiple applications on a single system, to save space, power and cost. Virtualization is a powerful tool for this, since it allows administrators to partition a physical machine into multiple virtual machines. This reduces the difficulties and risks usually associated with the coexistence of several types of software on a system. If companies were to bring fat machines to bear for the replacement of large numbers of smaller ones, the heavy-duty segment could grow further. IDC’s most recent figures show that the growth of cheap servers stagnated in the last quarter of 2006, but of course a single quarter does not make for a trend.
Meanwhile, the competition on the other side is not resting: blade servers offer high density and reliability thanks to cheap redundancy. Moreover, they can be clustered in case heavy duty calls. Many companies are interested in the so-called 'Google model', where hardware is merely a building block that can be added or removed as desired. In this model, heavy and expensive machines with four or eight cores are uncalled for. Oddly enough, virtualization can be handy for this sort of application, since a virtual machine doesn't care about the physical hardware, even if it changes from one moment to the next.

Another factor that can make the heavier systems less attractive is the growth of the number of cores per socket. Two years ago both the Opteron and the Xeon were single-core, but no-one would be surprised if there would be chips with eight cores by 2010. There are two ways of looking at this: one is to see it as simply the next method to increase CPU performance, on a par with bigger caches and higher clock speed. The other perspective is that the whole multicore business is going to get seriously crippled by software limitations, weakening demand for more sockets. We can ignore whether this happens at 16, 32, 64 or 128 cores, but if there is a practical limit not far from two sockets, demand for heavier systems would decrease.
Which of the factors mentioned is the strongest, is not important for the short term: the server market is a slow colossus worth about 50 billion dollars. Whichever way it moves, everyone in the business has plenty of time to respond. It is certain that this type of machine will be in reasonable demand for the next five years, so let's take a look at what such a beast can achieve.
Sun Fire X4600
The Sun Fire X4600 is a 4U rack-mounted server with support for four or eight processors and a maximum of 64GB of memory. There is room for four 2.5" SAS disks, four network ports of a gigabyte each, two USB ports at the front and another two at the rear. There is plenty of room for those who wish to plug in some extra cards, since six PCIe slots (four 8x and two 4x slots) and two 100MHz PCI-X slots are available. In principle, there is sufficient room in the casing for full-heigt cards, but in order to stick with a single standard across all servers, only low profile cards are supported. The motherboard uses a combination of an AMD 8132 PCI-X-tunnel and an nVidia nForce Pro 2200 chipset.

The machine has some quite remarkable design features: instead of having the processors sit on the motherboard, a blade system is used, with each of them housing a socket and four memory banks. This clever configuration keeps things compact while allowing the cooling fans to do their work.



The operating systems supported are Solaris, Red Hat Enterprise Linux, Suse Linux Enterprise Server, Vmware ESX Server, and Windows Server 2003. As may be expected of a system of this format, it can be fully administered remotely. The built-in 'service processor' can operate independently from the rest of the system, and can be reached via its own 100Mbit ethernet port or a serial connection. Administrators may use the built-in web interface or the standard protocols SNMP, IPMI, or DMTF.
Tweakers.net received the original Fire X4600 from Sun, fitted with eight 2,6GHz dual-core Opteron 885 processors and 32GB of DDR memory. There is a newer model, the X4600 M2 - with support for Socket F and up to 128GB of DDR2 memory, which isn't any more expensive than its predecessor and will hence be the better choice for most people. But the difference between the two is not really that big, so we think this review will be fairly representative. Pricing starts at 16,500 euros for a model with four 2.4GHz processors, 16GB of memory and a single 73GB hard disk. Our model, with eight 2,6GHz processors and 32GB is listed at 38,300 euros. To feed it energy, four 850 Watt power supplies are used, of which two can fail before the system gets into serious difficulty. The machine weighs almost 57 kilograms.

Fujitsu-Siemens TX200
The second machine that we look at is the Fujitsu-Siemens TX200. This is a server for cost-aware customers, based on the Intel 5000V chipset. It is sold as a (deep) tower or as a 4U rack mount, and has room for two dual- or quad-core Xeons, 24GB of FBD667 memory and six hard disks. Controllers are present for SATA, SAS and SCSI with basic RAID features and optional extensions. Those who prefer to use their own controller can resort to one of the present two PCIe slots (x8 and x4), two PCI-X slots or the single ordinary PCI slot. The standard 600 Watt power supply can be given a brother for some extra security. The machine supports Windows 2000 and 2003, Suse and Red Hat Linux, Vmware, SCO OpenServer and UnixWare.
Note that we do not pretend in any way that the TX200 is competition for the X4600: we wanted to use the opportunity to get an idea of the performance that a relatively cheap system with two quadcores can deliver. With two 1,6GHz Clovertowns and only 4GB of memory, this machine stands in sharp contrast to the high-end gear that we usually like to grill, but that is precisely the reason that some valuable lessons may be learnt from it. The configuration which we tested costs 3299 euros.

PostgreSQL 8.2 final vs. dev
In this series of articles, we have been using a development version of PostgreSQL 8.2 so far. Although it has never caused us any problems, the final version that has been released in the meantime will undoubtedly become a lot more popular. The difference between the two versions may have been extensively documented in the changelogs, but it is hard to summarize. However, neither performance nor scalability have improved over the past few months. Under a heavy load (up from 25 simultaneous users) there is a 24% decline on average when eight processors are used, and 14% at four processors. The gain that is made when going from four to eight processors has also diminished: while the development version still takes 24%, the final release is barely 6% faster when the number of cores is doubled.
The significant loss of performance first came to light when we tried 8.2-rc1, a near-complete version of the software. As a result of our discovery, the PostgreSQL team quickly released three patches to soothe the pain, but unfortunately, these did not make it into the final release. However, they did land in version 8.2.1. We used the 'final' version 8.2.0, with these specific patches but without any other changes that may have been made for 8.2.1.

Unfortunately, under Solaris things aren't much different. The development version was not doing too well (note that the configuration with eight processors is some 10% slower than the one with four processors), and the picture for the final version isn't any brighter. This was a dilemma: PostgreSQL 8.2 final will be used much more often than 8.2-dev, but does a worse job at using the hardware well. So as not to misrepresent the server unnecessarily, and because we have already collected a lot of material using 8.2-dev, we figured we might as well continue for now to use it as a basis.

A less serious version problem occurred with MySQL, of which version 4.1.22 turned out to be about 10% slower than 4.1.20, which rules out the possibility of direct comparisons.
Scaling behaviour from 4 to 8 sockets
The Sun X4600 appeared to be an ideal machine for studying the Opteron's scaling behaviour. The blade design allows for processors to be physically removed from the system, making it seem possible to make configurations of one to eight sockets by incrementing the number of sockets. Unfortunately this isn't so simple in practise, since each socket functions as a node in a network of HyperTransport links. This network is not constructed in a stepwise fashion but has two specific configurations. As the figure below shows, the connections for a 4-way and an 8-way system are largely different. Alternative configurations are unfortunately not supported, although a BIOS update for the newer M2 is in the making, in order to add support for six processors.

Left: configuration with eight sockets, Right: configuration with four sockets
We did manage to get the system running on two sockets, but the performance was clearly inferior compared to what might be expected from a standard server with double Opteron. Presumably, this is caused by the fact that the two sockets in the X4600 can only talk to each other in a roundabout way. Unfortunately, the consequence is that there is no sensible data on the scaling behaviour from one socket upwards, but we do have information with four and eight processors.
A few interesting things can be discerned in the graph below. MySQL 4.1.22 does not do very well on four processors in any case, but working with sixteen cores is clearly too much. The transition comes at the price of a 41% drop, which is the level which may normally be expected from a single socket system. MySQL 5.0.32 does somewhat better, but a 4% gain when the theoretical computing power is doubled, isn't too impressive. In any case, it is clear that the developers have improved their skills in working with multiple threads in the new version. Still, the score with four sockets (eight cores), is only marginally better than that of a single quadcore Xeon.
Loyal readers will know by now that PostgreSQL is a fine piece of software as far as scaling is concerned, but as we saw on the previous page, this doesn't hold on the X4600. The final version makes for a humble improvement of 6%, but the figure shows an effect that we have so far only seen in MySQL: increased crowdedness translates into a worse instead of a constant performance. The development version gives more favourable results: with scale gains of 44% it yields a new record of 950 requests per second.
Solaris vs. Linux
One can write long stories on the differences between Solaris and Linux, and the virtual infinity of the internet means that some people will in fact do just that. For instance, Dr. Nikolai Bezroukov has made a fundamental comparison listing advantages and disadvantages, but also the history and mutual influences. His conclusion is, in a nutshell, that Solaris has a lot of interesting features and is ahead in terms of architecture in a great deal of areas, but at the same time its commercial character - which has diminished but is still present - prevents it from enjoying the same name recognition as Linux. One of the claims being made is that Solaris scales better than Linux, especially where multithreading and open source databases are concerned. Our previous experience confirmed this hypothesis of Bezroukov: PostgreSQL ran 6% better on a double 2.4GHz Opteron and MySQL did 14% better than Linux.
When four or eight processors are used to compare Solaris and Linux, we see mixed results. There were already a few somewhat disappointing scores to be digested running MySQL 4.1.22 under Linux, moreover, these went down considerably with the transition from four to eight Opterons. The behaviour is different under Solaris: the peaks with small numbers of users are markedly higher than they are under Linux, but as soon as the load increases, performance diminishes quickly here, while under Linux this does not happen until the eight processor mark. Whether this is a good or a bad thing depends on the situation at hand, but actually neither of the two options look very favourable.

Under Linux, MySQL 5.0.32 behaves a good deal better than MySQL 4.1.22, but under Solaris it still exhibits strange behaviour. With four processors, the Solaris version beats the best results obtained under Linux (including the one with eight processors) but makes a poor showing on further upscaling, by falling back almost a quarter.

Scaling favourite PostgreSQL 8.2-dev behaves in an odd fashion: performance is virtually equal with four processors, but the Linux version wins 24% on the transition to eight processors while the Solaris version drops by 10%, allowing Linux to reach the finish line with a 38% advantage. It must be said that at Tweakers.net, we have a great deal more experience with Linux than with Solaris, but we do not think that the strange behaviour in the various databases can be attributed to spurious configurations: the installation of Solaris was done by people from Sun itself, and they subsequently provided us with a number of tips and tricks. A final check performed by Tweakers.net and Sun did not bring up an explanation either, but Sun will continue to look into the matter.
X4600 vs. Clovertown
We have seen that running a system on four or eight processors is not a straightforward matter: a large HyperTransport network appears to be sensitive to the way in which the software has been put together. Although we have seen troublesome scaling behaviour before in smaller systems, this is the first time we are seeing really drastic negative effects when the number of cores are doubled. Possibly, this is a problem that all servers run into, but unfortunately we do not have any other results in the same hardware class. What we can do, is compare the X6400 to Intel's quadcore. Beside the new Fujitsu TX200 with 1.6GHz processors, we draw the 2.66Ghz version which we reviewed earlier into the equation.
In MySQL 5.0.32 we see that two Clovertowns do not perform badly at all compared to four or even eight Opterons, in spite of AMD having twice as much bandwidth per core at its disposal. At the same time, it turns out that a 2.66GHz Clovertown with a 1333MHz bus is only 17% faster than a 1.6GHz model with a 1066MHz bus, while the former is more than twice as expensive. Though this sort of imbalance is nothing new in the world of computers, it's good to be reminded of it every now and then.

In the PostgreSQL 8.2-dev graph, we see a more logical picture, where the system with eight Opterons is clearly on top. Even though this looks a good deal better for the X4600, questions remain as to the extent to which it can really be called positive. After all, the top model Clovertown duo goes for 2344 dollar while the Opteron octet that we tested costs no less than 9320 dollars. Add to that the fact that a pair of 2.66GHz Clovertowns burn up a total of 240 Watt together and tend to fit into a 1U box, while the eight Opterons have a total TDP of 760 Watt and take up 4U. Although this is a somewhat simplistic comparison, it should be clear that for most companies, investing in a well-equipped X4600 isn't really justifiable when based on a performance gain of 19%.
PovRay and K-means
So, what type of application does clearly suit the X4600? Answering that question is probably best left up to Sun: the marketing department mentions records in a number of highly parallizable computationally intensive benchmarks: SpecFp_rate2000, SpecInt_rate2000, and SpecOmpm2001. These have been achieved under Solaris and with Suns own compiler. A number of the usual series of benchmarks (such as TPC-C en SpecJbb2005) have not been run or in any case not been published. The one that did appear is disappointing: in SAP-SD, 1650 users are served using eight 2.6GHz Opterons, worse than the 1980 or so which IBM and HP managed on four processors and also less than the score achieved on a double Clovertown: 1841. It appears then, that we are not the only ones that have (had) problems maximizing the potential of the X4600.
Armed with two benchmarks which share the characteristic with Sun's marketing department's test that they can be spread across multiple threads almost perfectly, we gave the X4600 an opportunity to prove its mettle. The first of these is ProvRay, the best-known software-based ray tracer. We chose version 3.7 beta, the first multithreaded version. The fastest time for the eight processor system was 160 seconds, some 35% quicker than the 246 seconds which were needed running on four processors. Still far from the ideal of double speed, but nevertheless the best we saw so far. The 1.6GHz Clovertown does not finish the test before 406 seconds have passed. Unfortunately the 2.66GHz version was not tested, but on the basis of the available evidence it does not appear as if it could overtake the X4600.

The second test was developed by Tweakers.net and is executed in PostgreSQL. It is an implementation of the K-means algorithm, that is meant to divide objects into groups of things which are similar. It is often used as an aid in the areas of data mining and search. The benchmark partitions 43,665 of our news items on the basis of the words which occur in them. On average, for every news item a little over a hundred characteristic words are selected. On the basis of the number of times that they appear, a 'distance' between two items can be computed. Stories that are close in terms of word usage are lumped together, and the test continues until every group has fewer than twenty members. The algorithm can be readily spread out across several threads by assigning subsets to each core.
The algorithm's scaling behaviour is illustrated nicely by Clovertown: a single thread on 1.6GHz takes 6,019 seconds, which decreases to 3883 seconds on 2.66GHz, almost perfect scaling of the performance with the clock speed. Doubling the number of threads works very well on small numbers, with more than 95% better performance for the step from one to two cores, more than 85% for the move to two cores, and more than 80% for the final transition to eight cores. In total, the program runs 6.8 times faster on 2.66Ghz and 7.2 times quicker on 1.6GHz, which is not far from the ideal of 8 times.
For the Opteron, we distinguish between Linux and Solaris. The first one yields a reasonable performance with eight cores: a time of 691 seconds may be higher than the 568 of the top model Clovertown, but not an odd result. But it is rather strange that the system needs almost twice the amount of time after going from eight to sixteen cores. Once more, we encounter an application which appears to scale well but collapses on eight sockets. A possible explanation for this is that PostgreSQL, under the intense pressure of eight dual-cores, needs to clean up its tables more often (vacuuming), which eventually has an adverse effect. However, under Solaris the picture is different again: here, a gain of 21% is achieved by adding the extra computational power. Unfortunately this is only a small comfort given the fact that the performance lags behind in absolute terms: even on 16 cores, the Opteron on Solaris needs almost four minutes more than the cheapest Clovertown.
Power consumption and conclusion
Relatively moderate performance in combination with four 850 Watt power supplies does not sound like good news as far as performance per Watt is concerned, but for completeness we shall give the figures. The X4600, on eight processors and 32GB of memory, uses 825 Watt when idle and 1030 Watt when loaded. With four processors and 16GB we measured 585 and 703 Watt, respectively. These figures have been obtained without the power saving option 'PowerNow!' switched on, because for some reason which did not become clear, we could not get this to work with our BIOS and/or Linux version. If it had worked, the idle consumption would have been lower in any case, and possibly we might have nibbled something off the consumption under load. The 2.66GHz Clovertown equipped with 8GB of memory required 355 Watt under full load, while the Fujitsu machine with its 1.6GHz CPU's and 4GB managed to make do with 279 Watt. This means that Clovertown easily achieves double the performance per Watt in the best scaling database.
 |
 | Capacity absorbed |  |
 |
 | Sun X4600 (8x Opteron 2.6GHz) |   1030 |  |
 |
 | Sun X4600 (4x Opteron 2.6GHz) |   703 |  |
 |
 | Melrow (2x Clovertown 2.66GHz) |   355 |  |
 |
 | Fujitsu TX200 (2x Clovertown 1.6GHz) |   279 |  |
 |
 |
 | Performance per Watt (PostgreSQL 8.2-dev) |  |
 |
 | Melrow (2x Clovertown 2.66GHz) |   1264 |  |
 |
 | Fujitsu TX200 (2x Clovertown 1.6GHz) |   1205 |  |
 |
 | Sun X4600 (4x Opteron 2.6GHz) |   615 |  |
 |
 | Sun X4600 (8x Opteron 2.6GHz) |   519 |  |
 |
X4600 conclusion
It is hard to be very positive about the 8-way Opteron. In the applications in which it isn't slower than a 4-way, the gains are often not impressive enough to justify the extra power consumption and cost. This isn't Sun's fault: the X4600 is stylish, user-friendly and complete - which is what we are used to from the company. It is simply AMD's architecture that does not seem very suited to scaling up to more than four sockets. There are some tests that demonstrate the potential, but those involve extremely well-parallelizable applications. For the price of 38,300 euros we do not even have to glance at the competition to find an alternative for this kind of software: the same money buys us almost ten(!) dual 2,8GHz Opteron servers with 4GB of memory. That's twenty processors for the price of eight. Given that some application is easily distributed, why not use a cluster? Easier said than done, but it does yield more possibilities for extension and flexibility.
The main advantage of a big machine over a cluster is that a large supply of memory is available which is shared by all threads, so programmers do not have to worry about synchronisation. The X4600 supports a maximum of 64GB of memory (the M2 even allows for 128GB) and that can hold a pretty big data set, for instance, for a scientific simulation. One location where the X4600 has been found to be a fit choice is the Japanese Tsubame, which at the time of writing is the ninth fastest supercomputer in the world. In a nutshell, the 8-way Opteron is interesting for tough simulations written by even tougher programmers, but the great majority will be better off with two or four sockets, or another processor that feels at home in heavy systems. Fortunately Sun is aware of this and can supply the X4600 with four sockets as well. This configuration still isn't a winner in our database test, but it does have a lot of success stories out there.

8-way Opterons in action in Japan
Fujitsu TX200 conclusion
This machine does not have the prestige an appearance of the Sun X4600, but armed with eight cores it can stick up for itself pretty well. Naturally the choice is always up to the customer, but we found the 1,6GHz Clovertowns of our test machine to be surprisingly competitive: they offer at least sixty to eighty percent of the top model's performance at less than half the price and a 2 x 40 Watt lower TDP - which makes for an excellent price/performance ratio. The only drawback is that the casing is fairly big. This does leave room for a lot of (cheap) 3,5" hard disks instead of the 2,5" server models but it surely isn't much fun to try and cram it into an overpopulated rack. In short, the TX200 is an interesting option for somewhat smaller companies that are above all interested in value for money and do not suffer from lack of space.
Acknowledgements
Tweakers.net would like to thank Hans Nijbacker, Bart Muijzer, Jignesh Shah, James van Geene, and Gert Jan van Gent from Sun Netherlands for cooperating to this article. We also thank Jeroen de Bruijn from Fujitsu Siemens Computers for lending us the TX200, our own sysadmins ACM and moto-moi for developing and executing the benchmarks, and Mick de Neeve for the English translation.
Previous articles in this series
12-12-2006: Intel Xeon 'Clovertown' 2.66GHz
13-11-2006: Intel Xeon 'Woodcrest' 3.0GHz (Apollo 5)
4-9-2006: Intel Xeon 'Woodcrest' 2.66GHz
30-7-2006: AMD Opteron Socket F 2.4GHz
27-7-2006: Sun UltraSparc T1 vs. AMD Opteron
19-4-2006: Xeon vs. Opteron, single- and dual-core (in Dutch)
Plug this story