Over three years ago, Sun started releasing information on a rather precocious new concept for server processors. The idea was, instead of trying to accomplish a single task very quickly, to do a lot of them at the same time and achieve a decent nett performance that way. The first result of this drastic change of direction is the UltraSparc T1 'Niagara', which was introduced at the end of last year. In this article, we dissect the vision behind this processor, look at the servers that Sun built around it, and most of all, we shall test to what extent this concept is useable for a website database like the one used here at Tweakers.net. A 'traditional' server with two dualcore Opterons will be used for comparison. Additionally, we shall capitalize on the opportunity and look at the differences between MySQL and PostgreSQL on the one hand, and between Solaris and Linux on the other.
Background: the problem
Sun's motivation to get started on the Niagara series was the fact that it gets increasingly harder to make conventional cores faster: while it may be simple to add extra muscle in the form of gigaflops, ensuring that power is actually applied in a useful way is a problem on which thousands of smart people wrack their brains on a daily basis. One problem is that other hardware develops at a much slower pace than processors, hence from the core's perspective it takes longer and longer for data to become available from memory. In spite of a good deal of research that has been (and is being) done on minimizing the average access time and/or using this wisely in a different way, the gap still keeps growing.
Not only memory is a bottleneck, code does not tend to be very cooperative either. Instructions are usually dependent on each other in one way or another, in the sense that the output of one of them is necessary as input for the other. Modern cores can handle between three and eight instructions simultaneously, but on average you may call yourself lucky if two can be found that can be executed wholly independently. Often, only one can be found, and regularly there are no instructions at all that can be sent safely into the pipeline at a particular moment. Hand-optimized code is probably better in many cases, but not everything can be fixed with better software: for most algorithms there are clear practical and theoretical limits.
Mindlessly driving up clock speed by using increasingly smaller transistors is also no longer an option because customers are becoming aware of the energy demands of their servers. Moreover, the deep pipelines that are necessary for high clock speeds make it difficult to apply the available computing power optimally, hence even a large power budget will not guarantee a race monster that beats the competition in the performance arena. In other words: the simplest tricks to up a core's performance have run out, so bigger investments in complexity need to be done to make relatively small gains.
Explosion of complexity
The problems mentioned above have lead to the situation that modern processors have tens to hundreds of millions of transistors on board that do nothing at all to increase the functionality of the chip, and are present to lower the bottlenecks in the rest of the system. We are not only talking about the cache, but also about functionality to execute instructions in a different order than the developer specified, or to guess that an instruction will branch in a particular direction. Naturally, making mistakes is not an option, so literally hundreds of instructions, including their interdependencies, have to be tracked simultaneously in order to guarantee correct code execution under all circumstances, including the most improbable situations.
Intel, AMD and IBM are all doing their best to address these issues with their new generations of x86- and Power processors. These companies consequently pour vast amounts of money into developing and improving their cores, battling against the increase in complexity at the same time. Meanwhile, alternative paths are being explored: the Itanium series has seen a lot of the logic mentioned above removed from the hardware and moved to the compiler. The reasoning is that the development of more powerful cores will get easier and easier in the long run in comparison to other architectures, while the software gets smarter at the same time. After all, compilers have lots of time to look for parallelism, compared to a processor that has to decide in a split microsecond. In practice there are other considerations that will determine the success of the Itanium approach, but that is better left to a future article.
