The Blackford and the Creek are the first chipsets to use a new type of memory module that was designed by a large group of companies, lead by Intel and IBM. Contrary to an ordinary DIMM, a so-called FB-DIMM does not use a parallel bus to send data, but a serial P2P connection. One of the main reasons for this is that it has turned out to be difficult to upscale a parallel bus to high speeds and large numbers of modules. DDR2, for instance, allows a maximum of four modules to a channel at 400MHz and 533MHz, but no more than two per channel at 667MHz and 800MHz. It has been predicted that next year, with DDR3, things will get to a point that only one module can be used per channel. This is not a favoured situation from the server perspective, because it limits capacity and more expensive modules are required to achieve a certain number of gigabytes. FB-DIMM supports a comfortable eight modules per channel, thereby eliminating this problem.
A second advantage of this method is that the controller is no longer required to talk directly to the memory chips, but only needs to converse with the buffer chip (AMB - Advanced Memory Buffer). This means that the memory controller no longer needs to know which chips are used on the other side of the buffer. At the moment, all available FBD-memory uses DDR2-chips, which can be painlessly replaced by a different kind, such as DDR3 or something more exotic like XDR. Such a switch will no longer require changing the motherboard or sockets.
A third motivation for turning towards FB-DIMM is that it only uses 69 traces per channel on the motherboard. An ordinary DDR2 channel requires 240 traces, which need to be (nearly) the same length – a frustrating situation for motherboard designers. Opening a computer and looking at the part between CPU and memory banks reveals that certain traces wriggle and wind their way from one side to the other, which is meant to slow the signal down. FB-DIMM has the chipset compensate for differences in length, and the combination with the smaller number of traces means that in practice, there is room for two or three times the number of channels of equal or smaller complexity.
The final issue that has been tackled is reliability: ECC fault correction is no longer just applied to data, but also addresses and instructions. Moreover, a transaction can be retried in case of an error without causing the processor or the operating system to panic. This involves supporting hotswapping and the switching off of data paths that prove unreliable. Although this decreases bandwidth, it keeps the system up and running.
Not everything about FB-DIMM is positive: a clear downside is the increased latency. Besides introducing a buffer as an extra step between processor and memory, the controller only has a direct connection with the first module on the channel. The remaining modules are only accessibly indirectly, which involves a three to five nanosecond delay (two or three clock cycles). By the time the eight module is accessed, something of an eternity has elapsed from the processor's perspective. The Opteron's NUMA architecture suffers from a similar drawback when data needs to be accessed from a module that is attached to the other socket. That adds thirty nanoseconds (about twenty cycles for DDR2-667) to the total access time.
Most manufacturers are working breadth-first rather than depth-first in applying the Xeon DP. The most popular configuration for Blackford servers is four channels with two or three modules each, so the extreme case of eight per channel has not yet appeared in practice. In case of heavy loads, the disadvantageous effect of higher latency per transaction is compensated by the fact that more actions can be executed simultaneously. For instance, reading and writing can be done simultaneously, and at each clock tick, instructions can be sent to three different modules per channel. This can bring average latency down as compared to DDR2 in case the system is under heavy pressure, but this is not expected to hold for all applications.
Another drawback of the buffer is that it can get fairly occupied: with an effective transmission speed of 3.2GHz (PC2-4200F) or 4.0GHz (PC2-5300F) in two different directions, few people will be surprised that this raises the power consumption of the module. We have found that every additional (533MHz) module costs an extra 7.6 Watt of power, regardless of the server load. On the other hand, the 1GB DDR2-667 modules of our Socket F Opteron used 1.9 Watt when idle and 2.4 Watt when loaded. This allows one to conclude that every FB-DIMM adds a good 5 Watts to the total consumption of the server, which makes for 40 Watts when using a total of eight modules.
It is expected that as more experience is gained with the design of buffers, power consumption will come down. However, frequency will have to go up at the same time to support faster memory, making for an everlasting battle. At least the memory controller does not seem to use so much power now that buffer chips are starting to take on a substantial range of functions. The maximum energy consumption of the 5000P chipset with four channels is specified at 30 Watts, but every channel adds only 1.75 Watts, making it appear as if the northbridge's power consumption resides largely in other functions.