What Drives Performance in HPC? That's a good question. What does drive performance in HPC? On a qualitative basis its easy to answer. A faster processor and memory. More memory. A better network or disk I/O subsystem. Unfortunately, those answers are rarely specific enough when faced with purchase decisions for a Linux cluster.
What Drives Performance in HPC? That’s a good question. What does drive performance in HPC? On a qualitative basis its easy to answer. A faster processor and memory. More memory. A better network or disk I/O subsystem. Unfortunately, those answers are rarely specific enough when faced with purchase decisions for a Linux cluster. This article is designed to support and expand on the What Drives Performance in HPC? Webinar presented in June 2007, which outlined a quantitative approach to performance. In this paper, we will apply a more quantitative treatment. As in the webinar, the first order of business is to define the terms and narrow the scope of the topics.
What is HPC?
High Performance Computing will be loosely defined as some type of technical workload running on 16 or more compute servers. A smaller number of servers running a technical workload will be tagged as belonging to the emerging “personal cluster” segment. Whether the dividing line is 8, 12 or 16 servers is not important. Under our definition we assume each server is running its own OS (standalone) and its associated I/O infrastructure is based on Commodity Off The Shelf (COTS) systems. In short, we are talking about Linux clusters.
A datacenter with 20,000 such servers running molecular dynamics simulations would certainly qualify, as would a small engineering firm with a half rack in an equipment room running CFD simulations. The only restriction on the workload is that it be technical. A large Internet “Web farm” would not qualify, although much of what is discussed below can be directly applied to that case.
It is rare today to talk about pure performance with Linux HPC clusters. More often one encounters the term in the context four metrics: Performance, Performance/Watt, Performance/Square foot, and Performance/dollar. The reason is intuitively obvious for the 20,000 server molecular dynamics cluster mentioned above. The constraints while running such a system are often dominated by its power consumption (Watts) and its size (SqFt), both of which factor into the TCO ($). Business benefit per TCO dollar is of paramount concern.
With that said, the scope of this article will be limited to Performance, while understanding that power, density and TCO can often be of equal or greater practical importance.
Performance will be defined here as a rate of computation. E.g. Jobs completed per day, floating point operations per second (FLOPs) and so on. In the following discussion it will be shown that there are good reasons to think in terms of the execution time of a given workload. The two are directly related where Rate = 1 / (Time/Job). Thus, performance can be measured by running a workload, recording its execution time and converting to rates where needed.
Quantitative vs Qualitative
As mentioned in the first paragraph, this article is concerned with a quantitative approach to performance as it relates to Linux clusters. The goal will be to present quantitative models and techniques that are accurate enough to help guide business decisions but simple enough to be of practical use. For example, these decisions could include:
- Purchasing — guide system component selection to obtain the best performance or price performance.
- Deployment & configuration — identify bottlenecks in system / application performance to guide tuning efforts
- Planning — highlight performance dependencies & limitations to enable medium term business planning.
The Prototypical Linux Cluster
Our model Linux cluster will have four main hardware components. (1) The compute nodes or compute servers which execute the technical workload. (2) A head node for cluster management, job control and so on. (3) The interconnect fabric, Gigabit Ethernet (GbE) being most common today. (4) Some type of global storage, which could be as simple as an NFS file system exported from the head node. This is shown in Figure One.
Figure One: Prototypical Linux Cluster
A Simple, Quantitative, Performance Model
At a high level, a quantitative performance model is fairly obvious. The execution time of a given workload on a given cluster is approximated as a sum of time spent in individual subsystems:
||Time = Tnode + Tfabric + Tstorage
Where Time is the workload execution time, Tnode is the part of execution time spent in the compute nodes, Tfabric is the part of execution time spent in the interconnect fabric communicating amongst the nodes and Tstorage is the part of execution time spent accessing either local scratch or global storage. As in the webinar, discussion of Tfabric and Tstorage will be deferred to a later date. Instead we’ll focus on Tnode. The execution time on the compute node can also be approximated as the sum of time spent in its individual subsystems:
||Tnode = Tcore + Tmemory
Where Tcore is the part of execution time on the compute node that is spent in the microprocessor itself, while Tmemory is the part of execution time spent accessing main memory. This model works well in practice for single CPU compute nodes and is easy enough to extend to common dual socket “SMP” compute nodes. In order to make this model (2) of practical use, the subsystem execution times must be related to physical configuration parameters of a compute node, e.g. processor speed, memory speed, etc.
The Compute Node
Let’s take a closer look at a prototype compute node, as shown in Figure Two, to identify the relevant configuration parameters. There are two processor sockets at the top, connected via the FSB (front side bus) to the MCH (memory controller hub). The MCH has four memory channels. The MCH also has an Infiniband HCA connected to it via a PCIe link.
Lower speed I/O, such as Gigabit Ethernet (GbE) and (Serial ATA) SATA drives are connected via the South Bridge. In Figure Two, a performance related parameter is shown in red beside each major component. These are specific features of the hardware that impact performance (though not all of them). They are also generally directly related to the cost of the hardware. For example, fcore, the processor clock speed has a large impact on performance for most workloads. Faster processors also cost more due to supply and demand intersecting semiconductor yield curves. The size of the cache, MBcache, also impacts performance by reducing the number of relatively slow accesses to main memory needed by a workload. The number of cores in the processor, Ncores, also has a large impact on performance as well as cost. The speed of the memory subsystem, parameterized jointly by fDIMM and fBus , has a large impact on performance for many workloads. Similarly the speed of the interconnect fabric can be limited by the PCIe express frequency, fPCIe . Note that many factors like DIMM CAS Latency, number of memory channels, etc have been ignored a priori as second order effects.
Figure Two: Prototype Compute Node
Performance Parameters We Can Use
Of the six performance-related parameters shown in Figure Two, four will be retained as relevant to the model. First, lets ignore fPCIe as it impacts the performance of the interconnect fabric which is outside the scope of this paper. Next lets note that fDIMM and fbus are constrained to fixed ratios for a typical MCH. On a current Core 2 Duo system these ratios are typically 4:5, 1:1, 5:4. We’ll use only one of them for now, fbus. The cache size, MBcache, is very important. It will be retained in the model. The number of cores, Ncores , is also very important, along with the core frequency, fcore.
The HPC Model
The following section develops a mathematical performance model. For those that are a bit math challenged, you may want to skip to Almost Done section and continue from there. In any case, you are invited to read along and see how the model is developed.
The basic form of the model (2) has been around for many years in the computer architecture research community (see References). A common form is:
||CPI = CPI0 + MPI * PPM
Where CPI is the processor “cycles per instruction” executed in the workload, CPI0 is the “core CPI” or CPI that would result if there were no cache misses, MPI is the cache “misses per instruction” executed in the workload (Note: In the HPC community the MPI acronym is used for “message passing interface” but we’ll stick with the processor architects’ convention here), and PPM is the “penalty per cache miss” in units of processor clock ticks. It is worth noting that equations (2) and (3) correspond to each other as the first term represents the processor and the second term represents the memory.
This correspondence can be made more explicit by assuming that a workload corresponds to executing P instructions per job and letting fcore stand for the processor frequency (units of processor cycles per second). Multiplying (3) by P / fcore gets us to:
||Tnode = (CPIo * P) * (1 / fcore) + (MPI * P) * PPM * (1 / fcore)
Note that (CPIo * P) has units of “processor cycles per job”, which is essentially a constant for a given workload running on a particular processor micro-architecture. Therefore, we rename this as α. (This assertion may be very counterintuitive. It is essential to understand that “processor cycles” are not measures of time in themselves. You have to multiply by the time per cycle – 1 / fcore – to get a measure of time. Hence Tnode on the left side of (4). Thinking of these cycles as instruction execution slots makes it easier to accept this as a constant for a given workload and architecture.)
The same is true of (MPI * P). It’s a constant for a given job and architecture, except that it depends very strongly on the cache size. We rename this as M(MBcache) to capture the cache size dependence. The penalty per cache miss, PPM, is the cost to access main memory and bring in one cache line of data. For a given workload, it takes on average some fixed number, C, of bus clocks (cycles) to service a cache miss . However PPM has units of “processor cycles per miss”, so we multiply by fcore / fBus , to convert from bus cycles to processor cycles. Therefore, PPM = C * fcore / fBus. Absorbing the constant, C, into M(MBcache) gives:
||Tnode = α * (1 / fcore) + M(MBcache) * (1 / fbus)
For cases where the bus frequency is held constant, equation (5) can be simplified to equation (6):
||Tnode = α * (1 / fcore) + β
Where Tcore = α * (1 / fcore) , and Tmemory = β (i.e. the terms in (2). The point of this exercise is two fold. First, the models (2), (5) and (6) have a solid theoretical grounding because we’ve shown how to get there from (3) (which is used in computer architecture theory). Secondly, the model now has three of the four hardware performance parameters included in it. We are still missing Ncores.
An intuitive way to account for the numbers of cores is to assume that N cores are roughly equivalent to a single core running at a net frequency of N*fcore. Then from (6) we have very approximately:
||Tcore ~ α / (N*fcore)
||Tcore~ ( α / N) * (1 / fcore )
We’ll denote this dependence as:
Alpha for a multicore processor is expected to be roughly 1/N times the alpha of a single core.
Almost Done (with the math!)
Normally we think of computer system performance (a rate) in terms of the system core and bus frequencies (also rates) as is shown in (5). But the left side of (5) has units of time – the execution time of a workload. It’s a bit cleaner to express the main system parameters on the right hand side directly in units of time also. Note that the core clock period, τcore , (that is the time per core cycle or clock tick) is just equal to (1 / fcore) . Same thing with the bus clock period.
||Tnode = αN * τcore + M(MBcache) * τBus
This conversion also gives us a model for execution time that is linear in the two principle independent variables, τcore and τBus. This will be convenient later on when we analyze some actual system data using a simple spread sheet.
Does It Work?
How good is this model in (9)? To answer that question, Let’s look at two very common workloads, Linpack and Stream. While these are generally thought of as synthetic, they do have direct applicability to workloads of commercial interest. Linpack for example is of direct interest to Boundary Element Methods, which have been used to simulate aircraft radar cross sections or submarine acoustic returns. The four loops or kernels in the Stream workload can be found at various amounts in hundreds of HPC application codes. They represent vector, or Level 1 BLAS, type operations. These two workloads are also very useful because they represent the two extreme ends of the spectrum of HPC workloads in that one, Linpack, is core compute bound and the other, Stream, is memory access bound. These two workloads constitute a good initial test of the execution time model of (1).
Linpack: A Core Bound Workload
Figure Three shows a plot of Linpack execution time vs τcore for a single CPU (socket) system (based on the Intel S3000PT server board) with 3 different multicore CPUs. The first CPU is a PentiumD (Netburst microarchitecture) dual core running at a variety of frequencies up to 3.2 GHz. The Linpack execution time data points for this workload are shown as yellow triangles. A linear fit to those data points is shown as a dashed yellow line.
The second CPU is a Core 2 Duo (Conroe) running at four frequencies up to 2.66 GHz. The data points and linear fit for this CPU are shown in blue. The third CPU is a Core 2 Quad (Kentsfield) running at two frequencies up to 2.4 GHz. The data points and linear fit for this CPU are shown in red. The bus frequency was held constant in all cases at 266 MHz (1066 MHz in quad pumped terminology). Using equation (1), and examining the linear fit curves, its clear that β, or M(MBcache) * (1/ fBus ), is essentially zero.
This assumption is consistent with user experience with Linpack where it runs entirely out of cache and rarely needs to access main memory. We expect Tmemory, or β, to be zero which is confirmed by the linear fit. In this respect the model seems to hold. Next, note that the line falls essentially right on each data point. For example, the Core 2 Duo case shows almost an exact linear fit through four data points. Certainly within the range of the data points we can assert that the actual physical behavior of the system is linear with respect to τcore . So we have confirmed another important aspect of the model — that the physical behavior is actually linear in τcore.
Figure Three: Linpack Execution Time vs Core Clock Period
Now we’ll take a look at the slopes of the linear fit curves. From equation (9) we know that the slope corresponds to the value of αN. In the development above it was asserted that αN = α / N . That is, the slope for N cores executing the workload in parallel is 1/N times the slope for a single core executing the same workload. From the Core 2 Duo linear fit (2 cores executing) we find a slope value of 0.154, while the Core 2 Quad linear fit (4 cores executing) has a slope value of 0.0806. This is a measured ratio of 1.91x vs the “expected” 2x. That is roughly a 5% error, which is more than good enough for practical use.
Comparing the slopes of the Pentium D and the Core 2 Duo shows another aspect of alpha (α). Both processors are dual core, however Core 2 Duo can execute 4 SSE2 instructions per clock as opposed to 2 for Pentium D. The Linpack workload is able to take advantage of the SSE2 execution units. Therefore we expect the slope for Pentium D to be about 2x that of Core 2 Duo, which it is.
Stream: A Bus Bound Workload
Figure Four shows a plot of both the Linpack and Stream workloads executing on a Core 2 Duo based system. As in Figure Three, the Linpack linear fit curve has a beta value of approximately zero and a relatively large slope, as expected. The Stream linear fit curve is just the opposite. It has a relatively large value of beta while the slope is very small. This behavior is expected for a bus bound workload. The speed of the processor has very little impact on the execution time of the Stream workload.
Figure Four: Linpack & Stream Execution Time Behavior
What about the impact of bus speed (equivalently MCH speed and DIMM speed) on Stream performance? Figure Five shows the impact of varying the speed of the front side bus, fBus . This data was collected on an ASUS P5B system with a Core 2 Duo. The ratio between fBus and fcore was fixed at 7 to keep things simple. Two different ratios of fBus and fDIMM are represented, 1:1 and 4:5.
Figure Five: Stream Execution Time vs Bus Frequency
The linear fit curve in Figure Five demonstrates that Tmemory varies proportional to the bus clock period, τBus. We can also see that there are really two distinct sets of data points. These points correspond to those at a 1:1 and those at a 4:5 value of fBus : fDIMM ratio.
A More Complicated Workload: SPEC_CPU2000
Before wrapping up, its worth a quick look at a more complicated workload to show that the execution time behavior is still essentially linear in the primary parameters. It’s also a chance to demonstrate the impact of larger caches on Tmemory.
This analysis will give an indication of how important M(MBcache) really is. The SPEC CPU2000 benchmark suite is a reasonable choice. It is very well know and widely scrutinized. In this application we’ll simply compile each component workload with baseline compiler flags (-O2). The execution time of the composite workload is just the sum of the execution times of the individual component workloads (~25 of them).
In order to examine the impact of the cache size, a system based on the “Gallatin” processor was used as the test platform. This processor was a Netburst architecture CPU with a single core, a 512KB L2 cache and a 2MB L3 cache. Figure Six shows the execution time of the composite workload as the processor clock period is varied.
Figure Six: SPEC CPU2000 Execution Time vs Core Clock Period
The dark blue data points and linear fit curve correspond to a configuration where the 2MB L3 cache is enabled. The light blue data points and linear curve fit correspond to a configuration where the L3 cache is disabled, effectively cutting the cache size down to 512KB. Both linear fits are parallel. The values of the slopes are identical to within a few percent. More interesting is what happens to the value of beta as the cache size increases from 512KB to 2MB. There is a 1.75x reduction in the amount of execution time spent accessing memory, Tmemory. This result is equivalent to increasing the bus and memory frequency by 1.75x. For this composite workload, cache size is important.
Wrap It Up
Its time to wrap our analysis. On a qualitative basis its easy to list the things that drive performance in HPC; a faster processor, memory, bus, disk, network and so on. Answering that on a quantitative basis is a bit more difficult, and for those without unlimited budgets, more important. Is a slightly faster processor worth the extra money? Do I need PCIExpress Gen2 interfaces in my cluster? Should I buy one speed bin faster memory or get another GB of memory per compute node? Is my workload bus bound and how can I provide 5x more compute power for my customers two years down the road (on the same annual budget)?
We’ve attempted to demonstrate that a relatively simple model of compute node performance works quite well. For a given workload and compute node the model is easily validated with common tools (i.e. a spreadsheet). The model can then be used to help answer some of the questions posed above related to purchasing, deployment & tuning, and long term capacity planning. As to the initial question of “What Drives Performance in HPC?”, it all depends on the workload.
For those who want to dig into this further, try:
 Emma, PG, Understanding some simple processor performance limits, IBM Journal of Research and Development, Vol 41, Num 3, 1997 or online
 Emma, PG et al, The effect of miss clustering on the cost of a cache miss, HPCA07, tutorial.
 Predtechenski, A. “A Method for Benchmark Analysis”, at SPEC Workshop on Performance Evaluation with Realistic Applications, San Jose, CA, Jan 25, 1999.
 Hennessy, J and D. Patterson, “Computer Architecture A Quantitative Approach”, 2nd edition, Ch 1 / pg 36, Morgan Kaufman, San Francisco, CA, 1996.