Linux Scalability in a NUMA World

The numa library provides many tools for the programmer to control where programs are run and where they allocate memory.

Architectural features, that were once only present on very large computer systems, are gradually being implemented on ever smaller systems. Sometimes these features require software changes, to make the best use of the capabilities of the hardware. Here I’ll describe the best way to get the best performance out of a NUMA machine.

SMP and NUMA

Processors, located within the main cabinet of a computer system, can be connected to memory in several ways.

Symmetric Multi-Processors (SMP) The simplest way, from a software perspective, is for all processors to be equally well connected to all memory. Such systems are identified by the acronym SMP (Symmetric Multi-Processors).

Non-Uniform Memory Access (NUMA) However, as the number of processors in a system increases, the aggregate bandwidth between processors and memory also increases. Building a symmetrical interconnect that can provide that bandwidth becomes increasingly expensive. A solution is a type of system known as NUMA (Non-Uniform Memory Access). A NUMA system connects processors and memory banks this way:

  • Connect some processors directly (or local) to some memory banks.
  • Provide a forwarding method to access data in memory banks for processors not directly connected to memory banks.

Today, even a single board computer may be NUMA, when processors have integrated memory controllers.

Figure 1: The simplest NUMA system
Figure 1: The simplest NUMA system

NUMA Performance

The overall performance of a NUMA system depends on the proportion of memory accesses made by all processors to local (directly connected) memory. Each access a task makes to remote memory reduces the performance of that task. It may also reduce the performance of other tasks, by causing contention for remote memory connections.

In certain memory bound benchmarks, NUMA systems perform better than SMP systems because the available aggregate memory bandwidth actually increases as more nodes are added to the system. This is because processors can access local memory in parallel, with no contention for a common bus.

Achieving NUMA Performance

So, the first step towards good performance from a NUMA system is finding out which processors are connected directly to which blocks of memory, and how far apart the processors are from their associated memory banks. Linux learns about the platform topology from the Advanced Configuration and Power Interface (ACPI), Static Resource Affinity Table (SRAT), and System Locality Information Table (SLIT) tables provided by firmware.

The SRAT table associates each processor and each block of memory with a simple integer value, called a proximity domain in ACPI jargon. Many computer vendors and software people prefer to use the term node. These terms are interchangeable, that is, objects in the same proximity domain are on the same node, objects in different proximity domains are on different nodes.

The distance between nodes is described by the SLIT table. To show how the ACPI SLIT table provides node-to-node latency information, here is a slightly more complex system than the simple NUMA system we saw in Image 1. Each node consists of just one processor, with an associated local block of memory. The nodes are connected in a ring, so a memory access from node zero to the memory on node one has just one hop, but accessing memory on node three requires two hops (going either through node one or node two).

Figure 2: Demonstration of Nodes, Mapping Processors to Memory Banks
Figure 2: Demonstration of Nodes, Mapping Processors to Memory Banks

The SLIT table is a simple matrix, providing relative latencies from one node to another. The reference value is 10, which marks the time for access to local memory. The leading diagonal of the matrix of a SLIT table will always be filled with the value 10. These entries mark a processor accessing memory on the same node. If, in our example, the time for a one-hop remote access is 80% longer than for a local access, then the SLIT table entry to denote that will have the value 18. If the two hop access takes 3.1 times as long, then the value for this is 31. So, the whole SLIT table for this system would look like this:

SLIT Table
SLIT Table

Once Linux has parsed this information, it can create logical structures for each node. It assigns memory for each node, to separate free lists, and creates preferential search paths in case a node runs out of memory. The default memory allocation policy in Linux allocates memory from the nearest memory to the processor on which the code is executing.

There are some exceptions for a few system-wide data structures that are allocated evenly from all nodes to share the load, and make sure that Linux does not run node0 out of memory by allocating everything from it. Linux would run node0 out of memory in this case if a strictly local allocation policy were followed because the majority of boot time initialization code runs on a single processor.

For our example, system processes running on CPU zero would default to having memory as their first node zero preference. If node zero was low on memory, either node one or node two would be the next choice. Node three would be the last resort.

Once a process is running on a given node on a processor, Linux prefers to keep it on the same node. Combined with the policy of allocating memory from the local node, this helps keep the ratio of local memory accesses high. New tasks are moved to different nodes during the exec system call. This is the point at which Linux has discarded memory from the old task, but has not begun allocating memory for the new task. Thus, this is the perfect time to migrate.

Several workloads consist of many short-lived tasks. For example, compiling a large program from many source files, or running a web server that generates dynamic content.

For these types of workloads, the default Linux kernel behavior is sufficient to make sure there is reasonable scaling on a NUMA system. Individual processes will be spread out to run across all nodes in the system, allocating memory locally as they run. Because a steady stream of new processes are being created, Linux can keep the load balanced across the system by migrating new tasks to the least loaded nodes as they are created.

In more complex situations, the operating system can run into difficulties. For example, if many long running processes are started and are spread out across all the nodes in the system, and at some later point all the processes on certain nodes complete, the operating system is left with the dilemma of deciding what to do. It has several choices. It could:

  • Start running some of the remaining processes on the now empty nodes. They will run more slowly there, because their memory will be remotely located on the nodes where they were started. But running slowly is better than not running at all.
  • Migrate the memory for some processes to empty nodes, and then start running them on the new nodes. This is an expensive operation, and the operating system has no information on whether the processes will continue to run for long enough to make that extra work worthwhile.
  • Do nothing. Perhaps some new processes will be created soon, which can be allocated to the empty nodes, rebalancing the system. If Linux had moved tasks around, it may need to undo that work.

Lacking a crystal ball to predict the future, Linux employs the first of these tactics, but this can be far from optimal for some workloads.

To make the best scheduling choices for these types of workloads, the operating system needs some advice from the user about how to distribute the workload across the nodes in the system. In many cases, it is not necessary to modify the source code and recompile an application. The administrative tool, numactl, can be used to set the desired policies and then invoke a command that inherits those policies.

Take a hypothetical application that allocates a large, shared memory segment that is later used by many smaller tasks for intensive data processing. The system administrator can use numactl to first run the application with the --interleave option, so that the shared memory segment is spread across all nodes. Then the data-intensive tasks can be started using the --membind and --cpunodebind options to spread them evenly across all the nodes in the system.

At the next level of sophistication/complexity, getting peak performance when running a NUMA system requires changes to the application. Next I’ll describe the low level Linux interfaces, to provide a sense of the functionality that is available. Most programmers would be advised to avoid coding using these methods and should look instead to the more user-friendly wrapper functions provided in numa.

Linux provides topology information about which processors are on which nodes through the /sys file system in /sys/devices/system/node/node*/{cpumap,distance}. The following system calls can also be used by the application to control placement of tasks onto cpus, and to choose appropriate memory allocation policies.

sched_setaffinity: This system call provides a bit mask to specify which processors a task can run on. In NUMA applications, it is typically used to lock a task to the processors on a single node.

mbind: This system call allows a process to choose one of several memory allocation policies across an already allocated range of virtual addresses. Available policies are:

  • MBIND_DEFAULT: Allocate from the local node.
  • MBIND_BIND: Allocate only from nodes specified in a bitmask argument to the system call. This is a strict policy, memory will not be allocated from any other nodes.
  • MBIND_INTERLEAVE: Spread allocations across a set of nodes.
  • MBIND_PREFERRED: Allocate from one specific node when memory is available on it. If not, then fall back to other nodes near the preferred node.

Linux will migrate any pages in the range that have already been allocated, and arrange that future page faults in the range follow the desired policy.

set_mempolicy This call is similar to mbind. The same set of policies are available. The difference is that the chosen policy is applied for all future allocations by a process, rather than being limited to a particular range of virtual addresses. The calls can be combined to provide a particular policy for general allocations with specific overrides for particular regions of memory.

move_pages This call moves the underlying physical pages for a given virtual address range for a specified process to a new set of nodes specified as a bit mask argument.

Linux also provides some statistics counters in /sys/devices/system/node/node*/numastat. These statistics counters provide a method for a user to monitor how often applications are succeeding and failing, when trying to allocate memory from the node that they would like. More sophisticated tools need to be used to measure the frequency with which local and remote memory is being accessed.

In summary, performance on NUMA systems can be very good, scaling up as more nodes are added. Some workloads may be well suited to the NUMA environment and require no changes at all, or just some simple placement hints to Linux, using the numactl program. Other programs will need source code changes to make the best use of the available memory bandwidth.

The numa library provides many tools for the programmer to control where programs are run and where they allocate memory. Finally, there are some applications that are inherently unsuited for running on a NUMA system. For example, if an application makes essentially random accesses to all memory, then performance will decrease as the number of nodes increases because progressively more of the memory references will be to remote memory. In a system with only two nodes, about 50% of the random accesses will be to the local node, but on a sixteen node system almost 94% of accesses will be to remote memory.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62