Dynamic Memory Allocation — Part II

Last month we introduced some of the concepts that are involved with memory management. We discussed how to dynamically allocate memory in your applications. We also described the various methods that allocators use to manage free memory. This month we turn our attention to memory allocation at a lower level -- the level of the operating system.








Compile Figure 1
Figure One: A process address space.

Last month we introduced some of the concepts that are involved with memory management. We discussed how to dynamically allocate memory in your applications. We also described the various methods that allocators use to manage free memory. This month we turn our attention to memory allocation at a lower level — the level of the operating system.

In all our discussions of memory allocation techniques, we never addressed the question, “Where does the allocator get actual blocks of memory to give to your application?” This month, we will answer that question by delving deeper into the relationship between the memory allocator and the operating system.

Every Linux process has its own dedicated memory address space. The kernel maps these “virtual” addresses in the process address space to actual physical memory addresses as necessary. In other words, the data stored at address 0xDEADBEEF in one process is not necessarily the same as the data stored at the same address in another process. When an attempt is made to access the data, the kernel translates the virtual address into an appropriate physical address in order to retrieve the data from memory. In this column, we are not really concerned with physical addresses. Rather, we are investigating exactly how the operating system allocates more virtual addresses to a process.

The part of the process address space that is used for dynamic memory allocation is called the “heap.” Any memory that you allocate using malloc(), calloc(), or realloc() comes from the heap. In contrast, the “stack” is the memory where all local variables and parameters to functions are stored.

Figure One shows a picture of the address space associated with a process. The heap is the section of memory that we are concerned with. The space in between the heap and the stack is unallocated space, therefore the top of the heap is often referred to as the “break point” because this is where the memory space is split. As more dynamic memory is needed, your application must tell the operating system to move up the break point to allocate more memory for the process, thus increasing the size of the heap.

The brk() system call moves the break point for the calling process. When the application calls this function, the operating system moves the break point to the given address. To ease the use of this system call, the standard C library defines another function, sbrk(), which simply takes as an argument the number of bytes you wish to move the break point. The prototypes for brk() and sbrk() can be found below:


int brk (void* end_data_segment);
void* sbrk (ptrdiff_t increment);

brk() returns 0 if the new allocation is successful and -1 if not. sbrk() returns a pointer to the new break point and -1 if the break point could not be moved. Figure Two shows the implementation of sbrk() as defined in the glibc package (file sysdeps/generic/sbrk.c). __curbrk is a global variable that keeps track of the current position of the break point so you don’t have to. The sbrk() function simply adds the increment passed into it to __curbrk and passes this value to the brk() system call to actually allocate the memory.




Figure Two: Implementation of sbrk()


/* Extend the process’s data space by INCREMENT.
If INCREMENT is negative, shrink data space by – INCREMENT.
Return start of new space allocated, or -1 for errors. */
void *
__sbrk (ptrdiff_t increment)
{
void *oldbrk;

/* If this is not part of the dynamic library or the library is used
via dynamic loading in a statically linked program update
__curbrk from the kernel’s brk value. That way two separate
instances of __brk and __sbrk can share the heap, returning
interleaved pieces of it. */
if (__curbrk == NULL || __libc_multiple_libcs)
if (__brk (0) < 0) /* Initialize the break. */
return (void *) -1;

if (increment == 0)
return __curbrk;

oldbrk = __curbrk;
if (__brk (oldbrk + increment) < 0)
return (void *) -1;

return oldbrk;
}

On a call to brk(), the Linux kernel performs a few checks and then allocates the new memory for the process. Figure Three shows the source code for the Linux kernel function sys_brk(). This is the function called when you perform a system call to brk().




Figure Three: The brk() System Call in mm/mmap.c


asmlinkage unsigned long sys_brk(unsigned long brk)
{
unsigned long rlim, retval;
unsigned long newbrk, oldbrk;
struct mm_struct *mm = current->mm;

down(&mm->mmap_sem);

if (brk < mm->end_code)
goto out;
newbrk = PAGE_ALIGN(brk);
oldbrk = PAGE_ALIGN(mm->brk);
if (oldbrk == newbrk)
goto set_brk;

/* Always allow shrinking brk. */
if (brk <= mm->brk) {
if (!do_munmap(mm, newbrk, oldbrk-newbrk))
goto set_brk;
goto out;
}

/* Check against rlimit.. */
rlim = current->rlim[RLIMIT_DATA].rlim_cur;
if (rlim < RLIM_INFINITY && brk – mm->start_data > rlim)
goto out;

/* Check against existing mmap mappings. */
if (find_vma_intersection(mm, oldbrk, newbrk+PAGE_SIZE))
goto out;

/* Check if we have enough memory.. */
if (!vm_enough_memory((newbrk-oldbrk) >> PAGE_SHIFT))
goto out;

/* Ok, looks good – let it rip. */
if (do_brk(oldbrk, newbrk-oldbrk) != oldbrk)
goto out;
set_brk:
mm->brk = brk;
out:
retval = mm->brk;
up(&mm->mmap_sem);
return retval;
}


Walking through this function shows the act of changing the break point for a task is fairly simple. First, the kernel aligns the old and the new break point to be on page boundaries. A page is the smallest unit of memory that the operating system will give to any process. Page sizes vary from system to system, but 4096 bytes is a common value (and is the value used by Linux on most architectures).

After aligning the addresses to fall on a boundary that is divisible by 4096 bytes, the system call checks to see if the amount of memory is decreased. If so, it immediately decreases the amount of heap space for that process with a call to do_munmap() (we’ll discuss this later) and returns.

If the system call determines that the caller actually wishes to increase the amount of heap space allocated to it, it performs three checks to make sure the allocation can succeed. The first check verifies that the specified limit of memory for this process will not be exceeded by allocating more memory to the process. Linux provides a mechanism for setting the limits of memory usage for a given process through the setrlimit() function. For more information about this function and others like it, look at the man page for setrlimit().

After verifying that a process is not requiring more memory than its limit allows, the kernel verifies that increasing the memory will not interfere with any of the other parts of memory for that process. (Other parts of memory can include mmaped files, the stack, etc.) Finally, as a last check, the kernel checks to see if there is enough memory to accommodate the new process. Once it has determined that it is okay to allocate the memory, it calls the do_brk() function to increase the memory for the process.

The two basic functions used by the sys_brk() system call to allocate and deallocate memory are do_brk()and do_munmap(). In fact, the do_brk() function is really just a simplified version of the do_munmap() function. These two functions are similar to the libc functions mmap() and munmap(). This reveals how the kernel views all the different address spaces used by a process. They are simply maps of memory. Each map takes a certain range of addresses in the processes address space and maps it to physical addresses in memory. When the process calls brk(), the operating system increases (or decreases) the size of the map for the heap, thus giving the process more memory to use.

It seems relatively simple for the kernel to allocate more memory; it only needs to make a mapping larger. However, this simplicity comes at a cost. When the size of the map is increased, the new pages that are mapped into the address space of the processes do not actually exist in memory! In other words, no physical pages are allocated for the new mapping. The operating system, by increasing the size of the map, only provides a mapping between the new addresses in the process space and the memory that those pages will occupy when they are used. As a result, it is possible, in Linux, to allocate more memory to a process than actually exists on the system.

To fully understand this, we need to more closely examine the function that checks the amount of memory remaining, vm_enough_memory() (see Figure Four). Notice that the comment in this function claims that this is a “stupid algorithm” and in fact, it is quite easy to fool this function into returning true even when there is not enough memory to satisfy a request.




Figure Four: The ‘vm_enough_memory’ Function in mm/mmap.c


/* Check that a process has enough memory to allocate a
* new virtual mapping.
*/
int vm_enough_memory(long pages)
{
/* Stupid algorithm to decide if we have enough memory: while
* simple, it hopefully works in most obvious cases.. Easy to
* fool it, but this should catch most mistakes.
*/
/* 23/11/98 NJC: Somewhat less stupid version of algorithm,
* which tries to do “TheRightThing”. Instead of using half of
* (buffers+cache), use the minimum values. Allow an extra 2%
* of num_physpages for safety margin.
*/

long free;
/* Sometimes we want to use more memory than we have. */
if (sysctl_overcommit_memory)
return 1;

free = atomic_read(&buffermem_pages);
free += atomic_read(&page_cache_size);
free += nr_free_pages();
free += nr_swap_pages;
return free > pages;
}

The vm_enough_memory() function looks at the amount of free space that the system currently has between all cache, buffers, free pages in memory, and free pages in the swap space (which is the portion of your hard drive used for additional memory). However, since the operating system can increase the size of the map without actually placing those mapped pages in memory, this function will not account for any allocated space that has not been placed in memory.

In Linux, allocated space returned by brk() is only placed in memory after the user writes to the memory. Therefore, if your process allocates a gigabyte of memory 1 MB at a time, even if your system only has 500 MB of memory available, Linux will give you that gigabyte of memory without thinking twice. This happens because Linux thinks that it can satisfy each single request of 1 MB. Of course, when you start writing to all of that newly allocated memory, you will have a problem. Once the actual memory fills up, no more memory is available, even if processes have more allocated space!

What does Linux do in this case? When the memory is completely saturated, it kills off any process that tries to write to space that isn’t in memory (even if that space has been allocated to the process).

The choice made in Linux has some unfortunate consequences. Imagine that you are testing a program that allocates memory and never frees it. Once the program allocates too much memory and starts writing to it, the memory on your machine will become full. At this point, any process running on your machine (e.g., sendmail, httpd, or even X) that tries to write to a page of allocated, but unwritten, memory will be instantly killed by the operating system. In this case, it is almost impossible to determine exactly which set of processes is killed. Of course, you will notice if X dies, but the absence of programs like sendmail and other background processes may not be immediately obvious. As a result, you will probably be forced to reboot your machine. (You may think that this example is contrived, but unfortunately, this situation occurs more frequently than we’d like.)

You are probably wondering how to avoid such situations. As mentioned previously, you can limit the amount of memory any process is allowed to allocate with the setrlimit() function. Also, the shell commands limit and ulimit can restrict the memory usage of processes that are called from that shell. These techniques can guard against a single buggy process allocating too much memory. However, the use of setrlimit(), limit, and ulimit will not solve the problem of 1,000 different processes each allocating 10 MB of memory and writing to it later. Fortunately, this situation tends to be much less common than a single process going haywire.

Now that we understand how the operating system gives us more memory, let’s back up a level and discuss the implementation of malloc() in Linux and how it uses sbrk() (and by extension, the brk() system call) to allocate more memory when it can’t find a free block to return. The source is quite long, so we will simply outline the functions that are called.

If you look at the source for malloc() (found in the malloc/malloc.c file in the glibc package), this description will help you identify what is going on where. The malloc() function calls an internal helper, chunk_ alloc(), to find the block to return to the user. This helper is the function that looks through the bins for a match to return (as described last month). If it cannot find a suitable match, it calls malloc_extend_top(), another helper function in the same file. It is this function that actually calls sbrk() in the following line (MORECORE is defined to be sbrk() in the Linux implementation):


brk = (char*)(MORECORE (sbrk_size));

Once the memory is retrieved from the operating system, chunk_alloc() splits off the piece that is required for the current allocation. It then adds the remainder to the appropriate bin for future use and returns the new block to malloc(), which then returns that block to the user. As it turns out, most of the work that malloc() does deals with deciding how to manage memory that has already been allocated by the OS.

This concludes our two-part exploration of memory allocation in Linux. With this solid understanding of the ins and outs of dynamic memory from both the application and operating system perspectives, you probably now know enough to write your own memory allocator. Or, perhaps just knowing exactly what happens when you call malloc() is enough for you. In either case, happy hacking!



Benjamin Chelf is an author and engineer at CodeSourcery, LLC. He can be reached at chelf@codesourcery.com.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62