Down Memory Lane, Part Two

Learn how the kernel views physical memory.
Last month’s “Gearheads” showed you how to use the kernel’s Memory Technology Devices (MTD) subsystem to interface an embedded Linux device with different flavors of flash memory. This month, let’s look at how the kernel manages physical memory (RAM). Let’s cautiously flip through kernel memory pages with the help of an example, and see how the kernel accesses some of the memory components shown in Figure One.

Linux’s View of Physical Memory

The kernel organizes physical memory into pages. The exact size of each page depends on the architecture. On x86-based machines, it’s 4,096 bytes.
Each page in physical memory has a struct page (defined in include/linux/mm.h) associated with it:
struct page {
page_flags_t flags; /* Page status */
atomic_t _count; /* Reference count */
/* … */
void * virtual; /* Explained later */
On 32-bit x86 systems, the default kernel configuration splits the available 4 GB address space into a 3 GB virtual memory space for user processes, and a 1 GB space for the kernel. This imposes a 1 GB limit on the amount of physical memory that the kernel can handle. (In reality, the physical memory limit is 896 MB, because 128 MB of the address space is occupied by kernel data structures.)
FIGURE ONE: Memory components on an x86- based laptop or embedded derivative

Page structures are associated with physical memory addresses, not kernel addresses. Kernel addresses that map the low 896 MB differ from physical addresses by a constant offset, and are called logical addresses. With “high memory” support, the kernel can access memory beyond 896 MB by generating virtual addresses for those regions using special mappings. All logical addresses are kernel virtual addresses, but not vice versa.
This leads to the following kernel memory zones:
*. ZONE_DMA(< 16 MB). ZONE_DMA is the DMA-able zone. Since legacy ISA devices have 24-bit address lines and can hence access only the first 16 MB, the kernel tries to dedicate this area for such devices.
*ZONE_NORMAL(16 MB to 896 MB). This is the normally addressable region, also called low memory. kmalloc() returns memory from this area. The virtual field in struct page for low memory pages contains the corresponding logical addresses.
*ZONE_HIGH(> 896 MB). ZONE_HIGH is the space that the kernel can access only after mapping such pages to regions in ZONE_NORMAL (using kmap() and kunmap()). The corresponding kernel addresses are virtual and not logical. vmalloc() returns kernel virtual addresses. The virtual field in struct page for high memory pages points to NULL if the page is not “kmapped.”
As mentioned above, Linux user processes own a virtual address space of 3 GB. Virtual addresses allow programs to have an address space larger than the available physical memory. Processors support a mechanism called a page table to automatically generate physical addresses from virtual addresses.

Page Management Made Easy

To learn more about page management, let’s implement a small modification to the kernel page setup code.
The kernel maintains a pool of free pages to help it manage physical memory. The modification will test each page before releasing it into the free pool. Merely adding test logic before freeing the corresponding struct page during boot is time consuming, so let’s adopt the following strategy:
1.At boot time (mm/bootmem.c), test and free some number of pages just sufficient to get the kernel running.
2.Create a kernel thread, which will slowly scan through the remaining memory pages, testing and freeing them in chunks. The number of pages available to your system will gradually increase as the thread works its way through untested pages.
You can see the available free memory at any instant, using the top command or by peeking into /proc/meminfo.
Listing One shows the changes to mm/bootmem.c (lines starting with +) that add only MIN_MEMORY_FOR_BOOT pages to the free pool during boot.
Listing One: Test and Free minimum pages at boot time

+ #define MIN_MEMORY_FOR_BOOT (64*1024*1024)

static unsigned long __init free_all_bootmem_core(pg_data_t *pgdat)
+ EXPORT_SYMBOL(max_low_pfn); /* Export so you can use it Listing Two */

if (gofast && v == ~0UL) {
/* … */
+ if (i < (unsigned long)(MIN_MEMORY_FOR_BOOT >> PAGE_SHIFT)) {
+ if (test_pages (page, order))
__free_pages (page, order);
+ }
/* … */
} else if (v) {
/* … */
+ if (i < (unsigned long)(MIN_MEMORY_FOR_BOOT >> PAGE_SHIFT)) {
+ if (test_pages (page, order))
__free_pages (page, order);
+ }
/* … */
} else {
/* … */
+ if (test_page (page))
__free_page (page);
/* … */

Depending on the page size imposed by the architecture, each physical address has a fixed number of least-significant bits (defined as PAGE_SHIFT) that hold the page offset. The remaining higher bits form the page address. Right-shifting an address by PAGE_SHIFT yields the page frame number (PFN). In Listing One, MIN_MEMORY_FOR_BOOT is right-shifted by PAGE_SHIFT to get the corresponding PFN. The kernel variable max_low_pfn contains the largest PFN in low memory.
The actual memory test function, test_pages() is not included in the listing. Algorithms to test the wiring of address/data lines and detect missing chips can easily be obtained elsewhere.
Compile the above changes into your kernel, reboot, and peek at system memory statistics:
$ cat /proc/meminfo
MemTotal: 245520 kB
MemFree: 10464 kB
Buffers: 5448 kB
Cached: 24192 kB

MemTotal is the total usable memory on your system, while MemFree is the memory currently available. Cached is the size of the page cache, which is used to buffer accesses to the disk. The kernel automatically usurps memory from MemFree to expand the page cache, but also contracts the page cache if programs need more memory.
Listing Two implements the kernel thread memwalkd, which tests and frees the remaining memory pages after boot.
Listing Two: A kernel thread that walks, tests, and frees memory

#include <linux/mm.h>
#include <linux/signal.h>
#include <linux/module.h>
#include <linux/bootmem.h>
#include <linux/page-flags.h>

static int npages=0;

static int
memwalkd (void *unused)
long curr_pfn = (MIN_MEMORY_FOR_BOOT >> PAGE_SHIFT);
struct page * curr_page;
siginfo_t info;
int i, sig;


allow_signal (SIGKILL);
allow_signal (SIGUSR1);

for (;;) {
if (signal_pending (current)) {
sig = dequeue_signal (current, &current->blocked, &info);
if (sig == SIGUSR1) {
/* Test and free a chunk at one-shot if a SIGUSR1
* signal is received */
for (i=0; i< npages; i++, curr_pfn++) {
/* Get the page structure corresponding to this PFN */
curr_page = pfn_to_page (curr_pfn);
if (curr_pfn >= (max_low_pfn)) {
/* No more pages */

if (!PageReserved (curr_page)) {
/* Test the page, and if okay,
* release it to the free pool */
set_page_count (curr_page, 1);
if (test_page (page))
__free_page (curr_page);
printk (“memwalkd: %d KB Tested and Freed\n”,
(npages << PAGE_SHIFT)/1024);
} else if (sig == SIGKILL) {
/* Die if a KILL signal is received */


set_current_state (TASK_RUNNING);

return 0;

int __init
kernel_thread(memwalkd, NULL,

return 0;

#ifdef MODULE
module_init (memwalkd_init);
module_param (npages, int, 0);

For easier illustration, memwalkd is designed to test and free npages pages each time a SIGUSR1 signal is delivered to it. Of course, in practical implementations, you’d want memwalkd to do its job continuously in the background rather than in response to received signals. npages is passed to the thread as a parameter during module insert. The 2.6 kernels use module_param() for defining module parameters, which obsoletes MODULE_PARM() used in earlier versions.
To obtain the PFN corresponding to a page structure, memwalkd uses the pfn_to_page() function. If you need the reverse mapping, use page_to_pfn().
Insert your module as follows:
$ insmod memwalk.ko npages=10000
Now, ask memwalkd to test and free a chunk of 40,000 KB (npages *PAGE_SIZE) by dispatching a SIGUSR1 signal:
$ ps –ef | grep memwalkd | \
grep –v grep | \
awk ’{print $2}’ | xargs kill –SIGUSR1

$ tail /var/log/messages
Nov 10 23:09:15 localhost kernel: memwalkd: 40000 KB Tested and Freed.
Verify that 40,000 KB has indeed been added to the free pool. You won’t see an exact 40000 KB increase due to instant demands from the page cache and temporary buffers.
$ cat /proc/meminfo
MemTotal: 245520 kB
MemFree: 43428 kB
Buffers: 5772 kB
Cached: 27248 kB

Debugging the Memory Thread

If you have a small amount of RAM in your machine, say 256 MB, you may run into an error after delivering SIGUSR1 to memwalkd for some time:
Bad page state at free_hot_cold_page (in process ’memwalkd’, page c11ecde0)
flags:0x40000080 mapping:00000000 mapcount:0 count:0
[<c01040fe>] dump_stack+0x1e/0x20
[<c0141f16>] bad_page+0x76/0xb0
[<c014263f>] free_hot_cold_page+0x5f/0x130
Trying to fix it up, but a reboot is needed
To figure out what’s happened, let’s add a debug statement before invoking __free_page():
printk ("Current PFN=%d, Page Flags=%x\n", 
curr_pfn, pfn_to_page(curr_pfn)->flags);
This emits the following message before the crash:
Current PFN=63087, Page Flags=40000080
The bit corresponding to PG_slab (see include/linux/page-flags.h) is set for the page that caused the crash. This means that the kernel slab layer owns this page. The slab layer is an allocator that makes it easier for drivers to manage memory buffers. But how did the slab layer get its hands on page number 63087, which wasn’t added to the free pool during boot time?
If you search for a clue in the kernel boot up messages (/var/log/dmesg), a line similar to this raises a red flag: Freeing initrd memory: 387k freed. initrd is a RAM disk that is loaded by the bootloader. It’s mounted as the root filesystem after the kernel boot to load additional modules required to mount the actual root filesystem. The memory where initrd is loaded (INITRD_START), is passed down to the kernel by the bootloader.
A printk() in arch/i386/kernel/setup.c to elicit this address yielded:
Debug: INITD_START = 0xF66F000
If you convert this to a page frame number (by dividing by PAGE_SIZE), you get 63087. That’s exactly the page responsible for the crash!
After mounting the root filesystem, the memory where initrd resides is freed by the kernel. After this, those pages are doled out to other parts of the kernel that request memory (the slab layer in this case).
To get past the problem, modify the code in Listing Two so that it checks for all page flags (like PG_slab) before it adds each page to the free pool, and modify mm/init.c to test pages occupied by initrd before freeing them.

ECC Memory

The DRAM chip shown in Figure One is labeled ECC, or Error-Correcting Code memory. ECC memory contains special silicon to measure the accuracy of data. Typical ECC DRAM chips can correct single-bt Errors (SBE) and detect multi-bit errors (MBE).
The mainstream Linux kernel tree does not support ECC. This means that if your DRAM controller supports ECC, error correction and detection occur silently, and Linux user applications don’t get a chance to fashion error handling policies.
The linux-ecc project hosted at http://www.anime.net/~goemon/linux-ecc/, adds ECC configuration and error reporting support for different chip-sets.
ECC DRAM controllers generally have two related registers: an Error Status Register and an Error Address Pointer Register. When an ECC error is detected, the former register contains the status (whether the error is an SBE or an MBE), while the latter register contains the address where the error occurred. The linux-ecc driver periodically checks these registers and reports results to user space via the process filesystem. To support your ECC DRAM chip with the linux-ecc driver, add hooks to read these registers, usually accessed via the PCI configuration space of your north bridge.

Accessing the CMOS

Desktops and Laptops contain a small chunk of battery-powered Complimentary Metal Oxide Semiconductor (CMOS) memory to hold Real Time Clock (RTC) registers and BIOS setup parameters. The kernel provides a character driver (/dev/nvram) to access the contents of the CMOS.
To use this, turn on CONFIG_NVRAM in your kernel configuration. CMOS contents are protected using a Cyclic Redundancy Check (CRC), which the nvram driver adjusts after each write. The first 14 CMOS bytes are usually used by the RTC. Look at /proc/driver/nvram to see how the BIOS interprets the remaining 114 bytes.
To dump the contents of CMOS, do:
$ od –x /dev/nvram

0000000 0000 0040 00f0 8003 0002 01fc 0000 1f00

0000160 00b2

Flash Memory on Desktops/Laptops

Desktops and laptops contain a flash memory chip called the firmware hub (FWH) that holds the BIOS. The FWH is not directly connected to the processor’s address and data bus. Instead, it’s interfaced via the Low Pin Count (LPC) bus, which is part of south bridge chipsets (see Figure One).
The kernel’s MTD subsystem is responsible for interfacing your system with the FWH. FWHs are usually not compliant with the Common Flash Interface (CFI) specification, a mechanism to automatically detect the configuration and command set used by the flash chip. Instead, they conform to the JEDEC (Joint Electron Device Engineering Council) standard.
To inform MTD about an unsupported JEDEC chip, add an entry to the jedec_table array in drivers/mtd/chip/jedec_probe.c with information like the chip manufacturer ID and the command set ID.
An example configuration to make the kernel aware of an FWH is:
CONFIG_MTD_JEDECPROBE enables the JEDEC MTD chip driver, while CONFIG_MTD_ICH2ROM adds the MTD map driver that maps the firmware hub to the processor’s address space. Additionally, you need to include the appropriate command set implementation (for example, CONFIG_MTD_CFI_INTELEXT for Intel Extension commands). Once these modules are built-in, applications can talk to the FWH using MTD APIs.

Looking at the Sources

The kernel memory management code lives in the mm/ subdirectory. Dive into arch/your-arch/mm/ directory for the platform-dependent code that does stuff like manipulate page tables. To access an unsupported BIOS firmware hub from Linux, use drivers/mtd/maps/ich2rom.c as the starting point.
If you enable CONFIG_DEBUG_SLAB, CONFIG_DEBUG_HIMEM or CONFIG_DEBUG_PAGE_ALLOC while configuring your kernel, additional error checking code gets compiled in, which can help debug problems related to memory management.

Sreekrishnan Venkateswaran has been working for IBM India for about ten years. His recent projects include porting Linux to a pacemaker programmer and writing firmware for a lung biopsy machine. You can reach Krishnan at class="emailaddress">krishhna@gmail.com.

Comments are closed.