The Passage of Time

The kernel keeps track of the flow of time. It provides services to sleep wait for long periods and busy wait for short periods, can schedule events to run at a future time, and can do periodic work with microsecond precision. Here’s a look at kernel time.
Many features of the Linux kernel are critically dependent on the passage of time. The Linux kernel makes use of different timers provided by hardware to provide time-dependent services such as busy wait and sleep wait. The kernel also facilitates scheduling for functions that desire to run after some specified time has elapsed.
In this month’s “Gearheads” column, let’s explore the the semantics of kernel variables like jiffies, HZ, loops_per_jiffy, and xtime, learn how to measure execution times using the Pentium Time Stamp Counter (TSC), and use the Real-Time Clock (RTC) to perform periodic work from an application with microsecond precision.

Of HZ and Jiffies

System timers interrupt the processor, or pop, at programmable frequencies. This frequency, or the number of timer ticks per second, is contained in the kernel variable HZ. Choosing a value for HZ is a trade-off: a large HZ value results in finer timer granularity and better resolution. However, high values of HZ also result in larger overhead, since more cycles are spent executing the timer interrupt handler.
The value of HZ is architecture-dependent. On x86 systems,the 2.4 kernels set HZ to 100, but with 2.6, this value was changed to 1000. On ARM-based platforms, 2.6 kernels use a HZ value of 100.
The jiffies variable holds the number of times the system timer popped since the system booted. The kernel increments jiffies HZ times every second. Thus, on a kernel with a HZ value of 100, a “jiffy” is a 10-millisecond duration, while on a kernel with HZ set to 1000, a jiffy is only 1-millisecond.
To better understand HZ and jiffies, consider the following code snippet from the IDE driver (drivers/ide/ide.c) that polls disk drives for busy status:
unsigned long timeout = jiffies + (3 * HZ);
while (hwgroup->busy) {
/* … */
if (time_after (jiffies, timeout)) {
return -EBUSY;

/* continue your work */

return SUCCESS;
The snippet returns SUCCESS if the busy condition gets cleared in less than 3 seconds, and -EBUSY otherwise. 3*HZ is the number of jiffies present in 3 seconds. The calculated timeout value (jiffies+3*HZ) is thus, the value that will be contained in jiffies after the 3-second timeout has elapsed. The time_after() macro compares the current value of jiffies with the requested timeout, taking care to account for wrap-arounds due to overflows. Related functions available for doing similar comparisons are time_before(), time_before_eq() and time_after_eq().
jiffies is defined as volatile, which asks the compiler not to optimize access to the variable. This ensures that jiffies (which is updated by the timer interrupt handler during each tick) is re-read during each pass through the loop.
For the reverse conversion from jiffies to seconds, have a look at this snippet from the SCSI driver, (drivers/block/scsi_ioctl.c):
start_time = jiffies;
blk_execute_rq (q, bd_disk, rq);
/* … */
hdr->duration = ((jiffies - start_time) * 1000) / HZ;
Here, hdr->duration contains the amount of time (in milliseconds) that it took for a SCSI command to execute. (jiffies-start_time) is the number of jiffies that elapsed while the command was executing. The division by HZ converts that value into seconds, while the multiplication by 1000 changes the measure to milliseconds.

Delaying Execution for Long Durations

In kernel terms, delays in the order of jiffies can be considered long durations. A possible, but non-optimal way to accomplish long delays is by busy waiting. A function that busy waits has a “dog-in-the-manger” attitude. It neither uses the processor for doing useful work, nor does it let others use it.
For example, the following code hogs the processor for one second:
unsigned long timeout = jiffies + HZ;
while (time_before (jiffies, timeout));
A better approach is to sleep wait, because it can yield the processor to others while waiting for the time delay to elapse. This can be done using schedule_timeout():
unsigned long timeout = jiffies + HZ;
schedule_timeout (timeout);
The delay guarantee is only on the lower bound of the timeout. Whether from kernel space or from user space, it is difficult to get more precise control over timeouts than the granularity of HZ, since process time slices are updated by the scheduler only during timer ticks. Also, even if your process enters the run queue after the specified timeout, the scheduler can decide to pick another process from the queue based on priorities.
Two other functions that facilitate sleep waiting are wait_event_timeout() and msleep(). Both are implemented with the help of the schedule_timeout() function. wait_event_timeout() can be used if your code desires to resume execution either if a specified condition becomes true or if a timeout occurs. msleep() sleeps for the specified number of milliseconds.
These delay techniques are suitable for use only from process context. Busy waiting for long durations is possible from interrupt context, but is considered a mortal sin. Equally taboo is busy waiting with interrupts disabled. Sleep waiting cannot be done from interrupt context since interrupt handlers are not allowed to schedule or sleep.

Doing Work in the Future

The kernel also provides programming interfaces to execute a function at some point in the future. You can populate a timer_list with the address and parameters of your function and register it using add_timer(), as shown in Listing One.
LISTING ONE: Scheduling a function to run in the future

#include <linux/timer.h>
struct timer_list my_timer;

/* … */
init_timer (&my_timer);
my_timer.expire = jiffies + n*HZ; /* n is the number of seconds */
my_timer.data = func_parameter; /* Parameter to be passed to timer_func */
my_timer.function = timer_func; /* Function to execute after n seconds */
add_timer (&my_timer);
/* … */

timer_func() runs once with parameter func_parameter. If you’d like to run timer_func() periodically, you can repeat the code in Listing One in timer_func() itself to reschedule the function over and over. The new function is shown in Listing Two.
LISTING TWO: To run a function repeatedly and periodically, reschedule the function call within the function itself

static void timer_func (unsigned long func_parameter)
/* Do work to be done periodically */
/* … */

init_timer (&my_timer);
my_timer.expire = jiffies + n*HZ;
my_timer.data = func_parameter;
my_timer.function = timer_func;
add_timer (&my_timer);

You can use mod_timer() to change the expiration of my_timer, and del_timer() to cancel my_timer. If you look at kernel/timer.c,, you can see that the schedule_timeout() function internally uses these same APIs.

Delaying Execution for Small Durations

In kernel terms, sub-jiffy delays qualify as small durations. Such delays are commonly requested from both process and interrupt contexts.
Since it’s not possible to use jiffy-based methods to implement sub-jiffy delays, the methods discussed in the previous section to sleep wait cannot be used for small timeouts. The only solution is to busy wait.
The kernel APIs to implement short delays are mdelay(), udelay(), and ndelay(), which support millisecond, microsecond and nanosecond delays, respectively. The actual implementations of these functions are architecture-specific and need not be supported on all platforms.
Busy waiting for short durations is accomplished by measuring the time that the processor takes to execute an instruction and looping for the necessary number of iterations. The busy loop calibration is done during system boot and the value is stored in a variable named loops_per_jiffy. The short-delay APIs use the loops_per_jiffy value to decide the number of times they need to busy loop. To achieve a one-microsecond delay during a handshake process, the USB host controller driver, drivers/usb/host/ehci-hcd.c, uses udelay():
do {
result = readl (ptr);
/* … */

if (result == done) return 0;
udelay (1); /* Internally uses loops_per_jiffy */
} while (usec > 0);

The Pentium Time Stamp Counter

The Time Stamp Counter (TSC) is a 64-bit timer register in Pentium-compatible processors that counts the number of clock cycles consumed by the processor. Since the TSC gets incremented at the rate of the processor cycle speed, it provides a high-resolution timer and is commonly used for profiling and instrumenting code. It can be accessed using the rdtsc instruction to measure execution time of intervening code with microsecond precision. TSC ticks can be converted to seconds by dividing by the CPU clock speed, which can be read from the kernel variable, cpu_khz.
In Listing Three, the low_tsc_ticks variables contain the lower 32 bits of the TSC, while the high_tsc_ticks variables contain the higher 32 bits. The lower 32 bits overflow in a few seconds depending on your processor speed, but is sufficient for many code instrumentation purposes.
LISTING THREE: Calculating the duration of a task using the Pentium Time Stamp Counter

unsigned long low_tsc_ticks0, high_tsc_ticks0;
unsigned long low_tsc_ticks1, high_tsc_ticks1;
unsigned long exec_time;

rdtsc (low_tsc_ticks0, high_tsc_ticks0); /* timestamp before */
printk (“Hello World\n”); /* Code to be profiled */
rdtsc (low_tsc_ticks1, high_tsc_ticks1); /* timestamp after */

exec_time = low_tsc_ticks1 – low_tsc_ticks0;

On a 1.8 GHz Pentium, exec_time measured 871 (or half a microsecond).
User space functions like nanosleep(), clock_settime(), and clock_gettime() can be used to access kernel timer services from user space. When used in tandem with real time scheduling, nanosleep() busy waits for requested delays of less than two milliseconds, using mdelay(). For larger delays, nanosleep() sleep waits. The getitimer() and setitimer() functions deliver a signal to the calling process when a specified timeout expires. The high-resolution timer patch (not part of the mainstream kernel) makes use of the TSC in Pentium-class machines to support high-precision for many of these functions.

The Real Time Clock

The Real Time Clock (RTC) tracks absolute time in nonvolatile memory. On x86 PCs, RTC registers constitute the top few locations of the CMOS. If you enable CONFIG_RTC during kernel configuration, the RTC can be accessed via the character device, /dev/rtc. Only one process is allowed to access this device at a time.
The RTC can be used to:
*Read and set the absolute clock and generate interrupts during clock updates.
*Generate periodic interrupts with frequencies ranging from 2 Hz to 8192 Hz.
*Set an alarm for a future time.
Many applications need the concept of absolute time or wall time. jiffies does not serve this purpose since it is relative to the time when the system booted. The kernel maintains wall time in a variable called xtime. During boot up, xtime is initialized to the current wall time by reading the RTC. When the system halts, the wall time is written back to the RTC.
You can use the do_gettimeofday() routine to read the wall time with the highest resolution supported by the hardware. This is shown in Listing Four.
LISTING FOUR: do_gettimeofday() returns a fine-resolution wall clock time

#include <linux/time.h>

static struct timeval curr_time;

do_gettimeofday (&curr_time);
my_timestamp = cpu_to_le32 (curr_time.tv_sec); /* Record timestamp */

There are a number of functions available to user code to access wall time, including:
*time(), which returns the calendar time, or number of seconds since the” Epoch” (00:00:00 on January 1, 1970).
*localtime(), a function that returns the calendar time in a decomposed format (see the definition of struct tm in time.h).
*mktime(), which does the reverse of localtime().
*And gettimeofday(), which can return the calendar time with microsecond precision, if your platform supports it.

Doing Periodic Work With Microsecond Precision

Many real-time applications must perform some work periodically in a time bound manner. The RTC is a timer source that can generate periodic interrupts with microsecond precision. Listing Five shows an example that reads /dev/rtc to get regular interrupt reports. The Pentium TSC is used to measure response times.
The real time nature of data collection can be affected due to preemptive scheduling. To guard against this indeterminism, the program in Listing Five changes its scheduling policy to SCHED_FIFO using the sched_setscheduler() function. (SCHED_OTHER is the default scheduling policy used by most processes.) SCHED_FIFO is a real time scheduling policy and always preempts a currently running SCHED_OTHER process. (The third scheduling policy supported on Linux is called SCHED_RR, or Round Robin scheduling, which is an enhancement of SCHED_FIFO.) SCHED_FIFO and SCHED_RR cannot prevent latencies due to interrupts.
Next, the mlockall() routine is invoked to lock all mapped pages in memory to ensure that swapping won’t interfere with deterministic timing. Only superuser processes are allowed to use mlockall() or change the scheduling policy.
Listing Five: Doing periodic work with microsecond precision

#include <stdio.h>
#include <linux/rtc.h>
#include <sys/ioctl.h>
#include <sys/time.h>
#include <fcntl.h>
#include <pthread.h>
#include <linux/mman.h>

/* Read the lower half of the Pentium Time Stamp Counter
* using the I<rdtsc> instruction
#define rdtscll(val) __asm__ __volatile__ (“rdtsc” : “=A” (val))

main ()
long long ts0,ts1; /* Store TSC ticks */
struct sched_param sched_p; /* Scheduling Policy related information */
long now, worst, data;
int fd, i=0;

/* Change the scheduling policy to SCHED_FIFO */
sched_getparam (getpid(), &sched_p);
sched_p.sched_priority = 50;
sched_setscheduler (getpid(), SCHED_FIFO, &sched_p);

/* Disable Paging */
mlockall (MCL_CURRENT);

/* Open the RTC device */
fd = open (“/dev/rtc”, O_RDONLY);

/* Set the periodic interrupt frequency to 8192 Hz
* This should give an interrupt rate of 122 uS
ioctl (fd, RTC_IRQP_SET, 8192);

/* Turn ON periodic interrupt */
ioctl (fd, RTC_PIE_ON, 0);
rdtscll (ts0);
worst = 0;

while (i++ < 10000) {
/* Block till the next periodic interrupt occurs */
read (fd, &data, sizeof (unsigned long));

/* Use the TSC to precisely measure the time consumed.
* Reading the lower half of the TSC is sufficient */
rdtscll (ts1);
now = (ts1-ts0);

/* Update the worst case latency */
if (now > worst) worst = now;
ts0 = ts1;

/* Do the work that is to be done periodically */
printf(“Worst latency was %8ld us\n”, worst);

/* Turn OFF periodic interrupts */
ioctl (fd, RTC_PIE_OFF, 0);

The code in Listing Five loops for 10,000 iterations and prints the worst case latency that occurred during execution. On a 1.8 GHz Pentium, the output was 240,899, which roughly corresponds to 133 microseconds. According to the data sheet of the RTC chipset, a timer frequency of 8,192 Hz should result in a periodic interrupt rate of 122 microseconds. That looks close enough. You can run the code using SCHED_OTHER instead of SCHED_FIFO and see the resultant drift.

Looking at the Sources

Timer services in the kernel consist of architecture-dependent portions that live in the arch/…/kernel/ subdirectory, and generic portions defined in kernel/timer.c/ For related definitions, look at the header files, include/linux/time*.h.
jiffies is defined in linux/jiffies.h. The value for HZ is processor-dependent and can be found in include/asm-…/param.h. Look at drivers/char/rtc.c for the Real Time Clock driver. To see how the kernel calibrates loops_per_jiffy, look at init/calibrate.c.

Sreekrishnan (Krishnan) Venkateswaran has been working for IBM India since 1996. His recent Linux projects include putting Linux onto a wristwatch, a cell phone, and a pacemaker programmer. You can reach Krishnan at class="emailaddress">krishhna@gmail.com.

Comments are closed.