Introduction to System Calls

Before anyone can write software for a particular operating system, they must first understand the programming interfaces the system provides. Many of Linux's APIs are defined by the POSIX standard; other APIs which Linux provides were originally introduced by groups like the X Consortium or by other operating systems. At the core of all of these APIs is Linux's system call interface, which every other system API is built around.

Before anyone can write software for a particular operating system, they must
first understand the programming interfaces the system provides. Many of Linux’s APIs are defined by
the POSIX standard; other APIs which Linux provides were originally introduced by groups like the X
Consortium or by other operating systems. At the core of all of these APIs is Linux’s system call
interface, which every other system API is built around.

Below, I’ll introduce you to some of the basic concepts involved with system calls. This article
is intended for users with minimal familiarity with the C programming language, and little to no
experience programming on any POSIX system. As C is (by far) the most popular programming language
used on Linux systems, it’s the only language we’re going to mention here.

In any modern operating system, there is a basic dichotomy between code which runs in privileged
mode (which UNIX folks normally call kernel space) and code which executes in user space. Kernel
code has complete control over the machine; it can access any of the machine’s resources, such as
memory, network adapters, and disk drives. User space code has limited access to system resources.
In order to read from a disk drive or write to the network, for example, user code has to ask the
kernel to perform the work on the user code’s behalf. If user code tries to carry out an operation
which it doesn’t have permission to do, the microprocessor notifies the kernel, which normally kills
the user space process.

This split between kernel and user code allows computers to juggle many independent tasks. The
kernel allows a user space program to run for a while, and then stops it to let other tasks run.
Additionally, the kernel can instruct the microprocessor to prevent one program from interfering
with resources being used by another, thus preventing tasks from harming one another.

Whenever user space programs need to access system resources they don’t own, they have to ask
the kernel for help. File and network access, creating and destroying other processes, and
allocating additional memory are all areas where the kernel becomes involved. By being involved in
these types of operations, the kernel retains complete control over the system. One task can be
refused access to a file when another is accessing it, memory allocation requests can be denied if
the system is running low on resources, and users can be prevented from killing each other’s
processes. On any POSIX system, the kernel has the primary responsibility for protecting system

System calls allow user space programs to request services from the kernel. In C, system calls
look just like normal function calls, but they have a very different implementation. Rather then
simply transferring control of the program, system calls switch the system to kernel mode. Once the
kernel has control, it performs the requested service, returns the system to user mode, and then
transfers control back to the originating process.

Every Linux program can be thought of as a very simple loop:

1. Compute something

2. Make a system call

3. Go to step 1

In other words, all programs can do are system calls and computations which decide what system
call to make next (memory mapping makes this a bit of an oversimplification, but it’s accurate for
the vast majority of programs). In a very real sense, programs are defined by the sequence of system
calls they generate.

Most Linux distributions provide a utility called strace, which allows users to see
what system calls a program is making. This is an incredibly useful thing to be able to do.
Inexplicable “file not found errors” can be narrowed down quite quickly thanks to strace.
As an example, we’ll look at a portion of the output from strace /bin/echo hello world (if
you have a Linux machine handy, go and run this command).

The first thing you’ll notice is that for such a simple command, this generates a lot of system
calls (27 on my system). Most of these system calls are involved with loading the program itself
(which is done in user space by the /lib/ld-linux.so library) and initializing the C
library. In fact, the program itself only generates a single system call:

write(1, “hello world\n”, 12) = 12

In this case, /bin/echo called the write() system call with three arguments.
The first was the number 1, which identifies which file should be written to (in this case, the
normal output terminal). The argument second was a pointer to a string containing the value “hello
world\n” (the \n represents a single new line character), and finally the number 12, which is the
number of characters pointed to by the second argument (remember that \n is a single character). The
“= 12″ reported by strace means that the write() system call returned the number
12, meaning (in this case) that12 characters were successfully written to file number 1 (standard
output). In other words, the system call did everything we wanted /bin/echo to do for us in
the first place.

Now let’s look at what strace tells us about a system call which fails. On most
systems, cat /ABC will fail, so try running strace cat /ABC. After the same sort
of program initialization we saw when we ran /bin/echo, this line appears:

open(“/ABC”, O_RDONLY) = -1
ENOENT (No such file or directory)

Strace clearly shows that the open() system call is failing with an error
called ENOENT, which in this case means that the file wasn’t opened because it doesn’t exist
on the system.(The O_RDONLY is a constant numeric value which indicates the file is to be
opened only for reading, not for writing.) The open() system call actually returned -1,
which is a generic way for the kernel to return an error. The actual error code, which is a small
integer (equivalent to the constant ENOENT in this case), is stored in the errno
global variable. Programs which need to check for an error from open() will include a
test something like Figure 1.

Figure 1

  returnCode = open(somefile, O_RDONLY);
if (returnCode < 0) {
/* Handle the error */
printf(“Error %d occured\n”,

A complete listing of error codes, and their symbolic equivalents, can be found in /usr/include/asm/errno.h on any Linux system. Note that the numeric values are not portable, and
vary between different architectures. Because of this, the symbolic names should always be used.

Small numbers are not terribly easy to comprehend. As users prefer less cryptic error messages,
C provides two easy ways of displaying better messages. One of them is the
strerror() function, which takes an error code as its sole parameter and returns a pointer
to a string which describes the error. The following code will display a reasonable description of
whatever error is currently contained in errno:

  printf(“Error string: %s

The other method is perror(), which takes a string to be printed (by convention, the
name of the system call) followed by the error description string (see Figure 2).

Figure 2

  returnCode = open(somefile, O_RDONLY);
if (returnCode < 0) {
/* Handle the error */

Now that you have a basic understanding of how system calls work, let’s examine a couple of
simple system calls. We’ll start with exit().

void exit(int errcode);

The exit() system call terminates the process which invokes it, and makes the errcode
value available to the process’s parent process. This is one of the simplest system calls.

The getpid() system call takes no arguments, but does return a value:

pid_t getpid(void);

It returns a pid_t (which, on Linux,is a 32 bit integer value) that holds the process
ID of the current process. A process ID is a number which uniquely identifies a process on the
system (when you use the “kill” command

the argument you specify is a process ID). A process can display its process ID quite

printf(“I’m %d\n”, getpid());

The final system call we’ll discuss is fork(), which is the basic mechanism Linux uses
to create new processes.

pid_t fork(void);

Like getpid(), fork() returns a pid_t. fork() is unique in that it
returns twice for each time it’s called; once in the process which called it and again in a newly
created process which is then known as the child of the creating process. Both processes are nearly
identical (they differ in areas such as their process ID). The original process’s memory map and
list of open files is copied so that the child has full access to the same resources the parent did.
The two processes do get different return codes from fork(), though; the parent gets the
process ID of the newly created child while the child process gets 0 (valid process IDs are always
greater then 0). After a fork(), the child may do whatever it likes. Common reasons for
using fork() are running a different program through yet another system call or performing
a background task.

Most system calls are well-documented in the section 2 man pages. If you don’t know how to read
those, try running man man.

While system calls provide all of the functionality Linux applications need to survive, there
are surprisingly few of them — Version 2.0.36 of the Linux kernel provides less then 300! This set
of system calls is diverse enough to enable a vast array of applications to run on Linux, and
understanding them lets programmers know what kernel services are available to their

Erik Troan is a developer for Red Hat Software, and co-author of the book Linux Application
Development. He can be reached at ewt@redhat.com.

Comments are closed.