Open Files and Inodes

Last month, I wrote about Linux's file access API. For this month's column, I'm gong to talk about some of the other important file-related system calls, and touch on how the kernel file implementation affects the system call interface.

Last month, I wrote about Linux’s file access API.
For this month’s column, I’m gong to talk about some of the other
important file-related system calls, and touch on how the kernel file
implementation affects the system call

In the POSIX world, it’s important to remember that a single file
can have more than one name (that is, if it even has a name), and can
also have symbolic links pointing to it. This means that a file does not
necessarily have to have a name, and even if it does, it does not have
to be a unique name. Files exist separately from their names as
far as the filesystem is concerned.

Compile Time 01
Figure 1: A file that is opened by two
different processes will still have a single in-core inode, although the
processes each have different file descriptors and file

The most obvious examples of this are symbolic links, also called
symlinks or soft links. Symlinks are special files that specify another
filename. Most system calls (including open()
“follow the symlink” if they are told to access the symbolic link. They
are then referred to a file by the symlink (if you’re familiar with
HTTP, think of these as a bit like a server redirect). For example, the
file /etc/localtime is normally a symbolic link
to the file that says what time zone the machine lives in (this is
/usr/share/zoneinfo/ US/Eastern on my system). When
a program needs to discover something about the system’s time zone, it
opens /etc/localtime. As open() follows symlinks, the
program automatically
reads this information from /usr/share/zoneinfo/US/Eastern.

One potential pitfall of symlinks is that they can be circular;
there is nothing to stop the file /tmp/a from
pointing to /tmp/b, even when /tmp/b points back to
/tmp/a. When this happens, the system gets stuck. It looks at
/tmp/a, which sends it to /tmp/b, which sends it back
to /tmp/a, and around and around it goes. To avoid this
problem, Linux will only follow a fixed number of symbolic links while
resolving a filename; if there are more than five, it returns ELOOP.

Symbolic or soft links are one way several path names can refer to
the same file; hard links are another. A hard link is the binding of a
file in a filesystem to a file name; creating a hard link “links” the
file into a directory.

Nothing stops a single file from being linked into more than one
directory (or into a single directory multiple times); in fact, the
ln command facilitates this. The “link count” is the
number of directories that file appears in. When a file name is removed
(or “unlinked”), the entry for that file name is removed from the
directory which contains it. The file itself is kept around as long as
its link count is above 0, or a running process has the file open.

Hard links and symlinks differ in a few important ways.

1. When the file name referred by a symlink is removed, the symbolic
link is left pointing to nothing. This is known as a “dangling link.”
This does not happen with hard links.

2. A file can only be hard linked into a single filesystem while
symbolic links can point across filesystems.

3. A symbolic link may (and often does) point to a directory, while
hard-linking directories is strongly discouraged. It is not allowed by
many filesystems.

Once you get comfortable with the idea that a file may have more
than one path name associated with it, the next thing to remember is
that a file may have no path names associated with it. This can
happen when a file is opened by a process and then unlinked from the
file system. Files like TCP/IP sockets and unnamed pipes (which are used
for interprocess communication) are also nameless.

Rather than a filename, every open file on a system has a single
in-core inode. “Core” is an antiquated way of saying “memory,” and
“inode” is short for “information node”. So, in-core inode means “a bit
of information kept in memory.” This doesn’t sound nearly as fancy, but
it means the same thing. The “single” part of “single in-core inode” is
worth noting. It means that no matter how many times the file has been
open, or how many processes are accessing it, only a single in-core
inode describes the open file.

As you may have already guessed, there is one other form of inode.
On-disk inodes describe a file in a filesystem. They are filesystem
specific. Every file in a filesystem has a single on-disk inode, and
they function similarly to in-core inodes.

If you’re wondering where I’m going with this, we need to understand
concept of an inode for the kernel’s file model to make any sense. And
we need to talk about the kernel’s file model for some of the details of
the file API to make any sense at all.

When a process opens a file, it gets a small, positive integer known
as a file descriptor. That file descriptor is an index into a table of
pointers to file structures.

The file structure contains information specific to that particular
opening of the file. Specifically, the file structure contains the file
pointer (which determines where the next read()
or write() will happen within the file) and the
file’s access mode (this says whether the file is open for reading or
both reading and writing).

The file structure also contains a pointer to the in-core inode.
While each open file has a single in-core inode, it may have more than
one file structure for an open file. Figure 1 illustrates how
this happens for a file that has been open()ed
by two separate processes. Note that each process has a different file
pointer and access mode, but both share a single in-core inode.

Compile Time 02
Figure 2: If Process A executes dup2(2,5),
descriptors 2 and 5 refer to a single file structure, while Process B
still refers to a separate file structure.

There are a couple of ways for file descriptors to share a single
file structure. The most esoteric is for a file descriptor to be sent
from one process to another through a Unix domain socket. A second, less
complex method occurs after a fork(), when a
parent process and its child share a single file structure for every
file that was open before the fork() occurred.
This behavior is important to remember, as it has caused its fair share
of obscure bugs.

The last way for two or more file descriptors to refer to a single
file structure is when some of those file descriptors have been created
by the dup() or dup2()
system calls.

int dup(int fd);
int dup2(int fd, int targetfd);

Both the dup() and dup2() system calls return a
file descriptor that points to the same file structure as the first
parameter passed to either one. dup() is guaranteed to return
the smallest file descriptor available, while dup2() will
return the targetfd file descriptor. If targetfd
already refers to an open file descriptor, that file will be closed.
Figure 2 shows what happens if Process A executes
dup2(2,5). In this case, Process A’s file descriptors 2 and 5
end up referring to a single file structure, but Process B’s file
descriptor still refers to a separate file structure.

dup() and dup2() are used by shells to redirect
standard input, output, and error for a process. For example, this code
fragment redirects the standard output of the running process to the
file “output:”

fd = open(“output”, O_RDWR | O_CREAT), 0666);
if (fd < 0) {
perror(“failed to open file output”);
dup2(fd, 1);

Most of the file-related system calls that manipulate a file’s inode
(either in-core inode or on-disk inode) have two forms. One form expects
a file name, and the other expects a file descriptor. Since both file
names and file descriptors resolve to a single inode, both forms yield
equivalent results. chdir(), which changes the process’s
current working directory is a good example of this.

int chdir(const char * dirname);
int fchdir(int fd);

The first form, chdir(), changes into the directory
specified by a path name while the second makes the directory referenced
by file descriptor, fd, the current directory. In many cases,
chdir() is easier to use, as it makes changing current
directories as simple as chdir(“/tmp”). fchdir() has
its place too, however; one popular use is letting the process remember
the current directory and change back to it. For example:

currDir = open(“.”, O_RDONLY);
. /* do some real work */

Not only is this a bit easier than using getcwd() to
remember the name of the original directory, but it is also guaranteed
to work even if a user removes the original directory while the process
is working in /tmp. As this process keeps a file descriptor
open to the original directory, that directory is always being used by
at least one process, so the system doesn’t remove it until the process
has finished with it.

The stat() family of functions actually comes in three
varieties: stat(), fstat(), and lstat(). All
of them return information stored in a file’s inode (one of the in-core
inodes or on-disk inodes is used, depending on the file type. For system
programs, it doesn’t really matter which is used). All of the
information returned by stat() is placed in a struct
, which looks like this:

struct stat {
dev_t st_dev;
ino_t st_ino;
mode_t st_mode;
nlink_t st_nlink;
uid_t st_uid;
gid_t st_gid;
dev_t st_rdev;
off_t st_size;
unsigned long st_blksize;
unsigned long st_blocks;
time_t st_atime;
time_t st_mtime;
time_t st_ctime;

To get access to struct stat, be sure to include sys/
. Figure 3 explains what each of these fields means.
Most of these fields are straightforward. The most complicated of the
bunch is st_mode, which warrants a little more explanation.

Figure 3: Functions of Various Struct Stat Fields

The major and minor number of the device the file resides on. Every
device on the system has a unique (major, minor) tuple.
st_ino The inode number of
the file. The system guarantees that the ( st_dev ,
st_ino ) pair for a file is unique for that file; this is a
handy way of telling whether two symlinks refer to the same file, for
st_mode A combination of
the permissions for the file and the file type.
st_nlink The link count of
the file (remember, this only includes hard links).
st_uid The user id of the
user who owns the file.
st_gid The user id of the
group that owns the file.
st_rdev For files that
represent devices, st_rdev contains the (major, minor) pair
for the device the file represents.
st_size The size of the
file, in bytes.
st_blksize The size of each
block on the filesystem.
st_blocks The number of
blocks the file takes in the filesystem.
st_atime The last time the
file was accessed (normally opened).
st_mtime The last time the
file contents were modified.
st_ctime The last time the
file’s inode was updated (i.e. the user or group owning the file

Linux supports many different file types: pipes, sockets,
directories, and normal files. And each of these file types is
identified by a few unique bits in the file mode. Here’s the symbolic
name (a #define) for each file type:

S_IFSOCK Sockets
S_IFLNK Symbolic link
S_IFREG Regular file
S_IFBLK Block device
S_IFCHR Character device
S_IFDIR Directory
S_IFIFO Pipe (either named
or unnamed)

There are macros that make it a bit easier to check a file’s type.
For example, to check if a file is a symlink, pass the file’s mode to

To make this a bit more explicit, Figure 4illustrates a small
program that tells you what type of files are listed on the command
line. An example of what the program does when it’s run can be seen in
Figure 5.

Figure 4: Identifying File Types Listed on the Command

 #include <errno.h>
#include <stdio.h>
#include <string.h>
#include <sys/stat.h>

int main(int argc, char ** argv)

struct stat sb;

/* skip argv[0] */

while (*argv) {
if (stat(*argv, &sb)) {
fprintf(stderr, “stat of %s failed: %s\n”, *argv,

} else {
printf(“%s “, *argv);

if (S_ISSOCK(sb.st_mode))
else if (S_ISLNK(sb.st_mode))
else if (S_ISREG(sb.st_mode))
else if (S_ISBLK(sb.st_mode))
else if (S_ISCHR(sb.st_mode))
else if (S_ISDIR(sb.st_mode))
else if (S_ISFIFO(sb.st_mode))
printf(“fifo (pipe)\n”);



return 0;


Figure 5: Output from Figure 4

 # ./filetype /dev/socket /etc/passwd /dev/log /dev/null
/dev/hda /tmp
/etc/passwd regular
/dev/log socket
/dev/null char
/dev/hda block
/tmp directory

The rest of the file mode contains the file’s permissions and
permission modifies. The permissions are the lowest 9 bits, and are the
same as the bits passed to chmod(). By logically ANDing
st_mode with 0x1FF (or 0777), you are left with just the
permission bits which can be checked easily enough. If you aren’t
familiar with Unix permission bits, you will find that man 1
contains a simple explanation.

The file permission modifiers consist of the setuid bit,
setgid bit, and the sticky bit.

The setuid and setgid modifiers allow a process
running that file to masquerade as another user or member of another
group. The sticky bit has a long history, but is now rarely used.
Usually it is set only for directories, where it changes how permissions
are checked when a file gets removed.

Normally, users can remove a file from any directory where they have
write permission. For publicly accessible directories (like
/tmp), this means that any user can erase any file, since the
directory is world-writable. When the directory has the sticky bit set,
users can only remove files they own. If your system is properly
configured, you will see that the sticky bit is set on /tmp
[see Figure 6].

Figure 6: The Sticky Bit

# ls -ld /tmp /etc
drwxrwxrwt 5 root root 2048 May 4 21:46 /tmp
drwxr-xr-x 25 root root 3072 May 2 18:33 /etc

The ‘t’ at the end of the first field means that the sticky bit is
set for the /tmp directory; notice that it’s
missing from /etc.

The file permission modifiers are bits 12, 11, and 10 of the file
mode, and have the same values they do for the chmod command.
You can test for them by checking for
a non-zero result from logically ANDing a file mode with one of the
following constants:

S_ISUID		Setuid bit
S_ISGID Setgid bit
S_ISVTX Sticky bit

Now that we’ve explained the file mode (albeit quite briefly), let’s
look at the three forms of stat().

int fstat(int fd, struct stat *sb);
int stat(const char * filename, struct
stat * sb);
int lstat(const char * filename, struct
stat * sb);

First the similarities: All of these functions return 0 on success
and fill in the struct sb pointed to by the
last parameter. Now, for the differences.

fstat() is the most noticeably different; it
returns information on the inode referred to by the passed-in file
descriptor. stat() returns information on the
file that is specified by the first parameter, and follows symbolic
links. This means that it will never return information on a symlink
itself, since symlinks are always de-referenced. lstat() works
a lot like stat(), but does not follow symlinks.
If you lstat() a symbolic link, a struct stat with a
file type of S_IFLNK is returned. For all other file types,
lstat() and stat() behave
identically. Note that in most cases, you would want to use stat()
rather than lstat().

So much for our whirlwind tour of inodes. We’ll continue next month
by talking about more file-related system calls. Now that we’ve
introduced all of the important concepts, we’ll be able to work through
the rest of the major filesystem calls in no time.

Erik Troan is a developer for Red Hat Software and co-author of the
Linux Application Development. He can be reached at

Comments are closed.