dcsimg

Using MOSIX

In the past few columns, we've discussed parallel computing, a methodology that divides and distributes the work of solving a significant computing task. Parallel computing can solve large problems more quickly because it focuses the processing power of many computers working in tandem (called a cluster) on a single task. A master node in the cluster breaks the problem into discrete parts, and passes tasks and computational data to slave nodes. The master node receives results from the slave nodes and generates a final answer. Coordination of the entire process is controlled with message-passing APIs.

In the past few columns, we’ve discussed parallel computing, a methodology that divides and distributes the work of solving a significant computing task. Parallel computing can solve large problems more quickly because it focuses the processing power of many computers working in tandem (called a cluster) on a single task. A master node in the cluster breaks the problem into discrete parts, and passes tasks and computational data to slave nodes. The master node receives results from the slave nodes and generates a final answer. Coordination of the entire process is controlled with message-passing APIs.

However, these message-passing APIs have some significant drawbacks. Most algorithms require considerable design and modifications to optimize them for parallel computing. And different processor speeds of various nodes in the cluster can also make it difficult to re-code an algorithm: if the slowest node causes others to wait for it to complete its portion of an algorithm, then the cluster is not being optimally utilized.

Fortunately, there’s another way to allocate tasks in a cluster of disparate processors that does not require you to rethink your algorithms. It’s called MOSIX (http://www.mosix.com/), the Multicomputer Operating System for UnIX. It was developed by Professor Amnon Barak and collaborators at the Hebrew University of Jerusalem in Israel.

MOSIX: What’s It Good For?

MOSIX is a set of kernel patches and user programs (for Linux and other UNIX variants) that perform automatic and transparent process migration for clustered and even non-clustered computers. By moving CPU-intensive processes to faster or less-busy cluster nodes or workstations, MOSIX provides load balancing across distributed systems in a fashion similar to that used in multi-processor computers or symmetric multi-processing (SMP) systems.

Adaptive management algorithms monitor and respond to uneven distribution of CPU resources. After a new process is created on a node, MOSIX attempts to make the best use of available cycles by reassigning or migrating the process, as necessary, to the best available node. To maximize overall performance, MOSIX continually monitors and reassigns processes as the distribution of processing load on nodes changes.

MOSIX evenly distributes the load generated by many serial processes or by forked or threaded processes that don’t use shared memory. It’s scalable and can support large numbers of cluster nodes and/or workstations. It works on top of TCP and UDP using existing network infrastructure and has minimal impact on other parts of the kernel. Moreover, a user-level package is now available that distributes workload among nodes without the overhead of the process migration mechanism.

MOSIX also supports a file system called MFS (the MOSIX File System) that allows every node access to the file systems on every other node. (The effect is like using NFS to mount every file system of every node on every other node.)

While MOSIX is of little use in an environment where explicit message passing programs are run on dedicated nodes, it’s very effective in a cluster that runs many independent serial simulations or application programs. A computing task can be initiated on any node in a MOSIX cluster and the workload will be distributed among all available nodes. This minimizes run time and is particularly useful in a cluster containing a variety of processors. MOSIX considers both the speed of the processor and the current load on the machine when deciding how to migrate processes.

Process migration can be invaluable in a university department or research laboratory that’s accumulated a number of computers of varying ages and speeds. If the department owns one or two beefy servers and many smaller desktops and older servers, MOSIX migrates jobs to run on the best available computer given the other tasks being performed on the cluster. Moreover, MFS provides an easy way to access files and data spread across many machines.

MOSIX can also be used on scalable Web servers that require a lot of processing power to service requests. By forming clusters and distributing the processing among all available servers, the client’s request is answered in the shortest possible time (although there is some time, arguably negligible, lost with process migration).

Although MOSIX is very good for handling CPU-bound problems, it’s less efficient for I/O-intensive problems. File and network I/O occurs on the node where the task was initiated, so a migrated process must contact its “home” node to perform these operations. MOSIX’s Direct File-System Access (DFSA) attempts to solve this problem by allowing remote processes to save time by performing most of their I/O and file system operations directly on the node where the process is currently running instead of having to “phone home.” At present, DFSA works only with MFS file systems.

Getting Started with MOSIX

Installing MOSIX involves patching and building a new kernel, compiling and installing user-level programs, and making minor modifications to a number of system files so that certain daemons aren’t migrated away. Luckily, the MOSIX distribution contains pre-built Perl scripts that perform these tasks automatically. A list of minimal requirements for MOSIX is provided in the MOSIX README file along with instructions for manual installation should the Perl scripts fail to do the job.

To demonstrate the features of MOSIX, we’re using a cluster of five older computers with different CPU speeds all running Red Hat 7.2, as shown in Table One.




Table One: Mosix testbed cluster


Machine Hostname Processor
1 mosix1 Pentium II 450MHz
2 mosix2 Pentium 166
3 mosix3 Pentium 133
4 mosix4 Pentium 120
5 mosix5 Pentium 90

We downloaded MOSIX version 1.5.7 and the source code for kernel version 2.4.17. The kernel sources were patched, and a compatible kernel was built on the first machine and then copied onto the other nodes.

We used the Perl scripts provided with MOSIX to build and install the user-level programs and to alter system files appropriately. We chose to use the GRUB boot loader (instead of LILO) on all five systems in the cluster and we found that the Perl scripts didn’t always make the correct modifications to the grub.conf file. The grub.conf files were updated manually. Finally, we used grub-install to update the master boot record on each node. (For more on GRUB see the April 2002 Guru Guidance column, available online at http://www.linux-mag.com/2002-04/guru_01.html.)

The file /etc/mosix.map was created (and must reside on every node) to map nodes into the MOSIX cluster. Since our five machines have contiguous IP addresses, only a single entry is needed in the file, as shown in Listing One. If the nodes in your cluster have widely scattered IP addresses, you’ll need a line for each node, with the number “1” as the third argument (e.g., 1 hosta 1, 2 hostb 1, etc.).




Listing One: /etc/mosix.map


# MOSIX CONFIGURATION
# ===================
# Each line should contain 3 fields,
# mapping IP addresses to MOSIX node-numbers:
# 1) first MOSIX node-number in range.
# 2) IP address or host name of the above node
# 3) number of nodes in this range.
#
# MOSIX-# IP/hostname number-of-nodes
# ============================
1 mosix1 5

With the new kernels in place, the user programs installed, and the systems files modified, all systems were rebooted. After rebooting, all computers appeared as nodes in the MOSIX cluster. Node 4, a Pentium 120, suffered a catastrophic disk failure wholly unrelated to the MOSIX installation and was subsequently unavailable for the rest of our tests.

Testing MOSIX’s Process Migration

To test automatic process migration, we developed a short program in C that wastes CPU time called time-waster (see Listing Two). This code executes a nested loop and calculates a useless value. After every tenth pass through the outer loop, timing information is printed.




Listing Two: time-waster.c


#include <stdio.h>
#include <time.h>
#include <sys/types.h>

int main(int argc, char **argv)
{
int i, j, elapse = 0, prev_elapse;
double val;
time_t ts;

ts = time((time_t *)NULL);
for (i = 0; i < 101; i++) {
for (j = 0; j < 9999999; j++)
val = (double)(j+1) / (double)(i+1);
if (!(i%10)) {
prev_elapse = elapse;
elapse = (int)(time((time_t *)NULL) – ts);
printf(
“i=%d, val=%lg, %d s elapsed, %d s
since last print\n”,
i, val, elapse, elapse-prev_elapse);
}
}
return 0;
}

Figure One shows the results of compiling and running this code on mosix1, the fastest box in our cluster. The program takes approximately 81 seconds to run on mosix1 with about 8 seconds spent in each of the ten passes through the outer loop. Figure Two shows the results of executing time-waster on mosix2, our Pentium 166, without process migration. The mosrun program allows node-allocation preferences to be established for executing a command. In this case, the -h flag (for “home”) is used to force time-waster to run on the home node (i.e., not migrated). It takes time-waster approximately 365 seconds to complete.




Figure One: Output from time-waster on mosix1


[forrest@mosix1 forrest]$ cc -o time-waster time-waster.c
[forrest@mosix1 forrest]$ ./time-waster
i=0, val=1e+07, 1 s elapsed, 1 s since last print
i=10, val=909091, 8 s elapsed, 7 s since last print
i=20, val=476190, 16 s elapsed, 8 s since last print
i=30, val=322581, 24 s elapsed, 8 s since last print
i=40, val=243902, 32 s elapsed, 8 s since last print
i=50, val=196078, 40 s elapsed, 8 s since last print
i=60, val=163934, 48 s elapsed, 8 s since last print
i=70, val=140845, 56 s elapsed, 8 s since last print
i=80, val=123457, 64 s elapsed, 8 s since last print
i=90, val=109890, 73 s elapsed, 9 s since last print
i=100, val=99009.9, 81 s elapsed, 8 s since last print




Figure Two: Output from time-waster on mosix2


[forrest@mosix2 forrest]$ mosrun -h ./time-waster
i=0, val=1e+07, 4 s elapsed, 4 s since last print
i=10, val=909091, 40 s elapsed, 36 s since last print
i=20, val=476190, 76 s elapsed, 36 s since last print
i=30, val=322581, 112 s elapsed, 36 s since last print
i=40, val=243902, 148 s elapsed, 36 s since last print
i=50, val=196078, 184 s elapsed, 36 s since last print
i=60, val=163934, 220 s elapsed, 36 s since last print
i=70, val=140845, 257 s elapsed, 37 s since last print
i=80, val=123457, 293 s elapsed, 36 s since last print
i=90, val=109890, 329 s elapsed, 36 s since last print
i=100, val=99009.9, 365 s elapsed, 36 s since last print

When executed on mosix2 without using the mosrun command, the time-waster process is quickly migrated over to mosix1 as can be seen in Figure Three. Because the delta from i=0 to i=10 is the same as it was when we ran on mosix1 (seven seconds), we can conclude that the process was migrated before i even became 1.




Figure Three: Output from time-waster with process migration


[forrest@mosix2 forrest]$ ./time-waster
i=0, val=1e+07, 3 s elapsed, 3 s since last print
i=10, val=909091, 10 s elapsed, 7 s since last print
i=20, val=476190, 18 s elapsed, 8 s since last print
i=30, val=322581, 27 s elapsed, 9 s since last print
i=40, val=243902, 34 s elapsed, 7 s since last print
i=50, val=196078, 43 s elapsed, 9 s since last print
i=60, val=163934, 51 s elapsed, 8 s since last print
i=70, val=140845, 59 s elapsed, 8 s since last print
i=80, val=123457, 67 s elapsed, 8 s since last print
i=90, val=109890, 75 s elapsed, 8 s since last print
i=100, val=99009.9, 84 s elapsed, 9 s since last print

The mosrun program can also be used to allocate processes to certain sets of nodes or provide tuning hints about the code to be run. Another useful utility, migrate, can be used to request migration of a particular process sending it back home or having MOSIX decide where it should run based on current loads. Additional utilities, including mosctl, setpe, and tune, control process migration and calibrate kernel parameters for processor and network configurations.

Monitoring Tools








mosix2
Figure Four: Sample output from the mon program

To monitor the state of a MOSIX cluster, use the mon Program. As shown in Figure Four, mon provides a graphical view of system load. Note that node 4 (that failed earlier) is unavailable and not shown in the graph. Mon can also display graphs of processor speeds, total system memories, memory utilization, and processor availability.

Another monitoring tool available separately for MOSIX is Mosixview (http://www.mosixview.com/). It provides a graphical front end to mosctl and offers a better view of node utilization (see Figure Five). It simultaneously shows cluster node availability, processor speed, system load, memory utilization, total memory, and number of CPUs. In addition, there are tools for collecting run-time statistics of resource utilization that can be plotted and analyzed.








mosixview
Figure Five: Sample output from the Mosixview program

The displays in Figure Four and Figure Five were obtained at approximately the same time after time-waster had been started on nodes 2, 3, and 5. As you can see in both figures, the processes were migrated from nodes 3 and 5 (the two slowest nodes) onto nodes 1 and 2 (the two fastest nodes).

MFS: The MOSIX File System

The MOSIX File System lets you easily access files on all cluster nodes. A look at /mfs on mosix1 (Figure Six) shows directories for nodes 1, 2, 3, and 5. Since node 4 is unavailable, no directory is displayed. In addition to directories that take you to the file systems on each node, there are also special directory entries that point to the appropriate node number which programs can use at runtime (note that here in /mfs points to node 1, where we are logged on).




Figure Six: A view into mosix1′s MFS


[forrest@mosix1 forrest]$ cd /mfs
[forrest@mosix1 mfs]$ ls -l
ls: 4: Object is remote
total 16
drwxr-xr-x 20 root root 4096 Mar 8 22:13 1
drwxr-xr-x 20 root root 4096 Mar 8 1996 2
drwxr-xr-x 21 root root 4096 Mar 8 23:25 3
drwxr-xr-x 20 root root 4096 Mar 8 22:59 5
lr-xr-xr-x 1 root root 1 Dec 31 1969 here -> 1
lr-xr-xr-x 1 root root 1 Dec 31 1969 home -> 1
lr-xr-xr-x 1 root root 1 Dec 31 1969 lastexec -> 1
lr-xr-xr-x 1 root root 1 Dec 31 1969 magic -> 1
lr-xr-xr-x 1 root root 1 Dec 31 1969 selected -> 1
[forrest@mosix1 mfs]$ ls 3
bin dev home lib mfs mnt proc sbin usr work
boot etc initrd lost+found misc opt root tmp var

The Future of MOSIX

MOSIX is still being expanded by Professor Barak and his colleagues. Separately, the code for MOSIX was recently forked to create a GPL-compliant clustering platform called openMosix (http://openmosix.sourceforge.net/.) Check out both solutions to see which better meets your needs.



Forrest Hoffman is a computer modeling and simulation researcher at Oak Ridge National Laboratory. He can be reached at forrest@esd.ornl.gov.

Fatal error: Call to undefined function aa_author_bios() in /opt/apache/dms/b2b/linux-mag.com/site/www/htdocs/wp-content/themes/linuxmag/single.php on line 62