Processes, forks and executions

This is the second blog post in a short series about processes on UNIX-like systems. It is a followup to the previous post which focused on basic definitions, creation of processes and relations between them. This time we analyze the semantics of two closely related system calls that play major roles in process creation and program execution.

fork() and exec()

The UNIX-based operating systems provide the fork() system call¹ to create a clone of an existing process and the execve() system call to start executing a program in a process. Windows, on the other hand, provide the CreateProcess() function which starts a given program in a newly created process. Why are UNIX-based systems doing things in a more complicated way? There are many reasons for that, some simply historical, as described in The Evolution of the Unix Time-sharing System:

Process control in its modern form was designed and implemented within a couple of days. It is astonishing how easily it fitted into the existing system; at the same time it is easy to see how some of the slightly unusual features of the design are present precisely because they represented small, easily-coded changes to what existed. A good example is the separation of the fork and exec functions.

In fact, the PDP-7's fork call required precisely 27 lines of assembly code. Of course, other changes in the operating system and user programs were required, and some of them were rather interesting and unexpected. But a combined fork-exec would have been considerably more complicated,…

Based on the above words one could say it was merely a coincidence or even a result of some laziness. Laziness of programmers² has over the decades definitely led to many great things (including great disasters). So this would not be something new or special. However, was it a coincidence that it was easy to add fork() and then exec() keeping them as two separate actions?

Finding the answer would definitely be a nice research and analysis, but maybe a topic for a separate blog post. One thing is definitely true, though – the semantics of fork() and exec() fit very well in the philosophy and design of the UNIX-based operating systems. One of the key UNIX principles is the Do One Thing And Do It Well principle. And even just this approach alone requires the operating system to be able to create new processes quickly and efficiently. After all, there are many things to be done!

Creating a clone of an existing process with fork() only to throw most of it away again with exec() may sound inefficient. But a closer look, for example at the fork(2) man page on Linux, shows something else:

Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.

In other words, fork() creates something like a shallow copy of an existing process where only the memory that is accessed for writing is actually copied before writing into it. This is very similar to creation of a new thread which, unlike a new child process, remains in the same address space even for writes into the memory. In fact, on Linux the fork() C function is actually implemented using the clone() system call which is also used when creating new threads, only with different arguments.

If a fork() is followed by an exec(), which loads the new program image and initializes the process to be able to run the program, very little of the work done for fork() comes in vain. And thus such combination is actually almost as efficient as doing everything at once. Of course, making two system calls generally requires a bit more resources than making a single system call, but the overhead is very small compared to all the things that need to happen when a program is started. If, on the other hand, a process only needs to do something in a separate process (for practical or security reasons) which means there's no exec() a lot of resources are saved in comparison to starting a program, doing all the initialization, replicating the state of the original process and doing that something that required to be run in a new process. Again, in UNIX, this is a common thing happening because of the UNIX philosophy. Even a simple usage of a subshell in a shell script to avoid worrying about reverting the current working directory like this:

echo "Doing something in $PWD"
(cd /tmp; echo "Now in $PWD")
echo "Back to the original shell in $PWD!"

is using a fork() without an exec(). We can see that if we run it with strace:

$ strace -e clone,execve,write,chdir -ff bash /tmp/test.sh > /dev/null
execve("/usr/bin/bash", ["bash", "/tmp/test.sh"], 0x7ffe730ee4a0 /* 71 vars */) = 0
write(1, "Doing something in /home/vpodzim"..., 34) = 34
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLDstrace: Process 709650 attached
, child_tidptr=0x7f1b9cce0a10) = 709650
[pid 709650] chdir("/tmp")              = 0
[pid 709650] write(1, "Now in /tmp\n", 12) = 12
[pid 709650] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709650, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
write(1, "Back to the original shell in /h"..., 46) = 46
+++ exited with 0 +++

There is one execve() system call when the bash program is started, then it executes the first echo command resulting in a write() system call, then there is a clone() as a result of the fork() function call which creates a new (child) process. The new process changes its current working directory, writes a message to standard output and exits (terminates) about which the parent process is notified (signaled with SIGCHLD). And before the parent itself terminates, it again writes a message to the standard output. There is no execve() system call from the child subshell process and the change of the current working directory only affects the forked child process.

Another example of when fork() without exec() makes a lot of sense is concurrent (and potentially parallel) processing. A process can do some initialization and then replicate itself N times to create N child processes as workers that then concurrently process some data or handle requests. Of course this could be done by starting the workers separately, each running the same program, but then the initialization would have to happen in all of them which would be wasting resources at least or even impossible (because of them running concurrently). Of course threads can be used for such workers, but processes provide better separation and make the workers really independent – they cannot write into each other's address space and when, for example, one of the workers makes a bad memory access, only that one worker is stopped by the operating system due to a violation of memory constraints.

An execve() system call without a previous fork() (or clone()) makes sense too. When a program's code is done running in a process and all it wants to do is to start a different program, it can simply exec() the other program. We can again use strace with a simple, although very artificial, shell command to demonstrate this:

$ strace -e execve,clone,write -ff -- bash -c 'echo "Running"; exec echo Done' >/dev/null
execve("/usr/bin/bash", ["bash", "-c", "echo \"Running\"; exec echo Done"], 0x7ffcb2148ed0 /* 71 vars */) = 0
write(1, "Running\n", 8)                = 8
execve("/usr/bin/echo", ["echo", "Done"], 0x5562fb9cb250 /* 69 vars */) = 0
write(1, "Done\n", 5)                   = 5
+++ exited with 0 +++

This time we see there is no clone() system call, the process runs bash and then the command tells bash to execute echo with a message.

Between fork() and exec()

The last, but not least, benefit of the separate fork() and exec() is that there might be instructions and operations happening between the two system calls. Some people actually even consider this the biggest benefit and the most useful feature of the two-step approach. And some consider it the biggest drawback of the approach. Let's see why.

The above sections and paragraphs describe process creation on UNIX-like systems as creation of clones of existing processes. However, that's a bit of a simplification because the newly created process is not a 1:1 copy of the process it was cloned from. As the fork(3) man page on Linux says:

The child process is an exact duplicate of the parent process except for the following points:

after which a list of 9 points follows, only to be followed by

The parent and child also differ with respect to the following Linux-specific process attributes:

with another list of 7 points. Some of those points are quite obvious – for example, the new (child) process has a different PID than the parent process and its parent PID is the PID of the parent process. The rest are more technical and potentially more subtle, but definitely not more important.

File descriptors

When scrolled even a bit further, the fork(3) man page says:

Note the following further points:

and lists 5 points that are very important. First, let's look at the last 3 ones:

The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal-driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)).

The child inherits copies of the parent's set of open message queue descriptors (see mq_overview(7)). Each file descriptor in the child refers to the same open message queue description as the corresponding file descriptor in the parent. This means that the two file descriptors share the same flags (mq_flags).

The child inherits copies of the parent's set of open directory streams (see opendir(3)). POSIX.1 says that the corresponding directory streams in the parent and child may share the directory stream positioning; on Linux/glibc they do not.

The message queues are not so common, at least not the POSIX-defined ones mentioned in the second point. But file descriptors (FDs) and directory streams ("directory descriptors") are. And as we can see, these are inherited and so in this regard the new process is an exact copy of the parent process. This basically means that whatever³ the parent process had open for reading or writing before it called fork() is available for reading and writing in the child (new) process.

And once again, this fits very well into the UNIX philosophy and way of doing things. The do one thing and do it well principle means that multiple programs (and processes) often need to operate on some data in a pipeline where the output from one step is the input for the next step. One way to achieve that is to use temporary files for the outputs (and thus inputs) of the individual steps. However, requiring each step to finish before the next one starts is, in most cases, a waste of resources because the individual steps hardly fully utilize them all. A common technique is to let the steps process the data in parallel with a step processing a new chunk of the data while the previous chunk is being processed by the next step. To achieve this, UNIX-like systems provide so-called pipes – unidirectional data channels with one end for writing and one end for reading. What do pipes have to do with forks and file descriptor inheritance? Let's take a look at another simple example (with a few lines omitted for better clarity):

$ strace -ff -e clone,execve,pipe,dup2,read,write -- bash -c '/bin/echo Hello|grep -i hell' 2>&1 >/dev/null
execve("/usr/bin/bash", ["bash", "-c", "/bin/echo Hello|grep -i hell"], 0x7ffc78c1cc50 /* 72 vars */) = 0
pipe([3, 4])                            = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4c42b6ca10) = 19493
strace: Process 19493 attached
[pid 19492] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
, child_tidptr=0x7f4c42b6ca10) = 19494
strace: Process 19494 attached
[pid 19493] dup2(4, 1)                  = 1
[pid 19494] dup2(3, 0)                  = 0
[pid 19493] execve("/bin/echo", ["/bin/echo", "Hello"], 0x559f63e212b0 /* 71 vars */) = 0
[pid 19494] execve("/usr/bin/grep", ["grep", "-i", "hell"], 0x559f63e212b0 /* 71 vars */) = 0
[pid 19493] write(1, "Hello\n", 6)      = 6
[pid 19493] +++ exited with 0 +++
[pid 19494] read(0, "Hello\n", 98304)   = 6
[pid 19494] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19493, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
+++ exited with 0 +++

Just a few lines, but showing multiple ingenious things! We can see that the bash process (PID 19492) first creates a pipe with its ends being file descriptors 3 and 4.⁴ Note how there is no path associated with the pipe, it virtually only exists inside of the bash process and cannot be accessed by any other process. Then, magic starts happening. The bash process forks (clone()) of course, twice, to create processes for both echo and grep. However, keep in mind that the new cloned child processes still run the bash program! And so it's the bash code that then does dup2(4, 1) and dup2(3, 0) in the child processes, respectively, duplicating the file descriptor 4 to 1 in the process 19493 and 3 to 0 in the process 19494. Duplicating a file descriptor practically means copying/mirroring – the descriptors then point to the same place, be it somewhere in a file or device, to a socket,…, or pipe, of course. 4 to 1 and 3 to 0, why these numbers? As mentioned above, the descriptors 3 and 4 are the ends of the pipe, the read end and the write end, respectively. It's thanks to the FD inheritance, why these descriptors are actually available in the new forked/cloned processes even though the pipe() syscall was made in their parent process. The 0 and 1 are the standard input and standard output descriptors. Put together, what happens in the dup2() system calls is that the standard input of the process 19494, is "hooked up" to the read end of the pipe and the standard output of the process 19493 is "hooked up" to the write end of the pipe. Which means that if the process 19494 writes something to its standard output, the process 19493 will be able to read it from its standard input. So the two processes can now communicate, and they can do that without using any special system calls or anything, they just read/write from/to their standard input/output as usual! The pipe and file descriptor inheritance allows two processes to be used in a pipeline without the programs they run being modified or specially written for such a use case! And of course, there can be more steps in the pipeline, two is just the minimal length.

Following the dup2() system calls we can see that only then the processes actually start executing the echo (19493) and grep (19494) programs which just normally do write() and read(), respectively, as usual. And there's one last interesting thing to notice! The echo process (19493) actually terminates before the grep reads the data from it! This is possible because the pipe uses a buffer which holds the data. So the write() syscall doesn't have to wait for the respective read() syscall, after writing the data into the pipe, as long as the data fits into the buffer, the process can continue executing (incl. termination).

Another one of many use cases for file descriptor inheritance and the ability to run code before fork() and exec() is related to security, in particular to permissions and privileges. A process running with high privileges and many permissions can do some initialization and even open some file descriptors that require those privileges and permissions only to fork itself, drop some of the privileges and permissions (usually by changing the user under which the process runs) in the forked child process and continuing work in the child process. This way the child process has limited access only to carefully chosen assets. Again, we can look at an example, though this time a bit more complicated and run as the root user (again with lines omitted for better clarity):

# strace -ff -e openat,clone,execve,setuid,write -- bash -c 'runuser -u vpodzime -- /bin/echo Hello > /file.txt'
execve("/bin/bash", ["bash", "-c", "runuser -u vpodzime -- /bin/echo"...], 0x7ffff56872f0 /* 55 vars */) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f95123bda10) = 837086
strace: Process 837086 attached
[pid 837086] openat(AT_FDCWD, "/file.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 837086] dup2(3, 1)                 = 1
[pid 837086] execve("/sbin/runuser", ["runuser", "-u", "vpodzime", "--", "/bin/echo", "Hello"], 0x55bc8b5f7250 /* 54 vars */) = 0
[pid 837086] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fbf36c73a50) = 837087
strace: Process 837087 attached
[pid 837087] setuid(1000)               = 0
[pid 837087] execve("/bin/echo", ["/bin/echo", "Hello"], 0x7ffe92f922a0 /* 54 vars */) = 0
[pid 837087] write(1, "Hello\n", 6)     = 6
[pid 837087] +++ exited with 0 +++
[pid 837086] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=837086, si_uid=0, si_status=0, si_utime=0, si_stime=1} ---
+++ exited with 0 +++

This time we can see bash started which forks (clone()), in the forked child process (837086) it opens the /file.txt file for writing (and creates it), copies the associated file descriptor (3) to standard output descriptor (1) and executes runuser. runuser then forks and in the new child process (837087) it sets user ID to 1000 (UID of the user vpodzime) and executes echo which then writes data to its standard output, i.e. its parent's (runuser) standard output, i.e. the /file.txt file. If we check the permissions of the file:

$ ls -l /file.txt
-rw-r--r--. 1 root root 6 Jul 21 16:56 /file.txt

we can see it is owned by the root user and of course an attempt to append some data to the file as the vpodzime user fails:

$ echo World >> /file.txt
bash: /file.txt: Permission denied

So a process running as the user vpodzime (with UID 1000) was able to write data into the file without the user actually having (write) access to the file. File descriptor inheritance and actions happening between fork() and exec() made this possible.

Threads, locks, signals,…

Going back to the list of 5 points from the fork(3) man page, to complete it, let's see the first two points:

The child process is created with a single thread—the one that called fork(). The entire virtual address space of the parent is replicated in the child, including the states of mutexes, condition variables, and other pthreads objects; the use of pthread_atfork(3) may be helpful for dealing with problems that this can cause.

After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2).

This tells us that the new process created by fork() only has one thread and it is the thread that called fork(), i.e. the new process continues to execute the instructions of that thread. But it also mentions that states of mutexes, condition variables and other objects are cloned (replicated) and that when using fork() in multi-threaded programs, special care needs to be taken.

These are the first hints that the cloning behavior of fork() and the ability to execute instructions before calling exec() might cause issues if things are not handled carefully. Especially in multi-threaded programs, but even the file descriptor inheritance itself can easily cause security issues, for example. However, more about that in an upcoming blog post focused on how CFEngine deals with these potential issues.

Since version 2.3.3, rather than invoking the kernel’s fork() system call, the glibc fork() wrapper function invokes clone() with flags. So, fork is both a system call and a library function, and in many cases the system call is not really used (depending on library, versions and platforms), but in practice, this ambiguity / distinction does not really matter because the API and behavior are the same. ↩︎
Some people basically consider this one of the required qualities of a good programmer ↩︎
Many things are represented as file descriptors on UNIX-like systems, on Linux including the POSIX message queues, actually. ↩︎
Don't get confused by the FDs showing up as arguments to the system call, the real argument is an int fd[2] array, i.e. a place where to store two integers (FDs). ↩︎

Processes, forks and executions - part 2

fork() and exec()

Between fork() and exec()

File descriptors

Threads, locks, signals,…

Try CFEngine Enterprise for free