This is the second blog post in a short series about processes on UNIX-like systems. It is a followup to the previous post which focused on basic definitions, creation of processes and relations between them. This time we analyze the semantics of two closely related system calls that play major roles in process creation and program execution.
fork() and exec()
The UNIX-based operating systems provide the fork()
system call1 to create a clone of an existing
process and the execve()
system call to start executing a program in a process. Windows, on the
other hand, provide the CreateProcess()
function which starts a given program in a newly created
process. Why are UNIX-based systems doing things in a more complicated way? There are many reasons
for that, some simply historical, as described in The Evolution of the Unix Time-sharing
System:
Process control in its modern form was designed and implemented within a couple of days. It is astonishing how easily it fitted into the existing system; at the same time it is easy to see how some of the slightly unusual features of the design are present precisely because they represented small, easily-coded changes to what existed. A good example is the separation of the fork and exec functions.
In fact, the PDP-7's fork call required precisely 27 lines of assembly code. Of course, other changes in the operating system and user programs were required, and some of them were rather interesting and unexpected. But a combined fork-exec would have been considerably more complicated,…
Based on the above words one could say it was merely a coincidence or even a result of some
laziness. Laziness of programmers2 has over the decades definitely led to many great things
(including great disasters). So this would not be something new or special. However, was it a
coincidence that it was easy to add fork()
and then exec()
keeping them as two separate actions?
Finding the answer would definitely be a nice research and analysis, but maybe a topic for a
separate blog post. One thing is definitely true, though – the semantics of fork()
and exec()
fit very well in the philosophy and design of the UNIX-based operating systems. One of the key UNIX
principles is the Do One Thing And Do It Well principle. And even just this approach alone
requires the operating system to be able to create new processes quickly and efficiently. After
all, there are many things to be done!
Creating a clone of an existing process with fork()
only to throw most of it away again with
exec()
may sound inefficient. But a closer look, for example at the fork(2) man page on Linux,
shows something else:
Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.
In other words, fork()
creates something like a shallow copy of an existing process where only the
memory that is accessed for writing is actually copied before writing into it. This is very similar
to creation of a new thread which, unlike a new child process, remains in the same address space
even for writes into the memory. In fact, on Linux the fork()
C function is actually implemented
using the clone()
system call which is also used when creating new threads, only with different
arguments.
If a fork()
is followed by an exec()
, which loads the new program image and initializes the
process to be able to run the program, very little of the work done for fork()
comes in vain. And
thus such combination is actually almost as efficient as doing everything at once. Of course,
making two system calls generally requires a bit more resources than making a single system call,
but the overhead is very small compared to all the things that need to happen when a program is
started. If, on the other hand, a process only needs to do something in a separate process (for
practical or security reasons) which means there's no exec()
a lot of resources are saved in
comparison to starting a program, doing all the initialization, replicating the state of the
original process and doing that something that required to be run in a new process. Again, in UNIX,
this is a common thing happening because of the UNIX philosophy. Even a simple usage of a subshell
in a shell script to avoid worrying about reverting the current working directory like this:
echo "Doing something in $PWD"
(cd /tmp; echo "Now in $PWD")
echo "Back to the original shell in $PWD!"
is using a fork()
without an exec()
. We can see that if we run it
with strace
:
$ strace -e clone,execve,write,chdir -ff bash /tmp/test.sh > /dev/null
execve("/usr/bin/bash", ["bash", "/tmp/test.sh"], 0x7ffe730ee4a0 /* 71 vars */) = 0
write(1, "Doing something in /home/vpodzim"..., 34) = 34
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLDstrace: Process 709650 attached
, child_tidptr=0x7f1b9cce0a10) = 709650
[pid 709650] chdir("/tmp") = 0
[pid 709650] write(1, "Now in /tmp\n", 12) = 12
[pid 709650] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=709650, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
write(1, "Back to the original shell in /h"..., 46) = 46
+++ exited with 0 +++
There is one execve()
system call when the bash
program is started, then it executes the first
echo
command resulting in a write()
system call, then there is a clone()
as a result of the
fork()
function call which creates a new (child) process. The new process changes its current
working directory, writes a message to standard output and exits (terminates) about which the parent
process is notified (signaled with SIGCHLD
). And before the parent itself terminates, it again
writes a message to the standard output. There is no execve()
system call from the child subshell
process and the change of the current working directory only affects the forked child process.
Another example of when fork()
without exec()
makes a lot of sense is concurrent (and
potentially parallel) processing. A process can do some initialization and then replicate itself N
times to create N child processes as workers that then concurrently process some data or handle
requests. Of course this could be done by starting the workers separately, each running the same
program, but then the initialization would have to happen in all of them which would be wasting
resources at least or even impossible (because of them running concurrently). Of course threads can
be used for such workers, but processes provide better separation and make the workers really
independent – they cannot write into each other's address space and when, for example, one of the
workers makes a bad memory access, only that one worker is stopped by the operating system due to a
violation of memory constraints.
An execve()
system call without a previous fork()
(or clone()
) makes sense too. When a
program's code is done running in a process and all it wants to do is to start a different program,
it can simply exec()
the other program. We can again use strace
with a simple, although very
artificial, shell command to demonstrate this:
$ strace -e execve,clone,write -ff -- bash -c 'echo "Running"; exec echo Done' >/dev/null
execve("/usr/bin/bash", ["bash", "-c", "echo \"Running\"; exec echo Done"], 0x7ffcb2148ed0 /* 71 vars */) = 0
write(1, "Running\n", 8) = 8
execve("/usr/bin/echo", ["echo", "Done"], 0x5562fb9cb250 /* 69 vars */) = 0
write(1, "Done\n", 5) = 5
+++ exited with 0 +++
This time we see there is no clone()
system call, the process runs bash
and then the command
tells bash
to execute echo
with a message.
Between fork() and exec()
The last, but not least, benefit of the separate fork()
and exec()
is that there might be
instructions and operations happening between the two system calls. Some people actually even
consider this the biggest benefit and the most useful feature of the two-step approach. And some
consider it the biggest drawback of the approach. Let's see why.
The above sections and paragraphs describe process creation on UNIX-like systems as creation of
clones of existing processes. However, that's a bit of a simplification because the newly created
process is not a 1:1 copy of the process it was cloned from. As the fork(3)
man page on Linux
says:
The child process is an exact duplicate of the parent process except for the following points:
after which a list of 9 points follows, only to be followed by
The parent and child also differ with respect to the following Linux-specific process attributes:
with another list of 7 points. Some of those points are quite obvious – for example, the new (child) process has a different PID than the parent process and its parent PID is the PID of the parent process. The rest are more technical and potentially more subtle, but definitely not more important.
File descriptors
When scrolled even a bit further, the fork(3)
man page says:
Note the following further points:
and lists 5 points that are very important. First, let's look at the last 3 ones:
The child inherits copies of the parent's set of open file descriptors. Each file descriptor in the child refers to the same open file description (see open(2)) as the corresponding file descriptor in the parent. This means that the two file descriptors share open file status flags, file offset, and signal-driven I/O attributes (see the description of F_SETOWN and F_SETSIG in fcntl(2)).
The child inherits copies of the parent's set of open message queue descriptors (see mq_overview(7)). Each file descriptor in the child refers to the same open message queue description as the corresponding file descriptor in the parent. This means that the two file descriptors share the same flags (mq_flags).
The child inherits copies of the parent's set of open directory streams (see opendir(3)). POSIX.1 says that the corresponding directory streams in the parent and child may share the directory stream positioning; on Linux/glibc they do not.
The message queues are not so common, at least not the POSIX-defined ones mentioned in the second
point. But file descriptors (FDs) and directory streams ("directory descriptors") are. And as we
can see, these are inherited and so in this regard the new process is an exact copy of the
parent process. This basically means that whatever3 the parent process had open for reading or
writing before it called fork()
is available for reading and writing in the child (new) process.
And once again, this fits very well into the UNIX philosophy and way of doing things. The do one thing and do it well principle means that multiple programs (and processes) often need to operate on some data in a pipeline where the output from one step is the input for the next step. One way to achieve that is to use temporary files for the outputs (and thus inputs) of the individual steps. However, requiring each step to finish before the next one starts is, in most cases, a waste of resources because the individual steps hardly fully utilize them all. A common technique is to let the steps process the data in parallel with a step processing a new chunk of the data while the previous chunk is being processed by the next step. To achieve this, UNIX-like systems provide so-called pipes – unidirectional data channels with one end for writing and one end for reading. What do pipes have to do with forks and file descriptor inheritance? Let's take a look at another simple example (with a few lines omitted for better clarity):
$ strace -ff -e clone,execve,pipe,dup2,read,write -- bash -c '/bin/echo Hello|grep -i hell' 2>&1 >/dev/null
execve("/usr/bin/bash", ["bash", "-c", "/bin/echo Hello|grep -i hell"], 0x7ffc78c1cc50 /* 72 vars */) = 0
pipe([3, 4]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4c42b6ca10) = 19493
strace: Process 19493 attached
[pid 19492] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD
, child_tidptr=0x7f4c42b6ca10) = 19494
strace: Process 19494 attached
[pid 19493] dup2(4, 1) = 1
[pid 19494] dup2(3, 0) = 0
[pid 19493] execve("/bin/echo", ["/bin/echo", "Hello"], 0x559f63e212b0 /* 71 vars */) = 0
[pid 19494] execve("/usr/bin/grep", ["grep", "-i", "hell"], 0x559f63e212b0 /* 71 vars */) = 0
[pid 19493] write(1, "Hello\n", 6) = 6
[pid 19493] +++ exited with 0 +++
[pid 19494] read(0, "Hello\n", 98304) = 6
[pid 19494] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=19493, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
+++ exited with 0 +++
Just a few lines, but showing multiple ingenious things! We can see that the bash
process (PID
19492
) first creates a pipe with its ends being file descriptors 3
and 4
.4 Note how there
is no path associated with the pipe, it virtually only exists inside of the bash
process and
cannot be accessed by any other process. Then, magic starts happening. The bash
process forks
(clone()
) of course, twice, to create processes for both echo
and grep
. However, keep in mind
that the new cloned child processes still run the bash
program! And so it's the bash
code that
then does dup2(4, 1)
and dup2(3, 0)
in the child processes, respectively, duplicating the file
descriptor 4
to 1
in the process 19493
and 3
to 0
in the process 19494
. Duplicating a
file descriptor practically means copying/mirroring – the descriptors then point to the same
place, be it somewhere in a file or device, to a socket,…, or pipe, of course. 4
to 1
and 3
to 0
, why these numbers? As mentioned above, the descriptors 3
and 4
are the ends of the pipe,
the read end and the write end, respectively. It's thanks to the FD inheritance, why these
descriptors are actually available in the new forked/cloned processes even though the pipe()
syscall was made in their parent process. The 0
and 1
are the standard input and standard output
descriptors. Put together, what happens in the dup2()
system calls is that the standard input
of the process 19494
, is "hooked up" to the read end of the pipe and the standard output
of the process 19493
is "hooked up" to the write end of the pipe. Which means that if the
process 19494
writes something to its standard output, the process 19493
will be able to read it
from its standard input. So the two processes can now communicate, and they can do that without
using any special system calls or anything, they just read/write from/to their standard input/output
as usual! The pipe and file descriptor inheritance allows two processes to be used in a pipeline
without the programs they run being modified or specially written for such a use case! And of
course, there can be more steps in the pipeline, two is just the minimal length.
Following the dup2()
system calls we can see that only then the processes actually start
executing the echo
(19493
) and grep
(19494
) programs which just normally do write()
and
read()
, respectively, as usual. And there's one last interesting thing to notice! The echo
process (19493
) actually terminates before the grep
reads the data from it! This is possible
because the pipe uses a buffer which holds the data. So the write()
syscall doesn't have to wait
for the respective read()
syscall, after writing the data into the pipe, as long as the data fits
into the buffer, the process can continue executing (incl. termination).
Another one of many use cases for file descriptor inheritance and the ability to run code before
fork()
and exec()
is related to security, in particular to permissions and privileges. A process
running with high privileges and many permissions can do some initialization and even open some file
descriptors that require those privileges and permissions only to fork itself, drop some of the
privileges and permissions (usually by changing the user under which the process runs) in the forked
child process and continuing work in the child process. This way the child process has limited
access only to carefully chosen assets. Again, we can look at an example, though this time a bit
more complicated and run as the root user (again with lines omitted for better clarity):
# strace -ff -e openat,clone,execve,setuid,write -- bash -c 'runuser -u vpodzime -- /bin/echo Hello > /file.txt'
execve("/bin/bash", ["bash", "-c", "runuser -u vpodzime -- /bin/echo"...], 0x7ffff56872f0 /* 55 vars */) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f95123bda10) = 837086
strace: Process 837086 attached
[pid 837086] openat(AT_FDCWD, "/file.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 837086] dup2(3, 1) = 1
[pid 837086] execve("/sbin/runuser", ["runuser", "-u", "vpodzime", "--", "/bin/echo", "Hello"], 0x55bc8b5f7250 /* 54 vars */) = 0
[pid 837086] clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fbf36c73a50) = 837087
strace: Process 837087 attached
[pid 837087] setuid(1000) = 0
[pid 837087] execve("/bin/echo", ["/bin/echo", "Hello"], 0x7ffe92f922a0 /* 54 vars */) = 0
[pid 837087] write(1, "Hello\n", 6) = 6
[pid 837087] +++ exited with 0 +++
[pid 837086] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=837086, si_uid=0, si_status=0, si_utime=0, si_stime=1} ---
+++ exited with 0 +++
This time we can see bash
started which forks (clone()
), in the forked child process (837086
)
it opens the /file.txt
file for writing (and creates it), copies the associated file descriptor
(3
) to standard output descriptor (1
) and executes runuser
. runuser
then forks and in the
new child process (837087
) it sets user ID to 1000
(UID of the user vpodzime) and executes
echo
which then writes data to its standard output, i.e. its parent's (runuser
) standard
output, i.e. the /file.txt
file. If we check the permissions of the file:
$ ls -l /file.txt
-rw-r--r--. 1 root root 6 Jul 21 16:56 /file.txt
we can see it is owned by the root user and of course an attempt to append some data to the file as the vpodzime user fails:
$ echo World >> /file.txt
bash: /file.txt: Permission denied
So a process running as the user vpodzime (with UID 1000
) was able to write data into the file
without the user actually having (write) access to the file. File descriptor inheritance and actions
happening between fork()
and exec()
made this possible.
Threads, locks, signals,…
Going back to the list of 5 points from the fork(3)
man page, to complete it, let's see the first
two points:
The child process is created with a single thread—the one that called fork(). The entire virtual address space of the parent is replicated in the child, including the states of mutexes, condition variables, and other pthreads objects; the use of pthread_atfork(3) may be helpful for dealing with problems that this can cause.
After a fork() in a multithreaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2).
This tells us that the new process created by fork()
only has one thread and it is the thread that
called fork()
, i.e. the new process continues to execute the instructions of that thread. But it
also mentions that states of mutexes, condition variables and other objects are cloned (replicated)
and that when using fork()
in multi-threaded programs, special care needs to be taken.
These are the first hints that the cloning behavior of fork()
and the ability to execute
instructions before calling exec()
might cause issues if things are not handled
carefully. Especially in multi-threaded programs, but even the file descriptor inheritance itself
can easily cause security issues, for example. However, more about that in an upcoming blog post
focused on how CFEngine deals which these potential issues.
-
Since version 2.3.3, rather than invoking the kernel’s
fork()
system call, the glibc fork() wrapper function invokesclone()
with flags. So, fork is both a system call and a library function, and in many cases the system call is not really used (depending on library, versions and platforms), but in practice, this ambiguity / distinction does not really matter because the API and behavior are the same. ↩︎ -
Some people basically consider this one of the required qualities of a good programmer ↩︎
-
Many things are represented as file descriptors on UNIX-like systems, on Linux including the POSIX message queues, actually. ↩︎
-
Don't get confused by the FDs showing up as arguments to the system call, the real argument is an
int fd[2]
array, i.e. a place where to store two integers (FDs). ↩︎