Editing and copying large files or large numbers of files is slow. For a configuration management tool, it is probably one of the slowest things we do, apart from waiting for other programs to finish or waiting for network communication. In this blog post, we look at how to copy files. More specifically, the most performant approaches available on modern Linux systems. We are working on implementing these techniques so CFEngine and all your policy will copy files more efficiently.
Traditional approach
Many programs need to copy data or whole files. Usually, this is something that doesn’t bother developers too much because it is straightforward to do with the following steps:
- Open the source file
- Open or create the destination file
- Read a chunk of data from the source file
- Write the chunk into the destination file
- Repeat steps 3 and 4 until all data is read from the source and written into the destination
- Close both files
And, if desired, fix up ownership and permissions on the destination file based on the source file.
The above approach is very easy, especially in high-level languages or
with libraries that don’t require the developer to handle cases where
the read()
system call (syscall) is interrupted with a signal or even
hiding the fact that it can easily read fewer bytes than requested. A
very naive approach that, however, often works well enough is to just
read the whole source file into one big buffer and then write the data
into the destination file. This, of course, doesn’t work for large files
and is thus not a good approach for a generic CopyFile()
function.
Then there are a couple of better alternatives using select()/poll()
syscalls or non-blocking/asynchronous I/O with two buffers or a ring
buffer to better handle cases where the program could for example be
writing into the destination file, but is instead stuck on trying to
read from the source file which is not ready for reading (due to slow
storage hardware, for example) and vice versa.
Talking to the system
All the above approaches, no matter how complex and smart, have one disadvantage – data is shoveled from one file into another, however, the information that it is actually being copied is lost. In other words, when the source file is read and when the data is written into the destination file, the operating system doesn’t know it’s being copied. This doesn’t allow the operating system and in particular the file system to do smarter things than just providing the data requested from somewhere to the program and then writing some other data somewhere else. The information that it is the same data is lost.
When working on a C implementation of the mender-flash
tool 1, I was
thinking that there must be a better way. Based on some past
knowledge of storage and how file systems work, it was quite obvious
that if some file systems can create copy-on-write (COW) copies of
files, where the blocks of data are shared by multiple files as long as
they all contain copies of the same blocks, then there must be some way
to tell the operating system that data should be copied, not just read
from one place and written to another.
Also, since the mender-flash
tool is expected to run on platforms
with very little memory it made sense to keep the required
buffers small or avoid them altogether. The size of the buffers used
for the shoveling approach creates a trade-off between the
number of system calls required to shovel the data and
the amount of memory needed. The bigger the buffer, the fewer
syscalls, but requiring more memory. If memory is not a problem, the
very naive approach mentioned above with just one buffer as big as
the source file potentially only requires 2 syscalls to copy all the
data – read()
everything and then write()
everything.2
However, the required size of the buffer for something like a system
image (in case of mender-flash
) is very likely too large for a
small device to handle. On the other hand, using a small buffer
means a large number of syscalls needed to shovel the data. And
making system calls is not the cheapest operation a process can do.
Usually this is not a problem because the related I/O operations eat
the biggest part of the required resources (time, power, …), but if
the number of required syscalls can be lowered, it’s always nice.
sendfile() and splice()
The Linux kernel has provided better alternatives addressing the
problems described above since a long time ago. Even a long time before
it started supporting any file systems with COW capabilities. Since
version 2.2 it provides the sendfile()
system call, originally
designed for sending data from a file via a socket, but allowing copying
data from one file into another one since version 2.6.33.3 Its man
page literally says:
sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.
where user space means the buffer in the address space of the particular process (which thus is not in the kernel’s address space).
There’s also a related system call splice()
which
moves data between two file descriptors without copying between kernel address space and user address space. It transfers up to len bytes of data from the file descriptor fd_in to the file descriptor fd_out, where one of the file descriptors must refer to a pipe.
And, since version 2.6.31, both file descriptors can refer to pipes.4 This similar system call allows efficient copying of data from a pipe, i.e., produced by some other process.
copy_file_range() and FICLONE
sendfile()
has the benefit of replacing the repetition of read()
and
write()
syscalls with a single one 2, it doesn’t require the buffer
and thus avoids copying data from the kernel and back. With it, the
program talks to the kernel more precisely, telling it that it needs to
send data from one file into another file instead of telling it that it
needs to read some data and write some (unrelated) data. However, it
doesn’t tell the kernel exactly that it needs to copy data from one file
into another (keep in mind that its original purpose was to efficiently
send data from a file via a socket).
To talk to the kernel even more precisely, a program can
use the copy_file_range()
system call which is very similar to
sendfile()
at first glance, but has the following important note in its
man page:
copy_file_range() gives filesystems an opportunity to implement “copy acceleration” techniques, such as the use of reflinks (i.e., two or more inodes that share pointers to the same copy-on-write disk blocks) or server-side-copy (in the case of NFS).
where reflinks means the copy-on-write mechanism mentioned above. This syscall has a short, but a bit wild, history (read the copy_file_range(2) man page for details) mentioning version 5.19 as the one with the most recent and best implementation and it only reliably works if the source and destination files are on the same file system. However, in those cases, it provides (almost) the best possible approach to copy data between two files.
Why only almost best possible approach? Because it supports, as its name suggests, copying ranges of data between two files. Which is nice if that is required and of course copying the whole file is just as easy as telling it to copy the range between the beginning and the end of the file. On the other hand, if the goal is to create a copy of a whole file, it should be possible to tell the kernel exactly that, right? And as the above note says, it "gives filesystem an opportunity…" which doesn’t sound very reassuring.
Then comes the last in the list of modern system calls
for file/data copying. And it’s not exactly a new syscall, but rather a
new variant of the good old ioctl()
system call that can be used for
various I/O-related operations – FICLONE
. Its declaration looks
like this:
int ioctl(int dest_fd, FICLONE, int src_fd);
which shows that it’s exactly what we want to tell the kernel when
copying a whole file – "clone this file (src_fd
) here
(dest_fd
)". However, as its man page says:
If a filesystem supports files sharing physical storage between multiple files (“reflink”), this ioctl(2) operation can be used to make some of the data in the src_fd file appear in the dest_fd file by sharing the underlying storage, which is faster than making a separate physical copy of the data.
it only works inside one file system and only if the file system
supports sharing physical storage (a different description for the COW
mechanism described above). When that is exactly the case, nothing can
beat this system call in performance and also the (e)quality of the
result. It exists since version 4.5 and the constant (macro) for the
ioctl()
call was actually originally called
BTRFS_IOC_CLONE
, revealing that it was introduced together
with the Btrfs file system.
Sparse files
When talking about a generic function to copy a file from a source to
destination efficiently, special attention needs to be paid to
sparse files – files that contain not only data, but also so-called
holes, areas that contain no data, don’t take up disk space,
and when being read, they look like blocks of zeroes. There
are many reasons for why such files are useful and quite commonly
created and supported by modern file systems. It’s not uncommon that
the amount of space occupied by a sparse file is an order of
magnitude smaller than its reported size. The stat
utility (and
related system calls and functions) report both size of a file and the
number of blocks (usually 512 bytes big) it occupies.
In relation to copying, it should be quite obvious that the naive
approach mentioned at the beginning of this post would result in a
non-sparse copy of file. Because read()
would simply return blocks of
zeroes while getting across the holes in the source file and write()
would simply write those zeroes into the destination file. And there
is a big difference between a sparse file and a file containing blocks
of zeroes, no matter how big those blocks are. Again, when a program
wants to create a sparse file, it needs to tell the operating system
instead of assuming that writing a block of zeroes will create a hole.
The operating system doesn’t care about the data being written, of
course. An optimization, which we have in our CFEngine code5, is to
detect such blocks of zeroes and creating holes in the destination file
instead of writing them there. However, it doesn’t avoid reading all
those zeroes from the source file.
The easiest way to create a sparse file is to use the truncate
utility and set a big size of a new empty file or an existing file
containing some data using the --size
option which creates
a hole at the end of the file after the last data block. Another
aproach is to use the dd
utility and its seek
option
to make it write some data at an offset that is larger than the end
of the last data block in the file which creates a hole between
such an offset and the end of the last data block with the newly
written data by dd
becoming the new last block of data.
Seek is the keyword when it comes to sparse files. As the example with
dd
shows, it provides one way to create holes in files. And since
Linux 3.1, the lseek()
syscall accepts two special values as its last
argument whence
– SEEK_DATA
and SEEK_HOLE
. These two special
values, as their names suggest, allow a program to seek (reposition
its file descriptor) in a file to the beginning of the next data block
or hole (if any), respectively. This can be used to avoid reading all
the zeroes from the holes in the source file in favor of skipping the
holes and creating the exact same holes in the destination file.
Out of the modern system calls for copying data between files
described above, only the FICLONE
variant of the
ioctl()
syscall takes care of mirroring holes between the source
and the destination. Which, after all, makes sense as it literally
creates a clone of the source file. Both sendfile()
and
copy_file_range()
replicate holes as blocks of zeroes in the
destination file and lseek()
with SEEK_DATA
and
SEEK_HOLE
needs to be used to avoid that.
Conclusions
Although copying files, or more generally, copying data from a source
file into a destination file looks like an easy thing to do, this blog
post shows that the naive approach has multiple issues that can easily
result in an unnecessary consumption of various resources – CPU,
memory, disk space and also time. Modern Linux systems provide
mechanisms that allow programs to be more explicit and tell the
operating system that they are copying data and not just reading data
here and writing data there. This blog post provides an overview of 3
specific system calls that serve such a purpose (plus splice()
as
related one). All with their limitations and with varying availability
across different versions of Linux. An ideal implementation of a generic
file copying function should check what the conditions are, then decide
which of the approaches mentioned above to choose. And a defensive way
of doing things is to fall back to using the very basic approach
shoveling data from the source file into the destination file in case
the chosen more efficient approach fails.
At the time of writing this blog, we are currently working on using the
collected knowledge summarized above in
CFEngine by changing the
implementation of our generic function
FileSparseCopy()
in our libntech utility
library (now available under the Apache
license).
It is not an easy task because CFEngine supports a wide variety of
platforms including some exotic ones like HP-UX, Solaris and AIX as well
as some old versions of GNU/Linux distributions like RHEL 6 and Ubuntu
16.04. All the syscalls mentioned above are Linux-specific (although
similar syscalls may exist in other operating systems, even with similar
or the exact same names) so we just use the good old shoveling approach
with hole detection on all of the non-Linux platforms as well as on RHEL
6 which doesn’t have SEEK_DATA
and SEEK_HOLE
(and
copy_file_range()
). Once fully implemented, tested and merged, these
new mechanisms for data copying will allow CFEngine to perform this
operation more efficiently with lower resource usage, at least on modern
Linux-based platforms. Although CFEngine doesn’t often copy large files
and sparse files, it is nice to have an efficient generic implementation
ready for such cases. And of course, the slower the HW is, especially
regarding I/O operations, the bigger the impact can be. For example,
copying files on a slow device using a micro-SD card for storage (like
many of the ARM boards do) can yield very different results when the
FICLONE
ioctl()
is used instead of slowly reading the data from an
SD card and writing it even more slowly to the same storage when simply
shoveling the data, let alone in case of large sparse files.
-
Which triggered a very long discussion about language choices and the necessary complexity of the tool, but that’s another story. ↩︎
-
Up to a certain limit that the syscall has. If more data needs to be copied, it needs to be called multiple times, but the limit is
2 GiB
according to the man page. ↩︎ ↩︎ -
According to the HISTORY section of the sendfile(2) man page. ↩︎
-
Our blog post about processes describes, among many other things, what pipes are and how they work. ↩︎
-
Search for
FileSparseWrite()
in the file_lib.c file in the sources of our utility library. ↩︎