Efficient data/file copying on modern Linux

Editing and copying large files or large numbers of files is slow. For a configuration management tool, it is probably one of the slowest things we do, apart from waiting for other programs to finish or waiting for network communication. In this blog post, we look at how to copy files. More specifically, the most performant approaches available on modern Linux systems. We are working on implementing these techniques so CFEngine and all your policy will copy files more efficiently.

Traditional approach

Many programs need to copy data or whole files. Usually, this is something that doesn’t bother developers too much because it is straightforward to do with the following steps:

Open the source file
Open or create the destination file
Read a chunk of data from the source file
Write the chunk into the destination file
Repeat steps 3 and 4 until all data is read from the source and written into the destination
Close both files

And, if desired, fix up ownership and permissions on the destination file based on the source file.

The above approach is very easy, especially in high-level languages or with libraries that don’t require the developer to handle cases where the read() system call (syscall) is interrupted with a signal or even hiding the fact that it can easily read fewer bytes than requested. A very naive approach that, however, often works well enough is to just read the whole source file into one big buffer and then write the data into the destination file. This, of course, doesn’t work for large files and is thus not a good approach for a generic CopyFile() function.

Then there are a couple of better alternatives using select()/poll() syscalls or non-blocking/asynchronous I/O with two buffers or a ring buffer to better handle cases where the program could for example be writing into the destination file, but is instead stuck on trying to read from the source file which is not ready for reading (due to slow storage hardware, for example) and vice versa.

Talking to the system

All the above approaches, no matter how complex and smart, have one disadvantage – data is shoveled from one file into another, however, the information that it is actually being copied is lost. In other words, when the source file is read and when the data is written into the destination file, the operating system doesn’t know it’s being copied. This doesn’t allow the operating system and in particular the file system to do smarter things than just providing the data requested from somewhere to the program and then writing some other data somewhere else. The information that it is the same data is lost.

When working on a C implementation of the mender-flash tool ¹, I was thinking that there must be a better way. Based on some past knowledge of storage and how file systems work, it was quite obvious that if some file systems can create copy-on-write (COW) copies of files, where the blocks of data are shared by multiple files as long as they all contain copies of the same blocks, then there must be some way to tell the operating system that data should be copied, not just read from one place and written to another.

Also, since the mender-flash tool is expected to run on platforms with very little memory it made sense to keep the required buffers small or avoid them altogether. The size of the buffers used for the shoveling approach creates a trade-off between the number of system calls required to shovel the data and the amount of memory needed. The bigger the buffer, the fewer syscalls, but requiring more memory. If memory is not a problem, the very naive approach mentioned above with just one buffer as big as the source file potentially only requires 2 syscalls to copy all the data – read() everything and then write() everything.² However, the required size of the buffer for something like a system image (in case of mender-flash) is very likely too large for a small device to handle. On the other hand, using a small buffer means a large number of syscalls needed to shovel the data. And making system calls is not the cheapest operation a process can do. Usually this is not a problem because the related I/O operations eat the biggest part of the required resources (time, power, …), but if the number of required syscalls can be lowered, it’s always nice.

sendfile() and splice()

The Linux kernel has provided better alternatives addressing the problems described above since a long time ago. Even a long time before it started supporting any file systems with COW capabilities. Since version 2.2 it provides the sendfile() system call, originally designed for sending data from a file via a socket, but allowing copying data from one file into another one since version 2.6.33.³ Its man page literally says:

sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.

where user space means the buffer in the address space of the particular process (which thus is not in the kernel’s address space).

There’s also a related system call splice() which

moves data between two file descriptors without copying between kernel address space and user address space. It transfers up to len bytes of data from the file descriptor fd_in to the file descriptor fd_out, where one of the file descriptors must refer to a pipe.

And, since version 2.6.31, both file descriptors can refer to pipes.⁴ This similar system call allows efficient copying of data from a pipe, i.e., produced by some other process.

copy_file_range() and FICLONE

sendfile() has the benefit of replacing the repetition of read() and write() syscalls with a single one ², it doesn’t require the buffer and thus avoids copying data from the kernel and back. With it, the program talks to the kernel more precisely, telling it that it needs to send data from one file into another file instead of telling it that it needs to read some data and write some (unrelated) data. However, it doesn’t tell the kernel exactly that it needs to copy data from one file into another (keep in mind that its original purpose was to efficiently send data from a file via a socket).

To talk to the kernel even more precisely, a program can use the copy_file_range() system call which is very similar to sendfile() at first glance, but has the following important note in its man page:

copy_file_range() gives filesystems an opportunity to implement “copy acceleration” techniques, such as the use of reflinks (i.e., two or more inodes that share pointers to the same copy-on-write disk blocks) or server-side-copy (in the case of NFS).

where reflinks means the copy-on-write mechanism mentioned above. This syscall has a short, but a bit wild, history (read the copy_file_range(2) man page for details) mentioning version 5.19 as the one with the most recent and best implementation and it only reliably works if the source and destination files are on the same file system. However, in those cases, it provides (almost) the best possible approach to copy data between two files.

Why only almost best possible approach? Because it supports, as its name suggests, copying ranges of data between two files. Which is nice if that is required and of course copying the whole file is just as easy as telling it to copy the range between the beginning and the end of the file. On the other hand, if the goal is to create a copy of a whole file, it should be possible to tell the kernel exactly that, right? And as the above note says, it "gives filesystem an opportunity…" which doesn’t sound very reassuring.

Then comes the last in the list of modern system calls for file/data copying. And it’s not exactly a new syscall, but rather a new variant of the good old ioctl() system call that can be used for various I/O-related operations – FICLONE. Its declaration looks like this:

int ioctl(int dest_fd, FICLONE, int src_fd);

which shows that it’s exactly what we want to tell the kernel when copying a whole file – "clone this file (src_fd) here (dest_fd)". However, as its man page says:

If a filesystem supports files sharing physical storage between multiple files (“reflink”), this ioctl(2) operation can be used to make some of the data in the src_fd file appear in the dest_fd file by sharing the underlying storage, which is faster than making a separate physical copy of the data.

it only works inside one file system and only if the file system supports sharing physical storage (a different description for the COW mechanism described above). When that is exactly the case, nothing can beat this system call in performance and also the (e)quality of the result. It exists since version 4.5 and the constant (macro) for the ioctl() call was actually originally called BTRFS_IOC_CLONE, revealing that it was introduced together with the Btrfs file system.

Sparse files

When talking about a generic function to copy a file from a source to destination efficiently, special attention needs to be paid to sparse files – files that contain not only data, but also so-called holes, areas that contain no data, don’t take up disk space, and when being read, they look like blocks of zeroes. There are many reasons for why such files are useful and quite commonly created and supported by modern file systems. It’s not uncommon that the amount of space occupied by a sparse file is an order of magnitude smaller than its reported size. The stat utility (and related system calls and functions) report both size of a file and the number of blocks (usually 512 bytes big) it occupies.

In relation to copying, it should be quite obvious that the naive approach mentioned at the beginning of this post would result in a non-sparse copy of file. Because read() would simply return blocks of zeroes while getting across the holes in the source file and write() would simply write those zeroes into the destination file. And there is a big difference between a sparse file and a file containing blocks of zeroes, no matter how big those blocks are. Again, when a program wants to create a sparse file, it needs to tell the operating system instead of assuming that writing a block of zeroes will create a hole. The operating system doesn’t care about the data being written, of course. An optimization, which we have in our CFEngine code⁵, is to detect such blocks of zeroes and creating holes in the destination file instead of writing them there. However, it doesn’t avoid reading all those zeroes from the source file.

The easiest way to create a sparse file is to use the truncate utility and set a big size of a new empty file or an existing file containing some data using the --size option which creates a hole at the end of the file after the last data block. Another aproach is to use the dd utility and its seek option to make it write some data at an offset that is larger than the end of the last data block in the file which creates a hole between such an offset and the end of the last data block with the newly written data by dd becoming the new last block of data.

Seek is the keyword when it comes to sparse files. As the example with dd shows, it provides one way to create holes in files. And since Linux 3.1, the lseek() syscall accepts two special values as its last argument whence – SEEK_DATA and SEEK_HOLE. These two special values, as their names suggest, allow a program to seek (reposition its file descriptor) in a file to the beginning of the next data block or hole (if any), respectively. This can be used to avoid reading all the zeroes from the holes in the source file in favor of skipping the holes and creating the exact same holes in the destination file.

Out of the modern system calls for copying data between files described above, only the FICLONE variant of the ioctl() syscall takes care of mirroring holes between the source and the destination. Which, after all, makes sense as it literally creates a clone of the source file. Both sendfile() and copy_file_range() replicate holes as blocks of zeroes in the destination file and lseek() with SEEK_DATA and SEEK_HOLE needs to be used to avoid that.

Conclusions

Although copying files, or more generally, copying data from a source file into a destination file looks like an easy thing to do, this blog post shows that the naive approach has multiple issues that can easily result in an unnecessary consumption of various resources – CPU, memory, disk space and also time. Modern Linux systems provide mechanisms that allow programs to be more explicit and tell the operating system that they are copying data and not just reading data here and writing data there. This blog post provides an overview of 3 specific system calls that serve such a purpose (plus splice() as related one). All with their limitations and with varying availability across different versions of Linux. An ideal implementation of a generic file copying function should check what the conditions are, then decide which of the approaches mentioned above to choose. And a defensive way of doing things is to fall back to using the very basic approach shoveling data from the source file into the destination file in case the chosen more efficient approach fails.

At the time of writing this blog, we are currently working on using the collected knowledge summarized above in CFEngine by changing the implementation of our generic function FileSparseCopy() in our libntech utility library (now available under the Apache license). It is not an easy task because CFEngine supports a wide variety of platforms including some exotic ones like HP-UX, Solaris and AIX as well as some old versions of GNU/Linux distributions like RHEL 6 and Ubuntu 16.04. All the syscalls mentioned above are Linux-specific (although similar syscalls may exist in other operating systems, even with similar or the exact same names) so we just use the good old shoveling approach with hole detection on all of the non-Linux platforms as well as on RHEL 6 which doesn’t have SEEK_DATA and SEEK_HOLE (and copy_file_range()). Once fully implemented, tested and merged, these new mechanisms for data copying will allow CFEngine to perform this operation more efficiently with lower resource usage, at least on modern Linux-based platforms. Although CFEngine doesn’t often copy large files and sparse files, it is nice to have an efficient generic implementation ready for such cases. And of course, the slower the HW is, especially regarding I/O operations, the bigger the impact can be. For example, copying files on a slow device using a micro-SD card for storage (like many of the ARM boards do) can yield very different results when the FICLONE ioctl() is used instead of slowly reading the data from an SD card and writing it even more slowly to the same storage when simply shoveling the data, let alone in case of large sparse files.

Which triggered a very long discussion about language choices and the necessary complexity of the tool, but that’s another story. ↩︎
Up to a certain limit that the syscall has. If more data needs to be copied, it needs to be called multiple times, but the limit is 2 GiB according to the man page. ↩︎ ↩︎
According to the HISTORY section of the sendfile(2) man page. ↩︎
Our blog post about processes describes, among many other things, what pipes are and how they work. ↩︎
Search for FileSparseWrite() in the file_lib.c file in the sources of our utility library. ↩︎