Introduction
In the CFEngine Core team, we have recently been working on a fix for
our WaitForCriticalSection()
function. In short, the function checks a
timestamp in a chunk of (lock) data stored in a local LMDB database and
if the timestamp is too old, it writes a new chunk of (lock) data with
the new timestamp. However, this used to be done in separate steps -
read the data from the DB and close DB, check the data and potentially
write the new data into the DB. So, there was a race condition because
if multiple processes did the same steps at the same time, they could
have read and checked the same timestamp value and then write their own
data with their new timestamps one after another. On the high-level
perspective that meant multiple processes could have entered the
critical section at the same time.
To fix the issue, we needed a function that would read, check and
potentially write data into an LMDB database in one transaction.
That way, multiple processes trying to enter the critical section at the
same time would do so one after another. To decide whether the current
value should be overwritten with a new value, the new function, called
OverwriteDB()
(complementary to ReadDB()
and WriteDB()
), takes as
one of its arguments a pointer to a function that gets the current value
and returns true
(overwrite) or false
(pass). Everything was working
fine with this new function. And then, in our CI, we started getting hit
by a (SIG)BUS. Tests started failing and hanging on our Solaris SPARC64
and HP-UX IA64 machines and investigation revealed that cf-agent
processes in the tests were sometimes killed with the SIGBUS signal
when there were multiple such processes running at the same time. And
fighting over the critical section, we figured.
Bad alignment
When we tried to debug the issue using truss (Solaris alternative of strace), we saw this:
16422: open("/var/cfengine/state/cf_lock.lmdb", O_RDWR|O_CREAT, 0644) = 10
16422: pread(10, "\0\0\0\0\0\0\0\b\0\0\0\0".., 92, 0) = 92
16422: pread(10, "\0\0\001\0\0\0\b\0\0\0\0".., 92, 8192) = 92
16422: mmap(0x00000000, 104857600, PROT_READ, MAP_SHARED, 10, 0) = 0xF8000000
16422: open("/var/cfengine/state/cf_lock.lmdb", O_WRONLY|O_DSYNC) = 11
16422: fcntl(11, F_GETFD, 0x00000000) = 0
16422: fcntl(11, F_SETFD, 0x00000001) = 0
16422: fcntl(9, F_SETLK, 0xFFBFA63C) = 0
16422: close(8) = 0
16422: time() = 1591194272
16422: getpid() = 16422 [16343]
16422: time() = 1591194272
16422: time() = 1591194272
16422: lwp_mutex_timedlock(0xFE860040, 0x00000000) = 0
16422: time() = 1591194272
16422: Incurred fault #5, FLTACCESS %pc = 0xFF28D24C
16422: siginfo: SIGBUS BUS_ADRALN addr=0xF800B426
16422: Received signal #10, SIGBUS [caught]
16422: siginfo: SIGBUS BUS_ADRALN addr=0xF800B426
So the process used mmap()
to map a file to its memory address space
starting at the address 0xF8000000
and then when it tried to access
the address 0xF800B426
it was killed with the SIGBUS signal
indicating that there was an invalid address alignment (BUS_ADRALN).
This was the first time we saw an issue like this. Memories from the university had some traces of alignment and alignment issues, but not too strong. And actually, it turns out that you will not get this error on x86 CPUs and some other architectures, at least not in a similar scenario. That was why everything worked on most of our machines except for Solaris 10 and 11 and HP-UX machines which use SPARC64 and IA64 CPUs, respectively.
Alignment error example
It is simple to write code that triggers this behavior:
#include <stdio.h>
int main() {
int nums[2] = {1, 2};
unsigned char *bytes = (unsigned char *) nums;
bytes++;
bytes++;
int wrong = *((int *)bytes);
printf ("wrong number: %d\n", wrong);
return 0;
}
When compiled and run on an x86_64 system, it produces this output:
wrong number: 131072
but on our Solaris 10 SPARC64 machine, it’s not so lucky:
Bus Error (core dumped)
The problem is that when trying to load
an integer from an address
that is 2 bytes after where the nums
array starts, there is an
alignment problem. The nums
array is properly aligned (otherwise it
wouldn’t be accessible at all!) and thus its address incremented by 2
is not divisible by 4 (sizeof(int)
). On x86_64, this is not a
problem, the address is in the address space of the process and so the 4
bytes from the middle of the nums
array are interpreted as an int
(why 131072 is left as an exercise to the reader).
But wait! CFEngine code is (hopefully) not doing anything crazy and
weird like that, right?! It’s not. However, that’s where mmap()
comes
into the game.
Alignment and mmap()
Let’s have a look at another, this time a bit longer, example:
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
struct Data {
int num;
char code[2];
};
int main () {
char chars[2] = {'a', 'b'};
struct Data data = { 1, { 'c', 'd' }};
printf("chars (%p): %c, %c\n", chars, chars[0], chars[1]);
printf("data (%p): {%d, {%c, %c}\n", &data, data.num, data.code[0], data.code[1]);
int fd = open("/tmp/file.dat", O_CREAT|O_TRUNC|O_WRONLY, S_IRUSR|S_IWUSR);
ssize_t written = write(fd, chars, sizeof(chars));
if (written != sizeof(chars)) {
printf("Failed to write 'chars': %m\n");
return 1;
}
written = write(fd, &data, sizeof(data));
if (written != sizeof(data)) {
printf("Failed to write 'chars: %m'\n");
return 1;
}
close(fd);
fd = open("/tmp/file.dat", O_RDONLY);
const size_t total_size = sizeof(chars) + sizeof(data);
unsigned char buf[total_size];
ssize_t n_read = read(fd, buf, total_size);
if (n_read != total_size) {
printf("Failed to read data back: %m\n");
return 1;
}
close(fd);
printf("File contents:");
for (size_t i = 0; i < total_size; i++)
printf(" %x", buf[i]);
printf("\n");
fd = open("/tmp/file.dat", O_RDONLY);
struct stat sb = {0};
fstat(fd, &sb);
unsigned char *addr = mmap(NULL, sb.st_size, PROT_READ, MAP_SHARED, fd, 0);
char *chars_in_file = addr;
struct Data *data_in_file = (struct Data *) (addr + sizeof(chars));
printf("chars_in_file (%p): %c, %c\n", chars_in_file, chars_in_file[0], chars_in_file[1]);
printf("data_in_file (%p): {%d, {%c, %c}\n", data_in_file, data_in_file->num, data_in_file->code[0], data_in_file->code[1]);
return 0;
}
The output on my x86_64 system is as one would expect:
chars (0x7ffd0f83a0de): a, b
data (0x7ffd0f83a0d4): {1, {c, d}
File contents: 61 62 1 0 0 0 63 64 0 0
chars_in_file (0x7f0665784000): a, b
data_in_file (0x7f0665784002): {1, {c, d}
and we can see multiple interesting things. chars
and data
are
stored on stack which, of course, grows down (in the address space). So
the address of data
is lower than the address of chars
. We can also
check that the address of chars
(0x7ffd0f83a0de) is divisible by 2,
but not by 4. The address of data
(0x7ffd0f83a0d4), however, is
divisible by 4. Which is good because it is a struct Data
variable
which starts with an int
that is 4 bytes big. The difference between
the addresses is 10 so there is padding between the two variables on the
stack (data
is only 6 bytes big).
If we write both chars
and data
into a file one right after the
other and then read the bytes back, we can see that there are the two
char
bytes for 'a'
(61) and 'b'
(62) followed by 4 bytes of the
little-endian representation of int
1 followed by char
bytes for
'b'
and 'c'
followed by two zero bytes. Where do these two zero
bytes come from? They are padding, of course. The struct Data
structure is 6 bytes big (4 bytes for the int num
+ 2 bytes for char code[2]
) so it is padded with 2 zero bytes to be 8 bytes, which is a
round (binary) number.
When the file is mapped to the address space of the process using
mmap()
, pointers to the particular offsets (0 and 2) of the address
segment can be created to get access to the copies of chars
and data
in the file. This time, however, the copy of data
(at address
0x7f0665784002) follows immediately after the copy of chars
(at
address 0x7f0665784000).
Any guesses what happens with the above code on SPARC64 Solaris 10 machine?
chars (ffbffbc0): a, b
data (ffbffbb8): {1, {c, d}
File contents: 61 62 0 0 0 1 63 64 0 0
chars_in_file (ff1e0000): a, b
Bus Error (core dumped)
(Also note that SPARC64 is big-endian, the 4B int
1 is stored as
0 0 0 1
.)
And the output from truss sheds some more light:
Received signal #10, SIGBUS [default]
siginfo: SIGBUS BUS_ADRALN addr=0xFF1E0002
Trying to access the copy of data
in the file results in the process
being killed with SIGBUS due to an invalid alignment. Of course, the
address 0xFF1E0002
is not divisible by 4 and so trying to load
the
int
in the struct attempts to do a non-aligned access.
Getting from under the (SIG)BUS
And the above is what was happening in our tests with the changes that
where fixing the WaitForCriticalSection()
function. Or, to be more
precise, in the OverwriteDB()
function, where the data from the
mdb_get()
function (getting a value from LMDB) was directly passed to
the function making the decision about whether to overwrite the data
with the new value or not. So while making the decision the function did
an unaligned access to the data passed to it which was actually a
pointer to a memory segment where the LMDB DB file was mapped before
with mmap()
. And of course, this didn’t happen always, because the
position of the data depended on the contents of the DB file.
The fix was, as usually, almost trivial compared to the complexity of
the investigation and debugging. All that was needed was to create a
local copy of the data from the mapped segment using the memcpy()
function which makes sure the copied data is accessed using aligned
addresses. And the destination was a local variable on the stack, which
is always properly aligned (by the compiler) for direct access.