Editing and copying large files or large numbers of files is slow. For a configuration management tool, it is probably one of the slowest things we do, apart from waiting for other programs to finish or waiting for network communication. In this blog post, we look at how to copy files. More specifically, the most performant approaches available on modern Linux systems. We are working on implementing these techniques so CFEngine and all your policy will copy files more efficiently.
The license of our in-house C utility and compatibility library libntech was recently changed from GPLv3 to Apache License Version 2.0 which makes the library suitable for more projects thanks to the more permissive license. While GPLv3 practically required any project using libntech to be licensed under GPLv3 as well, the Apache License v2.0 allows any open source as well as proprietary software to utilize our utility library, keeping the copyright attributions.
Opening and reading files may cause your program to block indefinitely. In this blogpost we'll discuss how to work around this issue.
This is the second blog post in a short series about processes on UNIX-like systems. It is a followup to the previous post which focused on basic definitions, creation of processes and relations between them. This time we analyze the semantics of two closely related system calls that play major roles in process creation and program execution.
fork() and exec() The UNIX-based operating systems provide the fork() system call1 to create a clone of an existing process and the execve() system call to start executing a program in a process. Windows, on the other hand, provide the CreateProcess() function which starts a given program in a newly created process. Why are UNIX-based systems doing things in a more complicated way? There are many reasons for that, some simply historical, as described in The Evolution of the Unix Time-sharing System:
While working on the integration of CFEngine Build into Mission Portal we came to the point where we needed to start executing separate tools from our recently added daemon - cf-reactor. Although it may seem like nothing special, knowing a bit about the process creation and program execution specifics (and having to fight some really hard to solve bugs in the past) we spent a lot of time and effort on this step. Now we want to share the story and the results of the effort, but since understanding of the reasons behind the work together with how the implementation works requires quite deep knowledge of how processes are being created and programs are being started on UNIX-like systems, we first start with a series of blog posts focused on this seemingly simple area. They cover the basics as well as some advanced topics in two parts:
Databases are great for data processing and storage. However, in many cases it is better or easier to work with data in files on a file system, some tools even cannot access the data in any other way. When a database (DB) is created in a database management system (DBMS) using a file system as its data storage, it of course uses files on the given file system to store the data. But working with those files outside of the DBMS, even for read-only access to the data stored in the DB, is practically impossible. So what can be done if some setup requires data in files while at the same time, the data processing and storage requires a use of a DB(MS)? The answer is synchronization between two storage places – a DB and files. It can either be from the DB to the files where the files are then treated as read-only for the parties working with the data, or with modifications of the files being synchronized to the DB. In the former setup, the DB is the single source of truth – the data in the files may be out of sync, but the DB has the up to date version. In the latter setup, the DB provides a backup or alternative read-only access to the data that is primarily stored in the files or the files provide an alternative write-only access to the DB. A two-way synchronization and thus a combination of read and write access in both places, the DB and the files, should be avoided because it's very hard (one could even say impossible) to properly implement mechanisms ensuring data consistency. Both between the two storages, but even in each of them alone.
In this blog post we show how it is possible to run an arbitrary program, script, or execute arbitrary code in reaction to changes and generally events in a PostgreSQL database.
Triggers Database management systems (DBMS) provide mechanisms for defining reactions to certain actions or, in other words, for defining that specific actions should trigger specific reactions. PostgreSQL, the DBMS used by CFEngine Enterprise, is no exception. These triggers can be used for ensuring consistency between tables when changes in one table should be reflected in another table, for recording information about actions, and many other things. PostgreSQL's Overview of Trigger Behavior describes the basics of triggers with the following sentences:
Software quality has been a topic and an area of interest since the dawn of software itself. And as software evolved so did the techniques and approaches to assuring its high quality. Better computers providing more computing power, bigger storage and faster communication have allowed software developers to detect issues in their code sooner and faster. And so we got from getting a syntax error after two days of waiting for the box of punch cards to go through the queue of boxes and get loaded into a computer running a compiler to getting such errors from a compiler in seconds or even in real-time from the code editor. And we got from bugs being detected by actually seeing real bugs on punch cards with machine instructions to operating systems providing bug reports with coredumps, tracebacks and lots of information helping the developers to identify the problem, tests detecting problems before the code gets into production, or compilers and tooling detecting them before the code is even executed the first time. We can afford doing things like fuzz testing, we have enough computational power for compilers and special tools to analyze the code and check all possible paths through it and much more. At the same time, software has become a part of almost everything we use or interact with every day and so with the incomparably greater amount of software potentially affecting our lives there is an incomparably greater amount of bugs that need to be detected and fixed or at least handled gracefully. With some software being more critical than other and with bugs ranging from minor annoyances to losses of human lives. Many things have changed in this evolution, but one rule has always been key:
I recently had a minor task involving changing an option - on one of our command line tools - from taking a required argument, to taking an optional argument. This should be easy they said; just change the respective option struct to take an optional argument, add a colon to the optstring, and get on with your life.
Well, it proved to be easier said than done. My initial expectation was that a solution similar to the one below should just work. And it does work, just not in the way I expected.
Introduction In the CFEngine Core team, we have recently been working on a fix for our WaitForCriticalSection() function. In short, the function checks a timestamp in a chunk of (lock) data stored in a local LMDB database and if the timestamp is too old, it writes a new chunk of (lock) data with the new timestamp. However, this used to be done in separate steps - read the data from the DB and close DB, check the data and potentially write the new data into the DB. So, there was a race condition because if multiple processes did the same steps at the same time, they could have read and checked the same timestamp value and then write their own data with their new timestamps one after another. On the high-level perspective that meant multiple processes could have entered the critical section at the same time.