Ten Reasons for 5-minute configuration update and repair

Posted by Mark Burgess
March 19, 2012

How often should your configuration management system verify the integrity of your system? The default choices we’ve made by CFEngine are the results of almost 20 years of research into this area. Below you will find ten issues and references that explain why these choices are underpinned by the science. These ten things really all amount to the same thing: if you are playing ping-pong against the adversary of change, you need to be as quick on your feet as your opponent – and faster

  1. Lost time is money:

    Down-time of services, security related incidents and application mis-configurations that lead to incorrect behaviour, can all mean lost revenues. A fast response to misconfiguration, or better still avoidance in the first place, can be accounted for in real money. With a 5 minute check on system integrity, coupled with the ability to makes repairs to thousands of machines in parallel, a company could save hundreds of thousands of dollars in lost revenue.

    This one isn’t hard to understand, and doesn’t need much science to argue its case – but, at CFEngine, that doesn’t stop us from being sure! No surprises that it turns out that the human factor is the major cause of downtime. (See, for instance, A Simple Way to Estimate the Cost of Downtime)

  2. Minimize the risk of catastrophic loss before a change window:

    Many industries try to limit the scope for human error by having change-windows during which all alterations and repairs must be made. The obsession with tracking who to blame for deviations from policy (liability) often overshadows the need to avoid deviations in the first place.

    If you are worried about the risk of system maintenance on service performance, take heart. Studies show that you can optimize (i.e. minimize) the risk of damage to a system, even with a small change window, if you arrange it to start repairs just after the peak of fault activity. But better still is to keeping the deficit of unrepaired faults low in the first place with regular policy-approved maintenance. CFEngine 3 can easily do this because it has very little impact on systems. See A Risk Analysis of Disk Backup or Repository Maintenance.

  3. A Red Queen Race - running to stand still:

    In Biology, and in IT management alike, there is the concept of running to stand still. Lewis Carrol parodied this with the help of the Red Queen in Through The Looking Glass.

    “Well, in our country,” said Alice, still panting a little, “you’d generally get to somewhere else if you run very fast for a long time, as we’ve been doing.”

    “A slow sort of country!” said the Queen. “Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!”

    The point is that, in evolving dynamical environments, change and counter-change must be matched, and must co-evolve to maintain a dynamical stalemate. If a configuration system is too slow in matching the moves of a fault-ridden environment, it will fail to maintain a system.

    Today, we continue to focus only on build and deployment, thanks to the ballooning scale of IT systems everywhere, it has become normal to pretend that system configurations remain constant after initial installation. In truth, a system is constantly degrading, like any paint job exposed to the weather. To maintain a dynamical balance (equilibrium) with the environment of users, we have to work fast enough to counter the `weather’ (maintenance and audit).

    If we then want to shift the balance for strategic gain, we need to make sure we can buy ourselves the time to do that, without being dragged back by firefighting (see maintenance theorem On the Theory of System Administration, or the more detailed discussion in Analytical Network and System Administration).

  4. Be proactive to extend the Mean Time Before Failure (MTBF):

    A few years ago it was popular to speak about `five-nines reliability’ for all kinds of systems, meaning that the Mean Time Before Failure was very long! Running preventative maintenance allows you to extend the MTBF almost indefinitely.

    Most problems can be avoided if we are quick. Wait too long and you allow a misconfiguration to occur; then an attempted roll-back might not even help you. Once the horse has bolted, it doesn’t help to lock the barn door. Proactive maintenance is the answer to living in a dynamical social network of users and machines. Like washing your car to prevent it from rusting: negligence is corrosive to the system. (See A Probabilistic Approach to Estimating Computer System Reliability)

  5. Simplify scheduling by eliminating cron:

    With 5 minute execution resolution, you can do away with configuring clunky cron altogether, for practically all applications.

    Instead of trying to configure as many schedulers as you have machines (something that requires changes of every machine’s crontab individually), just use CFEngine 3’s universal scheduler to offer a single point of schedule control through policy, giving much greater precision to select jobs for the right machines at the right times, and adding features like load-balancing to boot (see Scheduling and Event Management).

  6. Tracking your true system behaviour:

    Most configuration tools assume that computer systems are actually static: i.e. once deployed, a machine remains constant until an error is detected. Naturally this is wrong. Computers are changing very quickly all the time thanks to login and service usage.

    Suppose you want to audit change – to track what is going on, you can sample the state Q of the system every dT seconds. If you want to know the rate (speed) of system change, then you need to sample twice as fast (2 points in this interval), i.e. dQ/dT = (Q2 - Q1)/dT. Similarly if you want to know whether the trend is increasing or decreasing, you need to sample twice as fast again: d2Q/dT2 = (dQ2 - dQ1)/dT2, so you need four times the sampling, or four times the rate of configuration verification.

    Nyquist’s theorem says basically the same thing: to capture a variation properly we then have to sample the system at twice the frequency of the fastest Fourier component. This is why CFEngine 3’s monitoring daemon samples every 2.5 minutes to capture 5 minute changes. That still only allows us to estimate trends over 15 minute intervals. Imagine, checking configuration at a slower rate! If you only sample every hour, say, then you would have to wait 4 hours to see any pattern, and you would miss anything that happened in between (see Nyquist-Shannon sampling theorem).

  7. Competition for system resources is a zero-sum game:

    One of the reasons for an inability to function correctly is that multiple processes compete for resources, like trying to find a parking space on a Saturday morning at the mall. If any process goes out of control, consuming system resources (too much memory, CPU, or disk), then regardless of the intent, your system’s integrity is basically under attack from the users. A resource consumed is a resource unavailable.

    The effects of this competition can be analyzed with game theory. Having a timely `tit for tat’ response for keeping a share is the best approach, just like a duel or a gunfight. Garbage collection is an obvious example – something we take very much for granted until the garbage collectors go on strike.

    You need to be quick on the draw against a sudden attack, and you need to bail out buckets of water quickly on a sinking ship. Both of these are reasons to verify system integrity quickly and often (see On the Theory of System Administration, or the summary in Analytical Network and System Administration).

  8. Systems are getting faster and therefore go out of control faster:

    Unintended change occurs as a result of the processes that drive the system. Computers are getting faster every year, and thus errors can spiral out of control more explosively. Even if root causes only happen at the same rate, the avalanche of side-effects will expand at the fastest rate of the system.

    The fastest changes are the ones that we struggle to capture:

    • Servers: network traffic
    • Desktops: user behaviour and peripheral speeds
    • Mobiles: frequency of user activation
    • Compute nodes: disk, CPU and memory rates

    Network connections are throttled by typically 5-minute TCP time waits, so a 5 minute time matches well – though web servers are often optimized for even faster turnaround cycles. Other system auto-correlation studies show significant resource changes at the time scale of 20 minutes and less. From the rates of change researchers and engineers can measure, to capture faults in a mission critical environment, being no slower than a 5 minute response is conservative (see Measuring System Normality).

  9. Accurate because you can be (with CFEngine 3)!

    CFEngine’s non-imperative, self-organizing execution optimizes at a faster rate of throughput than an imperative script for certain resource-intensive operations. That’s because, by dropping a strict serial order, it has fewer constraints on its scheduling. Thus CFEngine can handle faster repair rates than serial tools.

    Intensive resource accesses, like scanning large filesystems and reading the process table (perhaps multiple times) are inherently serial in nature, etc, but only need only be executed once, in batch, for all similar configuration promises.

    It’s rather like walking only once around the supermarket to do your shopping – you can pick up what you need in any order, as you see it, instead of following your shopping list imperatives slavishly and having to walk back and forth across the store multiple times (see Shortest Path Problem).

  10. Because, with CFEngine 3, 5 minute updates are possible for as many as 5000 hosts per server:

    CFEngine technology consumes very little CPU, or memory, and scales with little hardware – i.e. scaling CFEngine is cheap. Even with a centralized architecture and a single hub/policy server, you can have 5 minute updating with a single server for up to 5000 machines. You can scale any technology you like, by brute force, (even fixing by hand!) – but you can’t do it as consistently, reliably or as cheaply (for Total Cost of Ownership) than you can with lightweight automation (see Scale and Scalability).

It’s simple engineering

The maintenance theorem is rather simple: it says that if something is attacking you quickly, you need to defend yourself at the same rate to stand a chance.

At CFEngine, we began with daily updates in 1993, then hourly updates in 2004, then we went to 15 minute updates around 2000 and we are currently resting at 5 minute updates since around 2005. We think that 5 minute resolution is a compromise sufficient for most users at today’s system rates, but we also know of a few that schedule checks every minute. Of course, not every maintenance task needs to be carried out every 5 minutes. Performing a full system backup would not even be possible at this rate (without mirroring technologies), but simple tasks like garbage collection and process integrity need to be checked often to avoid catastrophic loss.

All of the points above are different ways of saying the same thing: “Speed Is The Product” (see What makes clouds float and developers operative? Agility!). Try it for yourself. Does your configuration system run fast enough?