Continuous Configuration Management: the value of 6 billion checks a day

Posted by Thomas Ryd
November 15, 2012

Software developers know that quality of software projects tend to deteriorate over time unless strong measures are taken to prevent this. Software entropy accelerates once the beginning of “software rot” has been allowed to set in, so the trick is to keep the software as clean as possible at all times. This is referred to as “The Broken Windows Theory” because the pattern is similar to what police departments learned about maintaining order in inner cities: fix the small things all the time, and so keep out the big problems.

If this pattern applies to law enforcement and software development, does it also apply to Configuration Management? At CFEngine we believe it does, and we design our software to support continuous verification of the state of IT infrastructure and immediate repair if systems drift from their desired state.

To give an idea of the scale of this approach, let’s assume that we want to verify the settings of SSH across Linux servers. There are some 65 configuration settings that need to be decided and maintained. If they are deployed to 1,000 servers, then there are 65,000 checks to be made, just to verify the state of the infrastructure once. The default for CFEngine is that we check once per 5 minutes, 288x per day. In this example that means we do close to 20 million SSH checks per day.

There are many reasons why making checks this frequently makes sense (see Mark Burgess’ “Ten Reasons for 5-minute configuration update and repair”). I received a very compelling argument from one of our customers, who reports that they run over 6 billion configuration checks per day, catching and automatically repairing more than 100,000 unplanned changes during a workday! Without these repairs an average of 5% of their servers would drift out of compliance every day.

At this level of granularity the repairs are easy to make automatically, and their cost is extremely low. When left to accumulate, these repairs are harder to make, problem cause analysis becomes harder, and the impact of the drift becomes large: 70-90% of all IT outages are due to manual and unauthorized changes. If people are required to make these repairs, it is not only costly but also very error prone.

Our customer reports very low rates of outages and the ability to apply changes and introduce new capabilities at high frequency. They are confident about their infrastructure as a business enabler.

Most agile IT-organizations today have successfully implemented the concepts of continuous integration and continuous delivery to control quality and reduce the time to market. They have automated the build, test and delivery of new software. To close the loop, it is time for the operations side of the IT-organization to get into the mindset of automation and “continuousness”. Bleeding edge IT-organizations like the one mentioned above have already proven the model. As we go through the technology adoption cycle, we should expect a rapid increase in the uptake of continuous configuration management through automation.