Mission Critical IT - can you afford it?

Posted by Mark Burgess
August 22, 2011

CFEngine is an inexpensive life-support system for complex and mission critical IT infrastructure…

Every year the world spends billions on risk avoidance – safety and backup equipment, security systems and even insurance policies against loss and liability. For many of us, the risk of serious loss is quite small (though still sufficient to keep insurance companies in business) but in some industries the consequences of loss are so serious that even a small risk is unacceptable.

Mission critical IT systems, i.e. those for which the absence of the system for even a moment could result in serious financial or human loss, are all around us, in power stations (e.g. nuclear), air-traffic control systems, flight navigation systems, medical life support, etc. Human lives, indeed society, depends on critical infrastructure to an ever increasing degree.

Critical operations

How do you prevent systems from failing? Ultimately, we can’t. At least, we cannot prevent component parts from failing. Every technology, even our hearts and bodies, fail as we battle through the challenges of the environment – but both biology and technology have come up with some ways to minimize the effects.

  • Redundancy – having multiple backup systems that can take over if one part fails.
  • Repair – proactive maintenance to avoid faults, and rapid detection of faults and repair before ireversible damage occurs.

Once a fault has occurred, speed is of the essence. The industry speaks of two simple estimates: Mean Time Before Failure (MTBF) or how long IT is likely to function before there is a fault, and the Mean Time To Repair (MTTR) or roughly how long it will take to recover. Clearly, if you are on an aircraft flying towards a mountain, a thirty minute repair cycle is not helpful.

In spite of the obvious importance of prevention and repair, nearly all IT management technologies are still about the initial deployment of machinery – just set up your computers and hope for the best. At CFEngine, we have been making automated prevention and repair systems for many years – indeed, many of the key principles used by the IT management industry were first introduced by CFEngine. Surprisingly though, automation is still not as widespread as it should be. Humans are notoriously unreliable under pressure, but some companies don’t like machinery because if something goes wrong it is unclear who to blame. In some countries there is still a culture of repairing failure with a simple `you’re fired!’ – perhaps because a culture of liability lawsuits weighs more in their minds than actual material failure?

At CFEngine, we are not afraid of automation. Indeed, our mission is about rehumanizing IT, by judicious use of automation. As Alvin Toffler pointed out in the 1960s, replacing humans with technology is not dehumanizing, what is dehumanizing is asking humans to work like machines in the first place! In mission critical systems, automation is the only way to keep IT operations running 24/7/365, and there are some key technology strategies for doing this:

  • Define a clear model of the Mission.
  • Avoid relying on anything we do not control.
  • No ‘single points of failure’ (employ redundancy).
  • Be as swift and agile as possible when implementing change.
  • Provide trustworthy insight about the state of the system.
  • Avoid interacting with any part of the system unless it deviates from the mission.

We work continuously with key industry leaders to make CFEngine ever faster, lighter and more reliable in the face of complex and unpredictable infrastructure. You don’t have to work in outer space to expereince risk. Just having a complex organization with challenging communications can be reason enough for a serious system failure.

Recently we have added a solution for high availability insight into the CFEngine Nova 2.1 product, through our Mission Portal or CFDB; now humans can play their proper role in early warning of rising issues. By making a fully redundant architecture for information updating, we can make reliable promises about availability of status information. This might be the lightest redundant solution to system monitoring available to users today, with even greater performance improvements added recently. Users get reliable and predictable insight, with fault recovery times of only a few minutes in case of failure – something that cannot easily be matched by competing software.

A final point is that CFEngine gives engineers real insight into the accuracy of measurements it presents. Most monitoring systems do not tell you how fresh the information you are seeing is, so humans can use their skills in forming technical judgements. Engineers have no use for pretty graphs that satisfy only their emotional needs.

What is satisfying is that sound engineering principles have led to a solution with both low overhead and a relatively low price-tag. Today, the cost of creating mission critical infrastructure does not lie with the IT management technology – you will spend far more on the human management regulation.

Critical Cloud

For some, Cloud Computing is the answer to scaling – rent your computing power virtually on the web and recycle it without ever needing to own, just like car rental. The Cloud offers rapid response to demand, and such `elastic’ resources are certainly an important part of responding to peak demand, but merely deploying virtual machines quickly is not enough. To accomplish mission critical stability and keep risk at a minimum, you have to plan for the unexpected, and you cannot escape the need for prevention and repair with a short turnaround.

hubs

Now with the release of CFEngine Nova 2.1 (September 2011) and its integrated solution for high availability management, users have software that is robust to network outages and integrates high quality insight for both system experts and higher level management, with failover redundancy and mission compliance monitoring. We believe that Nova 2.1 might well be the most robust cross-platform management technology, with the fastest response time and smallest impact that you can find in the industry today.

This will not be the last time we improve CFEngine’s robustness under pressure, even though we lead the Open Source technology race significantly in this area. Our continued R&D shows that we are committed to our own critical mission: of making the impossible possible.