CFEngine and the future of monitoring

Posted by Mark Burgess
December 28, 2012

Since writing my earlier post on (Model based monitoring), I have talked to many users who encouraged me to describe CFEngine’s simple capabilities in more detail. Although CFEngine is not intended as a traditional monitoring platform, it offers a considerable amount of human-friendly information, with a model that could be a hint of the future. At CFEngine, we like to innovate, and this post offers some hints about how we are thinking.

*

At what point did IT-monitoring lose its way? For me, the answer is when it forgot to think about its audience.

Present day monitoring systems are typically mere data streaming appliances. They do little in the way of interpretation or analysis. Even with an increasing number of smart graphing tools, all the work of understanding and interpretation is left up to the user. That makes it very hard to answer concrete questions about either infrastructure or application.

In the monitoring of complex systems, there is more than one kind of audience – not everyone should be looking at the same data. Forensic investigation of system behaviour and scientific analysis that looks for meaning are very different activities than needing to know your speed and direction, or whether you are floating or sinking. Most tools can’t decouple these – so they waste users’ time with too much or too little data. Scientists look for data. Operations are looking for contextualized information.

Some monitoring is informational (for human analysis) and some is actionable (either by human or machine). CFEngine ensures that information can be acted upon as quickly as possible and that human time is not wasted by exposing users to raw data that is not part of a model.

What are they key day-to-day wishes and questions we should be asking about monitoring? Here are some examples:

  • Tell me about the status and current promise(s) the system should keep.
  • Tell me about security, performance, compliance, etc.
  • Where is the greatest/least activity?
  • How much spare capacity do I have?
  • Where are things changing fastest?
  • When do I need to think about X?
  • What do I need to know about X?
  • What are the most important things that happened to X?
  • Why is this object/rule here?
  • What can I do about X?
  • Who changed X last?
  • When can I expect X to happen?
  • Where can I find X?
  • What things affect X?
  • What promises have been made about X?
  • What stories lead to conclusion X?
  • What patterns can we see that might lead to efficiencies of scale?
  • And so on …

Most monitoring systems cannot begin to answer these questions without a lot of manual work from users, because they are not in any way related to configuration systems that define the purpose and desired state of the system. This is where CFEngine has been innovative. In CFEngine, we only need one model – because each configuration item implies automated monitoring,

In the rest of this essay, I sketch out some notes about the directions and dead ends I see in the story of IT monitoring, viewed from my perspective as a long-time researcher and practitioner. I have no doubt that this will provoke mixed reactions, so let’s get down to it.

Model based monitoring

Present day data streaming tools know so little about the data they are streaming that the best they can do is to report thresholds. Unfortunately thresholds have little long-term meaning. What is acceptable today, might be quite unacceptable tomorrow. Context is key. We need a more extensive system measurement philosophy if we want to understand the phenomena at work.

Most monitoring collects a raw snapshot of the current actual state. Some can also record history and calculate statistics. CFEngine belongs to the few systems that use machine learning to compute averages, deviations, trends that allow users to see what is now compared to what came before. Moreover, CFEngine is probably unique in automatically building model-based results from system design plans, giving the ability to predict/detect anomalies. In CFEngine, the model comes from policy itself. It can do this because all infrastructure design is boiled down to an `assembly language’ of atomic promises.

Data, information and knowledge

We can’t expect users to be scientists, and we should not encourage them to stare at data they have no context for understanding, hoping to glean some meaning from traces. That is a job for a specialist.

An inexperienced user might watch a value wanting to see if it will change, without realizing that it can never change for the next hour. This is both stressful and time-wasting.

Is “big data” the answer? Well, the question is silly. Is it the answer to what question? Collecting every clue might get you a spot on CSI, but it won’t help the average engineer to do anything except waste time sifting for gold (assuming he/she even knows how to recognize gold). To analyze big data requires specialized skills. For day to day operations, you want to minimize the amount of searching and re-thinking, not increase it. So “big data” is a dubious strategy for anyone but a system architect.

Metrics for business and pleasure

If we take seriously the purpose, or intent of the IT system, then this has to be reflected in what we try to measure. It is fruitless watching the system, hoping to see relevant phenomena, without any particular expectations. This is where human needmust inform the machine.

It might still be useful to collect data over time so that a suitable `data scientist’ can analyze them with a view to building such an understanding, perhaps for forensic analysis.

The key thing lacking in monitoring is context (i.e. a model) that allows us to understand what we are seeing. We need to be able to group measurements according to their semantics, i.e. what they mean for the intended purpose of the system.

Currently, CFEngine’s promises are mostly about configuration. In Enterprise, `measurements’ promises can also define the health of systems. This can be modelled both for business relevance (e.g. maximum acceptable response time for a sales query) or more elemental items like `at least 10G available storage’. We have also begun to model services and high level business objectives (goals), from component services (e.g. web farm) to an end-to-end business service (web farm, db cluster, web services, etc…). Ideally, all of these metrics would be tied to automated responses, that respect the human in the loop.

For example The consider the difference between a business measurement and a system one. You might have 100 business transactions per second (length 10ms), but the system is throttled by 5 minute TCP_FINWAIT, so measurement correlations are at least 5 mins long. A novice should not need to know such technicalities to derive meaning from a system portal.

The idea of monitoring everything `just in case’ is a sign of hedging, which suggests a lack of understanding of the environment. That is normal in the beginning, of course, but if your system is reasonably under control it is possible to form a model for it that leads to significant simplifications (cost savings of human hours).

At a recent customer visit, an operations centre was paying 40 persons to watch `big screens’ around the clock, just in the hope of seeing something unusual. Not only are humans easily de-sensitized to micro-anomalies, they tend to forget what they are looking for because the task is boring.

We need to separate metrics into those where we have a model for their behaviour, and those whose behaviour we really don’t understand. Measures can probably be further divided in the three main groups: business-result related, platform-performance related, environmental `weather’ checks for calibration. These are the major separations that we hope to achieve in running a system. If there is no clean separation between these areas, then the system may be called “complex”.

What we hope to achieve with such measurements includes:

  • Capacity planning.
  • Cause-effect inference.
  • Performance tuning.
  • Fault diagnosis.
  • Drift assessment.

Finally, different consumers of the information will want to organize their metrics on customized dashboards in a simple way. We need to get beyond trying to watch everything at the same time, to having confidence in model-based summaries.

Data collection and aggregation

How we collect data is important. A distributed process allows us to minimize the impact on resources, but we also need to aggregate data in order to compare results on a calibrated scale for immediate action, observe correlations etc. To do this efficiently, it makes sense to filter out irrelevant noise at the source, rather can burdening future searches.

Collection can be optimized by a pull process, to keep network traffic predictable. In push-based notification systems (like triggered SNMP) we can end up with notification storms.

Every dynamical process has a natural timescale; so, of course, it makes no sense to sample it faster than this limit, and research shows that many values change more slowly than users expect, while others exhibit meaningless noise that one should not be concerned about. Left to themselves, users will generally oversample out of uncertainty. This adds only cost without value.

In CFEngine, measurement promises allow users to sample data at an appropriate timescale, up to a maximum of once every 2.5 minutes. That captures most use-cases

We can relieve a lot of human burden by using unsupervised learning to detect context, time-scales and to summarize the main features of a change process. CFEngine has already been a pioneer in this area in the 1990s, and RRDtool has added some features to support this too.

More science can be applied here. Techniques like Principal Component Analysis can help us to see large numbers of hosts through different lenses.

Representing resource pools

With the rise in importance of the web as a front end to almost everything, web-scaling increasingly dominates resouce models. We care less about individual hosts and more about resource pools or clusters.

CFEngine naturally allows us to group hosts by their role in the system, and in the Enterprise portal we are making it easier to view such services as services, measuring Quality of Service (QoS) as has been common in network management for years. We can use that in a variety of ways to represent information at the level of the relevant virtual container for that function. Creating more virtual container concepts that allow us to handle clusters and resource pools more easily will bring monitoring forward in this area.

Understanding growth

Given that IT is currently dominated by growth, one of the important measurements is the number of machines we operate, and size of resources at a given moment. We would expect this to correlate with the operational signals reaching individual hosts in some way.

Current tools do not permit this kind of holistic correlation: measuring the number of machines (in various states) as a function of time, so that we can correlate other changes with the installation or retirement of machines. The total number of machines can affect any individual machine through shared resources.

Access to data, notification and escalation mechanisms

We are currently in a transition period where we are exploring the composition of different kinds of system, and there is therefore a love affair with APIs. “Just give us a REST API and we will build the world anew!” But, I believe this is a fleeting trend that has to pass. Imagine if we treated other infrastructure in the same way: “Today everyone is a carpenter, here are the tools – now build more desks, we need to expand!” Such an organization would be seen as immature and even irresponsible.

So we want to get users beyond raw data feeds, and toolboxes, to commodity contextualized information, backed up by a local knowledge of its meaning.

What happens when something out of the ordinary happens, i.e. an anomaly that requires special attention? Summary notifications help! Users should not have to look at a big screen to spot important events in a sea of change – nor should there be an air traffic or mission control room required for the normal functioning of a system. That was the best technology available in the 1960s.

Facebook is an impressive tool for displaying notifications and news about large numbers of relationships. A `facebook’ (hostbook) for hosts and services would allow us to maintain a close relationship with key favourites Intelligent notifications of events we care about is what Facebook does well, and we can learn from this.

Summary/Postscript

Ask yourself this. What is the purpose of monitoring? Is it a vanity mirror for looking at beautiful systems? Is it an alarm for fire fighting, or a crystal ball for strategic planning? Probably, it needs to be all of the above. But, above all, what it needs to be is something that helps human involvement and doesn’t waste our time.

To make CFEngine into a credible part of a monitoring infrastructure (for simple and complex needs), we are starting with a basic core of classical approaches to representing monitoring data – a handle for people to latch onto and build a more modern view on top of. Users need to be able to drill down from any starting point to find useful reasoning(not just more data) around the condition of the system. This will build confidence in the information as a source of intelligence.

We hope that others in the industry will continue to follow our thinking in infrastructure engineering. It is time for the industry to move forward.

Acknowledgement

Over the past year, I have benefitted from discussions with (in no particular order) Jason Dixon, Sal Jamil, Gleico Moraes and Juliano Martinez, John Willis, Damon Edwards, Patrick Debois, Kris Buytaert, John Allspaw, Arjan Eriks, Dan Klein, Kent Skaar and Reynold Jabbour. None of these gentlemen should be assumed to endorse the content above. Apologies if I have forgotten to mention anyone.