Scale and scalability

Posted by Mark Burgess
May 22, 2012

If someone asks you about the scalability of your operations, don’t tell them about the number of machines you run; tell them rather about what it costs you to tend them each month. The total cost of that burden can be summed up from the cost of hardware, software, maintenance, people, lost revenue during downtime, time lost during maintenance, and time wasted from not managing knowledge well.

At one customer site, CFEngine replaced an old second wave push-package provisioning system from one of the major vendors, with a lightweight CFEngine automatic maintenance framework, exchanging 200 hosts and a team of sysadmins in the push-framework for 5 CFEngine servers run by a single man. The push framework had a response time of about an hour, while the CFEngine framework handled 5 minute time resolution, all of this easily supporting a modest 10,000 hosts with configuration and monitoring of state. What is the cost of scaling?

On a second occasion, we were part of a test in a virtual cloud to compare the performance of CFEngine against ruby-based alternatives. The initial test had to be abandoned because we had provisioned the smallest memory (cheapest) virtual machines for CFEngine, but these were not even able to load the Ruby sub-system. In the cloud, more memory means higher cost. What is the cost of scaling?

Data transfer is another thing we pay for in datacentres, real or virtual. If you are provisioning hosts across a wide area and you maintain a single point of control with chatty protocols, you can run up a large bill. CFEngine customers have abandoned software with expensive push-pull models due to greedy protocol bandwidth. The answer here is to federate and maximize the autonomy of agents, thus minimizing reliance of fragile shared resources. Agility is reduced as delays are added by network latency, and timeouts from failures. CFEngine managed hosts do not need to exchange much data, even to achieve full monitoring, thanks to core principles of lazy evaluation, compression by natural selection and distribution of control. If hosts manage themselves, there are large savings to be made in time and transfer costs. What is the cost of scaling?

What does a big mess cost?

One of the hidden costs of management is insufficient knowledge. Time spent figuring out the what, the how and the why of an installation can add up to both time wasted and manpower. As a system grows in size and complexity, our knowledge of the system’s true nature has to scale too.

Getting hold of the person who designed a system (and expecting him or her to remember the details) might not even be possible if that person has already left the firm. The personnel cost to scaling is the highest one we have. True automation (the hands-free, not the remote control kind) uses models and design patterns to reduce the total amount of information we need to look at in order to govern our hosts. That makes it easier to know a system, because there is less to know. In Claude Shannon’s information theory, there is the concept of compression to the smallest amount of data needed to represent the true intent. If that information grows linearly with the size of your system, it is called noise (or maximal, uncompressible entropy!).

CFEngine was designed with a promise language that is composed of compressed Shannon intentions, in the form of convergent promises. In 2003 I proved that this makes CFEngine’s language a minimum entropy model – which can significantly reduce the information burden, using patterns, thus making a system easier to know.

In addition, we can document relationships between the governed parts of a system, and perceive general goals through the details that are more than just noise, and make predictions that become increasingly costly for humans alone, as a system scales. By separating what we intend from everything else that happens to our system, we keep a clarity of purpose even as the system numbers grow. CFEngine learns, as it goes, so that we always have a reference of what is normal and hence ignorable about behaviour.

Scalability is the ability to grow your datacenter without growing the amount of work, time or money needed to support it. I mentioned some of the main factors here as a teaser. I challenge readers to look at their own systems with a critical eye: what does it cost you as you add more IT?

See also:

CFEngine Special Topics Guide on Scale and Scalability