Scale and Scalability


Next: , Previous: (dir), Up: (dir)

Scalability

COMPLETE TABLE OF CONTENTS

Summary of contents


Next: , Previous: Top, Up: Top

1 Principles of scalability



Next: , Previous: Principles of scalability, Up: Principles of scalability

1.1 What is scalability?


By scalability we mean the intrinsic capacity of a system to handle growth. Growth in a system can occur in three ways: by the volume of input the system must handle, or in the total size of its infrastructure, and by the complexity of the processes within it.

For a system to be called scalable, growth should proceed unhindered, i.e. the size and volume of processing may expand without significantly affecting the average service level per node.

Although most of us have an intuitive notion of what scalability means, a full understanding of it is a very complex issue, mainly because there are so many factors to take into account. One factor that is often forgotten in considering scalability, is the human ability to comprehend the system as it grows. Limitations of comprehension often lead to over-simplification and lowest-common-denominator standardization. This ultimately causes systems to fail due to an information deficit.



Next: , Previous: What is scalability?, Up: Principles of scalability

1.2 How does CFEngine address scale?



CFEngine is a decentralized (or federated) agent-based system, with no single point of failure. It uses integrated Knowledge Management to present comprehensible views of how infrastructure complies with user intentions.


In this Special Topics Guide, we take a simple approach to gauging the scalability of CFEngine, considering the worst case scalability of the software as a management system for typical environments and network models.


Next: , Previous: How does CFEngine address scale?, Up: Principles of scalability

1.3 What does scalability depend on?


CFEngine's scalability is not only a function of the CFEngine software, but also of the environment in which it operates and the choices that are made. Some relevant environmental factors include:

For managers, there are several challenges to scaling that go beyond the infrastructure:

CFEngine does all processing of configuration policy at the destination node (i.e. on the affected system). There is no centralization of computation. CFEngine Nova adds fault-tolerant multi-node orchestration, in which any node's state can be made available to other nodes on request.

In general, reliance on common or centralized information will limit the inherent scalability of the system. However, through opportunistic use of caching, CFEngine is able to avoid single points of reliance. CFEngine's asynchronous promise model makes the impact of resource sharing less significant.


Next: , Previous: What does scalability depend on?, Up: Principles of scalability

1.4 A product strategy for scaling


As our product names suggest, CFEngine's scaling behaviour follows an `astronomical' hierarchy of scales. Our product range have been chosen to model different issues of scale that occur in systems as they grow from tens to tens of thousands of machines. As indicated above, dealing with scale is not just about machine capacity, but also about knowledge management and comprehension.

A Nova installation is designed around a single star configuration (like a solar system) in which the hub machine is the star and the managed entities are the planets. Special planets can have their own satellites (customized environments within the single point of control), so this model does not imply complete uniformity.

A Constellation is designed around multiple star configurations and is designed for cases where it is desirable to maintain several points of control or independently managed entities. This is also called a federation of star networks. The reason for choosing this kind of architecture may or may not have to do with service capacity (i.e. for coping with a large number of systems): sometimes knowledge management and responsibilities scale better with federation.


nova_const.png
From a Nova star network to a Constellation of star networks.


Next: , Previous: A product strategy for scaling, Up: Principles of scalability

1.5 A user strategy for scaling

This section proposes a simple method for evaluating capacity and scalability at a site. You will need some numbers in order to do some simple calculations.

You will be able to use these numbers in the next chapter to work out how far a simple star network configuration (Nova starburst) will go in supporting requirements. An environment that consists of multiple environments, or very large size will have to be handled as a constellation configuration.


Next: , Previous: A user strategy for scaling, Up: A user strategy for scaling

1.5.1 Scalable CFEngine architecture

Several architectural principles aid the ability to scale to large size:


Previous: Scalable CFEngine architecture, Up: A user strategy for scaling

1.5.2 Layout guidelines for scalability

The picture of a federated architecture is that of a number of smaller star networks loosely rather than tightly integrated together. This loose coupling avoids rigidity that can cause cascade failure.

This schematic architecture does not answer where these centres will be located however. Will the division into local centres be based on geography, departmental lines, or some other virtual view of the organization?

It does not matter from an architectural point of view what criteria are used for dividing up an organization. The main criteria is the environment itself. If we think in terms of promises for a moment, a strong group culture forms when members of a community make a lot of promises to one another. Organizational entities are therefore clusters of promises. This might or might not coincide with naming of institutional entities.

It is easy to understand the reason for a promise-oriented approach to scalability. Clusters of promises are also clusters where communication is likely to be required and take place. Since scalability is enhanced by limiting the amount and scope of communication, the clusters of promises mark out the areas (sub-networks, if you like) where communication is necessary. This makes for a natural encapsulation of policy issues.


Previous: A user strategy for scaling, Up: Principles of scalability

1.6 Unexpected risks of scaling



Next: , Previous: Principles of scalability, Up: Top

2 Scalable policy strategy



Next: , Previous: Scalable policy strategy, Up: Scalable policy strategy

2.1 Policy guidelines for comprehension

Part of the challenge of scale is comprehension. If every computer is identical, this is no problem, but when there is real growth in complexity scale can lead to a case of information and comprehension overload. Checking that growth of complexity requires some user discipline.


Next: , Previous: Policy guidelines for comprehension, Up: Policy guidelines for comprehension

2.1.1 Strategy for scaling policy

What if you have written down 250,000 promises to keep? How can you manage that? Here are some tips:

When writing the CFEngine bundles of promises, you will need to use a `meta-model' to organize the bundles. This is part of a Knowledge Management (comprehension) strategy. Your best friend here is someone with didactic or pedagogical skills, as he/she will maintain the visibility of high level goals throughout the technical challenges. Your worst enemy is a technician with his head in the machine, who will drag everyone down to the machine components, where goals and challenges are incomprehensible.


Next: , Previous: Strategy for scaling policy, Up: Policy guidelines for comprehension

2.1.2 Tactics for scaling policy

In writing policy, the thing most often forgotten by technicians is explaining why decisions have been made.


Previous: Tactics for scaling policy, Up: Policy guidelines for comprehension

2.1.3 Caching classes that are expensive to compute

As of version 3.4.0 of the CFEngine core, persistent classes can be used to construct a simple time-saving caching of classes that depend on very large amounts of data. This feature can be used to avoid recomputing expensive classes calculations on each invocation. If a class discovered is essentially constant or only slowly varying (like a hostname or alias from a non-standard naming facility)

For example, to create a conditional inclusion of costly class definitions, put them into a separate bundle in a file classes.cf.

# promises.cf

body common control 
{
cached_classes::
  bundlesequence => { "test" };

!cached_classes::
  bundlesequence => {  "setclasses", "test" };

!cached_classes::
  inputs => { "classes.cf" };
}
 

bundle agent test
{
reports:

  !my_cached_class::
   "no cached class";

  my_cached_class::
    "cached class defined";
}
 
Then create classes.cf
# classes.cf

bundle common setclasses
{
classes:

  "cached_classes"            # timer flag 
         expression => "any",
        persistence => "480";

  "my_cached_class" 
                or => { ...long list or heavy function... } ,
       persistence => "480";

}


Next: , Previous: Policy guidelines for comprehension, Up: Scalable policy strategy

2.2 Performance impact of promises

In a large system it is natural to expect a large number of promises. This makes the location of a specific promise difficult. The Copernicus Knowledge Map is a key strategy for locating promises.


Next: , Previous: Performance impact of promises, Up: Scalable policy strategy

2.3 Avoiding network traffic

Connecting to a CFEngine server process is one of the most time consuming activities in centralized updating. Every bidirectional query that has to be made adds latency and processing time the limits scalability. The connection time and data transfer size during checking for updates can be minimized by making use of the cf_promises_validated cache file on the server. This file summarizes whether it is necessary to search for file updates (a search that can take a significant number of seconds per client). Since most checks do not result in a required update, this cache file can save a large amount of network traffic.

     files:
     
      "$(inputs_dir)/cf_promises_validated"
          comment => "Check whether new policy update to reduce the distributed load",
           handle => "check_valid_update",
        copy_from => u_dcp("$(master_location)/cf_promises_validated","$(sys.policy_hub)"),
           action => u_immediate,
          classes => u_if_repaired("validated_updates_ready");
     
     am_policy_hub|validated_updates_ready::
     
      "$(inputs_dir)"
          comment => "Copy policy updates from master source on policy server if a new validation was acquired",
           handle => "update_files_inputs_dir",
        copy_from => u_rcp("$(master_location)","$(sys.policy_hub)"),
     depth_search => u_recurse("inf"),
     file_select  => u_input_files,
       depends_on => { "grant_access_policy", "check_valid_update" },
           action => u_immediate,
          classes => u_if_repaired("update_report");
     


Next: , Previous: Avoiding network traffic, Up: Scalable policy strategy

2.4 Defining classes

Modelling environments with classes is a powerful strategy for knowledge management, and is therefore encouraged.


Next: , Previous: Defining classes, Up: Scalable policy strategy

2.5 Using the Copernicus Knowledge Map

The Copernicus Knowledge Map is an integral part of the commercial CFEngine products. It is also a feature that is developing rapidly as part of the CFEngine commitment to research and development. It forms a browsable `mental model' of relationships between promises, goals and documents that describe them (including the manuals and other documentation sources). The map provides you with an overview of how parts of your policy relate to one another, and to other high level parts of your environment.

Use the knowledge map to:


Previous: Copernicus Knowledge Map, Up: Scalable policy strategy

2.6 System Tuning and Hub Optimization

CFEngine uses MongoDB as its repository of information for each Nova and Constellation hub. The performance of a hub depends on the combination of hardware and software. The CFEngine hub falls under the category or role of database server, and this requires fairly specific optimizations. For example, NUMA architecture processing is known to lead to severe processing bottlenecks on database servers, and so NUMA kernel modules should be switched off. Below are some of the optimizations that should be looked into for the CFEngine hubs.


Next: , Previous: System Tuning and Hub Optimization, Up: System Tuning and Hub Optimization

2.6.1 Tuning the linux kernel for thousands of hosts

Although CFEngine communicates with the Mongo database over a local socket, it still uses TCP as its connection protocol and is therefore subject to kernel optimizations.

Every write and read connection to the Mongo database makes a kernel TCP connection. With the extreme density of connections, this is somewhat like a high volume webserver. The standard `play safe' kernel settings are too conservative for this kind of performance. The main bottleneck eventually becomes the FIN_WAIT timeout, which leaves file descriptors occupied and non-recyclable for too long. We recommend reducing this waiting time to free up descriptors faster:

echo "1024 61000" > /proc/sys/net/ipv4/ip_local_port_range
echo "5" > /proc/sys/net/ipv4/tcp_fin_timeout
This will clear old connections faster, allowing new ones be created and old threads to terminate faster.


Previous: Tuning the linux kernel for thousands of hosts, Up: System Tuning and Hub Optimization

2.6.2 Tuning the MongoDB for thousands of hosts

Some points to consider when scaling the MongoDB:


Previous: Scalable policy strategy, Up: Top

3 Internal and external scalability of the software


Scalability is about the internal and external architecture of the system, and the way that information flows around the highways and bottlenecks within it. As a distributed system, some of those are internal to the software, and some lie in the way it is used.

CFEngine allows users to build any external architecture, so the only intrinsic limitations are those internal to the software itself. Poor decisions about external architecture generally lead to greater problems that the internal limitations.

A number of techniques are used internally to bring efficiency and hence scalability in relation to policy.

Hashing and indexing
are used in many ways to coordinate system without the need for communication. By basing decisions on the predefined and publicly available information about a node (like its name and address), we avoid having to pass messages between nodes.
Classifying of patterns
is used to model patterns of similarity and difference in the intentions for a system. By describing policy for different categories of system parts, rather than for each individual part by name, we can reduce the amount of information we need to specify.
Persistent locking of execution
is used to place limits on the frequency and concurrency of operations undertaken by CFEngine. Using the ifelapsed timers on each promise, one can say how often work-intensive checks are made.
Lazy evaluation
is a common approach to efficiency, but it is difficult to achieve in a dynamic environment. Necessary complexity makes for necessary work. CFEngine uses best-effort lazy evaluation to reduce processing, but errs on the side of correctness, reliability and security.

The greatest challenge for scalability is to reduce functional requirements to a description or model that uses patterns (i.e. general rules) to compress only what needs to be said about the policy into a comprehensible form. It is about identifying a necessary and sufficient level of complexity without over-simplification, and it is about not having to specify things that don't need to be specified. This is Knowledge Management.



Next: , Previous: Internal and external scalability of the software, Up: Internal and external scalability of the software

3.1 Best case approach to external scalability


In the most scalable approach to management, each agent works 100% autonomously as a standalone system, requiring no communication with its peers, or with a central agency. Its policy is therefore constant and fixed.

If this model suits your organization, the operation of CFEngine is completely independent of the number of machines, so it exhibits perfect scaling with respect to the size of the infra-structure. However, this model is too simple for most sites, and its value lies in documenting the extreme end of the scalability scale, as an ideal to work towards when striving for efficiency.


Next: , Previous: Best case approach to external scalability, Up: Internal and external scalability of the software

3.2 Failover and redundancy

As a self-healing system, CFEngine poses a low risk to loss of data. Most management data that are lost in an incident will recover automatically.

Failover due to unavailability.
The loss of a hub is not a critical failure in a CFEngine managed network. Client machines continue to work with the last known version of policy until a hub returns to `online' status. During updates, hubs can come under load. To avoid this, splaying options should be used primarily1. CFEngine's failover servers for file-copying can be set up to offer immediate redundancy.
Shared storage for /var/cfengine/masterfiles.
The only place where shared storage makes sense is between the main policy server (hub) and any failover servers. In that case only this one directory can be shared.
Backup of /var/cfengine.
It is not strictly necessary to have a backup of the CFEngine work directory, but keeping a backup of this for the hub can save time when restoring a broken hub, since the records of known clients are stored in this workspace, and the current status of reports is also stored.

On clients, this workspace can be considered disposable, but many users save public private key-pairs to preserve the `identity' of hosts that have merely disk failures, etc.

Multiple reporting hubs.
Although it is technically possible to have multiple reporting hosts, this is recommended against. Multiple hubs will only increase the overhead on all parts of the system for little return.

Our general recommendation is to introduce as few network relationships as possible (make every system as autonomous as possible). Shared storage is more of a liability than an asset in general networks, as it increases the number of possible failure modes. Backup of the hub workspace is desirable, but not a show-stopper.


Next: , Previous: Failover and redundancy, Up: Internal and external scalability of the software

3.3 A simple worst case approach to scalability: the star network


CFEngine 3 Nova is designed around the concept of a simple star network, i.e. a number of `client' machines bound together by a central hub. This is a commonly used architecture that is easy to understand. The hub architecture introduces a bottleneck, so we expect this to have a limited scalability as long as we cannot increase the power and the speed of the hub without limit.

To understand the external scalability of CFEngine management, we shall estimate how large a CFEngine system can grow using simple centralized management, in this star network pattern. This will force us to confront low level performance characteristics, where we try to extract the most from limited resources.

In a star network, all agents connect to a central hub to obtain possibly frequent updates. The challenge is to optimize this process2. This makes the central hub into a bottleneck. It becomes the weakest link in the chain of information, i.e. the star network has limitations that are nothing to do with CFEngine itself. The counterpoint is that the star network is very easy to comprehend, and thus it usually forms the starting point for most management frameworks.


hub.png
Centralized management with a hub.


In the Community edition of CFEngine, updates travel only in one direction, from server to client. CFEngine Nova, adds to this the collection of data for reporting and analysis (CFDB), from client to hub. Thus the expected load on the server is the sum of both processes.


We would like to answer two equivalent questions. Given a central hub with certain limitations, how many clients can it support? Conversely, given a number of hosts to support with a single hub, what capacity is required in system infrastructure to support it at a certain level?

There are actually two independent issues here. There is the scaling of the policy updates (served by the hub) and the scaling of the reporting updates (collected by the hub). Both of these processes compete for resources.


Next: , Previous: A simple worst case approach to scalability - the star network, Up: A simple worst case approach to scalability - the star network

3.3.1 Typical CFEngine scales

We start by setting some fundamental scales.

We choose deliberately a low value for network capacity. Even if more capacity is available, one does not typically expect to use it all for management overhead.


Next: , Previous: Typical CFEngine scales, Up: A simple worst case approach to scalability - the star network

3.3.2 Scaling of updates

To a first approximation, the process scaling relationships are linear, as long as we stay far away from the region of resource contention at which a server will typically perform very quickly. Scalability is thus about having safe margins for worst case behaviour.

The number of threads t supported by a server satisfies:


                                                     st \le M

and yields a value for t, given fixed values for the hub's RAM and software build size. For a server with 1GB of memory, we have t \le 1024/12 \sim 80 threads. This will consume U \le 1MB of data per agent for a community host, and U\le 5MB for Nova to account for reporting.

The expected time-to-update \tau_\rm round satisfies:


                                        U_\rm min/C \le \tau_\rm round\le U_\rm max/C

The actual values will normally by significantly less than this worst case, but he must plan for the possibility of update storms. The network capacity C = 1MBs^-1 suggests an expected time-to-update \tau_\rm round \le 2s (there and back) with a server-processing overhead. We shall take a conservative value of \tau_\rm round \le 10s for the maximum round trip request time.

Let's consider the policy/reporting update process, as this dominates the scaling behaviour of the software in terms of number of machines. Assuming that we can arrange for agents to distribute their updates over a single time interval \Delta t, with maximum possible entropy3, then we should be able to achieve a maximum number of scheduling slots \sigma:


                                        \sigma = \Delta t/\tau_\rm round

Since the round-trip time can be higher or lower, depending on the size of the update, we can propose some approximate limits. \sigma_\max = 300/2, \sigma_\min = 300/10 per interval \Delta t.

The total number of updates that can complete in the \Delta t interval is thus, at fixed t:

                                         N = \sigma t = \Delta t \over \tau_\rm roundt

If there are now t threads, then the total number of updates available must lie between the upper and lower bounds determined by \sigma:

                                          2400 \le N \le 12,000.

So we can gauge a reasonable lower bound of 2000 machines from a single hub, with extremely conservative estimates of the environment. Note that when exceeding several thousands connections over a short time interval, other limitations of the operating system will normally start to play a role, e.g. the maximum number of allowed file descriptors. Further, comprehending a system of more than a few thousand machines is a challenge unless the system is extremely uniform.

In making each of these estimates, we are assuming that the hub machine will be working almost to capacity. Increasing its resources will naturally lead to improvements in performance.


Next: , Previous: Scaling of updates, Up: A simple worst case approach to scalability - the star network

3.3.3 Limitations on the CFDB hub database

A `star network' architecture has a single concentration of processing, centred around registration of incoming and outgoing hosts on the hub. Read/write activity is focused on two databases:

The scalablity of the hub depends on how efficiently reads and writes can be parallelized to these databases. Even with aggressive cachine, databases are disk-intensive and since disk access is typically the slowest or weakest link in the data flow chain, disk accesses will throttle the scalability of the hub the most.

CFEngine tries to support efficient parallelization by using multiple threads to serve and collect data, however access to a shared resource must always be serialized, so ultimately serial access to the disk will be the limiting factor. The CFEngine hub supports Linux systems. Simple SATA disks, USB and Firewire have very limited performance. To achieve the maximum scaling limit of a few thousand hosts per hub, users can invest in faster interfaces and disk speeds, or even solid state disk devices.

Because of the serialization, a heavily loaded MongoDB will eat up most of the resources on the hub, dominating the performance. Some performance tuning option strategies are discussed below. Clearly maximizing the amount of RAM on the hub is a way to improve the performance.

The time estimates for host updating used above include the latency of this database, but it is possible that there might be additional delays once the amount of data passes a certain limits. We currently have no data to support this assumption and await customer experiences either way.

Lookup times for data in the reports database increase with the number of host-keys, so the time required to generate certain reports (especially when searching through logs) must increase as the number of hosts increases.


Previous: Limitations on the CFDB hub database, Up: A simple worst case approach to scalability - the star network

3.3.4 Update storms

When changes are made, many hosts will start downloading updated files. This can have a sudden impact on the network, as a lot of unexpected traffic is suddenly concentrated over a short interval of time. The contention from these multiple downloads can therefore make each download longer than it otherwise might have been, and this in turn makes the problem worse. The situation is analogous to disk thrashing.

Once thrashing has started, it can cause greatly reduced performance and hosts might pass their `blue horizon threshold', appearing to disappear from the CFEngine Mission Portal. This does not mean the hosts are dead or even `out of control'. It only means that updates are taking too long according the tuning parameters. The default update horizon is.


Next: , Previous: A simple worst case approach to scalability - the star network, Up: Internal and external scalability of the software

3.4 Optimizations affecting scalability



Next: , Previous: Optimizations affecting scalability, Up: Optimizations affecting scalability

3.4.1 The role of caching and indexing

To assist the speed of policy processing by cf-agent, choose class names evenly throughout the alphabet to make searching easier. CFEngine arranges for indexing and cachine to be performed automatically.


Next: , Previous: The role of caching and indexing, Up: Optimizations affecting scalability

3.4.2 Deterministic queue


Previous: Deterministic queue, Up: Optimizations affecting scalability

3.4.3 Push versus pull


Previous: Optimizations affecting scalability, Up: Internal and external scalability of the software

3.5 Redundancy and load balancing in the Nova hub

CFEngine does not need a `high availability' architecture to be an effective management system. The software was designed to work with very low availability, and the designers highly recommend avoiding the introduction of dependencies that require such availability. In normal operation, CFEngine will be able to continue to repair systems without any contact with the outside world, until actual policy changes are made.

It is nonetheless possible to balance load and account for failures by multiplying the number of hubs/policy servers in a Nova star network. This only makes sense for the policy servers and reporting hubs, which are in principle single points of failure.

Because CFEngine is a pull-based technology, the two processes look like this: In both cases, the processes are `self-healing'. If a client is unable to connect to a policy server, it will continue to work with its old policy until the policy server becomes available again. If a report hub fails and reports are no longer accessible, no data are lost because the actual data are sourced from the satellite nodes. When the hub is replaced or restored, it will update and continue as before.


Next: , Previous: Redundancy and load balancing in the Nova hub, Up: Redundancy and load balancing in the Nova hub

3.5.1 Balancing multiple policy servers

A simple way to balance client hosts across multiple servers is to split them into classes using the select_class function4. Nothing else is needed to spread computers evenly into a set of named buckets.

classes:

  "selection" select_class => { "bucket1", "bucket2" };

This selects a particular class for each host from the list. A given host will always map to the same class, thus allowing the rule for policy updates to include copying from a personal `bucket' server.
body copy_from xyz
{
...

bucket1::
     servers     => { "server-abc-1.xyz.com", "failover.xyz.com"  };
bucket2::
     servers     => { "server-abc-2.xyz.com", "failover.xyz.com"  };

...
}

Note this method is not suitable for bootstrapping hosts in a hub configuration, since that requires a one to one relationship documented in the /var/cfengine/policy_server.dat resource.


Previous: Balancing multiple policy servers, Up: Redundancy and load balancing in the Nova hub

3.5.2 Redundant reporting hubs and recovery

A CFEngine reporting hub is a host that aggregates reports from all managed hosts in a Nova starburst cluster. The purpose of aggregating reports is to have all the information in one place for searching, calibrating and comparing. The loss of a hub is an inconvenience rather than a problem, as the function of a hub is to make access to information about the system convenient and to enhance the knowledge of users.

Dividing hosts into multiple hubs should be a decision based on the absence of the `need to know' for all hosts. By splitting the aggregation up into multiple hubs, one jettisons the function of the hub which is to aggregate information. The should normally be done in conjunction with a decision about knowledge boundaries in your organization.

It makes sense to back up some information from a reporting hub. Hub information is considered to be a mixture of two kinds of data:

In addition to these, the hub makes use of the `last seen' database of known hosts in order to know where to collect data from. For rapid recovery after a data-loss catastrophe, it is recommended to back up this database also.

hubs.png
Some data can be backed up from a hub for convenience.

The procedure for backing up the ephemeral data is:

Table of Contents


Footnotes

[1] We strongly recommend users to abandon the idea that it is possible to have `instant' or `immediate' updates. There is always some delay. It is more pragmatic to manage that delay by making it predictable than to leave it to chance.

[2] The estimates here are based on CFEngine core versions 3.1.2 and higher.

[3] Maximum entropy means the most even or flat distribution over the time interval, in this case. See, for instance, Mark Burgess, Analytical Network and System Administration.

[4] This feature was added in core 3.1.5.