Benefits of agent-based configuration management

Within the configuration management space, people often distinguish between agent-based and agent-less approaches. In short, an agent-based solution means that you install a software agent to run in the background / periodically on the system. That software agent then makes changes to the system as desired, and also commonly communicates over the network to send and receive updates, policy, commands, scripts, data, etc. On the other hand, an agentless system does not involve installing something new, they instead rely on some software which is (presumed) already installed, like the SSH server, which can be used to acces and make changes to the system.

There are also more hybrid approaches (e.g. a system which involves using SSH in some cases, and installing an agent in other cases). In the real world, in large and complex infrastructures, organizations typically rely on a combination of tools for different purposes. Nevertheless, in this post, let’s take a look at the benefits of agent-based configuration management solutions, such as CFEngine.

Bandwidth - Limiting the amount of data transferred

When the software, data, and policy are stored on each host, we can track changes to them and be efficient to minimize bandwidth consumption. When there is new policy, we transfer it, but only the files / parts needed, not the whole directory. The same applies to data, both to and from the host. Most of the time, for most files and data, there is no change, and thus no reason to re-transmit the data.

By default, CFEngine reports useful information about a host to the hub, things like OS, CFEngine version, IP address, MAC address etc. In total there are around 200 of these values (variables, classes, inventory attributes) on a newly set up (stock) system, and users add more through the modules they use and policy they write. The difference between transferring just a few of these values, occasionally, when they change, compared to all of them every single time, is significant.

Decoupling - Doing different tasks at different frequencies

When you want to make changes to a remote system you need 3 things to happen:

Update: Transfer your desire to the remote system (can be a bash command, policy file, python script, JSON data, or something else)
Enforce: Have that remote system actually enforce your desired state (run a command, evaluate policy, delete a file, etc.)
Report: Transfer some information back to you about the result, whether it was successful, what was changed, etc.

These 3 things do not have to happen with the same frequency, and at scale this matters. When and how often you run these things depends on your requirements, how busy each host is, and how much network bandwidth you have available. Let’s take an example:

You only make changes to policy a couple of times per day.
- Updating could happen once per hour. CFEngine would spread your host updates out evenly throughout the hour. For 10 000 hosts that’s still an average of around 167 hosts updating every minute. This means that you’d have several hosts updating immediately after your change, and all of them would be done after 1 hour.
Your policy is enforcing important security-related configuration like file permissions and local user accounts:
- Since enforcing does not rely on the network, it can happen much more frequently without consuming excessive bandwidth. By default, CFEngine runs things every 5 minutes, if your policy is small / fast, you could turn it down to 1 minute, if your policy is large or involves slower operations, you can increase it to for example 10 or 15 minutes.
You want to have as fresh reporting information as possible, but the network / bandwidth is the bottleneck.
- Sending reporting data every 10 minutes could be a reasonable schedule. For the case of 10 000 hosts, this averages out to around 17 hosts sending reporting data per second.

The conditions for when to do these steps do not have to be simple intervals as above. If desirable, you can set different intervals based on time of day, day of the week, month, or in general the state of the system. Updates, or parts of policy enforcement, could also be postponed until a condition is met.

Robustness - Autonomy and decentralization

CFEngine was designed to be decentralized. Each host runs an agent, makes decisions, can serve and fetch policy and data from other hosts (not just the hub), if desirable. However, even in CFEngine, some things are typically centralized; the policy server and reporting database typically reside on one host called the hub. To mitigate the single point of failure scenario when the hub goes offline, one can set up alternative hubs, for example using the High Availability setup (HA). But even if all your hubs go offline (or your only hub), the agents on each host are still autonomous and decentralized - they will keep enforcing the policy they have, making changes according to your intentions, even if there is no central hub to tell them what to do and when to do it. Relating to the example above - Step 2 will keep working, even if step 1 and 3 are not currently possible.

Visualizations of 5 decentralized agents / servers which keep working even if the Hub in the center is offline.

If you have policy to uninstall unwanted packages like telnet, to limit who has superuser / sudo privileges, and to rotate log files when they grow to a certain size, those things will keep happening even if your centralized hub or network is having issues. Once the issues are resolved, the hosts will seamlesssly start communicating with the hub again, and you get reporting data about what has happened on those hosts.

Consistency - Preventing configuration drift

Agentless systems allow you to quickly deploy a change or piece of software. They’re often used in this way:

Deploy this software to all my hosts now.

However, that sentence doesn’t really describe what should happen in the future:

What about hosts which are currently unavailable / offline?
- Should there be some retries?
  - How many retries?
  - What if they’re offline for a long time (hours, days, weeks)?
What about hosts which don’t exist yet, but will be spawned in the future?
- Your automation should probably be a part of some provisioning step?
What about when a user / admin comes along and uninstalls the software?
- Maybe run your automation as often as possible? (as bandwidth / resources allow)

If the software relates to your security or reliability, it’s important to consider these. It’ll be appropriate to set up some periodic jobs, which run the automation, at some schedule, with retries. At that point, scalability gets more important, you want a configuration management system which is able to enforce a lot of rules (we call them promises), frequently, on a lot of hosts. Additionally, everyone who uses the configuration management system and automates things with it, will need to take into account the considerations above and ensure their desired state is consistently enforced.

We’ve seen that all of this poses a challenge for agentless systems in the real world ¹. In practice, they’re sometimes used in a deploy once / fire and forget way, and there is a problem with configuration drift - outliers and subsets of the infrastructure are missing certain files / software, or are otherwise not configured correctly.

In CFEngine, the default approach is to add the things you want into your policy set, which describes your desired state for the entire infrastructure (and each host in it). As mentioned above, CFEngine, on all the hosts where it is installed, will periodically pull and enforce this policy. This mitigates a lot of problems - offline hosts will fetch and enforce the policy when they come online, same for hosts spawned in the future, and all of your rules are enforced continuously (until you remove them).

Flexibility & security - Programmable decisions at the edge

An agent-based solution like CFEngine allows you to do distributed programming and make decentralized decisions in a very real and practical way. In many cases there is no need, or you don’t want to wait for a centralized server to give you directions about what to do. Here are some useful ways to leverage this, based on stories from CFEngine users:

Cowboy mode: When administrators log into a machine and want to make manual changes, they’ve written policy so that they can temporarily tell CFEngine to “back off”. They can create a file running touch /var/cfengine/cowboy, and CFEngine goes into “warnings-only” mode. If the admin forgets the file, there is a report to show hosts which have this mode enabled, and a timeout causing the file to be cleared if it’s been enabled for too long.

Flag files: Similarly, you can make your policy react to users creating other flag files, for example to increase logging, to flag the host as problematic in an automated report, to temporarily allow a support team to SSH into the machine, or anything else you’d like to enable local admins to do.

Peer-to-peer distribution of files: Hosts can communicate with each other, not just the hub. After establishing trust (exchanging cryptographic keys) hosts can fetch policy and data files from each other. This can be used to ease network load, or as a backup / fallback mechanism in case you cannot fetch updated policy from the main policy server. Depending on your configuration, any CFEngine host can act as a policy server, or file server more generally.

Peer-based anomaly detection of file changes: If you have some files which are expected to be the same across your hosts, you can divide hosts into groups and have them check each other’s files. Outliers, where the file differs from the rest of the group, could be highlighted in a report, or have some other automatic remediation associated with them. Again, if your hosts are communicating with each other, and are relatively close in the network topology, you can reduce the load on your hub and network, and your hosts can react to changes and anomalies much more quickly than every host having to check with a centralized server.

Decentralized decision on role and firewall rules: Policy can look at local files, installed packages, databases, and more, to make a decision about what role the host should have. CFEngine could see in a local database that this host should act as a webserver, and based on this information perform a number of steps, such as; install nginx and certbot, uninstall things you don’t want on your webservers (like compilers and development tools), configure the firewall to allow traffic on ports 443 and 80, render the appropriate configuration file to set up redirect from HTTP to HTTPS and to run certbot to get and renew the TLS certificate, etc.

Maintenance windows: Some users have special time windows when they want things to happen. Especially in finance / trading, where performance is really important, you might choose to have CFEngine not do anything during business hours on weekdays, and do your maintenance, updates and automation only at night or during weekends.

Skip parts of policy based on load: If the system is really busy (experiencing high load), you can check for this and choose to postpone some parts of your policy until later. You might want to prioritize fast and security-relevant changes, and postpone other slower and less critical tasks.

Fix SSH: If you are having problems with SSH or other remote access software, you can use CFEngine (which does not need SSH to communicate) to fix the SSH config. Taking this a step further, you could detect if both SSH is not working, and the CFEngine policy server is unreachable, and activate some kind of emergency remote access mode, for example enabling a 3rd option for remote access.

Configure DB memory based on host specifications: Databases often have configuration values which can be tuned for better performance. When using CFEngine to render these configuration files, you can customize the database configuration based on the state and specs of the system. A high performance system with a lot of memory can naturally use more memory for the database and in general have “higher” tunable values. The config can also be adjusted based on other factors, such as day / time, if you need extra performance during business hours, for example. (And want to use those same resources for something else at other times).

Redundancy - Alternate channel for remote access to a host

If SSH is down, you can fix it with CFEngine. If you have problems with NAT, DNS, or other networking issues which prevent you from opening an SSH connection from your machine to the desired host, there is a good chance that you can use CFEngine to access the host and/or to resolve the issues, since CFEngine opens a connection in the other direction, from the host to the hub, to fetch policy. All you’d have to do is put your fixes into the policy and wait for the hosts to fetch the new policy and resolve the issues. This means that if you are using something like SSH, you have CFEngine as a redundant and different option for remotely accessing your hosts, which is not affected by the same failure cases as SSH.

Conclusion

At scale, there are significant benefits to the agent-based, optimized, decentralized, and declarative approach CFEngine uses. As discussed above, these can impact security, robustness to failures, performance, bandwidth, and efficiency of staff making changes and observing the results. Contact us if you’re interested in discussing how to best use CFEngine or how CFEngine can help in your environment.

Our customers often use both CFEngine and other tools at the same time. ↩︎