Comparing Ansible and CFEngine

February 25, 2021

Generally speaking, CFEngine and Ansible can be used to solve the same problems, but their approaches are different. In this blog post I’d like to discuss the different approaches, their consequences, some advantages of each tool, and even using them together.

CFEngines autonomous agents

CFEngine works by installing and running an agent on every host of your infrastructure. It is distributed, each CFEngine agent will evaluate its policy periodically and independently. They rely on a centralized hub for refreshing policy and reporting. Updating the policy, enforcing it, and reporting on the results are decoupled - each of these 3 steps can happen with different configurations / schedules.

The policy update is only needed when there is new policy available, so the agents won’t re-transfer all the policy files they already have. Comparing the versions of policy on the machine and avoiding re-transfers can save a lot of bandwidth on large infrastructures. If the CFEngine hub is unavailable, agents can be configured to copy policy from a backup server, or the update can be postponed until the hub is available.

The actual enforcement of the policy, checking the state of the system and making changes, does not rely on the centralized hub. Enforcement can happen if the host is offline or the hub is unavailable. Frequent policy enforcement is possible, even if reporting and policy update is less frequent (to save network traffic).

Reporting necessarily involves the network and getting reporting data from hosts to the centralized hub. To reduce bandwidth, CFEngine only transmits the differences during each report collection, so if no variables change, the necessary data transfer is minimal. For the reporting aspect, the hub is a single point of failure, but this can be solved with the High Availability (HA) setup. Using HA, reporting will still work if the hub goes offline - another one will take over.

Ansibles orchestration

Ansible lets you orchestrate - tell it to do something and watch it go out to all hosts and make it happen. You don’t have to install an agent - as long your hosts have ssh and python (with some dependencies) you are good to go.

It uses ssh to connect to all your hosts, and run python code there to perform the changes you want. Like ssh, you can use it from the command line, and see the results as they happen. When working with multiple hosts, parallelization can speed up the process, this can be configured with an option for number of forks.

AWX / Ansible Tower, gives you a nice GUI, and allows you to run your ansible playbooks from a dedicated machine. In the GUI, you can schedule runs periodically, and access the results whenever you want.

Whether you are using the CLI, or the GUI, playbooks let you perform common tasks by only specifying the necessary bits in a YAML file. There’s a great community of users writing them, so chances are somebody has already solved the problem you are facing and you can reuse their playbook.

Scalability comparison

We wanted to compare the performance of both tools, at various scales (number of hosts). The example use case is to install and configure web servers (nginx on Google Cloud VMs), with a load balancer running HAProxy. To make it interesting, the webservers are a mix of Ubuntu 18 and CentOS 7. How long does it take to run the same Ansible playbook / CFEngine policy?

Setup

We tried to make the comparison as fair as possible - in both cases, using the same infrastructure in Google Cloud, comparing the same host counts, and sticking to defaults. The previously mentioned fork option for Ansible was the only exception we made, as Ansible would be very slow with the default (5). Testing different values up to 20 gave better performance, beyond that it did not improve, on our setup.

The setup is open source and can be found on GitLab. Anyone interested can reproduce the setup, and we encourage you to try it and make suggestions for improvements.

Results

Starting with just a few hosts, Ansible finishes quickly, within seconds. It is quite easy to see a linear relationship between the time and number of hosts. On our setup it looks like each host adds approximately 4 seconds. We stopped at 500 hosts as the relationship up to that point was very linear, and each test run would exceed 30 minutes at that point. Based on the linear relationship, we projected numbers between each test and from 500-1000, as can be seen in the graph above.

In CFEngine, you don’t schedule a specific job, but rather update the policy and wait for hosts to fetch it and report the results back. For this reason, we observed the load of the hub to confirm whether CFEngine could finish the entire batch within the 5 minute interval. (A CPU load graph is shown on page 4 of this white paper). The default schedule is that policy is updated and enforced every 5 minutes, and the hub collects reports every 5 minutes (independently). This gives us a worst case round trip time of 10 minutes, which we’ve plotted in the graph. We stopped at 3500 hosts as this was our quota in Google Cloud. The CFEngine hub still had an idle period with capacity for more hosts within the 5 minute interval.

The trend here shows a clear pattern - adding more for Ansible to do (more hosts or playbooks) increases the total runtime. The Ansible approach is convenient and easy to understand - but it does add some overhead as there is no agent, ssh connections must be established, files must be copied over and commands must be run, while the connection is held open and the Tower / CLI is waiting. All of this overhead does add up, especially at large scale, or if you want to run many of these on a regular schedule. For a small number of hosts, or one-off changes, the overhead might be worth it, and the convenience of direct (imperative) control and immediate feedback is more valuable.

Notes

The exact numbers here are not that important. They depend a lot on the use case, your hardware and network, and how you’ve configured the tool.

There are many improvements that could be made, and a second version, an optimized comparison, would be interesting. As an example, we could easily set CFEngine’s schedules to 1 minute, bringing the roundtrip time down to 2 minutes. Or speeding it up even further by requesting more agent runs with the cf-runagent functionality. Similarly, there are many strategies for improving the performance of Ansible.

Conclusions

Although CFEngine and Ansible can be used for the same things, which is the better option depends a lot on the situation. For most organizations, it makes sense to have both solutions available, and use the most appropriate one depending on the context.

When to use Ansible

In any situation where you’d SSH into a single machine (or perhaps a dozen with parallel ssh) and run a command, Ansible is great:

  • ssh / shell commands translate well into ansible.
  • Ansible essentially automates the parallel ssh workflow for you.
  • As long as you are working with a limited number of hosts, your job will finish quickly and you get immediate feedback.
  • There’s a great library of playbooks available to perform common tasks.

When to use CFEngine

In some situations, you might have security / compliance requirements, or scale / complexity factors which make CFEngine a better option. For example, if you:

  • Want to make a change to a large part of your infrastructure - without being limited by the rate your Ansible Tower can connect to your hosts with SSH.
  • Need the policy to be continuously enforced and have up to date compliance reports to prove it.
  • Find that some things are just too slow to do in the orchestrated Ansible way - either due to a large number of hosts or complex playbooks.

Using both, together

Perhaps even more interestingly, there are situations where you can use both, together:

Aleksey Tsalolikhin at Vertical Sysadmin has some examples of how they use CFEngine and Ansible in a complementary fashion, in this blog post.

Additional reading

We have published two white papers which discuss CFEngine and Ansible:

  1. Ansible|CFEngine - Analysis and methodologies
  2. Ansible and CFEngine Scalability

If you have any questions or comments about the white papers, we’d love to have your feedback.