CFEngine in a High Performance Computing environment

Posted by:

11 Jun 2020

CFEngine in HPC 

In High-Performance Computing (HPC) uptime and performance are very important. HPC is an area of computing that often focuses on research and development, supporting teams with extremely complex problems they need to solve, and heavy computation mathematical problems, such as protein folding for vaccine development. 

To achieve this, HPC systems rely on high performance, the equipment is expensive, and the average customer has very high demands. Any downtime, performance degradation, misconfiguration, or unexpected behavior will be a financial cost and will reduce the customers’ trust in the HPC provider. 

CFEngine is a configuration management tool that is created to manage such environments and truly excels at its tasks in an environment like this. The modular architecture, the small size, fast executables, self-healing properties, and autonomous execution are what make CFEngine ideal for these tasks. 

In this short blog post, we will look at these aspects of CFEngine, and how CFEngine users are saving money and improving the service they offer their users by leveraging CFEngine. 

Minimal agent

CFEngine is an agent-based technology that uses a minimal agent deployed to all hosts in the infrastructure. The significance of the performance of the CFEngine agent compared to any competing solution cannot be underestimated. 

The agent only uses a minimal amount of resources, due to the architecture and implementation. The agent is written in C and is highly configurable to minimize footprint. 

CFEngine has several daemons that can be running, but depending on the level of functionality most or even all of them can be turned off to reduce the footprint of CFEngine on the hosts. 

What is also possible is to configure the agent to only run during specific periods, not to impact the performance of the environment during computation. This thinking of “time scales” is one of the things that can make CFEngine even more efficient. The agent can run often to execute tiny policy sets and only do operations that are more demanding at specific times or intervals. 

Minimal network dependency

There are “agent-less” approaches to configuration management available, but in the context of HPC, these typically do not come near meeting the requirements of a high-performance environment. 

For each run or each invocation of such a technology, the node in question will need to open a network connection to the server, transfer the agent binary (it is not actually agentless, it is just not permanently installed) all the needed policy data and potentially any dependencies, before it can take any action. If the server is unavailable for any reason the policy execution will not happen, which is clearly unacceptable. 

With an “agent-less” approach, if the admins or policy server is distributing or forcing a change, this will not be applied to any node not available on the network at that specific time. This can cause serious deviations in the infrastructure, and it makes it hard to know what servers have actually had a change applied. 

With CFEngine, a given node does not need to have network access to the policy hub (ar at all for that matter) in order for the agent to execute the given policy. The policy is stored locally, and the node will check for updates to the policy at a set interval – the default is every five minutes, but this can be changed. Both in distributing policy updates as well as reporting back, the protocol defaults to delta information in order to reduce bandwidth. 

This approach significantly reduces the network traffic and the dependence on network availability. Regardless of the network status or connection between the Hub and the Host, the agent will continue to function independently. 

Large variety of hardware

It’s common for HPC systems to have several generations of hardware. However, to the users of the system, this should largely not be a concern. The management team needs to make sure that each generation of hardware has the correct configuration, software versions, and firmware. CFEngine excels at managing such a complex configuration, down to the smallest detail of these differences. 

Examples of this can be to set a specific sysctl value based on the video card that is found. 

Self-healing infrastructure 

A key feature of CFEngine is the self-healing capacity. The agent will react to any change in the system by converging all changes to the desired state. The agent itself will execute its policy, so this self-healing property is not dependent on any network connection or network traffic to be available. 

Typically, as time passes, configuration changes, sysadmins perform needed changes manually or just by logging in to check on the status of a machine in the system, system administrators are prone to change or break configurations or other parameters. Like installing and setting a new shell as the default, changing who is in the sudoers list, or other actions that lead to deviations from the desired state. This can have severe consequences over time, and CFEngine makes sure all resources are configured correctly. 

Reporting

Knowledge is power, and with CFEngine you have the most knowledge about your system available, at all times. Reporting is key to knowing that the changes you roll out are applied correctly, throughout the infrastructure. 

CFEngine can now collect reporting data from multiple Hubs, in different configurations, security zones or locations, to provide a single pane of glass into the entire infrastructure. 

Integrations

CFEngine reporting can easily be extended to integrate with a multitude of other tools. One of the important features to point out that make CFEngine invaluable in such an environment is the Jira integration.

CFEngine can automatically open a ticket in Jira (or other issue trackers) once a certain condition is observed in the infrastructure. An example of this can be the size of a certain database or the permissions of a user being incorrect. At such time that this condition is no longer observed CFEngine can automatically close the ticket again. This enables a rapid response model that can enable system engineers to work much more efficiently.

The value of CFEngine

One of our customers is a great showcase of the value of CFEngine in an HPC environment. 

They have two data centers, one consisting of around 10 000 servers, the other is approximately double that. The larger one is managed without CFEngine, but using various ad-hoc tools and processes. The team size for the non-automated datacenter is several times larger. Despite having a much larger team, they spend significantly more time deploying new software, struggling with much more sprawl. They struggle to correctly report on the state of their infrastructure. Neither of these are problems that the CFEngine based team faces. 

In the last year, there were no outages or severe issues with the datacenter automated using CFEngine, while there were three major outages caused by incorrect patching attempts in the non-managed datacenter. 

The costs of these outages are significant and help underscore not just the value in operations of using CFEngine, but the direct monetary impact of not doing it. That team has since also started to look at a path to start using CFEngine, and we are excited to bring them on board.

Nils Christian Roscher-Nielsen