Working with external data, a look at classfiltercsv()

Posted by Nick Anderson
October 21, 2021

When working with CFEngine, it’s common to hear advice about separating data from policy. Separating data from policy allows for separation of concerns, delegation of responsibilities and integration with other tooling. Each organization is different, and a strategy that works well in one environment may not work as well in a similar environment of another organization, so CFEngine looks to provide various generic ways to leverage external data. For example, Augments (def.json) is useful for setting classes and defining variables very early during the agent execution which can be applied to the entire policy having differences based on system characteristics as well as being used for host specific data.

Functions for parsing strings:

Functions for parsing files:

Let’s take a look at one of the newer additions, classfiltercsv() which was introduced as part of 3.14.0 and how it simplifies using a CSV file and class expressions for deriving appropriate values.

CSVs have been a common pattern in CFEngine for a long time. A column containing a class expression is a common way to express the appropriate context for a data value. Historically they would be parsed by something like readstringarray() or data_readstringarray(), then the index of the data would be pulled (see getindices()) and used for iterating over the data to find the rows that are relevant based on testing the class expression.

For example, let’s use this fact based data for crafting the content of a report.

#ClassExpression,Feeling,Reason
linux.Afternoon,Good,I am running my preferred platform and it's near luch. I hope it's Ramen!
linux.Morning,Good,I just had breakfast and I don't have broken windows nor rotten fruit.
windows,Bad,nothing is ever good if I am running Windows.
darwin,Bad,who want's to compute with rotten fruit?

This policy parses the file, figures out which rows are relevant and then uses the data.

bundle agent __main__
{
  vars:
    # Parse the file into a data container
      "_d"
        data => data_readstringarray("/tmp/my.csv",  # filename,
                                     "#[^\n]*",      # comment,
                                     ",",            # split,
                                     inf,            #maxentries,
                                     inf);           #maxbytes)

      # Get an index to iterate with
      "_di" slist => getindices( _d );

  reports:
    # Here we iterate over the class expressions $(_di) and print out the data from the matching rows
      "I feel $(_d[$(_di)][0]) because $(_d[$(_di)][1])"
        if => "$(_di)";

    # Let's also take a look at the data structure resulting from data_readstringarray()
      "data_readstringarray() returned: $(with)" with => storejson( _d ),
        unless => "no_show_data";
}

Execution output:

R: I feel Good because I am running my preferred platform and it's near luch. I hope it's Ramen!
R: data_readstringarray() returned: {
  "darwin": [
    "Bad",
    "who want's to compute with rotten fruit?"
  ],
  "linux.Afternoon": [
    "Good",
    "I am running my preferred platform and it's near luch. I hope it's Ramen!"
  ],
  "linux.Morning": [
    "Good",
    "I just had breakfast and I don't have broken windows nor rotten fruit."
  ],
  "windows": [
    "Bad",
    "nothing is ever good if I am running Windows."
  ]
}

Now let’s use the same data with classfiltercsv().

bundle agent __main__
{
  vars:
    # Parse the file into a data container
      "_d"
        data => classfiltercsv("/tmp/my.csv",  # filename,
                               "no",           # has_header
                               "0");           # class_column
                                               # optional sort column

      # Get an index to iterate with
      "_di" slist => getindices( _d );

  reports:
    # Here we iterate over the class expressions $(_di) and print out the data from the matching rows
      "I feel $(_d[$(_di)][0]) because $(_d[$(_di)][1])";

    # Let's also take a look at the data structure resulting from classfiltercsv()
      "classfiltercsv() returned: $(with)" with => storejson( _d ),
        unless => "no_show_data";
}

Execution output:

    R: I feel Good because I am running my preferred platform and it's near luch. I hope it's Ramen!
    R: classfiltercsv() returned: [
      {
        "0": "Good",
        "1": "I am running my preferred platform and it's near luch. I hope it's Ramen!"
      }
    ]

While the implemntation of these policies is very similar, classfiltercsv() can bring significant performance and efficiency improvements as the volume of data increases since we no longer have to iterate over the data.

Let’s add some more rows to the data for consideration.

for i in $(seq 9000); do
    echo "Class$RANDOM,Feel$RANDOM,Reason$RANDOM" >> /tmp/my.csv
done
unix2dos /tmp/my.csv
unix2dos: converting file /tmp/my.csv to DOS format...

You might wonder why unix2dos is run on the .csv file. This is because classfiltercsv() requires CRLF line endings, per RFC 4180.

Now, let’s compare the results of running each, first with data_readstringarray():

/bin/time cf-agent -KIf /tmp/data_readstringarray.cf | head -n 10

R: I feel Good because I am running my preferred platform and it's near luch. I hope it's Ramen!
R: data_readstringarray() returned: {
  "Class10000": [
    "Feel19773",
    "Reason27476"
  ],
  "Class1001": [
    "Feel15419",
    "Reason11564"
  ],
14.91user 0.00system 0:14.95elapsed 99%CPU (0avgtext+0avgdata 24424maxresident)k
0inputs+320outputs (8major+7905minor)pagefaults 0swaps

And again with classfiltercsv():

/bin/time cf-agent -KIf /tmp/classfiltercsv.cf | head -n 10

R: I feel Good because I am running my preferred platform and it's near luch. I hope it's Ramen!
R: classfiltercsv() returned: [
  {
    "0": "Good",
    "1": "I am running my preferred platform and it's near luch. I hope it's Ramen!"
  }
]
0.14user 0.00system 0:00.16elapsed 89%CPU (0avgtext+0avgdata 11504maxresident)k
0inputs+320outputs (8major+1879minor)pagefaults 0swaps

As we can see from the output above, for the given data set, classfiltercsv() is about 100 times faster.

I hope this motivates you to become familiar with the wide assortment of functions available and find other areas of your policy that can be improved.

Get in touch with us
to discuss how we can help!
Contact us
Sign up for
our newsletter
By signing up, you agree to your email address being stored and used to receive newsletters about CFEngine. We use tracking in our newsletter emails to improve our marketing content.