Facebook 'likes' CFEngine!

June 24, 2010

CFEngine has played a major role in allowing Facebook to scale from one dormroom server to the largest datacenters of the world. Tom Cook explained it all with high velocity.

“Velocity 2010 Conference”

It’s June, at the Velocity 2010 conference in Santa Clara, California, and Facebook Systems Engineer Tom Cook provides a glimpse inside the daily life of the Operations team. An important part of the job at Facebook is systems management. He tells us: Facebook chose CFEngine as their configuration management solution. It gives their engineers peace of mind.

In his presentation, Cook traces Facebook’s humble beginnings, from their launch on a single server at founder Mark Zuckerberg’s Harvard University student dorm room just over 6 years ago to today, having over 400 million active users. (This figure has now increased to over 500 million this past July according to Facebook’s website). This explosive growth has Facebook currently building a new data center in Oregon in addition to its other server sites in Virginia and Californina.

These centers manage millions of applications and content, with users uploading over 3 billion photos each month. Facebook users spend an average of 16 million minutes a day on the site as well as sharing 6 billion pieces of content per week. Since Facebook is customized to each individual user, there is a large volume of information exchange. Facebook engineers face the challenge to service requests quickly and make changes while trying to avoid pitfalls such as downtime, miscommunication and undocumented changes.

CFEngine updates all the servers while checking more than 1000 “promises” in a matter of seconds.

“We Use CFEngine”

CFEngine is probably the most popular data center automation solution available today. It is the universal glue that that makes IT work. The larger the organization, the more complex the infrastructure. This is why CFEngine is especially popular among large organziations.

As a large, expanding company, Facebook has been able to cope with their extremely rapid growth thanks to CFEngine. Facebook updates their servers every 15 minutes and needs a program that is fast. “CFEngine is a great option” says Cook. He mentions that unlike other sites who do a slow-roll or may experience downtime, they don’t do it that way. “CFEngine evaluates a little over 100 policies and 1000 rules in around 30 seconds, which allows us to make changes quickly and push them into production”.

“No commit and quit”

Cook explains that the rapid servicing of information is an issue Facebook engineers are concerned with. So much is going on that their site needs a combination of teams and automated processes to oversee its total function. How is this done?

First, the engineers see Facebook as a platform on which they can build; it acts as both a stable forum and and area to experiment with new features like Facebook Places. Second, Facebook engineers write, test, debug and deploy their own code. This means they are involved in the entire coding process from start to finish. Third, communication is key at Facebook and they have an internal news portal to share info and give updates and to keep up with what’s going on. Thus, the engineering culture at Facebook is intense and engineers have a symbiotic relationship with one another and the Operations Team. Finally, they want work flexibility to allow them to make changes at any time of the day, logging everything.

CFEngine, like Facebook, is comitted to the development process from start to finish. We base solutions on the principle of automation for complex or large installations, and develop new code for this purpose.

After years of research and development, CFEngine takes a pragmatic and long-term approach to automation. In our passion for developing new techniques and theoretical results, we have raised the industry bar for more than ten years.

“How do we actually manage all those systems?” Cook asks. “There’s a lot of churn on our network…we have to have some form of control at the end of the day to guarantee our people a stable place for development”. The solution is the site’s heavy use of automation as it saves time. Cook stresses that configuration management is “a must”. Whether you have 2 or 2000 servers. Configuration mangagement is one part of Facebook’s two-part system management, the other is “on demand tools”. For the latter, Facebook has written a program that suits their own needs. Cook says that configuration management systems “get you away from having to write shell scripts…error checking is handled for you.”

Cook points out that Facebook has one engineer per 1.1 million users to rely on for support. It would be impossible for any human to tackle this alone.

Join the club

Facebook’s engineers are busy bees making the most communicative tool on the Internet run faster and smoother, with more than a little help from CFEngine.

CFEngine has ensured that Facebook service is reliably available to its millions of users. No other system scales to tens of thousands of machines, and is just as `at home’ on laptops, desktops and embedded devices as it is on mainframes

Do you like CFEngine? Then join the CFEngine facebook page.

References

Facebook datacenter facts