Bushwhacking A New Path for Infrastructure Automation
“Automate all the things!” is a rallying cry you’ve certainly heard before; when you encounter repetitive work while you’re shipping a software service, your ultimate goal is to make the work the computer’s job, not yours.
Here on the Advertising Studio Platform Engineering team, we’ve been on this journey towards more automation for some time (like many of you). Unlike much of the Salesforce platform, though, our software runs on the .NET stack. So, in many respects, we have to break new ground as we push for more automation; it’s a unique set of challenges to solve.
This is the first in a series of posts chronicling our journey to apply these industry best practices (like “infrastructure as code” and “immutable servers”) to our own world. Hopefully those of you in a similar position will get the benefit of our experience!
Our Vision
As we started down the path of automation, our goals were quite straightforward: standard operational requests, like provisioning and capacity addition, should require no involvement from infrastructure team members. All systems should be kept patched, consistent and up-to-date — without manual intervention.
Every element of the infrastructure should be able to be rebuilt quickly and with little effort. Simple, right?
My Journey
The first step — and the most time-consuming and laborious — was simply learning all the concepts and terminology. I was a software engineer joining an experienced operations team, not an expert in infrastructure. Coming from that background, with limited systems administration knowledge, it was a long (and at times, bewildering) slog to learn the basics. What’s a “snowflake server”? What does “server drift” mean? And so on.
Many books and internet research sessions later, I managed to get a comprehensive grounding in the concepts and tools we would need going forward. (For posterity, by far the best references we found were “Infrastructure as Code” and “The DevOps 2.0 Toolkit: Automating the Continuous Deployment Pipeline with Containerized Microservices”.)
Initially, most of the tooling and articles we found focused on Amazon’s cloud offering, AWS. Since our infrastructure runs on a different cloud provider, we didn’t find as much direct documentation available.
Ultimately, though, most of what we needed was available through the OpenStack API, an open source standard that’s relatively widespread. We also noticed that because we’re working with Windows, the path to automation wasn’t nearly as well-trod as it is in the Linux world; we ended up with a fairly long list of issues and workarounds (too long for this blog, but let us know in the comments if you want more detail on that).
In parallel, we learnt about configuration management. Luckily there’s only a handful of main contenders in that space: Chef, Ansible, Puppet and SaltStack. We gave each one a try and finally settled on Chef: it was already in use elsewhere at Salesforce, it was very well-documented and popular, and it was easily extensible through Ruby.
A month later, we were ready to knuckle down and start coding.
Vagrant quickly proved a godsend. It’s pretty much the essential item in our toolbox from which everything else flows. We quickly became comfortable playing around with setup and experimenting with virtual machines locally. We then moved on to try to automate provisioning of our servers.
Our first attempt involved coming up with our own infrastructure definition tool as our needs were quite minimal. Pioneer (you win a prize if you can guess how we came up with the name; here’s a hint: Terraform) quickly devolved into a convoluted mess as we realized we were basically reinventing the wheel. Resource dependencies, orchestration, working with various cloud APIs — that’s a lot! We decided that we might as well bite the bullet and learn to use the more fully-featured open source tools available!
Our original contender was Terraform by Hashicorp. (Other tools, such as Chef Provisioning, CloudFormation or OpenStack Heat, weren’t easy to incorporate on our cloud hosting platform). We had the good luck of knowing a few of the developers working at Hashicorp, so after learning a bit about it, we enthusiastically jumped on the Terraform bandwagon.
After our initial enthusiasm, we discovered that our cloud hosting provider didn’t work well out of the box with Terraform (or any other tool we could find!). To fix this, we decided to try to extend Terraform by writing what it calls “providers” and “provisioners”. To our delight this worked really well — Hashicorp produces well-engineered and extensible open source tools and it only took us a few days to produce a first version of our custom provider. (And this despite having to learn the Go language at the same time, something we’d had no exposure to previously!)
The next natural step was working with Packer, another Hashicorp product, to produce machine images in order to speed up provisioning. Again we had to extend the core product to deal with the quirks of our cloud hosting provider, but it proved doable.
Current State of Play
The past few months have been challenging, but also incredibly stimulating. The key learning is that sure, technical challenges exist … but they are surmountable, with a bit of patience and determination. Taking a solid and battle-hardened system and transitioning it to an automated system is not straightforward, but doable. We’ve come up with a workflow we’re happy with — but, naturally, there are still many unanswered questions we’ll have to address soon. (Not to mention other interesting approaches like containerization that we’ve barely scratched the surface of!)
Stay tuned for post #2, where we’ll talk more about the intricacies of Vagrant and how it worked wonders for us.
Olivier Kouame is a Principal Engineer on the Advertising Studio team. His twin passions are software engineering in all its forms and the study of organizational behaviour and occupational psychology. Previously a developer and software engineer manager, he’s now leading an infrastructure team focused on spreading some DevOps love while running the Code Academics group dedicated to bridging the gap between academic research and real-life software engineering.