On Transitioning to Sre

Fri, Mar 12, 2021 11-minute read

Introduction

I’ve been focused on platform engineering for the last 5 years of my career. This means that some of the topics I’ve been actively involved in are DevOps, infrastructure as code and SRE. In this post we’ll examine the latter, SRE (Site Reliability Engineering), and understand how it can be introduced in IT operations group where software engineering is not part of the cultural DNA.

Team InfraOps and their SRE Journey

My favorite definition of SRE is “what happens when software engineers are asked to do ops”. And what are some of our main traits as software engineers? Well, we’re lazy. But we’re also experts at writing software to solve our pain points and over time we become very good at finding automated solutions for every challenge we face.

Let’s follow the adventures of Team InfraOps, an imaginary ops team responsible for managing and provisioning compute and storage resources in the cloud for WidgetMaker Corporation. Last year, the team provisioned 200 Azure virtual machines by hand or by running shell scripts individual engineers wrote to make their life easier. This year, they’re expecting to provision 2000 and have decided that, in the true SRE spirit, they need to automate this process in order to free them up for more value-add activities.

Their very first step was drawing out the sequence of activities required for provisioning a virtual machine. If you’ve never worked in a large enterprise, especially one that operates across multiple clouds and on-premise, you might be suprised at the amount of work involved - here is a simplified sequence of tasks that needed to be accomplished:

Step 1 - Create the virtual machine image, starting with a base image from a vendor and layering on top the organizational policies and standard configuration
Step 2 - Create a configuration item in the corporate CMDB to represent the virtual machine
Step 3 - Create a change request against the virtual machine
Step 4 - Get the change request approved
Step 5 - Obtain an IP address from the corporate IPAM systems
Step 6 - Provision the VM in the cloud environment
Step 7 - Run pre-installation checks to ensure the VM is configured as expected
Step 8 - Create functional accounts for use with the VM
Step 9 - Create IAM identities for use with the VM
Step 10 - Add the VM to the monitoring systems
Step 11 - Add the VM to the patching and release management systems
Step 12 - Add the appropriate backup policies to the VM
Step 13 - Install the software packages requested by the end-user
Step 14 - Run post-installation checks to ensure the VM is configured as per the corporate standard
Step 15 - Update the configuration item in the CMDB
Step 16 - Close the change request
Step 17 - Notify the requestor that his VM is now available for use and make any credentials or secrets securely available

First Pass

This first pass at automation revolved around using Azure DevOps pipelines to execute Terraform plans and Ansible playbooks in order to deliver the workflow. It did not need to be Azure DevOps - it could have been Jenkins, Spinnaker or any other similar product. Azure DevOps was selected because it was familiar to the team as they had previously used it in other scenarios. This is an important point, and one that bears repeating. It was important that the team leverage as much of their prior knowledge as possible. Jumping straight into the bleeding of software engineering such as going straight to GitOps or continuous deployment would have guaranteed the failure of the project.

Once this first phase was delivered, the team had a single pipeline executing the process workflow. They had delivered the result they wanted and automated the virtual machine provisioning process. But were they done? I’m sure the mind of any software engineer reading this immediately jumps to topics such as maintainability, testing and observability. Let’s go through the potential challenges the team will face if they stopped here:

Lack of modularity and reuse
The fact that tasks are defined inline means they cannot be reused across pipelines or for other use cases. Every single business process having to raise a change request - which is most of them in an enterprise environment - would have to copy-paste the underlying code whever it was used. This would have very quickly led to an unmaintainable Big Ball of Mud.
Lack of error-handling
Errors and failures happen, and they happen very often in the infrastructure world. Whatever automation the team put in place needed to be able to handle these challenges, and pipeline-oriented systems are not suited to accomodate these use cases. Error-handling frameworks would have needed to be designed from scratch and built into each individual pipeline.
Inability to mix manual tasks with automated tasks
The tasks in the pipeline are sequential and automated. Many of the more complicated (and interesting) processes are a mix of manual and automated steps, so our approach should be able to handle these.
Lack of observability
Pipelines provide logs but that’s about the limit of the observability affordances they provide. Much valuable information is not captured that would be essential for debugging.
Lack of testing
It’s not possible to put in place sufficient testing and quality checks for the code contained in the pipelines. Ideally, we would want to unit test specific components and then do integration testing and end-to-end testing around the entire workflow.

All of the above are issues that can be handled by approaching the work with a software engineering mindset - moving from an “operations” culture to a “SRE” culture. The key is to think in terms of reusable blocks instead of monolithic scripts. You want to leverage reusable APIs instead of chaining together sequential commands like you would write in a shelll script. The steps in the pipeline should be independent building blocks that any team can orchestrate to create the workflows they need. By taking this approach, we can introduce the usual testing and observability mechanisms into the codebase, as well as leveraging 3rd party frameworks such as orchestration services to introduce better sequencing and allow for appropriate error-handling.

Coming to this realization was a game changer for Team InfraOps. They knew they had deep expertise in ops and the actual processes which were being automated, but needed some software engineering support to successfully execute this transition. The team reached out to the software engineering teams and secured the assistance of a software architect who agreed to help them deliver the next part of the journey.

Plan

In order to move to this new world of orchestrated APIs, the team and the software architect arrived at the following plan:

Agree on an API framework and standardized tooling.
Break out the tasks into independent services abstracted behind HTTPS APIs.
Change the Azure DevOps pipeline to make HTTP calls to the services.
Migrate the pipeline to a more full-featured orchestrator in order to leverage advanced error-handling and workflow capabilities. We’ll love at each of these steps in turn and explore the challenges and opportunities presented by each.

API Contracts

The first and most important step was agreeing on how the API contracts would be defined, and then drawing up the initial set of contracts. They ended up using OpenAPI but any system of record that could be easily maintained and edited would have worked. The key driver was transparency - this allows different and individuals and teams to all work together, working off the API contracts and treating the actual services as abstract black boxes whose internals they did not need to spent time and effort investigating.

To facilitate the creating of these contracts the team chose Apicurio Studio to help them move quickly, and stored the generated artifacts in a Github repository. The use of pull requests ensured all changes were quickly reviewed and approved by the designated leads.

API Development Standards

Given the team was relatively new to large-scale software engineering, an important part of laying the foundations was deciding on ways of working. The main consideration was the varying levels and expertise and skill across the team - many of the engineers were infrastructure engineers or system administrators by trade, and so it was imperative that the approach selected allowed them to move quickly without having to learn a whole plethora of new technologieis.

The team decided to start with CherryPy (a simple and minimal Python WSGI implementation) scripts running on Azure Application Service Environments, Microsoft’s PaaS offering. A different choice could have been made, maybe starting straight away with Azure Functions or AKS, or using Flask or Django as web development frameworks.

The important outcome of this step is that the team selected a SDLC approach that suited their way of working.

API Architecture

This part of the process is where the software architect helping the team proved most useful. Designing an API ecosystem is not an easy task. There are many technical challenges such as security, logging, auditing, authorization, authentication and many other cross-cutting concerns which need to be resolved at an early stage. These foundational decisions dictate the ease with which progress will be made so the team spent a considerable time coming up with the right architecture for their environment and SDLC approach. They also involved the company security and risk teams, making sure these valuable stakeholders were involved from the start.

Some of the key decisions they ended up with were the following:

Use of an RPC API style rather than GraphQL or REST. This was the simplest and most familiar approach for them.
Use of the Microsoft Identity Platform for authentication. The company used Active Directory for IDAM, including AAD, so staying within the corporate standard made sense.
Use of JWT Web Tokens for authorization.
Use of the ELK stack, maintained and provided by another team inside the organization, for logging, auditing and monitoring. This allowed them to focus purely on the value-add side of creating the APIs rather than having to spend time on the underlying supporting infrastructure and systems.
Use of the CI/CD generators provided by a sister team in the organization whose mandate was around DevOps and CI/CD systems. This meant they could move faster, without having to learn and recreate all of the infrastructure needed to push code from an engineer’s workstation to a production environment in a safe and reliable manner.
Use of the Azure API Management gateway to work across all the API endpoints.

Pipeline Decomposition

At this point, the team had everything in place to get started! Each step in the pipeline naturally mapped to a service which could be extracted. They went through the pipeline and created microservices implementing the same logic. Each service was tested in isolation, then integration tested to make sure it performed as expected.

Each of the services was built in such a way that it could be versioned and deployed independently, and was monitored and documented to a high standard. At the end of this step the team had a catalogue of close to 20 independent services which could be orchestrated at will to implement a business process.

Service Orchestration Through Azure DevOps Pipelines

Now that the services were available, the pipeline was migrated to simply be a sequence of HTTP API calls bringing together all of the extracted services. By following this approach them team ensured the original behavior had not changed - or if it had, they had minimized the number of variables and could quickly identify what refactoring had led to the issue.

Service Orchestration and Augmentation Through Camunda

The last part of the project involved moving the workflow to a standalone workflow engine. The team wanted a richer set of capabilities around error-handling, easily sequencing complex workflows (including manual steps) and better insight into the different steps of the workflow as they were being executed. They selected Camunda, an open-source workflow engine that seemed to meet their needs and provided both an execution engine and a visual BPMN designer.

Porting the workflow over proved to be remarkably easy. Yes, they had to now manage an additional piece of infrastructure which needed to be managed, patched and all the usual IT activities - but the cost of doing was greatly outweighed by the benefits. They looked at using managed services such as Azure Logic Apps but unfortunately they proved a poor fit for the project.

Conclusion

Team InfraOps, our fictional ops team in the WidgetMaker company, started their VM provisioning automation journey with an Azure DevOps pipeline. It delivered a successful outcome, but the team also knew they had laid the seeds of technical pain and complexity down the line. After taking a software engineering approach as part of their journey to becoming an SRE group, they delivered a set of tested and self-contained services abstracted behind APIs and orchestrated by a BPMN workflow engine. The automation is now testable, the steps reusable. The team has great visibility into the execution of the process and the introduction of the workflow engine has led to thorough error-handling and the ability to address any scenario that’s thrown at the team.

The project was successful, and the first step on a journey. The key benefit gained from the project wasn’t the automation - it was starting the transition to thinking and acting like software engineers, looking to automate wherever possible and having the framework and tooling to do so easily and quickly.

Olivier Kouame