Site Reliability Engineering in the Cloud

Let us look at a day in the life of an operations team member working on multiple solutions deployed on a cloud platform. A financial job runs every night to notify all the customers about their daily account balance at the end of the day. There is no exception to this process on the weekends either. After all, A financial account operates even on weekends. After monitoring the financial job daily, the SRE team notices a pattern. Every Monday, it takes almost 40% more time to complete the job. On further investigation, they find that the actual issue lies with the job running on Sundays. For some reason, every Sunday, the job fails and there is an extra data load on Mondays to clear off the pending data. There could be a thousand possibilities of such a failure and the team gets into action.

In a separate case, another set of team members continues to fine-tune the values of environmental parameters and resources to make sure that their application gets what it needs for serving the promised number of users concurrently in real-time. They continuously keep monitoring the application performance and resource utilization and have come up with scripts to update the resource allocations accordingly.

In a third case, the other set of operations team is busy investigating an issue that killed one of the application instances. They collected resource utilization data and logs to determine the reason for extra load on the particular instance. It could be because of a faulty configuration in the load balancer or there was a race condition with some other parallel application.

All these are regular scenarios experienced by the IT operations teams that handle multiple solutions, application instances, and infrastructure. The need of the hour is introducing SRE with a development mindset into the application operations. SRE is expected to primarily address the concerns of automating all the operation processes, monitoring the operations to figure out potential risks and failures, and reducing the RTOs and RPOs of the solutions.

For solutions migrating to a cloud platform, the need for SRE is even greater. The simple reason is that every organization wants a smaller workforce that is capable to handle a large number of application and infrastructure operations. Building a smart workforce with higher reliability requires some strict process and objective definition. Defining the SLOs (service level objectives) is the first step towards achieving reliability. The important SLOs for a cloud application include:

  • Response Time: The time needed to respond to an incident and address it completely.
  • Observability: The levels and details of monitoring various aspects of the solution, such as the environmental parameters, the log files, and resource utilization.
  • Availability: Defining availability targets of the solution, such as unplanned downtime of less than 1hr per month (99.9% availability).
  • Performance: Setting up the application performance targets like ensuring that 99% of the transactions are completed in less than 3 seconds.
  • Throughput: Defining the throughput targets, especially for the batch processing, like completing 100,000 transaction processing within 2 hrs at midnight.

Once the SLOs are set, the operations team must monitor and measure the indicators regularly, which is called Service Level Indicators (SLIs). The team needs to employ appropriate tools and processes to monitor the SLIs. The SLI data is then analyzed with 2 key focuses:

  • Find out the current health and performance of the application
  • Detect any impending application failure that can be prevented in advance

These SLIs are typically in the form of log data generated by the application, platform, or infrastructure. It may also be in the form of other metrics captured by specific tools, like periodic thread dumps, periodic memory dumps, and more.

For the financial job scenario, explained above, the operations team kept collecting the processing logs of the application to arrive at the anomaly and detected the problem with jobs running on Sundays. Likewise, for the second scenario, the team kept capturing the memory utilization and thread count. When these values crossed a certain threshold limit, they immediately spawned a new instance of the application to handle the load.

As part of the SRE practices, the same scenarios would be handled in a slightly different manner to achieve the same outcome in a relatively easier and repeatable way.

The SRE team then uses tools that would analyze the data and perform the activities such as automatically spawning a new instance of the application or sending an alert when the thread count threshold is breached. The scripts to perform such activities are created and maintained by the SRE teams themselves, however, an orchestration tool is typically used to execute them at the correct time.

The SRE teams also define a set of common metrics for them to better evaluate the health of the entire solution ecosystem by looking at simple visualized data. These metrics generally include the following, but are not limited to:

  • Average and Peak Memory utilization
  • Average and Peak concurrent user load
  • Average response time and throughput
  • Error rate and density
  • SLO violation rate
  • Network load and latency
  • Platform saturation graph

Depending on the cloud service provider used in the solution ecosystem, the teams may use tools like CloudWatch on AWS, Monitor on Azure, Google Cloud Operations on GCP (aka Stackdriver), Splunk, ELK, and more. Depending on the tool’s capabilities, simple action scripts, written in Shell or Python scripting languages, are plugged into the base tool.

SRE plays a key role in the delivery and operations of a solution on all types of infrastructure, especially on the cloud. The programmatic approach, that SRE brings into the operations domain, makes it repeatable and reduces the time to detect and prevent operational issues.