The Principles of Site Reliability Engineering- Part 1
Site reliability engineering (SRE) is a relatively new discipline that describes the art of keeping large-scale computer systems up and running. It’s often defined as “the practice of building, monitoring, and maintaining applications and services to ensure performance, availability, and resilience.”
Photo by Kelli McClintock on Unsplash
Where operations focuses on keeping things running when they are supposed to be running, SRE focuses on keeping things up when they are not supposed to be up. This includes working with developers to prevent outages through code robustness, testing and deployment automation.
SREs also aim to improve the ability of infrastructure to withstand incidents, including natural disasters or other interruptions beyond the control of the team. The goal is to minimize downtime and keep things ticking along by identifying potential weaknesses in code or infrastructure before they pose a threat.
To succeed, SRE teams work closely with developers to build applications that can withstand failures. They also work closely with operations teams to respond quickly when incidents occur. In some cases, SREs may even take over responsibility for handling incidents from the operations group — for example, when an internal or external incident requires a coordinated response across several services.
The goal of any site reliability engineer (SRE) should be to create a service that can be safely relied upon. This requires a bit more than just avoiding outages. An SRE practitioner must also plan for scale, prevent failure, maintain cost-effectiveness, and cater to changing business needs.
But site reliability isn’t just an engineering discipline. It involves everyone in an organization, including product owners and designers who define service requirements; people who operate and maintain production systems; the support teams and anyone else involved in getting users what they need when they need it.
SRE principles vs DevOps principles
SRE and DevOps both operate based on a set of principles. Both sets of principles drive alignment towards business goals. Some of their principles overlap. When comparing SRE vs DevOps, the biggest difference is that DevOps principles describe goals. SRE principles describe processes to achieve goals. In this sense, SRE best practices are a way of implementing DevOps principles.
So Now lets dive deep into the first 4 of the 7 Principles of SRE:
- Embracing Risk.
- Service Level Objectives.
- Eliminating Toil.
- Monitoring Distributed Systems.
- The Automation.
- Release Engineering.
- Simplicity.
SRE Principle 1- Embracing Risk
Site reliability engineering seeks to balance the risk of unavailability with the goals of rapid innovation and efficient service operations, so that users overall happiness- with features, service, and performance is optimized. Managing risk is where we start as unreliable systems quickly have a negative impact on customer confidence. Managing risk also comes with a cost as reliability and building replicas of systems gets expensive. Measuring risk and reliability is the key to embracing risk and you have two options: time based or aggregate based availability.
Time based is availability=uptime/(uptime+downtime)
Aggregate based is availability=successful requests/total requests
Now that we can measure we can formulate tolerance and start to build whats called an error budget. Having error budgets is important as it enables team to make data driven decisions when releasing updates/maintenance etc. It also helps with automation as data driven decisions are much easy to code ie IF something is in this state THEN something can be pushed to make better.
SRE Principle 2- Service Level Objectives
In order to manage a service correctly you have to understand the measurable and what really matters to that service. Defining Service Level Indicators(SLI’s), objectives (SLO’s) and agreements(SLA’s) are a must when building distributed systems/services.
A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level or service that is provided. Latency, error rate, throughput and other metrics are key SLIs within distributed systems. Ideally SLIS measure a service level of interest bu sometimes can only be done by proxy due to complexity. The most important SLI to SRE’s is availability, or the amount of time the service is available. High availability is always sought, ya hear me dawg :D
A Service Level Objective (SLO) is a value or target range value for a service level that is measured by SLI’s. A common SLO for an e-commerce website might be to have the site up and running over 99% of the time. If a site experiences downtime, the SLO helps to determine whether it is acceptable. For example, if an acceptable SLO is to have less than 5 minutes of downtime per month, then a company may decide that experiencing 30 minutes of downtime in a single day is not acceptable because they would still be below their SLO.
A Service Level Agreement (SLA)is an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLO’s they contain. The consequences are most easily recognized when they are financial- IE credits/discounts on services that become unavailable.
SRE Principle Number 3- Eliminate Toil
Toil is the kind of operational work that it takes to run production systems that tends to be manual, repetitive, automatable, tactical, No enduring value, or on with service growth. So ask yourself these questions when trying to determine if the task is toil: Does this require manual intervention? Is this the first time performing this task? Can a machine do this task? Does the service remain in the same state after this task? Does this task scale up linearly with service size, traffic volume, or user count? The goal of every SRE should be to always maintain 50% of your time with operational work(toil) or lower. Ideally SRES want to work on engineering new solutions not maintaining existing ones.
There are two general strategies for reducing toil: automating tasks and eliminating them entirely. Automation is arguably the most attractive option. It’s relatively easy to automate manual processes and there are a lot of tools available for doing so. Unfortunately, it isn’t always possible to automate a process because too much data may be missing, or inputs may change too frequently. This can make automation impractical or impossible in some cases. In these situations, you have to eliminate the task altogether if you want to eliminate toil.
Reducing a task to its smallest possible form is key when eliminating it entirely. If a manual process takes 10 minutes and cannot be automated, the goal should be to find a way to complete that process in just 2 minutes. This doesn’t mean the overall time spent on the task needs to be reduced by 8 minutes; rather, it means finding a way to complete the task in as little time as possible so that energy and time can be shifted elsewhere.
SRE Principal Number 4- Monitoring
Monitoring and observability are the mechanisms used to alert humans when events happen within a distributed system. You have to be able to collect, process, aggregate, and display quantitative metrics about systems in order to understand how they work. Monitoring helps you determine the root cause of outages & security events and also helps you understand/analyze long term trends. There are 4 golden signals that are constent in everyones monitoring stack Latency, Traffic, Errors, and Saturation.
Latency is the time it takes to service a request. It is important to understand latency both on successful and failed requests. For example a 50X error on a failed database will have a little to no latency due to the catastrophic failure. A slow 50X request could mean something even worse as there are multiple hops within a request and it could be any of them. Track error latency, dont filter it out.
Traffic signals are those that measure how much of a demand is being placed on your system. For a web server this could be how many http requests hits your system per second. Another example would be for an audio stream and network I/O or concurrent sessions. For databases you could track transactions or retrievals per second.
Error signals are the rate of requests that fail both explicitly and implicitly . For example, a typical HTTP request that is successful will return a 200 response but it might take more than a second. IF you have a service level objective of delivering requests under 1 second, then you have an implicit failure here even though your request was served successfully. Monitoring end to end failure signals can be very complex at times and its important that you have objectives defined.
Saturation or how “full” your system is. A measure of your system fraction, impacting the resources that are most constrained.( IE memory intensive, cpu intensive, etc.) Many systems degrade in performance before hitting the 100% utilization rate so its key to having proper utilization targets. If every 1000 users equals 1 cpu then you know you need 100 cpu for 100k users. This is key for large scale marketing events or things like black Friday. Knowing what your system can handle can help you prepare for expected and unexpected surges of usage or traffic.
Having a solid monitoring posture is crucial to obtaining observability within your distributed system. Be careful not to overcomplicate things and ask youserlf these questions when building alerts:
Does this rule detect an otherwise undetected condition that is urgent, actionable, and actively or imminently user-visable?
Will I be able to ignore this alert, knowing it’s benign? When and Why will I ignore this alert and how can I avoid this?
Does this alert definitely indicate that users are being negative affected? Are there detectable cases where users are not being affected?
Can I take action in response to this alert?
Are others getting paged for this same alert?
Having answers to these questions will help you build a better active monitoring strategy and help you eliminate noise that can be ignored. You want to build monitoring systems that a meant for long term solutions and that is not overly alerted. Healthy Monitoring should focus primarily on symptoms for paging or problems that require human intervention. Next week I will Finish up with that last 3 SRE Principles- Automation, Release Engineering, and Simplicity. Thanks for Rocking with me on this tuesday, cheers yall!