Written by: David Strates
Why observability matters to business
The average cost of downtime is $5,600 to $9,000 per minute, depending on company scale and industry vertical. This figure translates to a more-reasonable-but-still-alarming $137 to $427 per minute for small and medium businesses. With Australian organisations experiencing on average five major outages per month, the losses are significant. It can be a high-stakes game for on-call engineers.
It’s no surprise then that observability has become an integral practice in a world where brand reputation is tied to uptime and latency. In fact studies have shown that 40% of people will leave a website if it takes longer than 3 seconds to load. Expectations around product and service availability are now at a point where businesses require 24 X 7 dedicated teams to manage always-on services. Forgoing this poses a significant risk when there are swaths of competitors ready to poach your customer base.
But not all businesses are able to invest in expert teams of Site Reliability Engineers (SREs), let alone the myriad of tools required to keep their critical systems online. Many are also hindered by technical debt, and building out the ability to monitor an increasingly complex amalgamation of systems and tacked-on services in itself becomes a major undertaking.
Why observability matters to site reliability
The SRE title is often a misnomer. So what do they actually do?
SREs work across the software development lifecycle and are often technical custodians for business applications. Strong SRE teams build solid relationships with their software development counterparts and establish themselves as bastions for engineering best practice, helping to improve the tools, workflows and principles around them. They partner with product developers to incorporate Non-Functional-Requirements (NFR), such as Service Level Objectives (SLO), Mean Time to Recovery (MTTR) and Recovery Point Objectives (RPO) as engineered deliverables in active product portfolios.
In terms of platform engineering, SREs provision scalable cloud environments, maintain complex service meshes, perform capacity planning and ecosystem performance testing, oversee incident response and disaster recovery testing, validate security compliance and wear multiple hats on a daily basis. Reliability engineers are steeped in system design knowledge and are deeply familiar with architectural patterns, dependency mapping and troubleshooting points of failure. They work backward from the customer perspective and understand the importance of maintaining a service-oriented mindset.
Daily SRE work relies on unified monitoring and logging solutions to help product teams understand the performance characteristics of their applications. SREs may develop test plans or right-sizing estimates and assess a new application’s production readiness by inferring trends and behaviours from historical monitoring data. Once services are deployed, SREs observe them and set various alerting thresholds that allow them to perform timely remediation activity. They collaborate with both engineering teams and business to define useful and achievable reliability and performance standards.
This means that SREs do not typically generate revenue streams – they actively protect revenue and perform cost optimisation measures through their work. The SRE’s core value proposition is that they uphold brand reputation by retaining user trust and ensuring ongoing compliance with relevant regulations. SREs balance reliability and cost concerns against change velocity and feature development, particularly through efforts such as reducing incident duration and impact, lessening service downtime and increasing development productivity through automation.
The difference between monitoring and observability
Observability has its roots in the engineering concept of control theory, which suggests that you can infer the internal state of a system by measuring its external outputs. This involves looking at the system holistically and not just its individual components.
At the core of any modern SRE team is the ability to monitor well-tethered distributed systems, container-based applications and highly elastic microservices at scale. But more importantly, SREs need to construct meaningful feedback loops that provide insights into the states and behaviours of production workloads and customer interactions. By continuously interpreting system outputs and leveraging automated toolsets, SREs are able to reliably pinpoint health trends and catch problems before they occur.
Most traditional monitoring tools are designed with homogeneous systems in mind, those with static infrastructures, purpose-built middleware and monolithic applications. Take white-box monitoring for instance, can you imagine trying to apply this to an ephemeral Kubernetes cluster where containers last for minutes, or even a serverless web application? With new paradigms we need new mechanisms for combining several abstraction layers into an entity that can be easily monitored. Remember “the whole is greater than the sum of its parts”.
This is where the concept of end-to-end monitoring and management of heterogeneous applications in distributed systems fits in. For example, the performance of an online web application depends on any number of services: an application host, a file system for the application, a web server to access the application, a database for storing and retrieving information, a load balancing mechanism to handle traffic, an event bus, an API interface and so on. Heterogeneous monitoring platforms allow us to group these dependencies and interpret them as a single unit, as well as hone into their individual components.
With shifting customer tolerance levels it’s becoming crucial that businesses are able to process granular insights, and prioritise response to potential issues in minutes, not hours or days. Near-real-time telemetry, anomaly detection and service-level distributed tracing are now considered foundational capabilities necessary to collect enriched metrics and enable quick detection. This facilitates a more proactive approach to incident response than traditional practices which aren’t always suitable for flexible and extensible cloud-native architectures.
Laying the foundations for an observability framework
Well architectured systems are fault-tolerant, highly available, performant and reliable. This doesn’t mean they aren’t prone to failure, instead they are more likely to be well equipped to deal with it. However a unique consequence of these designs is that they typically fail at the interdependencies between multiple systems, rather than through individual component faults alone. This makes it difficult to troubleshoot failures by manually sifting through logs and dashboards, especially in an extensively fragmented environment with many unknown unknowns.
Strong observability platforms rely on centralised log aggregation, metric and event ingestion, distributed tracing, and threshold-based alerting. Pipelines help decouple the collection of data from the intake of it into various aggregation services, such as Sumo Logic and Datadog, and of course the inevitable archiving of logs in Amazon Glacier. This makes the observability data easily consumable and routable, since we have already figured out what data to send, where to send it and how to send it (i.e. using logging layers like Fluentd). From here we can apply machine learning patterns and metadata-processing to handle filtering and provide critical business intelligence.
Anomaly detection works off this premise and allows companies to identify and predict abnormal patterns in data streams. The ability to analyse web log traffic to gain real-time insights into customer behaviour or potential security compromises is a prime use-case. Although at the mature end of the scale, real-time anomaly detection systems using streaming analytics (e.g. Amazon Kinesis Data Analytics for Apache Flink) and serverless data processing pipelines can be incredibly valuable to enterprise organisations, while removing the toil associated with maintaining complicated batch-processing solutions by using a managed AWS service. This contributes to predictive, behaviour-driven observability.
Dashboarding ties it all together by providing a single pane of glass view into the overall workings of an entire topology (from dev to prod). While dashboards are certainly helpful in gaining visibility, their fundamental limitation is that they are relatively passive and rarely catch novel issues. So every new failure situation, which most likely was unknown previously, will not be caught and alerted upon. However we still see huge value in making visualised metrics meaningful, and have begun codifying Grafana dashboards, placing them in version control and automating their deployment across our customer base.
Our approach to observability
Mantel Group Solutions have encountered the full gamut of organisational maturity levels. From experience the vast majority of in-house IT teams rarely have the expertise or capacity to effectively monitor and triage issues in a complex cloud environment. Most teams rely on traditional methods or have corporate structures that make it difficult for them to adopt a more streamlined approach, be it Agile, DevOps or even the introduction of automation.
To combat this we’ve made observability a key function of our Managed Service – from developing bespoke automated solutions to instilling the ideologies of simplicity, eliminating toil and an open post-mortem culture. We’ve collected best practices from a number of industry veterans and, with the aim of reducing operational overhead, have injected these into our monitoring and incident response processes.
In maturing our toolset and improving service reliability, we need to understand what we are trying to measure and what our customer’s capabilities and limitations are. Those organisations that are further along their DevOps journey will likely benefit from monitoring that is tailored towards data pipelines, containerisation and serverless patterns, whereas other companies may require “synthetic monitoring”, heartbeat checks for API availability and actionable alerts in a more predictable internet-facing tenancy.
Since we often architect our customer’s environments, we’re the ones that are best suited to running and maintaining them. Mantel Group’s reliability engineering team is deeply familiar with our customer environments, and we work closely with our colleagues and counterparts in a cross-functional capacity. We take the DevOps cultural paradigm of “you build it, you run it” and combine it with a strive for customer satisfaction, which boils down to an exceptional user experience supported by high uptime and low latency.