Why observability is the way forward


Technology is critical to the survival of nearly every modern business, and it is advancing at a whirlwind pace to continue supporting these vital digital infrastructures. One such change includes a shift towards Observability. In the past year, log management, unified monitoring and event management vendors have adopted Observability to understand the internal state of their IT systems through the system’s telemetry data outputs.

Why are vendors navigating toward Observability? Telemetry data, including logs, metrics and traces, allow DevOps and SRE teams to quickly understand service-disrupting incidents in their IT systems, analyze the root cause and mitigate the issue. The visibility into these systems is increasingly more essential as IT infrastructures become more distributed, complex, interconnected and, as a result, fragile. In short, Observability helps IT teams work faster and smarter and improve service assurance in extremely complicated production environments.

But, as with any leading-edge technology, confusion — and even skepticism — has set in. The skeptics typically fall into two camps. IT Operations and Service Management (ITOSM) professionals question if “Observability” is just another buzzword whipped up by marketing teams in an attempt to stay relevant in today’s competitive tech market. And DevOps-oriented Software Engineers (SWEs) and Site Reliability Engineers (SREs) are typically even more suspicious that Observability is just a refashioned legacy technology.

Let’s address this uncertainty by explaining exactly what constitutes an observable system and why it’s become essential to modern-day service assurance.

What constitutes an observable system?

Despite popular opinion, the term “Observability” isn’t a shiny new buzzword. It originated in the 1960s as part of Rudolf E. Kálmá’s mathematical discipline of Control Theory. In Control Theory, Observability is a property, and a system is deemed observable if it generates enough meaningful data that human operators can understand the system’s internal state based on the system’s outputs.

Now that we understand Observability, here is the historical context of how and why it relates to DevOps teams.

In the not too distant past — sometime between 2013 and 2015 — Application Performance Monitoring (APM) became popular in the ITOSM community. With the rise of business digitization, IT became a core component of a business’s success. And, because applications linked the core business with its customers and IT, tools monitoring these applications became critical, providing important application event data, customer latency metrics and transaction traces. Application managers would analyze all of this data and diagnose system failures.

But with great power comes great responsibility. Businesses put increasing pressure on IT teams to increase business agility and quickly implement new digital capabilities while continuing to monitor application performance. As a result, changes ensued.

The most significant change was the exponential increase of system state changes. It didn’t take long for the DevOps community to realize that the ITOSM community’s APM products were too slow and too coarse to keep up with the systems SWEs were developing and SREs were managing. APM technologies could no longer make modern applications observable.

How can Observability improve modern-day systems?

Monitoring, log management and event management vendors already recognized that their monitoring tools needed real-time data feeds with ingestion rates that could keep pace with the system’s rapid state changes. This more advanced monitoring solution required more granular, low-level data feeds that came right from the underlying telemetry data, including metrics, logs and traces.

There was still one unresolved issue. Reports based on metrics, logs and traces are challenging — and often impossible — to interpret, even for the most experienced DevOps or ITSOM pros. As a result, artificial intelligence (AI) or machine learning (ML) became essential to deciphering patterns and making better use of data. And AI and ML could analyze this data fast and without human error. ML based anomaly detection could rapidly identify deviations from norm that could be caused by unexpected degradation, or signal a DDoS attack.

That brings us up to the present day where Observability and AI pair up to pick out significant alerts from a noisy event stream, identify correlations between alerts from different sources, assemble the correct team of human specialists to diagnose and resolve a situation.  This combination of metrics, events, logs, and traces (MELT) is the foundation of an observable system. ML based clustering, combined with Natural Language Processing, is particularly suited to identifying unanticipated patterns, ideal for zero day scenarios. Intelligent Observability solutions even propose probable root causes and possible solutions based on past experiences.

So how do IT teams know if a monitoring tool provides true Observability to maximize data insights? The technology must include two components. It must rely on granular, low-level data feeds, and AI must surface the patterns in those data feeds.

Observability is not just a challenge for new DevOps-built systems. As discussed earlier, modern systems are intertwined with various systems, including legacy systems that are coarser and slower. Consequently, even enterprise ITSOM communities need to adopt completely observable, AI-enabled monitoring tools, even if these systems aren’t targeted at new applications and infrastructures. Indeed, all technology now and into the future will rely on truly observable systems to keep pace with increasing digital innovation and continuous service availability.

Image credit: tkemot / depositphotos.com

As Moogsoft’s chief evangelist, Richard Whitehead brings a keen sense of what is required to build transformational solutions. A former CTO and technology VP, Richard brought new technologies to market and was responsible for strategy, partnerships and product research. Richard served on Splunk’s Technology Advisory Board through their Series A, providing product and market guidance. He served on the advisory boards of RedSeal and Meriton Networks, was a charter member of the TMF NGOSS architecture committee, chaired a DMTF Working Group, and recently co-chaired the ONUG Monitoring & Observability Working Group. Richard holds three patents and is considered dangerous with JavaScript.

Author: Martha Meyer