How to best serve your SRE and DevOps teams


Technology is critical to driving business growth and increasing revenue in our fast-growing digital economy. But the effectiveness of a technology is often determined by its performance and reliability.

DevOps and site reliability engineering (SRE) teams are behind this continuous availability. These teams ensure the top performance of an organization’s apps and vital services, especially as IT environments grow in complexity. Avoiding incidents and outages is only one aspect of the job though.

In addition to continuous availability, DevOps practitioners and SREs enhance the user experience. As part of this mandate, teams innovate the kinds of updates and improvements that delight customers and drive business value. But rolling out these innovations and incorporating them into the production environment can cause the service-impacting incidents teams are trying to circumvent.

DevOps and SRE teams walk a fine line between maintaining a growing number of applications and more elaborate IT infrastructures and seamlessly delivering faster, continuous development.

And there’s a lot on the line for these teams. Modern enterprises rely heavily on technology for business continuity and revenue growth. So, when systems crash, all eyes are on DevOps practitioners and SREs.

Enterprises have a responsibility to take care of the people safeguarding their critical apps and services, but taking proactive measures to attract and retain this talent is also in their best interest. According to the 2021 Upskilling Enterprise DevOps Skills Report, 60 percent of organizations are trying to recruit IT talent now or in the future, with DevOps engineers making up 53 percent of these jobs and SREs making up 23 percent. And this talent isn’t easy to find. In fact, 64 percent of IT leaders say finding skilled people is the most challenging aspect of recruiting DevOps candidates.

How can IT leaders best serve their DevOps and SRE teams? I suggest a healthy mix of empathetic leadership and practical tools.

Lead with empathy

SREs and DevOps teams need empathy to improve the customer experience. They must constantly put themselves in consumers’ shoes, innovating based on their needs and desires. Shouldn’t IT leaders return the favor?

Being an empathetic leader is finding out what your team needs and wants. While each team’s desires will vary, most SREs or DevOps professionals choose their path over an IT Operations track because they value innovation. They look at the holistic user experience over just fixing systems.

In a culture of innovation, there’s no room for risk-aversion. This also means: when something goes wrong, IT leaders should empower their teams to fix the issue and discourage finger pointing. After teams remediate the problem and tensions subside, managers should host a blameless post-mortem, a review of the causes and events of the incident, to investigate the issue without punishing the individual. Creating this culture of shared accountability and learning helps teams continue innovating.

Of course, IT leaders can support this kind of culture and drastically decrease the likelihood of an outage with the right tools.

Provide advanced AIOps tools

Another way of leading with empathy is providing resources that make DevOps and SRE jobs less toilsome and more satisfying. AIOps tools automate toil out of these roles by correlating alerts across distributed services, clarifying and enriching data and adjusting thresholds for anomaly detection.

With AIOps, DevOps and SRE teams can quickly receive, understand and prioritize significant events that could cause downtime, affect the user experience or lead to missed service level agreements (SLAs) and service level objectives (SLOs). And AIOps helps prevent incidents from happening again by applying machine learning to all ingested data.

AIOps can also help DevOps and SRE teams collaborate more effectively, which is sometimes necessary when a high-priority incident occurs. But all-hands-on-deck meetings can be clumsy and unproductive. AIOps tools increase agility by giving involved parties all of the information needed to do their jobs, automating workflows and integrating with existing tools and systems.

The next generation of AIOps solutions is going even further, providing early incident detection that prevents incidents before they affect consumers, partners or internal operations. These modern tools automatically process massive amounts of data from across IT environments, converging metrics, traces, logs, changes and events. They operate on partial evidence to detect early indications of a problem.

This advanced technology is different from the brittle rules-based tools that frustrate DevOps and SRE teams. Legacy rules-based systems have a hidden complexity that can hinder incident detection and remediation, not to mention that their constant maintenance is a drain on time and money. And while rules are easy to create, they can’t scale to keep up with today’s complex systems, nor can they address “unknown unknowns.”

On the other hand, next-generation solutions automate the entire AIOps workflow. And there’s a reason why DevOps practitioners, in particular, relentlessly pursue automation. By minimizing mundane work, DevOps and SRE teams can focus on fixing the things that matter and then moving on to tackle the things they love: iterating and improving customer-delighting, revenue-generating technologies.

It’s in an enterprise’s best interest to give SREs and DevOps the empathy they want and the tools they need to perform better. After all, high-performing teams lead to high-performing apps and services. And continuous availability is critical to every modern business.

Photo Credit: anathomy/Shutterstock

As Moogsoft’s chief evangelist, Richard Whitehead brings a keen sense of what is required to build transformational solutions. A former CTO and technology VP, Richard brought new technologies to market and was responsible for strategy, partnerships and product research. Richard served on Splunk’s Technology Advisory Board through their Series A, providing product and market guidance. He served on the advisory boards of RedSeal and Meriton Networks, was a charter member of the TMF NGOSS architecture committee, chaired a DMTF Working Group, and recently co-chaired the ONUG Monitoring & Observability Working Group. Richard holds three patents and is considered dangerous with JavaScript.

Author: Martha Meyer