Blog

December 11, 2024

From Data Overload to Insight: How AI is Shaping the Next Era of Observability

Observability

Author

Andrew Smyth

Enterprise observability is the practice of measuring the state of an IT system by examining its outputs. This process involves reliability and operations engineers manually wrangling together disparate data sources and information, using multiple software tools, to understand operating state. The goal is to maintain system performance, which is crucial for digital services like video streaming, social media, and banking apps. Businesses invest in reliability because it directly impacts financial performance. Systems outages can cost large enterprises up to $1.4 million per hour, and 25% of Gen Z consumers will switch to a competing service if an app or website is slow. Observability budgets typically account for 20-30% of total infrastructure spend, representing a $51 billion market opportunity.

Understanding observability:  

Observability has traditionally been an instrumentation and data management challenge. Reliability teams first needed to instrument their applications and then extract and store telemetry data. This data helps engineers troubleshoot and conduct root cause analysis, answering questions like “why is my website slow” or “how do I respond to a timeout error for my REST service.” Monitoring and visualization tools now allow engineers to easily ingest, visualize, monitor, and create alerts with this data. However, interpreting data across various charts, dashboards, and alerts has led to unsustainable cognitive load for senior reliability engineers. What is needed is a higher-level framework to understand the context behind the data and automate troubleshooting, codifying the intuition, technical judgment, and institutional knowledge of the best reliability engineers.

Human capital & increased complexity: 

Individual operations and reliability engineers remain the single point of failure for system performance and avoiding outages. Burn out and high employee turnover is common. The most talented engineers are overworked, often enduring on-call rotations and weekend work. Observability teams struggle to attract younger engineers resulting in an acute labor challenge. Large enterprises spend millions annually on outsourced services for first level, manual troubleshooting and alert triage to bridge this gap.  

Labor challenges are compounded by the increasingly complex nature of modern IT environments. Infrastructure services are continuously changing and emit huge volumes of heterogenous performance data. A typical enterprise, production grade application can involve hundreds of hosts, thousands of containers and millions of service requests. This complexity has grown with each platform shift, from on-prem, to the cloud and now to AI native infrastructure. The current approach to observability and reliability engineering is not compatible with the performance demands of modern applications and uptime requirements. 

The future: 

Microsoft is at the forefront of developer automation and productivity. With generative AI, code and applications are being deployed faster than ever. The engineers maintaining these applications in production, however, cannot keep pace. At M12, we believe the next era of observability products, powered by AI, will revolutionize how we manage and maintain these systems. AI will help augment reliability engineers, automating inefficient, manual tasks and enabling efficient root cause analysis. We’re excited to announce our investment in Neubird.ai, a team building compound AI systems to codify the expertise of seasoned reliability engineers and productize troubleshooting workflows. Over the coming years, we expect the observability paradigm to shift from a system of record to a system of insight, redefining how software systems are designed, optimized, and maintained.