Data for DevOps: Part III

In the first and second parts of this series, we introduced the idea that data analytics can significantly improve IT outcomes and facilitate the transition to DevOps practices. We described the three data types available to IT departments (Operations Data, Monitoring Data, and Event Data), and we delved into more detail about how to use Operations Data and Monitoring Data in the form of continuous streams of metrics. This post focuses on the value of a special category of Monitoring Data; Event Data.

Diving Deeper: Event Data

At a fundamental level, we can understand the job of IT operations teams’ as delivering operational specifications such as discoverability, availability, and persistence. How these specifications may play out in practice may be meeting a specification related to page-load times for a web application. Currently, most IT departments are already using monitoring data to automatically trigger alarms, alerting operations staff to issues as soon as they arise.

While this is a considerable improvement over waiting for user complaints, analyzing event logs alongside monitoring streams can help solve two major issues with this approach:

  • False alarms – Tuning any automatic alert is difficult. IT departments often find themselves balancing creating sensitive alerts that can identify issues quickly against not flooding their systems with false positives. Analysis of monitoring streams that preceded various event types can help target alerts designed for specific event types. Multi-variable analytics methods can also allow for more robust, and even self-learning, alerts.
  • Reactive response – While simple, rules-based alerts that, for example, trigger when a metric crosses a given threshold certainly expedite IT department response times, they remain inherently reactive. The systems still fail or, depending on the alert design, come close to failing. Analytics methods that seek to model historical data and understand system behavior prior to events provide the opportunity to develop predictive models that allow for proactive attention. These analytical methods can identify behavior patterns that have evolved into incidents in the past, providing a warning to perform service activities prior to a failure occurring.
  • Self-healing – The move from reactive to proactive service attention enables significant improvement in IT operations. Perhaps the most exciting prospect of these type of analyses, though, is the ability to remove the need for service attention altogether. By coupling the data-driven failure analysis enabled by Monitoring and Event Data with Operations Data such as service logs, IT organizations can begin to automate responses to specific behavior patterns, improving performance and efficiency.

Outside of direct operational improvements, Event Data also provide retrospective information for evaluating performance metrics. For example, event logs can allow a department to track the number of times a resource was unavailable when requested. Understanding patterns in these logs and correlating them with other operational changes provides an objective measure of the value of DevOps interventions.

To the extent that DevOps philosophy and practices aim to transform IT departments into proactive entities that can efficiently deliver services across a business, data analytics offers an obvious tool for designing and prioritizing new processes and practices. In the coming months, we will examine the ideas discussed above in more detail.