Self-Healing IT Operations Using AIOps and Incident Automation

Operations engineers rarely have a quiet day.

On a given day they may see a thousand, 5000, 10,000, or more incident alarms, and it’s up to them to determine which, in the vernacular, are “noise,” and which others are meaningful. To do that they have to work through a maze of data from clouds, virtualized applications, websites, and machine logs. Then they have to prioritize the identified incidents and try to resolve the most pressing immediately, and the others soon after.

That’s why service operations need next-generation AIOps to support the automation of incident management. Automated incident management employs sophisticated models such as K-mean clustering and Random Forest, and it builds upon the output of analytics techniques such as anomaly detection and alarm noise reduction. Once in, next-generation AIOps makes short work of those 10,000 alarms, triaging, prioritizing, correlating, and resolving incidents in near real-time.

Transforming Business Many Incidents At A Time
Automated incident management is a must-have for any business that hopes to make a successful transition into digitization and digital operations.

Businesses are investing heavily in digital transformation, bringing in sensor-rich machinery and self-aware processes while trying to deliver “always-on” services to customers. Yet there is no single end-goal in view – just the knowledge that to pause is to risk losing competitive ground.

The resulting technology clash is predictable. New systems and processes, all sending signals to operations, are creating data complexities like never before. They’re creating problem incidents that have never been seen before and that are well beyond help from the most skilled operations teams.

An example is a modern multiple system operator, or MSO, like Comcast or Cox Communications. Companies like these operate complex service-delivery frameworks to support different access technologies and devices and deliver a variety of voice, data, security, and video services.

Some services are deployed in the cloud; others come directly from third parties. Also, large network segments are virtualized, thus reducing direct visibility into network components and making it difficult to detect problems and determine which services – and customers – are being affected.

Prior Learning Helps
To do its job, automated incident management with AIOps makes use of information gleaned from anomaly detection, which finds anomaly patterns over time and detects process-dependent anomalies.

Automated incident management adds its own analytics functions by identifying event patterns that are creating problems. It then:

Correlates new anomalies with existing incidents to reduce alarm noise and perform root-cause analysis;
Prioritizes incidents according to the impact they are having or may have, on service quality or customer activities. To do this, the analytics must be able to apply risk analysis and activate predictive logic in near-real-time;
Orchestrates resolutions and displays activities on an operations incident panel. Based on the nature of the identified risk, the analytics can trigger corrective automation through business process-management systems or can direct repair by maintenance technicians.
Adds to its knowledge base over time, incorporating known resolutions and learned workarounds to its store of metadata.

In the MSO, for example, the automated incident management analytics might identify a service incident based on seemingly random events such as network packet anomalies and network switch signals. The analytics could then find that the cause of the incident is a misbehaving application – not the switches themselves.

Expediting Resolutions
Besides keeping customers happy and off the helpline – no small achievement itself – automated incident management leveraging AIOps pays longer-term dividends to the transforming business by:

Improving resource utilization and operational performance – Automated change management is the best way to maximize new capital investment by keeping core technologies dependable in the face of transformation.

Removing geographic constraints – Automated change management gives operations personnel a common language for analyzing and resolving problems, regardless of geographic location. This becomes more important as organizations grow and operations staffers work in the field or out of different offices.

Putting humans higher up on the decision tree – Automated incident management reduces low-level labor costs and makes better use of analysts’ skills. Rather than asking, “What switch does this talk to?” the analyst can now ask, “Which customers might be affected by this?”

Facilitating organizational agility – CEOs and other executives want to know that their investments are paying off and not impeding operations. Automated incident management helps produce efficiency gains that can improve business agility, considered a critical KPI in the age of digital transformation.

AIOps: A Central Role
A wide-scope AIOps application embedded within IT operations service management can move IT much closer to a self-healing operating environment. AIOps can deliver proactive monitoring, anomaly detection, root cause analysis and discovery, and automated closed-loop automation. AIOps can improve visibility across your service delivery ecosystem, identify and repair issues, enable one report ticket to be produced eliminating multiple IT groups working on the same problem, and become a value multiplier by making your existing systems smarter through integration, effective analysis, and automation.

Vitria’s Blog

Self-Healing With AIOps

Leave a Reply Cancel reply

Recent Posts

Join Our Newsletter

Featured Resource

PRODUCTS

RESOURCES

COMPANY

CONTACT US