Microsoft’s SoftNER AI uses unsupervised learning to help triage cloud service outages

Take the latest VB Survey to share how your company is implementing AI today.

Microsoft is using unsupervised learning techniques to extract knowledge about disruptions to cloud services. In a paper published on the preprint server, researchers at the company detail SoftNER, a framework that has been deployed internally at Microsoft to collate information regarding 400 storage, compute, and other cloud outages. They claim it eliminates the need to annotate a large amount of training data while scaling to a high volume of timeouts, slow connections, and other product interruptions.

Structured information has inherent value, particularly in the high-stakes cloud and web operations domains. Not only can it be used to build AI models tailored to tasks like triaging, but it can save time and effort for engineers by automating processes like running checks on resources.

The SoftNER framework attempts to extract knowledge by parsing unstructured text, detecting entities in outage descriptions, and classifying entities into categories. It employs components that identify structural patterns in the descriptions to bootstrap training data, as well as label propagation and a multi-task model to generalize data beyond the patterns and extract entities from the descriptions.

SoftNER begins each run with data de-noising. Drawing incident statements, conversations, stack traces, shell scripts, and summaries from sources including Microsoft customers, feature engineers, and automated monitoring systems, SoftNER normalizes descriptions by pruning tables with more than two columns and getting rid of extraneous tags (like HTML tags). It then segments the descriptions into sentences and tokenizes the sentences into words.

After performing entity tagging (for things like problem types, exception messages, locations, and status codes) and data-type tagging (for IP addresses, URLs, subscription IDs, and more), SoftNER propagates the entity values’ types to all incident descriptions. For example, if the IP address “” is extracted as a “source IP” entity, it tags all un-tagged occurrences of “” as “source IP.”

In experiments, the researchers evaluated SoftNER’s performance by applying it to 41,000 outages at Microsoft over a two-month span from “large-scale online systems” with “a wide distribution of users,” each containing an average of 472 words. They report that the framework managed to extract 77 valid entities per 100 from descriptions with over 96% accuracy (averaged over 70 distinct entity types). Moreover, they say that SoftNER is accurate enough in downstream tasks to handle automatic triaging at Microsoft.

The researchers say that in the future, they plan to use SoftNER to evaluate bug reports and improve existing incident reporting and management tools. “Incident management is a key part of building and operating largescale cloud services,” they wrote. “We show that the extracted knowledge can be used for building significantly more accurate models for critical incident management tasks.”

Microsoft isn’t the only tech giant using machine learning to weed out bugs. Amazon’s CodeGuru service, which was partly trained on code reviews and apps developed internally at Amazon, spots issues including resource leaks and wasted CPU cycles. As for Facebook, it developed a tool called SapFix that generates fixes for bugs before sending them to human engineers for approval, and another tool called Zoncolan that maps the behavior and functions of codebases and looks for potential problems in individual branches as well as in the interactions of various paths through the program.

Credit: Source link