DOCC Lab Reading Group

LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud Incidents

Paper Link

Research Question: For root cause analysis (RCA) of cloud incidents, how can developers use AI-assisted tools like Large Language Models(LLMs) to get trustworthy and useful predictions? The broader problem this paper tackles is how to better use LLMs to assist open-ended tasks that involve domain-specific knowledge?

Key Contributions: This paper proposes “confidence estimation” as the metric to quantify trustworthiness of recommended root causes by RCA tools. For such confidence estimation methods, the authors identify two requirements: 1) highly adaptable and broadly applicable since RCA predictors may widely vary across many dimensions like time, services and background knowledge. 2) assumes nothing about LLMs’ model parameters (weights), probabilities, and logits for output tokens because such information is usually unavailable (taken GPT-3.5-Turbo and GPT-4 as examples). Given the two requirements, this paper introduces a general framework called LM-PACE that uses a retrieval-augmented two-step prompting procedure to do black-box confidence calibration.

LM-PACE retrieves historical cloud incidents, their associated root causes and a description of a current incident. Then the first step prompting quantifies the strength of the evidence, i.e. to what extent the model’s evaluation of the root cause is based on factual and reliable information, resulting in a score called Confidence-of-Evaluation (COE). The second step prompting evaluates the candidate’s root cause based on the historical incidents, yielding another confidence score called Root Cause Evaluation (RCE). Both steps use chain-of-thought prompting for higher accuracy.

Finally, LM-PACE defines an optimization objective to find the final confidence score for each root cause. Using 121,308 cloud incidents within Microsoft, LM-PACE is shown to be effective by human experts confirming over 80% of cases with high confidence scores are similar to ground truth root causes.

Opportunities for future work: This work demonstrates LLM’s potential to calibrate confidence in root cause analysis of cloud incidents. A future opportunity is to explore how to make the calibration even more accurate by maybe supplementing more context. Another direction could be what are other domain specific tasks in cloud management can LLMs help effectively?

Presenter: Max Liu