Unicorn: reasoning about configurable system performance through the lens of causality

February 21, 2024

Research Question(s): For complicated, composed systems with many configuration options across the system stack, how to create a performance model that can locate performance bugs, explain how configurable parameters affect performance, and find a configuration that achieves near-optimal performance?

Key Contributions: Most prior work on performance debugging rely on black-box statistical correlations, or “performance influence models” referred to by the paper, but such models suffer from incorrectness and bad explainability. Further, such models are specific to the system from which the observational data was sampled, and they cannot be easily transferred to another environment. To address these challenges, Unicorn creates causal performance models. A causal performance model is an instantiation of Probabilistic Graphical Models with new types and structural constraints to enable performance modeling and analyses. The new types Unicorn add include software options like “batch size” and intermediate causal mechanisms “cache misses” and performance objectives like “throughput” and “energy”.

Unicorn has five stages, Stage I specify performance query. Stage II learns a causal performance model by using observational data collected from the system and creates an acyclic directed mixed graph. In the end, between a node X and a node Y, there are only two possible edges/causal relations, X causes Y, or a confounder exists between X and Y. Stage III is iterative sampling which is an active learning phase. Unicorn selects top K paths using average causal effect to rank causal paths, and determines the next configuration to be measured. Stage IV updates causal performance models and Stage V estimates performance queries using do-calculus.

The technical novelty of Unicorn is that it selectively adopts and finely tunes many algorithms and statistical methods for the causal model to work accurately and efficiently. Also, the evaluation is very comprehensive and thorough.

Opportunities for future work: For Stage I of Unicorn, a grammar/domain-specific language would be great to automate the current manual translation of users’ performance queries. For faster convergence in Stage II and III, new algorithms are needed. Additionally, Unicorn can benefit from incorporating domain-specific knowledge, either from extracting constraints from the source code or involving human feedback in the causal model.

Presenter: Max Liu