Buffer-based End-to-end Request Event Monitoring in the Cloud

August 07, 2024

Research Question: How to accurately diagnose request latency anomalies (RLAs) in cloud environments by monitoring the end-to-end datapath of requests with consistent request-level semantics and low overhead?

Key Contributions: Key Contributions: This paper presents the design and usage of BufScope, A high-coverage request event monitoring system that models the end-to-end datapath of requests as a buffer chain and monitors buffer-related abnormal events. It consists of several noval approaches: 1) A buffer event modeling approach that defines a complete event library based on buffer properties, 2) A concise request-level semantic injection mechanism implemented on SmartNICs to achieve consistent semantics with low overhead, and 3) Implementation and evaluation on commodity programmable switches and SmartNICs, showing BufScope can diagnose 95% of RLAs with <0.07% network bandwidth overhead. A high-leve takeaway is this paper is similiar to Theo’s Foxhound, except Foxhound targets at real request ID in network packets and BufScope targets on RPC ID in packets.

However, some of assumption this paper build upon and arguments the authors conclude do not stand strongly. First, the request ID that author maniupluated in packet are actualy RPC ID which actually holds a many to one relationship to request ID unless forced to make one-to-one mapping and your application is simple enough to complete requests with just one RCP call. Based on what we learned from other Alibaba observability researches, their internal practice does not force one to one mapping between RCP ID and request ID. To maintain a map between RPC IDs and a request ID, it requires extra overhead to weaken scalability of BufScope. Second, evaulation section is not clear enough. 1) Some metrics evaulated are essentially meaningless but plotted as useful–the most presentative one is bandwidth overhead. More data a observability infra collects, more bandwidth it uses (basically liner). We don’t think it’s a good metric to measure in term of overhead. 2) Metris like collector overhead and data persistency cost are missing, which amplifies our concern about this method. Third, the author make some unreasonable arguments when talking about usage of BufScope. Accroding to the plentiful reserach context of distributed tracing, performance issue localization and root analysis in application layer is absurd and challenging–each small step involves a research paper. But the author fail to explain clearly about how to localize and diagnose issues after BufScpoe collecting observability data. Another weird arguement the authors make is contention in other resource could also be detected. We feel confused about 1) how to relate other resource’s event since BufScope works on an idea of connecting events across-layers with RPC ID in network packet, 2) contention diagnosis acutally involves understanding about concurrent requests while BufScope is incapable of.

Opportunities for future work:

Solid approach to link observability data across layers
Reasonable overhead control mechanism for observability tool anywhere

Presenter: Zhaoqi(Roy) Zhang