Fathom: Understanding Datacenter Application Network Performance

June 05, 2024

Research Question: How to design a system for debugging and understanding network performance in cloud-scale datacenters with visibility, interpretability and scalability?

Key Contributions: This paper presents the design and usage of Fathom, a system identifying the network performance bottlenecks of any service running in the Google fleet. There is not much research novelty in this paper, but it provides empirical experience, engineering techniques and case studies. The design goal of Fathom is to use and build upon existing telemetry data at Google to do fine-grained network performance analysis that covers multiple and different layers of abstractions. Fathom breaks down a RPC’s latency into subcomponents, time spent on client application, RPC’s queue, buffer, network stack from TCP queueing delay to WAN rate limiter, to NIC etc. Fathom achieves this fine granularity by 1) tracking the byte boundaries of an RPC in the serialization buffers in use space, 2) collecting kernel timestamps at various stages for payload of the RPC on the end-to-end path, 3) using aggregation techniques that preserves data distributions, especial at the tail, 4)using a Gaussian Mixture Model to project high-dimensional metrics data onto interested features to get a few blobs for easy analysis. All the extended kernel timestamp changes have been upstreamed to Linux v.3.17. Fathom incurs only 0.4% fleet-wide total RPC/TCP/kernel cycles.

The two major use cases of Fathom are 1) at micro-level, diagnosing application performance issues for a specific application/service and 2) at macro-level, characterizing applications’ network performance before and after a roll-out.

Opportunities for future work: With a set of internal monitoring systems at Google, a future work is to synthesize and to combine Fathom data, switch data, topology data, and CPU profiling data to further pinpoint resource bottlenecks in Google’s datacenters.

Presenter: Max Liu