DOCC Lab Reading Group

Bootstrapping evolvability for inter-domain routing with D-BGP

Paper Link

Research Question: What are the challenges to “evolving” (incrementally deploying improvements to) internet routing protocol(s) like BGP, and how can those challenges be overcome?

(Meta-Question: What lessons can be learned by taking this paper as an exemplar of a well-stated intro/background/motivation?)

Key Contributions:

This paper states the following problems with BGP (reasons why you’d want to evolve):

Key Insight: Solving any or even all of these problems with a new protocol, without addressing the root causes of lack-of-evolvability, will simply result in a new protocol that frustrates future efforts to evolve/improve. I.e., Evolvability is actually the meta-feature to improve!

Motivtion (2) structure:

  1. Narrative describing the structure of the present/future state
    • Introduces terminology: Islands, Gulfs, Mulit-Network-Protocol-Headers, Baseline Protocol, Routing Compliance (gets its own section?)
    • States the basis for generalization:
      • 14 (!) recently-proposed protocol improvements designed to mitigate specific problems
      • These will be sorted into “evolvability scenarios”
      • Each evolvability scenario produces a few requirements
      • The requirements are further affinitized into features
      • “use cases” -> generalized scenarios -> requirements -> features
        • specificity -> generality -> specificity -> generality
  2. The scenarios:
  3. baseline -> baseline w./critical fix Requires:
    • (CF-R1) Disseminate critical fixes’ control information across gulfs
    • (CF-R2) Disseminate critical fixes’ control information in-band of baseline’s advertisements
  4. baseline -> baseline (in parallel with) custom protocol Requires:
    • (CP-R3) Facilitate across-gulf discovery of islands running custom protocols and how to negotiate use of their services
  5. baseline -> replacement protocol Requires:
    • (G-R4) Inform islands and gulf ASes of what protocols are used on routing paths
    • (G-R5) Avoid loops across all protocols used on routing paths
  6. The features:
    • Pass-through support: (CF-R1)
    • Multi-protocol data structure: (G-R4, CP-R3, CF-R2, G-R5)

Questions for Discussion:

Presenter: Tony Astolfi

read more

LM-PACE: Confidence Estimation by Large Language Models for Effective Root Causing of Cloud Incidents

Paper Link

read more

DBOS: A DBMS-oriented Operating System

Paper Link

Research Question: Is it feasible to implement a (distributed) operating system on top of a relational database (RDMS)? Can we improve analytics/observability of cluster/cloud computing software while simplifying code complexity by doing this?

Key Contributions: This paper presents a vision for a new operating system stack approach, motivated by the shift to large-scale cloud/distributed computing and the increasing complexity of system design that comes with scale. The core insight at the heart of this idea is that a general-purpose, distributed DBMS must internally solve many if not all the major problems facing an operating system at cluster scale, so why not leverage that engineering effort as much as possible? Implicit in this is an assumption that we should be willing to break compatibility with existing system designs and interfaces, provided the net benefits are sufficiently high. Within the scope of this larger project and its assumptions, the goal of this paper is to offer an empirical basis for the feasibility of a DBOS in terms of performance and resource efficiency.

The authors spend much more time on the question of feasibility than potential benefit. On the question of benefit, they cite Kubernetes, XTrace, Dapper, and Prometheus: “Today’s distributed computing stacks … provide few abstractions [to developers]… [requiring them] to build or cobble together a wide range of external tools to monitor application metrics, collect logs, and enforce security policies. In contrast, in DBOS, the entire state of the OS and the application is available in structured tables that can simply be queried.” (2.1) The other benefit, reducing system complexity via a set of common abstractions (data model), interfaces (SQL), and mechanisms, is motivated by attempting to show how common OS concerns (task scheduling, IPC/RPC, and file system) can be mapped onto relational data models and SQL-like queries.

There is an intrinsic “chicken-and-egg” problem here: the effort to create a production quality DBOS can only be justified through applications, and applications require the existence of an operating system. This paper offers two mitigations to this dilemma: 1. A paradigmatic break in operating system design, with particular emphasis on a single organizing motif/principle is not without precendent: Unix (“everything is a file”) was similarly revolutionary at its conception, but it succeeded nonetheless, and 2. A three-step incremental roadmap (“Straw,” “Wood,” and “Brick” - the naysayers (big bad wolves) will be proven wrong!) which at each milestone requires increasing implementation effort, but also promises more functionality, such that future investment is justified by the success of the previous iteration. For this paper, DBOS-straw is implemented and evaluated.

“Straw” uses VoltDB, a distributed in-memory “New-SQL” relational database optimized for small online transactions. Applications are represented as task graphs (DAGs), whose structure represents functional/data dependencies. Many tables are partitioned by a primary key in such a way that this key determines the node on which operations for that row will execute. One interesting exception is the “parallel filesystem” which instead shards/partitions by block number to achieve higher throughput. Overall, the evaluation section does a good job at showing that, for the basic OS functionality chosen, the DBOS approach performs within an order of magnitude of current solutions. This is of course not good enough to supplant the current paradigm, but perhaps argues against DBOS being a completely crazy idea in practical (performance) terms.

Even though most of this paper focuses on the question of feasibility, it is worth asking: is it a good idea to “fuse” the mechanisms, data models, and interfaces used for OS and application functionality with those used for observability and debugging? Is the DBOS concept likely to replace the current landscape? If not, what are some ways that current mechanisms might be influenced by the DBOS philosophy on system design?

Opportunities for Future Work:

Presenter: Tony Astolfi

read more

Teamwork discussion

This week’s discussion is not on a particular research paper, but teamwork in scientific research in general. Reading materials include three sources.

“The Matthew Effect in Science” is the first reading. It was written by a sociologist and was published in Science in 1968. The major argument is that well-known scientists are rewarded more recognition than lesser-known scientists. For our own interest, we covered the section “Social and Psychological Bases of the Matthew Effect”. This part discusses several characteristics common among great scientists, among them self-confidence and self-criticism, good taste of selecting the right problem with risk rather than a problem with no risk at all, and standards for work worth publishing versus not.

The second reading is a relatively short journal article from The Guardian, titled “In science today, a genius never works alone”. The author argued that whether or not there will be more or fewer individual heroic geniuses does not matter because what matters is the nature of scientific progress. That is, people become experts in their fields, they share and challenge existing scientific knowledge or other people’s ideas.

The last reading is from the author of “The Power of Habit”, Charles Duhigg. The article is titled “What Google Learned From Its Quest to Build the Perfect Team”. We read about lessons on teamwork from Julia Rozovsky who was once a student at Yale School of Management. Her experience at business school showed that work dynamics could vary so drastically in different study groups. Julia was later hired by Google and was assigned to Project Aristotle, an internal study on Google’s teams to analyze why some teams work well together while others do not. Project Aristotle’s researchers found that the most important factor for a team to succeed is psychological safety. In particular, the researchers found two important phenomenons. The first is what is referred to as “equality in distribution of conversational turn-taking”. That is roughly, each team member speaks roughly about the same amount of time during meetings or conversations. The second phenomenon is high “average social sensitivity”, which means that team members usually infer and understand others’ emotional states through many nonverbal ways like eye contact or tone of voices.

Presenter: Max Liu

read more

Buffer-based End-to-end Request Event Monitoring in the Cloud

Research Question: How to accurately diagnose request latency anomalies (RLAs) in cloud environments by monitoring the end-to-end datapath of requests with consistent request-level semantics and low overhead?

Key Contributions: Key Contributions: This paper presents the design and usage of BufScope, A high-coverage request event monitoring system that models the end-to-end datapath of requests as a buffer chain and monitors buffer-related abnormal events. It consists of several noval approaches: 1) A buffer event modeling approach that defines a complete event library based on buffer properties, 2) A concise request-level semantic injection mechanism implemented on SmartNICs to achieve consistent semantics with low overhead, and 3) Implementation and evaluation on commodity programmable switches and SmartNICs, showing BufScope can diagnose 95% of RLAs with <0.07% network bandwidth overhead. A high-leve takeaway is this paper is similiar to Theo’s Foxhound, except Foxhound targets at real request ID in network packets and BufScope targets on RPC ID in packets.

However, some of assumption this paper build upon and arguments the authors conclude do not stand strongly. First, the request ID that author maniupluated in packet are actualy RPC ID which actually holds a many to one relationship to request ID unless forced to make one-to-one mapping and your application is simple enough to complete requests with just one RCP call. Based on what we learned from other Alibaba observability researches, their internal practice does not force one to one mapping between RCP ID and request ID. To maintain a map between RPC IDs and a request ID, it requires extra overhead to weaken scalability of BufScope. Second, evaulation section is not clear enough. 1) Some metrics evaulated are essentially meaningless but plotted as useful–the most presentative one is bandwidth overhead. More data a observability infra collects, more bandwidth it uses (basically liner). We don’t think it’s a good metric to measure in term of overhead. 2) Metris like collector overhead and data persistency cost are missing, which amplifies our concern about this method. Third, the author make some unreasonable arguments when talking about usage of BufScope. Accroding to the plentiful reserach context of distributed tracing, performance issue localization and root analysis in application layer is absurd and challenging–each small step involves a research paper. But the author fail to explain clearly about how to localize and diagnose issues after BufScpoe collecting observability data. Another weird arguement the authors make is contention in other resource could also be detected. We feel confused about 1) how to relate other resource’s event since BufScope works on an idea of connecting events across-layers with RPC ID in network packet, 2) contention diagnosis acutally involves understanding about concurrent requests while BufScope is incapable of.

read more

Autothrottle: A Practical Bi-Level Approach to Resource Management for SLO-Targeted Microservices

Research Question: How can current auto-scaling approaches be improved for microservice applications, in order to minimize cost and SLO violations in the face of dynamically changing workloads?

Key Contributions: The authors present Autothrottle, a “bi-level learning assisted framework” for automatically adjusting resource allocations dynamically per microservice instance. Autothrottle employs a global, application-wide component (Tower) and a per-microservice component (Captains). Tower uses an online learning approach (contextual bandits) to set throttle targets for clustered classes of microservice; these targets are periodically (per-minute) sent to the Captains, which monitor the number of CPU throttling events for their service and adjust the service’s cgroup quota to try to hit the target throttling rate. Captains are biased to react very quickly to a throttle rate above target by increasing CPU quota in order to avoid SLO violation. The use of CPU throttling events as a proxy metric for request latency was chosen because it was found empirically to have better correlation than the more typically used CPU utilitization (5.3, Figure 7). Another interesting design choice is to use k-means clustering to group microservices by average CPU utilization, and then set throttle targets per group. It was found that the best choice for k is 2, after which returns diminish (5.3 “Number of Performance Targets”). Autothrottle was able to consistently outperform the K8s CPU and CPU-Fast autoscalers and Sinan (a state-of-the-art ML based autoscaler), with 25%-50% lower resource allocation to satisfy the SLO. Interestingly, it does this while maintaining a worse (higher) P99 latency (while still under SLO), indicating a much lower variance and more stable performance (5.4, Figure 9b).

The k=2 clustering approach seems to suggest the microservices in the benchmark applications (Social-Network, Train-Ticket, and Hotel-Reservation) sort roughly into CPU-intensive and non-CPU-intensive, and further that this classification strongly predicts how sensitive the end-to-end application latency is to rate of CPU throttling within each group.

Opportunities for future work:

  1. Applying the approach to different resource types like memory and storage
  2. Adding horizontal scaling (in addition to vertical-scaling) to the action model

Presenter: Tony Astolfi

read more

Fathom: Understanding Datacenter Application Network Performance

Research Question: How to design a system for debugging and understanding network performance in cloud-scale datacenters with visibility, interpretability and scalability?

Key Contributions: This paper presents the design and usage of Fathom, a system identifying the network performance bottlenecks of any service running in the Google fleet. There is not much research novelty in this paper, but it provides empirical experience, engineering techniques and case studies. The design goal of Fathom is to use and build upon existing telemetry data at Google to do fine-grained network performance analysis that covers multiple and different layers of abstractions. Fathom breaks down a RPC’s latency into subcomponents, time spent on client application, RPC’s queue, buffer, network stack from TCP queueing delay to WAN rate limiter, to NIC etc. Fathom achieves this fine granularity by 1) tracking the byte boundaries of an RPC in the serialization buffers in use space, 2) collecting kernel timestamps at various stages for payload of the RPC on the end-to-end path, 3) using aggregation techniques that preserves data distributions, especial at the tail, 4)using a Gaussian Mixture Model to project high-dimensional metrics data onto interested features to get a few blobs for easy analysis. All the extended kernel timestamp changes have been upstreamed to Linux v.3.17. Fathom incurs only 0.4% fleet-wide total RPC/TCP/kernel cycles.

The two major use cases of Fathom are 1) at micro-level, diagnosing application performance issues for a specific application/service and 2) at macro-level, characterizing applications’ network performance before and after a roll-out.

Opportunities for future work: With a set of internal monitoring systems at Google, a future work is to synthesize and to combine Fathom data, switch data, topology data, and CPU profiling data to further pinpoint resource bottlenecks in Google’s datacenters.

Presenter: Max Liu

read more

Trace-based Multi-Dimensional Root Cause Localization of Performance Issues in Microservice Systems

Research Question: Can we identify root causes of performance problems by analyzing differences in the critical paths of normal and anomalous requests?

Key Contributions: This paper presents TraceContrast, a framework for analyzing the critical paths of request traces from distributed systems, finding disparities between the patterns within anomalous traces and normal traces, and using that information to localize causes of performance problems within these workflows. In particular, the approach presented in this paper notes that performance problems may arise not just due to a single component’s functionality but can often arise due to the interplay between various components, and this the approach aims to identify sequential patterns that indicate poor performance.

By representing critical paths as sequences of events, the paper’s approach is able to group paths into sequences of identical structure, and then combine those sequences into prefix trees, where divergences in the sequences cause branching. These sequences include not only tracepoint IDs but also include information such as version information and key-value data. By identifying nodes in the prefix trees that correspond to a high prevalence of anomalous traces versus normal traces, the algorithm is able to identify relevant trace sequences that point towards anomalous behavior. The algorithm outputs a ranked set of features meant to capture the root cause of a performance problem.

Opportunities for future work: The approach presented in this paper focuses exclusively on critical paths of requests suffering from performance problems. However, this method appears to be extendable in order to compare broader workflow patterns and to identify workflow properties at large that may contribute to performance problems in a system, particularly for non-problematic requests whose execution can be causally related to the execution of problematic requests.

Presenter: Tomislav Zabcic-Matic

read more

Lifting the veil on Meta’s microservice architecture: Analyses of topology and request workflows

Research Question: The primary research question addressed is to understand and characterize the distinctive features of Meta’s large-scale microservice architecture. The study investigates how the topology and request workflows within this architecture differ from conventional microservice designs and the implications these differences have for developing and researching tools that utilize microservice topology and request traces.

Key Contributions: The paper makes several significant contributions to the understanding of large-scale microservice architectures. Firstly, it characterizes Meta’s microservice topology, revealing it as extremely heterogeneous and in constant flux, with numerous software entities that do not conform to traditional microservice definitions. This insight underscores the complexity and dynamic nature of Meta’s architecture. Secondly, the study analyzes request workflows, finding them highly dynamic but locally predictable using service and endpoint names. This predictability, despite overall complexity, provides valuable insights into the system’s behavior. Lastly, the research quantifies the impact of obfuscating factors on microservice measurement, highlighting the challenges of analyzing such systems and suggesting areas for improvement in tool development​.

Opportunities for future work: The paper identifies several promising avenues for future research. One key area is the development of advanced tools tailored to the dynamic and heterogeneous nature of large-scale microservice architectures. Enhancing existing tools and creating new ones could significantly improve system management and performance analysis. Another area for exploration is the detailed study of request workflows, particularly focusing on their predictability and management in complex systems. Additionally, further research could investigate methods to mitigate the impact of obfuscating factors on microservice measurement, leading to more accurate and reliable performance data. Finally, exploring the roles and behaviors of non-conforming software entities within the architecture could deepen the overall understanding of their impact and contribution to the system

Presenter: Zhaoqi Zhang

read more

A Cloud-Scale Characterization of Remote Procedure Calls

Research Question(s): RPC is a key enabler for cloud-scale distributed applications and its use increases rapidly. Characterizing RPCs usage in the cloud helps with better understanding of cloud applications, yielding insights for optimizations. This paper presents methodology and results of characterizing RPCs at Google.

read more

Pivot Tracing summary & discussion

Research Question(s): How to correlate and group events across distributed components? How to avoid expensive data colletion during execution?

Key Contributions: 1) Introduced the “happened-before join” operator to group and filter events that causally precede each other in an execution, optimizing metric collections with the baggage abstraction. 2) Developed a prototype for Java-based systems and evaluated it on a heterogeneous Hadoop cluster, including HDFS, HBase, MapReduce, and YARN. 3) Achieved low execution overhead.

Opportunities for future work: 1) Their current definition of causality does not capture all possible causal relationships, including when events in the processing of one request could influence another. Our work can extend beyond a simple happened before relationship. 2) The propagation overhead is potentially unbounded (the number of packed tuples in a baggage can become enormous, although unlikely). In scenarios of excessive overhead, the system could switch to counting the number of tuples, but important information might still be missing when troubleshooting the root cause. More sampling strategies can be investigated to evaluate its scalability.

Presenter: Mona Ma

read more

Argus: Debugging performance issues in modern desktop applications with annotated causal tracing

Research question(s): How can we build accurate, powerful traces of behaviors in desktop applications to aid with debugging anomalous behaviors?

Summary: Argus proposes tracing for desktop applications. This form of tracing is different from our traditional notion of distributed traces: they build traces capturing causal relationships between segments of execution from system-level logs without relying on context propagation. They make minor changes to the Mach Kernel to support this trace generation. They capture logs for critical system behaviors that allow them to tease out causality after the fact. Their trace model has three types of edges: strong, weak and enhanced weak. They later use a beam-search algorithm to find the most important causal paths for anomalous events in traces. To find the root cause of anomalous events, they often need a ‘normal’ trace for the problematic behavior to identify what differed on the causal path to lead to the anomalous event. They evaluated Argus on many different performance problems in desktop applications, most of which were unsolved, and identified the root cause for all of the problems.

Argus is mainly suitable for correctness problems since they make the assumption that problems will appear as a structural difference in the trace. They also assume that we have a normal execution of problematic behaviors to compare with.

Opportunities for future work: It’d be interesting to see how Argus can be combined with existing distributed tracing frameworks to collect per-service detailed traces. Argus is currently build for MacOS, but the ideas seem to be generally applicable (although work needs to be done to generalize the specific system behaviors that should be logged to infer causality).

Presenter: Darby Huye

read more

Understanding and Detecting Software Upgrade Failures in Distributed Systems

Research Question(s): How can we effectively detect software upgrade failures in distributed systems? How can we most effectively diagnose the causes of version compatibility issues that cause upgrade failures?

Key Contributions: This paper presents a framework for both detecting and diagnosing the causes of upgrade-related failures in distributed systems. It presents DUPTester, a version compatibility testing tool that systematically tests for problems between different software versions, and DUPChecker, a tool for discovering incompatibilities between the data formats used and required by different versions of the same software. The paper presents two key insights regarding upgrade failures - the first being that most upgrade failures can be detected by testing between at most 2 version upgrades, and the second being that most failures can be detected by running workloads already present in the provided test suites. Through a thorough study of upgrade failures, the authors identify that two major categories of upgrade failure causes are incompatibilities between versions and broken upgrade operations, and identify several major categories of incompatibilities - namely, incompatibilities in data syntax (in serialization libraries) and data semantics.

The work finds that typically only three nodes are needed to replicate most kinds of failures, and the diagnostic system presented in the work uses 3-node configurations in order to test combinations of versions that are up to two versions apart, in line with the findings from the study of upgrade failures. Additionally, the work presents a static analysis tool that is able to identify incongruities between the data syntax and semantics of different software versions.

Opportunities for future work: One possible avenue for future work would be to look into the integration of a system for detecting and diagnosing upgrade failures into a broader problem diagnosis system in order to assist in the identification of root causes of both correctness and performance problems in distributed systems.

Presenter: Tomislav Zabcic-Matic

read more

Unicorn: reasoning about configurable system performance through the lens of causality

Research Question(s): For complicated, composed systems with many configuration options across the system stack, how to create a performance model that can locate performance bugs, explain how configurable parameters affect performance, and find a configuration that achieves near-optimal performance?

Key Contributions: Most prior work on performance debugging rely on black-box statistical correlations, or “performance influence models” referred to by the paper, but such models suffer from incorrectness and bad explainability. Further, such models are specific to the system from which the observational data was sampled, and they cannot be easily transferred to another environment. To address these challenges, Unicorn creates causal performance models. A causal performance model is an instantiation of Probabilistic Graphical Models with new types and structural constraints to enable performance modeling and analyses. The new types Unicorn add include software options like “batch size” and intermediate causal mechanisms “cache misses” and performance objectives like “throughput” and “energy”.

Unicorn has five stages, Stage I specify performance query. Stage II learns a causal performance model by using observational data collected from the system and creates an acyclic directed mixed graph. In the end, between a node X and a node Y, there are only two possible edges/causal relations, X causes Y, or a confounder exists between X and Y. Stage III is iterative sampling which is an active learning phase. Unicorn selects top K paths using average causal effect to rank causal paths, and determines the next configuration to be measured. Stage IV updates causal performance models and Stage V estimates performance queries using do-calculus.

The technical novelty of Unicorn is that it selectively adopts and finely tunes many algorithms and statistical methods for the causal model to work accurately and efficiently. Also, the evaluation is very comprehensive and thorough.

Opportunities for future work: For Stage I of Unicorn, a grammar/domain-specific language would be great to automate the current manual translation of users’ performance queries. For faster convergence in Stage II and III, new algorithms are needed. Additionally, Unicorn can benefit from incorporating domain-specific knowledge, either from extracting constraints from the source code or involving human feedback in the causal model.

Presenter: Max Liu

read more

Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems

Research Question: How can we effectively reduce the volume of trace data in distributed microservice systems while preserving the quality of information necessary for monitoring, diagnosing, and analyzing system behavior?

Key Contributions: Sieve introduces an attention-based sampling mechanism that prioritizes and samples trace data by focusing on significant traces likely to provide valuable insights. It implements a dynamic sampling probability for each trace, determined by analyzing various features extracted from the trace data to assess its informativeness. This approach allows Sieve to significantly reduce the volume of trace data while maintaining the essential information necessary for effective system monitoring and analysis.

We identified several areas where further clarifications would be helpful, highlighting potential weaknesses in their approach. Firstly, we questioned the timing of Sieve’s sampling rate changes. Given that traces cannot be processed until their collection is complete, the method appears tail-based rather than real-time, raising concerns about its classification as an “online sampler.” Secondly, we noted that Sieve employs a sliding window function for sampling rate adjustments, which means it may take time to reduce the sampling rate after encountering an uncommon trace. This delay could result in an excessive amount of data being sampled immediately afterward. Additionally, the paper frequently mentions API evaluations but lacks sufficient description, leading us to infer that sampling rate increases might be limited to specific APIs. Lastly, the evaluation section refers to API groups without detailed specifications, leaving us unclear about their definitions and implications. Addressing these points would enhance understanding of Sieve’s practical applications and limitations.

Opportunities for future work: Given uncleared questions above and quality of paper, I’ll leave direction for future work empty.

Presenter: Zhaoqi Zhang

read more

LatenSeer summary & discussion

Research Question(s): Current methods for estimating latency in complex distributed systems are unrealistic. Challenges include handling diverse and complex traces (using set nodes), dealing with clock skews and path-dependent executions that make aggregated casual relationships hard to identify (using succession time), considering mutual independence (using joint latency profiling).

Key Contributions: Developed an offline tool for conducting latency estimation from distributed traces. 2) Improved data structures to accurately represent causal relationships and model latency at scale. 3) Achieved easy deployment, enabling the framework to integrate with existing data collection systems, built on top of Jeager & OTel; suitable to versatile scenarios where the system estimates end-to-end latency given hypothetical latency changes in any of its constituent services; and provides realistic forecasts.

Opportunities for future work: 1) Latency for span A is described by the equation L(𝐴) = L(𝐵) + L(𝐸) + L(𝐺) + L(N), where the child services 𝐵, 𝐸, and 𝐺 are on the latency-critical path and L(N) accounts for the network latency or other processing time. However, the internal dependencies of 𝐵, 𝐸, and 𝐺 might not be accurately reflected in the actual traces, as they could overlap in time or run concurrently. 2) Synchronization points for nodes may be imprecise, and assumptions about serial and parallel processing are based solely on succession time. Our work can validate the accuracy of the generated invocation graph.

Presenter: Mona Ma

read more

STEAM: Observability-Preserving Trace Sampling

Research Questions(s): 1) How can we quantitatively measure the similarity between two traces? And 2) How can we sample traces such that we maximize the dissimilarity between the traces in the sampled subset?

Key Contributions: STEAM is a tail-based sampling framework that aims to maximize the coverage and entropy in the sampled traces. Coverage measures the unique behaviors in the system (trace shapes, latency, response codes, etc) and entropy measures the additional knowledge gained by sampling a trace. Their main contributions are 1) a way to measure the similarity between traces. They provide some predefined similarity metrics that compare trace characteristics like structure, latency, span names, etc. These metrics can be combined into logical clauses which are used to form triplets (A, B, C) that are interpreted as traces A and B are more similar to each other than they are to C. 2) A new way to train GNNs which takes trace triplets as input and aims to minimize the distance between A and B in vector space while maximizing the distance to C. The GNN implicitly learns trace characteristics that imply trace similarity (from the domain knowledge encoded in the logical clauses) since it knows input traces A and B are more similar than they are to C without having to be explicitly told why they are more similar. Finally, 3) they parallelize the determinantal point process (DPP) so they can quickly select the most dissimilar traces (via their representative vectors) given a sampling budget.

Overall, they show that STEAM is able to capture the most unique and representative behaviors in the system by maximizing coverage and entropy in their sampled traces. They show that their solution is fast (can process 15,000 traces in 4 seconds for a single processor) and outperforms all prior work.

Opportunities for future work: Interestingly, they claim that the standard approach to training GNNs (graph contrastive learning) which randomly drops nodes in the input graph to create other similar graphs, is not suitable for this task. They say that these additional input graphs are not realistic as they show traces we do not expect to see in our system. However, this implies they are assuming we have only complete traces (with no data loss). They use the trace triplet approach to avoid this method of training. They later compare STEAM to this approach (which they label as ‘contrast’) and show how it underperforms to STEAM and other related work. This means STEAM is not suitable in an environment that experiences data loss of any kind. Further work should explore how to make STEAM more robust to messy input data. Additionally, future work should look into automatically generating logical clauses and potentially updating them dynamically given specific use cases.

Presenter: Darby Huye

read more

QoS-Aware and Resource Efficient Microservice Deployment in Cloud-Edge Continuum

Research Question(s): How can we design a microservice scheduler and mapper that efficiently schedules microservices to improve cross-service communication overhead while maintaining low resource contention between services on the same node?

Key Contributions: This work presents Nautilus, a microservice deployment and scheduling tool which is able to assign microservices to host nodes at both cloud datacenters and edge datacenters in a manner which reduces QoS violations due to cross-node communication overheads and due to resource contention among co-located microservices. In order to reduce communication overhead, Nautilus subdivides the microservice topology into a number of subgraphs using a graph-cut algorithm which removes those edges which represent dependencies between microservices that do not pass much data between one another. The resulting subgraphs then represent the portions of the topology that should be co-located. Additionally, in order to avoid resource contention among co-located microservices, Nautilus employs a Reinforcement Learning algorithm with a dual reward structure - positive rewards for improving resource utilization, and penalties for QoS violations and throughput reductions. By employing both the graph-cut-based mapping and the RL algorithm in real time, Nautilus is able to reconfigure the deployment of a microservice stack in order to quickly address problems that may occur as load increases or as the average workload composition changes.

Opportunities for future work: The reinforcement learning algorithm used in this paper appears to try to preempt QoS violations due to resource contention, but does not encapsulate any true contention information into its reward structure. Future work could potentially consider adding such information and more directly targeting contention-based issues rather than ascribing QoS violations simply to the high-level metrics used in this paper.

Presenter: Tomislav Zabcic-Matic

read more

Watching the watchmen: Least privilege for managed network services

Relegating network management to third-party tools is less of a monetary cost, but it is still a cost – managed service providers (MSPs) are in a unique position to use their services to exploit the networks they manage on the behalf of companies that depend on them, like hospitals and banks. These providers have an inordinate amount of privilege when it comes to directly manipulating the systems they oversee, and a number of exploits have utilized this unique access to deploy ransomware, extract sensitive data, and execute more attacks. The authors are curious whether this access control can be bolstered by emulating these systems with a digital twin, and then revising the technician’s proposed changes before releasing them on the real thing, all with little overhead. To this end, they implement Heimdall, which presents such a digital twin that only mimics topology, not the sensitive data it holds, and then they present a verification framework for these decisions as well. They also found from preliminary experiments that Heimdall generated at most 42 seconds of latency for the most complex problems faced by their resident technicians. Some problem areas this paper stirred curiosity towards include exploiting the verification framework (since it is hosted in SGX, which has a reputation for leaking among other vulnerabilities), as well as similar work with digital twins of microservice topology.

Presenter: Sarah Abowitz

read more

Blueprint: A Toolchain for Highly-Reconfigurable Microservices

Research Question(s): How to easily reconfigure microservice applications for 1) update microservices design, 2) reproduce emergent phenomena in microservices and 3) prototype and evaluate solutions for microservices.

Key Contributions: Hard to explore and standardize is the design space of microservice systems, yet point solutions in such a vast design space are the norm in microservice benchmarks. Hence, the generalizability of research results derived from such benchmarks is questionable. Additionally, microservice implementations tightly couple concerns at source-code level, so a slight alteration of design requires a lot of efforts (hundreds to thousands of lines of code).

Blueprint’s key insight is that the design of a microservice application can be decoupled into three almost independent layers (i) the application level workflow that defines APIs used by microservices, (ii) the underlying scaffolding components such as replication and auto-scaling frameworks, communication libraries, and storage backends , and (iii) the concrete instantiations of those components and their configuration. Using this insight, Blueprint compiler first takes three inputs, workflow spec, wiring spec, and compiler plugins, and generates an intermediate representation(IR). An IR is a structured graph specifying the services, and scaffolding (e.g. distributed tracing), and instantiation granularity (e.g. Linux Process and Docker Container). The compiler then uses the IR to generate runnable microservice applications. Using Blueprint, a researcher can change the design of a microservice by simply updating one or more of workflow spec, wiring spec and compiler plugins.

Blueprint is able to reduce the number of lines of code for implementing a microservice benchmark by 5-7X. The lines of code required for changing a design choice has 17X reduction. Blueprint can reproduce emergent phenomena like metastable failures and cross-system inconsistency by 3-4 lines of code in the wiring spec. Blueprint can generate runnable small to medium sized microservice applications within seconds (1-3 sec).

read more

Understanding host network stack overheads

For this week, we cover Understanding host network stack overheads.

Research Question:

  1. What are the performance bottlenecks in the host network stack?
  2. What is the impact of various factors or tenchniques on network stack performance? Specificly, they covered number of flows, in-network congestion, Data-Direct I/O (DDIO), Input/Output Memory Management Unit(IOMMU), different workload patterns.
  3. What are the implications of the findings on designing future operating systems, network protocols, and network hardware?

Key Contributions:

The contributions of this paper include studying the performance of host network stacks in the context of increasing datacenter access link bandwidths and limited host resources. The paper explores various solutions such as Linux network stack optimizations, hardware offloads, RDMA, clean-slate userspace network stacks, and specialized host network hardware. The paper also investigates the impact of in-network congestion, the use of DDIO, and the implications of reducing the gap between bandwidth-delay product and cache sizes. Additionally, the paper discusses the potential benefits of emerging zero-copy mechanisms and hardware offloads for improving CPU efficiency in network stacks.

This is not an orthodox system paper but kinda like detective fiction on network stack performance. The whole network stack is not something distributed tracing capable of instrumenting without sophsicated effort (like Theon’s Foxhound), but we still are be able to extend application layer observability into relative lower parts in network stack. This paper provides a excellent guide of where and what to dive deeper.

Presenter: Zhaoqi(Roy) Zhang

read more

The Mystery Machine summary & discussion

Research Question(s): How to conduct performance analysis on complex, large-scale, heterogeneous distributed systems? How to construct causal relationships between components? How to obtain the true dependency?

Key Contributions: 1) The authors built a causal model without extensive instrumentation but used a large number of already-existing log messages. 2) They leveraged the large sample size and natural variation in ordering of different observed request traces to infer causal dependencies between components. 3) They applied the strategy of eliminating contradictions to the domain of distributed tracing. 4) They verified that individual components can be well-optimized in isolation, where performance improvements usually involve the interaction of multiple components.

Opportunities for future work: 1) The Happens-before relationship type introduced unnessasery complexities to their dependency model. Dependency edges between parallel workloads can be discarded since they only represent temporal orders instead of any significant casual relationships. 2) They defined critical path to be “the set of segments for which a differential increase in segment execution time would result in the same differential increase in end-to-end latency” and performed longest-path analysis from the first service to the end, whereas recent papers defined critical path algorithm to walk backward and pick the least returned service. It would be interested to look into how different critical path algorithms contributed to the accuracy of identifying the longest series of sequential operations in a parallel computation system.

Presenter: Mona Ma

read more

CLP summary & discussion

Research Question(s): How can we losslessly compress large volumes of logs so that those logs can also be searched efficiently across many attributes? Can we leverage domain-specific attributes of text logs in order to perform more efficient log compression and search?

Key Contributions: In order to achieve better lossless compression of application logs, the authors create a scheme that leverages a set of attributes present across all logs in order to sort and search through logs. One key insight is that common attributes can be deduplicated across entire sets of logs and replaced by a pointer to an entry in a set of common values. This both greatly aids in compression and also leads to faster search due to the fact that the pointers are fixed-width. Additionally, the authors present a technique for improving search speeds by caching infrequent log types, leading to a reduced need for unnecessary decompression of many logs that are not relevant to queries which return uncommon log types. They evaluate CLP over datasets ranging from several terabytes to the petabyte scale, showing that CLP achieves better compression ratios than comparable tools, and also achieves significant speedup on search queries compared to similar tools.

Opportunities for future work: This paper appears to evaluate the performance of CLP against other log storage and compression tools post-facto. Many real-world systems require fast diagnosis of bugs, meaning that developers need to be able to access logs quickly after they appear. Applying log compression in real-time to newly-generated logs appears to be a natural step in exploring the possibility of using such a tool for online debugging. Alternatively, this work opens up a possibility for exploring rich log sets which have not been too aggressively sampled, including storing historical data which may be useful in root cause analysis of bugs that appear later down the line.

Presenter: Tomislav Zabcic-Matic

read more

Detecting DoS Attacks in Microservice Applications: summary & discussion

For this week, we all read Detecting DoS Attacks in Microservice Applications: Approach and Case Study. This 2022 jaunt has three really clear research questions. They are:

This paper’s key contributions are also pretty clear: they do their due diligence investigating each question, and then they also have a predictable machine learning implementation based on RQ3.

When reading this, many avenues for future work emerged, so I was surprised when this workshop-size paper didn’t have any future work implications of its own. Anyway, three things I’d like to explore next are as such: First, I want to do more detailed analysis into how the thread metric contributes to DoS. Second, I am curious about how certain measures such as health checks and load balancers amplify DDOS attacks in microservices. I think what I’m most curious about though is how high and low volume DDOS attacks degrade TeaStore, DeathStarBench, TrainTicket, Unguard, and maybe other testbeds. Microservices architecture is already heavily varied, so degradation in one system may not occur the same way degradation in another does.

Presenter: Sarah Abowitz

read more

AutoArmor summary & discussion

Research Question: How to automatically generate least privileged inter-service access control policies for microservices and to keep them up to date as the application evolves?

Key Contributions: They assume that the source code encodes the expected normal behavior or legal inter-service access of a microservice application. 1)They develop a static analysis-based mechanism that uses backward taint propagation to extract inter-service invocation logic from source code. 2) They design a novel data structure called permission graph to represent inter-service invocation permissions. A permission graph captures the key feature of microservices and that is that multiple versions of a service can co-exist and they may have different access policies. Hence, the permission graph contains two types of permission nodes, service nodes that describe the permissions common to all versions of the service and version nodes that describe the permissions specific only to that version. As a result of this design, there are two types of edges in a permission graph, one for connecting a service node and a version, and the other for a possible inter-service invocation. Each permission node is represented by a hash-based skeleton tree that stores the details of inter-service invocations. This tree structure enables quick comparisons between permission nodes, therefore updating the permission graph is fast after a new (version of a) service is launched or an old (version of a) service is deprecated. They implement a prototype of the system AUTOARMOR on Kubernetes and Istio, and evaluate its effectiveness, analysis time, security evaluation, efficiency, scalability and end-to-end performance on popular open-source microservice applications. (Fortunately or unfortunately, they did not use neither DeathStarBench nor TrainTicket. They did use Bookinfo though.)

Opportunities for future work: 1) This work concerns the cases where each service is used by only one application. One direction for future work is to extend AUTOARMOR to effectively generate access policies for services used in multiple applications. 2) This work does static analysis on the source code, and they may extend to develop binary analysis for other use cases.

Presenter: Max Liu

read more

Iron: Isolating Network-based CPU in Container Environments

For this week, we cover Iron: Isolating Network-based CPU in Container Environments.

Research Question:

  1. What is the impact of network-intensive workloads on container performance? And how can network-based CPU processing be effectively isolated in container environments?
  2. How to provide sufficient isolation for container regarding to network-intensive workload? More specificly, how can the overhead of packet processing be accurately accounted for in resource management and scheduling?

Key Contributions:

It’s actually the second time I lead discussion on this paper as I forgot that it’s already presented in last year. What this paper benefits us most is gaining direct insigh into how performance degradate happen in container environment. As a member of performance debugging lab, such a insight enables further exploarition of debugging performance in various isolation level for me. Moreover, I liked most about this paper is that it provides detailed case study into an interesting performance issue.

Presenter: Zhaoqi(Roy) Zhang

read more

End-to-end summary & discussion

Research Question(s): How do we define end-to-end? Is it conflict with the reliability measures at lower level? What degree of instrumentations is necessary on the edges and in between?

Key Contributions: 1) A high-level definition that confirms whether programs actually work, where end-to-end guarantees that success has been achieved 2) With better failure recovery mechanisms, we can endure more errors and reduce costs by avoiding building a perfect system, and 3) It’s an engineering trade-off to add checking and recovery measures in the lower-system for performance rather than correctness. It’s more of a way of abstract thinking: with more attention to the end-to-end argument, we can avoid creating a perfect system or solution, but focused on relatively inexpensive ways to achieve the same goal.

Opportunities for future work: How the ancient theories apply to the cloud nowadays? How do the cloud providers applying end-to-end to their system designs? Is end-to-end in some ways conflict with observability?

Presenter: Mona Ma

read more

µTune summary & discussion

Research Question(s): How can we dynamically identify the best threading-model for a microservice under varying loads? How can we dynamically modify microservices to use the best thread-model given the current load to the service?

Key Contributions: They create a taxonomy of threading models which includes 8 different models using a combination of: 1) synchronous (one thread per request) vs asynchronous (no association between threads and requests), 2) in-line (one thread for entire request processing) vs dispatch (a thread for network processing and a seperate worker thread for handling the rpc), and 3) poll-based (a thread consistently loops polling for new requests) vs blocking (threads wait until a request arrives). They show that no model is always the best: the characteristics of services, size of thread pool, and load impact which model is best in that circumstance. They build µTune which dynamically monitors mid-tier microservices to change the threading model based on the current load. This work only focused on mid-tier microservices which propogate requests further trough the system (often to leaves that do major processing before responding) then aggregate results before responding upstream.

Opportunities for future work: Currently, the developer must choose between synchronous vs asynchronous when using µTune as their functions differ. Asynchronous will always be better than the best synchronous model, but is more difficult to program. Future work could look into automatic methods of converting synchronous code into an asynchronous version. µTune has an offline training phase where they learn the best threading model under various loads. Microservice applications are continously updated with services being changed and new services being added to the architecture. Future work could expand upon the training phase to learn the best threading model as the system is running (and potentially on only updated services).

Presenter: Darby Huye

read more

My first DEF CON: summary & discussion

Research Question(s): What is DEFCON like? What is this conference about? What can one learn from going to such a hacker conference? Can a member of DOCC Lab go to her first DEFCON and make friends while not getting hacked or stolen from?

Key Contributions: This is a three-part answer, and we’re going to start by summarizing my experience in Cloud Village. Not only did I learn about tools that modeled insecure microservice environments like CNAPPGoat and Unguard, but I even spent a lot of time hanging out with Unguard’s creators. Most of the talks I attended made it clear that there were distinct differences between logging as we know it in this lab, and security logging. The latter does not randomly sample as generally practiced in tracing, as attacks usually execute once, with less traces generated. In one talk, I learned how security logging outages and log delivery delays can both buy the attacker more time without detection. Another talk demonstrated a need for security monitoring in serverless environments, as it was recently discovered that an endpoint that turned on once a day was leaking Slack’s source code. A third one had some important things to say about how logs reporting system outages not directly related to security events made it harder for security teams to filter their logs for specific attacks. One may understand how this can get worse the larger your cloud is - as you add more things to your cloud, the chance of something failing in your cloud becomes more likely, and when folks are constantly building and updating new things, such failure from CI/CD workflows can make the security team’s search for legitimate security events harder. Now, I would have more snappy by-lines here for these good talks in Cloud Village, but in trying to stay for all the talks, I pretended to work on the Cloud Village CTF - until I wasn’t. By requesting some open-facing Google Cloud Platform queries with the right .json, I was able to secure an access token and go bucket fishing in a gnome-themed CTF. It was quite fun, and I can’t wait to return to DEFCON and actually throw more of myself into the CTF scene.

Next up is a recap of some of the policy talks I went to. Though policy does not always crop up in this lab’s work, I care about monitoring the evolution of technology, and policy has an important hand in shaping that. First, I learned about the Budapest Convention, and how Russia, China, and others are threatening to replace it with a far more draconian stature that hurts security researchers and goes against free speech on the internet. Consider the following: An unassuming security researcher in Country A breaches a database of trade records between authoritarian Countries B and C, and even though A subsequently does all the right things, the country hosting this database notices the breach. Rather than let Country A decide what happens to this researcher in their legal system, under this proposed statute it would be up to Countries B and C to determine whether or not the researcher deserves life in jail. The next talk was super relevant to work I did in undergrad. Even though safe harbor for grey hat disclosures is widespread, these two Canadian hackers talked about issues they came up against when disclosing to the government. I had a nice chat with Mr. Renderman himself after the talk about the paper I worked on about these issuesm, and I also vented about how unauthorized access is too nebulous of a law term for the kinds of things a hacker can be arrested for. Finally, I heard about the unique challenges women face when stealthily seeking abortions in our current legal landscape in the talk “Abortion Access in the Age of Surveillance.” Through this talk, I learned that most people reporting abortions to law enforcement are not doctors AND are close/formerly close contacts, that people are pressured into accepting device searches, that changing laws means your phone could tell on you if you did an abortion that was legal at least previously, that a lot of internet child protection laws don’t do what they claim to and actually make all this surveillance worse, and that privacy is super super hard to correctly standardize. It is important to have one foot in policy / the greater field of responsible computing / the space where you can keep companies like the ones that host abortion-related data (and other data, who knows), whether you are a software developer, project manager, or somewhere on the path to academia. There’s two reasons I say this as a privacy professional: one, sometimes work can be so divided amongst teams and distributed systems that what you work on and how you work on it may sometimes intentionally obscure the larger, possibly more oppressive thing that you may not be comfortable with contributing to, and two, knowing about different insider threats and privacy yikes can help one effectively build non-oppressive tools, or build a threat model for a tool in design that may have the capacity to be oppressive, but can have countermeasures made for it to prevent such manipulation. That is not to say that is always the case, but sometimes you do not know the scope of your work and who it effects without doing some digging.

Of course, though DEFCON is a skill and information share particularly centered around exploits of all kinds, it’s a hacker party, so fun was definitely to be had. In one fun talk, four high schoolers told the audience how they figured out they could clone CharlieCards by flipping them with slightly altered checksums, enabling the new clone to also contain the original money value of the first card, which was used for many joyrides on the MBTA. Although the MBTA had no vulnerability disclosure program at the time of this discovery, the high schoolers were taken seriously and together they worked well with the MBTA to address this vulnerability, which I think is neat. Another talk I saw that day had an abstract that led me to believe it would have something to say about observability, but instead I was in for a summary of how a few guys exploited outdated Lexmark firmware to make one (1) printer sing the Super Mario Bros. theme song. I have three more things to add in the “fun” category, but each of them deserve their own paragraph.

DEFCON has had a reputation of attracting a lot of cis straight white guys, and because of how some guys who hack are, a slightly larger proportion of those guys are worse to queer hackerwomen like me, but also others unlike them (or me). However, even though minimal opsec is needed to traverse DEFCON without getting hacked/stolen from/messed with, DEFCON has taken a lot of steps toward making the conference safe for people like us, but also fostering community spaces for minorities in cybersecurity, like QueerCon. I would find myself going to the QueerCon mixer when I was bored, but I’d always walk away with more friends like myself. Both I and the people I met were snowballed into a very big and gay and trans groupchat on Signal that we still ping regularly when one of us sees a dog, encounters a hacking conundrum, or needs support. Through that groupchat, we were able to organize shenanigans like a ten-person dinner at Guy Fieri Las Vegas before we all went our separate ways. There’s this thing the group Lesbians Who Tech always say about their organization: Wherever you go, the lesbians will find you. Regardless of how many lesbians I met at DEFCON, it made me feel safer to venture into this somewhat risky conference, find friends like me, and get assimilated into the Borg, no, I mean a really good network where we all took care of and showed up for each other when it mattered, and some of us had crushes on the Borg Queen and that was fine. Still, though, to any LGBTQ+ folks considering whether or not to go to DEFCON, we’re here if you know where to find us (at the QueerCon Mixer, it’s probably going to be in Chillout again next year), and you can make friends who also know what it’s like being something like you, and feel stronger because you have buddies now.

Something else rather interesting happened to me when I was on the floor headed towards Cloud Village. I saw someone in a shiny-looking helmet that reminded me a ton of Guy-Manuel’s helmet from Daft Punk, so a compliment was in order. They were very nice and let me have some stickers, as well as a small porcelain duck. Right as I’m walking away, though, I notice on the sleeve of their shirt are the words Cult of the Dead Cow. The cDc (no, not the health organization) is a hacktivist circle that has been around for quite a while, and since DEFCON originally was a hacker party, members of the cDc can usually be found walking amongst the other attendees. It is rumored that every year, each cDc member attending DEFCON is given one unique duck to hand off to someone at random. If this is true, I do not know how members decide who to hand it off to or if the duck has some secret significance, but like many a duck recipient, every time I see that little friend, I remember that means I met a particularly nice member of the cDc.

Last but never least, there is the tale of the challenge coin. At some point during the CTF, I left Cloud Village in search of food, and while working on the CTF and eating a little something, a HackerOne person comes up and gives me this heavy challenge coin. He says that if I decode the message on the coin and post it to a website, I could be entered in a giveaway and win a hoodie. (I cracked the coin on time, but sadly they didn’t choose me for the hoodie list). Messing around, I can tell that the front of the coin has a message that isn’t ROT13. It’s not until later that night that I realize there is also ciphertext on the other side too - and not only is it ROT13, it says attending DEFCON is a BEAUtiful efFORT. So I gather that the front might be encoded in some scheme with BEAUFORT in the name, but I need sleep, so this becomes a breakfast problem…and I don’t know how it hit me sooner. The Beaufort cipher is a variant of the Viginere cipher, which I should have known, especially since I was very much online when everyone was using the Viginere cipher to decode secret messages in the second season of Gravity Falls. DEFCON was capitalized on the back, I thought, because it must be the key to the cipher rotation used in both Beaufort and Viginere, so with some help from dcode.fr, we find that the message I then post on the website is We have a new community website, looks dope, right? . Sigh. Even if that’s what I ended up finding, the search was still pretty fun. All in all, my first DEFCON was pretty fun, I didn’t get hacked, stolen from, or otherwise messed with, and I learned what DEFCON was about while making friends.

Opportunities for future work: Personally, I know I will be going back next year. People have been known to amass knowledge from multiple DEFCONs like moss grows on a rock, and I aim to do that. I will also be serving as the lab’s unofficial DEFCON correspondent while I am still working on my doctorate. Whether you are more immersed in cybersecurity than I am, someone who has more of a background in a related area like policy or aerospace, a thrill seeker, or someone not working in computer science at all who wishes they knew a little more, if you are at least a little interested or you see yourself benefitting from learning at DEFCON, it is definitely more than worth it to go at least once. Plus, play your cards right, and you will end up in group chats and make friends you will meet in Vegas year after year to learn and live together for a bit. Speaking of friends in Las Vegas, Tufts’ own Ming Chow has had a history of running the Packet Hacking Village, and every DEFCON, he will take any Jumbos attending out for dinner. (Greetings from the CS 116 pcap!) That said, some words of caution to anyone who has not gone before: While some sources on the internet will fear-monger with lines like “trust nobody, not even yourself” and “put your phone in a Faraday cage,” you don’t have to obsess over security to the point that it obstructs your enjoyment of the conference. Goons (the staff/moderators of DEFCON) have worked to foster a culture of friendliness around DEFCON, it is less likely that something bad will happen to you on the conference floor, and if it does, Goons are always here to help. Also, if you spend your time not trusting anyone, you may miss out on quality opportunities like having breakfast with someone new every day because you both signed up for the same thing and got your egg sandwich here. I brought a door jammer, installed VPNs on my phone, and I was fine. When you go, plan to always have a buddy, and use your common sense whether you are up to something or not. Also, keep yourself boosted and safe: even though there were 10k of us, wear a mask (especially in crowded areas) and be sure to isolate and test as needed as soon as you come home, as some of our groupchat still caught COVID after all of this. I am available over at sarah dot abowitz at tufts dot edu if you have more questions or comments about opsec or other DEFCON. Still, though, many somewhat unconventional opportunities abound at DEFCON for learning, and I hope one day to see you there.

Presenter: Sarah Abowitz

ps: to that one attendee who jokingly called me a demon for using Vim…guess what I wrote this blog in. :)

read more

SampleHST summary & discussion

Research Question(s): 1) How can distributed traces be efficiently sampled on-the-fly in an unsupervised manner (to reduce the storage overhead of tracing)? 2) Can we design a sampling strategy that focuses on anomalous traces while also collecting normal traces?

Key Contributions: 1) The SampleHST Algorithm is an online clustering algorithm based on mass-based clustering. They use half-space trees (HSTs) to classify the traces (which is useful for anomaly detection). The HSTs are created by dividing the traces by trace attributes (e.g. trace has an error?). They create a forest of HSTs where different attributes are selected to divide traces. Given the HST forest, they calculate the mean mass score (percent of traces that had the same set of attributes) and 5th percentile mass score for each trace and cluster based on these two values. They show that these two mass-based properties are sufficient for clustering similar traces (when compared to the standard DBSCAN). They adjust sampling rates for cluster, proritizing clusters with anomalous traces. 2) They trade-off between sampling normal vs anomalous traces while remaining under a sampling budget. To do this, they only sample normal traces when the budget is higher than the percent of anomalous traces (which is calcuated using the HSTs). If the budget is lower than the percent of anomalous traces, they emphasize sampling anomalous traces over normal ones.

Opportunities for future work: 1) This tool requires tail-based sampling meaning we must collect all instrumentation before deciding which traces we want to store. Tail-based sampling is not feasible in large scaled systems, so future work could be to explore ways to reduce the amount of instrumentation collected before making storage decisions. 2) Scalability- They encode each trace as a count vector, where the length of the vector is the number of unique spans (+ any additional attributes) seen across all traces. This value can be extremely large and dynamic. Additionally, a count vector cannot capture variations like order calls were made and latency information that could vary across traces with the same span counts. More accurate & flexible methods of encoding traces need to be explored to be useful in industrial systems.

Presenter: Darby Huye

read more

OSDI & ATC '23 papers summary & discussion

This discussion introduced 7 presentations that we attended during ATC, serving as a sampler to choose future selections for the reading group papers. These papers are closely aligned with DOCC Lab’s agenda on debugging distributed systems. Among the seven research projects, Nodens aligns most closely with our current research direction. It identified a gap within the field of microservice management, highlighting that most tools are reactive, resulting in long Quality of Service (QoS) recovery times. Their research question addresses how to develop a proactive approach to reduce QoS recovery time. The main contributions include the proposal of a traffic-based load monitor and a blocking-aware load updater as solutions. They tested their approach using 200 microservice applications with 10 call graphs, simulating Alibaba’s production environment and datasets. Future work could involve examining their solutions to improve upon their design decisions.

Presenter: Mona Ma

read more

Rainmaker summary & discussion

Cloud-based environments present error-handling challenges that vary significantly from those found in co-located application environments, where all errors can typically be handled by the same underlying programming frameworks. In cloud-based environments, causes of errors on one machine are opaque to other machines, and miscommunication of these errors can lead to widespread problems in cloud-backed applications. The authors of Rainmaker pose the question of whether it is possible to create a unified model for testing and identifying application failure scenarios in the face of mis-handled errors across components, and present a drop-in framework for automating testing of error mis-handling. By injecting HTTP errors and timeouts into cross-machine HTTP calls, Rainmaker is able to create failure scenarios which would otherwise be easy to miss due to the vast space of possible failures and interactions among different application components.

read more

Unum summary & discussion

Ah, standalone orchestrators. If you want to work on more complex serverless applications, the industry standard is to rent some tenant space for an orchestrator to handle crashes, retries, and exactly-once execution going according to plan. However, the cost of renting the space for a standalone orchestrator works against the efficiency and frugality of serverless, and when a controller like this is logically centralized, it limits how many application patterns it can support without extensive modification. Unum’s authors wonder if decentralizing the FaaS control structure can 1) accommodate the broad needs of complex serverless applications with minimal modification and 2) present a serverless computing schema for developers that actually is closer to the price point that providers claim it is. In this paper, they present the Unum system, a serverless library that bakes orchestration directly into instrumentation, rather than renting out space for an orchestrator whose application pattern coverage is pretty narrow. The authors also evaluated how Unum does with performance and cost of use as compared to traditional standalone orchestrators. Areas for future development of particular interest include developing a tracing system for Unum, adapting Unum for less statically defined control structures, and further investigation into how Unum can scale without issues.

read more

Foxhound summary & discussion

Foxhound: Server-grade Observability for Network-augmented Applications (EuroSys’23) presents a distributed tracing system that extends to the network, providing visibility into previously unseen portions of end-to-end request workflows. Foxhound has two types of annotations that can be used to instrument a PDP: store and stats which allow you to create a span or collect aggregate metrics.

Presenter: Darby Huye

read more