CLP summary & discussion

October 16, 2023

Research Question(s): How can we losslessly compress large volumes of logs so that those logs can also be searched efficiently across many attributes? Can we leverage domain-specific attributes of text logs in order to perform more efficient log compression and search?

Key Contributions: In order to achieve better lossless compression of application logs, the authors create a scheme that leverages a set of attributes present across all logs in order to sort and search through logs. One key insight is that common attributes can be deduplicated across entire sets of logs and replaced by a pointer to an entry in a set of common values. This both greatly aids in compression and also leads to faster search due to the fact that the pointers are fixed-width. Additionally, the authors present a technique for improving search speeds by caching infrequent log types, leading to a reduced need for unnecessary decompression of many logs that are not relevant to queries which return uncommon log types. They evaluate CLP over datasets ranging from several terabytes to the petabyte scale, showing that CLP achieves better compression ratios than comparable tools, and also achieves significant speedup on search queries compared to similar tools.

Opportunities for future work: This paper appears to evaluate the performance of CLP against other log storage and compression tools post-facto. Many real-world systems require fast diagnosis of bugs, meaning that developers need to be able to access logs quickly after they appear. Applying log compression in real-time to newly-generated logs appears to be a natural step in exploring the possibility of using such a tool for online debugging. Alternatively, this work opens up a possibility for exploring rich log sets which have not been too aggressively sampled, including storing historical data which may be useful in root cause analysis of bugs that appear later down the line.

Presenter: Tomislav Zabcic-Matic