Distributed Tracing Project Proposal

Here is the accepted proposal for the distributed tracing project I will be working on this summer. Pythia is written in Rust and Jaeger is in Go. As I write weekly updates for this project, I’ll continue to post some here under the tracing tag.

Project Title: Enabling an automated instrumentation framework in a distributed tracing platform

Abstract

Diagnosing performance problems in distributed systems is challenging. One reason is that it is hard to know in advance what instrumentation to enable that can help diagnose problems. Pythia is an automated instrumentation framework that aims to solve this decision problem. The approach uses the key insight that performance variation in workflows that are expected to perform similarly provides insight to where instrumentation should be enabled. This project will validate Pythia’s approach by integrating with a mature distributed tracing system to automatically enable instrumentation that can help diagnose performance problems, as well as evaluating the system in representative distributed applications.

Background

Distributed applications are critical in many aspects of society. We use them when streaming movies [17], delivering packages [20], shopping online [1], and playing online games [22]. When Amazon.com’s website became inaccessible for approximately 13 minutes, Forbes estimated that the outage cost Amazon $66,240 per minute, or nearly $2 million [15]. It is critical that when problems arise in distributed applications, engineers diagnose and fix them rapidly.
Traditional methods of profiling, debugging, and monitoring can be used to diagnose problems in monolithic (non-distributed) applications. However, distributed applications pose challenges to these traditional methods in three major points — incoherence, inconsistency, and decentralization [19]. As an application becomes more distributed, the coherence of problems decreases, i.e. the distance between cause and effect increases. A single failure in a file storage component can propagate latency problems in numerous services, preventing an engineer from identifying the immediate cause. Compared to a non-distributed application, components of a distributed application are less consistent. Each component is designed to be highly independent, so the state of those components is inconsistent. The decentralized nature of distributed applications also applies to vital information about a service’s performance. It is difficult to identify failures in a service when each service may be composed of multiple components, and each component may be distributed over 10 or 100s of machines, virtual machines, or containers. Distributed tracing is designed to address these key challenges in diagnosing problems in distributed applications.

Problem Statement

With distributed tracing, engineers gain visibility into the distributed application’s behavior to help them diagnose problems and identify potential root causes [23]. Distributed tracing requires that engineers add instrumentation to the code of the distributed applications and the environments they run on (e.g., with logs [23, 27] or performance counters [16]). When deploying distributed tracing, it is difficult to know a priori where instrumentation should be enabled, in which stack layer, and what instrumentation to add or enable to help diagnose problems that may occur in the future [2, 14, 27-29]. Enabling all possible instrumentation all the time would result in unacceptable overhead. Thus, engineers must spend time manually exploring the search space of possible instrumentation choices before identifying the root cause of a new, unanticipated problem [3, 14].
This MS project aims to explore the instrumentation decision problem. The project builds upon the Pythia automated instrumentation framework which searches instrumentation choices in response to an observed performance problem in a distributed application [2]. In addition, this project will enable the vision of the Pythia framework by extending the instrumentation capabilities of the Jaeger distributed tracing system, and evaluating it in representative large scale distributed applications.

Prior Work

Pythia is an instrumentation framework that aims to help diagnose unanticipated problems by dynamically enabling instrumentation in distributed systems [2]. Pythia’s researchers presented two key insights enabling the instrumentation framework. The first observation is that a collection of requests with similar workflows should have similar performance [25, 26]. If not, then there may be problematic behavior in their workflows. Locating the source of this performance variation gives insight to where to enable instrumentation to explain this behavior [2]. Localizing the source of variation provides opportunities to use focused search strategies for exploring what instrumentation is needed, and in which stack layer should they be enabled [2]. The second observation is that recent works in end-to-end or workflow-centric tracing makes it possible to capture requests’ workflows by inserting tracing instrumentation into the application [4, 6, 7, 9, 12, 13, 18, 21, 25, 26]. Workflow-centric tracing propagates context by transferring instrumentation (e.g. request metadata) alongside individual requests as they are executed in the distributed application.
The Jaeger distributed tracing system can be enhanced by Pythia’s instrumentation framework. Jaeger was created in 2015 by Uber, and is now used in production in more than 20 organizations [10, 21]. Jaeger uses sampling techniques to reduce tracing overhead [11]. However, sampling only addresses the instrumentation decision problem by focusing on what already-enabled instrumentation should be preserved in the logs, not where they should be enabled in the first place [5, 24]. Pythia can address this limitation in Jaeger by exploring the search space of instrumentation choices to dynamically enable the instrumentation needed to diagnose the problem.

Approach

The approach to be taken for the completion of the MS project will have three components — analysis of Jaeger and Pythia, implementation, and evaluation.
We must determine that existing instrumentation in Jaeger is insufficient to diagnose problems solved by Pythia. Analysis of Jaeger source code will reveal potential areas for modification, such as the span and trace models. Trace points in Jaeger must be modified to express the granularity-marker and event-marker semantics required by Pythia. The Jaeger agent is the daemon program that receives tracing information submitted by applications using Jaeger client libraries. This agent must be modified to preserve concurrency and synchronization in input traces required by Pythia. Rather than having the developer manually add instrumentation in the code of the distributed application, Jaeger should have the capability to enable or disable instrumentation based on a control signal generated by Pythia. This can be achieved by modifying the Jaeger client libraries, and adding an instrumentation controller in Jaeger. The results of analysis will drive key design decisions and requirements for implementation. These requirements will be translated into test cases for verifying Pythia’s functionality.
The implementation phase will involve the programming and testing of Pythia’s logic in Jaeger. The first step is developing a mechanism in Jaeger to enable or disable tracepoints. This can be achieved by creating an instrumentation controller that checks for the presence of a file or flag. The second step will involve implementing Pythia’s logic and instrumentation controller in Jaeger. Pythia operates in a continuous cycle [2]. First, Pythia will require as input the initial expectations of which requests should perform similarly, and workflow skeletons created with a set of trace points always enabled. Pythia will extract workflow skeletons’ critical-path to create critical-path skeletons. These requests’ critical-path skeletons will be grouped by similar performance. Pythia will examine the response-time distribution of requests in the groups. If Pythia logic detects variation in a problematic group (e.g. high coefficient of variation, or consistently slow), Pythia will explore where to enable instrumentation, and what instrumentation to enable or disable. A control signal will be sent to the instrumentation controller to act on Pythia’s decision. Finally, Pythia will refine the expectations to account for recently enabled or disabled instrumentation, then the cycle is repeated. The result of implementation will be the Jaeger distributed tracing system with enhanced and automated instrumentation capabilities from Pythia.
The evaluation phase involves using the Pythia-Jaeger system to diagnose and fix performance issues in distributed applications. Pythia functionality will be validated using a cloud microservices benchmark suite to generate traces [8]. Performance problems and bugs will be artificially injected into the benchmark applications to confirm whether the instrumentation enabled by the framework leads to a rapid diagnosis of the problem. We will also attempt to find unresolved problems within OpenShift distributed applications with the aid of the Pythia-Jaeger system. In addition to a qualitative report, we will collect quantitative metrics such as Pythia search instrumentation overhead, final overhead of the instrumentation enabled by Pythia, time to identify instrumentation, and amount of human guidance to help the framework identify instrumentation. These metrics will be used to compare the Pythia-Jaeger system to Jaeger by itself. The result of this evaluation will be the validation of Pythia’s approach in the Jaeger distributed tracing system tested in representative distributed applications.

Plan and Schedule

  • Mid-Late May: During this phase Jaeger and Pythia source code will be analyzed. Jaeger and Pythia will be built and tested locally. The analysis will generate documentation detailing the specification and requirements for integrating Pythia with Jaeger. Requirements will be translated into test cases for driving development.
  • Late May — Mid June: We will develop mechanisms in Jaeger that are required by Pythia, based on the previous analysis. Documentation will be produced which highlights how to use Pythia-specific functionality in Jaeger.
  • Mid June — Mid July: The Pythia-enhanced Jaeger system will be developed, tested and prepared for a potential open-source release. At this point, it will be possible to build, test, and deploy the Pythia-Jaeger system from source code. Testing with a microservices benchmark [8] will be used to validate functionality of the automated instrumentation framework.
  • Mid July— Early August: The tracing system will be iterated on to debug and optimize for performance. The enhanced Jaeger system will be tested in the Jaeger continuous integration environment. The aim of this phase is to work with Jaeger project maintainers to review and test the Pythia-Jaeger system.
  • Mid July — Early August: This period will involve an exploratory study for performance evaluation of Pythia-Jaeger. The tracing system will be evaluated using traces generated from the benchmark suite [8]. The metrics collected will be compared to the Jaeger system by itself. Deviations in performance will be explored and explained.
  • Early August — Mid August: Results of the evaluation will be used to improve the efficacy and performance of the instrumentation system. The final report describing the analysis, design, implementation, and overall evaluation will be written and edited. The work will be presented.

References

[1] Amazon.com. https://www.amazon.com/ .
[2] E. Ates, L. Sturmann, M. Toslali, O. Krieger, R. Megginson, A. K. Coskun, and R. R. Sambasivan. An automated, cross-layer instrumentation framework for diagnosing performance problems in distributed applications. In ACM Symposium on Cloud Computing (SoCC ‘19), 2019.
[3] B. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of production systems. In ATC ’04: Proceedings of the 2004 USENIX Annual Technical Conference, 2004.
[4] M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based failure and evolution management. In NSDI ’04: Proceedings of the 1st USENIX Symposium on Networked Systems Design and Implementation, 2004.
[5] R. Ding, H. Zhou, J. Lou, H. Zhang, Q. Lin, Q. Fu, D. Zhang, and T. Xie. Log2: A cost-aware logging mechanism for performance diagnosis. In ATC ’15: Proceedings of the 2015 USENIX Annual Technical Conference, 2015.
[6] R. Fonseca, M. J. Freedman, and G. Porter. Experiences with tracing causality in networked services. In INM/WREN ’10: Proceedings of the 1st Internet Network Management Workshop/Workshop on Research on Enterprise Monitoring, 2010.
[7] R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-Trace: a pervasive network tracing framework. In NSDI ’07: Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation, 2007.
[8] Y. Gan, Y. Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y. He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou, An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge Systems. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019.
[9] Jaeger: open-source, end-to-end distributed tracing. https://www.jaegertracing.io .
[10] Jaeger Adopters https://github.com/jaegertracing/jaeger/blob/master/ADOPTERS.md.
[11] Jaeger Sampling https://www.jaegertracing.io/docs/1.22/sampling/ .
[12] J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, 2 V. Venkataraman, K. Veeraraghavan, and Y. J. Song. Canopy: An end-to-end performance tracing and analysis system. In SOSP ’17: Proceedings of the 26th Symposium on Operating Systems Principles, 2017.
[13] J. Mace and R. Fonseca. Universal context propagation for distributed system instrumentation. In EuroSys’18: Proceedings of the Thirteenth EuroSys Conference, 2018.
[14] J. Mace, R. Roelke, and R. Fonseca. Pivot Tracing: dynamic causal monitoring for distributed systems. In SOSP ’15: Proceedings of the 25th Symposium on Operating Systems Principles, 2015.
[15] K. Clay. Amazon.com Goes Down, Loses $66,240 Per Minute. Forbes, 19-Aug-2013. https://www.forbes.com/sites/kellyclay/2013/08/19/amazon-com-goes-down-loses-66240-per-minute/ . Last accessed April 2021.
[16] M. L. Massie, B. N. Chun, and D. E. Culler. The Ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7), July 2004.
[17] Netflix. Edgar: Solving Mysteries Faster with Observability. https://netflixtechblog.com/edgar-solving-mysteries-faster-with-observability-e1a76302c71f . Last accessed April 2021.
[18] OpenTracing website. http://opentracing.io/.
[19] A. Parker, D. Spoonhower, J. Mace, R. Isaacs, and B. Sigelman, Distributed tracing in practice: instrumenting, analyzing, and debugging microservices. Sebastopol, CA: O’Reilly Media, 2020.
[20] Red Hat. UPS streamlines tracking and delivery with DevOps and Red Hat. https://www.redhat.com/en/resources/ups-customer-case-study . Last accessed April 2021.
[21] P. Reynolds, C. Killian, J. L. Wiener, J. C. Mogul, M. Shah, and A. Vahdat. Pip: detecting the unexpected in distributed systems. In NSDI ’06: Proceedings of the 3rd USENIX Symposium on Networked Systems Design and Implementation, 2006.
[22] Roblox. The 100-million player platform. https://www.hashicorp.com/case-studies/roblox . Last accessed April 2021.
[23] B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical Report dapper-2010-1, Google, Apr. 2010.
[24] R. R. Sambasivan, R. Fonseca, I. Shafer, and G. R. Ganger. So, you want to trace your distributed system? Key design insights from years of practical experience. Tech. Rep. CMU-PDL-14-102, Parallel Data Laboratory, Carnegie Mellon University, Pittsburgh, PA 15213-3890, April 2014.
[25] R. R. Sambasivan and G. R. Ganger. Automated diagnosis without predictability is a recipe for failure. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Computing, pages 21–21. USENIX Association, June 2012.
[26] R. R. Sambasivan, A. X. Zheng, M. De Rosa, E. Krevat, S. Whitman, M. Stroucken, W. Wang, L. Xu, and G. R. Ganger. Diagnosing performance changes by comparing request flows. In NSDI’11: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, 2011.
[27] D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. SherLog: error diagnosis by connecting clues from run-time logs. In ASPLOS ’10: Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, 2010.
[28] D. Yuan, S. Park, P. Huang, Y. Liu, M. M. Lee, X. Tang, Y. Zhou, and S. Savage. Be conservative: enhancing failure diagnosis with proactive logging. In OSDI’ 12: Proceedings of the 10th conferences on Operating Systems Design & Implementation, 2012.
[29] X. Zhao, K. Rodrigues, Y. Luo, M. Stumm, D. Yuan, and Y. Zhou. Log20: Fully automated optimal placement of log printing statements under specified overhead threshold. In SOSP ’17: Proceedings of the 26th Symposium on Operating Systems Principles, 2017.