Popis: |
Microservices have emerged as a popular pattern for developing large-scale applications in cloud environments for their flexibility, scalability, and agility benefits. However, microservices make management more complex due to their scale, multiple languages, and distributed nature. Orchestration and automation tools like Kubernetes help deploy microservices running simultaneously, but it can be difficult for an operator to understand their behaviors, interdependencies, and interactions. In such a complex and dynamic environment, performance problems (e.g., slow application responses and high resource usage) require significant human effort spent on diagnosis and recovery. Moreover, manual diagnosis of cloud microservices tends to be tedious, time-consuming, and impractical. Effective and automated performance analysis and anomaly detection require an observable system, which means an application's internal state can be inferred by observing and tracking metrics, traces, and logs. Traditional APM uses libraries and SDKs to improve application monitoring and tracing but has additional overheads of rewriting, recompiling, and redeploying the applications' code base. Therefore, there is a critical need for a standardized automated microservices observability solution that does not require rewriting or redeploying the application to keep up with the agility of microservices. This thesis studies observability for microservices and implements an automated Extended Berkeley Packet Filter (eBPF) based observability solution. eBPF is a Linux feature that allows us to write extensions to the Linux kernel for security and observability use cases. eBPF does not require modifying the application layer and instrumenting the individual microservices. Instead, it instruments the kernel-level API calls, which are common across all hosts in the cluster. eBPF programs provide observability information from the lowest-level system calls and can export data without additional performance overhead. The Prometheus time-series database is leveraged to store all the captured metrics and traces for analysis. With the help of our tool, a DevOps engineer can easily identify abnormal behavior of microservices and enforce appropriate countermeasures. Using Chaos Mesh, we inject anomalies at the network and host layer, which we can identify with root cause identification using the proposed solution. The Chameleon cloud testbed is used to deploy our solution and test its capabilities and limitations. |