Service Level Objectives: What's working and what's not?

Autor: Felipe Henrique Bastos e Bastos
Jazyk: angličtina
Rok vydání: 2020
Předmět:
DOI: 10.5281/zenodo.4153496
Popis: With the increased complexity of the provided IT services a more sophisticated monitoring approach is required to guaranty their normal operations and reliability. We need to monitor our services from the perspective of the user experience and not only looking after a specific host or operating system parameters. Our website might be up and running but, if it responds slowly, the users will maybe never try to open it again. This problem could be solved easily by implementing the latest Service Level Objective (SLO) monitoring practices that are inspired by the Service Reliability Engineering (SRE) discipline. This project covers the work required for studying and implementing SLI/SLO dashboard for the need of the IT Monitoring Service at CERN. It describes the main concept and identifies appropriate SLI metrics for monitoring main service parameters. As next steps, it implements a mechanism for gathering the required metrics and calculating SLI and error budget results. All the information is made available on a single dashboard that provides immediate feedback to the service manager about the short and long-term reliability status on the provided service. In addition, a set of alerts is configured to notify in case of a sudden drop in the service reliability, in order to improve the awareness and decrease the reaction time in case of incidents.
Databáze: OpenAIRE