Alerting vs Reporting vs Dashboarding
One of the common concepts that is discussed in observability (at least by the author) is the distinction of what should be an alert compared to what should be a report. Somewhere in the middle of these two extremes is what should be a dashboard. Some of these concepts will exist across a spectrum between two points, but in most cases bucketing information into alerts, reports and dashboards will provide enough clarity that a choice to draw a line can be made.
What is an alert
An alert should meet three criteria:
- It is causing an immediate effect on a service level
- It is specifically able to be acted on
- Lack of immediate action will worsen the impact on the service level
If it doesn't meet all three of these criteria, it is not something that merits an alert.
It is causing an immediate effect on a service level
This is the first hurdle anything that is an alert should cross. If it is not affecting a service level commitment or metric, it is not something that is necessary to alert on. Critical services have service levels, therefore if there are no service levels being breached, it is not affecting anything critical.
It is specifically able to be acted on
An alert should include enough information that the person responding can immediately understand what is happening and use that to start the investigation and remediation process. If the alarm cannot be acted on due to being too broad in what it is alerting to (such as "the network is slow"), or is not in the capability for the responder to act on ("the third-party datacenter had a power outage") then there is no use in having the alert as the person responding cannot rapidly remediate the issue.
Lack of immediate action will worsen the impact on the service level
If there is a temporary increase in service latency because a service is scaling up, or a system crashed and is rebooting, that is not something to alert on. Unless there is an ongoing worsening against the impacted service level there is no alert that needs to be generated as part of the process. If a server fails over to its DR instance for example, that it not something that needs an alert, there is no immediate impact on the service level of the system.
What is a report
A dashboard is a scheduled or manually triggered point-in-time query listing actions that should be taken to maintain a system but are not causing immediate issues. This can be information such as the number of systems that are on older software versions, security vulnerabilities and so on that are important but not necessarily time-critical. That is not to say that there is an open-ended timeframe for these to be resolved, but that the identification and application of these actions can be completed in a time period of hours or days, subject to remediation windows. Reports can all inform status over time, such as count of scaling changes or test failures, where information can inform the overall health of a system or service but is not directly tied necessarily to the Service Levels of the service.
What is a dashboard
A dashboard should have two characteristics. One is that it surfaces information on a near-real time basis that requires consistent updating. The second is that it should provide a visual indicator of the health of a system or a subset of the system to provide indications of potential areas that are unhealthy. This will include some Service Level information such as SLO target achievement, as well as forward-looking indicators that might infer a problem, such as traffic trending or 99th-percentile latency over time. A dashboard should serve one primary function - "Can I use this to identify the health of a service and potential problems". A second function of "Can I use this as an entry point to triage an issue" could be argued, but I would counter that this is just a subset of the primary function.
What to remove from your observability/monitoring tool views
It's always best to remove as much as possible and only add in what is necessary. Some of the components that can or should be removed from your observability platform are:
- "Warning" or "Medium Severity" alerts. If is it not critical, it is not an alert
- Dashboards or alerts for systems and services that are not supporting a service level
- "Single Pane of Glass" dashboards. Start broad and narrow down, don't include all levels of detail in one location
- Uptime metrics
- Server/host/container count metrics
No Comments