Terminology

Acronyms and References

MELT - Metrics, Events, Logs, Traces
USE Method - Utilisation, Saturation, Errors
RED Method - Rate, Errors, Duration
"Golden Signals" - Latency, Errors, Traffic, Saturation
"Core Web Vitals" - Largest Contentful Paint, First Input Delay, Cumulative Layout Shift
Apdex - A relative score between 0 and 1 that provides an indicator of user satisfaction with a service
Apdex-T - The threshold at which a user moves from a 'good' experience to one they 'tolerate'

MELT

MELT refers to the four ways that systems can be observed, Metrics, Events, Logs and Traces.

Metrics

Metrics are numeric representations from a system, such as CPU Usage %, Concurrent User Sessions, Database Transactions per Second and so on. They are usually highly efficient to store and query as they are numbers instead of text and so don't require as much processing to present information.

Events

An event is a collated set of activities that happened at a point in time. For example a user logging into a web portal, a backup process starting, or an alert being sent from the system. Events and logs are closely related and can be very similar, my own differentiation is that a log is a more granular substep that happened as part of an event. As an example, an event may be that a vending machine purchase was made for $1.40. The individual logs might record the specific denominations that were inserted, any change given, transaction attempts to a payment provider, incrementing inventory and so on.

Logs

A log is a text-based record of something done by the system. This can have several levels of granularity and structure but will tend to provide a historic record of what happened in a system. Because logs can have varying structures and content, they usually require some level of processing to provide consistency before reporting.

Traces

A trace is an end-to-end transaction across multiple systems, particularly in distributed environments but can also be used in single-system software. A trace tracks the activity of a workflow across different components of the software and infrastructure stack and represents the full lifecycle of that transaction. For example, a user logon may have a trace from the load balancer to the web server, through to the third-party identity provider and the resulting page that is then generated and sent back to the user. This can help to identify areas across the whole service that may be causing problems, such as delays between the web server and database layer that may not be visible if looking at the two systems individually.

More Information

Splunk - MELT Explained: Metrics, Events, Logs & Traces

New Relic - Melt 101 - An introduction to the four essential telemetry data types

The Four Golden Signals

"The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four."

- https://sre.google/sre-book/monitoring-distributed-systems/

The Golden Signals should be considered a starting point only, re-reading the above, If you can only measure four metrics of your user-facing system - usually we are able to measure more than four metrics and should incorporate as many as reasonable to whatever we are measuring.

Latency

Latency measures the delay between a request and a response for a given request, or the time taken for a request to traverse a given path. This could be movement of packets across the network or the time between a user initiating an action and receiving a response that the action has been taken. Because latency can and does vary based on what is measured and where, it's important to monitor latency from areas that are within your control. This typically means latency from when a request enters your network until it exits. Latency from end-user devices can vary based on physical location, connectivity and interference, other workloads running on the device and so on which can all impact latency but are not necessarily evidence of an issue with your service.

Traffic

Traffic represents the amount of demand being placed on the system. For web-facing systems this would usually be similar to the number of page requests per second or total bytes served/received. Because systems can scale dynamically, the amount of traffic as an absolute measure does not necessarily mean it is close to being overloaded, but can correlate to other service issues if there are constraints overall.

Errors

Errors are the number of requests that fail where the failure is a result of your system. For example, a 404 error for a web service is not an error in your service necessarily, but a 5xx error could be. Packets dropped from the system, timeouts, invalid responses are all errors that might be tracked and reported on as part of a error metric.

Saturation

Saturation measures how busy your system is relative to its total capacity to respond to requests. High levels of saturation on an individual service can correlate to increased failures or latency for that component, even if the overall traffic to the application is unchanged. Saturation can also be used as a measure for persistent data, such as the amount of disk capacity remaining in a system.

USE Method

RED Method

Core Web Vitals

Core Web Vitals is a set of metrics that measure real-world user experience for loading performance, interactivity, and visual stability of the page. This, along with other page experience aspects, aligns with what our core ranking systems seek to reward.

- https://developers.google.com/search/docs/appearance/core-web-vitals

Largest Contentful Paint

Largest Contentful Paint (LCP) is an important, stable Core Web Vital metric for measuring perceived load speed because it marks the point in the page load timeline when the page's main content has likely ~~loaded—a~~loaded. A fast LCP helps reassure the user that the page is ~~useful.~~useful and useable. Google notes a "good" user experience will have a LCP of less than 2.5 seconds. LCP is measured from the first byte delivered. Depending on the application, this may or may not be a useful metric in itself, as if there are delays in sending the first byte from the server this can in turn impact the user experience even if LCP is within threshold.

Interaction to Next Paint

Previously using a different metric called First Input Delay (FIP), Interaction to Next ~~Pain~~Paint (INP) measures the overall responsiveness of a website. As a user interacts with objects on a page, INP measures the time between the user interacting and the result of that interaction appearing on the page. Google notes a "good" INP is 200ms or less and is a cumulative score based on the total interactions on a page distilled to a single responsiveness value.

Cumulative Layout Shift

Cumulative Layout Shift (CLS) measures the overall 'unexpected' layout shift during the lifecycle of the page. Changes that impact layout but are within 500ms of a user input will tend to not count towards layout shift. The CLS score is a calculated score where a lower score is better. Google notes a "good" user experience will have a CLS of less that 0.1.

Apdex

Apdex is a simplified version of Service Level that is used to indicate user satisfaction with a service. For user-facing web services this is usually the response time for the service, although more complex methodologies can be used. Apdex defines a threshold at which a users experience moves from "good" to "tolerating" and "bad" and then uses the ratio of "good" experiences to create a score between zero and one as a simple score to understand overall user experience. While it is not a significant metric in itself now, being replaced with more granular and relevant options, it is important to include within observability platform discussions as it can be used as a trigger to collect or activate various functionality, such as increasing sampling rates at certain score levels, or only collecting certain information when customer experience is outside of the "good" range. Information about Apdex, including the original whitepaper can be found at https://www.apdex.org/

Apdex-T

Apdex-T is the underlying number that is used to calculate an Apdex score. Apdex-T is the point at which a users experience tips from "good" to "tolerating". In most platforms where Apdex scores and Apdex-T exists, a default value is usually set. Google's SRE book suggests a default Apdex-T value of 500ms for web-facing frontends. In most cases this number will need to be adjusted for the specific environment.