Telemetry Metrics

Metrics allow operators to understand the internal state of a system by observing its outputs.

In TriggerMesh, telemetry is achieved by exposing a variety of time-based numeric measurements — also referred to as time-series — that can be collected and analyzed by third-party monitoring solutions.

This guide provides an overview of the nature and format of the telemetry metrics exposed by the TriggerMesh platform, as well as detailed examples of approaches for collecting and analyzing them.

Exposed Telemetry Data

Metrics Categories

Here is an overview of the types of metrics that are exposed by TriggerMesh components. The list is deliberately broad and generic, as some categories of metrics may only be exposed by certain types of components.

Processing of events
- Successes and errors
- Latency distribution
Delivery of events
- Successes and errors
- Latency distribution
Software runtime
- Heap memory usage
- Garbage collection

Later in this document, we will explore how metrics in these different categories can be collected, analyzed and visualized.

Data Model

Metrics are exposed by TriggerMesh in a line-oriented text-based format popularized by the Prometheus open source monitoring toolkit.

A Prometheus metric is typically represented as a single line of UTF-8 characters, optionally prepended with a HELP and a TYPE comment lines, starting with the name of the metric and ending with its value:

# HELP event_processing_success_count The number of events successfully processed
# TYPE event_processing_success_count counter
event_processing_success_count{event_source="my.cloud.storage", event_type="object.created"} 129389

The HELP comment provides a description of what the metric represents, whereas the TYPE comment carries information about the type of the metric (one of the four core metric types offered by Prometheus).

As illustrated in the example above, the metric name may be directly followed by a list of comma-separated key-value pairs between curly brackets called labels. Labels allow differentiating the characteristics of what is being measured. Each unique combination of labels identifies a particular dimension of a metric. Labels, and metric dimensions by extension, vary from metric to metric.

Enabling Metrics

By default TriggerMesh does not enable any metrics backend. Observability of TriggerMesh can be configured via the config-observability ConfigMap object.

For example, the following ConfigMap object enables the Prometheus metrics exporter in all TriggerMesh components:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: triggermesh
data:
  metrics.backend-destination: prometheus

For a description of the configuration settings which are currently supported, please refer to the observability.yaml manifest files inside the Knative Eventing source repository.

Access to Metrics

Every TriggerMesh component exposes a set of telemetry metrics on the local HTTP endpoint :9092/metrics. Available measurements can be retrieved via a simple GET request. The returned values are an instant view of each measurement at the time of the request.

This model is particularly suitable for a pull-based retrieval of metrics ("scrape") on a fixed interval by monitoring software, and for storage in a time-series database. Prometheus itself is a popular choice for this job, as it includes both a server for scraping telemetry metrics, and a time-series database optimized for the Prometheus data model.

Metrics scraping

Most of the available metrics fall into the categories described previously in this document, whenever applicable based on the type of component. Some components may expose additional, application-specific metrics. The data model presented in the previous section makes these metrics easily discoverable by consumers.

Collection and Analysis with Prometheus and Grafana

This section provides examples of configurations for collecting metrics from TriggerMesh components using the Prometheus monitoring toolkit, and visualizing them using the observability platform Grafana.

The rest of this document assumes that Prometheus and Grafana are both available in the Kubernetes cluster where TriggerMesh is deployed.

Tip

If you don't already have Prometheus and Grafana set up in your TriggerMesh cluster, we recommend installing a pre-configured Prometheus-Grafana stack using the Helm application manager for Kubernetes by executing the commands below:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install -n monitoring prometheus-stack prometheus-community/kube-prometheus-stack

Detailed installation instructions are available in the kube-prometheus-stack chart's documentation.

Scraping Metrics via Prometheus

The Prometheus Operator — which is included in the aforementioned kube-prometheus Stack and manages most Prometheus installations on Kubernetes — allows configuring Prometheus' scrape targets using familiar Kubernetes API objects.

The manifest below contains a PodMonitor that instructs Prometheus to automatically discover and scrape TriggerMesh components:

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: triggermesh-components
  namespace: monitoring
spec:
  selector:
    matchLabels:  # (1)
      app.kubernetes.io/part-of: triggermesh
  namespaceSelector:  # (2)
    any: true
  podMetricsEndpoints:  # (3)
  - port: metrics
  - port: user-port
    relabelings:
    - action: replace
      targetLabel: __address__
      sourceLabels:
      - __meta_kubernetes_pod_ip
      replacement: $1:9092
  jobLabel: app.kubernetes.io/name  # (4)

Selects all targets (Pods) that are labeled as being managed by TriggerMesh.
Looks up targets (Pods) matching the above selector in all Kubernetes namespace.
For targets that matched the above selectors, either scrape the port named metrics if it exists, or fall back to the TCP port 9092.
Sets the value of the job label in collected metrics to the name of the component.

After applying this configuration to the Kubernetes cluster using the kubectl apply -f command, a list of targets matching the name of the PodMonitor should be reported by Prometheus with the state UP.

Tip

If you installed Prometheus via the kube-prometheus-stack Helm chart — as suggested in the introduction to this section — you should be able to forward the local port 9090 to the Prometheus instance running in the Kubernetes cluster:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-prometheus 9090

Then, open your web browser at http://localhost:9090/targets.

Prometheus scrape targets

Visualizing Metrics in a Grafana Dashboard

By combining the visualization power of Grafana with the query language of Prometheus, we are able to analyze trends in collected metrics using different types of graphs and charts.

In the following example, we will create a dashboard like the one below where we analyze the latency distribution of a given measure — such as the processed or delivered events — over a selected period of time.

Grafana dashboard overview

Tip

If you installed Grafana via the kube-prometheus-stack Helm chart — as suggested in the introduction to this section — you should be able to forward the local port 3000 to the Grafana instance running in the Kubernetes cluster:

kubectl -n monitoring port-forward svc/kube-prometheus-stack-grafana 3000:80

Then, open your web browser at http://localhost:3000.

Number of Events Grouped by Latency Bucket

In this panel, we will use a bar gauge to display the distribution of the time spent by a component processing events over a selected period of time, organized in latency buckets.

The chart will be based on a Prometheus metric of type histogram with a pre-configured bucket distribution in milliseconds, defined by TriggerMesh as follows:

# HELP event_processing_latencies Time spent in the CloudEvents handler processing events
# TYPE event_processing_latencies histogram

In a histogram metric, cumulative counters for each bucket are exposed as sub-metrics of the main histogram metric, with _bucket appended to the name, and a le ("less than or equal") label indicating the upper bound of the bucket, such as in the example below:

event_processing_latencies_bucket{le="1"} 0
event_processing_latencies_bucket{le="2"} 0
event_processing_latencies_bucket{le="5"} 1541
event_processing_latencies_bucket{le="10"} 6161
event_processing_latencies_bucket{le="20"} 6776
event_processing_latencies_bucket{le="50"} 6865
event_processing_latencies_bucket{le="100"} 6868

We can start by charting the evolution of the raw counters corresponding to each latency bucket over time in a standard time series graph, before switching to bar gauges, in order to understand the metric's trend:

sum(
  event_processing_latencies_bucket
) by (le)

Warning

Always select the Heatmap format while working with Prometheus metrics of type histogram. This enables Grafana's intrinsic knowledge about Prometheus histograms, which results in buckets being sorted per le value and distinctive counts being shown for each bucket as one would expect.

Grafana processing latency - Raw

We can observe that counter values are cumulative. Some buckets have a steeper evolution than others, which already hints at a trend about the distribution of the component's processing latency.

When switching from a time series chart to a bar gauge chart, we will want the summary value to correspond to the aggregation of all events processed in a given latency bucket. We could be tempted to simply summarize totals as the last value per bucket in the selected time period, but

This would show the overall total of events processed by the component prior to the end of the time period, not the total of events processed during the selected time period.
This might result in skewed calculations if the counters were reset during that time period, as illustrated at the very end of the range in the previous graph. Such situation isn't unusual and could occur for different reasons, such as roll outs of new versions, horizontal scaling, Pod relocations, etc.

We will instead calculate the increase in the time series in the selected time range using a query function which adjusts for breaks in monotonicity, such as counter resets:

sum(
  increase(event_processing_latencies_bucket[$__range])
) by (le)

Grafana processing latency - Range

Selecting Bar Gauge in the list of visualizations instead of Time Series now shows a histogram which summary values accurately represent the number of events processed in each latency bucket for the selected time period:

Grafana processing latency - Bar gauge 1

The visualization options can be adjusted to one's preferences in order to obtain the desired result:

Grafana processing latency - Bar gauge 2

Rate of Processed Events by Latency Bucket

In this panel, we will use a time series graph to display the rate of events processed by a component over time, and distributed in latency buckets.

The chart will be based on the same Prometheus metric as in the previous example.

This time around, we will stick with the default Time Series visualization, but replace the increase() function used previously with the irate() function:

sum(
  irate(event_processing_latencies_bucket[$__rate_interval])
) by (le)

The interval used to calculate the rate of change of the metric can be either

Selected manually using an explicit time duration such as "[3m]".
Delegated to Grafana using Grafana's $__rate_interval variable, which calculates an optimal value based on the number of data points in the graph.

The result of this calculation is a series of metrics ("instant vectors"), each representing the number of events per second processed by the component over time in a given latency bucket:

Grafana processing latency - Time series

90th Percentile of the Processing Latency Over Time

In this panel, we will use a time series graph to display the 90th percentile of processing durations of events handled by a component over time. This measure indicates the longest time it took to process an event, for 90% of all the events processed during each time interval on the graph.

The chart will be based on the same Prometheus metric as in the previous example.

We can reuse the query from the previous example, and pass it to the histogram_quantile() function with the desired quantile:

histogram_quantile(0.90,
  sum(
    irate(event_processing_latencies_bucket[$__rate_interval])
  )
) by (le)

The result of this calculation is a single metric ("instant vector"). An interesting exercise could be to compare this graph with the rate of processed events measured in the previous example, and observe whether or not spikes in the 90th latency percentile can be correlated with higher processing rates:

Grafana processing latency - 90th percentile

Additional Notes and Takeaways

It can be noted that, in the previous examples, we generated three valuable types of visualizations from a single metric, by simply changing Prometheus queries to perform different types of calculations/aggregations.

Thanks to the power of time series and the multi-dimensional aspect of Prometheus metrics, we could explore other types of visualizations based on this same metric, for example by filtering or grouping metrics by labels.

For instance, the following query could be used to graph the event processing rate over time broken down by event type:

sum(
  irate(event_processing_latencies_count[$__rate_interval])
) by (event_type)

This other query could be used to graph the event processing rate over time for a specific event type:

sum(
  irate(event_processing_latencies_count{event_type="object.created"}[$__rate_interval])
)

The same idea can be applied when it comes to visualizing metrics pertaining to a specific instance of a given component, or for example to all instances of that component within a specific Kubernetes namespace. There is a large amount of possibilities to be explored, the only limit is the choice of labels available for each metric.

Grafana processing latency - Label filter