OpenTelemetry Observability v0.0.15
Klio provides built-in support for OpenTelemetry, enabling comprehensive observability through distributed tracing and metrics collection. This allows you to monitor backup operations, performance characteristics, and system health across your Klio deployment.
Available Telemetry
Klio automatically collects the following:
- Traces
- Distributed WAL streaming and processing
- Backup lifecycle (backup, backup run, verification, maintenance)
- Metrics
- Server
- Backup operation metrics
- Number of snapshots
- Number of files in the latest snapshot
- Number of directories in the latest snapshot
- Size of the latest snapshot
- Age of the latest snapshot
- Age of the oldest snapshot
- WAL processing metrics
- Number of WAL files written
- Bytes written
- Timestamp of the most recently written WAL file
- Queue metrics
- Number of messages in the queue
- Number of bytes in the queue
- GRPC metrics
- Go runtime statistics
- Host metrics
- Controller runtime metrics
- Backup operation metrics
- Sidecar
- Backup lifecycle metrics
- Whether a backup is currently running
- Timestamp of the most recent backup start
- Timestamp of the most recent successful completion
- Timestamp of the most recent failure
- Duration of the most recent backup
- Total number of successful backups
- Total number of failed backups
- Backup verification metrics
- GRPC metrics
- Go runtime statistics
- Host metrics
- Controller runtime metrics
- Backup lifecycle metrics
- Server
Note
Log exporters are not currently supported.
Traces Reference
Backup lifecycle spans
When a backup is triggered through CNPG-I, Klio creates
the following spans under the klio.backup tracer:
| Span Name | Description |
|---|---|
backup | Root span covering the entire backup operation (run + verify + maintenance) |
backup_run | Child span for the actual data backup execution |
backup_verify | Child span for post-backup verification |
backup_maintenance | Child span for post-backup maintenance |
The backup span includes the following attributes:
| Attribute | Type | Description |
|---|---|---|
backup.name | string | Name assigned to the backup |
On failure, the span records the error and sets its status to
ERROR.
Metrics Reference
Backup lifecycle metrics (sidecar)
These metrics are emitted by the sidecar and track backup operations on each PostgreSQL instance:
| Metric Name | Type | Unit | Description |
|---|---|---|---|
klio.backup.running | Gauge | - | Whether a backup is currently running (1) or not (0) |
klio.backup.latest_start_time | Gauge | s | Unix epoch timestamp when the most recent backup started |
klio.backup.latest_completion_time | Gauge | s | Unix epoch timestamp when the most recent backup completed successfully |
klio.backup.latest_failure_time | Gauge | s | Unix epoch timestamp when the most recent backup failed |
klio.backup.latest_duration_seconds | Gauge | s | Duration of the most recent backup in seconds |
klio.backup.successes | Counter | - | Total number of successful backups |
klio.backup.failures | Counter | - | Total number of failed backups |
klio.backup.verifications | Counter | - | Total number of backup verification attempts |
klio.backup.verification_failures | Counter | - | Total number of backup verification failures |
WAL server metrics (server)
These metrics are emitted by the Klio server WAL component and track WAL file reception from PostgreSQL instances:
| Metric Name | Type | Unit | Description |
|---|---|---|---|
klio.wal.written_size | Counter | By | Number of bytes written to disk for WAL files |
klio.wal.written | Counter | - | Number of WAL files written |
klio.wal.latest_written_time | Gauge | s | Unix epoch timestamp of the most recently written WAL file to disk |
WAL consumer metrics (server)
These metrics are emitted by the Klio server WAL consumer and track WAL archival to Tier 2 storage:
| Metric Name | Type | Unit | Description |
|---|---|---|---|
klio.consumer.written_size | Counter | By | Number of bytes written to Tier 2 for WAL files |
klio.consumer.written | Counter | - | Number of WAL files written to Tier 2 |
klio.consumer.latest_written_time | Gauge | s | Unix epoch timestamp of the most recently written WAL file to Tier 2 |
klio.consumer.backup_verification_success | Counter | - | Number of successful backup verifications |
klio.consumer.backup_verification_failure | Counter | - | Number of failed backup verifications (corruption detected) |
Alerting on stalled WAL processing
Despite sharing a similar name, klio.wal.latest_written_time and
klio.consumer.latest_written_time track two distinct stages of the
WAL pipeline and signal different failure scenarios:
klio.wal.latest_written_timereflects when the Klio server last received a WAL file from PostgreSQL streaming replication (Tier 1). A stale value means PostgreSQL is no longer shipping WALs to Klio, which may indicate a replication problem.klio.consumer.latest_written_timereflects when the WAL consumer last archived a WAL file to Tier 2 object storage (S3). A stale value means the S3 backend is no longer receiving WALs, even though PostgreSQL replication may still be working.
Both metrics carry a cluster_name attribute label identifying the
PostgreSQL cluster the WAL event belongs to.
Base backup metrics (server)
These metrics are emitted by the Klio server base backup component and track Kopia snapshot statistics:
| Metric Name | Type | Unit | Description |
|---|---|---|---|
klio.base.snapshots | Gauge | - | Total number of base snapshots |
klio.base.latest_snapshot_size | Gauge | By | Size of latest base snapshot in bytes (ignoring compression and deduplication) |
klio.base.latest_snapshot_files | Gauge | - | Number of files in latest base snapshot |
klio.base.latest_snapshot_dirs | Gauge | - | Number of directories in latest base snapshot |
klio.base.latest_snapshot_age | Gauge | s | Age of latest base snapshot in seconds |
klio.base.oldest_snapshot_age | Gauge | s | Age of oldest base snapshot in seconds |
Queue metrics (server)
These metrics are emitted by the Klio server and track the state of the embedded NATS JetStream queue used for asynchronous Tier 2 offloading of WAL files and backups:
| Metric Name | Type | Unit | Description |
|---|---|---|---|
klio.queue.messages | Gauge | - | Number of messages currently stored in the embedded NATS JetStream queue |
klio.queue.bytes | Gauge | By | Number of bytes currently stored in the embedded NATS JetStream queue |
Configuration
Klio automatically detects OpenTelemetry configuration through standard environment variables. If no OpenTelemetry environment variables are present, Klio will use no-op providers that don't collect any telemetry data.
Traces and metrics exporters can be configured independently through the
autoexport package.
General Settings
The following environment variables are used to configure OpenTelemetry:
OTEL_SERVICE_NAME: (required) Name of the service, e.g.,klio-serverOTEL_RESOURCE_ATTRIBUTES: Comma-separated list of resource attributes (e.g.,deployment.environment=production,service.namespace=klio-system)OTEL_RESOURCE_DETECTORS: Comma-separated list of resource detectors from theautodetectpackage, used to automatically populate resource attributes
Traces exporter
To enable the traces exporter, set the OTEL_TRACES_EXPORTER environment
variable to one of the supported exporters:
otlp: OpenTelemetry Protocol (OTLP) exporterconsole: Console exporter (useful for debugging)none: No-op exporter (disables tracing)
You can define the OTLP protocol using the OTEL_EXPORTER_OTLP_TRACES_PROTOCOL
variable, or the general OTEL_EXPORTER_OTLP_PROTOCOL. Supported protocols
include:
http/protobuf(default)grpc
Additional configuration options for trace exporters can be found in the documentation of the respective exporters:
Metrics Exporter
To enable the metrics exporter, set the OTEL_METRICS_EXPORTER environment
variable to one of the supported exporters:
otlp: OpenTelemetry Protocol (OTLP) exporterprometheus: Prometheus exporter + HTTP serverconsole: Console exporter (useful for debugging)none: No-op exporter (disables metrics)
You can define the OTLP protocol using the OTEL_EXPORTER_OTLP_METRICS_PROTOCOL
variable, or the general OTEL_EXPORTER_OTLP_PROTOCOL. Supported protocols
include:
http/protobuf(default)grpc
Additional configuration options for metrics exporters can be found in the documentation of the respective exporters:
For the Prometheus exporter, you can configure the host and port of the HTTP server using the following environment variables:
OTEL_EXPORTER_PROMETHEUS_HOST(default:localhost)OTEL_EXPORTER_PROMETHEUS_PORT(default:9464)
Exporters and receivers
The OTLP exporter pushes telemetry to any OTLP-compatible receiver. Common options include:
- An OpenTelemetry Collector,
which can receive OTLP data and fan it out to multiple backends
(Prometheus, Jaeger, Grafana, etc.). In Kubernetes, the
OpenTelemetry Operator
manages collectors via the
OpenTelemetryCollectorCRD and can expose a stable in-cluster OTLP endpoint for Klio to target. - Any backend with native OTLP support.
The Prometheus exporter starts a local HTTP server that Prometheus scrapes directly, with no intermediate collector required.
Configuring Klio with OpenTelemetry in Kubernetes
When running in a Kubernetes environment, Klio will automatically define
CONTAINER_NAME, POD_NAME and NAMESPACE_NAME environment variables.
When any of these environment variables are set, Klio will automatically add
the corresponding resource attributes (k8s.container.name, k8s.pod.name,
k8s.namespace.name) to all telemetry data. Each attribute is added
independently - you don't need all three environment variables to be present.
Important
If you have already defined any of these attributes in
OTEL_RESOURCE_ATTRIBUTES, Klio will not override them. Only missing
attributes will be added from the environment variables. This allows you to
customize the values while still benefiting from automatic defaults for any
attributes you don't explicitly set.
Klio server with OpenTelemetry
When deploying a Klio Server, you can configure OpenTelemetry by specifying
the necessary settings in the template section of the Server spec:
- Set the required environment variables for OpenTelemetry configuration in
the
servercontainer. - Mount any necessary TLS certificates for secure communication with the OpenTelemetry Collector.
For simpler management, use a ConfigMap to store the OpenTelemetry configuration:
apiVersion: v1 kind: ConfigMap metadata: name: klio-otel-config data: OTEL_SERVICE_NAME: "klio-server" OTEL_RESOURCE_DETECTORS: "telemetry.sdk,host,os.type,process.executable.name" OTEL_TRACES_EXPORTER: "otlp" OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: "grpc" OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: "https://otel-collector:4317" OTEL_EXPORTER_OTLP_TRACES_COMPRESSION: "gzip" OTEL_EXPORTER_OTLP_TRACES_TIMEOUT: "10000" OTEL_EXPORTER_OTLP_TRACES_INSECURE: "false" OTEL_EXPORTER_OTLP_TRACES_CERTIFICATE: "/otel/ca.crt" OTEL_EXPORTER_OTLP_TRACES_CLIENT_CERTIFICATE: "/otel/tls.crt" OTEL_EXPORTER_OTLP_TRACES_CLIENT_KEY: "/otel/tls.key" OTEL_METRICS_EXPORTER: "otlp" OTEL_METRIC_EXPORT_INTERVAL: "60000" OTEL_EXPORTER_OTLP_METRICS_PROTOCOL: "grpc" OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: "https://otel-collector:4317" OTEL_EXPORTER_OTLP_METRICS_TIMEOUT: "60000" OTEL_EXPORTER_OTLP_METRICS_INSECURE: "false" OTEL_EXPORTER_OTLP_METRICS_CERTIFICATE: "/otel/ca.crt" OTEL_EXPORTER_OTLP_METRICS_CLIENT_CERTIFICATE: "/otel/tls.crt" OTEL_EXPORTER_OTLP_METRICS_CLIENT_KEY: "/otel/tls.key" --- apiVersion: klio.enterprisedb.io/v1alpha1 kind: Server metadata: name: my-klio-server spec: # ... other configuration ... template: spec: containers: - name: server envFrom: - configMapRef: name: klio-otel-config volumeMounts: - mountPath: /otel name: otel volumes: - name: otel projected: sources: - secret: name: otel-collector-tls items: - key: ca.crt path: ca.crt - secret: name: otel-client-cert items: - key: tls.crt path: tls.crt - key: tls.key path: tls.key
Klio plugins with OpenTelemetry
When deploying Klio as a CNPG Cluster plugin, configure OpenTelemetry by
specifying the necessary environment variables in the containers section of
the PluginConfiguration spec. The available container names are:
klio-plugin: Main plugin sidecar for backup managementklio-restore: Restore operations sidecar
Create a ConfigMap for the shared OpenTelemetry configuration:
apiVersion: v1 kind: ConfigMap metadata: name: cluster-klio-otel-config data: OTEL_RESOURCE_DETECTORS: "telemetry.sdk,host,os.type,process.executable.name" OTEL_TRACES_EXPORTER: "otlp" OTEL_METRICS_EXPORTER: "otlp" OTEL_EXPORTER_OTLP_PROTOCOL: "grpc" OTEL_EXPORTER_OTLP_ENDPOINT: "https://otel-collector:4317" OTEL_EXPORTER_OTLP_COMPRESSION: "gzip" OTEL_EXPORTER_OTLP_TIMEOUT: "10000" OTEL_EXPORTER_OTLP_INSECURE: "false" OTEL_EXPORTER_OTLP_CERTIFICATE: "/projected/ca.crt" OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE: "/projected/tls.crt" OTEL_EXPORTER_OTLP_CLIENT_KEY: "/projected/tls.key"
Configure the PluginConfiguration to inject the environment variables into
each sidecar container:
apiVersion: klio.enterprisedb.io/v1alpha1 kind: PluginConfiguration metadata: name: client-config-cluster-example spec: serverAddress: klio.default clientSecretName: cluster-example-klio-user serverSecretName: klio-server-tls clusterName: cluster-example containers: - name: klio-plugin env: - name: OTEL_SERVICE_NAME value: "klio-plugin" envFrom: - configMapRef: name: cluster-klio-otel-config - name: klio-restore env: - name: OTEL_SERVICE_NAME value: "klio-restore" envFrom: - configMapRef: name: cluster-klio-otel-config
Mount the OpenTelemetry certificates using the Cluster's projectedVolumeTemplate.
The projected volume is mounted at /projected/ and is accessible to all
sidecar containers:
apiVersion: postgresql.cnpg.io/v1 kind: Cluster metadata: name: cluster-example spec: instances: 3 projectedVolumeTemplate: sources: - secret: name: otel-collector-tls items: - key: ca.crt path: ca.crt - secret: name: otel-client-cert items: - key: tls.crt path: tls.crt - key: tls.key path: tls.key plugins: - name: klio.enterprisedb.io enabled: true parameters: pluginConfigurationRef: client-config-cluster-example storage: size: 10Gi