OpenTelemetry Observability v0.0.15

Klio provides built-in support for OpenTelemetry, enabling comprehensive observability through distributed tracing and metrics collection. This allows you to monitor backup operations, performance characteristics, and system health across your Klio deployment.

Available Telemetry

Klio automatically collects the following:

  • Traces
    • Distributed WAL streaming and processing
    • Backup lifecycle (backup, backup run, verification, maintenance)
  • Metrics
    • Server
      • Backup operation metrics
        • Number of snapshots
        • Number of files in the latest snapshot
        • Number of directories in the latest snapshot
        • Size of the latest snapshot
        • Age of the latest snapshot
        • Age of the oldest snapshot
      • WAL processing metrics
        • Number of WAL files written
        • Bytes written
        • Timestamp of the most recently written WAL file
      • Queue metrics
        • Number of messages in the queue
        • Number of bytes in the queue
      • GRPC metrics
      • Go runtime statistics
      • Host metrics
      • Controller runtime metrics
    • Sidecar
      • Backup lifecycle metrics
        • Whether a backup is currently running
        • Timestamp of the most recent backup start
        • Timestamp of the most recent successful completion
        • Timestamp of the most recent failure
        • Duration of the most recent backup
        • Total number of successful backups
        • Total number of failed backups
      • Backup verification metrics
      • GRPC metrics
      • Go runtime statistics
      • Host metrics
      • Controller runtime metrics
Note

Log exporters are not currently supported.

Traces Reference

Backup lifecycle spans

When a backup is triggered through CNPG-I, Klio creates the following spans under the klio.backup tracer:

Span NameDescription
backupRoot span covering the entire backup operation (run + verify + maintenance)
backup_runChild span for the actual data backup execution
backup_verifyChild span for post-backup verification
backup_maintenanceChild span for post-backup maintenance

The backup span includes the following attributes:

AttributeTypeDescription
backup.namestringName assigned to the backup

On failure, the span records the error and sets its status to ERROR.

Metrics Reference

Backup lifecycle metrics (sidecar)

These metrics are emitted by the sidecar and track backup operations on each PostgreSQL instance:

Metric NameTypeUnitDescription
klio.backup.runningGauge-Whether a backup is currently running (1) or not (0)
klio.backup.latest_start_timeGaugesUnix epoch timestamp when the most recent backup started
klio.backup.latest_completion_timeGaugesUnix epoch timestamp when the most recent backup completed successfully
klio.backup.latest_failure_timeGaugesUnix epoch timestamp when the most recent backup failed
klio.backup.latest_duration_secondsGaugesDuration of the most recent backup in seconds
klio.backup.successesCounter-Total number of successful backups
klio.backup.failuresCounter-Total number of failed backups
klio.backup.verificationsCounter-Total number of backup verification attempts
klio.backup.verification_failuresCounter-Total number of backup verification failures

WAL server metrics (server)

These metrics are emitted by the Klio server WAL component and track WAL file reception from PostgreSQL instances:

Metric NameTypeUnitDescription
klio.wal.written_sizeCounterByNumber of bytes written to disk for WAL files
klio.wal.writtenCounter-Number of WAL files written
klio.wal.latest_written_timeGaugesUnix epoch timestamp of the most recently written WAL file to disk

WAL consumer metrics (server)

These metrics are emitted by the Klio server WAL consumer and track WAL archival to Tier 2 storage:

Metric NameTypeUnitDescription
klio.consumer.written_sizeCounterByNumber of bytes written to Tier 2 for WAL files
klio.consumer.writtenCounter-Number of WAL files written to Tier 2
klio.consumer.latest_written_timeGaugesUnix epoch timestamp of the most recently written WAL file to Tier 2
klio.consumer.backup_verification_successCounter-Number of successful backup verifications
klio.consumer.backup_verification_failureCounter-Number of failed backup verifications (corruption detected)

Alerting on stalled WAL processing

Despite sharing a similar name, klio.wal.latest_written_time and klio.consumer.latest_written_time track two distinct stages of the WAL pipeline and signal different failure scenarios:

  • klio.wal.latest_written_time reflects when the Klio server last received a WAL file from PostgreSQL streaming replication (Tier 1). A stale value means PostgreSQL is no longer shipping WALs to Klio, which may indicate a replication problem.

  • klio.consumer.latest_written_time reflects when the WAL consumer last archived a WAL file to Tier 2 object storage (S3). A stale value means the S3 backend is no longer receiving WALs, even though PostgreSQL replication may still be working.

Both metrics carry a cluster_name attribute label identifying the PostgreSQL cluster the WAL event belongs to.

Base backup metrics (server)

These metrics are emitted by the Klio server base backup component and track Kopia snapshot statistics:

Metric NameTypeUnitDescription
klio.base.snapshotsGauge-Total number of base snapshots
klio.base.latest_snapshot_sizeGaugeBySize of latest base snapshot in bytes (ignoring compression and deduplication)
klio.base.latest_snapshot_filesGauge-Number of files in latest base snapshot
klio.base.latest_snapshot_dirsGauge-Number of directories in latest base snapshot
klio.base.latest_snapshot_ageGaugesAge of latest base snapshot in seconds
klio.base.oldest_snapshot_ageGaugesAge of oldest base snapshot in seconds

Queue metrics (server)

These metrics are emitted by the Klio server and track the state of the embedded NATS JetStream queue used for asynchronous Tier 2 offloading of WAL files and backups:

Metric NameTypeUnitDescription
klio.queue.messagesGauge-Number of messages currently stored in the embedded NATS JetStream queue
klio.queue.bytesGaugeByNumber of bytes currently stored in the embedded NATS JetStream queue

Configuration

Klio automatically detects OpenTelemetry configuration through standard environment variables. If no OpenTelemetry environment variables are present, Klio will use no-op providers that don't collect any telemetry data.

Traces and metrics exporters can be configured independently through the autoexport package.

General Settings

The following environment variables are used to configure OpenTelemetry:

  • OTEL_SERVICE_NAME: (required) Name of the service, e.g., klio-server
  • OTEL_RESOURCE_ATTRIBUTES: Comma-separated list of resource attributes (e.g., deployment.environment=production,service.namespace=klio-system)
  • OTEL_RESOURCE_DETECTORS: Comma-separated list of resource detectors from the autodetect package, used to automatically populate resource attributes

Traces exporter

To enable the traces exporter, set the OTEL_TRACES_EXPORTER environment variable to one of the supported exporters:

  • otlp: OpenTelemetry Protocol (OTLP) exporter
  • console: Console exporter (useful for debugging)
  • none: No-op exporter (disables tracing)

You can define the OTLP protocol using the OTEL_EXPORTER_OTLP_TRACES_PROTOCOL variable, or the general OTEL_EXPORTER_OTLP_PROTOCOL. Supported protocols include:

  • http/protobuf (default)
  • grpc

Additional configuration options for trace exporters can be found in the documentation of the respective exporters:

Metrics Exporter

To enable the metrics exporter, set the OTEL_METRICS_EXPORTER environment variable to one of the supported exporters:

  • otlp: OpenTelemetry Protocol (OTLP) exporter
  • prometheus: Prometheus exporter + HTTP server
  • console: Console exporter (useful for debugging)
  • none: No-op exporter (disables metrics)

You can define the OTLP protocol using the OTEL_EXPORTER_OTLP_METRICS_PROTOCOL variable, or the general OTEL_EXPORTER_OTLP_PROTOCOL. Supported protocols include:

  • http/protobuf (default)
  • grpc

Additional configuration options for metrics exporters can be found in the documentation of the respective exporters:

For the Prometheus exporter, you can configure the host and port of the HTTP server using the following environment variables:

  • OTEL_EXPORTER_PROMETHEUS_HOST (default: localhost)
  • OTEL_EXPORTER_PROMETHEUS_PORT (default: 9464)

Exporters and receivers

The OTLP exporter pushes telemetry to any OTLP-compatible receiver. Common options include:

  • An OpenTelemetry Collector, which can receive OTLP data and fan it out to multiple backends (Prometheus, Jaeger, Grafana, etc.). In Kubernetes, the OpenTelemetry Operator manages collectors via the OpenTelemetryCollector CRD and can expose a stable in-cluster OTLP endpoint for Klio to target.
  • Any backend with native OTLP support.

The Prometheus exporter starts a local HTTP server that Prometheus scrapes directly, with no intermediate collector required.

Configuring Klio with OpenTelemetry in Kubernetes

When running in a Kubernetes environment, Klio will automatically define CONTAINER_NAME, POD_NAME and NAMESPACE_NAME environment variables. When any of these environment variables are set, Klio will automatically add the corresponding resource attributes (k8s.container.name, k8s.pod.name, k8s.namespace.name) to all telemetry data. Each attribute is added independently - you don't need all three environment variables to be present.

Important

If you have already defined any of these attributes in OTEL_RESOURCE_ATTRIBUTES, Klio will not override them. Only missing attributes will be added from the environment variables. This allows you to customize the values while still benefiting from automatic defaults for any attributes you don't explicitly set.

Klio server with OpenTelemetry

When deploying a Klio Server, you can configure OpenTelemetry by specifying the necessary settings in the template section of the Server spec:

  1. Set the required environment variables for OpenTelemetry configuration in the server container.
  2. Mount any necessary TLS certificates for secure communication with the OpenTelemetry Collector.

For simpler management, use a ConfigMap to store the OpenTelemetry configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: klio-otel-config
data:
  OTEL_SERVICE_NAME: "klio-server"
  OTEL_RESOURCE_DETECTORS: "telemetry.sdk,host,os.type,process.executable.name"
  OTEL_TRACES_EXPORTER: "otlp"
  OTEL_EXPORTER_OTLP_TRACES_PROTOCOL: "grpc"
  OTEL_EXPORTER_OTLP_TRACES_ENDPOINT: "https://otel-collector:4317"
  OTEL_EXPORTER_OTLP_TRACES_COMPRESSION: "gzip"
  OTEL_EXPORTER_OTLP_TRACES_TIMEOUT: "10000"
  OTEL_EXPORTER_OTLP_TRACES_INSECURE: "false"
  OTEL_EXPORTER_OTLP_TRACES_CERTIFICATE: "/otel/ca.crt"
  OTEL_EXPORTER_OTLP_TRACES_CLIENT_CERTIFICATE: "/otel/tls.crt"
  OTEL_EXPORTER_OTLP_TRACES_CLIENT_KEY: "/otel/tls.key"
  OTEL_METRICS_EXPORTER: "otlp"
  OTEL_METRIC_EXPORT_INTERVAL: "60000"
  OTEL_EXPORTER_OTLP_METRICS_PROTOCOL: "grpc"
  OTEL_EXPORTER_OTLP_METRICS_ENDPOINT: "https://otel-collector:4317"
  OTEL_EXPORTER_OTLP_METRICS_TIMEOUT: "60000"
  OTEL_EXPORTER_OTLP_METRICS_INSECURE: "false"
  OTEL_EXPORTER_OTLP_METRICS_CERTIFICATE: "/otel/ca.crt"
  OTEL_EXPORTER_OTLP_METRICS_CLIENT_CERTIFICATE: "/otel/tls.crt"
  OTEL_EXPORTER_OTLP_METRICS_CLIENT_KEY: "/otel/tls.key"
---
apiVersion: klio.enterprisedb.io/v1alpha1
kind: Server
metadata:
  name: my-klio-server
spec:
  # ... other configuration ...
  template:
    spec:
      containers:
        - name: server
          envFrom:
            - configMapRef:
                name: klio-otel-config
          volumeMounts:
            - mountPath: /otel
              name: otel
      volumes:
        - name: otel
          projected:
            sources:
              - secret:
                  name: otel-collector-tls
                  items:
                    - key: ca.crt
                      path: ca.crt
              - secret:
                  name: otel-client-cert
                  items:
                    - key: tls.crt
                      path: tls.crt
                    - key: tls.key
                      path: tls.key

Klio plugins with OpenTelemetry

When deploying Klio as a CNPG Cluster plugin, configure OpenTelemetry by specifying the necessary environment variables in the containers section of the PluginConfiguration spec. The available container names are:

  • klio-plugin: Main plugin sidecar for backup management
  • klio-restore: Restore operations sidecar

Create a ConfigMap for the shared OpenTelemetry configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-klio-otel-config
data:
  OTEL_RESOURCE_DETECTORS: "telemetry.sdk,host,os.type,process.executable.name"
  OTEL_TRACES_EXPORTER: "otlp"
  OTEL_METRICS_EXPORTER: "otlp"
  OTEL_EXPORTER_OTLP_PROTOCOL: "grpc"
  OTEL_EXPORTER_OTLP_ENDPOINT: "https://otel-collector:4317"
  OTEL_EXPORTER_OTLP_COMPRESSION: "gzip"
  OTEL_EXPORTER_OTLP_TIMEOUT: "10000"
  OTEL_EXPORTER_OTLP_INSECURE: "false"
  OTEL_EXPORTER_OTLP_CERTIFICATE: "/projected/ca.crt"
  OTEL_EXPORTER_OTLP_CLIENT_CERTIFICATE: "/projected/tls.crt"
  OTEL_EXPORTER_OTLP_CLIENT_KEY: "/projected/tls.key"

Configure the PluginConfiguration to inject the environment variables into each sidecar container:

apiVersion: klio.enterprisedb.io/v1alpha1
kind: PluginConfiguration
metadata:
  name: client-config-cluster-example
spec:
  serverAddress: klio.default
  clientSecretName: cluster-example-klio-user
  serverSecretName: klio-server-tls
  clusterName: cluster-example
  containers:
    - name: klio-plugin
      env:
        - name: OTEL_SERVICE_NAME
          value: "klio-plugin"
      envFrom:
        - configMapRef:
            name: cluster-klio-otel-config
    - name: klio-restore
      env:
        - name: OTEL_SERVICE_NAME
          value: "klio-restore"
      envFrom:
        - configMapRef:
            name: cluster-klio-otel-config

Mount the OpenTelemetry certificates using the Cluster's projectedVolumeTemplate. The projected volume is mounted at /projected/ and is accessible to all sidecar containers:

apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: cluster-example
spec:
  instances: 3

  projectedVolumeTemplate:
    sources:
      - secret:
          name: otel-collector-tls
          items:
            - key: ca.crt
              path: ca.crt
      - secret:
          name: otel-client-cert
          items:
            - key: tls.crt
              path: tls.crt
            - key: tls.key
              path: tls.key

  plugins:
    - name: klio.enterprisedb.io
      enabled: true
      parameters:
        pluginConfigurationRef: client-config-cluster-example

  storage:
    size: 10Gi