Controller metrics

The kube-controller Deployment installed by apoxy k8s install exposes a Prometheus /metrics endpoint that covers cert lifecycle, reconciliation, mirror sync, and the controller's uplink to Apoxy. This page lists the metrics under stability contract and shows how to scrape them.

Endpoint


Path	`/metrics`
Container port	`8083` (named `metrics`)
Protocol	HTTP (cluster-internal)
Registry	`sigs.k8s.io/controller-runtime/pkg/metrics`

The metrics port is on the pod, not on the kube-controller Service (the Service exposes only the aggregated APIService on 443). Scrape via a PodMonitor (or equivalent), not a ServiceMonitor.

Scraping with the Prometheus Operator

$terminalYAML

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: apoxy-kube-controller
  namespace: apoxy
spec:
  selector:
    matchLabels:
      app: kube-controller
  podMetricsEndpoints:
    - port: metrics
      interval: 30s

Apply with kubectl apply -f in the cluster running the controller. The metrics port is the named container port on the kube-controller Deployment.

For plain Prometheus (no Operator) add a scrape config that targets pods with label app=kube-controller in the apoxy namespace on port 8083.

Stable metrics

These metrics are part of the customer-facing contract. Names and labels won't change without a release note. The endpoint exposes additional debug metrics (see Debug metrics) that have no stability guarantee.

Cert lifecycle

Metric	Type	Labels	Description
`apoxy_kube_controller_cert_expiry_seconds`	Gauge	-	Unix-seconds expiry (`NotAfter`) of the live upstream client cert.
`apoxy_kube_controller_cert_reloads_total`	Counter	`result` (`success`, `failure`)	Hot-reload attempts after a kubelet projection of the `apiz-cert` Secret. Failure means the new cert material didn't parse or the chain didn't validate; the controller keeps serving the previous cert.
`apoxy_kube_controller_cert_renewals_total`	Counter	`result` (`success`, `failure`)	Auto-renewal attempts. The controller checks its own cert hourly and re-issues against cosmos when validity drops below 30 days. Failure means cosmos returned an error or the Secret write conflicted with a manual rotation.
`apoxy_kube_controller_cert_renewal_skipped_total`	Counter	-	Auto-renewal ticks that found the cert above the renewal threshold. A flat rate over hours means the renewer loop is stuck or hasn't acquired leadership.

Reconciliation

The controller's per-resource reconcilers run on top of sigs.k8s.io/controller-runtime, which emits the standard metrics below. The controller label is the reconciler name (e.g. gateway, httproute, tunnelnode).

Metric	Type	Labels	Description
`controller_runtime_reconcile_total`	Counter	`controller`, `result` (`success`, `error`, `requeue`, `requeue_after`)	Per-reconciler outcome counter.
`controller_runtime_reconcile_errors_total`	Counter	`controller`	Convenience counter for the `result="error"` slice.
`controller_runtime_active_workers`	Gauge	`controller`	Workers currently inside `Reconcile`. Saturation signal.

Mirror sync

The mirror loop pushes Kubernetes resources (Gateway API, Ingress) from the local cluster up to the Apoxy control plane. Metrics here are prefixed tunnel_mirror_* for historical reasons - the mirror originally lived in the tunnel agent.

Metric	Type	Labels	Description
`tunnel_mirror_synced_resources_total`	Counter	`resource_type`	Successful mirror operations. A flat rate proves the mirror is moving.
`tunnel_mirror_sync_errors_total`	Counter	`resource_type`	Mirror sync failures. Alert on sustained non-zero rate.
`tunnel_mirror_heartbeat_failures_total`	Counter	-	Lease renewals the controller failed to write. Sustained increase means the controller is about to lose its shard membership and stop mirroring.

Uplink

Metric	Type	Labels	Description
`tunnel_agent_info`	Gauge (`1`)	`version`, `build_date`, `commit`	Build info - for dashboards that pin the controller version.
`tunnel_connections_active`	Gauge	-	Live tunnel connections to the Apoxy control plane. `0` means the controller is disconnected.
`tunnel_connection_reconnects_total`	Counter	-	Reconnect attempts. Steady-state should be zero; sustained rate indicates flapping.
`tunnel_connection_failures_total`	Counter	`reason`	Connection failures by reason.

Recommended alerts

Starting point. Tune thresholds to your environment.

Cert expiring soon

$terminalTXT

apoxy_kube_controller_cert_expiry_seconds - time() < 14 * 24 * 3600

Fire when the cert has less than 14 days remaining. Use apoxy k8s certs rotate to roll it.

Hot-reload failing

$terminalTXT

rate(apoxy_kube_controller_cert_reloads_total{result="failure"}[10m]) > 0

A non-zero failure rate means the kubelet projected a Secret the controller couldn't load. The previous cert keeps serving, so this is a warning, not an outage - but the next rotation won't take effect until it's resolved.

Auto-renewal failing

$terminalTXT

rate(apoxy_kube_controller_cert_renewals_total{result="failure"}[1h]) > 0

The controller's hourly auto-renewer is failing to issue a new cert against cosmos. Fires as a warning the first time, then escalates: the live cert keeps working until expiry, but if this stays non-zero for days you'll silently expire. Common causes: cosmos unreachable from the cluster, ext_authz revoking the controller's cert mid-renewal, or a Secret write conflict with a manual rotation.

Auto-renewer stuck

$terminalTXT

increase(apoxy_kube_controller_cert_renewal_skipped_total[6h]) == 0
  and on() apoxy_kube_controller_cert_expiry_seconds - time() < 60 * 24 * 3600

The renewer loop hasn't ticked at all in 6 hours despite the live cert having less than 60 days left. Either the renewer goroutine crashed (controller log will say) or leader election is broken - the second pod thinks it's leader and the right pod thinks it isn't. Rare in single-replica installs.

Reconciler erroring

$terminalTXT

sum by (controller) (rate(controller_runtime_reconcile_errors_total[10m])) > 0.1

Sustained reconcile errors on any controller. The most common cause is a transient cosmos outage; fire as a warning, not a page.

Mirror failing

$terminalTXT

rate(tunnel_mirror_sync_errors_total[10m]) > 0

The mirror can't push resources to Apoxy. Gateway API changes in the cluster are not reaching the control plane.

Mirror losing shard

$terminalTXT

rate(tunnel_mirror_heartbeat_failures_total[5m]) > 0

Page-grade. The controller is about to drop out of its shard, which stops the mirror entirely.

Controller disconnected

$terminalTXT

tunnel_connections_active == 0

Page-grade if sustained > 2 minutes. Brief flaps during cosmos rollouts are expected.

Debug metrics

The /metrics endpoint also serves lower-level metrics for the tunnel agent (tunnel_packets_*, tunnel_bytes_*, per-protocol breakdowns), SOCKS proxy (socks_*), BFD heartbeats (bfd_*), DNS resolver (dns_*), and the Go runtime / controller-runtime workqueue / client-go REST client. These are useful in support tickets but have no stability guarantee - names, labels, and presence can change between releases. Don't build alerts on them.