Getting startedGuidesReferenceChangelog
Apoxy:// Docs / Reference / Controller metrics

Controller metrics

Prometheus metrics exposed by the in-cluster Apoxy controller, the stable subset customers can scrape and alert on, and how to wire them into Prometheus.

The kube-controller Deployment installed by apoxy k8s install exposes a Prometheus /metrics endpoint that covers cert lifecycle, reconciliation, mirror sync, and the controller's uplink to Apoxy. This page lists the metrics under stability contract and shows how to scrape them.

Endpoint

Path/metrics
Container port8083 (named metrics)
ProtocolHTTP (cluster-internal)
Registrysigs.k8s.io/controller-runtime/pkg/metrics

The metrics port is on the pod, not on the kube-controller Service (the Service exposes only the aggregated APIService on 443). Scrape via a PodMonitor (or equivalent), not a ServiceMonitor.

Scraping with the Prometheus Operator

$terminalYAML
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: apoxy-kube-controller namespace: apoxy spec: selector: matchLabels: app: kube-controller podMetricsEndpoints: - port: metrics interval: 30s

Apply with kubectl apply -f in the cluster running the controller. The metrics port is the named container port on the kube-controller Deployment.

For plain Prometheus (no Operator) add a scrape config that targets pods with label app=kube-controller in the apoxy namespace on port 8083.

Stable metrics

These metrics are part of the customer-facing contract. Names and labels won't change without a release note. The endpoint exposes additional debug metrics (see Debug metrics) that have no stability guarantee.

Cert lifecycle

MetricTypeLabelsDescription
apoxy_kube_controller_cert_expiry_secondsGaugeUnix-seconds expiry (NotAfter) of the live upstream client cert.
apoxy_kube_controller_cert_reloads_totalCounterresult (success, failure)Hot-reload attempts after a kubelet projection of the apiz-cert Secret. Failure means the new cert material didn't parse or the chain didn't validate; the controller keeps serving the previous cert.
apoxy_kube_controller_cert_renewals_totalCounterresult (success, failure)Auto-renewal attempts. The controller checks its own cert hourly and re-issues against cosmos when validity drops below 30 days. Failure means cosmos returned an error or the Secret write conflicted with a manual rotation.
apoxy_kube_controller_cert_renewal_skipped_totalCounterAuto-renewal ticks that found the cert above the renewal threshold. A flat rate over hours means the renewer loop is stuck or hasn't acquired leadership.

Reconciliation

The controller's per-resource reconcilers run on top of sigs.k8s.io/controller-runtime, which emits the standard metrics below. The controller label is the reconciler name (e.g. gateway, httproute, tunnelnode).

MetricTypeLabelsDescription
controller_runtime_reconcile_totalCountercontroller, result (success, error, requeue, requeue_after)Per-reconciler outcome counter.
controller_runtime_reconcile_errors_totalCountercontrollerConvenience counter for the result="error" slice.
controller_runtime_active_workersGaugecontrollerWorkers currently inside Reconcile. Saturation signal.

Mirror sync

The mirror loop pushes Kubernetes resources (Gateway API, Ingress) from the local cluster up to the Apoxy control plane. Metrics here are prefixed tunnel_mirror_* for historical reasons — the mirror originally lived in the tunnel agent.

MetricTypeLabelsDescription
tunnel_mirror_synced_resources_totalCounterresource_typeSuccessful mirror operations. A flat rate proves the mirror is moving.
tunnel_mirror_sync_errors_totalCounterresource_typeMirror sync failures. Alert on sustained non-zero rate.
tunnel_mirror_heartbeat_failures_totalCounterLease renewals the controller failed to write. Sustained increase means the controller is about to lose its shard membership and stop mirroring.
MetricTypeLabelsDescription
tunnel_agent_infoGauge (1)version, build_date, commitBuild info — for dashboards that pin the controller version.
tunnel_connections_activeGaugeLive tunnel connections to the Apoxy control plane. 0 means the controller is disconnected.
tunnel_connection_reconnects_totalCounterReconnect attempts. Steady-state should be zero; sustained rate indicates flapping.
tunnel_connection_failures_totalCounterreasonConnection failures by reason.

Starting point. Tune thresholds to your environment.

Cert expiring soon

$terminalTXT
apoxy_kube_controller_cert_expiry_seconds - time() < 14 * 24 * 3600

Fire when the cert has less than 14 days remaining. Use apoxy k8s certs rotate to roll it.

Hot-reload failing

$terminalTXT
rate(apoxy_kube_controller_cert_reloads_total{result="failure"}[10m]) > 0

A non-zero failure rate means the kubelet projected a Secret the controller couldn't load. The previous cert keeps serving, so this is a warning, not an outage — but the next rotation won't take effect until it's resolved.

Auto-renewal failing

$terminalTXT
rate(apoxy_kube_controller_cert_renewals_total{result="failure"}[1h]) > 0

The controller's hourly auto-renewer is failing to issue a new cert against cosmos. Fires as a warning the first time, then escalates: the live cert keeps working until expiry, but if this stays non-zero for days you'll silently expire. Common causes: cosmos unreachable from the cluster, ext_authz revoking the controller's cert mid-renewal, or a Secret write conflict with a manual rotation.

Auto-renewer stuck

$terminalTXT
increase(apoxy_kube_controller_cert_renewal_skipped_total[6h]) == 0 and on() apoxy_kube_controller_cert_expiry_seconds - time() < 60 * 24 * 3600

The renewer loop hasn't ticked at all in 6 hours despite the live cert having less than 60 days left. Either the renewer goroutine crashed (controller log will say) or leader election is broken — the second pod thinks it's leader and the right pod thinks it isn't. Rare in single-replica installs.

Reconciler erroring

$terminalTXT
sum by (controller) (rate(controller_runtime_reconcile_errors_total[10m])) > 0.1

Sustained reconcile errors on any controller. The most common cause is a transient cosmos outage; fire as a warning, not a page.

Mirror failing

$terminalTXT
rate(tunnel_mirror_sync_errors_total[10m]) > 0

The mirror can't push resources to Apoxy. Gateway API changes in the cluster are not reaching the control plane.

Mirror losing shard

$terminalTXT
rate(tunnel_mirror_heartbeat_failures_total[5m]) > 0

Page-grade. The controller is about to drop out of its shard, which stops the mirror entirely.

Controller disconnected

$terminalTXT
tunnel_connections_active == 0

Page-grade if sustained > 2 minutes. Brief flaps during cosmos rollouts are expected.

Debug metrics

The /metrics endpoint also serves lower-level metrics for the tunnel agent (tunnel_packets_*, tunnel_bytes_*, per-protocol breakdowns), SOCKS proxy (socks_*), BFD heartbeats (bfd_*), DNS resolver (dns_*), and the Go runtime / controller-runtime workqueue / client-go REST client. These are useful in support tickets but have no stability guarantee — names, labels, and presence can change between releases. Don't build alerts on them.