Controller metrics
Prometheus metrics exposed by the in-cluster Apoxy controller, the stable subset customers can scrape and alert on, and how to wire them into Prometheus.
The kube-controller Deployment installed by apoxy k8s install exposes a Prometheus /metrics endpoint that covers cert lifecycle, reconciliation, mirror sync, and the controller's uplink to Apoxy. This page lists the metrics under stability contract and shows how to scrape them.
Endpoint
| Path | /metrics |
| Container port | 8083 (named metrics) |
| Protocol | HTTP (cluster-internal) |
| Registry | sigs.k8s.io/controller-runtime/pkg/metrics |
The metrics port is on the pod, not on the kube-controller Service (the Service exposes only the aggregated APIService on 443). Scrape via a PodMonitor (or equivalent), not a ServiceMonitor.
Scraping with the Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: apoxy-kube-controller
namespace: apoxy
spec:
selector:
matchLabels:
app: kube-controller
podMetricsEndpoints:
- port: metrics
interval: 30sApply with kubectl apply -f in the cluster running the controller. The metrics port is the named container port on the kube-controller Deployment.
For plain Prometheus (no Operator) add a scrape config that targets pods with label app=kube-controller in the apoxy namespace on port 8083.
Stable metrics
These metrics are part of the customer-facing contract. Names and labels won't change without a release note. The endpoint exposes additional debug metrics (see Debug metrics) that have no stability guarantee.
Cert lifecycle
| Metric | Type | Labels | Description |
|---|---|---|---|
apoxy_kube_controller_cert_expiry_seconds | Gauge | — | Unix-seconds expiry (NotAfter) of the live upstream client cert. |
apoxy_kube_controller_cert_reloads_total | Counter | result (success, failure) | Hot-reload attempts after a kubelet projection of the apiz-cert Secret. Failure means the new cert material didn't parse or the chain didn't validate; the controller keeps serving the previous cert. |
apoxy_kube_controller_cert_renewals_total | Counter | result (success, failure) | Auto-renewal attempts. The controller checks its own cert hourly and re-issues against cosmos when validity drops below 30 days. Failure means cosmos returned an error or the Secret write conflicted with a manual rotation. |
apoxy_kube_controller_cert_renewal_skipped_total | Counter | — | Auto-renewal ticks that found the cert above the renewal threshold. A flat rate over hours means the renewer loop is stuck or hasn't acquired leadership. |
Reconciliation
The controller's per-resource reconcilers run on top of sigs.k8s.io/controller-runtime, which emits the standard metrics below. The controller label is the reconciler name (e.g. gateway, httproute, tunnelnode).
| Metric | Type | Labels | Description |
|---|---|---|---|
controller_runtime_reconcile_total | Counter | controller, result (success, error, requeue, requeue_after) | Per-reconciler outcome counter. |
controller_runtime_reconcile_errors_total | Counter | controller | Convenience counter for the result="error" slice. |
controller_runtime_active_workers | Gauge | controller | Workers currently inside Reconcile. Saturation signal. |
Mirror sync
The mirror loop pushes Kubernetes resources (Gateway API, Ingress) from the local cluster up to the Apoxy control plane. Metrics here are prefixed tunnel_mirror_* for historical reasons — the mirror originally lived in the tunnel agent.
| Metric | Type | Labels | Description |
|---|---|---|---|
tunnel_mirror_synced_resources_total | Counter | resource_type | Successful mirror operations. A flat rate proves the mirror is moving. |
tunnel_mirror_sync_errors_total | Counter | resource_type | Mirror sync failures. Alert on sustained non-zero rate. |
tunnel_mirror_heartbeat_failures_total | Counter | — | Lease renewals the controller failed to write. Sustained increase means the controller is about to lose its shard membership and stop mirroring. |
Uplink
| Metric | Type | Labels | Description |
|---|---|---|---|
tunnel_agent_info | Gauge (1) | version, build_date, commit | Build info — for dashboards that pin the controller version. |
tunnel_connections_active | Gauge | — | Live tunnel connections to the Apoxy control plane. 0 means the controller is disconnected. |
tunnel_connection_reconnects_total | Counter | — | Reconnect attempts. Steady-state should be zero; sustained rate indicates flapping. |
tunnel_connection_failures_total | Counter | reason | Connection failures by reason. |
Recommended alerts
Starting point. Tune thresholds to your environment.
Cert expiring soon
apoxy_kube_controller_cert_expiry_seconds - time() < 14 * 24 * 3600Fire when the cert has less than 14 days remaining. Use apoxy k8s certs rotate to roll it.
Hot-reload failing
rate(apoxy_kube_controller_cert_reloads_total{result="failure"}[10m]) > 0A non-zero failure rate means the kubelet projected a Secret the controller couldn't load. The previous cert keeps serving, so this is a warning, not an outage — but the next rotation won't take effect until it's resolved.
Auto-renewal failing
rate(apoxy_kube_controller_cert_renewals_total{result="failure"}[1h]) > 0The controller's hourly auto-renewer is failing to issue a new cert against cosmos. Fires as a warning the first time, then escalates: the live cert keeps working until expiry, but if this stays non-zero for days you'll silently expire. Common causes: cosmos unreachable from the cluster, ext_authz revoking the controller's cert mid-renewal, or a Secret write conflict with a manual rotation.
Auto-renewer stuck
increase(apoxy_kube_controller_cert_renewal_skipped_total[6h]) == 0
and on() apoxy_kube_controller_cert_expiry_seconds - time() < 60 * 24 * 3600The renewer loop hasn't ticked at all in 6 hours despite the live cert having less than 60 days left. Either the renewer goroutine crashed (controller log will say) or leader election is broken — the second pod thinks it's leader and the right pod thinks it isn't. Rare in single-replica installs.
Reconciler erroring
sum by (controller) (rate(controller_runtime_reconcile_errors_total[10m])) > 0.1Sustained reconcile errors on any controller. The most common cause is a transient cosmos outage; fire as a warning, not a page.
Mirror failing
rate(tunnel_mirror_sync_errors_total[10m]) > 0The mirror can't push resources to Apoxy. Gateway API changes in the cluster are not reaching the control plane.
Mirror losing shard
rate(tunnel_mirror_heartbeat_failures_total[5m]) > 0Page-grade. The controller is about to drop out of its shard, which stops the mirror entirely.
Controller disconnected
tunnel_connections_active == 0Page-grade if sustained > 2 minutes. Brief flaps during cosmos rollouts are expected.
Debug metrics
The /metrics endpoint also serves lower-level metrics for the tunnel agent (tunnel_packets_*, tunnel_bytes_*, per-protocol breakdowns), SOCKS proxy (socks_*), BFD heartbeats (bfd_*), DNS resolver (dns_*), and the Go runtime / controller-runtime workqueue / client-go REST client. These are useful in support tickets but have no stability guarantee — names, labels, and presence can change between releases. Don't build alerts on them.