# Controller metrics

> Prometheus metrics exposed by the in-cluster Apoxy controller, the stable subset customers can scrape and alert on, and how to wire them into Prometheus.

The `kube-controller` Deployment installed by `apoxy k8s install` exposes a Prometheus `/metrics` endpoint that covers cert lifecycle, reconciliation, mirror sync, and the controller's uplink to Apoxy. This page lists the metrics under stability contract and shows how to scrape them.

## Endpoint

| | |
|---|---|
| Path | `/metrics` |
| Container port | `8083` (named `metrics`) |
| Protocol | HTTP (cluster-internal) |
| Registry | `sigs.k8s.io/controller-runtime/pkg/metrics` |

The metrics port is on the pod, not on the `kube-controller` Service (the Service exposes only the aggregated APIService on `443`). Scrape via a `PodMonitor` (or equivalent), not a `ServiceMonitor`.

## Scraping with the Prometheus Operator

```yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: apoxy-kube-controller
  namespace: apoxy
spec:
  selector:
    matchLabels:
      app: kube-controller
  podMetricsEndpoints:
    - port: metrics
      interval: 30s
```

Apply with `kubectl apply -f` in the cluster running the controller. The `metrics` port is the named container port on the kube-controller Deployment.

For plain Prometheus (no Operator) add a scrape config that targets pods with label `app=kube-controller` in the `apoxy` namespace on port `8083`.

## Stable metrics

These metrics are part of the customer-facing contract. Names and labels won't change without a release note. The endpoint exposes additional debug metrics (see [Debug metrics](#debug-metrics)) that have no stability guarantee.

### Cert lifecycle

| Metric | Type | Labels | Description |
|---|---|---|---|
| `apoxy_kube_controller_cert_expiry_seconds` | Gauge | — | Unix-seconds expiry (`NotAfter`) of the live upstream client cert. |
| `apoxy_kube_controller_cert_reloads_total` | Counter | `result` (`success`, `failure`) | Hot-reload attempts after a kubelet projection of the `apiz-cert` Secret. Failure means the new cert material didn't parse or the chain didn't validate; the controller keeps serving the previous cert. |
| `apoxy_kube_controller_cert_renewals_total` | Counter | `result` (`success`, `failure`) | Auto-renewal attempts. The controller checks its own cert hourly and re-issues against cosmos when validity drops below 30 days. Failure means cosmos returned an error or the Secret write conflicted with a manual rotation. |
| `apoxy_kube_controller_cert_renewal_skipped_total` | Counter | — | Auto-renewal ticks that found the cert above the renewal threshold. A flat rate over hours means the renewer loop is stuck or hasn't acquired leadership. |

### Reconciliation

The controller's per-resource reconcilers run on top of `sigs.k8s.io/controller-runtime`, which emits the standard metrics below. The `controller` label is the reconciler name (e.g. `gateway`, `httproute`, `tunnelnode`).

| Metric | Type | Labels | Description |
|---|---|---|---|
| `controller_runtime_reconcile_total` | Counter | `controller`, `result` (`success`, `error`, `requeue`, `requeue_after`) | Per-reconciler outcome counter. |
| `controller_runtime_reconcile_errors_total` | Counter | `controller` | Convenience counter for the `result="error"` slice. |
| `controller_runtime_active_workers` | Gauge | `controller` | Workers currently inside `Reconcile`. Saturation signal. |

### Mirror sync

The mirror loop pushes Kubernetes resources (Gateway API, Ingress) from the local cluster up to the Apoxy control plane. Metrics here are prefixed `tunnel_mirror_*` for historical reasons — the mirror originally lived in the tunnel agent.

| Metric | Type | Labels | Description |
|---|---|---|---|
| `tunnel_mirror_synced_resources_total` | Counter | `resource_type` | Successful mirror operations. A flat rate proves the mirror is moving. |
| `tunnel_mirror_sync_errors_total` | Counter | `resource_type` | Mirror sync failures. Alert on sustained non-zero rate. |
| `tunnel_mirror_heartbeat_failures_total` | Counter | — | Lease renewals the controller failed to write. Sustained increase means the controller is about to lose its shard membership and stop mirroring. |

### Uplink

| Metric | Type | Labels | Description |
|---|---|---|---|
| `tunnel_agent_info` | Gauge (`1`) | `version`, `build_date`, `commit` | Build info — for dashboards that pin the controller version. |
| `tunnel_connections_active` | Gauge | — | Live tunnel connections to the Apoxy control plane. `0` means the controller is disconnected. |
| `tunnel_connection_reconnects_total` | Counter | — | Reconnect attempts. Steady-state should be zero; sustained rate indicates flapping. |
| `tunnel_connection_failures_total` | Counter | `reason` | Connection failures by reason. |

## Recommended alerts

Starting point. Tune thresholds to your environment.

### Cert expiring soon

```
apoxy_kube_controller_cert_expiry_seconds - time() < 14 * 24 * 3600
```

Fire when the cert has less than 14 days remaining. Use [`apoxy k8s certs rotate`](/docs/guides/rotating-kube-controller-cert.md) to roll it.

### Hot-reload failing

```
rate(apoxy_kube_controller_cert_reloads_total{result="failure"}[10m]) > 0
```

A non-zero failure rate means the kubelet projected a Secret the controller couldn't load. The previous cert keeps serving, so this is a warning, not an outage — but the next rotation won't take effect until it's resolved.

### Auto-renewal failing

```
rate(apoxy_kube_controller_cert_renewals_total{result="failure"}[1h]) > 0
```

The controller's hourly auto-renewer is failing to issue a new cert against cosmos. Fires as a warning the first time, then escalates: the live cert keeps working until expiry, but if this stays non-zero for days you'll silently expire. Common causes: cosmos unreachable from the cluster, ext_authz revoking the controller's cert mid-renewal, or a Secret write conflict with a manual rotation.

### Auto-renewer stuck

```
increase(apoxy_kube_controller_cert_renewal_skipped_total[6h]) == 0
  and on() apoxy_kube_controller_cert_expiry_seconds - time() < 60 * 24 * 3600
```

The renewer loop hasn't ticked at all in 6 hours despite the live cert having less than 60 days left. Either the renewer goroutine crashed (controller log will say) or leader election is broken — the second pod thinks it's leader and the right pod thinks it isn't. Rare in single-replica installs.

### Reconciler erroring

```
sum by (controller) (rate(controller_runtime_reconcile_errors_total[10m])) > 0.1
```

Sustained reconcile errors on any controller. The most common cause is a transient cosmos outage; fire as a warning, not a page.

### Mirror failing

```
rate(tunnel_mirror_sync_errors_total[10m]) > 0
```

The mirror can't push resources to Apoxy. Gateway API changes in the cluster are not reaching the control plane.

### Mirror losing shard

```
rate(tunnel_mirror_heartbeat_failures_total[5m]) > 0
```

Page-grade. The controller is about to drop out of its shard, which stops the mirror entirely.

### Controller disconnected

```
tunnel_connections_active == 0
```

Page-grade if sustained > 2 minutes. Brief flaps during cosmos rollouts are expected.

## Debug metrics

The `/metrics` endpoint also serves lower-level metrics for the tunnel agent (`tunnel_packets_*`, `tunnel_bytes_*`, per-protocol breakdowns), SOCKS proxy (`socks_*`), BFD heartbeats (`bfd_*`), DNS resolver (`dns_*`), and the Go runtime / controller-runtime workqueue / client-go REST client. These are useful in support tickets but have **no stability guarantee** — names, labels, and presence can change between releases. Don't build alerts on them.
