Skip to content

Prometheus Setup

Attune relies on Prometheus for historical CPU and memory usage data. This guide covers which metrics are required, how to configure the Prometheus address, and how to verify the integration is working.

Required Prometheus metrics

The operator queries these metrics, all scraped automatically by cadvisor (built into the kubelet):

Metric Query What it measures
container_cpu_usage_seconds_total rate(...[5m]) CPU cores consumed per container
container_memory_working_set_bytes instant Memory actively used per container
container_cpu_cfs_throttled_periods_total rate(...[5m]) / rate(cfs_periods[5m]) CPU throttle ratio (safety monitor)
container_cpu_cfs_periods_total used in throttle ratio denominator Total CPU scheduling periods

The first two metrics are required for recommendations. The CFS throttle metrics are used by the safety monitor when autoRevert: true (default) to detect CPU under-provisioning after a resize. If these metrics are missing, throttle detection is silently skipped.

These metrics are available out of the box in any Prometheus installation that scrapes the kubelet's /metrics/cadvisor endpoint. No additional exporters or recording rules are needed.

Note

The queries filter by namespace, pod (regex prefix match), and container name. If your Prometheus relabels these labels, the queries will return empty results.

Prometheus address resolution

The operator resolves the Prometheus address in this order:

flowchart TD
    A[Policy spec<br/>metricsSource.prometheus.address] -->|set?| Z[Use it]
    A -->|not set| B{Namespace defaults exist?}
    B -->|yes| C[AttuneNamespaceDefaults<br/>metricsSource.prometheus.address]
    C --> D{Address set?}
    D -->|yes| Z
    D -->|no| G[Auto-discovery:<br/>Prometheus Operator CRD]
    B -->|no| E[AttuneDefaults<br/>metricsSource.prometheus.address]
    E --> F{Address set?}
    F -->|yes| Z
    F -->|no| G
    G -->|found?| Z
    G -->|not found| H[Auto-discovery:<br/>well-known service names]
    H -->|found?| Z
    H -->|not found| I[PrometheusUnavailable<br/>condition set]

1. Policy-level address (highest priority)

spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

Use this when different namespaces use different Prometheus instances.

If you configure metricsSource.prometheus.bearerTokenSecret, the Secret must live in the same namespace as the AttunePolicy. Cross-namespace Secret references are rejected.

Use an in-cluster address

The operator validates metricsSource.prometheus.address to block loopback and cloud metadata endpoints. http://127.0.0.1:9090, http://[::1]:9090, http://169.254.169.254/..., and metadata hostnames are rejected. Do not point a policy at a local port-forward or a workstation URL. Use a Service DNS name or ClusterIP that the operator can reach from inside the cluster, such as http://prometheus-server.monitoring:80. Private cluster IPs are allowed.

2. Namespace defaults

apiVersion: attune.io/v1alpha1
kind: AttuneNamespaceDefaults
metadata:
  name: team-defaults
  namespace: production
spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

Policies that omit metricsSource.prometheus.address inherit from this first. Use namespace defaults when different teams or environments need different Prometheus backends.

Because the controller resolves a single defaults source per namespace, a AttuneNamespaceDefaults object shadows cluster defaults for Prometheus config too. If it exists but omits metricsSource.prometheus.address, the controller falls through to auto-discovery, not to AttuneDefaults.

3. Cluster-wide defaults

apiVersion: attune.io/v1alpha1
kind: AttuneDefaults
metadata:
  name: default
spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

Policies that omit metricsSource.prometheus.address inherit from this when no AttuneNamespaceDefaults exists in the same namespace. This is the recommended baseline for most clusters.

4. Auto-discovery (Prometheus Operator)

If the Prometheus Operator is installed, Attune lists monitoring.coreos.com/v1 Prometheus resources and constructs the address from the first one found:

http://prometheus-<name>.<namespace>:<port>

No configuration needed. This works with kube-prometheus-stack and any Prometheus Operator deployment.

5. Auto-discovery (well-known services)

As a last resort, the operator checks for services with well-known names in common namespaces:

Namespace Service name
monitoring prometheus-server
monitoring prometheus-kube-prometheus-prometheus
prometheus prometheus-server
kube-prometheus-stack prometheus-kube-prometheus-prometheus

If found, the actual port from the Service spec is used (falls back to 9090 if no ports are defined).

Service port vs process port

The Prometheus process usually listens on port 9090, but the Kubernetes Service may expose a different port (e.g., port 80 in the prometheus-community chart). Well-known service auto-discovery uses the Service port from the Service spec and only falls back to 9090 when the Service declares no ports. Set the address explicitly only if you use a non-standard Service name or namespace, or if you want to bypass auto-discovery.

Common Prometheus installations

prometheus-community/prometheus (Helm)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus \
  --namespace monitoring --create-namespace \
  --set server.persistentVolume.enabled=true

The Service is prometheus-server.monitoring on port 80 (not 9090):

# AttuneDefaults
spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

kube-prometheus-stack (Helm)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prom prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

The Service is prometheus-kube-prometheus-prometheus.monitoring on port 9090:

spec:
  metricsSource:
    prometheus:
      address: http://prometheus-kube-prometheus-prometheus.monitoring:9090

Auto-discovery (both Prometheus Operator CRD and well-known service name) works out of the box with this stack.

Prometheus Operator (standalone)

If you deploy Prometheus via the Prometheus Operator's Prometheus CRD, auto-discovery finds it automatically. No address configuration needed.

Verifying the integration

Step 1: Check the Prometheus Service port

kubectl get svc -n monitoring prometheus-server
# NAME                TYPE        CLUSTER-IP     PORT(S)
# prometheus-server   ClusterIP   10.96.x.x      80/TCP

Use the PORT(S) column value, not 9090.

Step 2: Verify cadvisor metrics exist

kubectl run prom-check --image=curlimages/curl --restart=Never --rm --attach --command -- \
  curl -s 'http://prometheus-server.monitoring:80/api/v1/query?query=container_cpu_usage_seconds_total' \
  | head -c 200

You should see "status":"success" with result data. If you see "resultType":"vector","result":[], cadvisor scraping is not configured.

Step 3: Test a namespace-scoped query

Replace <namespace> and <pod-prefix> with a real workload:

kubectl run prom-check --image=curlimages/curl --restart=Never --rm --attach --command -- \
  curl -s 'http://prometheus-server.monitoring:80/api/v1/query?query=rate(container_cpu_usage_seconds_total{namespace="<namespace>",pod=~"<pod-prefix>.*"}[5m])'

Non-empty results confirm Attune can query metrics for that workload.

Step 4: Check policy conditions

kubectl get rsp -A
Condition Meaning
Ready: True, Reason: Monitoring Prometheus reachable, recommendations computed
Ready: False, Reason: InsufficientData Prometheus reachable but not enough history yet
Ready: False, Reason: PrometheusUnavailable Prometheus could not be used for this reconcile. Check the condition message and Troubleshooting for address, auth/TLS, timeout, or query failures.

If the condition is InsufficientData, wait for enough samples to accumulate. By default, recommendations need minimumDataPoints: 48 Prometheus range-query samples. With the default queryStep: 5m, that is about 4 hours of data within the default historyWindow: 168h.

Operator metrics (what Attune exposes)

Attune itself exposes Prometheus metrics on its :8080/metrics endpoint. To scrape these, either:

  • Enable the Helm chart's ServiceMonitor (metrics.serviceMonitor.enabled: true), or
  • Add a scrape annotation to the operator pod

See Metrics Reference for the full list of attune_* metrics.

Grafana dashboard

Enable the Helm chart's dashboard ConfigMap to auto-provision a Grafana dashboard:

helm upgrade attune oci://ghcr.io/attune-io/charts/attune \
  --set grafanaDashboard.enabled=true

The dashboard covers resizes, reverts, savings, recommendations, confidence scores, reconcile latency, and Prometheus query health. See deploy/grafana/dashboard.json for the raw JSON.

Alerting with PrometheusRule

Enable the Helm chart's PrometheusRule to get out-of-the-box alerts:

helm upgrade attune oci://ghcr.io/attune-io/charts/attune \
  --set metrics.prometheusRule.enabled=true

This creates four alerts:

Alert Fires when Default severity
AttuneReconcileErrors Reconcile error rate > 0 sustained for 10m warning
AttunePrometheusUnreachable Prometheus query errors sustained for 10m warning
AttuneDegraded More than 3 reverts in 15m for a workload critical
AttuneReconcileStale No reconcile completes within 30m warning

Individual alerts can be disabled or tuned:

metrics:
  prometheusRule:
    enabled: true
    rules:
      reconcileErrors:
        severity: critical   # escalate to critical
      degraded:
        enabled: false       # disable this alert
      reconcileStale:
        staleDuration: 1h    # fire after 1 hour instead of 30m

See the Helm chart README for the full list of configurable parameters.