Prometheus Setup¶

Attune relies on Prometheus for historical CPU and memory usage data. This guide covers which metrics are required, how to configure the Prometheus address, and how to verify the integration is working.

Required Prometheus metrics¶

The operator queries these metrics, all scraped automatically by cadvisor (built into the kubelet):

Metric	Query	What it measures
`container_cpu_usage_seconds_total`	`rate(...[5m])`	CPU cores consumed per container
`container_memory_working_set_bytes`	instant	Memory actively used per container
`container_cpu_cfs_throttled_periods_total`	`rate(...[5m]) / rate(cfs_periods[5m])`	CPU throttle ratio (safety monitor)
`container_cpu_cfs_periods_total`	used in throttle ratio denominator	Total CPU scheduling periods

The first two metrics are required for recommendations. The CFS throttle metrics are used by the safety monitor when autoRevert: true (default) to detect CPU under-provisioning after a resize. If these metrics are missing, throttle detection is silently skipped.

These metrics are available out of the box in any Prometheus installation that scrapes the kubelet's /metrics/cadvisor endpoint. No additional exporters or recording rules are needed.

Note

The queries filter by namespace, pod (regex prefix match), and container name. If your Prometheus relabels these labels, the queries will return empty results.

Prometheus address resolution¶

The operator resolves the Prometheus address in this order:

flowchart TD
    A[Policy spec<br/>metricsSource.prometheus.address] -->|set?| Z[Use it]
    A -->|not set| B{Namespace defaults exist?}
    B -->|yes| C[AttuneNamespaceDefaults<br/>metricsSource.prometheus.address]
    C --> D{Address set?}
    D -->|yes| Z
    D -->|no| G[Auto-discovery:<br/>Prometheus Operator CRD]
    B -->|no| E[AttuneDefaults<br/>metricsSource.prometheus.address]
    E --> F{Address set?}
    F -->|yes| Z
    F -->|no| G
    G -->|found?| Z
    G -->|not found| H[Auto-discovery:<br/>well-known service names]
    H -->|found?| Z
    H -->|not found| I[PrometheusUnavailable<br/>condition set]

1. Policy-level address (highest priority)¶

spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

Use this when different namespaces use different Prometheus instances.

If you configure metricsSource.prometheus.bearerTokenSecret, the Secret must live in the same namespace as the AttunePolicy. Cross-namespace Secret references are rejected.

Use an in-cluster address

The operator validates metricsSource.prometheus.address to block loopback and cloud metadata endpoints. http://127.0.0.1:9090, http://[::1]:9090, http://169.254.169.254/..., and metadata hostnames are rejected. Do not point a policy at a local port-forward or a workstation URL. Use a Service DNS name or ClusterIP that the operator can reach from inside the cluster, such as http://prometheus-server.monitoring:80. Private cluster IPs are allowed.

2. Namespace defaults¶

apiVersion: attune.io/v1alpha1
kind: AttuneNamespaceDefaults
metadata:
  name: team-defaults
  namespace: production
spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

Policies that omit metricsSource.prometheus.address inherit from this first. Use namespace defaults when different teams or environments need different Prometheus backends.

Because the controller resolves a single defaults source per namespace, a AttuneNamespaceDefaults object shadows cluster defaults for Prometheus config too. If it exists but omits metricsSource.prometheus.address, the controller falls through to auto-discovery, not to AttuneDefaults.

3. Cluster-wide defaults¶

apiVersion: attune.io/v1alpha1
kind: AttuneDefaults
metadata:
  name: default
spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

Policies that omit metricsSource.prometheus.address inherit from this when no AttuneNamespaceDefaults exists in the same namespace. This is the recommended baseline for most clusters.

4. Auto-discovery (Prometheus Operator)¶

If the Prometheus Operator is installed, Attune lists monitoring.coreos.com/v1 Prometheus resources and constructs the address from the first one found:

http://prometheus-<name>.<namespace>:<port>

No configuration needed. This works with kube-prometheus-stack and any Prometheus Operator deployment.

5. Auto-discovery (well-known services)¶

As a last resort, the operator checks for services with well-known names in common namespaces:

Namespace	Service name
`monitoring`	`prometheus-server`
`monitoring`	`prometheus-kube-prometheus-prometheus`
`prometheus`	`prometheus-server`
`kube-prometheus-stack`	`prometheus-kube-prometheus-prometheus`

If found, the actual port from the Service spec is used (falls back to 9090 if no ports are defined).

Service port vs process port

The Prometheus process usually listens on port 9090, but the Kubernetes Service may expose a different port (e.g., port 80 in the prometheus-community chart). Well-known service auto-discovery uses the Service port from the Service spec and only falls back to 9090 when the Service declares no ports. Set the address explicitly only if you use a non-standard Service name or namespace, or if you want to bypass auto-discovery.

Common Prometheus installations¶

prometheus-community/prometheus (Helm)¶

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus \
  --namespace monitoring --create-namespace \
  --set server.persistentVolume.enabled=true

The Service is prometheus-server.monitoring on port 80 (not 9090):

# AttuneDefaults
spec:
  metricsSource:
    prometheus:
      address: http://prometheus-server.monitoring:80

kube-prometheus-stack (Helm)¶

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prom prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace

The Service is prometheus-kube-prometheus-prometheus.monitoring on port 9090:

spec:
  metricsSource:
    prometheus:
      address: http://prometheus-kube-prometheus-prometheus.monitoring:9090

Auto-discovery (both Prometheus Operator CRD and well-known service name) works out of the box with this stack.

Prometheus Operator (standalone)¶

If you deploy Prometheus via the Prometheus Operator's Prometheus CRD, auto-discovery finds it automatically. No address configuration needed.

Verifying the integration¶

Step 1: Check the Prometheus Service port¶

kubectl get svc -n monitoring prometheus-server
# NAME                TYPE        CLUSTER-IP     PORT(S)
# prometheus-server   ClusterIP   10.96.x.x      80/TCP

Use the PORT(S) column value, not 9090.

Step 2: Verify cadvisor metrics exist¶

kubectl run prom-check --image=curlimages/curl --restart=Never --rm --attach --command -- \
  curl -s 'http://prometheus-server.monitoring:80/api/v1/query?query=container_cpu_usage_seconds_total' \
  | head -c 200

You should see "status":"success" with result data. If you see "resultType":"vector","result":[], cadvisor scraping is not configured.

Step 3: Test a namespace-scoped query¶

Replace <namespace> and <pod-prefix> with a real workload:

kubectl run prom-check --image=curlimages/curl --restart=Never --rm --attach --command -- \
  curl -s 'http://prometheus-server.monitoring:80/api/v1/query?query=rate(container_cpu_usage_seconds_total{namespace="<namespace>",pod=~"<pod-prefix>.*"}[5m])'

Non-empty results confirm Attune can query metrics for that workload.

Step 4: Check policy conditions¶

kubectl get attunepolicy -A

Condition	Meaning
`Ready: True, Reason: Monitoring`	Prometheus reachable, recommendations computed
`Ready: False, Reason: InsufficientData`	Prometheus reachable but not enough history yet
`Ready: False, Reason: PrometheusUnavailable`	Prometheus could not be used for this reconcile. Check the condition message and Troubleshooting for address, auth/TLS, timeout, or query failures.

If the condition is InsufficientData, wait for enough samples to accumulate. By default, recommendations need minimumDataPoints: 48 Prometheus range-query samples. With the default queryStep: 5m, that is about 4 hours of data within the default historyWindow: 168h.

Operator metrics (what Attune exposes)¶

Attune itself exposes Prometheus metrics on its :8080/metrics endpoint. To scrape these, either:

Enable the Helm chart's ServiceMonitor (metrics.serviceMonitor.enabled: true), or
Add a scrape annotation to the operator pod

See Metrics Reference for the full list of attune_* metrics.

Grafana dashboard¶

Option A: Auto-provision via sidecar (kube-prometheus-stack)¶

If you use the Grafana sidecar from kube-prometheus-stack, enable the dashboard ConfigMap:

helm upgrade attune oci://ghcr.io/attune-io/charts/attune \
  --set grafanaDashboard.enabled=true

This creates a ConfigMap with the grafana_dashboard: "1" label. The Grafana sidecar automatically detects it and imports the dashboard.

Option B: Manual import (standalone Grafana)¶

If your Grafana is not using the sidecar pattern:

Download the dashboard JSON from deploy/grafana/dashboard.json
In Grafana, go to Dashboards > New > Import
Paste the JSON or upload the file
Select your Prometheus data source when prompted

The dashboard covers resizes, reverts, savings, recommendations, confidence scores, reconcile latency, and Prometheus query health.

Alerting with PrometheusRule¶

Enable the Helm chart's PrometheusRule to get out-of-the-box alerts:

helm upgrade attune oci://ghcr.io/attune-io/charts/attune \
  --set metrics.prometheusRule.enabled=true

This creates 10 alerts covering the key operational signals:

Alert	Fires when	Default severity
`AttuneReconcileErrors`	Reconcile error rate > 0 sustained for 10m	warning
`AttunePrometheusUnreachable`	Prometheus query errors sustained for 10m	warning
`AttuneDegraded`	High revert rate sustained for 5m	critical
`AttuneHighRevertRate`	More than 50% of resizes reverted in 15m	critical
`AttuneReconcileStale`	No reconcile completes within 30m	warning
`AttuneBudgetExhausted`	Resize budget exhausted for 30m	warning
`AttuneDataQuality`	NaN/Inf values from Prometheus for 30m	warning
`AttuneRequestsClamped`	Recommendations capped at limits for 1h	info
`AttuneStaleRecommendations`	Recommendations stale due to data gaps for 1h	warning
`AttuneRevertFailures`	Resize revert failed for 5m	critical

Individual alerts can be disabled or tuned:

metrics:
  prometheusRule:
    enabled: true
    rules:
      reconcileErrors:
        severity: critical   # escalate to critical
      degraded:
        enabled: false       # disable this alert
      reconcileStale:
        staleDuration: 1h    # fire after 1 hour instead of 30m

See the Helm chart README for the full list of configurable parameters.

Full observability stack¶

To enable metrics scraping, alerts, and the Grafana dashboard in one command:

helm upgrade attune oci://ghcr.io/attune-io/charts/attune \
  --set metrics.enabled=true \
  --set metrics.serviceMonitor.enabled=true \
  --set metrics.prometheusRule.enabled=true \
  --set grafanaDashboard.enabled=true

Or in a values file:

metrics:
  enabled: true
  serviceMonitor:
    enabled: true
  prometheusRule:
    enabled: true
grafanaDashboard:
  enabled: true