Troubleshooting¶
Common conditions¶
Check the policy's conditions for a quick diagnosis:
kubectl get rsp <name> -o jsonpath='{.status.conditions}' | jq .
PrometheusUnavailable¶
Symptom: Ready condition is False with reason PrometheusUnavailable.
Cause: PrometheusUnavailable means the controller could not use
Prometheus for this reconcile. The condition message tells you which step
failed:
Cannot resolve Prometheus configmeans address resolution failed. The operator checks (in order): policy spec, one defaults source (AttuneNamespaceDefaultsif present, otherwiseAttuneDefaults), Prometheus Operator CRD, then well-known service names.Cannot create metrics collector,reading secret, or transport errors likeTLS handshake timeoutmean the address was found but auth, headers, bearer token secret, CA bundle, or TLS setup failed.Prometheus query timeout exceededmeans the reconcile-level timeout expired before all Prometheus queries completed.Prometheus query errors (means Prometheus answered, but one or more metric queries failed. This can still happen when Prometheus is reachable.
If the condition message includes Cannot resolve Prometheus config: SSRF blocked,
the configured address points at localhost, 127.0.0.1, ::1, or a
cloud metadata endpoint. Replace it with the in-cluster Prometheus Service
DNS name or ClusterIP. A local kubectl port-forward URL on your workstation
will not work.
Fix address resolution failures:
-
Set the address explicitly in a
AttuneDefaultsresource:apiVersion: attune.io/v1alpha1 kind: AttuneDefaults metadata: name: default spec: metricsSource: prometheus: address: http://prometheus-server.monitoring:80 -
Verify the Prometheus Service exists and note its port:
kubectl get svc -n monitoring # Check the PORT(S) column: "80/TCP" means use :80, not :9090 -
Test connectivity from inside the cluster:
kubectl run prom-test --image=curlimages/curl --restart=Never --rm --attach --command -- \ curl -sf http://prometheus-server.monitoring:80/-/healthy
If the condition message includes Cannot create metrics collector,
reading secret, or a transport error like TLS handshake timeout,
verify the credentials and connection details before changing timeouts:
- Check the referenced Secret exists in the policy namespace and contains the expected bearer token.
- Re-check custom headers, CA bundle, and
insecureSkipVerifysettings. - Test the exact Prometheus URL from inside the cluster with the same auth mechanism the operator uses.
If the condition message includes Prometheus query timeout exceeded, the
operator's reconcile-level timeout expired before all workload queries
completed. This typically happens when Prometheus is slow to respond
(not down, just overloaded) or when a policy targets many workloads.
Fix query timeouts:
- Increase the timeout: set Helm
prometheusTimeout: "10m"(or--prometheus-timeout=10m). - Reduce per-query cost: decrease
historyWindowor increasequeryStepon the AttunePolicy or AttuneDefaults. - Check Prometheus health: high query latency often indicates Prometheus itself needs more resources or recording rules.
If the condition message includes Prometheus query errors (, Prometheus was
reachable but one or more metric queries still failed.
Fix query errors:
- Check the operator logs for the exact failing query and backend error.
- Replay the failing query directly against Prometheus to confirm whether the backend rejects it or returns partial data.
- If the backend is overloaded, reduce query cost with a shorter
historyWindowor a largerqueryStep.
See the Prometheus Setup guide for full details on address resolution and common installations.
Prometheus reachable but queries return no data¶
Symptom: Ready condition is InsufficientData even after days of running.
Operator logs show "cpuPoints":0,"memPoints":0.
Cause: Prometheus is reachable but cadvisor metrics are not being scraped, or label names have been relabeled.
Fix:
-
Verify cadvisor metrics exist in Prometheus:
kubectl run prom-check --image=curlimages/curl --restart=Never --rm --attach --command -- \ curl -s 'http://prometheus-server.monitoring:80/api/v1/query?query=container_cpu_usage_seconds_total' \ | head -c 200 -
If the result is empty (
"result":[]), cadvisor scraping is not configured. Check your Prometheus scrape configuration for akubernetes-nodes-cadvisoror equivalent job. -
If the result has data but the operator still reports 0 data points, check that the
namespace,pod, andcontainerlabel names match. Some Prometheus configurations relabel these.
NoWorkloadsFound¶
Symptom: Ready condition is False with reason NoWorkloadsFound.
Cause: The policy's targetRef does not match any workloads in the
namespace. This is usually a typo in the workload name or an incorrect
kind (e.g., targeting a Deployment when the workload is a StatefulSet).
Fix:
-
Verify the workload exists:
kubectl get deploy,sts,ds -n <namespace> -
Check the
targetRef.namespelling in your policy. If using a label selector, verify the labels exist on the target workload:kubectl get deploy <name> -n <namespace> --show-labels -
Ensure the
targetRef.kindmatches the workload type (Deployment,StatefulSet,DaemonSet,ReplicaSet,Job, orCronJob).
InsufficientData¶
Symptom: Ready condition is False with reason InsufficientData.
Cause: Not enough Prometheus data points to generate recommendations.
The default minimum is 48 Prometheus range-query samples. With the default
queryStep: 5m, that is about 4 hours of data.
Fix: Wait for more data to accumulate, or adjust these settings:
minimumDataPoints: Lower for faster (but less confident) recommendations.historyWindow: If too short (e.g.1h), Prometheus may not have enough samples within the window. The default is168h(7 days). Ensure the window is long enough for your scrape interval to produce at leastminimumDataPointsdata points.
spec:
metricsSource:
minimumDataPoints: 48 # ~4 hours of data at the default queryStep: 5m
historyWindow: 168h # query the last 7 days of metrics
InvalidConfig¶
Symptom: Ready condition is False with reason InvalidConfig.
Cause: The controller could not fetch or apply defaults cleanly before
continuing. The condition message includes the failing step, such as
Failed to fetch defaults: listing AttuneNamespaceDefaults ....
Fix:
- Check whether the operator can list
AttuneDefaultsandAttuneNamespaceDefaults. - Verify the defaults objects themselves are valid and that only the expected objects exist in the namespace.
- Check operator logs for the exact failing API call or validation error.
WorkloadDiscoveryFailed¶
Symptom: Ready condition is False with reason WorkloadDiscoveryFailed.
Cause: The operator could not resolve the policy's targetRef into the
workloads it should inspect. The condition message includes the failing step,
for example an unsupported kind, an invalid selector, or a client/list error.
Fix:
- Verify
spec.targetRef.kindis one ofDeployment,StatefulSet,DaemonSet,CronJob,Job, orReplicaSet. - If you use
targetRef.name, confirm the workload exists in the same namespace as the policy. - If you use
targetRef.selector, confirm it matches at least one workload and includes realmatchLabelsormatchExpressionsentries. - Check operator logs for the exact discovery error if the target still looks correct.
Paused¶
Symptom: Ready condition is False with reason Paused.
Cause: spec.paused is set to true on the policy. The operator skips
all reconciliation: no metrics collection, no recommendations, no resizes.
Existing resizes are not reverted.
Fix: Set spec.paused: false or remove the field entirely. The operator
will resume reconciliation on the next cycle.
CooldownActive¶
Symptom: The operator logs "Cooldown active, skipping resize" and no pods are resized.
Cause: A resize was performed recently and the cooldown period has not elapsed.
Fix: Wait for the cooldown to expire, or shorten it:
kubectl patch rsp <name> --type merge \
-p '{"spec":{"updateStrategy":{"cooldown":"30m"}}}'
Webhook / cert-manager issues¶
Webhook connection refused¶
Symptom: kubectl apply -f policy.yaml returns:
Error from server (InternalError): Internal error occurred: failed calling
webhook "vattunepolicy.kb.io": Post "https://...": dial tcp ...: connection refused
Cause: The webhook server is not running or the TLS certificate is not ready. This typically means cert-manager is missing or broken.
Fix:
-
Verify cert-manager is installed and running:
kubectl get pods -n cert-manager # All 3 pods (cert-manager, cainjector, webhook) should be Running -
If cert-manager is not installed, install it:
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.17.2/cert-manager.yaml kubectl wait --for=condition=Available deployment/cert-manager-webhook -n cert-manager --timeout=120s -
Check the Certificate status:
kubectl get certificate -n attune-system # Status should be True (Ready) -
If the Certificate is not ready, check the cert-manager logs:
kubectl logs -n cert-manager deploy/cert-manager --tail=20
Webhook timeout¶
Symptom: Policy creation takes 30 seconds then fails with timeout.
Cause: The webhook pod is running but the cainjector has not patched the CA bundle into the webhook configuration yet.
Fix: Wait for cainjector to inject the CA bundle (usually resolves within 1-2 minutes after cert-manager is ready):
kubectl get validatingwebhookconfiguration -o yaml | grep caBundle | head -1
# If empty, cainjector has not run yet. Wait and retry.
Resize failures¶
Resize subresource not found (K8s 1.32)¶
Symptom: Operator logs contain the server does not allow this method
on the requested resource or pod resize subresource is not enabled when
attempting a resize.
Cause: On Kubernetes 1.32, the In-Place Pod Resize feature is alpha and
disabled by default. The /resize subresource is only available when the
InPlacePodVerticalScaling feature gate is enabled on all control plane
components and kubelets.
Fix: Enable the feature gate on all components. For managed clusters, check your provider's documentation. For self-managed clusters:
# API server, controller-manager, and scheduler flags:
--feature-gates=InPlacePodVerticalScaling=true
# Kubelet config (on every node):
featureGates:
InPlacePodVerticalScaling: true
On Kubernetes 1.33+, this feature gate is enabled by default and no action is needed.
Infeasible resize¶
Symptom: Resize history shows result: Failed and operator logs contain
resize infeasible.
Cause: The node cannot accommodate the new resource values. Common when increasing resources on a node that is already at capacity.
Fix: Ensure the cluster has sufficient allocatable resources, or tighten bounds to stay within node capacity:
spec:
cpu:
maxAllowed: "2000m" # reduce max to fit on nodes
QoS class change blocked¶
Symptom: Operator logs Skipping resize: would change QoS class.
Cause: For Guaranteed-class pods, requests must equal limits. If the policy would set different values for requests and limits, the resize is skipped.
Fix: Set controlledValues: RequestsAndLimits so both are updated
together, or switch to RequestsOnly if the pod should be Burstable.
ResourceQuota exceeded¶
Symptom: Operator logs Skipping resize: quota/limitrange violation
with a message mentioning exceed ResourceQuota.
Cause: The resize would increase CPU or memory requests beyond the remaining headroom in the namespace's ResourceQuota.
Fix:
-
Check current quota usage:
kubectl get resourcequota -n <namespace> -
Either increase the quota limits, or tighten the policy's resource bounds so recommendations stay within quota.
Revert issues¶
High revert rate¶
Symptom: Degraded condition is True with reason HighRevertRate, or
multiple entries in .status.resizeHistory show result: Reverted.
Cause: 3+ of the last 5 resize operations were reverted due to safety violations. The controller applies exponential backoff (2x cooldown per consecutive revert, capped at 16x).
Check the current backoff state:
kubectl get rsp <name> -o jsonpath='{.status.cooldown}'
# Example: {"backoffMultiplier":8,"consecutiveReverts":3,"effectiveCooldown":"8h0m0s"}
Fix: Investigate the revert reasons:
kubectl get rsp <name> -o jsonpath='{.status.resizeHistory}' | \
jq '[.[] | select(.result=="Reverted")]'
Common causes:
- oomkill: overhead is too low for memory. Increase
memory.overhead. - throttle: CPU throttle ratio exceeded 50% post-resize. Increase
cpu.overhead. - restart: the application crashes at the new resource level. Check application logs.
- notready: readiness probe fails post-resize. Verify probe configuration.
Resizes not happening during expected window¶
Symptom: Operator logs "Outside resize window, skipping resize" even though you expect the window to be open.
Cause: The schedule.timezone does not match your local time.
Windows are evaluated in the configured timezone (default: UTC).
Fix: Verify your timezone is correct:
schedule:
windows:
- start: "02:00"
end: "06:00"
timezone: "America/New_York" # not UTC
Check the current time in the configured timezone:
TZ="America/New_York" date "+%H:%M %A"
Budget exhausted¶
Symptom: Operator logs "Budget exhausted, deferring resize to next cycle" and some pods are not resized.
Cause: The total CPU or memory increase across all pods exceeds the
configured maxTotalCpuIncrease or maxTotalMemoryIncrease.
Fix: Either increase the budget or accept that resizes are spread across multiple reconcile cycles (this is the intended behavior for gradual rollout):
updateStrategy:
maxTotalCpuIncrease: "4000m" # 4 cores per cycle
maxTotalMemoryIncrease: "8Gi" # 8 GiB per cycle
Policy rejected: invalid schedule timezone¶
Symptom: kubectl apply fails with:
admission webhook "validation.attune.io" denied the request:
updateStrategy.schedule.timezone "PST" is not a valid IANA timezone
Cause: The timezone must be a valid IANA timezone name from the
tz database.
Common mistakes include using abbreviations that Go's time.LoadLocation
does not recognize.
Fix: Use the canonical IANA region/city name:
| Invalid | Valid alternative |
|---|---|
PST |
America/Los_Angeles |
IST |
Asia/Kolkata |
Note: US/Eastern, EST, and CET are valid IANA timezone links and
will be accepted, but the canonical forms (America/New_York,
Europe/Berlin) are recommended for clarity.
# List all valid timezones on your system:
timedatectl list-timezones
Policy rejected: invalid day of week¶
Symptom: kubectl apply fails with:
admission webhook "validation.attune.io" denied the request:
updateStrategy.schedule.daysOfWeek contains invalid day "Wed"
Cause: Day names must be the full English name. Abbreviations and non-English names are not accepted.
Fix: Use the full name (case-insensitive):
schedule:
daysOfWeek: ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday"]
Valid values: Monday, Tuesday, Wednesday, Thursday, Friday,
Saturday, Sunday.
Deleting a policy¶
When you delete a AttunePolicy, the operator uses a
attune.io/cleanup finalizer to clean up before the resource is
garbage-collected:
- Annotations removed: all tracking annotations (
attune.io/resized-at,attune.io/policy, etc.) and theattune.io/trackedlabel are removed from pods managed by that policy. - Resources retained: pods keep their current (resized) CPU and memory values. The operator does not revert resources to pre-resize values.
- Gauges cleaned: Prometheus gauge metrics for the policy are removed.
- Finalizer removed: only after cleanup succeeds. If a pod update fails, the finalizer remains and the controller retries on the next reconcile cycle.
If the policy appears stuck in Terminating, check the operator logs for
pod update errors during cleanup:
kubectl logs -n attune-system deploy/attune-controller-manager | grep "deletion cleanup"
Large cluster performance¶
Stale recommendations (slow reconciliation)¶
If workqueue_depth is consistently > 0 and
workqueue_longest_running_processor_seconds climbs, the operator cannot
keep up with the reconcile queue. Solutions (in order of impact):
- Increase
maxConcurrentReconciles(or use aclusterSizepreset). - Scope with
--watch-namespacesto reduce informer cache size. - Policies targeting many workloads via label selector now process up to 10 workloads in parallel per reconcile cycle.
See the Scaling Guide for tuning details and preset values.
High memory usage¶
If the operator pod is OOMKilled or uses unexpectedly high memory, the
informer cache may be caching too many objects. Use --watch-namespaces
to limit the cache to the namespaces where your policies exist.
Resizes skipped due to stale recommendations¶
When Prometheus does not return fresh data during a reconcile cycle, the operator marks the recommendation as stale and skips the resize to avoid acting on outdated metrics. You will see this in the operator logs:
Skipping resize for workload with stale recommendation workload=my-app
The attune_stale_recommendations_total counter increments each
time this happens. Common causes:
- Prometheus is temporarily unavailable or responding slowly.
- The
historyWindowis too short for the workload's scrape interval, so range queries return no data. - Pod label changes caused the PromQL regex to stop matching.
To diagnose, enable debug logging and check the Prometheus query results:
kubectl logs -n attune-system deploy/attune-controller-manager \
| grep -E "stale|Prometheus query returned no data"
Resizes resume automatically once fresh data is available.
Deployment-owned ReplicaSet targeting¶
If a AttunePolicy targets a ReplicaSet that is owned by a Deployment,
the operator rejects it with an error:
ReplicaSet my-ns/my-rs is owned by a Deployment; target the Deployment instead
Deployment-owned ReplicaSets are also automatically filtered from selector-based discovery to prevent double-resizing (the Deployment and its child ReplicaSet would both match). To right-size the workload, target the parent Deployment instead.
Known limitations¶
Maximum Prometheus addresses¶
The operator caches at most 64 unique Prometheus collector connections. Clusters with more than 64 distinct Prometheus addresses across all policies will see errors on additional addresses. In practice this is rarely hit since most clusters use 1-2 Prometheus instances.
Minimum cooldown floor¶
The operator enforces a minimum cooldown of 1 minute regardless of the
configured cooldown value. Setting cooldown: 10s effectively becomes
cooldown: 1m. This prevents accidental resource churn.
Enabling debug logs¶
The operator supports multiple log verbosity levels. By default it runs
at info level. To enable debug logging:
# Enable debug logs (V(1): queries, pod selection, cache, recommendations)
helm upgrade attune attune/attune \
--set logging.level=debug
# Enable verbose trace logs (V(2): per-sample data, full recommendation chain)
helm upgrade attune attune/attune \
--set logging.level=2
You can also switch to human-readable text format for local debugging:
helm upgrade attune attune/attune \
--set logging.level=debug --set logging.format=text
Revert to normal after debugging:
helm upgrade attune attune/attune \
--set logging.level=info
Debug commands¶
Operator logs:
kubectl -n attune-system logs -l app.kubernetes.io/name=attune --tail=100
List all policies with status:
kubectl get rsp --all-namespaces -o wide
Inspect a specific policy in detail:
kubectl describe rsp <name>
Check operator metrics:
kubectl -n attune-system port-forward svc/attune-metrics 8080:8080 >/tmp/attune-metrics-pf.log 2>&1 &
PF_PID=$!
trap 'kill "$PF_PID" 2>/dev/null || true' EXIT
curl -s localhost:8080/metrics | grep attune