Core Concepts¶
Custom Resource Definitions¶
Attune introduces three CRDs:
AttunePolicy (namespaced, short name rsp) is the primary resource.
Each policy targets one or more workloads in a namespace, configures the
recommendation parameters, and controls how resizes are applied.
AttuneDefaults (cluster-scoped, short name rsd) sets global default
values for metrics source, resource config, and update strategy.
AttuneNamespaceDefaults (namespaced, short name rsnd) sets
per-namespace defaults for policies in the same namespace. If a namespace
has a AttuneNamespaceDefaults, the controller uses it instead of the
cluster-scoped AttuneDefaults. Fields omitted there fall back to the
operator's built-in defaults. If multiple defaults objects exist at one
scope, the controller deterministically picks the lexicographically
smallest metadata.name.
Update modes¶
Modes are graduated from safe observation to full automation:
| Mode | Reads metrics | Writes recommendations | Resizes pods |
|---|---|---|---|
| Observe | Yes | No (data collection only) | No |
| Recommend | Yes | Yes (status only) | No |
| OneShot | Yes | Yes | One pod per cycle |
| Canary | Yes | Yes | A percentage of pods, then the rest after observation |
| Auto | Yes | Yes | All eligible pods |
Observe vs Recommend
Observe collects metrics and tracks data-point progress but does not
surface recommendations or savings estimates. Use it as a zero-footprint
warm-up phase. Switch to Recommend when you want to see what the
operator would suggest.
Batch workloads (Job / CronJob)
Jobs and CronJobs are supported as targetRef.kind values. Batch
workloads are always recommend-only regardless of the mode setting,
since completed pods cannot be resized in-place. Use the recommendations
to update your Job/CronJob template for future runs.
Warning
Start with Recommend in production. Promote to Canary only after reviewing recommendations and verifying confidence scores.
The estimator chain¶
Recommendations are produced by a chain of composable estimators. Each stage wraps the previous one:
flowchart LR
A[Percentile] --> B[Margin]
B --> C[Burst]
C --> D[Confidence]
D --> E[Bounds]
E --> F[Change Filter]
- Percentile selects the configured percentile (e.g. p95) from 24 hourly buckets and takes the maximum across all hours.
- Margin multiplies by a safety factor (e.g. 1.2 for 20% headroom).
- Burst applies extra headroom when
maxgreatly exceeds the selected percentile. - Confidence widens the recommendation when data is sparse.
- Bounds clamps the result to user-defined min/max values.
- Change Filter suppresses changes below 10% and caps changes above the configured maximum percentage per cycle.
See Algorithm for formulas and details.
In-Place Pod Resize¶
Kubernetes 1.32 added the /resize subresource for in-place pod resize
(alpha, requires feature gate). Kubernetes 1.33 graduated the feature to
beta (enabled by default). The kubelet adjusts cgroup limits without
restarting the container. Attune calls UpdateResize on each pod,
then polls the container status until the new resources are reported or an
Infeasible condition appears.
QoS class preservation
The operator refuses a resize if it would change the pod's QoS class. For Guaranteed pods, requests must always equal limits.
Safety system¶
Every resize is guarded by the safety monitor:
- OOMKill detection: reverts if the container is OOMKilled after resize.
- CPU throttle detection: reverts if the CPU throttle ratio exceeds 50%
post-resize (queries Prometheus for
container_cpu_cfs_throttled_periods_total). - Restart spike: reverts if the container restarts 2+ times post-resize.
- NotReady detection: reverts if the pod loses its Ready condition.
- Exponential backoff: consecutive reverts double the cooldown (capped at 16x).
- LimitRange/ResourceQuota guard: skips resizes that would violate namespace LimitRange min/max constraints or exceed ResourceQuota headroom.
- Degraded condition: when 3+ of the last 5 resizes are reverted, the
controller sets a
Degradedcondition with reasonHighRevertRate. - Kubernetes Events: emits
Normal/ResizedandWarning/Revertedevents on the policy for visibility viakubectl describe. - Auto-revert: when enabled (default), the operator restores the original
resources via the
/resizesubresource.
Cooldown enforcement prevents repeated resize attempts. See Safety System for the full design.
Cost savings estimation¶
The operator computes EstimatedMonthlySavings based on the difference
between current and recommended resource requests. Pricing is configurable
via AttuneDefaults or AttuneNamespaceDefaults:
spec:
costPricing:
cpuPerCoreHour: "0.031" # default: $0.031
memoryPerGiBHour: "0.004" # default: $0.004
The formula is: (cpuCoresSaved * cpuPrice + memGiBSaved * memPrice) * 730 hours/month.
View savings via kubectl attune savings or the Grafana dashboard.
Multi-container support¶
By default, the operator computes recommendations for every container in a pod.
For pods with sidecar containers managed by a service mesh (e.g., istio-proxy,
linkerd-proxy), use excludedContainers to skip them:
spec:
excludedContainers:
- istio-proxy
Before executing a resize, the operator also checks that the total resource requests across all containers (with the new target applied) do not exceed the node's allocatable resources.
Prometheus auto-discovery¶
The Prometheus address is resolved in order:
spec.metricsSource.prometheus.addresson the AttunePolicyspec.metricsSource.prometheus.addresson an AttuneNamespaceDefaults resource in the same namespacespec.metricsSource.prometheus.addresson an AttuneDefaults resource- Auto-discovery: Prometheus Operator CRD (
monitoring.coreos.com/v1 Prometheus) - Auto-discovery: well-known service names (
prometheus-server,prometheus-kube-prometheus-prometheus) in common namespaces
If all five fail, the policy enters PrometheusUnavailable status.
Conflict detection¶
The operator detects:
- VPA conflicts: warns when a VPA targets the same workload.
- HPA coexistence: logs a notice and adjusts only requests (not replicas).
- Policy overlap: higher-weight policies take precedence when multiple AttunePolicies match the same workload.
- Active rollouts: skips resizing during an in-progress deployment rollout.
- Opt-out annotation: workloads with
attune.io/skip: "true"are ignored.