Core Concepts¶

Custom Resource Definitions¶

Attune introduces three CRDs:

AttunePolicy (namespaced, short name ap) is the primary resource. Each policy targets one or more workloads in a namespace, configures the recommendation parameters, and controls how resizes are applied.

AttuneDefaults (cluster-scoped, short name ad) sets global default values for metrics source, resource config, and update strategy.

AttuneNamespaceDefaults (namespaced, short name and) sets per-namespace defaults for policies in the same namespace. If a namespace has a AttuneNamespaceDefaults, the controller uses it instead of the cluster-scoped AttuneDefaults. Fields omitted there fall back to the operator's built-in defaults. If multiple defaults objects exist at one scope, the controller deterministically picks the lexicographically smallest metadata.name.

Update modes¶

Modes are graduated from safe observation to full automation:

Mode	Reads metrics	Writes recommendations	Resizes pods
Observe	Yes	No (data collection only)	No
Recommend	Yes	Yes (status only)	No
OneShot	Yes	Yes	One pod per cycle
Canary	Yes	Yes	A percentage of pods, then the rest after observation
Auto	Yes	Yes	All eligible pods

Observe vs Recommend

Observe collects metrics and tracks data-point progress but does not surface recommendations or savings estimates. Use it as a zero-footprint warm-up phase. Switch to Recommend when you want to see what the operator would suggest.

Batch workloads (Job / CronJob)

Jobs and CronJobs are supported as targetRef.kind values. Batch workloads are always recommend-only regardless of the mode setting, since completed pods cannot be resized in-place. Use the recommendations to update your Job/CronJob template for future runs.

Warning

Start with Recommend in production. Promote to Canary only after reviewing recommendations and verifying confidence scores.

The estimator chain¶

Recommendations are produced by a chain of composable estimators. Each stage wraps the previous one:

flowchart LR
  A[Percentile] --> B[Margin]
  B --> C[Burst]
  C --> D[Confidence]
  D --> E[Bounds]
  E --> F[Change Filter]

Percentile selects the configured percentile (e.g. p95) from 24 hourly buckets and takes the maximum across all hours.
Margin multiplies by a safety factor (e.g. 1.2 for 20% headroom).
Burst applies extra headroom when max greatly exceeds the selected percentile.
Confidence widens the recommendation when data is sparse.
Bounds clamps the result to user-defined min/max values.
Change Filter suppresses changes below 10% and caps changes above the configured maximum percentage per cycle.

See Algorithm for formulas and details.

In-Place Pod Resize¶

Kubernetes 1.32 added the /resize subresource for in-place pod resize (alpha, requires feature gate). Kubernetes 1.33 graduated the feature to beta (enabled by default). The kubelet adjusts cgroup limits without restarting the container. Attune calls UpdateResize on each pod, then polls the container status until the new resources are reported or an Infeasible condition appears.

QoS class preservation

The operator refuses a resize if it would change the pod's QoS class. For Guaranteed pods, requests must always equal limits.

Safety system¶

Every resize is guarded by the safety monitor:

OOMKill detection: reverts if the container is OOMKilled after resize.
CPU throttle detection: reverts if the CPU throttle ratio exceeds 50% post-resize (queries Prometheus for container_cpu_cfs_throttled_periods_total).
Restart spike: reverts if the container restarts 2+ times post-resize.
NotReady detection: reverts if the pod loses its Ready condition.
Exponential backoff: consecutive reverts double the cooldown (capped at 16x).
LimitRange/ResourceQuota guard: skips resizes that would violate namespace LimitRange min/max constraints or exceed ResourceQuota headroom.
Degraded condition: when 3+ of the last 5 resizes are reverted, the controller sets a Degraded condition with reason HighRevertRate.
Kubernetes Events: emits Normal/Resized and Warning/Reverted events on the policy for visibility via kubectl describe.
Auto-revert: when enabled (default), the operator restores the original resources via the /resize subresource.

Cooldown enforcement prevents repeated resize attempts. See Safety System for the full design.

Cost savings estimation¶

The operator computes EstimatedMonthlySavings based on the difference between current and recommended resource requests. Pricing is configurable via AttuneDefaults or AttuneNamespaceDefaults:

spec:
  costPricing:
    cpuPerCoreHour: "0.031"     # default: $0.031
    memoryPerGiBHour: "0.004"   # default: $0.004

The formula is: (cpuCoresSaved * cpuPrice + memGiBSaved * memPrice) * 730 hours/month. View savings via kubectl attune savings or the Grafana dashboard.

Multi-container support¶

By default, the operator computes recommendations for every container in a pod, except well-known mesh and sidecar names (istio-proxy, linkerd-proxy, consul-dataplane, kuma-dp, vault-agent, Cloud SQL proxy names, and similar). That list is applied when excludeKnownSidecars is true (the default). Add more names with excludedContainers (union with the known list).

spec:
  # excludeKnownSidecars: true  # default
  excludedContainers:
    - my-company-agent   # extra skips beyond the known list

To right-size known sidecars again (pre-auto-exclude behavior):

spec:
  excludeKnownSidecars: false

Before executing a resize, the operator also checks that the total resource requests across all containers (with the new target applied) do not exceed the node's allocatable resources.

Prometheus auto-discovery¶

The Prometheus address is resolved in order:

spec.metricsSource.prometheus.address on the AttunePolicy
spec.metricsSource.prometheus.address on an AttuneNamespaceDefaults resource in the same namespace
spec.metricsSource.prometheus.address on an AttuneDefaults resource
Auto-discovery: Prometheus Operator CRD (monitoring.coreos.com/v1 Prometheus)
Auto-discovery: well-known service names (prometheus-server, prometheus-kube-prometheus-prometheus) in common namespaces

If all five fail, the policy enters PrometheusUnavailable status.

Conflict detection¶

The operator detects:

VPA conflicts: warns when a VPA targets the same workload.
HPA coexistence: logs a notice and adjusts only requests (not replicas).
Policy overlap: higher-weight policies take precedence when multiple AttunePolicies match the same workload.
Active rollouts: skips resizing during an in-progress deployment rollout.
Opt-out annotation: workloads with attune.io/skip: "true" are ignored.