Skip to content

Metrics

All metrics are exposed on the operator's metrics endpoint (default port 8080) and use the attune_ prefix.

Counters

attune_resize_total

Total number of in-place resize operations performed.

Label Description
namespace Workload namespace
workload Workload name
resource cpu or memory
result success, failed, or reverted

attune_reverts_total

Total number of resize reverts triggered by the safety monitor.

Label Description
namespace Workload namespace
workload Workload name
reason oomkill, throttle, restart, notready, re-fetch-failed, or annotation-persist-failed

attune_prometheus_query_errors_total

Total number of failed Prometheus queries.

Label Description
namespace Workload namespace where the query originated
query_type cpu_grouped or memory_grouped

attune_reconcile_errors_total

Total number of reconciliation errors by type.

Label Description
error_type fetch, fetch_defaults, prometheus_config, collector_options, collector_create, discover_workloads, get_pods, compute_recommendations, status_update, or safety_observation

attune_webhook_validation_total

Total number of webhook admission decisions.

Label Description
operation validate_create, validate_update, defaulting, defaults_validate_create, defaults_validate_update, namespace_defaults_validate_create, or namespace_defaults_validate_update
result allowed or rejected

attune_schedule_skipped_total

Total resize cycles skipped because the current time is outside the configured schedule window.

Label Description
namespace Policy namespace
policy Policy name

attune_budget_exhausted_total

Total resize operations deferred because the per-cycle budget cap (maxTotalCpuIncrease / maxTotalMemoryIncrease) was exhausted.

Label Description
namespace Policy namespace
policy Policy name

attune_eviction_total

Total eviction attempts when resizeMethod: InPlaceOrRecreate falls back to pod eviction after an in-place resize fails or is marked Infeasible.

Label Description
namespace Workload namespace
workload Workload name
result success or denied

attune_throttle_deferred_total

Total number of throttle safety checks deferred because the Prometheus rate window grace period has not elapsed. Incremented when a pod's observation period is shorter than 5 minutes, meaning there is not yet enough data for a reliable throttle check.

Label Description
namespace Workload namespace
workload Workload name

attune_startup_boost_total

Total startup boost lifecycle events (boost applied, expired, or failed).

Label Description
namespace Workload namespace
workload Workload name
action applied, expired, or failed

attune_infeasible_skipped_total

Total pods skipped because kubelet marked the in-place resize as Infeasible and resizeMethod is InPlaceOnly (no eviction fallback).

Label Description
namespace Workload namespace
workload Workload name

attune_stale_recommendations_total

Total times recommendations were marked stale due to Prometheus data gaps.

Label Description
namespace Policy namespace
policy Policy name

Gauges

attune_recommendation_cpu_cores

Recommended CPU cores for each workload container.

Label Description
namespace Workload namespace
workload Workload name
container Container name

attune_recommendation_memory_bytes

Recommended memory (bytes) for each workload container.

Label Description
namespace Workload namespace
workload Workload name
container Container name

attune_savings_cpu_cores_total

Total CPU cores saved per namespace.

Label Description
namespace Namespace

attune_savings_memory_bytes_total

Total memory bytes saved per namespace.

Label Description
namespace Namespace

attune_savings_estimated_monthly_dollars

Estimated monthly cost savings in USD per namespace, computed from configured or default pricing ($0.031/vCPU-hour, $0.004/GiB-hour).

Label Description
namespace Namespace

attune_confidence

Recommendation confidence score (0-1) per workload container.

Label Description
namespace Workload namespace
workload Workload name
container Container name

attune_burst_factor

Burst detection multiplier applied to recommendations. A value of 1.0 means no burst detected; values above 1.0 indicate the recommendation was inflated to accommodate a detected usage burst.

Label Description
namespace Workload namespace
workload Workload name
container Container name
resource cpu or memory

Histograms

attune_resize_duration_seconds

Duration of individual pod resize operations.

Label Description
namespace Workload namespace
workload Workload name

attune_reconcile_duration_seconds

Duration of each reconciliation loop.

Label Description
controller Controller name
namespace Policy namespace
policy Policy name

attune_prometheus_query_duration_seconds

Duration of each Prometheus query.

Label Description
query_type cpu_grouped or memory_grouped

attune_webhook_duration_seconds

Duration of webhook validation and defaulting operations.

Label Description
operation validate_create, validate_update, defaulting, defaults_validate_create, defaults_validate_update, namespace_defaults_validate_create, or namespace_defaults_validate_update

Controller-runtime Workqueue Metrics

These metrics are auto-registered by controller-runtime for the AttunePolicy reconciler workqueue. They are critical for diagnosing reconcile backlog and throughput at scale.

workqueue_depth

Current depth of the reconcile workqueue (number of items waiting to be processed).

Label Description
name Queue name (e.g. attunepolicy)

workqueue_adds_total

Total number of items added to the workqueue.

Label Description
name Queue name

workqueue_queue_duration_seconds

Time an item spends waiting in the queue before being processed (histogram).

Label Description
name Queue name

workqueue_work_duration_seconds

Time spent processing an item from the queue (histogram).

Label Description
name Queue name

workqueue_retries_total

Total number of item retries (requeue after error).

Label Description
name Queue name

workqueue_longest_running_processor_seconds

Duration of the longest currently running processor (gauge). A sustained high value indicates a stuck or very slow reconciliation.

Label Description
name Queue name

workqueue_unfinished_work_seconds

Time that unfinished work has been in progress (gauge). Complements longest_running_processor_seconds by measuring aggregate backlog age.

Label Description
name Queue name

Example PromQL queries

Total successful in-place resizes in the last 24 hours:

sum(increase(attune_resize_total{result="success"}[24h]))

Total successful eviction fallbacks in the last 24 hours:

sum(increase(attune_eviction_total{result="success"}[24h]))

Revert rate as a percentage of successful in-place resizes:

sum(rate(attune_reverts_total[1h]))
/
sum(rate(attune_resize_total{result="success"}[1h]))
* 100

Total CPU cores saved cluster-wide:

sum(attune_savings_cpu_cores_total)

Low-confidence recommendations (below 0.5):

attune_confidence < 0.5

P99 reconciliation latency:

histogram_quantile(0.99, rate(attune_reconcile_duration_seconds_bucket[5m]))

Prometheus query error rate:

sum(rate(attune_prometheus_query_errors_total[5m]))

Reconcile queue depth (backlog indicator):

workqueue_depth{name="attunepolicy"}

Average time items wait in the queue before processing:

histogram_quantile(0.99, rate(workqueue_queue_duration_seconds_bucket{name="attunepolicy"}[5m]))

Reconcile enqueue rate (items added to queue per second):

rate(workqueue_adds_total{name="attunepolicy"}[5m])