Skip to main content

Monitoring Your Cluster After Enablement

After successfully setting up Cluster Orchestrator, you gain access to monitoring screens and dashboards that provide real-time insights into your cluster's performance, cost, and optimization opportunities. These dashboards are designed to help you track the effectiveness of your optimization settings and make data-driven decisions about your infrastructure.

Cluster Orchestrator provides four specialized views, each focusing on different aspects of your cluster:

1. Overview

This is your central page for monitoring overall cluster health, performance, and cost metrics. The dashboard is divided into several key sections:

  • Cluster Spend: Track your total cluster costs over time, with breakdowns by instance type and spot vs. on-demand usage
  • Cluster Details: View essential information about your cluster including name, region, and identifier.
  • Nodes Breakdown: Visualize your node distribution by fulfillment method (spot vs. on-demand)
  • CPU Breakdown: Track CPU allocation, usage, and available capacity across your cluster
  • Memory Breakdown: Monitor memory allocation, usage, and available capacity
  • Pod Distribution: See how many pods are there as spot, on-demand scheduled and unsecheduled

2. Workloads screen

This view focuses on the applications running in your cluster, helping you identify optimization opportunities at the workload level:

  • Namespace Organization: View workloads grouped by namespace for logical organization
  • Replica Count: Track the number of replicas for each workload

3. Nodes screen

This view provides insights into your cluster's infrastructure. The table displays the following information for each node:

ColumnDescription
Node NameThe full hostname of the node
WorkloadsNumber of workloads running on the node (e.g., 11)
Instance TypeThe AWS instance type (e.g., m5.2xlarge)
FulfillmentWhether the node is running as spot or on-demand
CPUCurrent CPU usage and total capacity (e.g., 7.91/8)
Memory (GiB)Current memory usage and total capacity (e.g., 29.92/30.89)
AgeHow long the node has been running (e.g., 2h)
StatusCurrent node status (e.g., Ready or not )

4. Logs

The Logs screen provides a chronological record of all cluster events and optimization activities. You can filter events by time period (e.g., Last 90 days) and event type.

ColumnDescription
Logged OnTimestamp when the event occurred (e.g., Apr 23, 2025 5:11 PM)
Event TypeCategory of the event (see types below)
Event DetailsDescription of what happened

Event Types

The logs track several types of events that help you understand cluster activity:

  • NodeProvisioned: When new instances are launched (e.g., "Instance i-0138fc81dc9c82e31 launched")
  • ScaleUpPlan: When the system generates plans to scale up resources (e.g., "Generated scale-up plans for 5 launch templates")
  • EvictPodsFromCandidateNode: When pods are evicted as part of optimization activities
  • Fallback: When workloads transition from spot to on-demand instances due to unavailability
  • ReverseFallback: When workloads transition back from on-demand to spot instances
  • InstanceReplacementDueToInterruption: When instances are replaced due to spot interruptions
  • SpotInterruptionNotice: When AWS sends a notification that a spot instance will be interrupted