GKE Scaling¶

Description¶

Scaling in GKE encompasses multiple dimensions: horizontal pod scaling (adding more pod replicas), vertical pod scaling (increasing pod resources), and cluster scaling (adding more nodes). GKE provides automated scaling mechanisms to handle variable workloads efficiently while optimizing costs.

Concept: Automatically adjust resources (pods and nodes) based on demand to maintain performance while optimizing costs.

Types of Scaling¶

Horizontal Pod Autoscaler (HPA)¶

Scales the number of pod replicas based on observed metrics.

Vertical Pod Autoscaler (VPA)¶

Adjusts CPU and memory requests/limits for containers.

Cluster Autoscaler¶

Adds or removes nodes based on pod resource requirements.

Multidimensional Pod Autoscaler (MPA)¶

Scales both pod replicas and resources (GKE Autopilot feature).

Horizontal Pod Autoscaler (HPA)¶

Description¶

HPA automatically scales the number of pods in a deployment, replica set, or stateful set based on observed CPU utilization, memory usage, or custom metrics.

Key Features¶

CPU-based Scaling: Scale based on CPU utilization (most common)
Memory-based Scaling: Scale based on memory usage
Custom Metrics: Scale on application-specific metrics (Pub/Sub queue length, HTTP requests/sec)
External Metrics: Scale based on metrics from external systems
Multiple Metrics: Combine different metrics for scaling decisions
Configurable Behavior: Set min/max replicas, scaling velocity

When to Use HPA¶

✅ Use HPA When:

Traffic varies throughout the day
Want automatic scaling based on load
Cost optimization through dynamic scaling
Stateless applications that can scale horizontally

❌ Don’t Use HPA When:

Stateful applications that can’t easily add replicas
Applications with long startup times (scale-up lag)
Need vertical scaling (use VPA instead)

Vertical Pod Autoscaler (VPA)¶

Description¶

VPA automatically adjusts CPU and memory requests and limits for containers based on historical usage patterns.

Key Features¶

Right-Sizing: Automatically set appropriate resource requests
Historical Analysis: Based on actual resource usage patterns
Update Modes: Recommend, auto-update, or initial-only
Container-Level: Can configure per-container in a pod

VPA Update Modes¶

Off (Recommendation Only):

updatePolicy:
  updateMode: "Off"

VPA only generates recommendations
No automatic updates
Use for analysis before implementing

Initial:

updatePolicy:
  updateMode: "Initial"

Set resources only when pods are created
No updates to running pods
Good for stateful workloads

Auto:

updatePolicy:
  updateMode: "Auto"

Automatically update running pods
Evicts and recreates pods to apply new resources
Best for stateless workloads

Recreate:

updatePolicy:
  updateMode: "Recreate"

Similar to Auto but always recreates pods
More disruptive

Important Considerations¶

VPA Limitations:

Cannot be used with HPA on same CPU/memory metrics
Requires pod eviction to apply changes (except Initial mode)
May cause brief downtime during updates

When to Use VPA¶

✅ Use VPA When:

Resource requests are incorrect or unknown
Want automatic right-sizing based on usage
Applications with varying resource needs over time
Optimizing cost by eliminating over-provisioning

❌ Don’t Use VPA When:

Already using HPA on CPU/memory
Cannot tolerate pod evictions
Resource requirements are well-known and stable
Application startup time is very long

Cluster Autoscaler¶

Description¶

Cluster Autoscaler automatically adjusts the number of nodes in a cluster based on pod resource requests that cannot be scheduled on existing nodes.

How It Works¶

Pods are unschedulable due to insufficient resources
Cluster Autoscaler detects pending pods
New nodes added to accommodate pods
When nodes are underutilized, they’re removed

Key Features¶

Automatic Scale-Up: Add nodes when pods can’t be scheduled
Automatic Scale-Down: Remove nodes when underutilized
Node Pool Awareness: Scale specific node pools
Cost Optimization: Reduce costs by removing unused nodes
Configurable Behavior: Set min/max nodes, scale-down delays

Important Limits¶

Limit	Value	Notes
Min nodes per pool	0	Can scale to zero
Max nodes per pool	1000	Configurable
Max nodes per cluster	15,000	Total limit
Scale-up time	~2-5 minutes	Node provisioning time
Scale-down time	10 minutes (default)	Configurable delay

When to Use Cluster Autoscaler¶

✅ Use Cluster Autoscaler When:

Workload varies significantly
Want automatic infrastructure scaling
Cost optimization important
Using HPA (complement to pod autoscaling)

❌ Don’t Use Cluster Autoscaler When:

Workload is stable and predictable
Cannot tolerate 2-5 minute scale-up delay
Using Autopilot (handles this automatically)

Autopilot Scaling (GKE Autopilot)¶

Description¶

In Autopilot mode, Google manages all scaling automatically. No cluster autoscaler configuration needed.

How It Works¶

Nodes provisioned automatically based on pod requests
Scales to zero when no workloads running
Right-sized nodes for pod requirements
No over-provisioning

Autopilot Scaling Features¶

Automatic:

Node provisioning and removal
Perfect bin-packing
Cost optimization
No configuration needed

You Still Configure:

HPA for pod replica scaling
VPA for resource recommendations (optional)
Pod resource requests (required)

Set up Alerts:

HPA at max replicas
Cluster at max nodes
High pod pending time
Scaling failures