GKE Scaling¶
Description¶
Scaling in GKE encompasses multiple dimensions: horizontal pod scaling (adding more pod replicas), vertical pod scaling (increasing pod resources), and cluster scaling (adding more nodes). GKE provides automated scaling mechanisms to handle variable workloads efficiently while optimizing costs.
Concept: Automatically adjust resources (pods and nodes) based on demand to maintain performance while optimizing costs.
Types of Scaling¶
Horizontal Pod Autoscaler (HPA)¶
Scales the number of pod replicas based on observed metrics.
Vertical Pod Autoscaler (VPA)¶
Adjusts CPU and memory requests/limits for containers.
Cluster Autoscaler¶
Adds or removes nodes based on pod resource requirements.
Multidimensional Pod Autoscaler (MPA)¶
Scales both pod replicas and resources (GKE Autopilot feature).
Horizontal Pod Autoscaler (HPA)¶
Description¶
HPA automatically scales the number of pods in a deployment, replica set, or stateful set based on observed CPU utilization, memory usage, or custom metrics.
Key Features¶
- CPU-based Scaling: Scale based on CPU utilization (most common)
- Memory-based Scaling: Scale based on memory usage
- Custom Metrics: Scale on application-specific metrics (Pub/Sub queue length, HTTP requests/sec)
- External Metrics: Scale based on metrics from external systems
- Multiple Metrics: Combine different metrics for scaling decisions
- Configurable Behavior: Set min/max replicas, scaling velocity
When to Use HPA¶
✅ Use HPA When:
- Traffic varies throughout the day
- Want automatic scaling based on load
- Cost optimization through dynamic scaling
- Stateless applications that can scale horizontally
❌ Don’t Use HPA When:
- Stateful applications that can’t easily add replicas
- Applications with long startup times (scale-up lag)
- Need vertical scaling (use VPA instead)
Vertical Pod Autoscaler (VPA)¶
Description¶
VPA automatically adjusts CPU and memory requests and limits for containers based on historical usage patterns.
Key Features¶
- Right-Sizing: Automatically set appropriate resource requests
- Historical Analysis: Based on actual resource usage patterns
- Update Modes: Recommend, auto-update, or initial-only
- Container-Level: Can configure per-container in a pod
VPA Update Modes¶
Off (Recommendation Only):
updatePolicy:
updateMode: "Off"
- VPA only generates recommendations
- No automatic updates
- Use for analysis before implementing
Initial:
updatePolicy:
updateMode: "Initial"
- Set resources only when pods are created
- No updates to running pods
- Good for stateful workloads
Auto:
updatePolicy:
updateMode: "Auto"
- Automatically update running pods
- Evicts and recreates pods to apply new resources
- Best for stateless workloads
Recreate:
updatePolicy:
updateMode: "Recreate"
- Similar to Auto but always recreates pods
- More disruptive
Important Considerations¶
VPA Limitations:
- Cannot be used with HPA on same CPU/memory metrics
- Requires pod eviction to apply changes (except Initial mode)
- May cause brief downtime during updates
When to Use VPA¶
✅ Use VPA When:
- Resource requests are incorrect or unknown
- Want automatic right-sizing based on usage
- Applications with varying resource needs over time
- Optimizing cost by eliminating over-provisioning
❌ Don’t Use VPA When:
- Already using HPA on CPU/memory
- Cannot tolerate pod evictions
- Resource requirements are well-known and stable
- Application startup time is very long
Cluster Autoscaler¶
Description¶
Cluster Autoscaler automatically adjusts the number of nodes in a cluster based on pod resource requests that cannot be scheduled on existing nodes.
How It Works¶
- Pods are unschedulable due to insufficient resources
- Cluster Autoscaler detects pending pods
- New nodes added to accommodate pods
- When nodes are underutilized, they’re removed
Key Features¶
- Automatic Scale-Up: Add nodes when pods can’t be scheduled
- Automatic Scale-Down: Remove nodes when underutilized
- Node Pool Awareness: Scale specific node pools
- Cost Optimization: Reduce costs by removing unused nodes
- Configurable Behavior: Set min/max nodes, scale-down delays
Important Limits¶
| Limit | Value | Notes |
|---|---|---|
| Min nodes per pool | 0 | Can scale to zero |
| Max nodes per pool | 1000 | Configurable |
| Max nodes per cluster | 15,000 | Total limit |
| Scale-up time | ~2-5 minutes | Node provisioning time |
| Scale-down time | 10 minutes (default) | Configurable delay |
When to Use Cluster Autoscaler¶
✅ Use Cluster Autoscaler When:
- Workload varies significantly
- Want automatic infrastructure scaling
- Cost optimization important
- Using HPA (complement to pod autoscaling)
❌ Don’t Use Cluster Autoscaler When:
- Workload is stable and predictable
- Cannot tolerate 2-5 minute scale-up delay
- Using Autopilot (handles this automatically)
Autopilot Scaling (GKE Autopilot)¶
Description¶
In Autopilot mode, Google manages all scaling automatically. No cluster autoscaler configuration needed.
How It Works¶
- Nodes provisioned automatically based on pod requests
- Scales to zero when no workloads running
- Right-sized nodes for pod requirements
- No over-provisioning
Autopilot Scaling Features¶
Automatic:
- Node provisioning and removal
- Perfect bin-packing
- Cost optimization
- No configuration needed
You Still Configure:
- HPA for pod replica scaling
- VPA for resource recommendations (optional)
- Pod resource requests (required)
Set up Alerts:
- HPA at max replicas
- Cluster at max nodes
- High pod pending time
- Scaling failures