Section 3 – Site Reliability Engineering (SRE) Practices¶

3.1 – Balancing Change, Velocity, and Reliability¶

SLI / SLO / SLA / Error Budget¶

SLI → measures reliability
SLO → target for that measure
Error Budget → tolerance for unreliability (100% - SLO)
SLA → contract with users about SLO (with consequences for breach)

SLI (Service Level Indicator)¶

A quantitative measure of a service’s behavior from the user’s perspective.

SLI Type	Example
Availability	% of HTTP requests returning 2xx (success)
Latency	% of requests served in < 200ms
Error rate	% of requests returning 5xx
Throughput	requests per second
Freshness	% of data updated within 1 hour
Durability	% of data successfully stored and retrieved
Correctness	% of responses with correct data

Good SLI formula:

SLI = good events / valid events × 100%

SLO (Service Level Objective)¶

A target value for an SLI over a compliance period.

SLO: 99.9% availability over a 30-day rolling window
    → allows 43.8 minutes of downtime per month

SLO: 95% of requests served in < 300ms
    → 5% of requests may be slow

SLO: 99.99% availability
    → allows ~4.4 minutes downtime per month

Aspirational vs. achievable SLOs: - Don’t set SLOs tighter than needed — creates toil - Don’t set SLOs looser than users need — erodes trust - Start with current baseline + slight improvement target

Error Budget¶

Error Budget = 100% - SLO target
Example: SLO = 99.9% → Error Budget = 0.1% of 30 days = 43.2 minutes

Burn rate = rate of consuming error budget
  - Burn rate 1 = consuming exactly at budget rate
  - Burn rate 2 = consuming budget 2x faster than allowed
  - Fast burn (> 14.4x) = alert immediately
  - Slow burn (~1x over long window) = alert before budget exhaustion

Error budget policy: - Budget healthy → dev teams free to ship features - Budget 50% consumed → begin reliability focus - Budget 0% (exhausted) → freeze non-critical deployments until next period

SLA (Service Level Agreement)¶

Legal/contractual commitment to users about SLOs
SLA target is always weaker than internal SLO (buffer for incident response)
Example: Internal SLO = 99.9%, SLA = 99.5%
SLA breach → financial penalty, credit, etc.

GCP Cloud Monitoring SLO Setup¶

# Create SLO via gcloud (request-based)
gcloud monitoring slos create \
  --service=projects/PROJECT/services/my-service \
  --display-name="Availability SLO 99.9%" \
  --request-based-sli='{
    "goodTotalRatio": {
      "goodServiceFilter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\" metric.labels.response_code_class=\"2xx\"",
      "totalServiceFilter": "metric.type=\"loadbalancing.googleapis.com/https/request_count\""
    }
  }' \
  --goal=0.999 \
  --rolling-period-days=30

Burn Rate Alerting¶

Fast burn alert:   burn rate > 14.4×, last 1 hour  → page immediately
Slow burn alert:   burn rate > 1×, last 6 hours    → ticket/investigate

# Alert on burn rate
gcloud monitoring policies create \
  --display-name="Error Budget Fast Burn" \
  --condition-filter='select_slo_burn_rate("projects/PROJECT/services/svc/serviceLevelObjectives/slo")' \
  --condition-threshold-value=14.4 \
  --comparison=COMPARISON_GT \
  --condition-duration=3600s

3.2 – Managing Service Lifecycle¶

Service Management Stages¶

Stage	Key Activities
Planning	Define SLOs, design for reliability, capacity planning
Deployment	Progressive rollout, canary, feature flags
Maintenance	Patching, upgrades, cert rotation, dependency updates
Retirement	Drain traffic, deprecation notices, data migration

Capacity Planning¶

Quotas and Limits¶

GCP enforces quotas (soft limits, requestable) and hard limits
Check quotas: Cloud Console → IAM & Admin → Quotas
Request quota increase: gcloud compute project-info set-usage-export-bucket

# Check current quota usage
gcloud compute regions describe us-central1 --format="yaml(quotas)"

# Request quota increase (via console or support ticket for large increases)
gcloud compute project-info add-metadata \
  --metadata quota-request="CPUS:500"

Reservations¶

Reserve specific VM capacity in a zone — guaranteed availability
Use when: predictable demand spikes, capacity-sensitive workloads

gcloud compute reservations create my-reservation \
  --machine-type=n2-standard-8 \
  --vm-count=10 \
  --zone=us-central1-a

Dynamic Workload Scheduler (DWS)¶

Request batch GPU/TPU capacity for future time windows
GCP schedules the resources when capacity is available
Use for: ML training jobs that don’t need immediate, persistent VMs

Autoscaling¶

Managed Instance Groups (MIGs) — Compute Engine¶

gcloud compute instance-groups managed set-autoscaling my-mig \
  --min-num-replicas=2 \
  --max-num-replicas=20 \
  --target-cpu-utilization=0.7 \
  --cool-down-period=60

Scaling signals: CPU, HTTP load balancing, Pub/Sub queue depth, custom metrics

GKE Autoscaling¶

Autoscaler	Scales	Based on
HPA (Horizontal Pod Autoscaler)	Pod replicas	CPU, memory, custom metrics
VPA (Vertical Pod Autoscaler)	Pod resource requests	Historical usage
Cluster Autoscaler	Nodes in node pools	Pending pods (unschedulable)
KEDA (event-driven)	Pod replicas	Pub/Sub, HTTP, custom events

# HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 2
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Cloud Run Autoscaling¶

Scales to zero by default; set --min-instances to avoid cold starts
--max-instances to cap cost
Scale-up based on concurrent requests per instance

gcloud run services update myapp \
  --min-instances=2 \
  --max-instances=100 \
  --concurrency=80

3.3 – Mitigating Incident Impact¶

Traffic Management During Incidents¶

Draining / Redirecting Traffic¶

# GKE: Remove pod from service without downtime
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# GKE: Cordon node (no new pods scheduled)
kubectl cordon NODE_NAME

# Cloud Run: Redirect 100% traffic to stable revision
gcloud run services update-traffic myapp --to-revisions=stable=100

# Load Balancer: Update backend service weight
gcloud compute backend-services update my-backend \
  --global \
  --update-custom-request-headers="..."

Adding Capacity¶

# GKE: Scale deployment immediately
kubectl scale deployment app --replicas=20

# GKE: Resize node pool
gcloud container node-pools update my-pool \
  --cluster=my-cluster \
  --num-nodes=10

# MIG: Set target size
gcloud compute instance-groups managed resize my-mig --size=20

Rollback Strategies¶

Platform	Rollback Method
GKE Deployment	`kubectl rollout undo deployment/app`
Cloud Run	`gcloud run services update-traffic --to-revisions=PREV=100`
Cloud Deploy	`gcloud deploy rollouts rollback ROLLOUT`
GKE node pool	Blue/green node pool switch
Terraform	`git revert` + apply previous state

Incident Response Process (SRE Model)¶

Detect — alert fires; on-call paged
Mitigate first — restore service before root-causing (rollback, add capacity, redirect traffic)
Communicate — update status page, stakeholders
Investigate — root cause analysis while service is stable
Resolve — permanent fix
Post-mortem — blameless, action items to prevent recurrence

Availability Table (Nines)¶

Nines	Availability	Downtime/month	Downtime/year
99% (2 nines)	99%	~7.3 hours	~3.65 days
99.5%	99.5%	~3.6 hours	~1.83 days
99.9% (3 nines)	99.9%	~43.8 min	~8.76 hours
99.95%	99.95%	~21.9 min	~4.38 hours
99.99% (4 nines)	99.99%	~4.4 min	~52.6 min
99.999% (5 nines)	99.999%	~26 sec	~5.26 min

Toil¶

Definition (SRE): Repetitive, manual, automatable work that scales linearly with service growth and produces no lasting value.

Signs of toil: manual steps in deployments, manual alert triage, manual scaling, manual cert rotation.

Goal: Keep toil < 50% of SRE team time. Remainder = engineering work that reduces future toil.

Exam Tips¶

SLI = measurement; SLO = target; SLA = contract; Error Budget = tolerance
Error budget = 100% - SLO% not SLO - actual uptime
Burn rate > 14.4x = alert immediately (fast burn); < 1x = no alert needed
SLA target should always be weaker than internal SLO
Canary deployment = reduces blast radius = respects error budget
Cloud Run scales to zero; use --min-instances for latency-sensitive services
GKE Cluster Autoscaler scales nodes; HPA scales pods — both may be needed
kubectl drain = graceful pod eviction; kubectl cordon = prevent scheduling only