Cloud Operations¶
Core Concepts¶
Cloud Operations (formerly Stackdriver) is a suite of tools for monitoring, logging, debugging, and managing applications and infrastructure. Provides unified observability across GCP and other clouds.
Key Principle: Observability through metrics, logs, and traces; proactive monitoring, not reactive troubleshooting.
Cloud Operations Suite¶
| Service | Purpose | Key Feature |
|---|---|---|
| Cloud Monitoring | Metrics, dashboards, alerts | Time-series data, SLOs |
| Cloud Logging | Log collection and analysis | Centralized logs, filters |
| Cloud Trace | Distributed tracing | Request latency analysis |
| Cloud Profiler | CPU/memory profiling | Production profiling |
| Cloud Debugger | Live debugging | No code changes, snapshots |
| Error Reporting | Error aggregation | Smart grouping, notifications |
Cloud Monitoring¶
Purpose¶
Collect, visualize, and alert on metrics from GCP resources, applications, and external sources.
Key Features¶
Metrics Collection:
- Automatic (GCP resources)
- Custom (application metrics)
- Agent-based (VM detailed metrics)
Dashboards:
- Predefined (GCE, GKE, etc.)
- Custom (MQL, PromQL)
- Charts, tables, heatmaps
Alerting:
- Metric-based (CPU > 80%)
- Log-based (error rate spikes)
- Uptime checks (availability)
- Multi-condition policies
SLIs and SLOs:
- Service-level indicators (latency, availability)
- Service-level objectives (99.9% uptime)
- Error budget tracking
Common Patterns¶
Infrastructure Monitoring:
Alert: CPU > 80% for 5 minutes → Page on-call
Alert: Disk usage > 90% → Auto-expand or alert
Uptime check: HTTP endpoint unavailable → Alert
Application Performance:
SLO: 99.9% requests < 500ms
Error Budget: Calculate remaining tolerance
Alert: Error budget burn rate too high
Monitoring Agent¶
Purpose: Collect system and application metrics from VMs
Installation: Optional but recommended for detailed metrics
Benefits: Memory, disk I/O, process metrics
Cloud Logging¶
Purpose¶
Centralized log management: collection, storage, analysis, and alerting.
Log Types¶
Platform Logs (automatic):
- Admin Activity (who did what)
- Data Access (who accessed data, must enable)
- System Events (GCP actions)
- Access Transparency (Google admin access)
Application Logs:
- Stdout/stderr (automatic for many services)
- Structured logging (recommended)
- Custom log writes
Log Routing¶
Default behavior: Logs stored in Cloud Logging for 30 days
Sinks: Export logs to destinations
- Cloud Storage (long-term archival)
- BigQuery (analysis, SQL queries)
- Pub/Sub (streaming to external systems)
- Other projects (centralized logging)
Filters: Route specific logs (e.g., only errors)
Log-Based Metrics¶
Purpose: Create metrics from log entries
Use cases:
- Count error occurrences
- Track specific events
- Custom business metrics
- Alert on log patterns
Example: Alert when 5XX errors > 10/minute
Best Practices¶
Structured Logging:
{
"severity": "ERROR",
"message": "Payment failed",
"userId": "123",
"amount": 99.99
}
Benefits: Searchable fields, better analysis
Log Sampling: Reduce volume for high-traffic apps (sample 10%)
Retention: Default 30 days, export to Storage for longer
Cloud Trace¶
Purpose¶
Distributed tracing for understanding request latency across services.
How It Works¶
Request → Service A → Service B → Service C
└─── Trace spans collected ────┘
Trace: End-to-end request journey Span: Single operation within trace
Use Cases¶
- Identify slow services in request path
- Understand service dependencies
- Optimize critical paths
- Troubleshoot latency issues
Automatic Instrumentation¶
Supported:
- App Engine (automatic)
- Cloud Run (automatic)
- GKE (with service mesh)
Manual:
- Compute Engine (use client libraries)
- Other services (OpenTelemetry)
Analysis¶
Features:
- Latency distribution
- Request waterfall view
- Service dependency graph
- Comparison across traces
Example: “95% of requests to Service A take 200ms, but 5% take 5s due to Service B dependency”
Cloud Profiler¶
Purpose¶
Continuous CPU and memory profiling in production with minimal overhead.
Key Features¶
- No performance impact (<1% overhead)
- Always-on profiling
- Multiple languages (Java, Python, Go, Node.js)
- Compare time periods
Use Cases¶
- Identify performance bottlenecks
- Optimize resource usage
- Reduce compute costs
- Memory leak detection
Best Practice¶
Enable in production: Designed for production use, safe
Cloud Debugger¶
Purpose¶
Debug production applications without stopping or restarting.
How It Works¶
Snapshots: Capture variable state at specific line
Logpoints: Inject log statement without code changes
No downtime: Debug live production apps
Limitations¶
- Snapshots expire after capture
- Not all languages supported
- Cannot modify state (read-only)
Use Case¶
Troubleshooting production issues without redeployment
Error Reporting¶
Purpose¶
Aggregate and display errors from applications, with smart grouping and notifications.
Features¶
Smart Grouping: Similar errors grouped together
Stack Trace: Full stack traces for debugging
Notifications: Email, mobile alerts on new errors
Integration: Works with Cloud Logging automatically
Supported Services¶
- App Engine, Cloud Functions, Cloud Run (automatic)
- Compute Engine, GKE (via Logging agent)
- External applications (via API)
Unified Observability¶
The Three Pillars¶
Metrics (Cloud Monitoring):
- What is happening (CPU, memory, requests)
- When did it happen
- Historical trends
Logs (Cloud Logging):
- Why it happened
- Detailed context
- Debugging information
Traces (Cloud Trace):
- How requests flow through system
- Where latency occurs
- Service dependencies
Together: Complete picture of system health
SLI, SLO, and SLA¶
Definitions¶
SLI (Service Level Indicator):
- Quantitative measure of service level
- Examples: Latency, availability, error rate
SLO (Service Level Objective):
- Target value for SLI
- Examples: 99.9% availability, 95th percentile latency < 200ms
SLA (Service Level Agreement):
- Contractual commitment
- Penalties if SLO not met
- Example: 99.95% uptime or refund
Relationship¶
SLI (measurement) → SLO (internal target) → SLA (customer contract)
Error Budget¶
Concept: Allowed downtime based on SLO
Example: 99.9% availability SLO = 43.2 minutes downtime/month allowed
Use: Prioritize features vs reliability
Monitoring Strategy¶
What to Monitor¶
Golden Signals (Google SRE):
- Latency: Request duration
- Traffic: Request rate
- Errors: Failed requests
- Saturation: Resource utilization
Infrastructure:
- CPU, memory, disk, network
- Service health
- Resource quotas
Application:
- Business metrics (orders, payments)
- User experience (page load time)
- Custom KPIs
Alert Design¶
Good Alerts:
- Actionable (can fix)
- Represent real problems
- Rarely false positives
Bad Alerts:
- Noisy (too many)
- Not actionable
- Alert fatigue
Best Practice: Alert on SLO burn rate, not arbitrary thresholds
Cost Optimization¶
Logging:
- Default 30-day retention (free)
- Export to Storage for cheaper long-term
- Use log exclusion filters (reduce volume)
- Sample high-volume logs
Monitoring:
- Free tier: 150 MB logs ingestion/month
- Custom metrics charged beyond free tier
- Use sampling for high-cardinality metrics
Trace/Profiler/Debugger: Free (no additional charge)
Integration Patterns¶
Multi-Cloud Monitoring¶
Ops Agent: Monitor GCP, AWS, on-premises from single dashboard
Use case: Unified monitoring across hybrid/multi-cloud
Centralized Logging¶
Pattern: All projects route logs to central project
Project A logs → Sink → Central Logging Project
Project B logs → Sink → Central Logging Project
Benefits: Single pane of glass, better analysis
Alert Routing¶
Integration:
- PagerDuty, Slack (notifications)
- Cloud Functions (automated remediation)
- Cloud Tasks (workflow orchestration)
Compliance and Audit¶
Audit Logs¶
Types:
- Admin Activity: Always enabled, 400-day retention
- Data Access: Must enable, 30-day default retention
- System Events: Automatic
- Access Transparency: Google employee access
Use for: Compliance evidence, security investigations
Log Retention¶
Compliance requirements:
- HIPAA: 6 years
- SOX: 7 years
- PCI-DSS: 1 year
Implementation: Export logs to Cloud Storage, set retention policy
Best Practices¶
Monitoring¶
- Define SLOs for critical services
- Alert on SLO burn rate
- Use dashboards for visibility
- Regular review of alert policies
Logging¶
- Use structured logging
- Sample high-volume logs
- Export for long-term retention
- Enable data access logs for sensitive resources
Tracing¶
- Enable for all services
- Use for performance optimization
- Trace critical request paths
- Set sampling rate appropriately
Observability¶
- Implement all three pillars (metrics, logs, traces)
- Correlate across pillars (same request ID)
- Monitor business metrics, not just infrastructure
- Proactive monitoring, not reactive
Exam Focus¶
Core Concepts¶
- Observability pillars (metrics, logs, traces)
- SLI vs SLO vs SLA
- Golden signals (latency, traffic, errors, saturation)
- Error budgets
Service Purpose¶
- Cloud Monitoring: Metrics, alerts, dashboards
- Cloud Logging: Centralized logs, sinks
- Cloud Trace: Distributed tracing, latency
- Cloud Profiler: Production profiling
- Error Reporting: Error aggregation
Architecture¶
- Log routing (sinks to Storage, BigQuery, Pub/Sub)
- Centralized logging pattern
- Multi-cloud monitoring
- Alert routing and automation
Best Practices¶
- SLO-based alerting
- Structured logging
- Log retention for compliance
- Enable data access logs for sensitive resources
- Sample high-volume logs
Integration¶
- Automatic (App Engine, Cloud Run, GKE)
- Agent-based (Compute Engine)
- API/SDK (custom applications)
- Multi-cloud (Ops Agent)