4.3 – Metrics, Dashboards & Alerts | 4.4 – Distributed Tracing | 4.5 – Troubleshooting¶
4.3 – Managing Metrics, Dashboards, and Alerts¶
Metrics Explorer¶
- Explore, visualize, and query any metric in Cloud Monitoring
- Supports: MQL (Monitoring Query Language), PromQL (for Prometheus-style queries)
- Can overlay multiple metrics, apply aggregation, ratio queries
PromQL in Cloud Monitoring¶
# Request rate
rate(http_requests_total[5m])
# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
MQL (Monitoring Query Language)¶
# CPU utilization for GKE pods
fetch k8s_container
| metric 'kubernetes.io/container/cpu/core_usage_time'
| filter resource.cluster_name = 'prod-cluster'
| align rate(1m)
| every 1m
| group_by [resource.namespace_name], [value: mean(value)]
Dashboards¶
Creating Dashboards¶
# Create dashboard from JSON
gcloud monitoring dashboards create --config-from-file=dashboard.json
# List dashboards
gcloud monitoring dashboards list
Dashboard JSON structure:
{
"displayName": "Application Overview",
"gridLayout": {
"columns": 2,
"widgets": [
{
"title": "Request Rate",
"xyChart": {
"dataSets": [{
"timeSeriesQuery": {
"prometheusQuery": "rate(http_requests_total[5m])"
}
}]
}
}
]
}
}
Dashboard Best Practices¶
- Include: golden signals (latency, traffic, errors, saturation)
- Group by: service, environment, region
- Add playbooks as text widgets — guides during incidents
- Use alert annotations to show alert events on time series charts
- Share dashboards via URL — accessible to anyone with Cloud Monitoring viewer role
Alerting Policies¶
Alert Policy Components¶
- Condition: what metric/log to evaluate + threshold
- Notification channels: where to send alerts (email, Slack, PagerDuty, webhook, Pub/Sub)
- Documentation: runbook URL, custom message for on-call engineer
# Create alert via gcloud
gcloud monitoring policies create \
--display-name="High Error Rate" \
--condition-display-name="Error rate > 5%" \
--condition-filter='resource.type="k8s_container" AND metric.type="prometheus.googleapis.com/http_request_errors_total/counter"' \
--condition-threshold-value=0.05 \
--condition-comparison=COMPARISON_GT \
--condition-duration=300s \
--notification-channels=CHANNEL_ID
Notification Channels¶
# Create email notification channel
gcloud monitoring channels create \
--display-name="OpsTeam Email" \
--type=email \
--channel-labels=email_address=ops@company.com
# List channels
gcloud monitoring channels list
# Third-party integrations: webhook (for PagerDuty, Rootly, Opsgenie)
gcloud monitoring channels create \
--display-name="PagerDuty" \
--type=webhook_tokenauth \
--channel-labels=url=https://events.pagerduty.com/integration/KEY/enqueue
SLO-Based Alerting¶
# Alert on SLO burn rate
gcloud monitoring policies create \
--display-name="SLO Fast Burn" \
--condition-filter='select_slo_burn_rate("projects/PROJECT/services/svc/serviceLevelObjectives/slo")' \
--condition-threshold-value=14.4 \
--condition-comparison=COMPARISON_GT \
--condition-duration=3600s \
--notification-channels=CHANNEL_ID
Cost Control Alerting¶
# Budget alert
gcloud billing budgets create \
--billing-account=BILLING_ACCOUNT_ID \
--display-name="Monthly Budget Alert" \
--budget-amount=1000USD \
--threshold-rule=percent=80,basis=CURRENT_SPEND \
--threshold-rule=percent=100,basis=FORECASTED_SPEND
Third-Party Alerting Integrations¶
| Tool | Integration method |
|---|---|
| PagerDuty | Webhook notification channel → PagerDuty Events API |
| Rootly | Webhook notification channel |
| Opsgenie | Webhook notification channel |
| Slack | Slack notification channel (built-in) |
| Pub/Sub | Route alerts to Pub/Sub → custom handler |
Gemini Cloud Assist for Metrics¶
- In Metrics Explorer: “Show me CPU usage for my production GKE cluster”
- Interpret anomalies: “Why is my error rate spiking?”
- Generate alert conditions from natural language
4.4 – Distributed Tracing¶
Core Concepts¶
- A trace represents an end-to-end request across services
- A span represents a single operation within a trace (one service call, one DB query)
- Trace ID: unique ID propagated across all services for a request
- Parent span ID: links child spans to parent, forming the trace tree
Trace (single request):
span: API Gateway (100ms total)
span: Auth Service (10ms)
span: User Service (50ms)
span: Database Query (30ms)
span: Cache Lookup (5ms)
OpenTelemetry (OTel)¶
GCP’s preferred instrumentation framework — vendor-neutral, widely supported.
Instrumentation (Python example)¶
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Setup
provider = TracerProvider()
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317") # Ops Agent OTLP receiver
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("my-service")
# Instrument a function
with tracer.start_as_current_span("process-order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("user.id", user_id)
result = process_order(order_id)
Auto-instrumentation (no code changes)¶
# Python auto-instrumentation
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
opentelemetry-instrument python app.py
Cloud Trace¶
- Managed distributed tracing service
- Collects spans from: Ops Agent OTLP receiver, direct API, auto-instrumented GCP services
- GCP services that auto-generate traces: Cloud Run, App Engine, GKE (with mesh), Cloud Functions
Sending Traces to Cloud Trace¶
Option 1: Ops Agent OTLP receiver (recommended for VMs)
# ops-agent config.yaml
traces:
receivers:
otlp:
type: otlp
grpc_endpoint: 0.0.0.0:4317
exporters:
google:
type: google_cloud_trace
service:
pipelines:
trace_pipeline:
receivers: [otlp]
exporters: [google]
Option 2: Direct Cloud Trace exporter
from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
exporter = CloudTraceSpanExporter(project_id="my-project")
Analyzing Traces in Cloud Trace¶
Trace Waterfall¶
- Visualize span hierarchy and timing
- Identify bottlenecks: longest spans in the critical path
- Look for: unexpected fan-out, sequential calls that could be parallel, slow DB queries
API Gateway [100ms] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Auth Service [10ms] ━━━
User Service [50ms] ━━━━━━━━━━━━━━━
DB Query [30ms] ━━━━━━━━━
Cache [5ms] ━━
What to look for:¶
- Long tail latency: p99 >> p50 — investigate outlier spans
- Missing spans: broken trace propagation between services
- High span count: excessive calls (N+1 problem, chatty microservices)
- Errors in spans:
span.status = ERROR
Correlating Traces with Logs¶
Structured Logging with Trace Context¶
import logging
from opentelemetry import trace
# Get current trace context
span = trace.get_current_span()
ctx = span.get_span_context()
# Add trace fields to structured log
logging.info({
"message": "Processing order",
"trace": f"projects/PROJECT/traces/{format(ctx.trace_id, '032x')}",
"spanId": format(ctx.span_id, '016x'),
"traceSampled": True
})
In Logs Explorer: when a log has trace and spanId fields in the correct format, the UI shows a link to the trace → click through directly.
Special log fields for trace correlation¶
{
"message": "Order processed",
"logging.googleapis.com/trace": "projects/PROJECT/traces/TRACE_ID",
"logging.googleapis.com/spanId": "SPAN_ID",
"logging.googleapis.com/traceSampled": true
}
Gemini Cloud Assist for Traces¶
- Analyze trace anomalies: “Why is this trace 10x slower than average?”
- Suggest which service is the bottleneck
- Explain unusual patterns in trace waterfall
4.5 – Troubleshooting Issues¶
Troubleshooting Framework¶
- Define the symptom — user impact, not internal metric
- Check recent changes — deployments, config changes, infra changes
- Examine the golden signals — latency, traffic, errors, saturation
- Drill down — logs → traces → metrics for the affected service
- Isolate the component — narrow to service → pod → node → dependency
- Mitigate — rollback, add capacity, redirect traffic (fix user impact first)
- Root cause — after user impact resolved
Infrastructure Issues¶
# Node issues in GKE
kubectl get nodes
kubectl describe node NODE_NAME # Events section
# Pod scheduling failures
kubectl get events --field-selector reason=FailedScheduling
kubectl describe pod POD_NAME
# Resource exhaustion
kubectl top nodes
kubectl top pods --all-namespaces
# GCE VM issues
gcloud compute instances describe INSTANCE --zone=ZONE
gcloud compute ssh INSTANCE --zone=ZONE # Check /var/log/syslog
CI/CD Pipeline Issues¶
# Cloud Build failure
gcloud builds list --filter="status=FAILURE" --limit=10
gcloud builds log BUILD_ID
# Cloud Deploy rollout failure
gcloud deploy rollouts describe ROLLOUT_NAME --delivery-pipeline=PIPELINE --region=REGION
gcloud logging read 'resource.type="clouddeploy.googleapis.com/DeliveryPipeline"' --limit=50
Application Issues¶
# Find errors in GKE workloads
gcloud logging read \
'resource.type="k8s_container" AND severity>=ERROR AND resource.labels.cluster_name="prod"' \
--limit=100 --project=PROJECT
# Trace slow requests
# Cloud Trace → Trace List → Sort by latency → Examine waterfalls
# Check pod logs directly
kubectl logs DEPLOYMENT_NAME --tail=100 -f
kubectl logs POD_NAME -c CONTAINER_NAME --previous # Previous crash logs
Observability Issues¶
| Issue | Check |
|---|---|
| Metrics not appearing | SA permissions (roles/monitoring.metricWriter), Ops Agent running |
| Logs not ingested | SA permissions (roles/logging.logWriter), check Ops Agent status |
| Traces not showing | OTel exporter config, SA permissions (roles/cloudtrace.agent) |
| Alerts not firing | Alert condition, notification channel, alert policy status |
| Uptime check failing | Check endpoint accessibility from global probers, firewall rules |
Performance and Latency Issues¶
# Check for resource throttling in GKE
kubectl top pods -n production
kubectl describe pod POD --namespace=production | grep -A5 "Limits:"
# Check Cloud Run cold starts
gcloud logging read \
'resource.type="cloud_run_revision" AND textPayload:"Cold Start"'
# Check for GCE network performance
gcloud compute instances get-serial-port-output INSTANCE --zone=ZONE
Common Performance Root Causes¶
| Symptom | Likely cause | Fix |
|---|---|---|
| p99 latency spike | GC pauses, cold starts | JVM tuning, --min-instances |
| Gradual latency increase | Memory leak, connection pool exhaustion | Rolling restart, connection pool tuning |
| Sudden error rate spike | New deployment, dependency failure | Rollback, circuit breaker |
| Periodic latency spikes | Cron jobs, batch processing | Schedule off-peak, resource limits |
| High CPU throttling (K8s) | CPU limits too low | Increase limits or remove (VPA) |
Exam Tips¶
- Golden signals: Latency, Traffic, Errors, Saturation (L-T-E-S)
- PromQL supported natively in Cloud Monitoring (GMP) — use for Kubernetes metrics
- Trace correlation in logs requires:
logging.googleapis.com/traceandlogging.googleapis.com/spanIdfields - Cloud Trace auto-instruments: Cloud Run, App Engine, some GKE configs
- Ops Agent OTLP receiver = unified way to collect app traces/metrics on VMs
kubectl logs --previous= logs from the previous container instance (after crash)- Gemini Cloud Assist available in: Logs Explorer, Metrics Explorer, Cloud Trace UI