4.3 – Metrics, Dashboards & Alerts | 4.4 – Distributed Tracing | 4.5 – Troubleshooting¶

4.3 – Managing Metrics, Dashboards, and Alerts¶

Metrics Explorer¶

Explore, visualize, and query any metric in Cloud Monitoring
Supports: MQL (Monitoring Query Language), PromQL (for Prometheus-style queries)
Can overlay multiple metrics, apply aggregation, ratio queries

PromQL in Cloud Monitoring¶

# Request rate
rate(http_requests_total[5m])

# Error ratio
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

MQL (Monitoring Query Language)¶

# CPU utilization for GKE pods
fetch k8s_container
| metric 'kubernetes.io/container/cpu/core_usage_time'
| filter resource.cluster_name = 'prod-cluster'
| align rate(1m)
| every 1m
| group_by [resource.namespace_name], [value: mean(value)]

Dashboards¶

Creating Dashboards¶

# Create dashboard from JSON
gcloud monitoring dashboards create --config-from-file=dashboard.json

# List dashboards
gcloud monitoring dashboards list

Dashboard JSON structure:

{
  "displayName": "Application Overview",
  "gridLayout": {
    "columns": 2,
    "widgets": [
      {
        "title": "Request Rate",
        "xyChart": {
          "dataSets": [{
            "timeSeriesQuery": {
              "prometheusQuery": "rate(http_requests_total[5m])"
            }
          }]
        }
      }
    ]
  }
}

Dashboard Best Practices¶

Include: golden signals (latency, traffic, errors, saturation)
Group by: service, environment, region
Add playbooks as text widgets — guides during incidents
Use alert annotations to show alert events on time series charts
Share dashboards via URL — accessible to anyone with Cloud Monitoring viewer role

Alerting Policies¶

Alert Policy Components¶

Condition: what metric/log to evaluate + threshold
Notification channels: where to send alerts (email, Slack, PagerDuty, webhook, Pub/Sub)
Documentation: runbook URL, custom message for on-call engineer

# Create alert via gcloud
gcloud monitoring policies create \
  --display-name="High Error Rate" \
  --condition-display-name="Error rate > 5%" \
  --condition-filter='resource.type="k8s_container" AND metric.type="prometheus.googleapis.com/http_request_errors_total/counter"' \
  --condition-threshold-value=0.05 \
  --condition-comparison=COMPARISON_GT \
  --condition-duration=300s \
  --notification-channels=CHANNEL_ID

Notification Channels¶

# Create email notification channel
gcloud monitoring channels create \
  --display-name="OpsTeam Email" \
  --type=email \
  --channel-labels=email_address=ops@company.com

# List channels
gcloud monitoring channels list

# Third-party integrations: webhook (for PagerDuty, Rootly, Opsgenie)
gcloud monitoring channels create \
  --display-name="PagerDuty" \
  --type=webhook_tokenauth \
  --channel-labels=url=https://events.pagerduty.com/integration/KEY/enqueue

SLO-Based Alerting¶

# Alert on SLO burn rate
gcloud monitoring policies create \
  --display-name="SLO Fast Burn" \
  --condition-filter='select_slo_burn_rate("projects/PROJECT/services/svc/serviceLevelObjectives/slo")' \
  --condition-threshold-value=14.4 \
  --condition-comparison=COMPARISON_GT \
  --condition-duration=3600s \
  --notification-channels=CHANNEL_ID

Cost Control Alerting¶

# Budget alert
gcloud billing budgets create \
  --billing-account=BILLING_ACCOUNT_ID \
  --display-name="Monthly Budget Alert" \
  --budget-amount=1000USD \
  --threshold-rule=percent=80,basis=CURRENT_SPEND \
  --threshold-rule=percent=100,basis=FORECASTED_SPEND

Third-Party Alerting Integrations¶

Tool	Integration method
PagerDuty	Webhook notification channel → PagerDuty Events API
Rootly	Webhook notification channel
Opsgenie	Webhook notification channel
Slack	Slack notification channel (built-in)
Pub/Sub	Route alerts to Pub/Sub → custom handler

Gemini Cloud Assist for Metrics¶

In Metrics Explorer: “Show me CPU usage for my production GKE cluster”
Interpret anomalies: “Why is my error rate spiking?”
Generate alert conditions from natural language

4.4 – Distributed Tracing¶

Core Concepts¶

A trace represents an end-to-end request across services
A span represents a single operation within a trace (one service call, one DB query)
Trace ID: unique ID propagated across all services for a request
Parent span ID: links child spans to parent, forming the trace tree

Trace (single request):
  span: API Gateway (100ms total)
    span: Auth Service (10ms)
    span: User Service (50ms)
      span: Database Query (30ms)
    span: Cache Lookup (5ms)

OpenTelemetry (OTel)¶

GCP’s preferred instrumentation framework — vendor-neutral, widely supported.

Instrumentation (Python example)¶

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
otlp_exporter = OTLPSpanExporter(endpoint="http://localhost:4317")  # Ops Agent OTLP receiver
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("my-service")

# Instrument a function
with tracer.start_as_current_span("process-order") as span:
    span.set_attribute("order.id", order_id)
    span.set_attribute("user.id", user_id)
    result = process_order(order_id)

Auto-instrumentation (no code changes)¶

# Python auto-instrumentation
pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap --action=install
opentelemetry-instrument python app.py

Cloud Trace¶

Managed distributed tracing service
Collects spans from: Ops Agent OTLP receiver, direct API, auto-instrumented GCP services
GCP services that auto-generate traces: Cloud Run, App Engine, GKE (with mesh), Cloud Functions

Sending Traces to Cloud Trace¶

Option 1: Ops Agent OTLP receiver (recommended for VMs)

# ops-agent config.yaml
traces:
  receivers:
    otlp:
      type: otlp
      grpc_endpoint: 0.0.0.0:4317
  exporters:
    google:
      type: google_cloud_trace
  service:
    pipelines:
      trace_pipeline:
        receivers: [otlp]
        exporters: [google]

Option 2: Direct Cloud Trace exporter

from opentelemetry.exporter.cloud_trace import CloudTraceSpanExporter
exporter = CloudTraceSpanExporter(project_id="my-project")

Analyzing Traces in Cloud Trace¶

Trace Waterfall¶

Visualize span hierarchy and timing
Identify bottlenecks: longest spans in the critical path
Look for: unexpected fan-out, sequential calls that could be parallel, slow DB queries

API Gateway [100ms] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  Auth Service [10ms] ━━━
  User Service [50ms]     ━━━━━━━━━━━━━━━
    DB Query [30ms]           ━━━━━━━━━
  Cache [5ms]                              ━━

What to look for:¶

Long tail latency: p99 >> p50 — investigate outlier spans
Missing spans: broken trace propagation between services
High span count: excessive calls (N+1 problem, chatty microservices)
Errors in spans: span.status = ERROR

Correlating Traces with Logs¶

Structured Logging with Trace Context¶

import logging
from opentelemetry import trace

# Get current trace context
span = trace.get_current_span()
ctx = span.get_span_context()

# Add trace fields to structured log
logging.info({
    "message": "Processing order",
    "trace": f"projects/PROJECT/traces/{format(ctx.trace_id, '032x')}",
    "spanId": format(ctx.span_id, '016x'),
    "traceSampled": True
})

In Logs Explorer: when a log has trace and spanId fields in the correct format, the UI shows a link to the trace → click through directly.

Special log fields for trace correlation¶

{
  "message": "Order processed",
  "logging.googleapis.com/trace": "projects/PROJECT/traces/TRACE_ID",
  "logging.googleapis.com/spanId": "SPAN_ID",
  "logging.googleapis.com/traceSampled": true
}

Gemini Cloud Assist for Traces¶

Analyze trace anomalies: “Why is this trace 10x slower than average?”
Suggest which service is the bottleneck
Explain unusual patterns in trace waterfall

4.5 – Troubleshooting Issues¶

Troubleshooting Framework¶

Define the symptom — user impact, not internal metric
Check recent changes — deployments, config changes, infra changes
Examine the golden signals — latency, traffic, errors, saturation
Drill down — logs → traces → metrics for the affected service
Isolate the component — narrow to service → pod → node → dependency
Mitigate — rollback, add capacity, redirect traffic (fix user impact first)
Root cause — after user impact resolved

Infrastructure Issues¶

# Node issues in GKE
kubectl get nodes
kubectl describe node NODE_NAME  # Events section

# Pod scheduling failures
kubectl get events --field-selector reason=FailedScheduling
kubectl describe pod POD_NAME

# Resource exhaustion
kubectl top nodes
kubectl top pods --all-namespaces

# GCE VM issues
gcloud compute instances describe INSTANCE --zone=ZONE
gcloud compute ssh INSTANCE --zone=ZONE  # Check /var/log/syslog

CI/CD Pipeline Issues¶

# Cloud Build failure
gcloud builds list --filter="status=FAILURE" --limit=10
gcloud builds log BUILD_ID

# Cloud Deploy rollout failure
gcloud deploy rollouts describe ROLLOUT_NAME --delivery-pipeline=PIPELINE --region=REGION
gcloud logging read 'resource.type="clouddeploy.googleapis.com/DeliveryPipeline"' --limit=50

Application Issues¶

# Find errors in GKE workloads
gcloud logging read \
  'resource.type="k8s_container" AND severity>=ERROR AND resource.labels.cluster_name="prod"' \
  --limit=100 --project=PROJECT

# Trace slow requests
# Cloud Trace → Trace List → Sort by latency → Examine waterfalls

# Check pod logs directly
kubectl logs DEPLOYMENT_NAME --tail=100 -f
kubectl logs POD_NAME -c CONTAINER_NAME --previous  # Previous crash logs

Observability Issues¶

Issue	Check
Metrics not appearing	SA permissions (`roles/monitoring.metricWriter`), Ops Agent running
Logs not ingested	SA permissions (`roles/logging.logWriter`), check Ops Agent status
Traces not showing	OTel exporter config, SA permissions (`roles/cloudtrace.agent`)
Alerts not firing	Alert condition, notification channel, alert policy status
Uptime check failing	Check endpoint accessibility from global probers, firewall rules

Performance and Latency Issues¶

# Check for resource throttling in GKE
kubectl top pods -n production
kubectl describe pod POD --namespace=production | grep -A5 "Limits:"

# Check Cloud Run cold starts
gcloud logging read \
  'resource.type="cloud_run_revision" AND textPayload:"Cold Start"'

# Check for GCE network performance
gcloud compute instances get-serial-port-output INSTANCE --zone=ZONE

Common Performance Root Causes¶

Symptom	Likely cause	Fix
p99 latency spike	GC pauses, cold starts	JVM tuning, `--min-instances`
Gradual latency increase	Memory leak, connection pool exhaustion	Rolling restart, connection pool tuning
Sudden error rate spike	New deployment, dependency failure	Rollback, circuit breaker
Periodic latency spikes	Cron jobs, batch processing	Schedule off-peak, resource limits
High CPU throttling (K8s)	CPU limits too low	Increase limits or remove (VPA)

Exam Tips¶

Golden signals: Latency, Traffic, Errors, Saturation (L-T-E-S)
PromQL supported natively in Cloud Monitoring (GMP) — use for Kubernetes metrics
Trace correlation in logs requires: logging.googleapis.com/trace and logging.googleapis.com/spanId fields
Cloud Trace auto-instruments: Cloud Run, App Engine, some GKE configs
Ops Agent OTLP receiver = unified way to collect app traces/metrics on VMs
kubectl logs --previous = logs from the previous container instance (after crash)
Gemini Cloud Assist available in: Logs Explorer, Metrics Explorer, Cloud Trace UI