4.1 – Instrumenting & Collecting Telemetry | 4.2 – Managing & Analyzing Logs¶

4.1 – Instrumenting and Collecting Telemetry¶

The Three Pillars of Observability¶

Pillar	GCP Service	Data type
Metrics	Cloud Monitoring + Google Cloud Managed Service for Prometheus	Time-series numbers
Logs	Cloud Logging	Structured/unstructured text events
Traces	Cloud Trace	Request path + timing across services

Collecting Logs¶

Ops Agent (Compute Engine)¶

Primary agent for Compute Engine VMs
Replaces legacy Stackdriver Logging agent + Monitoring agent
Single process: uses Fluent Bit (logs) + OpenTelemetry Collector (metrics/traces)
Supports: OTLP metrics and traces from instrumented apps

# Install Ops Agent on VM
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

# Check status
sudo systemctl status google-cloud-ops-agent

Custom log collection (config.yaml):

logging:
  receivers:
    my_app_logs:
      type: files
      include_paths:
        - /var/log/myapp/*.log
  service:
    pipelines:
      my_pipeline:
        receivers: [my_app_logs]

GKE Logging¶

System logs: automatically collected (kubelet, container runtime, k8s control plane)
Application logs: stdout/stderr from containers automatically ingested
Cloud Logging agent (Fluentd/Fluent Bit DaemonSet) runs by default in GKE

# Enable logging on GKE cluster
gcloud container clusters update CLUSTER \
  --enable-cloud-logging \
  --logging=SYSTEM,WORKLOAD  # WORKLOAD = app logs

Cloud Audit Logs¶

Type	Contains	Default
Admin Activity	Who created/modified resources	Always on, free
Data Access	Who read/wrote data	Off by default, billable
System Event	GCP-automated resource changes	Always on, free
Policy Denied	IAM policy denials	Always on, free

# Enable Data Access logs for a service
gcloud projects get-iam-policy PROJECT > policy.yaml
# Add auditConfigs section for service
# ...
gcloud projects set-iam-policy PROJECT policy.yaml

VPC Flow Logs¶

Record network flow samples for VPC subnets
Useful for: network troubleshooting, security monitoring, traffic analysis
Enable per subnet; configurable sampling rate (default 1/10 packets)

gcloud compute networks subnets update my-subnet \
  --region=us-central1 \
  --enable-flow-logs \
  --logging-flow-sampling=0.5 \
  --logging-aggregation-interval=interval-5-sec

Cloud Service Mesh Logs¶

Access logs, audit logs from Envoy sidecars
Trace context propagated via headers (X-B3-TraceId, etc.)
Configure via Telemetry CRD in Istio/Cloud Service Mesh

Collecting Metrics¶

GCP Platform Metrics (auto-collected)¶

Every GCP service emits metrics to Cloud Monitoring automatically
Prefix pattern: compute.googleapis.com/, container.googleapis.com/, run.googleapis.com/
No configuration needed — always available

Google Cloud Managed Service for Prometheus (GMP)¶

Drop-in managed Prometheus for GKE and other workloads
No Prometheus server to manage — GCP runs it
Uses standard Prometheus scraping + PromQL
Data stored in Cloud Monitoring (not separate Prometheus storage)

# PodMonitoring CRD — scrape app metrics
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
  name: app-metrics
spec:
  selector:
    matchLabels:
      app: myapp
  endpoints:
  - port: metrics
    interval: 30s

# Enable GMP on GKE cluster
gcloud container clusters update CLUSTER \
  --enable-managed-prometheus

Application Metrics (Custom)¶

Use OpenTelemetry SDK in your app → export to Cloud Monitoring via Ops Agent OTLP receiver
Or use Cloud Monitoring client libraries directly
Custom metrics billed per sample after free tier

from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

meter = metrics.get_meter("my-app")
request_counter = meter.create_counter("http.requests")
request_counter.add(1, {"method": "GET", "status": "200"})

Hybrid/Multi-Cloud Metrics¶

Ops Agent on non-GCP VMs (AWS EC2, on-prem) → sends to Cloud Monitoring
OpenTelemetry Collector can collect from any source → forward to GCP
Use metric labels to identify source environment

Synthetic Monitoring¶

Uptime checks: proactively test HTTP/HTTPS/TCP endpoints from multiple global locations
Synthetic monitors: run custom scripts (Node.js) to test complex user journeys

# Create uptime check
gcloud monitoring uptime create my-check \
  --display-name="App Health" \
  --resource-type=uptime-url \
  --hostname=app.example.com \
  --port=443 \
  --request-path=/healthz \
  --check-interval=60s

Synthetic monitor (Cloud Monitoring): - Runs a Cloud Function that executes test scripts - Can test multi-step workflows (login → checkout → confirm) - Results appear as metrics → alerting on failure

Custom Metrics and Log-Based Metrics¶

Custom Metrics¶

# Write a custom metric via API
gcloud monitoring time-series create \
  --project=PROJECT \
  custom.googleapis.com/myapp/queue_depth \
  --metric-kind=GAUGE \
  --value-type=INT64 \
  --points="[{'interval': {'endTime': '2024-01-01T00:00:00Z'}, 'value': {'int64Value': 42}}]"

Log-Based Metrics¶

Extract metrics from log entries — count errors, extract latency values from logs
Counter metric: count log entries matching a filter
Distribution metric: extract numeric values (e.g., latency) from structured logs

# Create counter metric for 5xx errors
gcloud logging metrics create http_5xx_errors \
  --description="Count of 5xx HTTP errors" \
  --log-filter='resource.type="k8s_container" AND httpRequest.status>=500'

# Distribution metric for latency
gcloud logging metrics create request_latency \
  --description="Request latency distribution" \
  --log-filter='resource.type="gce_instance" AND jsonPayload.latency!=""' \
  --value-extractor='EXTRACT(jsonPayload.latency)'

4.2 – Managing and Analyzing Logs¶

Cloud Logging Architecture¶

Log Sources → Cloud Logging API → _Default bucket (30d retention)
                                → _Required bucket (400d, admin activity)
                                → Custom buckets (your retention)
                                → Log Router → Sinks → BigQuery / Pub/Sub / GCS

Logs Explorer & Logging Query Language (LQL)¶

Query Syntax¶

# Filter by resource type
resource.type="k8s_container"

# Filter by severity
severity>=ERROR

# Filter by label
resource.labels.cluster_name="prod-cluster"

# Filter by log name
logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"

# Full text search
"NullPointerException"

# Time range (combined with UI picker)
timestamp >= "2024-01-01T00:00:00Z" AND timestamp <= "2024-01-01T23:59:59Z"

# Structured field access
jsonPayload.status_code=500

# Boolean logic
(severity=ERROR OR severity=CRITICAL) AND resource.labels.namespace_name="production"

# Exclusion (negate)
NOT httpRequest.status=200

Useful Filters for DevOps¶

# GKE container logs for a specific pod
resource.type="k8s_container"
resource.labels.pod_name:"myapp-"
severity>=WARNING

# Cloud Build failures
resource.type="build"
jsonPayload.status="FAILURE"

# Cloud Deploy rollout events
resource.type="clouddeploy.googleapis.com/DeliveryPipeline"

# K8s audit logs
logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.resourceName:"namespaces/production"

# VPC flow logs - rejected traffic
resource.type="gce_subnetwork"
jsonPayload.disposition="DENIED"

Log Sinks (Export and Routing)¶

# Export logs to BigQuery
gcloud logging sinks create my-bq-sink \
  bigquery.googleapis.com/projects/PROJECT/datasets/logs_dataset \
  --log-filter='severity>=WARNING' \
  --description="Warning+ logs to BigQuery"

# Export to GCS (archival)
gcloud logging sinks create archive-sink \
  storage.googleapis.com/my-log-archive-bucket \
  --log-filter='logName:"cloudaudit"' \
  --billing-project=PROJECT

# Export to Pub/Sub (streaming)
gcloud logging sinks create stream-sink \
  pubsub.googleapis.com/projects/PROJECT/topics/log-events \
  --log-filter='resource.type="k8s_container"'

Grant sink SA write access¶

# After creating sink, grant SA permission to write to destination
SINK_SA=$(gcloud logging sinks describe my-bq-sink --format="value(writerIdentity)")
gcloud projects add-iam-policy-binding PROJECT \
  --member="$SINK_SA" \
  --role="roles/bigquery.dataEditor"

Log Retention¶

Bucket	Default retention	Notes
`_Default`	30 days	Configurable (1-3650 days)
`_Required`	400 days	Fixed — admin activity, system events; cannot reduce
Custom	You define	Create for long-term specific needs

# Update retention on _Default bucket
gcloud logging buckets update _Default \
  --location=global \
  --retention-days=90

Handling Sensitive Data (PII/PHI)¶

Log Redaction Options¶

Data Access Audit Log: don’t enable for services with PII in request/response body
Cloud DLP + Dataflow: real-time de-identification of log streams before storage
Cloud Logging log processors (beta): filter/redact fields at ingestion time
Structured logging: emit logs without PII fields in the application layer (best)

# Exclusion filter: drop logs with credit card numbers
gcloud logging sinks update _Default \
  --exclusion-filter='jsonPayload.message=~"[0-9]{16}"' \
  --exclusion-description="Drop potential PAN data"

Gemini Cloud Assist for Logs¶

Available in Logs Explorer UI
Can: explain a log entry, generate a query from natural language, suggest next steps
Example prompts: “Show me all errors in the payment service in the last hour”, “Explain this stack trace”

Log Costs Optimization¶

Technique	Description
Exclusion filters	Drop high-volume low-value logs (e.g., health checks, 200s) before ingestion
Log sampling	VPC Flow Logs: reduce sampling rate (0.1 = 10%)
Log-based metrics	Extract signal from logs → use metrics instead of querying raw logs
Routing to cheaper storage	Route cold logs to GCS (cheaper than Logging storage)
Retention tuning	Reduce _Default retention to minimum needed

# Exclude health check logs (save $$$)
gcloud logging sinks update _Default \
  --exclusion-filter='httpRequest.requestUrl="/healthz" AND httpRequest.status=200'

Exam Tips¶

Ops Agent = Fluent Bit (logs) + OTel Collector (metrics/traces) — single agent
VPC Flow Logs = network-level telemetry; not app logs
_Required bucket retention = 400 days, cannot be changed
Log-based metrics = extract metrics from logs → use in alerting and dashboards
Sinks route to: BigQuery (analysis), Pub/Sub (streaming), GCS (archival)
Exclusion filters applied at ingestion → don’t pay for unwanted logs
Gemini Cloud Assist can explain log entries and generate LQL queries
Data Access audit logs = OFF by default — must explicitly enable per service