4.1 – Instrumenting & Collecting Telemetry | 4.2 – Managing & Analyzing Logs
4.1 – Instrumenting and Collecting Telemetry
The Three Pillars of Observability
| Pillar |
GCP Service |
Data type |
| Metrics |
Cloud Monitoring + Google Cloud Managed Service for Prometheus |
Time-series numbers |
| Logs |
Cloud Logging |
Structured/unstructured text events |
| Traces |
Cloud Trace |
Request path + timing across services |
Collecting Logs
Ops Agent (Compute Engine)
- Primary agent for Compute Engine VMs
- Replaces legacy Stackdriver Logging agent + Monitoring agent
- Single process: uses Fluent Bit (logs) + OpenTelemetry Collector (metrics/traces)
- Supports: OTLP metrics and traces from instrumented apps
# Install Ops Agent on VM
curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install
# Check status
sudo systemctl status google-cloud-ops-agent
Custom log collection (config.yaml):
logging:
receivers:
my_app_logs:
type: files
include_paths:
- /var/log/myapp/*.log
service:
pipelines:
my_pipeline:
receivers: [my_app_logs]
GKE Logging
- System logs: automatically collected (kubelet, container runtime, k8s control plane)
- Application logs:
stdout/stderr from containers automatically ingested
- Cloud Logging agent (Fluentd/Fluent Bit DaemonSet) runs by default in GKE
# Enable logging on GKE cluster
gcloud container clusters update CLUSTER \
--enable-cloud-logging \
--logging=SYSTEM,WORKLOAD # WORKLOAD = app logs
Cloud Audit Logs
| Type |
Contains |
Default |
| Admin Activity |
Who created/modified resources |
Always on, free |
| Data Access |
Who read/wrote data |
Off by default, billable |
| System Event |
GCP-automated resource changes |
Always on, free |
| Policy Denied |
IAM policy denials |
Always on, free |
# Enable Data Access logs for a service
gcloud projects get-iam-policy PROJECT > policy.yaml
# Add auditConfigs section for service
# ...
gcloud projects set-iam-policy PROJECT policy.yaml
VPC Flow Logs
- Record network flow samples for VPC subnets
- Useful for: network troubleshooting, security monitoring, traffic analysis
- Enable per subnet; configurable sampling rate (default 1/10 packets)
gcloud compute networks subnets update my-subnet \
--region=us-central1 \
--enable-flow-logs \
--logging-flow-sampling=0.5 \
--logging-aggregation-interval=interval-5-sec
Cloud Service Mesh Logs
- Access logs, audit logs from Envoy sidecars
- Trace context propagated via headers (X-B3-TraceId, etc.)
- Configure via
Telemetry CRD in Istio/Cloud Service Mesh
Collecting Metrics
- Every GCP service emits metrics to Cloud Monitoring automatically
- Prefix pattern:
compute.googleapis.com/, container.googleapis.com/, run.googleapis.com/
- No configuration needed — always available
Google Cloud Managed Service for Prometheus (GMP)
- Drop-in managed Prometheus for GKE and other workloads
- No Prometheus server to manage — GCP runs it
- Uses standard Prometheus scraping + PromQL
- Data stored in Cloud Monitoring (not separate Prometheus storage)
# PodMonitoring CRD — scrape app metrics
apiVersion: monitoring.googleapis.com/v1
kind: PodMonitoring
metadata:
name: app-metrics
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
# Enable GMP on GKE cluster
gcloud container clusters update CLUSTER \
--enable-managed-prometheus
Application Metrics (Custom)
- Use OpenTelemetry SDK in your app → export to Cloud Monitoring via Ops Agent OTLP receiver
- Or use Cloud Monitoring client libraries directly
- Custom metrics billed per sample after free tier
from opentelemetry import metrics
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
meter = metrics.get_meter("my-app")
request_counter = meter.create_counter("http.requests")
request_counter.add(1, {"method": "GET", "status": "200"})
Hybrid/Multi-Cloud Metrics
- Ops Agent on non-GCP VMs (AWS EC2, on-prem) → sends to Cloud Monitoring
- OpenTelemetry Collector can collect from any source → forward to GCP
- Use metric labels to identify source environment
Synthetic Monitoring
- Uptime checks: proactively test HTTP/HTTPS/TCP endpoints from multiple global locations
- Synthetic monitors: run custom scripts (Node.js) to test complex user journeys
# Create uptime check
gcloud monitoring uptime create my-check \
--display-name="App Health" \
--resource-type=uptime-url \
--hostname=app.example.com \
--port=443 \
--request-path=/healthz \
--check-interval=60s
Synthetic monitor (Cloud Monitoring):
- Runs a Cloud Function that executes test scripts
- Can test multi-step workflows (login → checkout → confirm)
- Results appear as metrics → alerting on failure
Custom Metrics and Log-Based Metrics
Custom Metrics
# Write a custom metric via API
gcloud monitoring time-series create \
--project=PROJECT \
custom.googleapis.com/myapp/queue_depth \
--metric-kind=GAUGE \
--value-type=INT64 \
--points="[{'interval': {'endTime': '2024-01-01T00:00:00Z'}, 'value': {'int64Value': 42}}]"
Log-Based Metrics
- Extract metrics from log entries — count errors, extract latency values from logs
- Counter metric: count log entries matching a filter
- Distribution metric: extract numeric values (e.g., latency) from structured logs
# Create counter metric for 5xx errors
gcloud logging metrics create http_5xx_errors \
--description="Count of 5xx HTTP errors" \
--log-filter='resource.type="k8s_container" AND httpRequest.status>=500'
# Distribution metric for latency
gcloud logging metrics create request_latency \
--description="Request latency distribution" \
--log-filter='resource.type="gce_instance" AND jsonPayload.latency!=""' \
--value-extractor='EXTRACT(jsonPayload.latency)'
4.2 – Managing and Analyzing Logs
Cloud Logging Architecture
Log Sources → Cloud Logging API → _Default bucket (30d retention)
→ _Required bucket (400d, admin activity)
→ Custom buckets (your retention)
→ Log Router → Sinks → BigQuery / Pub/Sub / GCS
Logs Explorer & Logging Query Language (LQL)
Query Syntax
# Filter by resource type
resource.type="k8s_container"
# Filter by severity
severity>=ERROR
# Filter by label
resource.labels.cluster_name="prod-cluster"
# Filter by log name
logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
# Full text search
"NullPointerException"
# Time range (combined with UI picker)
timestamp >= "2024-01-01T00:00:00Z" AND timestamp <= "2024-01-01T23:59:59Z"
# Structured field access
jsonPayload.status_code=500
# Boolean logic
(severity=ERROR OR severity=CRITICAL) AND resource.labels.namespace_name="production"
# Exclusion (negate)
NOT httpRequest.status=200
Useful Filters for DevOps
# GKE container logs for a specific pod
resource.type="k8s_container"
resource.labels.pod_name:"myapp-"
severity>=WARNING
# Cloud Build failures
resource.type="build"
jsonPayload.status="FAILURE"
# Cloud Deploy rollout events
resource.type="clouddeploy.googleapis.com/DeliveryPipeline"
# K8s audit logs
logName="projects/PROJECT/logs/cloudaudit.googleapis.com%2Factivity"
protoPayload.resourceName:"namespaces/production"
# VPC flow logs - rejected traffic
resource.type="gce_subnetwork"
jsonPayload.disposition="DENIED"
Log Sinks (Export and Routing)
# Export logs to BigQuery
gcloud logging sinks create my-bq-sink \
bigquery.googleapis.com/projects/PROJECT/datasets/logs_dataset \
--log-filter='severity>=WARNING' \
--description="Warning+ logs to BigQuery"
# Export to GCS (archival)
gcloud logging sinks create archive-sink \
storage.googleapis.com/my-log-archive-bucket \
--log-filter='logName:"cloudaudit"' \
--billing-project=PROJECT
# Export to Pub/Sub (streaming)
gcloud logging sinks create stream-sink \
pubsub.googleapis.com/projects/PROJECT/topics/log-events \
--log-filter='resource.type="k8s_container"'
Grant sink SA write access
# After creating sink, grant SA permission to write to destination
SINK_SA=$(gcloud logging sinks describe my-bq-sink --format="value(writerIdentity)")
gcloud projects add-iam-policy-binding PROJECT \
--member="$SINK_SA" \
--role="roles/bigquery.dataEditor"
Log Retention
| Bucket |
Default retention |
Notes |
_Default |
30 days |
Configurable (1-3650 days) |
_Required |
400 days |
Fixed — admin activity, system events; cannot reduce |
| Custom |
You define |
Create for long-term specific needs |
# Update retention on _Default bucket
gcloud logging buckets update _Default \
--location=global \
--retention-days=90
Handling Sensitive Data (PII/PHI)
Log Redaction Options
- Data Access Audit Log: don’t enable for services with PII in request/response body
- Cloud DLP + Dataflow: real-time de-identification of log streams before storage
- Cloud Logging log processors (beta): filter/redact fields at ingestion time
- Structured logging: emit logs without PII fields in the application layer (best)
# Exclusion filter: drop logs with credit card numbers
gcloud logging sinks update _Default \
--exclusion-filter='jsonPayload.message=~"[0-9]{16}"' \
--exclusion-description="Drop potential PAN data"
Gemini Cloud Assist for Logs
- Available in Logs Explorer UI
- Can: explain a log entry, generate a query from natural language, suggest next steps
- Example prompts: “Show me all errors in the payment service in the last hour”, “Explain this stack trace”
Log Costs Optimization
| Technique |
Description |
| Exclusion filters |
Drop high-volume low-value logs (e.g., health checks, 200s) before ingestion |
| Log sampling |
VPC Flow Logs: reduce sampling rate (0.1 = 10%) |
| Log-based metrics |
Extract signal from logs → use metrics instead of querying raw logs |
| Routing to cheaper storage |
Route cold logs to GCS (cheaper than Logging storage) |
| Retention tuning |
Reduce _Default retention to minimum needed |
# Exclude health check logs (save $$$)
gcloud logging sinks update _Default \
--exclusion-filter='httpRequest.requestUrl="/healthz" AND httpRequest.status=200'
Exam Tips
- Ops Agent = Fluent Bit (logs) + OTel Collector (metrics/traces) — single agent
- VPC Flow Logs = network-level telemetry; not app logs
_Required bucket retention = 400 days, cannot be changed
- Log-based metrics = extract metrics from logs → use in alerting and dashboards
- Sinks route to: BigQuery (analysis), Pub/Sub (streaming), GCS (archival)
- Exclusion filters applied at ingestion → don’t pay for unwanted logs
- Gemini Cloud Assist can explain log entries and generate LQL queries
- Data Access audit logs = OFF by default — must explicitly enable per service