Skip to content

BigQuery

Core Concepts

BigQuery is a serverless, fully managed data warehouse for analytics at petabyte scale. SQL-based querying with automatic scaling and high performance.

Key Principle: Designed for analytics (OLAP), not transactions (OLTP); pay for storage and queries, not infrastructure.

When to Use BigQuery

✅ Use When

  • Analytics and reporting (data warehouse)
  • Large-scale data analysis (terabytes to petabytes)
  • Ad-hoc SQL queries
  • Business intelligence and dashboards
  • Log analysis at scale
  • Machine learning on structured data

❌ Don’t Use When

  • Transactional workload (OLTP) → Cloud SQL, Spanner
  • Low-latency reads/writes → Firestore, Bigtable
  • Blob storage → Cloud Storage
  • Small datasets (<100 GB) → Cloud SQL may be cheaper
  • Real-time streaming with millisecond latency → Bigtable

Architecture

Serverless

No infrastructure management:

  • No servers to provision
  • Automatic scaling
  • No capacity planning
  • Pay per query

Separation of storage and compute:

  • Storage: Pay for data stored
  • Compute: Pay for queries executed
  • Scale independently

Data Organization

Structure:

Project
└── Dataset (location, access control)
    └── Tables (schema, partitioning, clustering)
        └── Rows and columns

Dataset: Logical container, regional or multi-regional

Table types:

  • Native tables (managed by BigQuery)
  • External tables (data in Cloud Storage/Drive/Bigtable)
  • Views (virtual tables from queries)
  • Materialized views (precomputed results)

Performance Optimization

Partitioning

Purpose: Divide large tables into smaller segments

Types:

  • Time-based (daily, hourly, monthly, yearly)
  • Integer range
  • Ingestion time

Benefits:

  • Faster queries (scan less data)
  • Lower costs (process less data)
  • Easier data management (delete old partitions)

Example: Partition logs table by date, query only last 7 days

Clustering

Purpose: Organize data within partitions by column values

Benefits:

  • Faster queries on clustered columns
  • Lower costs (better pruning)
  • Automatic (BigQuery manages)

Use with: Up to 4 columns commonly queried together

Pattern: Partition by date, cluster by user_id and region

Query Optimization

Best practices:

  • Select only needed columns (avoid SELECT *)
  • Filter early (WHERE clause reduces data scanned)
  • Use partitioned tables
  • Avoid self-joins (use window functions)
  • Use approximate aggregation (HyperLogLog for COUNT DISTINCT)

Explanation tabs: Shows query execution plan, bytes processed

Storage Options

Active Storage

Default: Standard storage

Price: ~$0.020/GB/month (first 10 GB free)

Use for: Frequently queried data

Long-Term Storage

Automatic: Tables not edited for 90 days

Price: ~$0.010/GB/month (50% discount)

Benefit: Cost savings for archival data

Important: Query pricing same regardless of storage type

Data Loading

Batch Loading

Methods:

  • Load jobs (Cloud Storage, local files)
  • SQL INSERT statements
  • Dataflow pipelines
  • Third-party ETL tools

Free: Batch loading is free

Use for: Large bulk loads, scheduled ETL

Streaming Inserts

Purpose: Real-time data ingestion

Latency: Available for query immediately

Cost: $0.010 per 200 MB (higher than batch)

Limits: 100,000 rows/sec per table

Use for: Real-time dashboards, log streaming

Alternative: Storage Write API (cheaper, better performance)

Data Export

Destinations:

  • Cloud Storage (CSV, JSON, Avro, Parquet)
  • Other BigQuery tables
  • Google Sheets (small datasets)

Limits: 1 GB per file (use wildcards for larger)

Cost: Export is free, storage charges apply

Access Control

Dataset-Level

IAM Roles:

  • bigquery.dataViewer: Read data and metadata
  • bigquery.dataEditor: Create, update, delete tables
  • bigquery.dataOwner: Full control including permissions

Scope: All tables in dataset

Table-Level

Authorized views: Allow access to specific views, not underlying tables

Use case: Row-level security, column masking

Row-Level Security

Purpose: Filter rows based on user identity

Implementation: Create row-level policies

Example: Users only see their own organization’s data

Data Loss Prevention (DLP) API

Purpose

Discover, classify, and protect sensitive data in BigQuery (PII, PHI, financial data).

Capabilities

Discovery:

  • Scan datasets for sensitive data
  • Identify PII (SSN, credit cards, emails, phone numbers)
  • Custom info types (regex patterns)

Classification:

  • Likelihood scores (very likely, likely, possible, unlikely)
  • Info types (PERSON_NAME, CREDIT_CARD_NUMBER, etc.)
  • Statistical analysis

De-identification:

  • Masking (replace with *****)
  • Tokenization (consistent hash)
  • Date shifting (preserve relative dates)
  • Generalization (age → age range)

Common Patterns

PII Discovery:

DLP API scans BigQuery table → Identifies sensitive columns → Generate report

De-identification Pipeline:

Source table → DLP API (de-identify) → Destination table (masked data)

Continuous Monitoring:

Cloud Functions + DLP API → Scan new data → Alert if PII found

Use Cases

  • GDPR compliance (identify PII)
  • HIPAA compliance (protect PHI)
  • PCI-DSS (find credit card data)
  • Data lake governance
  • Safe analytics on sensitive data

Integration

Automatic: Scan tables directly via API

Patterns:

  • Scheduled scans (Cloud Scheduler + Cloud Functions)
  • Pre-export scanning (ensure no PII exported)
  • CI/CD validation (check for accidental PII)

Cost

DLP API pricing: Per GB scanned

Optimization:

  • Scan samples, not entire tables
  • Use column-level scans
  • Cache results, don’t rescan unchanged data

BigQuery ML

Purpose: Build and deploy ML models using SQL

Supported models:

  • Linear regression
  • Logistic regression
  • K-means clustering
  • Time series forecasting
  • Deep neural networks
  • Import TensorFlow models

Benefits: No data export, SQL-based, integrated

Use case: Predictions on BigQuery data without separate ML platform

Federated Queries

External Data Sources

Supported:

  • Cloud Storage (CSV, JSON, Avro, Parquet, ORC)
  • Cloud Bigtable
  • Cloud SQL (read-only)
  • Google Sheets

Use cases:

  • Query data without loading
  • Join BigQuery with external data
  • One-time analysis
  • Data lake pattern (data in Storage, query in BigQuery)

Limitations:

  • Slower than native tables
  • No caching
  • Subject to external source performance

BI Engine

Purpose: In-memory analysis for fast dashboards

Benefits:

  • Sub-second query response
  • Interactive dashboards
  • No query charges for cached data
  • Automatic caching

Capacity: Pay for memory reservation (per GB/hour)

Use for: Frequently accessed dashboards, real-time analytics

High Availability

Automatic

Built-in: Multi-zone replication within region

No configuration needed: Highly available by default

SLA: 99.99% monthly uptime

Multi-Region

Datasets: US, EU (multi-region) or specific regions

Benefits:

  • Geographic redundancy
  • Lower latency for global users
  • Higher availability

Limitation: Cannot change location after creation

Disaster Recovery

Backup Strategy

Table snapshots: Point-in-time copies

Table clones: Lightweight copies (copy-on-write)

Cross-region copy: Export and re-import in another region

Time travel: Query data from up to 7 days ago

Pattern: Periodic exports to Cloud Storage in multi-region

Time Travel

Purpose: Query historical data without explicit backups

Duration: 7 days default (configurable 2-7 days)

Use cases:

  • Recover accidentally deleted/modified data
  • Compare data over time
  • Audit changes

Query syntax: FOR SYSTEM_TIME AS OF timestamp

Deleted Table Recovery

Duration: Restore deleted tables within 7 days

Limitation: Dataset deletion is permanent (cannot recover)

Scalability

Automatic Scaling

Query processing: Thousands of concurrent queries

Storage: Exabyte scale

No tuning: Automatic resource allocation

Slots: Units of computational capacity (automatic or reserved)

Reservation and Slots

On-demand (default): Pay per query, automatic capacity

Flat-rate pricing: Reserve slots, predictable cost

Flex slots: Commit for 60 seconds minimum

Use reserved when: Predictable workload, >$10K/month queries

Cost Optimization

Query Costs

Pricing: $5 per TB processed (first 1 TB/month free)

Optimization:

  • Partition and cluster tables
  • Select specific columns (not *)
  • Use table preview (free)
  • Cache query results (24 hours free)
  • Approximate aggregation functions
  • Streaming buffer charges

Storage Costs

Active: $0.020/GB/month Long-term (90+ days): $0.010/GB/month (automatic)

Optimization:

  • Delete unused tables
  • Use table expiration
  • Export to Cloud Storage (cheaper long-term)
  • Partition pruning

Monitoring

Metrics:

  • Bytes scanned per query
  • Slot utilization (reserved pricing)
  • Storage trends

Set budgets: Alert when costs exceed threshold

Security

Encryption

At rest: Automatic (Google-managed or CMEK)

In transit: TLS by default

Column-level encryption: Application-level before load

Access Logging

Audit logs:

  • Admin Activity (dataset/table changes)
  • Data Access (must enable, query logs)

Use for: Compliance, security investigations

VPC Service Controls

Purpose: Prevent data exfiltration

Pattern: BigQuery inside perimeter, cannot export outside

Best Practices

Schema Design

  • Denormalize for performance (avoid JOINs)
  • Use nested/repeated fields (STRUCT, ARRAY)
  • Partition by date/timestamp
  • Cluster by high-cardinality columns

Query Patterns

  • Filter early in WHERE clause
  • Avoid SELECT * in production
  • Use materialized views for common aggregations
  • Sample large datasets for exploration (TABLESAMPLE)

Data Lifecycle

  • Set table expiration for temporary data
  • Archive to Cloud Storage for long-term
  • Use partitioned tables for time-series
  • Enable time travel for 7 days

Cost Management

  • Estimate before running (dry run)
  • Use query validator
  • Set maximum bytes billed
  • Monitor spending with billing exports

Exam Focus

Core Concepts

  • Serverless data warehouse (no infrastructure)
  • Separation of storage and compute
  • Analytics (OLAP), not transactions (OLTP)
  • Pay per query (on-demand) or slots (reserved)

Use Cases

  • Data warehouse and BI
  • Log analysis at scale
  • Ad-hoc analytics
  • BigQuery ML for predictions
  • NOT for: OLTP, low-latency K/V, small databases

Performance

  • Partitioning (reduce data scanned)
  • Clustering (faster queries)
  • Materialized views (precompute)
  • Query optimization patterns

DLP API

  • Discover sensitive data (PII, PHI)
  • De-identification techniques
  • Compliance (GDPR, HIPAA, PCI-DSS)
  • Automated scanning patterns

Cost Optimization

  • Partition and cluster
  • Avoid SELECT *
  • Use caching
  • Long-term storage (automatic after 90 days)
  • Reserved slots for predictable workload

Integration

  • Federated queries (Cloud Storage, Bigtable, Cloud SQL)
  • Streaming inserts vs batch loading
  • Export to Cloud Storage
  • BigQuery ML

Disaster Recovery

  • Time travel (7 days)
  • Table snapshots
  • Cross-region exports
  • Deleted table recovery (7 days)