BigQuery¶
Core Concepts¶
BigQuery is a serverless, fully managed data warehouse for analytics at petabyte scale. SQL-based querying with automatic scaling and high performance.
Key Principle: Designed for analytics (OLAP), not transactions (OLTP); pay for storage and queries, not infrastructure.
When to Use BigQuery¶
✅ Use When¶
- Analytics and reporting (data warehouse)
- Large-scale data analysis (terabytes to petabytes)
- Ad-hoc SQL queries
- Business intelligence and dashboards
- Log analysis at scale
- Machine learning on structured data
❌ Don’t Use When¶
- Transactional workload (OLTP) → Cloud SQL, Spanner
- Low-latency reads/writes → Firestore, Bigtable
- Blob storage → Cloud Storage
- Small datasets (<100 GB) → Cloud SQL may be cheaper
- Real-time streaming with millisecond latency → Bigtable
Architecture¶
Serverless¶
No infrastructure management:
- No servers to provision
- Automatic scaling
- No capacity planning
- Pay per query
Separation of storage and compute:
- Storage: Pay for data stored
- Compute: Pay for queries executed
- Scale independently
Data Organization¶
Structure:
Project
└── Dataset (location, access control)
└── Tables (schema, partitioning, clustering)
└── Rows and columns
Dataset: Logical container, regional or multi-regional
Table types:
- Native tables (managed by BigQuery)
- External tables (data in Cloud Storage/Drive/Bigtable)
- Views (virtual tables from queries)
- Materialized views (precomputed results)
Performance Optimization¶
Partitioning¶
Purpose: Divide large tables into smaller segments
Types:
- Time-based (daily, hourly, monthly, yearly)
- Integer range
- Ingestion time
Benefits:
- Faster queries (scan less data)
- Lower costs (process less data)
- Easier data management (delete old partitions)
Example: Partition logs table by date, query only last 7 days
Clustering¶
Purpose: Organize data within partitions by column values
Benefits:
- Faster queries on clustered columns
- Lower costs (better pruning)
- Automatic (BigQuery manages)
Use with: Up to 4 columns commonly queried together
Pattern: Partition by date, cluster by user_id and region
Query Optimization¶
Best practices:
- Select only needed columns (avoid
SELECT *) - Filter early (WHERE clause reduces data scanned)
- Use partitioned tables
- Avoid self-joins (use window functions)
- Use approximate aggregation (HyperLogLog for COUNT DISTINCT)
Explanation tabs: Shows query execution plan, bytes processed
Storage Options¶
Active Storage¶
Default: Standard storage
Price: ~$0.020/GB/month (first 10 GB free)
Use for: Frequently queried data
Long-Term Storage¶
Automatic: Tables not edited for 90 days
Price: ~$0.010/GB/month (50% discount)
Benefit: Cost savings for archival data
Important: Query pricing same regardless of storage type
Data Loading¶
Batch Loading¶
Methods:
- Load jobs (Cloud Storage, local files)
- SQL INSERT statements
- Dataflow pipelines
- Third-party ETL tools
Free: Batch loading is free
Use for: Large bulk loads, scheduled ETL
Streaming Inserts¶
Purpose: Real-time data ingestion
Latency: Available for query immediately
Cost: $0.010 per 200 MB (higher than batch)
Limits: 100,000 rows/sec per table
Use for: Real-time dashboards, log streaming
Alternative: Storage Write API (cheaper, better performance)
Data Export¶
Destinations:
- Cloud Storage (CSV, JSON, Avro, Parquet)
- Other BigQuery tables
- Google Sheets (small datasets)
Limits: 1 GB per file (use wildcards for larger)
Cost: Export is free, storage charges apply
Access Control¶
Dataset-Level¶
IAM Roles:
bigquery.dataViewer: Read data and metadatabigquery.dataEditor: Create, update, delete tablesbigquery.dataOwner: Full control including permissions
Scope: All tables in dataset
Table-Level¶
Authorized views: Allow access to specific views, not underlying tables
Use case: Row-level security, column masking
Row-Level Security¶
Purpose: Filter rows based on user identity
Implementation: Create row-level policies
Example: Users only see their own organization’s data
Data Loss Prevention (DLP) API¶
Purpose¶
Discover, classify, and protect sensitive data in BigQuery (PII, PHI, financial data).
Capabilities¶
Discovery:
- Scan datasets for sensitive data
- Identify PII (SSN, credit cards, emails, phone numbers)
- Custom info types (regex patterns)
Classification:
- Likelihood scores (very likely, likely, possible, unlikely)
- Info types (PERSON_NAME, CREDIT_CARD_NUMBER, etc.)
- Statistical analysis
De-identification:
- Masking (replace with *****)
- Tokenization (consistent hash)
- Date shifting (preserve relative dates)
- Generalization (age → age range)
Common Patterns¶
PII Discovery:
DLP API scans BigQuery table → Identifies sensitive columns → Generate report
De-identification Pipeline:
Source table → DLP API (de-identify) → Destination table (masked data)
Continuous Monitoring:
Cloud Functions + DLP API → Scan new data → Alert if PII found
Use Cases¶
- GDPR compliance (identify PII)
- HIPAA compliance (protect PHI)
- PCI-DSS (find credit card data)
- Data lake governance
- Safe analytics on sensitive data
Integration¶
Automatic: Scan tables directly via API
Patterns:
- Scheduled scans (Cloud Scheduler + Cloud Functions)
- Pre-export scanning (ensure no PII exported)
- CI/CD validation (check for accidental PII)
Cost¶
DLP API pricing: Per GB scanned
Optimization:
- Scan samples, not entire tables
- Use column-level scans
- Cache results, don’t rescan unchanged data
BigQuery ML¶
Purpose: Build and deploy ML models using SQL
Supported models:
- Linear regression
- Logistic regression
- K-means clustering
- Time series forecasting
- Deep neural networks
- Import TensorFlow models
Benefits: No data export, SQL-based, integrated
Use case: Predictions on BigQuery data without separate ML platform
Federated Queries¶
External Data Sources¶
Supported:
- Cloud Storage (CSV, JSON, Avro, Parquet, ORC)
- Cloud Bigtable
- Cloud SQL (read-only)
- Google Sheets
Use cases:
- Query data without loading
- Join BigQuery with external data
- One-time analysis
- Data lake pattern (data in Storage, query in BigQuery)
Limitations:
- Slower than native tables
- No caching
- Subject to external source performance
BI Engine¶
Purpose: In-memory analysis for fast dashboards
Benefits:
- Sub-second query response
- Interactive dashboards
- No query charges for cached data
- Automatic caching
Capacity: Pay for memory reservation (per GB/hour)
Use for: Frequently accessed dashboards, real-time analytics
High Availability¶
Automatic¶
Built-in: Multi-zone replication within region
No configuration needed: Highly available by default
SLA: 99.99% monthly uptime
Multi-Region¶
Datasets: US, EU (multi-region) or specific regions
Benefits:
- Geographic redundancy
- Lower latency for global users
- Higher availability
Limitation: Cannot change location after creation
Disaster Recovery¶
Backup Strategy¶
Table snapshots: Point-in-time copies
Table clones: Lightweight copies (copy-on-write)
Cross-region copy: Export and re-import in another region
Time travel: Query data from up to 7 days ago
Pattern: Periodic exports to Cloud Storage in multi-region
Time Travel¶
Purpose: Query historical data without explicit backups
Duration: 7 days default (configurable 2-7 days)
Use cases:
- Recover accidentally deleted/modified data
- Compare data over time
- Audit changes
Query syntax: FOR SYSTEM_TIME AS OF timestamp
Deleted Table Recovery¶
Duration: Restore deleted tables within 7 days
Limitation: Dataset deletion is permanent (cannot recover)
Scalability¶
Automatic Scaling¶
Query processing: Thousands of concurrent queries
Storage: Exabyte scale
No tuning: Automatic resource allocation
Slots: Units of computational capacity (automatic or reserved)
Reservation and Slots¶
On-demand (default): Pay per query, automatic capacity
Flat-rate pricing: Reserve slots, predictable cost
Flex slots: Commit for 60 seconds minimum
Use reserved when: Predictable workload, >$10K/month queries
Cost Optimization¶
Query Costs¶
Pricing: $5 per TB processed (first 1 TB/month free)
Optimization:
- Partition and cluster tables
- Select specific columns (not *)
- Use table preview (free)
- Cache query results (24 hours free)
- Approximate aggregation functions
- Streaming buffer charges
Storage Costs¶
Active: $0.020/GB/month Long-term (90+ days): $0.010/GB/month (automatic)
Optimization:
- Delete unused tables
- Use table expiration
- Export to Cloud Storage (cheaper long-term)
- Partition pruning
Monitoring¶
Metrics:
- Bytes scanned per query
- Slot utilization (reserved pricing)
- Storage trends
Set budgets: Alert when costs exceed threshold
Security¶
Encryption¶
At rest: Automatic (Google-managed or CMEK)
In transit: TLS by default
Column-level encryption: Application-level before load
Access Logging¶
Audit logs:
- Admin Activity (dataset/table changes)
- Data Access (must enable, query logs)
Use for: Compliance, security investigations
VPC Service Controls¶
Purpose: Prevent data exfiltration
Pattern: BigQuery inside perimeter, cannot export outside
Best Practices¶
Schema Design¶
- Denormalize for performance (avoid JOINs)
- Use nested/repeated fields (STRUCT, ARRAY)
- Partition by date/timestamp
- Cluster by high-cardinality columns
Query Patterns¶
- Filter early in WHERE clause
- Avoid SELECT * in production
- Use materialized views for common aggregations
- Sample large datasets for exploration (TABLESAMPLE)
Data Lifecycle¶
- Set table expiration for temporary data
- Archive to Cloud Storage for long-term
- Use partitioned tables for time-series
- Enable time travel for 7 days
Cost Management¶
- Estimate before running (dry run)
- Use query validator
- Set maximum bytes billed
- Monitor spending with billing exports
Exam Focus¶
Core Concepts¶
- Serverless data warehouse (no infrastructure)
- Separation of storage and compute
- Analytics (OLAP), not transactions (OLTP)
- Pay per query (on-demand) or slots (reserved)
Use Cases¶
- Data warehouse and BI
- Log analysis at scale
- Ad-hoc analytics
- BigQuery ML for predictions
- NOT for: OLTP, low-latency K/V, small databases
Performance¶
- Partitioning (reduce data scanned)
- Clustering (faster queries)
- Materialized views (precompute)
- Query optimization patterns
DLP API¶
- Discover sensitive data (PII, PHI)
- De-identification techniques
- Compliance (GDPR, HIPAA, PCI-DSS)
- Automated scanning patterns
Cost Optimization¶
- Partition and cluster
- Avoid SELECT *
- Use caching
- Long-term storage (automatic after 90 days)
- Reserved slots for predictable workload
Integration¶
- Federated queries (Cloud Storage, Bigtable, Cloud SQL)
- Streaming inserts vs batch loading
- Export to Cloud Storage
- BigQuery ML
Disaster Recovery¶
- Time travel (7 days)
- Table snapshots
- Cross-region exports
- Deleted table recovery (7 days)