Athena¶

Overview¶

Amazon Athena is a powerful serverless query service that enables users to analyze data stored in Amazon S3 using standard SQL. Built on the Presto framework, Athena eliminates the need for complex ETL jobs when querying data, offering a simplified yet robust solution for data analysis.

Core Features and Functionality¶

Query Service Capabilities¶

Athena operates as a serverless query engine, requiring no infrastructure management while providing the ability to analyze data directly from S3 storage. The service uses standard SQL language (built on Presto), making it accessible to users familiar with SQL syntax.

Supported File Formats¶

Athena supports multiple data formats, including:

CSV
JSON
ORC
Avro
Parquet

Cost Structure¶

The pricing model is straightforward:

$5.00 per TB of data scanned
Users only pay for the data they query

Integration with QuickSight¶

Athena seamlessly integrates with Amazon QuickSight, enabling:

Creation of comprehensive reports
Building interactive dashboards
Visual data exploration and analysis

Common Use Cases¶

Athena serves various analytical needs, including:

Business intelligence and analytics
Reporting and data analysis
VPC Flow Logs analysis
ELB Logs examination
CloudTrail trails investigation

Performance Optimization¶

Data Format Recommendations¶

To optimize performance and reduce costs, consider the following recommendations:

Use columnar data formats:
Apache Parquet or ORC are highly recommended
These formats provide significant performance improvements
AWS Glue can be used to convert existing data to Parquet or ORC
Implement data compression:
Supported compression formats include:
- bzip2
- gzip
- lz4
- snappy
- zlib
- zstd

Data Organization and Storage¶

For optimal performance:

Implement partitioning:
Organize datasets in S3 using partition columns

Follow the structure:

s3://yourBucket/pathToTable/<PARTITION_COLUMN_NAME>=<VALUE>/

Example:

s3://athena-examples/flight/parquet/year=1991/month=1/day=1/

File size optimization:
Maintain file sizes larger than 128 MB
This approach minimizes query overhead

Federated Query Capabilities¶

Overview¶

Federated queries enable SQL query execution across diverse data sources, including:

Relational databases
Non-relational databases
Object stores
Custom data sources

Architecture Components¶

The federated query system consists of: 1. Data Source Connectors running on AWS Lambda 2. Support for various data sources:

CloudWatch Logs
DynamoDB
RDS
ElastiCache
DocumentDB
Redshift
HBase in EMR
MySQL
Aurora
SQL Server
On-premises databases

Functionality¶

Enables querying across multiple data sources simultaneously
Results can be stored back in Amazon S3
Provides unified access to distributed data sources

Best Practices¶

Use columnar formats (Parquet or ORC) for cost optimization
Implement appropriate compression methods
Partition data effectively
Optimize file sizes
Utilize federated queries when dealing with multiple data sources