The decision matrix. Pick the row that matches your problem.
TL;DR
Managed Spark behind an API. You write PySpark, AWS runs the cluster. Includes a Hive-compatible Data Catalog (central metastore for your lake).
Architecture
⚠ The Trap
Cold start: 5-10 min. DPU billing rounds up. Crawlers misclassify schemas silently. Glue 4.0 still runs Spark 3.3. At scale, cost per DPU-hour ($0.44) adds up fast.
Cheat Snippet
aws glue start-job-run \ --job-name my-etl \ --arguments='--input=s3://raw/,--output=s3://clean/'
Pricing
$0.44/DPU-hour. 1 DPU = 4 vCPU + 16GB. Min 2 DPUs. Crawlers billed separately.
TL;DR
Raw Hadoop/Spark clusters you control. Pick instance types, install custom JARs, tune spark.executor.memory. EMR Serverless exists but you lose fine control.
Use Case
⚠ The Trap
You ARE the ops team. Cluster sizing is black magic. Over-provision = burn cash. Under-provision = OOM. Spot interruptions kill long Spark stages. No auto-scaling by default.
Cheat Snippet
aws emr create-cluster \ --release-label emr-7.0.0 \ --applications Name=Spark Name=Hive \ --instance-type m5.xlarge --instance-count 5
Pricing
EC2 cost + EMR surcharge (~20%). Use spot for task nodes (up to 90% savings, risk of interruption).
TL;DR
Presto-as-a-service. Point at S3, write SQL, get results. No infrastructure. Technically ELT (query-time transform) but everyone searches "ETL" for it.
Cost Comparison
⚠ The Trap
$5 per TB scanned. Unpartitioned CSV = 100x cost vs partitioned Parquet. No indexes. Complex joins on huge datasets are slow. Always use CTAS to materialize.
Cheat Snippet
CREATE TABLE clean WITH ( format='PARQUET', partitioned_by=ARRAY['dt'] ) AS SELECT user_id, event, dt FROM raw WHERE dt > '2026-01-01';
TL;DR
Data Streams: Managed Kafka-like shards. You read with consumers (Lambda, KCL).
Firehose: Zero-admin delivery pipe. Buffers → dumps to S3/Redshift/OpenSearch.
Data Analytics: Managed Apache Flink for streaming SQL/Java.
Streaming Architecture
⚠ The Trap
Firehose min buffer = 60s (NOT real-time). Lambda transform timeout = 3 min, 6MB payload max. No deduplication. Shard splitting is manual and disruptive.
Cheat Snippet
aws firehose create-delivery-stream \ --delivery-stream-name clicks \ --s3-destination-configuration \ BucketARN=arn:aws:s3:::lake,\ CompressionFormat=SNAPPY
TL;DR
Reads the source DB's transaction log (binlog/WAL/redo) and replays on target. Full-load + ongoing CDC replication. NOT a transformation engine -it moves data.
Migration Flow
⚠ The Trap
Replication instance = single EC2 (no auto-scaling). LOB columns are brutally slow. Monitor replication lag or you lose data. NOT an ETL tool -minimal transformation only.
Cheat Snippet
aws dms create-replication-task \ --replication-task-id ora-to-pg \ --source-endpoint-arn $SRC \ --target-endpoint-arn $TGT \ --migration-type full-load-and-cdc
| Service | Arch | Engine | Mode | Pricing |
|---|---|---|---|---|
| Glue | Serverless | Spark | Batch | $0.44/DPU-hr |
| EMR | Provisioned | Spark/Hadoop | Batch | EC2 + 20% |
| Athena | Serverless | Trino | Ad-hoc | $5/TB scan |
| Kinesis | Serverless | Custom | Stream | $0.029/GB |
| DMS | Provisioned | CDC | Migration | Instance-hr |
| Lambda | Serverless | Any runtime | Event | Per invoke |
| Step Fn | Serverless | State machine | Orchestr. | Per transition |
| Redshift | Provisioned | Columnar | Warehouse | Node-hr |
What is AWS Glue? Managed Spark for batch ETL. Includes a Data Catalog shared by Athena, EMR, Redshift.
Hub-and-Spoke Architecture
What is Amazon EMR? Full Hadoop/Spark/Presto clusters. You pick instances, install JARs, tune memory.
Cluster Processing (50TB/day)
What is Amazon Athena? Serverless Presto/Trino. Point at S3, write SQL, get results. Pay per TB scanned.
Query → Scan → Results
Analyst
SELECT...
$5/TB
scanned
CTAS materialization
CREATE TABLE clean_events WITH (format='PARQUET', partitioned_by=ARRAY['dt']) AS SELECT user_id, action, ts, dt FROM raw_events WHERE dt > '2026-01-01' AND action IS NOT NULL;
What is Amazon Kinesis? Streaming family. Data Streams = managed shards. Firehose = delivery. Flink = streaming SQL.
Multi-shard Fan-out
What is AWS DMS? Reads source DB transaction logs and replays on target. Full-load + ongoing CDC.
Zero-Downtime Migration
Standard nightly ETL: raw → transform → load
Crawler discovers schema → PySpark transforms → COPY loads → analysts query
Real-time clickstream / IoT ingestion
Buffer 60s → Snappy compress → partition by dt → SQL analytics
Lake Formation -Fine-grained access control on your data lake. Column/row-level security on Glue Catalog tables.
Glue Data Quality -DQDL rules on Glue jobs. Fail pipelines on bad data.
Rules = [ ColumnValues "age" between 0 and 150, Completeness "email" > 0.95, IsUnique "user_id" ]
What you actually pay (us-east-1, 2026)
Pro tip: Glue + Athena on Parquet is the cheapest serverless stack. EMR on spot is cheapest at petabyte scale.
This cheat sheet for AWS ETL tools is the fastest way to compare AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS. Each tool card includes the TL;DR (what it actually is under the hood), the exact use case, the hidden traps that will burn your budget, and a copy-paste CLI snippet.
Whether you are building a batch data pipeline, a real-time streaming architecture, or migrating databases, this AWS ETL tools cheat sheet gives you the decision matrix, architecture diagrams, and pricing comparison to pick the right tool in 30 seconds.
Unlike AWS documentation, this cheat sheet AWS ETL tools reference is brutally honest about limitations. We tell you about Glue's cold start, Athena's scan costs, EMR's operational overhead, Firehose's 60-second buffer, and DMS's single-instance bottleneck.
A cheat sheet for AWS ETL tools is a quick-reference guide that compares AWS data processing services — AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS — by architecture, pricing, use cases, and limitations. This cheat sheet AWS ETL tools page helps data engineers pick the right tool without reading 50 pages of documentation.
AWS Glue is serverless Spark — you submit a job and AWS manages the cluster. Amazon EMR gives you full control over Hadoop/Spark clusters (instance types, tuning, custom JARs). Use Glue for standard batch ETL under 10TB. Use EMR for petabyte-scale jobs where you need to tune executor memory and install custom libraries.
Use Athena for ad-hoc SQL queries on data already in S3 — no infrastructure needed, pay per TB scanned. Use Glue when you need to transform and move data (ETL jobs, crawlers, data catalog). Athena is technically ELT (query-time transformation). The key cost trap: unpartitioned CSV data in Athena will 10x your bill vs partitioned Parquet.
AWS Glue: DPU-hours round up, 5-10 min cold start burns money. Athena: $5/TB scanned — CSV vs Parquet is a 10x cost difference. EMR: over-provisioned clusters and spot interruptions. Kinesis Firehose: not true real-time (60s buffer minimum). DMS: replication instance is a single EC2 with no auto-scaling.
Kinesis Firehose for managed ingestion (buffer → S3/Redshift/OpenSearch). Kinesis Data Streams + Lambda for record-level processing. Kinesis Data Analytics (Managed Apache Flink) for complex streaming SQL/Java. Firehose is simplest but has a minimum 60-second buffer — it's near-real-time, not true real-time.
DMS is for database-to-database migration with CDC (change data capture). It reads transaction logs and replays them. AWS Glue is for ETL — extract, transform, load with Spark. If you're migrating Oracle to Aurora with zero downtime, use DMS. If you're transforming and loading data into a data lake, use Glue.
Practice with an AI interviewer that tests your ability to design data pipelines and articulate trade-offs in real time.
Try Crackr freearrow_forward