AWS ETL Tools Cheat Sheet

When to use what -no fluff, just facts

crackr.dev

1When to Use What

The decision matrix. Pick the row that matches your problem.

Glue -Serverless batch ETL, schema discovery, data catalog

EMR -Petabyte-scale Spark/Hadoop, full cluster control

Athena -Ad-hoc SQL on S3 -no infra, pay per scan

Kinesis -Real-time streaming ingestion + transforms

DMS -Database migration with CDC, zero downtime

Step Functions -Orchestrate multi-step ETL workflows

Lambda -Lightweight, event-driven micro-transforms

Redshift -Columnar data warehouse for analytics queries

2AWS Glue

ServerlessApache SparkBatch

TL;DR

Managed Spark behind an API. You write PySpark, AWS runs the cluster. Includes a Hive-compatible Data Catalog (central metastore for your lake).

Architecture

S3 Raw

→

Crawler

→

Catalog

→

ETL Job

→

S3 Clean

Watch out

Cold start: 5-10 min. DPU billing rounds up. Crawlers misclassify schemas silently. Glue 4.0 still runs Spark 3.3. At scale, cost per DPU-hour ($0.44) adds up fast.

Cheat Snippet

cli

aws glue start-job-run \
  --job-name my-etl \
  --arguments='--input=s3://raw/,--output=s3://clean/'

Pricing

$0.44/DPU-hour. 1 DPU = 4 vCPU + 16GB. Min 2 DPUs. Crawlers billed separately.

3Amazon EMR

ProvisionedSpark / Hadoop / Presto

TL;DR

Raw Hadoop/Spark clusters you control. Pick instance types, install custom JARs, tune spark.executor.memory. EMR Serverless exists but you lose fine control.

Use Case

• 50TB+ daily click stream processing
• ML model training on Spark MLlib
• Interactive Presto analytics on massive datasets

Watch out

You ARE the ops team. Cluster sizing is black magic. Over-provision = burn cash. Under-provision = OOM. Spot interruptions kill long Spark stages. No auto-scaling by default.

Cheat Snippet

cli

aws emr create-cluster \
  --release-label emr-7.0.0 \
  --applications Name=Spark Name=Hive \
  --instance-type m5.xlarge --instance-count 5

Pricing

EC2 cost + EMR surcharge (~20%). Use spot for task nodes (up to 90% savings, risk of interruption).

4Amazon Athena

ServerlessPresto / TrinoAd-hoc SQL

TL;DR

Presto-as-a-service. Point at S3, write SQL, get results. No infrastructure. Technically ELT (query-time transform) but everyone searches "ETL" for it.

Cost Comparison

CSV (unpartitioned)

$50/query

Parquet (partitioned)

$0.50/query

Watch out

$5 per TB scanned. Unpartitioned CSV = 100x cost vs partitioned Parquet. No indexes. Complex joins on huge datasets are slow. Always use CTAS to materialize.

Cheat Snippet

cli

CREATE TABLE clean WITH (
  format='PARQUET',
  partitioned_by=ARRAY['dt']
) AS SELECT user_id, event, dt
FROM raw WHERE dt > '2026-01-01';

5Kinesis (Streams + Firehose)

ServerlessStreaming

TL;DR

Data Streams: Managed Kafka-like shards. You read with consumers (Lambda, KCL).

Firehose: Zero-admin delivery pipe. Buffers → dumps to S3/Redshift/OpenSearch.

Data Analytics: Managed Apache Flink for streaming SQL/Java.

Streaming Architecture

Producers→

Streams

→

Firehose

→

S3 Lake

Watch out

Firehose min buffer = 60s (NOT real-time). Lambda transform timeout = 3 min, 6MB payload max. No deduplication. Shard splitting is manual and disruptive.

Cheat Snippet

cli

aws firehose create-delivery-stream \
  --delivery-stream-name clicks \
  --s3-destination-configuration \
    BucketARN=arn:aws:s3:::lake,\
    CompressionFormat=SNAPPY

6AWS DMS

ProvisionedCDCMigration

TL;DR

Reads the source DB's transaction log (binlog/WAL/redo) and replays on target. Full-load + ongoing CDC replication. NOT a transformation engine -it moves data.

Migration Flow

Oracle

→

SCT

→

DMS

→

Aurora

Watch out

Replication instance = single EC2 (no auto-scaling). LOB columns are brutally slow. Monitor replication lag or you lose data. NOT an ETL tool -minimal transformation only.

Cheat Snippet

cli

aws dms create-replication-task \
  --replication-task-id ora-to-pg \
  --source-endpoint-arn $SRC \
  --target-endpoint-arn $TGT \
  --migration-type full-load-and-cdc

7Service Comparison

Service	Arch	Engine	Mode	Pricing
Glue	Serverless	Spark	Batch	$0.44/DPU-hr
EMR	Provisioned	Spark/Hadoop	Batch	EC2 + 20%
Athena	Serverless	Trino	Ad-hoc	$5/TB scan
Kinesis	Serverless	Custom	Stream	$0.029/GB
DMS	Provisioned	CDC	Migration	Instance-hr
Lambda	Serverless	Any runtime	Event	Per invoke
Step Fn	Serverless	State machine	Orchestr.	Per transition
Redshift	Provisioned	Columnar	Warehouse	Node-hr

8Glue: Data Lake ETL

What is AWS Glue? Managed Spark for batch ETL. Includes a Data Catalog shared by Athena, EMR, Redshift.

Hub-and-Spoke Architecture

RDS

S3 CSV

DynamoDB

Data Catalog

central metastore

Athena

Redshift

EMR

9EMR: Big Data at Scale

What is Amazon EMR? Full Hadoop/Spark/Presto clusters. You pick instances, install JARs, tune memory.

Cluster Processing (50TB/day)

50TB Raw

EMR Spark Cluster

Master Node

core

spot

Parquet

10Athena: SQL on S3

What is Amazon Athena? Serverless Presto/Trino. Point at S3, write SQL, get results. Pay per TB scanned.

Query → Scan → Results

Analyst

SELECT...

Athena

dt=1

dt=2

dt=3

dt=4

S3 partitioned Parquet

$5/TB

scanned

CTAS materialization

cli

CREATE TABLE clean_events
WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
AS SELECT user_id, action, ts, dt
FROM raw_events
WHERE dt > '2026-01-01' AND action IS NOT NULL;

11Kinesis: Streaming Pipelines

What is Amazon Kinesis? Streaming family. Data Streams = managed shards. Firehose = delivery. Flink = streaming SQL.

Multi-shard Fan-out

Data Streams

Shard 0

Shard 1

Shard 2

Lambda

Firehose

Flink

→

S3 Lake

→

Query

12DMS: Database Migration

What is AWS DMS? Reads source DB transaction logs and replays on target. Full-load + ongoing CDC.

Zero-Downtime Migration

Oraclesource

binlog

DMS

full-load

CDC

Aurora PGtarget

lag:0ms

13Batch ETL Pattern

Standard nightly ETL: raw → transform → load

S3 Raw

→

Glue Job

→

S3 Clean

→

Redshift

Crawler discovers schema → PySpark transforms → COPY loads → analysts query

14Streaming ETL Pattern

Real-time clickstream / IoT ingestion

IoT→

Kinesis

→

Firehose

→

S3 Parquet

→

Athena

Buffer 60s → Snappy compress → partition by dt → SQL analytics

15Key Limits & Quotas

GlueMax 100 concurrent job runs. Cold start 5-10 min. Max 2000 DPUs per account.

EMRDefault 20 instances per cluster. EBS storage limit per node. Spot can terminate mid-stage.

Athena20 concurrent queries default. 100 databases. Results expire in S3. No transactions.

Kinesis1MB record size. 1MB/s write per shard. 2MB/s read per shard. 7 day retention max.

Firehose60s min buffer. 3 min Lambda timeout. 6MB payload. No dedup. Dynamic partitioning adds cost.

DMSLOBs limited to 32KB (default). No parallel full-load for some engines. Single AZ replication instance.

Lambda15 min timeout. 10GB memory max. 250MB deployment package. 6MB sync payload.

16Lake Formation & Governance

Lake Formation -Fine-grained access control on your data lake. Column/row-level security on Glue Catalog tables.

• Manages permissions instead of S3 bucket policies
• Tag-based access control (LF-Tags)
• Cross-account data sharing without S3 replication
• Works with Glue, Athena, Redshift Spectrum, EMR

Glue Data Quality -DQDL rules on Glue jobs. Fail pipelines on bad data.

cli

Rules = [
  ColumnValues "age" between 0 and 150,
  Completeness "email" > 0.95,
  IsUnique "user_id"
]

17Cost Cheat Sheet

What you actually pay (us-east-1, 2026)

Glue

$0.44/DPU-hr

EMR

EC2 + ~20%

Athena

$5/TB scanned

Firehose

$0.029/GB

DMS

~$0.18/hr

Lambda

$0.20/1M inv

Pro tip: Glue + Athena on Parquet is the cheapest serverless stack. EMR on spot is cheapest at petabyte scale.

The Complete Cheat Sheet for AWS ETL Tools

This cheat sheet for AWS ETL tools is the fastest way to compare AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS. Each tool card includes the TL;DR (what it actually is under the hood), the exact use case, the hidden traps that will burn your budget, and a copy-paste CLI snippet.

Whether you are building a batch data pipeline, a real-time streaming architecture, or migrating databases, this AWS ETL tools cheat sheet gives you the decision matrix, architecture diagrams, and pricing comparison to pick the right tool in 30 seconds.

Unlike AWS documentation, this cheat sheet AWS ETL tools reference is brutally honest about limitations. We tell you about Glue's cold start, Athena's scan costs, EMR's operational overhead, Firehose's 60-second buffer, and DMS's single-instance bottleneck.

AWS ETL Tools Cheat Sheet FAQ

What is a cheat sheet for AWS ETL tools?expand_more

A cheat sheet for AWS ETL tools is a quick-reference guide that compares AWS data processing services, AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS, by architecture, pricing, use cases, and limitations. This cheat sheet AWS ETL tools page helps data engineers pick the right tool without reading 50 pages of documentation.

What is the difference between AWS Glue and Amazon EMR?expand_more

AWS Glue is serverless Spark, you submit a job and AWS manages the cluster. Amazon EMR gives you full control over Hadoop/Spark clusters (instance types, tuning, custom JARs). Use Glue for standard batch ETL under 10TB. Use EMR for petabyte-scale jobs where you need to tune executor memory and install custom libraries.

When should I use Amazon Athena instead of AWS Glue?expand_more

Use Athena for ad-hoc SQL queries on data already in S3, no infrastructure needed, pay per TB scanned. Use Glue when you need to transform and move data (ETL jobs, crawlers, data catalog). Athena is technically ELT (query-time transformation). The key cost trap: unpartitioned CSV data in Athena will 10x your bill vs partitioned Parquet.

What are the hidden costs of AWS ETL tools?expand_more

AWS Glue: DPU-hours round up, 5-10 min cold start burns money. Athena: $5/TB scanned, CSV vs Parquet is a 10x cost difference. EMR: over-provisioned clusters and spot interruptions. Kinesis Firehose: not true real-time (60s buffer minimum). DMS: replication instance is a single EC2 with no auto-scaling.

What is the best AWS ETL tool for real-time streaming?expand_more

Kinesis Firehose for managed ingestion (buffer → S3/Redshift/OpenSearch). Kinesis Data Streams + Lambda for record-level processing. Kinesis Data Analytics (Managed Apache Flink) for complex streaming SQL/Java. Firehose is simplest but has a minimum 60-second buffer, it's near-real-time, not true real-time.

How do I choose between AWS DMS and AWS Glue for data migration?expand_more

DMS is for database-to-database migration with CDC (change data capture). It reads transaction logs and replays them. AWS Glue is for ETL, extract, transform, load with Spark. If you're migrating Oracle to Aurora with zero downtime, use DMS. If you're transforming and loading data into a data lake, use Glue.

Ready to practice data engineering interviews?

Practice with an AI interviewer that tests your ability to design data pipelines and articulate trade-offs in real time.

Try Crackr freearrow_forward

AWS ETL Tools Cheat Sheet

1When to Use What

2AWS Glue

3Amazon EMR

4Amazon Athena

5Kinesis (Streams + Firehose)

6AWS DMS

7Service Comparison

8Glue: Data Lake ETL

9EMR: Big Data at Scale

10Athena: SQL on S3

11Kinesis: Streaming Pipelines

12DMS: Database Migration

13Batch ETL Pattern

14Streaming ETL Pattern

15Key Limits & Quotas

16Lake Formation & Governance

17Cost Cheat Sheet

The Complete Cheat Sheet for AWS ETL Tools

AWS ETL Tools Cheat Sheet FAQ

Ready to practice data engineering interviews?

Continue learning

System Design Cheat Sheet

Big O Cheat Sheet

All Cheat Sheets

Blind 75

DSA Tracker

Algorithm Visualizers

Company Questions