AWS ETL Tools Cheat Sheet

crackr.dev

1When to Use What

The decision matrix. Pick the row that matches your problem.

Glue -Serverless batch ETL, schema discovery, data catalog
EMR -Petabyte-scale Spark/Hadoop, full cluster control
Athena -Ad-hoc SQL on S3 -no infra, pay per scan
Kinesis -Real-time streaming ingestion + transforms
DMS -Database migration with CDC, zero downtime
Step Functions -Orchestrate multi-step ETL workflows
Lambda -Lightweight, event-driven micro-transforms
Redshift -Columnar data warehouse for analytics queries

2AWS Glue

ServerlessApache SparkBatch

TL;DR

Managed Spark behind an API. You write PySpark, AWS runs the cluster. Includes a Hive-compatible Data Catalog (central metastore for your lake).

Architecture

S3 RawS3 Raw
CrawlerCrawler
CatalogCatalog
ETL JobETL Job
S3 CleanS3 Clean

⚠ The Trap

Cold start: 5-10 min. DPU billing rounds up. Crawlers misclassify schemas silently. Glue 4.0 still runs Spark 3.3. At scale, cost per DPU-hour ($0.44) adds up fast.

Cheat Snippet

aws glue start-job-run \
  --job-name my-etl \
  --arguments='--input=s3://raw/,--output=s3://clean/'

Pricing

$0.44/DPU-hour. 1 DPU = 4 vCPU + 16GB. Min 2 DPUs. Crawlers billed separately.

3Amazon EMR

ProvisionedSpark / Hadoop / Presto

TL;DR

Raw Hadoop/Spark clusters you control. Pick instance types, install custom JARs, tune spark.executor.memory. EMR Serverless exists but you lose fine control.

Use Case

  • • 50TB+ daily click stream processing
  • • ML model training on Spark MLlib
  • • Interactive Presto analytics on massive datasets

⚠ The Trap

You ARE the ops team. Cluster sizing is black magic. Over-provision = burn cash. Under-provision = OOM. Spot interruptions kill long Spark stages. No auto-scaling by default.

Cheat Snippet

aws emr create-cluster \
  --release-label emr-7.0.0 \
  --applications Name=Spark Name=Hive \
  --instance-type m5.xlarge --instance-count 5

Pricing

EC2 cost + EMR surcharge (~20%). Use spot for task nodes (up to 90% savings, risk of interruption).

4Amazon Athena

ServerlessPresto / TrinoAd-hoc SQL

TL;DR

Presto-as-a-service. Point at S3, write SQL, get results. No infrastructure. Technically ELT (query-time transform) but everyone searches "ETL" for it.

Cost Comparison

CSV (unpartitioned)
$50/query
Parquet (partitioned)
$0.50/query

⚠ The Trap

$5 per TB scanned. Unpartitioned CSV = 100x cost vs partitioned Parquet. No indexes. Complex joins on huge datasets are slow. Always use CTAS to materialize.

Cheat Snippet

CREATE TABLE clean WITH (
  format='PARQUET',
  partitioned_by=ARRAY['dt']
) AS SELECT user_id, event, dt
FROM raw WHERE dt > '2026-01-01';

5Kinesis (Streams + Firehose)

ServerlessStreaming

TL;DR

Data Streams: Managed Kafka-like shards. You read with consumers (Lambda, KCL).

Firehose: Zero-admin delivery pipe. Buffers → dumps to S3/Redshift/OpenSearch.

Data Analytics: Managed Apache Flink for streaming SQL/Java.

Streaming Architecture

Producers
StreamsStreams
FirehoseFirehose
S3 LakeS3 Lake

⚠ The Trap

Firehose min buffer = 60s (NOT real-time). Lambda transform timeout = 3 min, 6MB payload max. No deduplication. Shard splitting is manual and disruptive.

Cheat Snippet

aws firehose create-delivery-stream \
  --delivery-stream-name clicks \
  --s3-destination-configuration \
    BucketARN=arn:aws:s3:::lake,\
    CompressionFormat=SNAPPY

6AWS DMS

ProvisionedCDCMigration

TL;DR

Reads the source DB's transaction log (binlog/WAL/redo) and replays on target. Full-load + ongoing CDC replication. NOT a transformation engine -it moves data.

Migration Flow

OracleOracle
SCT
DMSDMS
AuroraAurora

⚠ The Trap

Replication instance = single EC2 (no auto-scaling). LOB columns are brutally slow. Monitor replication lag or you lose data. NOT an ETL tool -minimal transformation only.

Cheat Snippet

aws dms create-replication-task \
  --replication-task-id ora-to-pg \
  --source-endpoint-arn $SRC \
  --target-endpoint-arn $TGT \
  --migration-type full-load-and-cdc

7Service Comparison

ServiceArchEngineModePricing
GlueServerlessSparkBatch$0.44/DPU-hr
EMRProvisionedSpark/HadoopBatchEC2 + 20%
AthenaServerlessTrinoAd-hoc$5/TB scan
KinesisServerlessCustomStream$0.029/GB
DMSProvisionedCDCMigrationInstance-hr
LambdaServerlessAny runtimeEventPer invoke
Step FnServerlessState machineOrchestr.Per transition
RedshiftProvisionedColumnarWarehouseNode-hr

8Glue: Data Lake ETL

What is AWS Glue? Managed Spark for batch ETL. Includes a Data Catalog shared by Athena, EMR, Redshift.

Hub-and-Spoke Architecture

RDSRDS
S3 CSVS3 CSV
DynamoDBDynamoDB
Data Catalog
central metastore
AthenaAthena
RedshiftRedshift
EMREMR

9EMR: Big Data at Scale

What is Amazon EMR? Full Hadoop/Spark/Presto clusters. You pick instances, install JARs, tune memory.

Cluster Processing (50TB/day)

50TB Raw50TB Raw
EMR Spark Cluster
Master Node
core
core
spot
spot
ParquetParquet

10Athena: SQL on S3

What is Amazon Athena? Serverless Presto/Trino. Point at S3, write SQL, get results. Pay per TB scanned.

Query → Scan → Results

Analyst

SELECT...

AthenaAthena
dt=1
dt=2
dt=3
dt=4
S3 partitioned Parquet

$5/TB

scanned

CTAS materialization

CREATE TABLE clean_events
WITH (format='PARQUET', partitioned_by=ARRAY['dt'])
AS SELECT user_id, action, ts, dt
FROM raw_events
WHERE dt > '2026-01-01' AND action IS NOT NULL;

11Kinesis: Streaming Pipelines

What is Amazon Kinesis? Streaming family. Data Streams = managed shards. Firehose = delivery. Flink = streaming SQL.

Multi-shard Fan-out

P1
P2
P3
Data Streams
Shard 0
Shard 1
Shard 2
LambdaLambda
FirehoseFirehose
FlinkFlink
S3 LakeS3 Lake
QueryQuery

12DMS: Database Migration

What is AWS DMS? Reads source DB transaction logs and replays on target. Full-load + ongoing CDC.

Zero-Downtime Migration

Oraclesource
binlog
DMS
full-load
CDC
Aurora PGtarget
lag:0ms

13Batch ETL Pattern

Standard nightly ETL: raw → transform → load

S3 RawS3 Raw
Glue JobGlue Job
S3 CleanS3 Clean
RedshiftRedshift

Crawler discovers schema → PySpark transforms → COPY loads → analysts query

14Streaming ETL Pattern

Real-time clickstream / IoT ingestion

IoT
KinesisKinesis
FirehoseFirehose
S3 ParquetS3 Parquet
AthenaAthena

Buffer 60s → Snappy compress → partition by dt → SQL analytics

15Key Limits & Quotas

GlueMax 100 concurrent job runs. Cold start 5-10 min. Max 2000 DPUs per account.
EMRDefault 20 instances per cluster. EBS storage limit per node. Spot can terminate mid-stage.
Athena20 concurrent queries default. 100 databases. Results expire in S3. No transactions.
Kinesis1MB record size. 1MB/s write per shard. 2MB/s read per shard. 7 day retention max.
Firehose60s min buffer. 3 min Lambda timeout. 6MB payload. No dedup. Dynamic partitioning adds cost.
DMSLOBs limited to 32KB (default). No parallel full-load for some engines. Single AZ replication instance.
Lambda15 min timeout. 10GB memory max. 250MB deployment package. 6MB sync payload.

16Lake Formation & Governance

Lake Formation -Fine-grained access control on your data lake. Column/row-level security on Glue Catalog tables.

  • • Manages permissions instead of S3 bucket policies
  • • Tag-based access control (LF-Tags)
  • • Cross-account data sharing without S3 replication
  • • Works with Glue, Athena, Redshift Spectrum, EMR

Glue Data Quality -DQDL rules on Glue jobs. Fail pipelines on bad data.

Rules = [
  ColumnValues "age" between 0 and 150,
  Completeness "email" > 0.95,
  IsUnique "user_id"
]

17Cost Cheat Sheet

What you actually pay (us-east-1, 2026)

Glue
$0.44/DPU-hr
EMR
EC2 + ~20%
Athena
$5/TB scanned
Firehose
$0.029/GB
DMS
~$0.18/hr
Lambda
$0.20/1M inv

Pro tip: Glue + Athena on Parquet is the cheapest serverless stack. EMR on spot is cheapest at petabyte scale.

The Complete Cheat Sheet for AWS ETL Tools

This cheat sheet for AWS ETL tools is the fastest way to compare AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS. Each tool card includes the TL;DR (what it actually is under the hood), the exact use case, the hidden traps that will burn your budget, and a copy-paste CLI snippet.

Whether you are building a batch data pipeline, a real-time streaming architecture, or migrating databases, this AWS ETL tools cheat sheet gives you the decision matrix, architecture diagrams, and pricing comparison to pick the right tool in 30 seconds.

Unlike AWS documentation, this cheat sheet AWS ETL tools reference is brutally honest about limitations. We tell you about Glue's cold start, Athena's scan costs, EMR's operational overhead, Firehose's 60-second buffer, and DMS's single-instance bottleneck.

AWS ETL Tools Cheat Sheet FAQ

What is a cheat sheet for AWS ETL tools?expand_more

A cheat sheet for AWS ETL tools is a quick-reference guide that compares AWS data processing services — AWS Glue, Amazon EMR, Amazon Athena, Kinesis Firehose, and AWS DMS — by architecture, pricing, use cases, and limitations. This cheat sheet AWS ETL tools page helps data engineers pick the right tool without reading 50 pages of documentation.

What is the difference between AWS Glue and Amazon EMR?expand_more

AWS Glue is serverless Spark — you submit a job and AWS manages the cluster. Amazon EMR gives you full control over Hadoop/Spark clusters (instance types, tuning, custom JARs). Use Glue for standard batch ETL under 10TB. Use EMR for petabyte-scale jobs where you need to tune executor memory and install custom libraries.

When should I use Amazon Athena instead of AWS Glue?expand_more

Use Athena for ad-hoc SQL queries on data already in S3 — no infrastructure needed, pay per TB scanned. Use Glue when you need to transform and move data (ETL jobs, crawlers, data catalog). Athena is technically ELT (query-time transformation). The key cost trap: unpartitioned CSV data in Athena will 10x your bill vs partitioned Parquet.

What are the hidden costs of AWS ETL tools?expand_more

AWS Glue: DPU-hours round up, 5-10 min cold start burns money. Athena: $5/TB scanned — CSV vs Parquet is a 10x cost difference. EMR: over-provisioned clusters and spot interruptions. Kinesis Firehose: not true real-time (60s buffer minimum). DMS: replication instance is a single EC2 with no auto-scaling.

What is the best AWS ETL tool for real-time streaming?expand_more

Kinesis Firehose for managed ingestion (buffer → S3/Redshift/OpenSearch). Kinesis Data Streams + Lambda for record-level processing. Kinesis Data Analytics (Managed Apache Flink) for complex streaming SQL/Java. Firehose is simplest but has a minimum 60-second buffer — it's near-real-time, not true real-time.

How do I choose between AWS DMS and AWS Glue for data migration?expand_more

DMS is for database-to-database migration with CDC (change data capture). It reads transaction logs and replays them. AWS Glue is for ETL — extract, transform, load with Spark. If you're migrating Oracle to Aurora with zero downtime, use DMS. If you're transforming and loading data into a data lake, use Glue.

Ready to practice data engineering interviews?

Practice with an AI interviewer that tests your ability to design data pipelines and articulate trade-offs in real time.

Try Crackr freearrow_forward

Continue learning