Building ETL Pipelines That Actually Scale

ETL (Extract, Transform, Load) sounds simple until it isn't. You're pulling data from three APIs, two databases, and a CSV file that someone uploads manually on Fridays. The schemas don't match. The APIs have rate limits. The CSV sometimes has extra columns. And it all needs to be in the warehouse by 6 AM.

I've built ETL systems for networking infrastructure (eBay), financial data (equity research platforms), and healthcare — each with different scale requirements but the same fundamental challenges.

Idempotency Is Non-Negotiable

Every ETL pipeline will fail. Network timeouts, API changes, malformed data — failures are not edge cases, they're the normal state. The pipeline needs to handle re-runs gracefully.

I design every pipeline operation as idempotent: running it twice produces the same result as running it once. This means using INSERT ... ON CONFLICT DO UPDATE instead of plain inserts, and tracking which records have been processed with watermarks or checksums.

Schema-on-Read vs Schema-on-Write

For volatile data sources (APIs that change their response format, CSVs with inconsistent headers), I land raw data first and transform second. Store the raw JSON or CSV in a staging table, then apply transformations in a separate step. This way, when the source schema changes — and it will — you only need to update the transform step, not re-extract.

Monitoring That Actually Helps

The most important metric for an ETL pipeline isn't throughput — it's data freshness. How old is the newest record in the destination? If the answer is "I don't know," you have a monitoring problem.

Every pipeline I build emits three signals: records processed, records failed, and last successful run timestamp. When the freshness metric exceeds a threshold, someone gets paged — not when the pipeline crashes, but when the data is stale. A silently failing pipeline is worse than a loudly crashing one.

Go vs Python for ETL

I use both. Python (with pandas) excels at ad-hoc transformations, data exploration, and pipelines where the logic changes frequently. Go wins when throughput matters — processing millions of records with low memory usage and high concurrency.

My typical architecture: Go services for high-volume extraction and loading, Python for complex transformation logic, PostgreSQL as the warehouse. Simple, battle-tested, and maintainable by a small team.