Timezone Bugs in Data Pipelines: Normalize at Ingest or Suffer Later
If one source sends local time and another sends UTC, downstream metrics drift silently. Normalize all timestamps at ingest and keep timezone metadata explicit.
Step 1: parse source timestamp with declared source timezone
from zoneinfo import ZoneInfo
from datetime import datetime
def parse_local(ts: str, tz: str) -> datetime:
return datetime.fromisoformat(ts).replace(tzinfo=ZoneInfo(tz))
Step 2: convert immediately to UTC for storage
utc_dt = local_dt.astimezone(ZoneInfo("UTC"))
Step 3: store original timezone in audit column
row["source_tz"] = "America/New_York"
row["event_time_utc"] = utc_dt.isoformat()
Pitfall
Saving naive datetimes in warehouse tables and assuming everyone interprets them the same way.
Verification
- DST transitions do not produce duplicate or missing hourly buckets.
- All downstream transforms consume UTC fields only.
- Audit columns trace original source timezone.