From Demo Scraper to Reliable Collector

Simple scrapers usually work in week one and fail in week two because production realities are ignored: selector drift, throttling, retries, and idempotent storage.

Step 1: Separate fetch, parse, and persist stages

def run(urls, fetch, parse, save):
    for url in urls:
        html = fetch(url)
        records = parse(html)
        save(url, records)

Step 2: Add bounded retry and jitter

import random, time

def with_retry(fn, attempts=4):
    for i in range(attempts):
        try:
            return fn()
        except Exception:
            if i == attempts - 1:
                raise
            time.sleep((2 ** i) + random.random())

Step 3: Persist with idempotency key

def upsert_record(db, source_url, item_id, payload):
    key = f"{source_url}:{item_id}"
    db.upsert(key=key, payload=payload)

Pitfalls

  • Selector changes silently producing empty datasets.
  • No retry policy for intermittent transport errors.
  • Append-only writes creating duplicate records.

Verification

  • Retry logs clearly show attempt counts and final result.
  • Parser tests catch selector drift quickly.
  • Re-running the same URL set does not duplicate rows.

Get New Tutorials by Email

No spam. Just clear, practical breakdowns you can apply right away.

Enjoy this tutorial?

Get new practical tech tutorials in your inbox.