From Demo Scraper to Reliable Collector
Simple scrapers usually work in week one and fail in week two because production realities are ignored: selector drift, throttling, retries, and idempotent storage.
Step 1: Separate fetch, parse, and persist stages
def run(urls, fetch, parse, save):
for url in urls:
html = fetch(url)
records = parse(html)
save(url, records)
Step 2: Add bounded retry and jitter
import random, time
def with_retry(fn, attempts=4):
for i in range(attempts):
try:
return fn()
except Exception:
if i == attempts - 1:
raise
time.sleep((2 ** i) + random.random())
Step 3: Persist with idempotency key
def upsert_record(db, source_url, item_id, payload):
key = f"{source_url}:{item_id}"
db.upsert(key=key, payload=payload)
Pitfalls
- Selector changes silently producing empty datasets.
- No retry policy for intermittent transport errors.
- Append-only writes creating duplicate records.
Verification
- Retry logs clearly show attempt counts and final result.
- Parser tests catch selector drift quickly.
- Re-running the same URL set does not duplicate rows.