Web Scraping at Scale: Adaptive Pacing Without Getting Blocked
Fixed-rate scraping fails in production because server tolerance changes over time. Adaptive pacing uses feedback signals to stay stable and respectful.
Step 1: Track per-domain response quality
stats = {
"ok": 0,
"timeouts": 0,
"status_429": 0
}
Step 2: Adjust delay window based on recent errors
def next_delay(base, stats):
penalty = stats["timeouts"] * 0.3 + stats["status_429"] * 0.6
return min(8.0, base + penalty)
Step 3: Persist crawl cursor for resumability
state["last_url"] = current_url
save_state(state)
Pitfalls
- Constant request interval regardless of server signals.
- No checkpointing, forcing full restart after errors.
- Ignoring
Retry-Afteron throttled responses.
Validation
- Throttle events decrease after adaptive pacing.
- Interrupted jobs resume from saved cursor.
- Error budget per domain stays within target.