The Notebook Trap
Many data science learners build promising notebooks, but when they need repeatable experiments, everything becomes hard to reproduce. The missing piece is project structure discipline.
Step 1: Split exploration and production paths
project/
notebooks/
src/
data/
features/
models/
tests/
configs/
Step 2: Move feature logic into importable modules
def add_ratio_features(df):
out = df.copy()
out["profit_ratio"] = out["profit"] / out["revenue"].clip(lower=1)
out["cost_ratio"] = out["cost"] / out["revenue"].clip(lower=1)
return out
Step 3: Make experiments config-driven
experiment:
model: random_forest
train_split: 0.8
random_seed: 42
features:
- profit_ratio
- cost_ratio
Pitfalls
- Hard-coded file paths in notebooks.
- No seed control for model training.
- Feature engineering mixed with plotting code.
Verification
- A fresh clone can run one experiment command end-to-end.
- Two runs with same config produce consistent metrics.
- Feature functions are unit tested outside notebooks.