A Practical PCA Pipeline That Avoids Data Leakage
PCA is useful, but many pipelines leak test information by fitting transforms on full data. That inflates metrics and gives false confidence.
1) Split first, then fit transforms on train only
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)
2) Fit PCA on scaled train data only
pca = PCA(n_components=0.95).fit(X_train_s)
Xt_train = pca.transform(X_train_s)
Xt_test = pca.transform(X_test_s)
3) Log explained variance with run metadata
artifact = {
"variance_ratio": pca.explained_variance_ratio_.tolist(),
"n_components": int(pca.n_components_),
}
Failure pattern
- Calling
fit_transformon the entire dataset before split. - Comparing models with different preprocessing but shared metrics.
- No artifact trail for PCA version and parameters.
What to verify
- Train/test boundaries are respected by every transform.
- Pipeline reruns produce identical component counts.
- Variance metrics are logged and reviewable.