A Practical PCA Pipeline That Avoids Data Leakage

PCA is useful, but many pipelines leak test information by fitting transforms on full data. That inflates metrics and gives false confidence.

1) Split first, then fit transforms on train only

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_test_s = scaler.transform(X_test)

2) Fit PCA on scaled train data only

pca = PCA(n_components=0.95).fit(X_train_s)
Xt_train = pca.transform(X_train_s)
Xt_test = pca.transform(X_test_s)

3) Log explained variance with run metadata

artifact = {
    "variance_ratio": pca.explained_variance_ratio_.tolist(),
    "n_components": int(pca.n_components_),
}

Failure pattern

  • Calling fit_transform on the entire dataset before split.
  • Comparing models with different preprocessing but shared metrics.
  • No artifact trail for PCA version and parameters.

What to verify

  • Train/test boundaries are respected by every transform.
  • Pipeline reruns produce identical component counts.
  • Variance metrics are logged and reviewable.

Get New Tutorials by Email

No spam. Just clear, practical breakdowns you can apply right away.

Enjoy this tutorial?

Get new practical tech tutorials in your inbox.