model drift MLOps monitoring

What Is Model Drift and Why Does It Sink Production ML?

Kevin Nakamura August 11, 2025 8 min read

Abstract visualization of model drift — a glowing metric line gradually declining on dark background

You trained a fraud detection model last quarter. It hit 94% accuracy on your holdout set. You deployed it. Six weeks later, you're in a post-mortem because the model let through $40,000 in fraudulent transactions that the old rule-based system would have caught.

What happened? Almost certainly: model drift. But which kind, and where?

The terms "data drift," "concept drift," and "model decay" get used interchangeably in engineering conversations, and that sloppiness causes real problems — because each failure mode has a different cause, a different detection signal, and a different remediation path. Treating them as synonyms means you apply the wrong fix and wonder why your accuracy doesn't recover.

This article draws precise boundaries between each type and gives you the statistical intuition to know which one you're looking at.

Data drift: the input distribution shifts

Data drift (also called covariate shift or feature drift) happens when the statistical distribution of model inputs changes between training and production. The relationship between features and labels hasn't changed — your model's learned function is still correct in principle — but it's now being asked to operate outside the distribution it was trained on.

A concrete example: you train a customer churn model on data from Q1 through Q3. In Q4, your company runs a large acquisition campaign that brings in a cohort of customers with significantly shorter tenures and lower historical LTV than your existing base. The feature distribution for account_age_days and total_spend_90d shifts substantially. Your model, which learned that young, low-spend accounts churn at rate X, now sees those accounts making up 40% of your inference traffic instead of 15%.

The model's predictions won't necessarily be wrong — but its confidence calibration will degrade, and if the new cohort behaves differently from what the training data implied, accuracy drops. What makes this failure mode insidious is that the model never "breaks" in a technical sense. The serving endpoint keeps returning predictions. Error rates in your application monitoring stay flat. The only signal is a quiet, gradual drift in business outcomes that your stakeholders will eventually notice before your monitoring does — unless you've instrumented input distribution tracking.

A second concrete scenario that comes up regularly: an e-commerce recommendation model trained on 2024 purchase behavior runs into early 2025 when a supply chain disruption significantly shifts inventory availability. The product_category feature distribution shifts — categories that previously represented 8% of purchases now represent 22%, while high-velocity categories drop out due to stockouts. PSI on product_category climbs from 0.04 (stable) to 0.31 over six weeks. The model keeps recommending the previously-popular-but-now-unavailable items, degrading both recommendation click-through and customer satisfaction. The input data didn't change in structure — the real-world distribution it represents did.

Detection method: Population Stability Index (PSI) is the standard tool here. PSI compares two distributions by binning them and computing a weighted divergence measure:

PSI = sum((Actual_i - Expected_i) * ln(Actual_i / Expected_i))

# Interpretation thresholds:
# PSI < 0.10  → stable
# PSI 0.10–0.20 → moderate shift, investigate
# PSI > 0.20  → significant drift, action required

PSI is sensitive to the number of bins you use; 10 bins is the standard for continuous features, fewer for low-cardinality categoricals. The Kolmogorov-Smirnov (KS) test is an alternative for continuous features — it tests whether two samples come from the same distribution nonparametrically and gives you a p-value.

For high-dimensional feature spaces, you can't run PSI on every feature individually without statistical power issues. Feature importance weighting — running PSI only on the top-N features by Shapley value from the training set — is a practical approach.

Concept drift: the relationship between features and labels shifts

Concept drift is different in kind from data drift. Here, the input distribution may be perfectly stable, but the relationship between features and the target variable has changed. Your model learned P(Y|X) on historical data; the real P(Y|X) has since changed.

Example: you built a credit default model in 2023 using features like debt-to-income ratio and employment tenure. In 2024, a regional economic contraction changes the default rate among high-DTI borrowers significantly — your model's learned relationship between DTI and default probability is now miscalibrated, even though DTI values in your inference traffic look exactly like they did in training.

Concept drift is harder to detect because it requires ground truth labels to confirm. You can't measure it directly from input distributions alone — you need to observe actual outcomes and compare model predictions against them.

The reason this distinction matters in practice: if you apply a data drift fix (retraining on a shifted input distribution) to a problem that's actually concept drift, the new model trains on inputs that look "correct" but learns the wrong P(Y|X) because the true relationship has changed. You retrain, deploy, watch the model pass holdout metrics, and then watch it degrade again within weeks — because your holdout set was sampled from a historical distribution that no longer reflects the current P(Y|X) either. The fix requires not just new data, but labeled data from the new environment that lets the model learn the updated relationship.

Detection methods:

Delayed ground truth monitoring: For models where labels arrive with a lag (churn, default, conversion), log predictions and compare to outcomes when they arrive. A widening gap between predicted and actual rates signals concept drift. This requires a prediction logging infrastructure keyed by a stable request ID that you can join against outcome tables when labels arrive — if you're not logging predictions with IDs today, concept drift monitoring is not possible until you are.
Output distribution monitoring: If your model's prediction score distribution shifts — more predictions clustering near 0.5 instead of polarizing toward 0 or 1 — the model is becoming less confident, which often indicates concept drift even before you have labels. This works as an early warning and doesn't require labeled outcomes.
Wasserstein distance on prediction outputs: Track the Wasserstein distance (earth mover's distance) between your training-time prediction distribution and your current production prediction distribution. A growing Wasserstein distance suggests the model is behaving differently without necessarily having shifted inputs — a pattern consistent with concept drift rather than covariate shift.
Calibration drift: Track the relationship between predicted probabilities and actual outcomes in probability deciles. A model that used to be well-calibrated (predicted 0.7 → ~70% actual positive rate) becoming miscalibrated in a specific decile (predicted 0.7 → 45% actual positive rate) often reflects concept drift in that segment before aggregate AUC shows any degradation.

Model decay: gradual degradation from aging

Model decay is the umbrella term for accuracy degradation over time, typically caused by a combination of data drift and concept drift that accumulates gradually. Unlike an abrupt distribution shift, decay is slow — often 2–5% accuracy loss per quarter — which makes it the most dangerous form because it flies under the radar of ad hoc monitoring.

The canonical production ML anti-pattern: a team deploys a model, monitors it for two weeks, observes stable metrics, declares it healthy, and then doesn't look at it for six months. By month three, the model has drifted enough to meaningfully degrade. By month six, users are visibly affected. By the time the team investigates, they can't reconstruct which model version was running when, what the input distribution looked like at deployment time, or what changed in the data pipeline. They're debugging blind.

Model decay has an additional characteristic that distinguishes it from acute drift: it's often invisible in your system's error monitoring. The model server returns HTTP 200. Prediction latency is nominal. The application works. The only signal is in the business outcomes downstream — churn is slightly higher, conversion rates are slightly lower, fraud losses are slightly up — and those signals get attributed to market conditions, seasonal effects, or A/B test noise rather than model degradation. This is why monitoring model health requires explicit ML-aware instrumentation, not just application observability.

Preventing decay requires two things operating in parallel: continuous monitoring (not periodic audits) and immutable version records so you can always compare "now" against "then." If you can't reconstruct the training data hash and feature schema that produced the currently-deployed model version, your debugging options when decay surfaces are severely limited.

Which type are you actually seeing?

In practice, a debugging checklist:

Run PSI on your top features. If PSI > 0.20 on any high-importance feature, you have data drift. Investigate upstream data pipelines first — feature extraction bugs, schema changes, population shifts. Before touching the model, fix the data.
Check prediction score distribution. If PSI is < 0.10 on all features but your Wasserstein distance on prediction outputs has increased, suspect concept drift. You need labeled outcomes to confirm.
Check for abrupt vs. gradual onset. Abrupt changes (PSI jumps overnight) usually indicate a data pipeline bug — feature extraction broken, upstream table schema changed. Gradual changes usually indicate real distributional or concept shift in the world.
Compare model versions. If you can't reconstruct what version was running when a degradation started, you have a versioning problem compounding a drift problem.

The threshold question

The right thresholds for PSI alerts depend on your model's sensitivity to distribution shift — which varies significantly by model type and deployment context. A credit scoring model serving 10,000 decisions per day has very different tolerance than a product recommendation model where accuracy degradation costs you click-through rate. Setting PSI thresholds too low generates alert fatigue (your team learns to ignore drift notifications). Setting them too high means you're not catching genuine signals early enough.

The practical recommendation: don't set production thresholds on day one. Run your monitoring in "observe only" mode for 60–90 days and build a PSI baseline that captures your data's natural variation — seasonal fluctuations, day-of-week patterns, batch vs. real-time traffic differences. Then set your alert threshold at 1.5–2× your observed natural variation ceiling. This is how you distinguish "this feature moves by 0.08 PSI every Friday" from "this feature just moved by 0.22 PSI and that's unusual."

Inferpathio's starting configuration for teams without PSI history:

# Starting thresholds — tune from these based on your model's behavior
monitor:
  model: my-model
  methods:
    psi:
      threshold: 0.10      # alert, review
      critical: 0.20       # alert + trigger retraining consideration
    ks_test:
      p_value_threshold: 0.05
    wasserstein:
      enabled: true        # use for prediction output monitoring

Track your PSI history for 60–90 days before setting production thresholds. You'll see natural variation ranges for your data, and that history lets you distinguish signal from noise.

What Inferpathio tracks

Inferpathio is not a generic monitoring system. It tracks drift specifically in the context of ML model lifecycle management — which means the drift signals it surfaces are linked to the model version that was active when drift occurred, the training run that produced that version, and the retraining policies that can respond when thresholds are crossed. The distinction matters: drift as an isolated metric is less useful than drift as an event in a model's operational history.

Inferpathio is not a feature store, not a model serving layer, and not a general-purpose observability platform. It does one thing: keep production models accurate by closing the detect-retrain loop. For the monitoring primitives described in this article — PSI, KS-test, Wasserstein distance — see the drift configuration docs.

Data drift: the input distribution shifts

Concept drift: the relationship between features and labels shifts

Model decay: gradual degradation from aging

Which type are you actually seeing?

The threshold question

What Inferpathio tracks

More from the blog

Model Versioning Best Practices: What Git Taught Us and What It Didn't

Retraining Triggers Explained: Metric Threshold, Drift Score, or Schedule?

The Production ML Monitoring Checklist: 12 Things to Watch Before Your Users Notice