monitoring checklist production ML

The Production ML Monitoring Checklist: 12 Things to Watch Before Your Users Notice

Abstract monitoring dashboard concept with multiple signal streams and health indicators on dark background

Most teams monitor accuracy. If you're monitoring anything in production ML, it's probably a metric dashboard showing prediction accuracy, AUC, or F1 over time. That's a start. It's also insufficient for the way production ML actually fails.

The problem with accuracy-only monitoring is that it's a lagging indicator — and the lag varies by model type. For a fraud detection model where labels arrive within 72 hours, accuracy degradation is visible within a week of a distribution shift. For a churn prediction model where labels arrive after a 30-day billing period, by the time your accuracy metric moves, the model has been serving degraded predictions for a month. Monitoring only accuracy in that context is like reviewing accident reports to prevent accidents instead of monitoring road conditions in real time.

The signals that precede accuracy drops are detectable earlier — from input distributions, from model output patterns, from operational metrics, from infrastructure health. They require monitoring additional layers of your ML system beyond the accuracy metric. This checklist covers the twelve things your production ML monitoring should include, in approximate order of implementation priority.

Input monitoring

1. Feature distribution shift (PSI per feature)

Run Population Stability Index on every high-importance feature, comparing a rolling production window against your training baseline. PSI captures when your model is being asked to make predictions on inputs it wasn't trained on — the earliest detectable signal of potential accuracy degradation, detectable without ground truth labels.

Typical threshold: PSI < 0.10 stable, 0.10–0.20 investigate, > 0.20 alert + consider retraining. Track PSI history over 90 days so you can distinguish one-time spikes from sustained shifts. A spike that returns to baseline within 48 hours is almost certainly transient (a batch job that briefly skewed your traffic distribution, a holiday weekend with unusual purchase patterns). A shift that sustains for more than 5–7 days is worth treating as real signal.

Implementation note: don't run PSI on every feature with equal weight. For models with 50+ features, this generates too many alerts. Weight your PSI monitoring by feature importance rank from your training set — monitor the top 10–15 features by Shapley value. An alert on a low-importance feature is usually noise; an alert on a top-5 feature warrants immediate investigation.

2. Missing value rate per feature

Track the percentage of inference requests where each feature is null or imputed. An upstream schema change or pipeline bug that suddenly makes transaction_amount null on 40% of requests won't immediately cause an accuracy alert — your imputation strategy fills in a value. But your model is now running on imputed data at a rate it was never tested for.

GET /monitors/fraud-detector/feature-health
# returns: per-feature null rates, out-of-range rates, type mismatch counts

3. Feature value range violations

For numerical features, monitor the fraction of production values that fall outside the range seen in training. A credit_score feature that suddenly has values below 300 or above 900 (outside your training range) signals either a bug or a population shift. The model will extrapolate outside its learned distribution — usually poorly.

4. Categorical cardinality change

For categorical features encoded at training time, track whether new category values are appearing in production that weren't seen in training. These get mapped to unknown-category encodings, and if a new category represents 20% of your inference traffic, you have a coverage gap that no amount of retraining on old data will fix.

Output monitoring

5. Prediction score distribution (Wasserstein distance)

Monitor the distribution of model output scores — probability estimates, regression values, logits — against a reference distribution from shortly after deployment. Use Wasserstein distance (earth mover's distance) to track how the output distribution is shifting.

Why this matters: prediction distribution shift often precedes accuracy degradation and is detectable without ground truth labels. If your fraud model's prediction scores are clustering near 0.5 instead of polarizing toward 0 or 1, the model is becoming uncertain — a signal of concept drift even before you have outcome data to confirm it.

6. Prediction class balance (for classifiers)

If you're running a binary or multiclass classifier, track the fraction of predictions in each class over time. A churn model that suddenly predicts 80% churn (up from a historical 15%) is either responding to a real business crisis or has drifted. You want the alert to be automated and immediate so you can investigate which one.

Model performance monitoring

7. Delayed ground truth accuracy

For models where outcome labels arrive with a lag (credit default at 30/60/90 days, churn at end of billing period, conversion at end of attribution window), systematically compare predictions to outcomes when they arrive. Track accuracy, AUC, or calibration against cohorts by prediction date. This is your primary signal for concept drift detection.

The engineering requirement: you need a prediction logging table keyed by a request ID that you can join against outcome tables when labels arrive. If you're not logging predictions with IDs, you can't do this.

8. Calibration drift

Model calibration — whether a predicted probability of 0.7 actually corresponds to ~70% actual positive rate — degrades over time even when discrimination metrics (AUC) stay stable. Calibrated probabilities are important wherever your model outputs are used in decision thresholds or downstream systems. A fraud model with AUC of 0.93 but poor calibration may be systematically over-predicting fraud probability in a specific customer segment, leading to unnecessary friction for legitimate users even as the aggregate discrimination metric looks healthy.

Monitor using Brier score or reliability diagrams. A growing gap between predicted and actual rates in a specific probability decile often signals concept drift in that segment before it shows up in aggregate accuracy. This is especially useful for detecting drift that's localized to a subpopulation — when aggregate metrics look fine but a specific user segment is being poorly served.

Operational monitoring

9. Inference latency P50/P95/P99

Track prediction latency percentiles. P99 latency spikes often indicate a model size or infrastructure issue that precedes service degradation. For SLA-bound models (real-time scoring in a transactional system), P99 latency is a first-class production metric, not an afterthought.

GET /monitors/fraud-detector/latency
# returns: p50_ms, p95_ms, p99_ms, max_ms for rolling window

10. Inference volume and request rate

A sudden drop in inference volume can indicate an upstream bug that's silently stopping requests from reaching your model — which means your drift monitoring windows are getting thinner and less statistically reliable. A sudden spike can indicate a traffic anomaly that will skew your baseline comparisons.

Alert on deviations from expected request rate (>3 standard deviations from 7-day average) in both directions.

Infrastructure and pipeline monitoring

11. Training pipeline health (for auto-retrained models)

If you have automated retraining configured, monitor whether triggered training jobs are completing successfully. A retraining trigger fires, the job fails silently (out-of-memory, upstream data unavailable, feature schema mismatch), and your model keeps serving production traffic on the old version while your monitoring dashboard says retraining is "running."

Monitor: job success/failure rate, job duration (outliers indicate data issues), and time since last successful retrain. Alert when a triggered training job fails.

12. Retraining queue depth and lag

If you're batching retraining triggers or have a rate limit on training job launches, monitor queue depth. A long queue means your model is drifting and you're not keeping up with remediation. This is the production ML equivalent of a deployment queue building up during a high-velocity engineering sprint.

Also track time since last retrain per model. A model that hasn't been retrained in 90 days when it normally retrains every 14 days is either not drifting (possible, worth checking) or your retraining trigger is misconfigured (more likely).

Implementing the checklist

Items 1–4 (input monitoring) and 9–10 (operational) can be implemented without ground truth labels — they run on inference traffic in real time and don't require any outcome data. Start there. PSI-based feature drift monitoring and latency P99 tracking are the two highest-ROI items on this list per hour of implementation time, and neither requires label infrastructure.

Items 5–6 (output monitoring) also don't require labels and are high-value leading indicators of concept drift. Implement these alongside input monitoring. Wasserstein distance on prediction outputs is particularly underused — most teams track their input distributions but not their output distributions, which means they're missing an early warning signal that doesn't require any additional data collection.

Items 7–8 (model performance with ground truth) require label collection infrastructure. If you don't have a prediction logging table that joins against outcome tables, build it before implementing these items. The monitoring is only as good as your label pipeline. One implementation note: prediction logs need a stable request ID or session ID that persists until the outcome is available, not just a timestamp. Timestamp-based joins work until you have multiple predictions per window, then they break.

Items 11–12 (pipeline monitoring) only apply if you have automated retraining. Implement them when you automate retraining, not before. A retraining queue metric for a manual retraining process is just noise.

We're not saying you need all twelve in place before shipping a model. We're saying that every item you're missing is a category of failure you won't detect until your users do. The checklist is a gap analysis: work through it, find what's missing, and prioritize by model risk level. A fraud detection model serving 50,000 decisions per day needs tighter coverage than an internal analytics model that influences weekly reporting.

A monitoring system that covers all twelve areas will tell you about a problem before your users do. That's the only standard worth targeting.