retraining automation drift

Retraining Triggers Explained: Metric Threshold, Drift Score, or Schedule?

Kevin Nakamura October 30, 2025 9 min read

Abstract trigger event visualization with an amber burst triggering a new pipeline flow

Retraining a model costs GPU time, engineering attention, and deployment risk. Training too infrequently means models degrade. Training too frequently means you're spending real compute on runs that don't meaningfully improve production accuracy, and you're introducing deployment risk every cycle.

The right trigger strategy for your model depends on how it fails — and more specifically, how quickly you can observe that failure. That's the core distinction between the three trigger types. Using the wrong one means you're either responding to a signal that doesn't yet exist (waiting for accuracy to drop when you could have caught drift two weeks earlier) or firing on a signal that doesn't map to your actual failure mode (running PSI-based drift triggers on a model where the input distribution never moves but concept drift is constant).

Trigger type 1: Performance threshold

A performance threshold trigger fires when a measured accuracy metric — F1, AUC, RMSE, precision, recall, or a custom business metric — drops below a defined floor. This is the most direct signal: the model is provably worse than your acceptable threshold, and you know it because you have outcome labels to measure against.

This trigger is ideal when ground truth labels arrive quickly. Fraud detection is the canonical example: a fraud label typically arrives within 24–72 hours of a transaction being flagged (chargeback, customer report, manual review). That means you can compute yesterday's actual fraud rate against your model's predictions this morning. If precision on confirmed fraud has dropped from 0.91 to 0.83, you fire a retrain.

# Performance threshold policy
policy:
  model: fraud-detector
  trigger:
    type: performance
    metric: precision_at_threshold_0.5
    threshold: 0.85          # retrain when precision drops below 85%
    evaluation_window: 3d    # compute over rolling 3-day labeled window
    min_labeled_samples: 500 # require at least 500 confirmed labels
  job:
    type: airflow_dag
    dag_id: retrain_fraud_v4

The limitation is also the defining characteristic: you need labeled data in production, and it needs to arrive soon enough to be actionable. A churn model where labels arrive 90 days after the billing period ends cannot use performance threshold triggers as a primary mechanism — by the time you observe performance degradation, the model has been serving degraded predictions for months. For long-label-lag models, you need a different primary trigger.

There's a second structural limitation worth being explicit about: performance threshold triggers require a prediction logging infrastructure. You need a table of production predictions keyed by a stable request ID, which you can join against outcome tables when labels arrive. If your current production serving setup doesn't log predictions with persistent IDs, performance threshold monitoring isn't possible until you add it. This is often the hidden prerequisite that teams miss — they try to implement performance-based retraining triggers and discover that their serving layer doesn't persist the predictions they need to compute the signal.

Trigger type 2: Drift score

A drift score trigger fires when a statistical divergence measure on input features or prediction outputs exceeds a threshold. PSI (Population Stability Index) is the most common metric for feature drift; KS-test and Jensen-Shannon divergence are alternatives. The key difference from a performance trigger: you don't need ground truth labels. You're measuring whether the inputs your model receives today look like the inputs it was trained on.

This is the right trigger for models with slow-arriving labels. Consider an e-commerce recommendation engine: conversion labels for a recommendation don't arrive until a customer completes or abandons a purchase funnel, which could take days. If your inventory mix shifts — say a supply chain disruption in Q1 2025 suddenly makes 30% of previously high-velocity SKUs unavailable — the PSI on your product_category feature will exceed 0.25 within days of the disruption, while conversion-based accuracy won't reflect the degradation for another week or two.

# Drift score policy
policy:
  model: recommendation-engine
  trigger:
    type: drift
    metric: psi
    feature_group: product_features
    threshold: 0.20          # significant drift — PSI > 0.20
    window: 7d               # rolling 7-day window vs. training baseline
    top_features_only: true  # monitor top-10 features by training importance
  job:
    type: sagemaker_pipeline
    pipeline_arn: arn:aws:sagemaker:us-west-2:...

The honest limitation of drift score triggers: not all drift is harmful. Input distributions can shift without degrading model accuracy — especially for models with strong generalization across the shifted range. A PSI alert on a feature that your model barely uses is noise, not signal. This is why weighting drift scores by feature importance is important; raw per-feature PSI across all features generates too many false positives.

We're not saying drift score triggers are inferior to performance threshold triggers — they serve different models. The right framing is: drift score = early warning system (fires before accuracy visibly degrades); performance threshold = definitive signal (fires when degradation is confirmed with labels). The combination is more powerful than either alone.

Trigger type 3: Schedule

A schedule trigger fires on a cron expression, independent of any data or performance signal. This is the simplest trigger, and also the most commonly overused one as a substitute for real monitoring.

Schedule-based retraining is genuinely appropriate in specific contexts:

Models with known seasonal periodicity where retraining aligned to the business cycle makes more sense than drift-triggered retraining (e.g., a demand forecast model that retrains at the start of each fiscal quarter when promotional planning data is refreshed)
Regulatory or compliance requirements that mandate periodic model validation and retraining on a defined cadence, regardless of measured performance
Cases where training runs are cheap enough that the cost of unnecessary retraining is lower than the cost of maintaining drift monitoring infrastructure

# Schedule-based policy — floor trigger, not primary
policy:
  model: demand-forecast-q
  trigger:
    type: schedule
    cron: "0 2 1 */3 *"    # first day of each quarter, 2am UTC
  cooldown: 72h             # prevent double-triggers at quarter boundaries
  job:
    type: github_actions
    workflow: retrain_demand_quarterly.yml

Where schedule triggers consistently fail: using a weekly cron as a substitute for drift monitoring. A model can degrade significantly in 48 hours after a major distribution shift. A weekly retrain doesn't catch that until 5 days of degraded predictions have already been served. Schedules should be a floor — a backstop that ensures retraining happens at minimum every N days — not the primary detection mechanism for drift or performance degradation.

The other failure mode of schedule-only retraining: it trains when it doesn't need to. If your model is stable and the schedule fires, you've spent GPU compute, incurred deployment risk from promoting a new version to production, and possibly introduced a regression (yes, a retrained model can be worse than the current one if the validation gates aren't tight). Scheduled retraining without validation gating is one of the most reliable ways to accidentally degrade a production model that was working fine.

Composing triggers: AND / OR logic

Real-world models often need combinations. The two most useful patterns:

"Drift AND data freshness" (AND logic): Fire only when a drift signal is present AND enough new labeled data has accumulated to train on. This prevents a common failure mode: drift triggers fire, a retrain runs on the same stale dataset (because no new data has arrived yet), and the retrained model doesn't improve because the new data distribution isn't represented in training.

# AND composition — both conditions must be true
trigger:
  all:
    - type: drift
      metric: psi
      threshold: 0.20
    - type: data_freshness
      min_new_examples: 10000    # require 10K new examples since last train

"Performance OR schedule" (OR logic): Fire when accuracy drops below threshold, but also fire on schedule as a fallback for models where labels arrive sporadically. The schedule ensures retraining happens at least every N weeks; the performance trigger fires immediately when degradation is confirmed without waiting for the schedule.

# OR composition — either condition fires
trigger:
  any:
    - type: performance
      metric: f1
      threshold: 0.88
    - type: schedule
      cron: "0 3 * * 0"    # Sunday 3am UTC as fallback floor

Validation gating: a retrain trigger is not a deploy trigger

A retraining trigger isn't a deployment trigger. When a policy fires and a training run completes, the new model should pass a validation gate before touching production traffic. This is where a surprising number of automated retraining systems fail silently: they train automatically but deploy without gates, and a run that regressed on the holdout set (possible when the trigger fires on a distribution shift that hasn't settled into new labeled data yet) goes directly to production.

# Validation gate on every retraining policy
validation:
  min_f1: 0.88              # absolute floor — must achieve this
  compare_to: production    # must also beat current production version
  tolerance: -0.005         # allow up to 0.5% regression (floating-point noise)
  promote_on_pass: staging  # promote to staging, NOT production
  notify_on_fail:
    - slack: "#ml-ops-alerts"
  cooldown_on_fail: 48h     # don't re-trigger immediately after a failed validation

The two-stage promotion pattern — trigger fires, model trains and goes to Staging, a human or automated gate promotes it to Production — gives you the benefits of automated retraining without the blast radius of fully autonomous promotion. For most teams managing production models, this is the correct default posture.

Matching trigger type to model type

The practical decision tree: if your ground truth labels arrive within 24–72 hours, a performance threshold trigger is your primary signal — it's the most direct and most defensible. If labels lag by more than a week, lead with drift score as your primary and add a schedule fallback. For models with known seasonal cycles, schedule-based retraining as a first-class mechanism (not a fallback) is legitimate.

Most production models benefit from two triggers composed with OR logic: a fast-signal trigger (performance threshold or drift score) as the primary mechanism, and a schedule as a floor that catches degradation even when your primary signal doesn't fire. The combination costs almost nothing to configure and eliminates both failure modes — the false negative of a missed drift event and the false positive of a noise-triggered unnecessary retrain.

Starting from scratch: the practical sequence

If you have no automated retraining today, the right sequence is not "pick the best trigger and implement it." It's: set up prediction logging first, then implement drift monitoring and collect 60–90 days of PSI baseline, then configure a drift-score trigger with conservative thresholds validated against that baseline, then add a validation gate, and only then consider performance threshold triggers once your label infrastructure is in place.

Teams that skip the baseline collection phase and go straight to thresholds almost universally set their PSI thresholds too low, get alert fatigue in the first two weeks, and either disable the monitoring or start ignoring alerts. The baseline is what separates a trigger that fires on signal from a trigger that fires on noise.

A useful mental model: your retraining trigger is a hypothesis test. The null hypothesis is "the model is still adequate." The trigger is your rejection criterion. Just as a statistical test needs a calibrated significance threshold (not just "p < 0.05 because everyone uses that"), your trigger needs a calibrated threshold based on your model's observed behavior — not a default value from documentation.

For YAML syntax, parameter references, and example policies across common model types, see the retraining policies documentation.

Trigger type 1: Performance threshold

Trigger type 2: Drift score

Trigger type 3: Schedule

Composing triggers: AND / OR logic

Validation gating: a retrain trigger is not a deploy trigger

Matching trigger type to model type

Starting from scratch: the practical sequence

More from the blog

What Is Model Drift and Why Does It Sink Production ML?

Model Versioning Best Practices: What Git Taught Us and What It Didn't

MLOps for Small Teams: What to Automate First When You Have Three People