CI/CD model pipeline MLOps

CI/CD for ML Models: Adapting Software Delivery Principles to Model Pipelines

Abstract pipeline stages visualization showing continuous integration and deployment flow on dark background

In software engineering, CI/CD is solved. You push code, a pipeline runs lint and tests, a build artifact is produced, it deploys to staging, integration tests pass, it deploys to production. The whole thing is well-understood: tooling, best practices, failure modes, rollback strategies — there's a 15-year corpus of knowledge behind every GitHub Actions workflow you've written.

ML CI/CD is not solved. The tooling exists in pieces, the best practices are still being worked out, and the seams between the stages break in ways that pure software pipelines don't. This guide maps the classical CI/CD pipeline to its ML equivalent stage-by-stage and shows you where the analogies hold, where they break, and what to do about the breaks.

Stage 1: Source control — code is only half the artifact

In software CI/CD, git is the single source of truth. Code changes trigger the pipeline. The artifact produced is deterministic given the code version — the same source code compiled with the same toolchain produces the same binary, reliably and reproducibly.

In ML, this breaks immediately: the artifact (a trained model) is not deterministic given only the training code. It depends on the code + the training data + the random seed + the framework version + the hardware. Two runs of the same training script on different data produce different models. Two runs with the same data on different GPUs may produce slightly different floating-point values due to thread scheduling and floating-point accumulation order. The same training run on PyTorch 2.0 vs. PyTorch 2.1 may produce subtly different weights because of changes to backend operations.

The implication: ML CI/CD needs two source-of-truth systems — git for code and a model registry for artifacts. The registry stores the weight file, the training data hash, the hyperparameters, and the framework environment. A "version" in ML CI/CD is a tuple: (code commit, data version, training run metadata). If your source control system doesn't capture all three components, you can't reproduce your artifacts, you can't debug regressions, and your "versioning" is actually just file naming.

The practical consequence of conflating these: a team using git as their only source of truth checks out the commit that produced their best model, runs the training script, gets a model with different weights, and has no way to know if the difference matters. They're in an unrepro state with no systematic way out of it. This is the condition most teams are in before they add a dedicated model registry.

# A registered model version captures all three
with ifp.track("revenue-forecast") as run:
    run.log_param("learning_rate", 0.001)
    run.log_param("data_version", "v14.3")    # explicit data version
    run.log_param("framework", "pytorch-2.1.0")
    model.fit(train_loader)
    run.register(tags=["commit:a4f9c22"])

Stage 2: CI — testing ML code is not the same as testing ML models

Software CI tests behavior: given input X, function F returns Y. The tests are fast, deterministic, and run on every commit.

ML "CI" has two distinct phases that most teams conflate:

Code CI: Unit tests, linting, type checking, integration tests for data preprocessing functions. This is regular software CI and should run on every commit in under 5 minutes. Test your feature extraction functions, your data validation logic, your model loading code. Do not run a full training job in code CI — it's too slow and too resource-intensive.

Model CI: A full training run on a representative data sample, evaluation against held-out data, metric threshold gate, artifact registration. This is expensive (minutes to hours) and should run on a schedule or when a retraining trigger fires, not on every code commit.

# Code CI — fast, on every commit
# .github/workflows/code-ci.yml
- name: Run unit tests
  run: pytest tests/unit/

- name: Lint and type check
  run: flake8 src/ && mypy src/

- name: Test data pipeline
  run: pytest tests/integration/test_data_pipeline.py --data-sample small

The mistake teams make: running the full training job in code CI and calling it "model testing." This gives you slow CI with 45-minute wall times that nobody waits for, not meaningful model quality assurance. A developer committing a one-line change to a data preprocessing utility now triggers a full GPU training run. Engineers start skipping CI or pushing directly to avoid the wait. The pipeline that was supposed to catch regressions becomes the obstacle that makes regressions more likely.

Separate the two concerns. Code CI is fast, cheap, and runs on CPU. Model CI is slow, expensive, and runs on GPU. They trigger on different events and serve different purposes. Conflating them breaks both.

Stage 3: Build — the training run as build step

In software CI/CD, the build step compiles source code into a binary artifact. The artifact is immutable: once built, its content doesn't change.

In ML, the training run is the build step. It takes code + data + hyperparameters and produces a model artifact. Like a compiled binary, the artifact should be immutable after registration — you don't modify the weights of a registered model, you train a new version.

Key principle: the artifact should be identifiable by content hash, not by filename or timestamp. Content-addressed storage ensures that if you accidentally train the same model twice (identical code, identical data, identical seed), you don't waste storage on a duplicate.

Practical requirements for the build step:

  • Reproducibility: log the exact random seed, framework version, and hardware spec. Full reproducibility across different hardware is often not achievable for GPU-trained models due to floating-point non-determinism — but you can get close enough to reproduce within a tolerance that matters for debugging.
  • Data provenance: hash the training dataset and include it in the run record. If you're training on a database view or a partitioned data lake, the hash should capture the exact query or partition spec used — not just a date range, which is ambiguous.
  • Environment capture: record Python environment (requirements.txt or conda env hash). The framework version matters more than most teams realize — behavior differences between PyTorch 2.0 and 2.1 can produce measurable metric differences on some architectures.
  • Artifact registration: don't just save to S3 — register in the model registry with full metadata. A file in S3 without metadata is an orphan: you can't tell what produced it, what it was trained on, or whether it was ever validated.

Stage 4: Testing — model evaluation as quality gate

Software CI/CD uses automated tests as a gate: if tests fail, the pipeline stops. The gate is binary: pass or fail.

ML has the same concept but more nuance. Model evaluation gates are metric thresholds: the new model must achieve a minimum F1, AUC, RMSE, or custom business metric on a holdout test set. But there's a second gate that software CI/CD doesn't have: the new model should also be compared against the current production model. A model that passes the absolute threshold but regresses relative to production should not deploy.

# Validation gate in Inferpathio retraining policy
validation:
  min_f1: 0.88            # absolute threshold
  compare_to: production  # must beat current production version
  tolerance: -0.005       # allow up to 0.5% regression (rounding noise)
  promote_on_pass: staging
  notify_on_fail:
    - slack: "#ml-ops-alerts"

Shadow deployment as an integration test: for high-stakes models, move the validated candidate to Shadow stage before Production. Shadow mode runs the model against live traffic and logs predictions, but doesn't surface them to users. Compare shadow predictions against production predictions for distributional alignment and latency profiling. This is the ML equivalent of a staging environment integration test — you're observing behavior against real production-distribution inputs before committing to serving users.

The shadow stage catches a class of regression that holdout test sets miss: prediction behavior on low-frequency edge cases in your production traffic that aren't well-represented in your test set. A model that passes holdout metrics but makes dramatically different predictions on 5% of production requests (which happened to not appear in your holdout set) is a regression in practice even if it isn't one on paper. Shadow mode surfaces this before users see it.

Stage 5: Deployment — promotion, not redeployment

Software CI/CD deploys a new binary to infrastructure. The deployment artifact changes; the infrastructure stays the same.

In ML, deployment is often cleaner: you promote a version in the model registry from Staging to Production. Your serving infrastructure polls the registry and loads the new version. The infrastructure doesn't change; only the registry pointer does. This is sometimes called "hot swap" deployment and is how MLflow, Seldon, and most MLOps platforms implement production promotion.

The deployment equivalent of a rollback is re-promoting the previous version. Because model artifacts are immutable and permanently retained (Archived, not deleted), rollback is always available and always safe. There's no "the old binary was overwritten" problem.

Stage 6: Monitoring — the stage software CI/CD doesn't have

Here's where the analogy breaks completely. Software deployments don't degrade over time by themselves. A deployed binary runs the same code next month as it does this month, assuming the infrastructure doesn't change. You don't need to "monitor" whether your payment processing library is still computing arithmetic correctly. The only failure modes are infrastructure failures and bugs introduced by new code — both of which are detectable with standard application monitoring.

ML models degrade on their own, with no code change and no infrastructure failure. The world changes, input distributions shift, and the accuracy of a model that was correct at deployment time becomes incorrect over time. This is the fundamental difference between software and ML production operations, and it's why the monitoring stage doesn't exist in software CI/CD but is load-bearing in ML CI/CD.

Post-deployment monitoring is not optional in ML CI/CD; it's the stage that drives the next training run. The monitoring system watches for drift signals, and when thresholds are crossed, it feeds back into Stage 1: source control (data) and Stage 3: build (trigger a new training run). Without this feedback loop, you have a deployed model with a cron job on top, not a CI/CD pipeline.

# The complete loop
# Stage 6 monitoring → Stage 3 build → Stage 4 test → Stage 5 deploy
monitor:
  model: revenue-forecast
  methods:
    psi:
      threshold: 0.10
  on_trigger:
    action: retrain
    pipeline: revenue-forecast-training-dag

This is what makes ML pipelines fundamentally different from software pipelines: they're loops, not lines. A software CI/CD pipeline terminates after deployment. An ML CI/CD pipeline has a feedback cycle that can fire days or weeks after the last deployment.

What good ML CI/CD actually looks like

A mature ML CI/CD setup has these characteristics:

  • Two source of truth systems: git for code, a model registry for artifacts. Neither is sufficient alone.
  • Fast code CI on every commit; expensive model CI only when a trigger fires.
  • Two-gate validation: absolute metric threshold + comparison to current production.
  • Promotion-based deployment, not redeployment. Rollback is re-promotion, always available.
  • Continuous drift monitoring as the input to the next pipeline run.
  • Immutable artifact storage with content addressing — no overwriting, no "final" filenames.

Most ML teams have 70% of this. The part that's consistently missing is the loop closure: monitoring is set up, but it doesn't automatically feed back into training. Engineers are still manually deciding when to retrain based on Slack notifications from the monitoring system. That means a drift alert at 11pm on a Friday doesn't get acted on until Monday morning, which is three days of degraded production predictions for a model that could have been retrained overnight if the trigger was automated.

Closing that loop — from drift detection to training trigger to validation gate to promotion — is what separates an ML pipeline from a deployed model with a cron job on top. It's also the harder engineering problem. Most of the components exist as products and open-source tools (Airflow, Kubeflow Pipelines, SageMaker Pipelines, GitHub Actions for smaller teams). The gap is usually not tooling but integration: the monitoring system doesn't speak to the training orchestration system, or the validation gate isn't wired to the registry promotion API. Building those connections is the work that turns a collection of ML infrastructure pieces into an actual CI/CD loop.