model versioning MLOps best practices

Model Versioning Best Practices: What Git Taught Us and What It Didn't

Kevin Nakamura September 18, 2025 7 min read

Abstract branching version tree visualization on dark background

When machine learning teams start taking model versioning seriously, the first instinct is almost always the same: "we'll just use git." It's the right instinct. Git is excellent version control. The problem is that ML models aren't code.

Git stores text diffs. Model weights are binary blobs — a 500MB PyTorch checkpoint doesn't diff meaningfully. More critically, the meaning of a model version isn't captured by the weight file alone. It's the weight file + the training data version + the feature schema + the hyperparameters + the training environment. A model trained on data-v3 with learning_rate=1e-4 and a model trained on data-v3 with learning_rate=5e-4 that happen to produce identical validation AUC are still different versions — they may behave differently at distribution boundaries.

Git taught us how to think about versioning. It didn't give us the right primitives for models. Here's what good model versioning actually looks like — and where teams consistently get it wrong even after they've started using a dedicated registry.

Why Git alone fails for models

The fundamental mismatch is this: Git's diff engine is built for text. It compresses and stores changes efficiently because most code changes are small relative to the total codebase. A model weight file is a binary blob — 100MB to 10GB of floating point values. Git doesn't diff it; it stores the full file on every commit. Git LFS addresses the storage problem by storing large files in remote storage and keeping only a pointer in the repository, but LFS doesn't solve the metadata coupling problem: the model's meaning isn't captured by the weight file alone.

More critically, Git's versioning model assumes that the artifact is fully deterministic given the source files. Commit A produces binary B, always. Models violate this assumption: the same training script run twice on the same data can produce slightly different floating point weights due to GPU non-determinism, thread scheduling variance, or floating point accumulation order differences. This means a git commit hash doesn't uniquely identify a model — it only identifies the code that was run. Two identical code commits + same data can produce measurably different models at distribution boundaries.

These aren't theoretical limitations. They bite teams in practice when they try to reproduce a model from git history alone and can't, or when they try to figure out why two models with the same code commit behave differently in production.

The four artifacts that define a model version

A production model version is meaningfully defined by four things:

The weight file. The serialized parameters produced by training. Could be a .pkl, a model.pt, an ONNX export, or a SavedModel directory. This is the obvious part.
The training data hash. A cryptographic hash (SHA-256) of the training dataset, or at minimum a stable pointer to the dataset version used. Without this, you can't reproduce the model or understand why two versions behave differently.
The feature schema. The exact list of input features, their types, and their expected ranges. Deployed models break when the serving layer presents features in a different order, or when a feature that existed during training is absent at inference time.
The training run record. Hyperparameters, framework versions, compute environment, random seed, evaluation metrics. The metadata that explains what produced the artifact.

A versioning system that captures all four can answer the question: "what was running in production on November 14th, and exactly how was it trained?" A system that only stores weight files cannot.

Content-addressed storage: why filenames are the wrong unit

If you version models by filename — fraud_model_v1.pkl, fraud_model_v2_retrained.pkl, fraud_model_v2_retrained_final.pkl — you've already lost. Filenames are mutable labels. They don't guarantee content identity. Two files named v3 can contain different weights.

Content-addressed storage (CAS) flips the model: every artifact is identified by a hash of its content, computed at registration time. If you re-register the same model twice, you get the same hash and Inferpathio stores it once — deduplication is automatic. More importantly, when you reference a version, you're making an immutable commitment: the content at that hash will never change.

# Every registered version gets a content hash
# Registration output:
# fraud-detector@sha256:a4f9c22e...
# Stage: staging | Run: run-8832
# Artifact: s3://models/fraud-detector/a4f9c22e.pkl (stored once)

# Later — even after 50 retrains — you can still access this exact artifact:
run = ifp.get_run("run-8832")
model = run.load_artifact()   # fetches by hash, not filename

Stages: the production promotion gate

A model doesn't go directly from training to production. It should pass through a set of states that enforce validation discipline:

Staging: freshly registered candidate. Not yet evaluated on holdout data.
Shadow: running alongside production, receiving production traffic, but predictions are not surfaced to users. Used to compare candidate vs. current production behavior on real distribution before committing.
Production: serving live users. Only one version per model should be in Production stage at a time.
Archived: retired from service, but artifact and metadata permanently retained.

The key rule: promotion from Staging to Production should require a passing validation check — a minimum F1, AUC, or RMSE threshold against a held-out test set. Automating this gate prevents the common failure of deploying a retrained model that technically completed training but regressed on the evaluation metric.

with ifp.track("fraud-detector") as run:
    model.fit(X_train, y_train)
    f1 = f1_score(y_val, model.predict(X_val))
    run.log_metric("f1", f1)
    run.register(
        stage="staging",
        tags=["q1-retrain"]
    )

# Promotion gate — enforced in retraining policy:
# validation:
#   min_f1: 0.90
#   promote_on_pass: production   # auto-promote if gate passes
#   notify_on_fail: slack:#ml-alerts

Branching: handling parallel experiments

Production model registries often need to track parallel experiments — you're testing a transformer architecture alongside your existing gradient boosting model, both on the same production task. You need version lineage that reflects the branching structure, not just a flat list of versions.

MLflow models this as "model name + version number" (flat list). A DAG-based approach captures it better: model versions are nodes in a directed acyclic graph, with edges representing "derived from" relationships. A retrained version derived from a shadow experiment is linked back to the experiment run and the production version it forked from.

Why does this matter in practice? When a retrained model starts underperforming in production and you need to roll back, a flat version list tells you "go back to v7." A lineage DAG tells you "v8 was trained on data-v12 derived from run-4432, which forked from run-3901 at the point where you added feature X — here's what changed." The difference between these two debugging experiences is the difference between hours and days of investigation time during a production incident.

Lineage also matters for regulatory compliance in some industries. If you're operating in financial services or healthcare and need to explain why a model made a specific prediction on a specific date, you need to be able to answer: which model version served that request, what data was it trained on, and who approved its promotion to production. A flat version list with no metadata association makes this audit reconstruction extremely difficult. A DAG with full provenance makes it a simple query.

Rollback semantics: what rollback actually means

In software deployment, rollback means "deploy the previous artifact." In model registry terms, it's slightly different: rollback means "promote a previous version back to Production stage." The previous artifact was never deleted — it was just demoted to Archived. Rolling back is a promotion event for the previous version, logged with a timestamp and a reason code.

# Rollback via SDK:
registry = ifp.registry("fraud-detector")
registry.promote("run-7291", to_stage="production", reason="accuracy regression in v8")

# Equivalent via API:
PATCH /models/fraud-detector/versions/run-7291
{"stage": "production", "rollback_reason": "accuracy regression in v8"}

The current production version is automatically demoted to Archived. The event is logged in the audit trail. You can reconstruct the full production history of a model — every promotion, demotion, and rollback — from the event log.

One important caveat about rollback semantics: rolling back the model doesn't roll back the world. If model v8 made predictions that drove downstream business logic — auto-approved loans, sent emails, modified inventory orders — those actions aren't undone by promoting v7. Rollback only stops future harm; it doesn't reverse past decisions made by the degraded model. Teams managing high-stakes production models should design their systems with this in mind: model rollback is necessary but rarely sufficient when model degradation has already caused real-world effects.

We're not saying rollback is useless — it's essential. We're saying rollback is a damage-control mechanism, not a complete recovery strategy. Early detection (monitoring that catches drift before it causes significant harm) is worth more than fast rollback (responding quickly after harm has accumulated). Both matter; the order of priority is detection first, rollback second.

The naming convention question

Semantic versioning (1.0.0, 1.1.0, 2.0.0) doesn't map well to model versions — it implies breaking changes and API compatibility, concepts that don't translate directly to ML models. More useful conventions:

Date-stamped: fraud-detector-2025-Q4. Legible in Slack alerts. Easy to reason about temporal ordering. Doesn't convey architecture changes.
Run ID based: The internal registry uses content hashes and run IDs as primary keys. Human-readable names are labels on top.
Tag-based: Use tags to encode purpose — ["production-candidate", "transformer-arch", "data-v14"]. More expressive than version numbers alone.

What to avoid: fraud_model_final_v2_FIXED.pkl. You know what that file is. You've created it. Don't commit it to the registry.

What good versioning isn't

Model versioning is not model serving. Inferpathio is not an inference endpoint manager — it doesn't run your models or route traffic. The registry records what versions exist, their provenance, and their stage. Your serving infrastructure (SageMaker, KServe, Seldon, or a custom FastAPI endpoint) still handles the actual deployment.

The separation matters: the registry can tell you that a particular version should be in Production, but promotion doesn't automatically deploy it. Your deployment pipeline reads from the registry and handles the actual rollout. Inferpathio provides the source of truth; your infra acts on it.

Why Git alone fails for models

The four artifacts that define a model version

Content-addressed storage: why filenames are the wrong unit

Stages: the production promotion gate

Branching: handling parallel experiments

Rollback semantics: what rollback actually means

The naming convention question

What good versioning isn't

More from the blog

What Is Model Drift and Why Does It Sink Production ML?

Retraining Triggers Explained: Metric Threshold, Drift Score, or Schedule?

CI/CD for ML Models: Adapting Software Delivery Principles to Model Pipelines