Skip to content
Back to Insights
MLOps Machine Learning Model Deployment AI Engineering Feature Store

MLOps in 2026: The Stack That Gets Models to Production Without Drama

ML models that never reach production are experiments, not products. MLOps is the engineering discipline that closes the gap between a model that works in a notebook and one that runs reliably in production for months without silent degradation. This post covers the 2026 stack we use and why each piece earns its place.

Codecanis Admin

9 min read
ML engineering team working
Live retraining pipeline view — model deploys, canaries, and drift alerts in one place.

The uncomfortable truth about ML in most organisations is that the model is the easy part. The hard parts are: reproducible experiments, consistent feature computation between training and serving, reliable deployment pipelines, and catching the slow degradation that happens when the world changes and your model doesn't.

This post describes the MLOps stack we run in production in 2026. It's opinionated — we've tried other tools, and these are the ones that survived contact with real engineering teams.

Experiment Tracking: MLflow vs Weights & Biases

If your team writes Python and already lives in Jupyter/VS Code, MLflow is the lowest-friction experiment tracker. It's open source, self-hostable, and integrates with every major framework (sklearn, PyTorch, HuggingFace, LightGBM). The autologging feature means you can get useful tracking with two lines of code:

import mlflow
import mlflow.sklearn

mlflow.autolog()

# Everything below is automatically tracked: params, metrics, model artifact
with mlflow.start_run(run_name="xgb-fraud-v2"):
    model = XGBClassifier(**params)
    model.fit(X_train, y_train)
    # Logged: all XGBoost params, train/eval metrics per epoch, feature importance

Weights & Biases wins when you have a team doing deep learning at scale — its visualisation for training curves, gradient histograms, and hyperparameter sweeps is significantly better than MLflow's. The collaboration features (tagging runs, inline comments) are also better for larger teams. Cost is a competitive per-user monthly subscription for the Team tier.

Recommendation: MLflow for self-hosted simplicity and cost control; W&B for research-heavy or deep-learning-focused teams.

Feature Store: Why Feast Still Wins

Training-serving skew — where features computed at training time differ subtly from features computed at serving time — is the single most common cause of "the model worked perfectly in evaluation but underperforms in production." Feature stores solve this by ensuring both pipelines consume the same feature definitions from the same registry.

Feast is our default open-source feature store. It separates the feature registry (definitions, metadata, lineage) from the feature serving layer (Redis for online, BigQuery/Redshift/S3 for offline), which means you can point it at infrastructure you already have.

from feast import FeatureStore

store = FeatureStore(repo_path=".")

# Training: pull historical features (offline store)
training_df = store.get_historical_features(
    entity_df=entity_df,
    features=["user_stats:transaction_count_7d",
               "user_stats:avg_transaction_amount_30d"],
).to_df()

# Serving: pull real-time features (online store — Redis)
features = store.get_online_features(
    features=["user_stats:transaction_count_7d",
               "user_stats:avg_transaction_amount_30d"],
    entity_rows=[{"user_id": user_id}],
).to_dict()

The same feature definitions serve both environments. Training-serving skew disappears because the computation is identical.

Model Registry and CI/CD for ML

A model registry is the single source of truth for model versions, their metrics, and their promotion state. We use MLflow's model registry (it ships with MLflow; no additional infrastructure) with a four-stage lifecycle: None → Staging → Production → Archived.

Our CI/CD pipeline for ML looks like this:

  1. Feature branch: Experiments run freely. No registry entries.
  2. PR to main: CI triggers a training run on the full dataset. Metrics are compared against the currently promoted Production model using a comparison script. If the new model doesn't beat the champion on key metrics (F1, AUC-ROC, or whatever's appropriate), the PR requires manual override to merge.
  3. Main merge: Model is registered as Staging.
  4. Manual promotion: A human (ML engineer or tech lead) reviews the evaluation report and promotes to Production. This is the human-in-the-loop checkpoint before anything touches live traffic.
# CI comparison script (simplified)
client = mlflow.tracking.MlflowClient()

new_run_metrics   = client.get_run(new_run_id).data.metrics
prod_model_run_id = client.get_model_version(model_name, prod_version).run_id
prod_run_metrics  = client.get_run(prod_model_run_id).data.metrics

if new_run_metrics["f1_score"] < prod_run_metrics["f1_score"] - 0.005:
    print("FAIL: New model does not beat champion.")
    sys.exit(1)

Canary Deployments for Models

Never switch 100% of traffic to a new model in one step. We use a canary deployment pattern mirrored from standard service deployments:

  • 5% of traffic to new model, 95% to current production model.
  • Hold for 1 hour. Monitor model-specific metrics: prediction distribution, null rates, latency.
  • If metrics are healthy, roll to 20%, then 50%, then 100% in stages.
  • If any metric degrades beyond threshold, automated rollback to 100% previous model.

In Kubernetes, this is a standard weighted ingress or Istio traffic split. The key addition for ML is a shadow logging system that logs both the old and new model's predictions for the same input — letting you compare them offline before committing to the canary.

Model Drift Detection

Models degrade silently. Data distributions shift, user behaviour changes, upstream data pipelines change schema. Without monitoring, you find out about drift from a business metric (revenue down, churn up) rather than an ML metric.

We monitor two types of drift:

  • Data drift (input drift): The distribution of input features has shifted relative to the training distribution. We use the Population Stability Index (PSI) for continuous features and chi-squared test for categoricals, computed daily on a rolling sample.
  • Concept drift (output drift): The relationship between inputs and the true label has changed. Detectable when you have ground-truth labels; use windowed accuracy/F1 metrics compared to baseline.
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=training_sample, current_data=last_7_days_data)
report.save_html("drift_report.html")

# If drift_share > 0.3 (30% of features drifted), trigger alert
result = report.as_dict()
drift_share = result["metrics"][0]["result"]["share_of_drifted_features"]
if drift_share > 0.3:
    send_alert(f"Data drift detected: {drift_share:.0%} of features shifted")

We use Evidently for drift reports, integrated into a nightly Airflow DAG. Reports are published to an S3 bucket and linked from a simple internal dashboard.

Key Takeaways

  • MLflow for self-hosted experiment tracking; W&B for deep learning teams that need richer viz.
  • Feast feature store eliminates training-serving skew — the most insidious production ML bug.
  • CI for ML should include an automated champion-challenger comparison; failing models shouldn't merge.
  • Canary-deploy all model promotions with shadow logging to compare old and new predictions.
  • Monitor both data drift (PSI) and concept drift (windowed accuracy) — ideally with Evidently in a nightly pipeline.
Let's build something

Want to work together?

If this article made you think about your architecture, your roadmap, or a problem you haven't solved yet — let's talk.