The Gap Between a Working Model and a Working System
Your data scientist has built a model that performs well in the notebook. It predicts customer churn with 87% accuracy, the stakeholders are excited, and now someone asks the obvious question: "When can we put this in production?"
That question is where most AI projects stall.
The model works. The infrastructure to support it reliably does not exist yet. You need versioning, monitoring, automated retraining pipelines, serving endpoints, and rollback capability - and your team has three people, two of whom learned Python eighteen months ago.
This is the gap that MLOps addresses. Not the research side of machine learning, but the operational side: how do you take a model from a Jupyter notebook and turn it into something your organisation can actually depend on? The good news is that you do not need a dedicated platform engineering team or a PhD in distributed systems to get there. You need a clear sequence of decisions and a realistic understanding of what each tool does.
What MLOps Actually Means in Practice
MLOps - machine learning operations - borrows from DevOps the idea that software delivery is a repeatable, automated process rather than a manual handoff. Applied to machine learning, it covers the full lifecycle: experiment tracking, model packaging, deployment, monitoring, and retraining.
The phrase gets used loosely, which causes confusion. Some vendors use it to mean an entire platform. Some teams use it to mean a CI/CD pipeline that happens to include a model. For practical purposes, think of mlops deployment as the set of practices that let you answer three questions reliably:
- What version of the model is running in production right now?
- Is it performing as expected?
- How do you update it safely when it is not?
If you can answer those three questions with confidence, you have functional MLOps. Everything else is refinement.
Start With Experiment Tracking Before You Touch Deployment
The most common mistake teams make is jumping straight to deployment infrastructure before they have any discipline around experiments. You cannot manage what you have not tracked.
MLflow is the most practical starting point for most teams. It is open source, runs locally or on a server you control, and integrates with scikit-learn, PyTorch, TensorFlow, and most other common frameworks with minimal code changes. You log parameters, metrics, and artefacts - including the model itself - and you get a UI that lets you compare runs.
A concrete example: a retail client we worked with had three data scientists running experiments on the same dataset independently. They were using different feature engineering approaches and could not reliably reproduce each other's results. Within a week of standing up MLflow on a single EC2 instance, every experiment was logged with its parameters, the training data version it used, and the resulting metrics. When they eventually deployed a model, they could point to exactly which experiment it came from and reproduce it from scratch.
This matters for mlops deployment because deployment is not a one-time event. Models degrade. You will need to retrain. If you cannot reproduce your training process, you are starting from scratch every time.
Packaging Models for Deployment
Once you have a model worth deploying, the next challenge is packaging it in a way that is environment-agnostic. The model that runs on your data scientist's MacBook needs to run identically on a Linux server in a data centre.
Docker is the standard answer here, and it is worth learning even if your team finds containers unfamiliar at first. The pattern is straightforward:
- Define your Python environment in a
requirements.txtorpyproject.toml - Write a Dockerfile that installs those dependencies and copies your model artefact
- Expose an API endpoint using FastAPI or Flask that loads the model and serves predictions
FastAPI is worth choosing over Flask for new projects. It handles async requests better, generates automatic API documentation, and validates request and response schemas using Pydantic, which catches a surprising number of integration bugs before they reach production.
For the API itself, keep the interface simple. Accept JSON, return JSON, and version your endpoint from the start (/v1/predict rather than /predict). This costs nothing upfront and saves significant pain when you need to deploy a new model version alongside the existing one.
If your organisation is already using Kubernetes, you can deploy these containers into your existing cluster. If not, managed services like AWS ECS, Google Cloud Run, or Azure Container Apps let you run containers without managing the underlying infrastructure. Cloud Run in particular is well-suited to models with variable traffic - it scales to zero when idle and scales up quickly under load.
Monitoring Is Not Optional
A deployed model that nobody is watching is a liability. Models degrade in production for two distinct reasons: data drift and concept drift.
Data drift means the statistical properties of your input data have changed. If you trained a model on customer transactions from 2022 and the spending patterns of 2024 customers look meaningfully different, your model is operating outside the conditions it was trained on.
Concept drift means the relationship between your inputs and the target variable has changed. A fraud detection model trained before a new fraud vector emerged will miss that fraud, even if the input data looks similar.
Detecting both requires logging your production inputs and, where possible, the eventual ground truth. For a churn model, you can log which customers the model predicted would churn and then check six weeks later whether they actually did. That gives you a feedback loop to measure real-world accuracy over time.
Evidently AI is a practical open source tool for drift detection. It takes your training data as a reference distribution and compares incoming production data against it, generating reports that flag when distributions have shifted beyond a defined threshold. You can run this as a scheduled job - daily or weekly depending on your data volume - and alert your team when drift exceeds your threshold.
Set up basic monitoring before you deploy, not after. At minimum, log every prediction request with a timestamp, the input features, and the model version that served it. This data is what you will use to diagnose problems when they occur.
Automate Retraining Without Overengineering It
Automated retraining pipelines are where teams tend to over-invest early. A fully automated system that detects drift, triggers retraining, evaluates the new model, and promotes it to production with no human involvement is genuinely useful - but it is also complex to build and maintain, and it is not where you should start.
A more practical progression:
-
Manual retraining with automated evaluation. You trigger retraining manually, but the evaluation pipeline runs automatically and produces a report comparing the new model to the current production model on a held-out test set. A human reviews the report and makes the promotion decision.
-
Scheduled retraining with manual promotion. Retraining runs on a schedule (weekly, monthly) using fresh data. Evaluation is automatic. Promotion is still a human decision.
-
Fully automated with guardrails. Retraining and promotion happen automatically, but only when the new model exceeds the current model by a defined margin on your evaluation metrics, and only after passing integration tests.
Most organisations doing practical mlops deployment operate at level one or two. That is fine. The goal is a reliable, repeatable process - not the most automated possible process.
Apache Airflow and Prefect are both reasonable choices for orchestrating these pipelines. Prefect has a gentler learning curve and better local development experience. If your team is already using dbt for data transformation, Prefect integrates cleanly with it.
Choosing Tools That Match Your Team's Capacity
The MLOps tool landscape is large and growing. Kubeflow, SageMaker, Vertex AI, Azure ML, MLflow, DVC, BentoML, Seldon, Ray Serve - the options are genuinely overwhelming, and most of them solve problems at a scale your team is unlikely to face.
A practical stack for a small team with modest infrastructure requirements:
| Component | Tool |
|---|---|
| Experiment tracking | MLflow |
| Model packaging | Docker + FastAPI |
| Container registry | AWS ECR or Docker Hub |
| Serving | Cloud Run or ECS |
| Monitoring | Evidently AI + CloudWatch or Grafana |
| Orchestration | Prefect |
This stack covers the full mlops deployment lifecycle, runs on infrastructure most organisations already have access to, and does not require specialist knowledge to operate. You can stand up a working version of this in two to three weeks.
Resist the pull toward managed ML platforms until you have outgrown this approach. SageMaker and Vertex AI are powerful, but they introduce abstraction layers that make debugging harder and create vendor lock-in that is expensive to undo. Build on primitives first.
What to Do Next
If you are starting from zero, the sequence matters. Do not try to build everything at once.
Week one: Stand up MLflow and require every experiment to be logged before it is discussed in a meeting. This is a cultural change as much as a technical one, and it needs to happen before anything else.
Week two to three: Containerise your best-performing model using Docker and FastAPI. Deploy it to a managed container service. Test it with real traffic, even if that traffic is internal.
Week four: Add logging to every prediction request. Store the logs somewhere queryable - S3 with Athena, or a Postgres table if your volumes are low. Set up a weekly job that runs Evidently against the last seven days of production data.
Month two: Build a retraining script that can be triggered manually and produces a comparison report. Run it. Evaluate the output. Decide whether the new model is better.
That sequence gets you to functional mlops deployment without requiring your team to learn ten new tools simultaneously. Each step builds on the last, and each step delivers something useful before the next one begins.
If you want to talk through what this looks like for your specific models and infrastructure, the team at Exponential Tech works through exactly these problems with Australian organisations. The contact details are on the website.