Introduction to MLOps
Why do ML projects fail? MLOps bridges the gap between data science experimentation and reliable production systems.
Models that work in Jupyter but can't scale to real-time systems
Real-world data changes cause silent model degradation
Optimizing for accuracy instead of latency or cost
Losing track of data, dependencies, and hyperparameters
The Core Idea
Only a small fraction of real-world ML systems are composed of ML code. The surrounding infrastructure โ data pipelines, serving, monitoring, CI/CD โ is far larger. MLOps applies DevOps best practices to the ML lifecycle.
๐ด Classic Problems (No MLOps)
- No version control for data, models, or dependencies
- Hardcoded paths (data.csv, model.pkl)
- No reproducibility โ which sklearn version? what random seed?
- Manual execution โ not automatable
- No logging โ cannot audit
- Works in Jupyter, breaks in production
๐ข With MLOps
- Modular, testable, reusable code
- Central config โ flexible deployment
- Version control for data and models
- Automated CI/CD pipelines
- Logging and monitoring built-in
- Reproducible experiments and audits
๐ณ Fraud Detection Failures
- Real-time decisions on high-velocity data
- Changing fraud patterns over time (concept drift)
- Trained without holiday peak season data โ high false positive for travel/flight transactions
- Clients without recent history have higher rejected first-transaction rates
- Quality of data ingestion is critical โ missing data leads to wrong decisions
๐ Cybersecurity ML Failures
- Changing attack patterns and threat vectors evolve constantly
- Lack of monitoring results means threats go undetected
- Models trained on old patterns miss novel attack types
- Requires continuous retraining and real-time detection pipelines
๐ Recommendation System Failures
- Words and definitions change over time in NLP-based systems
- Lack of unit tests: when a word's meaning changes, what should the model consider?
- Automobile sales model built for specific market โ couldn't scale globally
- Healthcare cost correlated with race โ biased patient care predictions
- Post-COVID behavioral shift: online vs. on-premise customer profiling
๐ง LLMOps Failures
- ~35% Runaway Costs: Token consumption scales with queries. Without semantic routing or caching, budget spikes can bankrupt a project.
- ~30% Hallucination & Jailbreaks: Models ignore system instructions or are manipulated via prompt injection.
- ~25% RAG Scale Gap: Pipeline works on 10 test docs, collapses with 100k messy enterprise PDFs.
- ~10% Data Privacy: Internal wikis with PII accidentally leaked through the LLM.
Model Development
Feature management, data validation with Great Expectations, feature engineering strategies, and experiment tracking with MLflow.
Large Data Volume
Handling big datasets efficiently without re-computing features from scratch
Feature Reusability
Different models need the same features โ avoid duplicate computation
Standardized Definitions
Features must mean the same thing across teams, models, and time
Train/Serve Consistency
Features computed at training time must be identical at serving time
Data Validation Layer
Centralized Feature Repository
Features are computed once and shared โ ensuring consistency between training and serving
Core Concepts
- Expectation Suite: Collection of rules your data must satisfy
- Datasource: Connection to your data (DB, CSV, API)
- Validator: Runs expectations against data
- Checkpoint: Reusable validation workflow
Useful Expectations
- Null value checks on mandatory columns
- Type enforcement (numeric, string)
- Range and outlier checks
- Unique key validation
- Business rules (e.g. withdrawal โ no beneficiary)
- Statistical drift from reference distribution
Container for your data โ represents related tables/entities and their structure
Define one-to-many links between parent and child entities
Operations applied to data: aggregations (SUM, COUNT, MEAN) and transformations
Deep Feature Synthesis โ automatically stacks primitives across relationships
Key Capabilities
- Tracking: Log parameters, metrics, artifacts per run
- Model Registry: Manage model versions and lifecycle stages
- UI: Compare runs, visualize metrics over time
- Reproducibility: Capture environment, requirements, seeds
- Serving: Serve models locally or on cloud
What is the "training-serving skew" problem?
When features computed during training are calculated differently at serving time, causing the model to receive different input distributions than it was trained on โ leading to degraded predictions in production.
Production & Deployment
Structuring ML code for production: modularity, version control, CI/CD integration, and reproducible pipelines.
๐ด Common Anti-Patterns
- Hard-coded paths: Cannot generalize across environments
- No version control for data/model: Can't track changes
- No separation of concerns: Hard to test, debug, or scale
- Manual execution: Not automatable
- No logging: Cannot audit
๐ข Production-Ready Design
- Modular design: Easy to test and reuse
- Central config/paths: Flexible deployment
- Separate concerns: Scalable architecture
- Parameterization: Supports experimentation
- CI/CD integration: Automatable pipelines
- Version data/model: Auditability
๐ Model Quality Metrics
Assertions on model performance: accuracy, precision, recall. E.g., "fail deployment if accuracy drops below 90%"
โก Computational Metrics
Latency, throughput. E.g., "fail if 5th percentile of scoring events takes more than X ms"
๐ฅ Subpopulation Analysis
Check for model fairness โ overall metrics may be good while false positives concentrate on a specific demographic segment.
Current production model
New model versions are validated against the champion on quality metrics, computational performance, creative edge cases, and subpopulation fairness before promotion to production.
Deployment Requirements
Containerization with Docker, virtual machines vs. containers, and best practices for reproducible ML deployments.
๐ฅ๏ธ Virtual Machines
- Access hardware via a hypervisor
- Include full OS + application stack
- More resource-intensive
- Better isolation โ good for testing in different OS environments
- Good for public cloud and hybrid solutions
๐ฆ Containers
- Share the host OS kernel
- Package executable + dependencies only
- Lightweight and portable
- Fast to start and deploy
- Good for microservices, web services, CI/CD
โ Key Rules
- Use official Docker images from hub.docker.com
- Use .dockerignore to exclude unnecessary files
- Order Dockerfile layers from least โ most frequently changed to leverage cache
- One container per process โ easier debugging and orchestration
- Container lifetime = app lifetime
Docker Compose lets you define and run multi-container applications with a simple YAML file. Useful for running training, serving, and monitoring containers together.
Base image โ System dependencies โ Python requirements โ Application code
If you change
app.py, Docker only rebuilds the last layer โ not requirements.
Cached โ
Cached โ
Cached โ
Rebuilt โบ
Model Serving
Real-time vs batch inference, Kubernetes orchestration, and deployment strategies: canary, shadow testing, and A/B testing.
โก REST API (FastAPI)
For on-demand single predictions. Low latency, synchronous. E.g., credit card fraud detection โ must respond in milliseconds.
๐ง Kubernetes + REST
For high-traffic, auto-scaling scenarios. Multiple model replicas with load balancing. Best for business-critical APIs.
โ๏ธ Serverless (Cloud)
Robust, scalable with minimal infrastructure management. Cloud providers handle scaling. Good for variable traffic.
โก Real-Time Inference
- Individual predictions on-demand
- Focus: low latency, high availability
- Infrastructure: FastAPI, Load Balancers
- Example: Credit card fraud detection โ block the card NOW
๐ฆ Batch Inference
- Score many records at once, periodically
- Focus: throughput, parallel processing, big data
- Infrastructure: Spark, cloud compute clusters
- Example: Monthly churn prediction (scoring all customers overnight)
Coordinator, kubectl, API Server
Coordinates the cluster. kubectl uses its API.
Worker VMs/computers that run containers.
Smallest unit in K8s. Created/destroyed, not containers. Can have multiple containers.
๐ค Canary Release
Keep the champion model in production, but redirect a small percentage of traffic to the new challenger model. Monitor results before full rollout.
Champion
10% Challenger
Monitor
Promoted
โ Low risk, gradual rollout. Kubernetes handles this natively.
๐ฅ Shadow Testing
The challenger model receives the same production traffic as the champion but its decisions are not acted upon. Results are logged and compared.
Critical requirement: if the challenger fails, production must not experience any degradation in response time.
๐ A/B Testing
Split traffic between champion and challenger โ users are assigned to one model exclusively. Each model serves its assigned users.
๐ต๐ข Blue/Green Deployment
Set up the new system (green) alongside the stable one (blue). When the new version is functional, switch all traffic to it instantly โ zero downtime.
100% traffic
Set up + test
100% traffic
Kubernetes handles this natively. No downtime for real-time scoring.
Monitoring & Drift Detection
Detecting when your model degrades in production through data drift, concept drift, and statistical testing.
1. Resource Level
Is the model running correctly? CPU, memory, uptime, infrastructure health.
2. Performance Level
Is the model still accurate? Monitoring degradation and triggering retraining when needed.
3. Prediction Explanation
Which features drive predictions? Log Shapley values to identify potential model issues.
๐ฅ Data Drift / Covariate Shift
P(X) changes but P(Y|X) stays the same. The distribution of input features shifts โ e.g., seasonal temperature changes, new product categories. The model receives different inputs than it was trained on.
๐ฏ Concept Drift
P(Y|X) changes. The relationship between features and the target changes โ e.g., fraud patterns evolve, customer behavior after COVID. The model's learned associations become stale.
๐ท๏ธ Label / Prediction Drift
The distribution of predicted labels shifts. E.g., the model starts predicting fraud much more often even though fraud rates are stable โ a sign of covariate shift downstream.
๐ผ Business Drift
Business goals or definitions change. E.g., what counts as "churn" is redefined, regulatory requirements shift, or new product lines change the target population.
Sudden drift โ A new concept occurs in a short period of time (e.g., at the start of COVID-19 in March 2020, stock prices suddenly changed).
Population Stability Index (PSI)
Measures the difference between two discrete distributions. Widely used in credit risk scorecards.
| PSI Value | Interpretation | Action |
|---|---|---|
| < 0.1 | โ No significant change | Continue monitoring |
| 0.1 โ 0.2 | โ ๏ธ Moderate change | Investigate features |
| โฅ 0.2 | ๐จ Significant change | Retrain model |
Kullback-Leibler (KL) Divergence
Measures how distribution Q diverges from true distribution P. Based on information theory (entropy).
- Not symmetric: KL(PโQ) โ KL(QโP)
- Measures entropy increase due to approximation
- Useful for GANs and synthetic data generation
- Problems with zero-probability events (log(0) undefined)
Jensen-Shannon (JS) Divergence
A symmetrized version of KL divergence. Also called "information radius."
- Symmetric: JS(PโQ) = JS(QโP)
- Handles zero-probability bins naturally (0 ร ln(0) = 0)
- Applied independently per feature (univariate)
- Applied to binned/discretized data in practice
Two-Sample Kolmogorov-Smirnov (KS) Test
Non-parametric test measuring the maximum distance between two empirical cumulative distribution functions.
Other tests: Anderson-Darling, Wasserstein Distance, Chi-Squared (for categorical), Fisher's Exact Test.
Model Governance
Responsible AI, GDPR compliance, model explainability, and the governance frameworks that make ML trustworthy at scale.
Process Governance
GDPR, industry regulations (pharma, finance, model risk management)
MLOps
Modular code, model versions, logging of all activities
Responsible AI
Explainability techniques for interpretability, bias testing for auditors
- โ Lawfulness, fairness, and transparency
- โ Purpose limitation
- โ Data minimization
- โ Accuracy
- โ Storage limitation
- โ Integrity and confidentiality (security)
- โ Accountability
- โ Right to explanation for automated decisions
๐ฌ Interpretability
Looking at the inner mechanics โ understanding the model's weights, decision tree splits, and feature coefficients. Possible only for glass-box models (linear regression, decision trees).
๐ฌ Explainability
Explaining a black-box model's behavior in human terms โ finding the meaning between input attributions and model outputs. Post-hoc methods applied to any model.
๐ฎ SHAP (SHapley Additive exPlanations)
Game-theoretic approach using Shapley values for optimal credit allocation. Works on any black-box model, most efficient on tree ensembles. Supports both global and local explanations.
๐ LIME (Local Interpretable Model-Agnostic Explanations)
Fits a surrogate glass-box model around a specific prediction's neighborhood. Perturbs data points to generate synthetic samples. Designed for local explanations only.
๐ฒ Permutation Importance
Measures feature importance by shuffling each feature and measuring prediction error increase. If shuffling a feature doesn't change error, it's unimportant. Global only.
๐ Partial Dependence Plot (PDP)
Shows the marginal effect of one or two features on the model output. Reveals if relationships are linear, monotonic, or complex. Assumes feature independence โ can mislead if correlated. Global only.
๐ณ Tree Surrogates
An interpretable model trained to approximate a black-box model's predictions. Easy to visualize and interpret. Supports both global and local use.
๐ Explainable Boosting Machine (EBM)
Tree-based cyclic gradient boosting GAM from Microsoft Research. As accurate as black-box models but fully interpretable. Fast at prediction time. Supports both global and local explanations.
1. Accountability
When a model makes a wrong decision, knowing what caused it is essential for troubleshooting and responsibility.
2. Trust
In high-risk domains (healthcare, finance), domain experts will challenge your model โ you need evidence it works.
3. Compliance
Critical for auditors and regulators โ GDPR's right to explanation requires understanding automated decisions.
4. Performance
Understanding which features matter most guides hyperparameter tuning and feature selection.
What is the difference between SHAP and LIME?
SHAP uses game-theoretic Shapley values to compute global AND local feature attributions on any model. LIME fits a local surrogate model around one specific prediction โ it only provides local explanations and is model-agnostic but less theoretically grounded.