MLOps Learning Hub

Week 1

Introduction to MLOps

Why do ML projects fail? MLOps bridges the gap between data science experimentation and reliable production systems.

Why MLOps?

~40%

Notebook-to-Production Gap
Models that work in Jupyter but can't scale to real-time systems

~30%

Data & Concept Drift
Real-world data changes cause silent model degradation

~20%

Business Misalignment
Optimizing for accuracy instead of latency or cost

~10%

Governance Deficit
Losing track of data, dependencies, and hyperparameters

MLOps = ML + DEV + OPS

The Core Idea

Only a small fraction of real-world ML systems are composed of ML code. The surrounding infrastructure — data pipelines, serving, monitoring, CI/CD — is far larger. MLOps applies DevOps best practices to the ML lifecycle.

📊 Business Goal

→

🗃️ Data Collection

→

🔧 Feature Eng.

→

🤖 Model Training

→

🚀 Deployment

→

📡 Monitoring

🔴 Classic Problems (No MLOps)

No version control for data, models, or dependencies
Hardcoded paths (data.csv, model.pkl)
No reproducibility — which sklearn version? what random seed?
Manual execution — not automatable
No logging — cannot audit
Works in Jupyter, breaks in production

🟢 With MLOps

Modular, testable, reusable code
Central config — flexible deployment
Version control for data and models
Automated CI/CD pipelines
Logging and monitoring built-in
Reproducible experiments and audits

Real-World ML Failure Cases

💳 Fraud Detection Failures

Real-time decisions on high-velocity data
Changing fraud patterns over time (concept drift)
Trained without holiday peak season data → high false positive for travel/flight transactions
Clients without recent history have higher rejected first-transaction rates
Quality of data ingestion is critical — missing data leads to wrong decisions

🔐 Cybersecurity ML Failures

Changing attack patterns and threat vectors evolve constantly
Lack of monitoring results means threats go undetected
Models trained on old patterns miss novel attack types
Requires continuous retraining and real-time detection pipelines

🛒 Recommendation System Failures

Words and definitions change over time in NLP-based systems
Lack of unit tests: when a word's meaning changes, what should the model consider?
Automobile sales model built for specific market — couldn't scale globally
Healthcare cost correlated with race → biased patient care predictions
Post-COVID behavioral shift: online vs. on-premise customer profiling

🧠 LLMOps Failures

~35% Runaway Costs: Token consumption scales with queries. Without semantic routing or caching, budget spikes can bankrupt a project.
~30% Hallucination & Jailbreaks: Models ignore system instructions or are manipulated via prompt injection.
~25% RAG Scale Gap: Pipeline works on 10 test docs, collapses with 100k messy enterprise PDFs.
~10% Data Privacy: Internal wikis with PII accidentally leaked through the LLM.

Test Your Knowledge

Week 2

Model Development

Feature management, data validation with Great Expectations, feature engineering strategies, and experiment tracking with MLflow.

Feature Challenges

📦

Large Data Volume

Handling big datasets efficiently without re-computing features from scratch

♻️

Feature Reusability

Different models need the same features — avoid duplicate computation

📐

Standardized Definitions

Features must mean the same thing across teams, models, and time

🔄

Train/Serve Consistency

Features computed at training time must be identical at serving time

Feature Store Architecture

Raw DB

Batch Data

Real-time Data

↓

🔍 Great Expectations
Data Validation Layer

↓

🏪 Feature Store
Centralized Feature Repository

↓ ↓

🤖 Training

🚀 Serving

Features are computed once and shared — ensuring consistency between training and serving

Great Expectations — Data Validation

Core Concepts

Expectation Suite: Collection of rules your data must satisfy
Datasource: Connection to your data (DB, CSV, API)
Validator: Runs expectations against data
Checkpoint: Reusable validation workflow

Useful Expectations

Null value checks on mandatory columns
Type enforcement (numeric, string)
Range and outlier checks
Unique key validation
Business rules (e.g. withdrawal → no beneficiary)
Statistical drift from reference distribution

Feature Engineering Strategies

📐 Derivatives ▼

Infer new information from existing data — e.g., extracting "day of week" from a date column, computing "age" from birth date.

➕ Enrichment ▼

Add external information — e.g., is this day a public holiday? Weather data, exchange rates, geographic data.

🔄 Encoding ▼

Present the same information differently — e.g., one-hot encoding, ordinal encoding, weekday vs. weekend binary flag.

🔗 Combination ▼

Link features together — e.g., weighting backlog size by item complexity, multiplying two signals to capture interaction effects.

⚠️ Feature Engineering Trade-offs: More features → more maintenance, higher compute cost, reduced stability, and potential privacy concerns. Use automated tools (Featuretools) but always apply critical thinking to your domain.

Featuretools — Deep Feature Synthesis

01 EntitySet

Container for your data — represents related tables/entities and their structure

02 Relationships

Define one-to-many links between parent and child entities

03 Primitives

Operations applied to data: aggregations (SUM, COUNT, MEAN) and transformations

04 DFS

Deep Feature Synthesis — automatically stacks primitives across relationships

MLflow — Experiment Tracking

Key Capabilities

Tracking: Log parameters, metrics, artifacts per run
Model Registry: Manage model versions and lifecycle stages
UI: Compare runs, visualize metrics over time
Reproducibility: Capture environment, requirements, seeds
Serving: Serve models locally or on cloud

import mlflow
import mlflow.sklearn

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("max_depth", 5)
    mlflow.log_param("n_estimators", 100)

    # Log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("f1_score", f1)

    # Log model
    mlflow.sklearn.log_model(model, "model")
    

Flashcards — Feature Management

Click the card to reveal the answer

What is the "training-serving skew" problem?

When features computed during training are calculated differently at serving time, causing the model to receive different input distributions than it was trained on — leading to degraded predictions in production.

1 / 5

Week 3

Production & Deployment

Structuring ML code for production: modularity, version control, CI/CD integration, and reproducible pipelines.

From Notebook to Production

🔴 Common Anti-Patterns

Hard-coded paths: Cannot generalize across environments
No version control for data/model: Can't track changes
No separation of concerns: Hard to test, debug, or scale
Manual execution: Not automatable
No logging: Cannot audit

🟢 Production-Ready Design

Modular design: Easy to test and reuse
Central config/paths: Flexible deployment
Separate concerns: Scalable architecture
Parameterization: Supports experimentation
CI/CD integration: Automatable pipelines
Version data/model: Auditability

Recommended Project Structure

📁 ml-project/
├── 📁 data/          # Versioned datasets, saved models, predictions
│   ├── raw/
│   ├── processed/
│   └── models/
├── 📁 src/           # Modular Python source code
│   ├── data_ingestion.py
│   ├── feature_engineering.py
│   ├── model_training.py
│   └── evaluation.py
├── 📁 tests/         # Unit and integration tests
├── 📁 configs/       # YAML configs — no hardcoded values
├── requirements.txt  # Pinned dependencies
├── Dockerfile        # Reproducible environment
└── .github/ci.yml    # CI/CD pipeline
    

Quality Assurance in MLOps

💡 Unlike software engineering, QA methodologies in MLOps are still maturing. The entire pipeline — data collection, ingestion, transformation, feature engineering, training, and evaluation — must be documented and validated.

📊 Model Quality Metrics

Assertions on model performance: accuracy, precision, recall. E.g., "fail deployment if accuracy drops below 90%"

⚡ Computational Metrics

Latency, throughput. E.g., "fail if 5th percentile of scoring events takes more than X ms"

👥 Subpopulation Analysis

Check for model fairness — overall metrics may be good while false positives concentrate on a specific demographic segment.

Champion / Challenger Framework

🏆 Champion
Current production model

vs

🥊 Challenger 1

🥊 Challenger 2

🥊 Challenger N

New model versions are validated against the champion on quality metrics, computational performance, creative edge cases, and subpopulation fairness before promotion to production.

Test Your Knowledge

Week 4

Deployment Requirements

Containerization with Docker, virtual machines vs. containers, and best practices for reproducible ML deployments.

VMs vs Containers

🖥️ Virtual Machines

Access hardware via a hypervisor
Include full OS + application stack
More resource-intensive
Better isolation — good for testing in different OS environments
Good for public cloud and hybrid solutions

📦 Containers

Share the host OS kernel
Package executable + dependencies only
Lightweight and portable
Fast to start and deploy
Good for microservices, web services, CI/CD

Docker Best Practices

✅ Key Rules

Use official Docker images from hub.docker.com
Use .dockerignore to exclude unnecessary files
Order Dockerfile layers from least → most frequently changed to leverage cache
One container per process — easier debugging and orchestration
Container lifetime = app lifetime

# train/Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY train.py requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
CMD ["python", "train.py"]

# serve/Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY serve.py requirements.txt model.pkl ./
RUN pip install --no-cache-dir -r requirements.txt
CMD ["python", "serve.py"]
    

Docker Compose — Multi-Container Apps

Docker Compose lets you define and run multi-container applications with a simple YAML file. Useful for running training, serving, and monitoring containers together.

version: "3.8"
services:
  api:
    image: myregistry.com/ml-api:1.2.0
    ports: ["8080:8080"]
  monitor:
    image: myregistry.com/ml-monitor:latest
    depends_on: [api]
    

Docker Layer Caching

    Cache layers from least to most frequently changed:

    Base image → System dependencies → Python requirements → Application code

    If you change app.py, Docker only rebuilds the last layer — not requirements.

FROM python:3.10
Cached ✓

→

COPY requirements
Cached ✓

→

RUN pip install
Cached ✓

→

COPY app.py
Rebuilt ↺

Week 5

Model Serving

Real-time vs batch inference, Kubernetes orchestration, and deployment strategies: canary, shadow testing, and A/B testing.

Serving Patterns

⚡ REST API (FastAPI)

For on-demand single predictions. Low latency, synchronous. E.g., credit card fraud detection — must respond in milliseconds.

🔧 Kubernetes + REST

For high-traffic, auto-scaling scenarios. Multiple model replicas with load balancing. Best for business-critical APIs.

☁️ Serverless (Cloud)

Robust, scalable with minimal infrastructure management. Cloud providers handle scaling. Good for variable traffic.

⚡ Real-Time Inference

Individual predictions on-demand
Focus: low latency, high availability
Infrastructure: FastAPI, Load Balancers
Example: Credit card fraud detection — block the card NOW

📦 Batch Inference

Score many records at once, periodically
Focus: throughput, parallel processing, big data
Infrastructure: Spark, cloud compute clusters
Example: Monthly churn prediction (scoring all customers overnight)

Kubernetes Architecture

🧠 Master Node
Coordinator, kubectl, API Server

↓ coordinates ↓

📦 Node 1

Pod A

Pod B

📦 Node 2

Pod C

Pod D

📦 Node N

Pod E

Master

Coordinates the cluster. kubectl uses its API.

Nodes

Worker VMs/computers that run containers.

Pods

Smallest unit in K8s. Created/destroyed, not containers. Can have multiple containers.

Deployment Strategies

🐤 Canary Release

Keep the champion model in production, but redirect a small percentage of traffic to the new challenger model. Monitor results before full rollout.

100% Traffic
Champion

→

90% Champion
10% Challenger

→

50% / 50%
Monitor

→

100% New
Promoted

✅ Low risk, gradual rollout. Kubernetes handles this natively.

👥 Shadow Testing

The challenger model receives the same production traffic as the champion but its decisions are not acted upon. Results are logged and compared.

✅ No business risk — users are not affected by the challenger's decisions.

⚠️ More expensive — both models process every request. Also impossible when ground truth requires action (e.g., calling the client to verify).

Critical requirement: if the challenger fails, production must not experience any degradation in response time.

🔀 A/B Testing

Split traffic between champion and challenger — users are assigned to one model exclusively. Each model serves its assigned users.

✅ Cost-efficient — total predictions stays the same. Resources not doubled.

⚠️ Riskier — a portion of real users receive challenger predictions.

🔵🟢 Blue/Green Deployment

Set up the new system (green) alongside the stable one (blue). When the new version is functional, switch all traffic to it instantly — zero downtime.

🔵 Blue (Current)
100% traffic

→

🟢 Green
Set up + test

→

🟢 Green (New)
100% traffic

Kubernetes handles this natively. No downtime for real-time scoring.

Week 6

Monitoring & Drift Detection

Detecting when your model degrades in production through data drift, concept drift, and statistical testing.

Three Levels of Monitoring

🖥️

1. Resource Level

Is the model running correctly? CPU, memory, uptime, infrastructure health.

📊

2. Performance Level

Is the model still accurate? Monitoring degradation and triggering retraining when needed.

🔍

3. Prediction Explanation

Which features drive predictions? Log Shapley values to identify potential model issues.

Types of Drift

📥 Data Drift / Covariate Shift

P(X) changes but P(Y|X) stays the same. The distribution of input features shifts — e.g., seasonal temperature changes, new product categories. The model receives different inputs than it was trained on.

🎯 Concept Drift

P(Y|X) changes. The relationship between features and the target changes — e.g., fraud patterns evolve, customer behavior after COVID. The model's learned associations become stale.

🏷️ Label / Prediction Drift

The distribution of predicted labels shifts. E.g., the model starts predicting fraud much more often even though fraud rates are stable — a sign of covariate shift downstream.

💼 Business Drift

Business goals or definitions change. E.g., what counts as "churn" is redefined, regulatory requirements shift, or new product lines change the target population.

Drift Patterns Over Time

Sudden drift — A new concept occurs in a short period of time (e.g., at the start of COVID-19 in March 2020, stock prices suddenly changed).

Statistical Tests for Drift

Population Stability Index (PSI)

Measures the difference between two discrete distributions. Widely used in credit risk scorecards.

PSI = Σ (actual% − expected%) × ln(actual% / expected%)

PSI Value	Interpretation	Action
< 0.1	✅ No significant change	Continue monitoring
0.1 – 0.2	⚠️ Moderate change	Investigate features
≥ 0.2	🚨 Significant change	Retrain model

⚠️ PSI is not symmetric — do not swap the order of reference and production data. Use a consistent reference dataset (e.g., last 3 months of training data).

Kullback-Leibler (KL) Divergence

Measures how distribution Q diverges from true distribution P. Based on information theory (entropy).

KL(P‖Q) = Σ P(x) × ln(P(x) / Q(x))   [discrete]
KL(P‖Q) = ∫ P(x) × ln(P(x) / Q(x)) dx  [continuous]

Not symmetric: KL(P‖Q) ≠ KL(Q‖P)
Measures entropy increase due to approximation
Useful for GANs and synthetic data generation
Problems with zero-probability events (log(0) undefined)

Jensen-Shannon (JS) Divergence

A symmetrized version of KL divergence. Also called "information radius."

JS(P‖Q) = ½ KL(P‖M) + ½ KL(Q‖M)
where M = ½(P + Q)  [mixture distribution]

Symmetric: JS(P‖Q) = JS(Q‖P)
Handles zero-probability bins naturally (0 × ln(0) = 0)
Applied independently per feature (univariate)
Applied to binned/discretized data in practice

Two-Sample Kolmogorov-Smirnov (KS) Test

Non-parametric test measuring the maximum distance between two empirical cumulative distribution functions.

Works well for continuous features. Does not require binning — applied directly to the raw distribution. Returns a p-value to determine statistical significance of the difference.

Other tests: Anderson-Darling, Wasserstein Distance, Chi-Squared (for categorical), Fisher's Exact Test.

Test Your Knowledge

Week 7

Model Governance

Responsible AI, GDPR compliance, model explainability, and the governance frameworks that make ML trustworthy at scale.

Why Governance?

A machine learning model is not isolated from society's rules and laws. The past — where the model was trained — cannot anticipate evolving future problems. Governance ensures financial, legal, and ethical obligations are met.

⚖️

Process Governance

GDPR, industry regulations (pharma, finance, model risk management)

🔧

MLOps

Modular code, model versions, logging of all activities

🤖

Responsible AI

Explainability techniques for interpretability, bias testing for auditors

GDPR Principles

✅ Lawfulness, fairness, and transparency
✅ Purpose limitation
✅ Data minimization
✅ Accuracy

✅ Storage limitation
✅ Integrity and confidentiality (security)
✅ Accountability
✅ Right to explanation for automated decisions

8-Step Governance Framework

1. Understand and classify the analytics use cases Foundation

2. Establish an ethical position Ethics

3. Establish responsibilities People

4. Determine governance policies Policy

5. Integrate policies into the MLOps process Process

6. Select tools for centralized governance management Tools

7. Engage and educate People

8. Monitor and refine Continuous

Interpretability vs Explainability

🔬 Interpretability

Looking at the inner mechanics — understanding the model's weights, decision tree splits, and feature coefficients. Possible only for glass-box models (linear regression, decision trees).

💬 Explainability

Explaining a black-box model's behavior in human terms — finding the meaning between input attributions and model outputs. Post-hoc methods applied to any model.

Explainability Methods

🎮 SHAP (SHapley Additive exPlanations)

Game-theoretic approach using Shapley values for optimal credit allocation. Works on any black-box model, most efficient on tree ensembles. Supports both global and local explanations.

🔍 LIME (Local Interpretable Model-Agnostic Explanations)

Fits a surrogate glass-box model around a specific prediction's neighborhood. Perturbs data points to generate synthetic samples. Designed for local explanations only.

🎲 Permutation Importance

Measures feature importance by shuffling each feature and measuring prediction error increase. If shuffling a feature doesn't change error, it's unimportant. Global only.

📈 Partial Dependence Plot (PDP)

Shows the marginal effect of one or two features on the model output. Reveals if relationships are linear, monotonic, or complex. Assumes feature independence — can mislead if correlated. Global only.

🌳 Tree Surrogates

An interpretable model trained to approximate a black-box model's predictions. Easy to visualize and interpret. Supports both global and local use.

🚀 Explainable Boosting Machine (EBM)

Tree-based cyclic gradient boosting GAM from Microsoft Research. As accurate as black-box models but fully interpretable. Fast at prediction time. Supports both global and local explanations.

5 Reasons for Explainability

1. Accountability

When a model makes a wrong decision, knowing what caused it is essential for troubleshooting and responsibility.

2. Trust

In high-risk domains (healthcare, finance), domain experts will challenge your model — you need evidence it works.

3. Compliance

Critical for auditors and regulators — GDPR's right to explanation requires understanding automated decisions.

4. Performance

Understanding which features matter most guides hyperparameter tuning and feature selection.

Flashcards — Governance

Click the card to reveal the answer

What is the difference between SHAP and LIME?

SHAP uses game-theoretic Shapley values to compute global AND local feature attributions on any model. LIME fits a local surrogate model around one specific prediction — it only provides local explanations and is model-agnostic but less theoretically grounded.

1 / 4