Week 1

Introduction to MLOps

Why do ML projects fail? MLOps bridges the gap between data science experimentation and reliable production systems.

Why MLOps?
~40%
Notebook-to-Production Gap
Models that work in Jupyter but can't scale to real-time systems
~30%
Data & Concept Drift
Real-world data changes cause silent model degradation
~20%
Business Misalignment
Optimizing for accuracy instead of latency or cost
~10%
Governance Deficit
Losing track of data, dependencies, and hyperparameters
MLOps = ML + DEV + OPS

The Core Idea

Only a small fraction of real-world ML systems are composed of ML code. The surrounding infrastructure โ€” data pipelines, serving, monitoring, CI/CD โ€” is far larger. MLOps applies DevOps best practices to the ML lifecycle.

๐Ÿ“Š Business Goal
โ†’
๐Ÿ—ƒ๏ธ Data Collection
โ†’
๐Ÿ”ง Feature Eng.
โ†’
๐Ÿค– Model Training
โ†’
๐Ÿš€ Deployment
โ†’
๐Ÿ“ก Monitoring

๐Ÿ”ด Classic Problems (No MLOps)

  • No version control for data, models, or dependencies
  • Hardcoded paths (data.csv, model.pkl)
  • No reproducibility โ€” which sklearn version? what random seed?
  • Manual execution โ€” not automatable
  • No logging โ€” cannot audit
  • Works in Jupyter, breaks in production

๐ŸŸข With MLOps

  • Modular, testable, reusable code
  • Central config โ€” flexible deployment
  • Version control for data and models
  • Automated CI/CD pipelines
  • Logging and monitoring built-in
  • Reproducible experiments and audits
Real-World ML Failure Cases

๐Ÿ’ณ Fraud Detection Failures

  • Real-time decisions on high-velocity data
  • Changing fraud patterns over time (concept drift)
  • Trained without holiday peak season data โ†’ high false positive for travel/flight transactions
  • Clients without recent history have higher rejected first-transaction rates
  • Quality of data ingestion is critical โ€” missing data leads to wrong decisions

๐Ÿ” Cybersecurity ML Failures

  • Changing attack patterns and threat vectors evolve constantly
  • Lack of monitoring results means threats go undetected
  • Models trained on old patterns miss novel attack types
  • Requires continuous retraining and real-time detection pipelines

๐Ÿ›’ Recommendation System Failures

  • Words and definitions change over time in NLP-based systems
  • Lack of unit tests: when a word's meaning changes, what should the model consider?
  • Automobile sales model built for specific market โ€” couldn't scale globally
  • Healthcare cost correlated with race โ†’ biased patient care predictions
  • Post-COVID behavioral shift: online vs. on-premise customer profiling

๐Ÿง  LLMOps Failures

  • ~35% Runaway Costs: Token consumption scales with queries. Without semantic routing or caching, budget spikes can bankrupt a project.
  • ~30% Hallucination & Jailbreaks: Models ignore system instructions or are manipulated via prompt injection.
  • ~25% RAG Scale Gap: Pipeline works on 10 test docs, collapses with 100k messy enterprise PDFs.
  • ~10% Data Privacy: Internal wikis with PII accidentally leaked through the LLM.
Test Your Knowledge
Week 2

Model Development

Feature management, data validation with Great Expectations, feature engineering strategies, and experiment tracking with MLflow.

Feature Challenges
๐Ÿ“ฆ

Large Data Volume

Handling big datasets efficiently without re-computing features from scratch

โ™ป๏ธ

Feature Reusability

Different models need the same features โ€” avoid duplicate computation

๐Ÿ“

Standardized Definitions

Features must mean the same thing across teams, models, and time

๐Ÿ”„

Train/Serve Consistency

Features computed at training time must be identical at serving time

Feature Store Architecture
Raw DB
Batch Data
Real-time Data
โ†“
๐Ÿ” Great Expectations
Data Validation Layer
โ†“
๐Ÿช Feature Store
Centralized Feature Repository
โ†“ โ†“
๐Ÿค– Training
๐Ÿš€ Serving

Features are computed once and shared โ€” ensuring consistency between training and serving

Great Expectations โ€” Data Validation

Core Concepts

  • Expectation Suite: Collection of rules your data must satisfy
  • Datasource: Connection to your data (DB, CSV, API)
  • Validator: Runs expectations against data
  • Checkpoint: Reusable validation workflow

Useful Expectations

  • Null value checks on mandatory columns
  • Type enforcement (numeric, string)
  • Range and outlier checks
  • Unique key validation
  • Business rules (e.g. withdrawal โ†’ no beneficiary)
  • Statistical drift from reference distribution
Feature Engineering Strategies
๐Ÿ“ Derivatives โ–ผ
Infer new information from existing data โ€” e.g., extracting "day of week" from a date column, computing "age" from birth date.
โž• Enrichment โ–ผ
Add external information โ€” e.g., is this day a public holiday? Weather data, exchange rates, geographic data.
๐Ÿ”„ Encoding โ–ผ
Present the same information differently โ€” e.g., one-hot encoding, ordinal encoding, weekday vs. weekend binary flag.
๐Ÿ”— Combination โ–ผ
Link features together โ€” e.g., weighting backlog size by item complexity, multiplying two signals to capture interaction effects.
โš ๏ธ Feature Engineering Trade-offs: More features โ†’ more maintenance, higher compute cost, reduced stability, and potential privacy concerns. Use automated tools (Featuretools) but always apply critical thinking to your domain.
Featuretools โ€” Deep Feature Synthesis
01 EntitySet

Container for your data โ€” represents related tables/entities and their structure

02 Relationships

Define one-to-many links between parent and child entities

03 Primitives

Operations applied to data: aggregations (SUM, COUNT, MEAN) and transformations

04 DFS

Deep Feature Synthesis โ€” automatically stacks primitives across relationships

MLflow โ€” Experiment Tracking

Key Capabilities

  • Tracking: Log parameters, metrics, artifacts per run
  • Model Registry: Manage model versions and lifecycle stages
  • UI: Compare runs, visualize metrics over time
  • Reproducibility: Capture environment, requirements, seeds
  • Serving: Serve models locally or on cloud
import mlflow import mlflow.sklearn with mlflow.start_run(): # Log parameters mlflow.log_param("max_depth", 5) mlflow.log_param("n_estimators", 100) # Log metrics mlflow.log_metric("accuracy", acc) mlflow.log_metric("f1_score", f1) # Log model mlflow.sklearn.log_model(model, "model")
Flashcards โ€” Feature Management
Click the card to reveal the answer

What is the "training-serving skew" problem?

When features computed during training are calculated differently at serving time, causing the model to receive different input distributions than it was trained on โ€” leading to degraded predictions in production.

1 / 5
Week 3

Production & Deployment

Structuring ML code for production: modularity, version control, CI/CD integration, and reproducible pipelines.

From Notebook to Production

๐Ÿ”ด Common Anti-Patterns

  • Hard-coded paths: Cannot generalize across environments
  • No version control for data/model: Can't track changes
  • No separation of concerns: Hard to test, debug, or scale
  • Manual execution: Not automatable
  • No logging: Cannot audit

๐ŸŸข Production-Ready Design

  • Modular design: Easy to test and reuse
  • Central config/paths: Flexible deployment
  • Separate concerns: Scalable architecture
  • Parameterization: Supports experimentation
  • CI/CD integration: Automatable pipelines
  • Version data/model: Auditability
Recommended Project Structure
๐Ÿ“ ml-project/ โ”œโ”€โ”€ ๐Ÿ“ data/ # Versioned datasets, saved models, predictions โ”‚ โ”œโ”€โ”€ raw/ โ”‚ โ”œโ”€โ”€ processed/ โ”‚ โ””โ”€โ”€ models/ โ”œโ”€โ”€ ๐Ÿ“ src/ # Modular Python source code โ”‚ โ”œโ”€โ”€ data_ingestion.py โ”‚ โ”œโ”€โ”€ feature_engineering.py โ”‚ โ”œโ”€โ”€ model_training.py โ”‚ โ””โ”€โ”€ evaluation.py โ”œโ”€โ”€ ๐Ÿ“ tests/ # Unit and integration tests โ”œโ”€โ”€ ๐Ÿ“ configs/ # YAML configs โ€” no hardcoded values โ”œโ”€โ”€ requirements.txt # Pinned dependencies โ”œโ”€โ”€ Dockerfile # Reproducible environment โ””โ”€โ”€ .github/ci.yml # CI/CD pipeline
Quality Assurance in MLOps
๐Ÿ’ก Unlike software engineering, QA methodologies in MLOps are still maturing. The entire pipeline โ€” data collection, ingestion, transformation, feature engineering, training, and evaluation โ€” must be documented and validated.

๐Ÿ“Š Model Quality Metrics

Assertions on model performance: accuracy, precision, recall. E.g., "fail deployment if accuracy drops below 90%"

โšก Computational Metrics

Latency, throughput. E.g., "fail if 5th percentile of scoring events takes more than X ms"

๐Ÿ‘ฅ Subpopulation Analysis

Check for model fairness โ€” overall metrics may be good while false positives concentrate on a specific demographic segment.

Champion / Challenger Framework
๐Ÿ† Champion
Current production model
vs
๐ŸฅŠ Challenger 1
๐ŸฅŠ Challenger 2
๐ŸฅŠ Challenger N

New model versions are validated against the champion on quality metrics, computational performance, creative edge cases, and subpopulation fairness before promotion to production.

Test Your Knowledge
Week 4

Deployment Requirements

Containerization with Docker, virtual machines vs. containers, and best practices for reproducible ML deployments.

VMs vs Containers

๐Ÿ–ฅ๏ธ Virtual Machines

  • Access hardware via a hypervisor
  • Include full OS + application stack
  • More resource-intensive
  • Better isolation โ€” good for testing in different OS environments
  • Good for public cloud and hybrid solutions

๐Ÿ“ฆ Containers

  • Share the host OS kernel
  • Package executable + dependencies only
  • Lightweight and portable
  • Fast to start and deploy
  • Good for microservices, web services, CI/CD
Docker Best Practices

โœ… Key Rules

  • Use official Docker images from hub.docker.com
  • Use .dockerignore to exclude unnecessary files
  • Order Dockerfile layers from least โ†’ most frequently changed to leverage cache
  • One container per process โ€” easier debugging and orchestration
  • Container lifetime = app lifetime
# train/Dockerfile FROM python:3.10-slim WORKDIR /app COPY train.py requirements.txt ./ RUN pip install --no-cache-dir -r requirements.txt CMD ["python", "train.py"] # serve/Dockerfile FROM python:3.10-slim WORKDIR /app COPY serve.py requirements.txt model.pkl ./ RUN pip install --no-cache-dir -r requirements.txt CMD ["python", "serve.py"]
Docker Compose โ€” Multi-Container Apps

Docker Compose lets you define and run multi-container applications with a simple YAML file. Useful for running training, serving, and monitoring containers together.

version: "3.8" services: api: image: myregistry.com/ml-api:1.2.0 ports: ["8080:8080"] monitor: image: myregistry.com/ml-monitor:latest depends_on: [api]
Docker Layer Caching
Cache layers from least to most frequently changed:
Base image โ†’ System dependencies โ†’ Python requirements โ†’ Application code

If you change app.py, Docker only rebuilds the last layer โ€” not requirements.
FROM python:3.10
Cached โœ“
โ†’
COPY requirements
Cached โœ“
โ†’
RUN pip install
Cached โœ“
โ†’
COPY app.py
Rebuilt โ†บ
Week 5

Model Serving

Real-time vs batch inference, Kubernetes orchestration, and deployment strategies: canary, shadow testing, and A/B testing.

Serving Patterns

โšก REST API (FastAPI)

For on-demand single predictions. Low latency, synchronous. E.g., credit card fraud detection โ€” must respond in milliseconds.

๐Ÿ”ง Kubernetes + REST

For high-traffic, auto-scaling scenarios. Multiple model replicas with load balancing. Best for business-critical APIs.

โ˜๏ธ Serverless (Cloud)

Robust, scalable with minimal infrastructure management. Cloud providers handle scaling. Good for variable traffic.

โšก Real-Time Inference

  • Individual predictions on-demand
  • Focus: low latency, high availability
  • Infrastructure: FastAPI, Load Balancers
  • Example: Credit card fraud detection โ€” block the card NOW

๐Ÿ“ฆ Batch Inference

  • Score many records at once, periodically
  • Focus: throughput, parallel processing, big data
  • Infrastructure: Spark, cloud compute clusters
  • Example: Monthly churn prediction (scoring all customers overnight)
Kubernetes Architecture
๐Ÿง  Master Node
Coordinator, kubectl, API Server
โ†“ coordinates โ†“
๐Ÿ“ฆ Node 1
Pod A
Pod B
๐Ÿ“ฆ Node 2
Pod C
Pod D
๐Ÿ“ฆ Node N
Pod E
Master

Coordinates the cluster. kubectl uses its API.

Nodes

Worker VMs/computers that run containers.

Pods

Smallest unit in K8s. Created/destroyed, not containers. Can have multiple containers.

Deployment Strategies

๐Ÿค Canary Release

Keep the champion model in production, but redirect a small percentage of traffic to the new challenger model. Monitor results before full rollout.

100% Traffic
Champion
โ†’
90% Champion
10% Challenger
โ†’
50% / 50%
Monitor
โ†’
100% New
Promoted

โœ… Low risk, gradual rollout. Kubernetes handles this natively.

๐Ÿ‘ฅ Shadow Testing

The challenger model receives the same production traffic as the champion but its decisions are not acted upon. Results are logged and compared.

โœ… No business risk โ€” users are not affected by the challenger's decisions.
โš ๏ธ More expensive โ€” both models process every request. Also impossible when ground truth requires action (e.g., calling the client to verify).

Critical requirement: if the challenger fails, production must not experience any degradation in response time.

๐Ÿ”€ A/B Testing

Split traffic between champion and challenger โ€” users are assigned to one model exclusively. Each model serves its assigned users.

โœ… Cost-efficient โ€” total predictions stays the same. Resources not doubled.
โš ๏ธ Riskier โ€” a portion of real users receive challenger predictions.

๐Ÿ”ต๐ŸŸข Blue/Green Deployment

Set up the new system (green) alongside the stable one (blue). When the new version is functional, switch all traffic to it instantly โ€” zero downtime.

๐Ÿ”ต Blue (Current)
100% traffic
โ†’
๐ŸŸข Green
Set up + test
โ†’
๐ŸŸข Green (New)
100% traffic

Kubernetes handles this natively. No downtime for real-time scoring.

Week 6

Monitoring & Drift Detection

Detecting when your model degrades in production through data drift, concept drift, and statistical testing.

Three Levels of Monitoring
๐Ÿ–ฅ๏ธ

1. Resource Level

Is the model running correctly? CPU, memory, uptime, infrastructure health.

๐Ÿ“Š

2. Performance Level

Is the model still accurate? Monitoring degradation and triggering retraining when needed.

๐Ÿ”

3. Prediction Explanation

Which features drive predictions? Log Shapley values to identify potential model issues.

Types of Drift

๐Ÿ“ฅ Data Drift / Covariate Shift

P(X) changes but P(Y|X) stays the same. The distribution of input features shifts โ€” e.g., seasonal temperature changes, new product categories. The model receives different inputs than it was trained on.

๐ŸŽฏ Concept Drift

P(Y|X) changes. The relationship between features and the target changes โ€” e.g., fraud patterns evolve, customer behavior after COVID. The model's learned associations become stale.

๐Ÿท๏ธ Label / Prediction Drift

The distribution of predicted labels shifts. E.g., the model starts predicting fraud much more often even though fraud rates are stable โ€” a sign of covariate shift downstream.

๐Ÿ’ผ Business Drift

Business goals or definitions change. E.g., what counts as "churn" is redefined, regulatory requirements shift, or new product lines change the target population.

Drift Patterns Over Time

Sudden drift โ€” A new concept occurs in a short period of time (e.g., at the start of COVID-19 in March 2020, stock prices suddenly changed).

Statistical Tests for Drift

Population Stability Index (PSI)

Measures the difference between two discrete distributions. Widely used in credit risk scorecards.

PSI = ฮฃ (actual% โˆ’ expected%) ร— ln(actual% / expected%)
PSI ValueInterpretationAction
< 0.1โœ… No significant changeContinue monitoring
0.1 โ€“ 0.2โš ๏ธ Moderate changeInvestigate features
โ‰ฅ 0.2๐Ÿšจ Significant changeRetrain model
โš ๏ธ PSI is not symmetric โ€” do not swap the order of reference and production data. Use a consistent reference dataset (e.g., last 3 months of training data).

Kullback-Leibler (KL) Divergence

Measures how distribution Q diverges from true distribution P. Based on information theory (entropy).

KL(Pโ€–Q) = ฮฃ P(x) ร— ln(P(x) / Q(x)) [discrete] KL(Pโ€–Q) = โˆซ P(x) ร— ln(P(x) / Q(x)) dx [continuous]
  • Not symmetric: KL(Pโ€–Q) โ‰  KL(Qโ€–P)
  • Measures entropy increase due to approximation
  • Useful for GANs and synthetic data generation
  • Problems with zero-probability events (log(0) undefined)

Jensen-Shannon (JS) Divergence

A symmetrized version of KL divergence. Also called "information radius."

JS(Pโ€–Q) = ยฝ KL(Pโ€–M) + ยฝ KL(Qโ€–M) where M = ยฝ(P + Q) [mixture distribution]
  • Symmetric: JS(Pโ€–Q) = JS(Qโ€–P)
  • Handles zero-probability bins naturally (0 ร— ln(0) = 0)
  • Applied independently per feature (univariate)
  • Applied to binned/discretized data in practice

Two-Sample Kolmogorov-Smirnov (KS) Test

Non-parametric test measuring the maximum distance between two empirical cumulative distribution functions.

Works well for continuous features. Does not require binning โ€” applied directly to the raw distribution. Returns a p-value to determine statistical significance of the difference.

Other tests: Anderson-Darling, Wasserstein Distance, Chi-Squared (for categorical), Fisher's Exact Test.

Test Your Knowledge
Week 7

Model Governance

Responsible AI, GDPR compliance, model explainability, and the governance frameworks that make ML trustworthy at scale.

Why Governance?
A machine learning model is not isolated from society's rules and laws. The past โ€” where the model was trained โ€” cannot anticipate evolving future problems. Governance ensures financial, legal, and ethical obligations are met.
โš–๏ธ

Process Governance

GDPR, industry regulations (pharma, finance, model risk management)

๐Ÿ”ง

MLOps

Modular code, model versions, logging of all activities

๐Ÿค–

Responsible AI

Explainability techniques for interpretability, bias testing for auditors

GDPR Principles
  • โœ… Lawfulness, fairness, and transparency
  • โœ… Purpose limitation
  • โœ… Data minimization
  • โœ… Accuracy
  • โœ… Storage limitation
  • โœ… Integrity and confidentiality (security)
  • โœ… Accountability
  • โœ… Right to explanation for automated decisions
8-Step Governance Framework
1. Understand and classify the analytics use cases Foundation
2. Establish an ethical position Ethics
3. Establish responsibilities People
4. Determine governance policies Policy
5. Integrate policies into the MLOps process Process
6. Select tools for centralized governance management Tools
7. Engage and educate People
8. Monitor and refine Continuous
Interpretability vs Explainability

๐Ÿ”ฌ Interpretability

Looking at the inner mechanics โ€” understanding the model's weights, decision tree splits, and feature coefficients. Possible only for glass-box models (linear regression, decision trees).

๐Ÿ’ฌ Explainability

Explaining a black-box model's behavior in human terms โ€” finding the meaning between input attributions and model outputs. Post-hoc methods applied to any model.

Explainability Methods

๐ŸŽฎ SHAP (SHapley Additive exPlanations)

Game-theoretic approach using Shapley values for optimal credit allocation. Works on any black-box model, most efficient on tree ensembles. Supports both global and local explanations.

๐Ÿ” LIME (Local Interpretable Model-Agnostic Explanations)

Fits a surrogate glass-box model around a specific prediction's neighborhood. Perturbs data points to generate synthetic samples. Designed for local explanations only.

๐ŸŽฒ Permutation Importance

Measures feature importance by shuffling each feature and measuring prediction error increase. If shuffling a feature doesn't change error, it's unimportant. Global only.

๐Ÿ“ˆ Partial Dependence Plot (PDP)

Shows the marginal effect of one or two features on the model output. Reveals if relationships are linear, monotonic, or complex. Assumes feature independence โ€” can mislead if correlated. Global only.

๐ŸŒณ Tree Surrogates

An interpretable model trained to approximate a black-box model's predictions. Easy to visualize and interpret. Supports both global and local use.

๐Ÿš€ Explainable Boosting Machine (EBM)

Tree-based cyclic gradient boosting GAM from Microsoft Research. As accurate as black-box models but fully interpretable. Fast at prediction time. Supports both global and local explanations.

5 Reasons for Explainability

1. Accountability

When a model makes a wrong decision, knowing what caused it is essential for troubleshooting and responsibility.

2. Trust

In high-risk domains (healthcare, finance), domain experts will challenge your model โ€” you need evidence it works.

3. Compliance

Critical for auditors and regulators โ€” GDPR's right to explanation requires understanding automated decisions.

4. Performance

Understanding which features matter most guides hyperparameter tuning and feature selection.

Flashcards โ€” Governance
Click the card to reveal the answer

What is the difference between SHAP and LIME?

SHAP uses game-theoretic Shapley values to compute global AND local feature attributions on any model. LIME fits a local surrogate model around one specific prediction โ€” it only provides local explanations and is model-agnostic but less theoretically grounded.

1 / 4