Learn 🧠 All Concepts (20) 🤖 What is an LLM? 📚 RAG Explained ⚡ AI Agents 💻 Run AI Locally 🇮🇳 AI in India 📖 Learn Tracks 🔧 DevOps Track ⚙️ AI Ops Track 🗺️ AI Engineer Roadmap
Tools 🔧 AI Tools Directory 🔓 Open Source AI ⭐ Top GitHub Repos ✦ Claude Skill Repos 🚀 Ready-to-Deploy Projects
Build 🏗️ Build Hub 🎯 Master Prompts 🧩 RAG Agents 🚀 App Megaprompts
Workflows ⚡ All Workflows (22) 🎥 Text to Video 🎞️ Image to Video 🔊 Text to Speech ♻️ Automation
Resources 🧪 Colab Notebooks ⚙️ n8n Workflows 📈 Algo Trading 💰 Passive Income
🗂️ Browse All Topics About AItheGuru
Learn AI Ops Meera runs the experiment
AI Ops Ch 8 / 9 Advanced
🧪

Meera runs the experiment

A/B testing ML models, statistical significance and shadow deployment

⏱ 10 min 5 commands 5 takeaways
🧪
In this chapter
Meera
Senior ML engineer, fintech product team
The story

Meera had trained a new loan approval model showing 15% better accuracy on the test set. Her manager was excited. The engineering team was ready to deploy.

But Meera had a question nobody had asked: Better accuracy on test data — does that mean better outcomes in production?

A model can have excellent offline metrics and still perform worse in production because the test data is not representative, users behave differently when the model changes, and the business metric is not exactly what the model optimises for.

She insisted on a proper A/B test.

Shadow deployment: the safe first step. Run the new model in shadow mode — it receives the same inputs as the production model but its outputs go nowhere. You just log them.

class ModelRouter:
    def __init__(self, production_model, shadow_model):
        self.production = production_model
        self.shadow = shadow_model
    def predict(self, features):
        production_result = self.production.predict(features)
        try:
            shadow_result = self.shadow.predict(features)
            log.info("shadow_prediction",
                     production=production_result,
                     shadow=shadow_result)
        except Exception as e:
            log.warning("shadow_model_error", error=str(e))
        return production_result  # always return production result

Shadow mode lets you verify the new model runs without errors and measure its latency before any user sees it.

A/B testing: send X percent of traffic to the new model, the rest to the old model. Measure the business metric on both groups.

import hashlib
def get_model_for_user(user_id, experiment_fraction=0.10):
    # Deterministic: same user always sees the same model
    hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    bucket = hash_value % 100
    return 'model_b' if bucket < experiment_fraction * 100 else 'model_a'

Deterministic assignment is critical. A user must see the same model on every request, not random each time.

Calculate required sample size before you start:

from scipy import stats
import numpy as np
def required_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
    effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
    n = stats.TTestIndPower().solve_power(effect_size=effect_size, alpha=alpha, power=power)
    return int(np.ceil(n))
# Example: baseline default rate 12%, want to detect 2pp improvement
n = required_sample_size(0.12, 0.02)
print(f"Need {n} samples per group")  # about 1800

Most teams run A/B tests too short and mistake noise for signal. Always calculate sample size first.

Analysing results:

from scipy import stats
chi2, p_value, dof, expected = stats.chi2_contingency([
    [180, 1620],   # model A: 180 successes out of 1800
    [210, 1590],   # model B: 210 successes out of 1800
])
if p_value < 0.05:
    print("Statistically significant - ship the new model")
else:
    print("Not significant yet - need more data")

Gradual rollout even after significance:

- Week 1: 10 percent to new model — watch for errors

- Week 2: 25 percent — check business metrics

- Week 3: 50/50 — final comparison

- Week 4: 100 percent if all metrics healthy

Meera ran the test for 3 weeks. The new model showed statistically significant improvement in approval rate without increasing default rate. She shipped it with confidence.

Key takeaways

Shadow deployment runs the new model without showing results to users — catch errors safely first

Always calculate required sample size before starting an A/B test — most teams run tests too short

User assignment must be deterministic (same user = same model every request) not random per request

p-value under 0.05 is not enough alone — also check practical significance and effect size

Gradual rollout: 10% then 25% then 50% then 100% with monitoring at each stage

Commands from this chapter
$ int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100
Deterministic user bucketing for A/B assignment
$ stats.chi2_contingency([[a,b],[c,d]])
Chi-squared test for comparing conversion rates
$ stats.TTestIndPower().solve_power(effect_size, alpha=0.05, power=0.80)
Calculate required sample size
$ from scipy import stats
Import scipy stats for A/B test analysis
$ pip install scipy statsmodels
Install statistical testing libraries