Learn › AI Ops › Meera runs the experiment

AI Ops Ch 8 / 9 Advanced

🧪

Meera runs the experiment

A/B testing ML models, statistical significance and shadow deployment

⏱ 10 min 5 commands 5 takeaways

🧪

In this chapter

Meera

Senior ML engineer, fintech product team

The story

Meera had trained a new loan approval model showing 15% better accuracy on the test set. Her manager was excited. The engineering team was ready to deploy.

But Meera had a question nobody had asked: Better accuracy on test data — does that mean better outcomes in production?

A model can have excellent offline metrics and still perform worse in production because the test data is not representative, users behave differently when the model changes, and the business metric is not exactly what the model optimises for.

She insisted on a proper A/B test.

Shadow deployment: the safe first step. Run the new model in shadow mode — it receives the same inputs as the production model but its outputs go nowhere. You just log them.

class ModelRouter:
    def __init__(self, production_model, shadow_model):
        self.production = production_model
        self.shadow = shadow_model

    def predict(self, features):
        production_result = self.production.predict(features)

        try:
            shadow_result = self.shadow.predict(features)
            log.info("shadow_prediction",
                     production=production_result,
                     shadow=shadow_result)
        except Exception as e:
            log.warning("shadow_model_error", error=str(e))

        return production_result  # always return production result

Shadow mode lets you verify the new model runs without errors and measure its latency before any user sees it.

A/B testing: send X percent of traffic to the new model, the rest to the old model. Measure the business metric on both groups.

import hashlib

def get_model_for_user(user_id, experiment_fraction=0.10):
    # Deterministic: same user always sees the same model
    hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    bucket = hash_value % 100
    return 'model_b' if bucket < experiment_fraction * 100 else 'model_a'

Deterministic assignment is critical. A user must see the same model on every request, not random each time.

Calculate required sample size before you start:

from scipy import stats
import numpy as np

def required_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
    effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
    n = stats.TTestIndPower().solve_power(effect_size=effect_size, alpha=alpha, power=power)
    return int(np.ceil(n))

# Example: baseline default rate 12%, want to detect 2pp improvement
n = required_sample_size(0.12, 0.02)
print(f"Need {n} samples per group")  # about 1800

Most teams run A/B tests too short and mistake noise for signal. Always calculate sample size first.

Analysing results:

from scipy import stats

chi2, p_value, dof, expected = stats.chi2_contingency([
    [180, 1620],   # model A: 180 successes out of 1800
    [210, 1590],   # model B: 210 successes out of 1800
])

if p_value < 0.05:
    print("Statistically significant - ship the new model")
else:
    print("Not significant yet - need more data")

Gradual rollout even after significance:

- Week 1: 10 percent to new model — watch for errors

- Week 2: 25 percent — check business metrics

- Week 3: 50/50 — final comparison

- Week 4: 100 percent if all metrics healthy

Meera ran the test for 3 weeks. The new model showed statistically significant improvement in approval rate without increasing default rate. She shipped it with confidence.

Key takeaways

Shadow deployment runs the new model without showing results to users — catch errors safely first

Always calculate required sample size before starting an A/B test — most teams run tests too short

User assignment must be deterministic (same user = same model every request) not random per request

p-value under 0.05 is not enough alone — also check practical significance and effect size

Gradual rollout: 10% then 25% then 50% then 100% with monitoring at each stage

Commands from this chapter

$ int(hashlib.md5(user_id.encode()).hexdigest(), 16) % 100

Deterministic user bucketing for A/B assignment

$ stats.chi2_contingency([[a,b],[c,d]])

Chi-squared test for comparing conversion rates

$ stats.TTestIndPower().solve_power(effect_size, alpha=0.05, power=0.80)

Calculate required sample size

$ from scipy import stats

Import scipy stats for A/B test analysis

$ pip install scipy statsmodels

Install statistical testing libraries