Meera runs the experiment
A/B testing ML models, statistical significance and shadow deployment
Meera had trained a new loan approval model showing 15% better accuracy on the test set. Her manager was excited. The engineering team was ready to deploy.
But Meera had a question nobody had asked: Better accuracy on test data — does that mean better outcomes in production?
A model can have excellent offline metrics and still perform worse in production because the test data is not representative, users behave differently when the model changes, and the business metric is not exactly what the model optimises for.
She insisted on a proper A/B test.
Shadow deployment: the safe first step. Run the new model in shadow mode — it receives the same inputs as the production model but its outputs go nowhere. You just log them.
class ModelRouter:
def __init__(self, production_model, shadow_model):
self.production = production_model
self.shadow = shadow_model def predict(self, features):
production_result = self.production.predict(features) try:
shadow_result = self.shadow.predict(features)
log.info("shadow_prediction",
production=production_result,
shadow=shadow_result)
except Exception as e:
log.warning("shadow_model_error", error=str(e)) return production_result # always return production resultShadow mode lets you verify the new model runs without errors and measure its latency before any user sees it.
A/B testing: send X percent of traffic to the new model, the rest to the old model. Measure the business metric on both groups.
import hashlibdef get_model_for_user(user_id, experiment_fraction=0.10):
# Deterministic: same user always sees the same model
hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
bucket = hash_value % 100
return 'model_b' if bucket < experiment_fraction * 100 else 'model_a'Deterministic assignment is critical. A user must see the same model on every request, not random each time.
Calculate required sample size before you start:
from scipy import stats
import numpy as npdef required_sample_size(baseline_rate, mde, alpha=0.05, power=0.80):
effect_size = mde / np.sqrt(baseline_rate * (1 - baseline_rate))
n = stats.TTestIndPower().solve_power(effect_size=effect_size, alpha=alpha, power=power)
return int(np.ceil(n))# Example: baseline default rate 12%, want to detect 2pp improvement
n = required_sample_size(0.12, 0.02)
print(f"Need {n} samples per group") # about 1800Most teams run A/B tests too short and mistake noise for signal. Always calculate sample size first.
Analysing results:
from scipy import statschi2, p_value, dof, expected = stats.chi2_contingency([
[180, 1620], # model A: 180 successes out of 1800
[210, 1590], # model B: 210 successes out of 1800
])if p_value < 0.05:
print("Statistically significant - ship the new model")
else:
print("Not significant yet - need more data")Gradual rollout even after significance:
- Week 1: 10 percent to new model — watch for errors
- Week 2: 25 percent — check business metrics
- Week 3: 50/50 — final comparison
- Week 4: 100 percent if all metrics healthy
Meera ran the test for 3 weeks. The new model showed statistically significant improvement in approval rate without increasing default rate. She shipped it with confidence.
Shadow deployment runs the new model without showing results to users — catch errors safely first
Always calculate required sample size before starting an A/B test — most teams run tests too short
User assignment must be deterministic (same user = same model every request) not random per request
p-value under 0.05 is not enough alone — also check practical significance and effect size
Gradual rollout: 10% then 25% then 50% then 100% with monitoring at each stage