Learn › AI Ops › Vikram teaches the model to see

AI Ops Ch 7 / 9 Intermediate

🔍

Vikram teaches the model to see

Feature engineering, data pipelines and why raw data is never model-ready

⏱ 11 min 5 commands 5 takeaways

🔍

In this chapter

Vikram

ML engineer, e-commerce recommendation team

The story

Vikram had a model predicting whether a user would click on a product recommendation. Accuracy was 62%. His manager wanted 75%.

He spent a week trying different algorithms — Random Forest, XGBoost, LightGBM. None broke 65%.

Then a senior colleague asked: What features are you using?

Vikram listed them: user_id, product_id, category, price.

The colleague said: You have transaction history for every user. You have time of day. You have product views in the last 7 days. You are feeding the model almost nothing and expecting it to learn everything.

That conversation changed how Vikram thought about machine learning.

Feature engineering is the process of transforming raw data into inputs that make it easy for a model to learn patterns.

Raw: user_id, product_id, timestamp, clicked=True

Engineered features:

- user_purchase_count_last_30d (how active is this user?)

- product_view_count_last_7d (is this product trending?)

- hour_of_day (people shop differently at 2pm vs 11pm)

- days_since_last_purchase (is this user churning?)

- category_affinity_score (does this user love electronics?)

- price_vs_user_avg_spend (is this product within their budget?)

The model cannot invent these features. You have to create them.

Building a feature pipeline:

import pandas as pd
from datetime import datetime, timedelta

def create_user_features(df, reference_date):
    return df.groupby('user_id').agg(
        purchase_count_30d=('order_id', lambda x: x[
            df.loc[x.index,'date'] >= reference_date - timedelta(days=30)
        ].count()),
        avg_order_value=('amount', 'mean'),
        days_since_last_purchase=('date', lambda x: (reference_date - x.max()).days),
        favourite_category=('category', lambda x: x.mode()[0])
    ).reset_index()

Time-based features are often the most powerful:

def add_time_features(df):
    df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
    df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.dayofweek
    df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
    df['is_evening'] = df['hour'].between(18, 23).astype(int)
    return df

In Indian e-commerce, payday timing matters enormously. Orders spike on the 1st and 15th of the month. Days until payday is a genuinely predictive feature.

Handling categorical features. Models understand numbers, not strings:

from sklearn.preprocessing import LabelEncoder
from category_encoders import TargetEncoder

# For low-cardinality (under 10 unique values): one-hot encoding
df = pd.get_dummies(df, columns=['day_of_week'])

# For high-cardinality (product IDs, cities): target encoding
encoder = TargetEncoder()
df['product_id_encoded'] = encoder.fit_transform(df['product_id'], df['clicked'])

Feature stores manage features at scale. Once you have hundreds of features, a feature store serves them consistently between training and production:

pip install feast

user_stats = FeatureView(
    name="user_purchase_stats",
    entities=[user],
    ttl=timedelta(days=1),
    features=[
        Feature(name="purchase_count_30d", dtype=Int64),
        Feature(name="avg_order_value", dtype=Float64),
    ],
)

# Online serving
feature_vector = store.get_online_features(
    features=["user_purchase_stats:purchase_count_30d"],
    entity_rows=[{"user_id": "user_123"}]
).to_dict()

After adding 15 new engineered features, Vikram's model accuracy jumped to 78%. The algorithm had barely changed. The data changed.

Key takeaways

Feature engineering often improves model accuracy more than changing the algorithm

Time-based features (hour, day of week, days since event) are frequently the most predictive

Target encoding handles high-cardinality categoricals (product IDs, city names) better than one-hot

A feature store ensures training and production use identical feature computation — no training-serving skew

In Indian e-commerce: payday timing, festival seasons, and regional patterns are powerful signals

Commands from this chapter

$ df.groupby('user_id').agg({'order_id':'count','amount':'mean'})

Aggregate user-level features from transaction history

$ pd.get_dummies(df, columns=['category'])

One-hot encode low-cardinality categorical features

$ pip install feast category_encoders

Install feature store and encoding libraries

$ pd.to_datetime(df['timestamp']).dt.hour

Extract hour of day as a time-based feature

$ df['days_since'] = (pd.Timestamp.now() - df['event_date']).dt.days

Calculate recency feature