Vikram teaches the model to see
Feature engineering, data pipelines and why raw data is never model-ready
Vikram had a model predicting whether a user would click on a product recommendation. Accuracy was 62%. His manager wanted 75%.
He spent a week trying different algorithms — Random Forest, XGBoost, LightGBM. None broke 65%.
Then a senior colleague asked: What features are you using?
Vikram listed them: user_id, product_id, category, price.
The colleague said: You have transaction history for every user. You have time of day. You have product views in the last 7 days. You are feeding the model almost nothing and expecting it to learn everything.
That conversation changed how Vikram thought about machine learning.
Feature engineering is the process of transforming raw data into inputs that make it easy for a model to learn patterns.
Raw: user_id, product_id, timestamp, clicked=True
Engineered features:
- user_purchase_count_last_30d (how active is this user?)
- product_view_count_last_7d (is this product trending?)
- hour_of_day (people shop differently at 2pm vs 11pm)
- days_since_last_purchase (is this user churning?)
- category_affinity_score (does this user love electronics?)
- price_vs_user_avg_spend (is this product within their budget?)
The model cannot invent these features. You have to create them.
Building a feature pipeline:
import pandas as pd
from datetime import datetime, timedeltadef create_user_features(df, reference_date):
return df.groupby('user_id').agg(
purchase_count_30d=('order_id', lambda x: x[
df.loc[x.index,'date'] >= reference_date - timedelta(days=30)
].count()),
avg_order_value=('amount', 'mean'),
days_since_last_purchase=('date', lambda x: (reference_date - x.max()).days),
favourite_category=('category', lambda x: x.mode()[0])
).reset_index()Time-based features are often the most powerful:
def add_time_features(df):
df['hour'] = pd.to_datetime(df['timestamp']).dt.hour
df['day_of_week'] = pd.to_datetime(df['timestamp']).dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_evening'] = df['hour'].between(18, 23).astype(int)
return dfIn Indian e-commerce, payday timing matters enormously. Orders spike on the 1st and 15th of the month. Days until payday is a genuinely predictive feature.
Handling categorical features. Models understand numbers, not strings:
from sklearn.preprocessing import LabelEncoder
from category_encoders import TargetEncoder# For low-cardinality (under 10 unique values): one-hot encoding
df = pd.get_dummies(df, columns=['day_of_week'])# For high-cardinality (product IDs, cities): target encoding
encoder = TargetEncoder()
df['product_id_encoded'] = encoder.fit_transform(df['product_id'], df['clicked'])Feature stores manage features at scale. Once you have hundreds of features, a feature store serves them consistently between training and production:
pip install feastuser_stats = FeatureView(
name="user_purchase_stats",
entities=[user],
ttl=timedelta(days=1),
features=[
Feature(name="purchase_count_30d", dtype=Int64),
Feature(name="avg_order_value", dtype=Float64),
],
)# Online serving
feature_vector = store.get_online_features(
features=["user_purchase_stats:purchase_count_30d"],
entity_rows=[{"user_id": "user_123"}]
).to_dict()After adding 15 new engineered features, Vikram's model accuracy jumped to 78%. The algorithm had barely changed. The data changed.
Feature engineering often improves model accuracy more than changing the algorithm
Time-based features (hour, day of week, days since event) are frequently the most predictive
Target encoding handles high-cardinality categoricals (product IDs, city names) better than one-hot
A feature store ensures training and production use identical feature computation — no training-serving skew
In Indian e-commerce: payday timing, festival seasons, and regional patterns are powerful signals