House Price Prediction

Overview

A complete machine learning pipeline for predicting residential property prices: data collection, preprocessing, model training, evaluation, and inference. The goal was to build something end-to-end rather than training on a pre-cleaned Kaggle dataset — which meant dealing with real-world messiness.

Data Collection

The dataset was self-collected from publicly available real estate listings in a target geographic region. Features captured include:

Square footage (gross living area)
Lot size
Number of bedrooms and bathrooms
Year built / year renovated
Neighbourhood (categorical)
Property type (house, townhouse, condo)
Distance to transit nodes

Collecting data manually surfaces problems that pre-packaged datasets hide: inconsistent units, missing fields, duplicate listings, and listings with obvious data entry errors.

Preprocessing Pipeline

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])

preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features),
])

Imputation strategy: Numeric features use median imputation (robust to outliers). Categorical features use the mode. Missing lot size values — common for condos — are imputed separately with a condo-specific median.

Encoding: One-hot encoding for neighbourhood and property type. The handle_unknown='ignore' flag ensures inference on new neighbourhoods doesn't crash the pipeline.

Model: Random Forest Regressor

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(
    n_estimators=1000,
    max_features='sqrt',
    min_samples_leaf=4,
    n_jobs=-1,
    random_state=42,
)

Random Forest was chosen over linear regression for its ability to capture non-linear interactions (e.g., square footage matters more in certain neighbourhoods) and its robustness to outliers without requiring explicit feature scaling.

n_estimators=1000 balances variance reduction against training time. Beyond ~500 trees, validation error plateaus — 1000 was chosen conservatively.

Evaluation

5-fold cross-validation on the training set:

| Metric | Value | |---|---| | MAE | within competitive range for dataset size | | R² | demonstrates meaningful predictive power | | Feature Importance (top 3) | Square footage, neighbourhood, year built |

Feature importance analysis confirmed the intuitive hierarchy: location and size dominate, with renovation year providing meaningful signal for older properties.

Lessons Learned

The most time-consuming part was data cleaning, not model training — consistent with the "80% data prep" rule of thumb. The biggest data quality issue was inconsistent square footage reporting (some listings used gross area, others used main floor only).

The model performs well within the training distribution but generalizes poorly to luxury properties (>3× median price) and rural lots, which were underrepresented in the dataset. A production system would need stratified sampling or separate models for distinct market segments.