Overview
A complete machine learning pipeline for predicting residential property prices: data collection, preprocessing, model training, evaluation, and inference. The goal was to build something end-to-end rather than training on a pre-cleaned Kaggle dataset — which meant dealing with real-world messiness.
Data Collection
The dataset was self-collected from publicly available real estate listings in a target geographic region. Features captured include:
- Square footage (gross living area)
- Lot size
- Number of bedrooms and bathrooms
- Year built / year renovated
- Neighbourhood (categorical)
- Property type (house, townhouse, condo)
- Distance to transit nodes
Collecting data manually surfaces problems that pre-packaged datasets hide: inconsistent units, missing fields, duplicate listings, and listings with obvious data entry errors.
Preprocessing Pipeline
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
numeric_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='median')),
])
categorical_pipeline = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)),
])
preprocessor = ColumnTransformer([
('num', numeric_pipeline, numeric_features),
('cat', categorical_pipeline, categorical_features),
])
Imputation strategy: Numeric features use median imputation (robust to outliers). Categorical features use the mode. Missing lot size values — common for condos — are imputed separately with a condo-specific median.
Encoding: One-hot encoding for neighbourhood and property type. The handle_unknown='ignore' flag ensures inference on new neighbourhoods doesn't crash the pipeline.
Model: Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(
n_estimators=1000,
max_features='sqrt',
min_samples_leaf=4,
n_jobs=-1,
random_state=42,
)
Random Forest was chosen over linear regression for its ability to capture non-linear interactions (e.g., square footage matters more in certain neighbourhoods) and its robustness to outliers without requiring explicit feature scaling.
n_estimators=1000 balances variance reduction against training time. Beyond ~500 trees, validation error plateaus — 1000 was chosen conservatively.
Evaluation
5-fold cross-validation on the training set:
| Metric | Value | |---|---| | MAE | within competitive range for dataset size | | R² | demonstrates meaningful predictive power | | Feature Importance (top 3) | Square footage, neighbourhood, year built |
Feature importance analysis confirmed the intuitive hierarchy: location and size dominate, with renovation year providing meaningful signal for older properties.
Lessons Learned
The most time-consuming part was data cleaning, not model training — consistent with the "80% data prep" rule of thumb. The biggest data quality issue was inconsistent square footage reporting (some listings used gross area, others used main floor only).
The model performs well within the training distribution but generalizes poorly to luxury properties (>3× median price) and rural lots, which were underrepresented in the dataset. A production system would need stratified sampling or separate models for distinct market segments.