Learnings from Kaggle House Prices Competition - Part 1
I wanted to try getting familiar with sklearn’s machine learning pipeline, so I started applying it to the Kaggle House Prices Competition. The data comprised of housing prices in Ames, Iowa with 79 features. I ended up learning a lot through the whole process.
Data Exploration and Cleaning
This was a great dataset (probably purposefully so, being a “Getting Started” competition) to practice a workflow for analyzing data since there were so many features. It definitely reminded me that I need to have the patience to observe all the data. I ended up going through a couple rounds of data cleaning.
To make the data cleaning workflow more streamlined in the future, I came up with this workflow:
Steps for reviewing data:
- Review data explanations if available!
- Review missing data
- Purpose: To omit columns that have too many missing values, identify data cleaning approach (ie., imputation or other)
- Make sure to understand why data is missing. For example, in this dataset the data is NA if the house does not have a certain feature (ie., a fence). This is valuable information and should be cleaned to represent that, and not be tossed out.
- Visualization
- Purpose: This can help understand correlations between the features and dependent variable. Additionally, this can help show any quirks of the dataset.
- Look at distribution of output variable - is it normally distributed? If not, consider transforming the variable.
- Make pair scatter plots of all the numeric variables.
- Make boxplots of categorical variables with respect to the dependent variable
- Any others that might be helpful
- Purpose: This can help understand correlations between the features and dependent variable. Additionally, this can help show any quirks of the dataset.
The sklearn pipeline
First, some terminology
One thing I learned is that there are many words in machine learning that sound scary but are actually very simple concepts. For example, “pipeline” sounds very scary, but it’s just a fancy word to describe a way to apply a series of functions to your data, either to clean your data, transform it or model it.
Below are some of such words that I’ll keep be using throughout:
Terminology | In “English” |
---|---|
Pipeline | An object you can use to apply many functions to your data (for data cleaning, transforming or modeling) |
Transformer | A function to “transform” your data, which can be as simple as scaling it |
Custom Transformer | A function you write that can be applied to your pipeline |
Feature Engineering | You use the variables (features) you have to come up with new ones that may be helpful. For example if you have square footage and rooms, maybe you want to know the square footage per room. |
The Pipeline
Using sklearn’s pipeline was super streamlined! Below shows a code snippet of what the pipeline looks like. The ColumnTransfomer()
function allows you to apply different pipelines to the data by the column name. In this case, I had 3 separate pipelines by the data type. Each pipeline is composed of sklearn functions or custom transformers.
Custom Transformers
A useful thing I learned is that you can create custom transfomers and add it to your pipeline. For example, I created a add_home_area
Custom Transformer that adds the total home area to the dataset, which I decided to add for feature engineering. This is used in the numeric pipeline above.