examples/house-prices-kaggle-competi.../house-prices-orig.ipynb

1778 lines
158 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.036229,
"end_time": "2021-10-27T20:04:18.236097",
"exception": false,
"start_time": "2021-10-27T20:04:18.199868",
"status": "completed"
},
"tags": []
},
"source": [
"# Introduction #\n",
"\n",
"Welcome to the feature engineering project for the [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) competition! This competition uses nearly the same data you used in the exercises of the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course. \n",
"\n",
"\n",
"# Step 1 - Preliminaries #\n",
"## Imports and Configuration ##\n",
"\n",
"We'll start by importing the packages we used in the exercises and setting some notebook defaults. Unhide this cell if you'd like to see the libraries we'll use:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:18.320753Z",
"iopub.status.busy": "2021-10-27T20:04:18.319486Z",
"iopub.status.idle": "2021-10-27T20:04:20.118899Z",
"shell.execute_reply": "2021-10-27T20:04:20.117383Z"
},
"papermill": {
"duration": 1.84978,
"end_time": "2021-10-27T20:04:20.119141",
"exception": false,
"start_time": "2021-10-27T20:04:18.269361",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"import os\n",
"import warnings\n",
"from pathlib import Path\n",
"\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"from IPython.display import display\n",
"from pandas.api.types import CategoricalDtype\n",
"\n",
"from category_encoders import MEstimateEncoder\n",
"from sklearn.cluster import KMeans\n",
"from sklearn.decomposition import PCA\n",
"from sklearn.feature_selection import mutual_info_regression\n",
"from sklearn.model_selection import KFold, cross_val_score\n",
"from xgboost import XGBRegressor\n",
"\n",
"\n",
"# Set Matplotlib defaults\n",
"plt.style.use(\"seaborn-whitegrid\")\n",
"plt.rc(\"figure\", autolayout=True)\n",
"plt.rc(\n",
" \"axes\",\n",
" labelweight=\"bold\",\n",
" labelsize=\"large\",\n",
" titleweight=\"bold\",\n",
" titlesize=14,\n",
" titlepad=10,\n",
")\n",
"\n",
"# Mute warnings\n",
"warnings.filterwarnings('ignore')\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.03292,
"end_time": "2021-10-27T20:04:20.184209",
"exception": false,
"start_time": "2021-10-27T20:04:20.151289",
"status": "completed"
},
"tags": []
},
"source": [
"## Data Preprocessing ##\n",
"\n",
"Before we can do any feature engineering, we need to *preprocess* the data to get it in a form suitable for analysis. The data we used in the course was a bit simpler than the competition data. For the *Ames* competition dataset, we'll need to:\n",
"- **Load** the data from CSV files\n",
"- **Clean** the data to fix any errors or inconsistencies\n",
"- **Encode** the statistical data type (numeric, categorical)\n",
"- **Impute** any missing values\n",
"\n",
"We'll wrap all these steps up in a function, which will make easy for you to get a fresh dataframe whenever you need. After reading the CSV file, we'll apply three preprocessing steps, `clean`, `encode`, and `impute`, and then create the data splits: one (`df_train`) for training the model, and one (`df_test`) for making the predictions that you'll submit to the competition for scoring on the leaderboard."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:20.256708Z",
"iopub.status.busy": "2021-10-27T20:04:20.255666Z",
"iopub.status.idle": "2021-10-27T20:04:20.259278Z",
"shell.execute_reply": "2021-10-27T20:04:20.258676Z"
},
"papermill": {
"duration": 0.043063,
"end_time": "2021-10-27T20:04:20.259451",
"exception": false,
"start_time": "2021-10-27T20:04:20.216388",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"def load_data():\n",
" # Read data\n",
" data_dir = Path(\"../input/house-prices-advanced-regression-techniques/\")\n",
" df_train = pd.read_csv(data_dir / \"train.csv\", index_col=\"Id\")\n",
" df_test = pd.read_csv(data_dir / \"test.csv\", index_col=\"Id\")\n",
" # Merge the splits so we can process them together\n",
" df = pd.concat([df_train, df_test])\n",
" # Preprocessing\n",
" df = clean(df)\n",
" df = encode(df)\n",
" df = impute(df)\n",
" # Reform splits\n",
" df_train = df.loc[df_train.index, :]\n",
" df_test = df.loc[df_test.index, :]\n",
" return df_train, df_test\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.031365,
"end_time": "2021-10-27T20:04:20.325080",
"exception": false,
"start_time": "2021-10-27T20:04:20.293715",
"status": "completed"
},
"tags": []
},
"source": [
"### Clean Data ###\n",
"\n",
"Some of the categorical features in this dataset have what are apparently typos in their categories:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:20.394532Z",
"iopub.status.busy": "2021-10-27T20:04:20.393763Z",
"iopub.status.idle": "2021-10-27T20:04:20.450417Z",
"shell.execute_reply": "2021-10-27T20:04:20.450991Z"
},
"papermill": {
"duration": 0.094558,
"end_time": "2021-10-27T20:04:20.451203",
"exception": false,
"start_time": "2021-10-27T20:04:20.356645",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"array(['VinylSd', 'MetalSd', 'Wd Shng', 'HdBoard', 'Plywood', 'Wd Sdng',\n",
" 'CmentBd', 'BrkFace', 'Stucco', 'AsbShng', 'Brk Cmn', 'ImStucc',\n",
" 'AsphShn', 'Stone', 'Other', 'CBlock'], dtype=object)"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data_dir = Path(\"../input/house-prices-advanced-regression-techniques/\")\n",
"df = pd.read_csv(data_dir / \"train.csv\", index_col=\"Id\")\n",
"\n",
"df.Exterior2nd.unique()"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.031778,
"end_time": "2021-10-27T20:04:20.516295",
"exception": false,
"start_time": "2021-10-27T20:04:20.484517",
"status": "completed"
},
"tags": []
},
"source": [
"Comparing these to `data_description.txt` shows us what needs cleaning. We'll take care of a couple of issues here, but you might want to evaluate this data further."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:20.588327Z",
"iopub.status.busy": "2021-10-27T20:04:20.587260Z",
"iopub.status.idle": "2021-10-27T20:04:20.590341Z",
"shell.execute_reply": "2021-10-27T20:04:20.589702Z"
},
"papermill": {
"duration": 0.042177,
"end_time": "2021-10-27T20:04:20.590486",
"exception": false,
"start_time": "2021-10-27T20:04:20.548309",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"def clean(df):\n",
" df[\"Exterior2nd\"] = df[\"Exterior2nd\"].replace({\"Brk Cmn\": \"BrkComm\"})\n",
" # Some values of GarageYrBlt are corrupt, so we'll replace them\n",
" # with the year the house was built\n",
" df[\"GarageYrBlt\"] = df[\"GarageYrBlt\"].where(df.GarageYrBlt <= 2010, df.YearBuilt)\n",
" # Names beginning with numbers are awkward to work with\n",
" df.rename(columns={\n",
" \"1stFlrSF\": \"FirstFlrSF\",\n",
" \"2ndFlrSF\": \"SecondFlrSF\",\n",
" \"3SsnPorch\": \"Threeseasonporch\",\n",
" }, inplace=True,\n",
" )\n",
" return df\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.032069,
"end_time": "2021-10-27T20:04:20.654729",
"exception": false,
"start_time": "2021-10-27T20:04:20.622660",
"status": "completed"
},
"tags": []
},
"source": [
"### Encode the Statistical Data Type ###\n",
"\n",
"Pandas has Python types corresponding to the standard statistical types (numeric, categorical, etc.). Encoding each feature with its correct type helps ensure each feature is treated appropriately by whatever functions we use, and makes it easier for us to apply transformations consistently. This hidden cell defines the `encode` function:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:20.737930Z",
"iopub.status.busy": "2021-10-27T20:04:20.736900Z",
"iopub.status.idle": "2021-10-27T20:04:20.740552Z",
"shell.execute_reply": "2021-10-27T20:04:20.740028Z"
},
"papermill": {
"duration": 0.052778,
"end_time": "2021-10-27T20:04:20.740714",
"exception": false,
"start_time": "2021-10-27T20:04:20.687936",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"# The numeric features are already encoded correctly (`float` for\n",
"# continuous, `int` for discrete), but the categoricals we'll need to\n",
"# do ourselves. Note in particular, that the `MSSubClass` feature is\n",
"# read as an `int` type, but is actually a (nominative) categorical.\n",
"\n",
"# The nominative (unordered) categorical features\n",
"features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
"\n",
"\n",
"# The ordinal (ordered) categorical features \n",
"\n",
"# Pandas calls the categories \"levels\"\n",
"five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
"ten_levels = list(range(10))\n",
"\n",
"ordered_levels = {\n",
" \"OverallQual\": ten_levels,\n",
" \"OverallCond\": ten_levels,\n",
" \"ExterQual\": five_levels,\n",
" \"ExterCond\": five_levels,\n",
" \"BsmtQual\": five_levels,\n",
" \"BsmtCond\": five_levels,\n",
" \"HeatingQC\": five_levels,\n",
" \"KitchenQual\": five_levels,\n",
" \"FireplaceQu\": five_levels,\n",
" \"GarageQual\": five_levels,\n",
" \"GarageCond\": five_levels,\n",
" \"PoolQC\": five_levels,\n",
" \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
" \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
" \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
" \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
" \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
" \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
" \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
" \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
" \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
" \"CentralAir\": [\"N\", \"Y\"],\n",
" \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
" \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
"}\n",
"\n",
"# Add a None level for missing values\n",
"ordered_levels = {key: [\"None\"] + value for key, value in\n",
" ordered_levels.items()}\n",
"\n",
"\n",
"def encode(df):\n",
" # Nominal categories\n",
" for name in features_nom:\n",
" df[name] = df[name].astype(\"category\")\n",
" # Add a None category for missing values\n",
" if \"None\" not in df[name].cat.categories:\n",
" df[name].cat.add_categories(\"None\", inplace=True)\n",
" # Ordinal categories\n",
" for name, levels in ordered_levels.items():\n",
" df[name] = df[name].astype(CategoricalDtype(levels,\n",
" ordered=True))\n",
" return df\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.032488,
"end_time": "2021-10-27T20:04:20.805249",
"exception": false,
"start_time": "2021-10-27T20:04:20.772761",
"status": "completed"
},
"tags": []
},
"source": [
"### Handle Missing Values ###\n",
"\n",
"Handling missing values now will make the feature engineering go more smoothly. We'll impute `0` for missing numeric values and `\"None\"` for missing categorical values. You might like to experiment with other imputation strategies. In particular, you could try creating \"missing value\" indicators: `1` whenever a value was imputed and `0` otherwise."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:20.876211Z",
"iopub.status.busy": "2021-10-27T20:04:20.875480Z",
"iopub.status.idle": "2021-10-27T20:04:20.878433Z",
"shell.execute_reply": "2021-10-27T20:04:20.877838Z"
},
"papermill": {
"duration": 0.041181,
"end_time": "2021-10-27T20:04:20.878580",
"exception": false,
"start_time": "2021-10-27T20:04:20.837399",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"def impute(df):\n",
" for name in df.select_dtypes(\"number\"):\n",
" df[name] = df[name].fillna(0)\n",
" for name in df.select_dtypes(\"category\"):\n",
" df[name] = df[name].fillna(\"None\")\n",
" return df\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.032117,
"end_time": "2021-10-27T20:04:20.942911",
"exception": false,
"start_time": "2021-10-27T20:04:20.910794",
"status": "completed"
},
"tags": []
},
"source": [
"## Load Data ##\n",
"\n",
"And now we can call the data loader and get the processed data splits:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:21.014073Z",
"iopub.status.busy": "2021-10-27T20:04:21.012915Z",
"iopub.status.idle": "2021-10-27T20:04:21.211421Z",
"shell.execute_reply": "2021-10-27T20:04:21.211946Z"
},
"papermill": {
"duration": 0.236978,
"end_time": "2021-10-27T20:04:21.212142",
"exception": false,
"start_time": "2021-10-27T20:04:20.975164",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"df_train, df_test = load_data()"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.031984,
"end_time": "2021-10-27T20:04:21.276325",
"exception": false,
"start_time": "2021-10-27T20:04:21.244341",
"status": "completed"
},
"tags": []
},
"source": [
"Uncomment and run this cell if you'd like to see what they contain. Notice that `df_test` is\n",
"missing values for `SalePrice`. (`NA`s were willed with 0's in the imputation step.)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:21.347212Z",
"iopub.status.busy": "2021-10-27T20:04:21.346514Z",
"iopub.status.idle": "2021-10-27T20:04:21.349604Z",
"shell.execute_reply": "2021-10-27T20:04:21.350153Z"
},
"papermill": {
"duration": 0.039737,
"end_time": "2021-10-27T20:04:21.350317",
"exception": false,
"start_time": "2021-10-27T20:04:21.310580",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"# Peek at the values\n",
"#display(df_train)\n",
"#display(df_test)\n",
"\n",
"# Display information about dtypes and missing values\n",
"#display(df_train.info())\n",
"#display(df_test.info())"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.031973,
"end_time": "2021-10-27T20:04:21.414475",
"exception": false,
"start_time": "2021-10-27T20:04:21.382502",
"status": "completed"
},
"tags": []
},
"source": [
"## Establish Baseline ##\n",
"\n",
"Finally, let's establish a baseline score to judge our feature engineering against.\n",
"\n",
"Here is the function we created in Lesson 1 that will compute the cross-validated RMSLE score for a feature set. We've used XGBoost for our model, but you might want to experiment with other models.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:21.482947Z",
"iopub.status.busy": "2021-10-27T20:04:21.482249Z",
"iopub.status.idle": "2021-10-27T20:04:21.489005Z",
"shell.execute_reply": "2021-10-27T20:04:21.489545Z"
},
"papermill": {
"duration": 0.043045,
"end_time": "2021-10-27T20:04:21.489734",
"exception": false,
"start_time": "2021-10-27T20:04:21.446689",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"def score_dataset(X, y, model=XGBRegressor()):\n",
" # Label encoding for categoricals\n",
" #\n",
" # Label encoding is good for XGBoost and RandomForest, but one-hot\n",
" # would be better for models like Lasso or Ridge. The `cat.codes`\n",
" # attribute holds the category levels.\n",
" for colname in X.select_dtypes([\"category\"]):\n",
" X[colname] = X[colname].cat.codes\n",
" # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)\n",
" log_y = np.log(y)\n",
" score = cross_val_score(\n",
" model, X, log_y, cv=5, scoring=\"neg_mean_squared_error\",\n",
" )\n",
" score = -1 * score.mean()\n",
" score = np.sqrt(score)\n",
" return score\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.032884,
"end_time": "2021-10-27T20:04:21.555620",
"exception": false,
"start_time": "2021-10-27T20:04:21.522736",
"status": "completed"
},
"tags": []
},
"source": [
"We can reuse this scoring function anytime we want to try out a new feature set. We'll run it now on the processed data with no additional features and get a baseline score:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:21.624797Z",
"iopub.status.busy": "2021-10-27T20:04:21.624101Z",
"iopub.status.idle": "2021-10-27T20:04:23.666249Z",
"shell.execute_reply": "2021-10-27T20:04:23.666863Z"
},
"papermill": {
"duration": 2.078872,
"end_time": "2021-10-27T20:04:23.667100",
"exception": false,
"start_time": "2021-10-27T20:04:21.588228",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Baseline score: 0.14351 RMSLE\n"
]
}
],
"source": [
"X = df_train.copy()\n",
"y = X.pop(\"SalePrice\")\n",
"\n",
"baseline_score = score_dataset(X, y)\n",
"print(f\"Baseline score: {baseline_score:.5f} RMSLE\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.033877,
"end_time": "2021-10-27T20:04:23.736448",
"exception": false,
"start_time": "2021-10-27T20:04:23.702571",
"status": "completed"
},
"tags": []
},
"source": [
"This baseline score helps us to know whether some set of features we've assembled has actually led to any improvement or not.\n",
"\n",
"# Step 2 - Feature Utility Scores #\n",
"\n",
"In Lesson 2 we saw how to use mutual information to compute a *utility score* for a feature, giving you an indication of how much potential the feature has. This hidden cell defines the two utility functions we used, `make_mi_scores` and `plot_mi_scores`: "
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:23.813707Z",
"iopub.status.busy": "2021-10-27T20:04:23.812779Z",
"iopub.status.idle": "2021-10-27T20:04:23.815905Z",
"shell.execute_reply": "2021-10-27T20:04:23.816437Z"
},
"papermill": {
"duration": 0.046446,
"end_time": "2021-10-27T20:04:23.816666",
"exception": false,
"start_time": "2021-10-27T20:04:23.770220",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"def make_mi_scores(X, y):\n",
" X = X.copy()\n",
" for colname in X.select_dtypes([\"object\", \"category\"]):\n",
" X[colname], _ = X[colname].factorize()\n",
" # All discrete features should now have integer dtypes\n",
" discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]\n",
" mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)\n",
" mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
" mi_scores = mi_scores.sort_values(ascending=False)\n",
" return mi_scores\n",
"\n",
"\n",
"def plot_mi_scores(scores):\n",
" scores = scores.sort_values(ascending=True)\n",
" width = np.arange(len(scores))\n",
" ticks = list(scores.index)\n",
" plt.barh(width, scores)\n",
" plt.yticks(width, ticks)\n",
" plt.title(\"Mutual Information Scores\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.033927,
"end_time": "2021-10-27T20:04:23.887054",
"exception": false,
"start_time": "2021-10-27T20:04:23.853127",
"status": "completed"
},
"tags": []
},
"source": [
"Let's look at our feature scores again:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:23.964339Z",
"iopub.status.busy": "2021-10-27T20:04:23.963474Z",
"iopub.status.idle": "2021-10-27T20:04:26.380546Z",
"shell.execute_reply": "2021-10-27T20:04:26.381068Z"
},
"papermill": {
"duration": 2.4575,
"end_time": "2021-10-27T20:04:26.381279",
"exception": false,
"start_time": "2021-10-27T20:04:23.923779",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"OverallQual 0.571457\n",
"Neighborhood 0.526220\n",
"GrLivArea 0.430395\n",
"YearBuilt 0.407974\n",
"LotArea 0.394468\n",
" ... \n",
"PoolQC 0.000000\n",
"MiscFeature 0.000000\n",
"MiscVal 0.000000\n",
"MoSold 0.000000\n",
"YrSold 0.000000\n",
"Name: MI Scores, Length: 79, dtype: float64"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = df_train.copy()\n",
"y = X.pop(\"SalePrice\")\n",
"\n",
"mi_scores = make_mi_scores(X, y)\n",
"mi_scores"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.0332,
"end_time": "2021-10-27T20:04:26.448237",
"exception": false,
"start_time": "2021-10-27T20:04:26.415037",
"status": "completed"
},
"tags": []
},
"source": [
"You can see that we have a number of features that are highly informative and also some that don't seem to be informative at all (at least by themselves). As we talked about in Tutorial 2, the top scoring features will usually pay-off the most during feature development, so it could be a good idea to focus your efforts on those. On the other hand, training on uninformative features can lead to overfitting. So, the features with 0.0 scores we'll drop entirely:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:26.519381Z",
"iopub.status.busy": "2021-10-27T20:04:26.518372Z",
"iopub.status.idle": "2021-10-27T20:04:26.522233Z",
"shell.execute_reply": "2021-10-27T20:04:26.522699Z"
},
"papermill": {
"duration": 0.041247,
"end_time": "2021-10-27T20:04:26.522889",
"exception": false,
"start_time": "2021-10-27T20:04:26.481642",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"def drop_uninformative(df, mi_scores):\n",
" return df.loc[:, mi_scores > 0.0]\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.033805,
"end_time": "2021-10-27T20:04:26.590279",
"exception": false,
"start_time": "2021-10-27T20:04:26.556474",
"status": "completed"
},
"tags": []
},
"source": [
"Removing them does lead to a modest performance gain:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:26.661877Z",
"iopub.status.busy": "2021-10-27T20:04:26.660922Z",
"iopub.status.idle": "2021-10-27T20:04:28.664621Z",
"shell.execute_reply": "2021-10-27T20:04:28.665234Z"
},
"papermill": {
"duration": 2.041129,
"end_time": "2021-10-27T20:04:28.665440",
"exception": false,
"start_time": "2021-10-27T20:04:26.624311",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0.14338026718687277"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = df_train.copy()\n",
"y = X.pop(\"SalePrice\")\n",
"X = drop_uninformative(X, mi_scores)\n",
"\n",
"score_dataset(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.034251,
"end_time": "2021-10-27T20:04:28.735154",
"exception": false,
"start_time": "2021-10-27T20:04:28.700903",
"status": "completed"
},
"tags": []
},
"source": [
"Later, we'll add the `drop_uninformative` function to our feature-creation pipeline.\n",
"\n",
"# Step 3 - Create Features #\n",
"\n",
"Now we'll start developing our feature set.\n",
"\n",
"To make our feature engineering workflow more modular, we'll define a function that will take a prepared dataframe and pass it through a pipeline of transformations to get the final feature set. It will look something like this:\n",
"\n",
"```\n",
"def create_features(df):\n",
" X = df.copy()\n",
" y = X.pop(\"SalePrice\")\n",
" X = X.join(create_features_1(X))\n",
" X = X.join(create_features_2(X))\n",
" X = X.join(create_features_3(X))\n",
" # ...\n",
" return X\n",
"```\n",
"\n",
"Let's go ahead and define one transformation now, a [label encoding](https://www.kaggle.com/alexisbcook/categorical-variables) for the categorical features:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:28.808709Z",
"iopub.status.busy": "2021-10-27T20:04:28.808032Z",
"iopub.status.idle": "2021-10-27T20:04:28.812616Z",
"shell.execute_reply": "2021-10-27T20:04:28.813299Z"
},
"papermill": {
"duration": 0.043595,
"end_time": "2021-10-27T20:04:28.813531",
"exception": false,
"start_time": "2021-10-27T20:04:28.769936",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"def label_encode(df):\n",
" X = df.copy()\n",
" for colname in X.select_dtypes([\"category\"]):\n",
" X[colname] = X[colname].cat.codes\n",
" return X\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.042891,
"end_time": "2021-10-27T20:04:28.892074",
"exception": false,
"start_time": "2021-10-27T20:04:28.849183",
"status": "completed"
},
"tags": []
},
"source": [
"A label encoding is okay for any kind of categorical feature when you're using a tree-ensemble like XGBoost, even for unordered categories. If you wanted to try a linear regression model (also popular in this competition), you would instead want to use a one-hot encoding, especially for the features with unordered categories.\n",
"\n",
"## Create Features with Pandas ##\n",
"\n",
"This cell reproduces the work you did in Exercise 3, where you applied strategies for creating features in Pandas. Modify or add to these functions to try out other feature combinations."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:28.978917Z",
"iopub.status.busy": "2021-10-27T20:04:28.977945Z",
"iopub.status.idle": "2021-10-27T20:04:28.987672Z",
"shell.execute_reply": "2021-10-27T20:04:28.988259Z"
},
"papermill": {
"duration": 0.054368,
"end_time": "2021-10-27T20:04:28.988444",
"exception": false,
"start_time": "2021-10-27T20:04:28.934076",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"def mathematical_transforms(df):\n",
" X = pd.DataFrame() # dataframe to hold new features\n",
" X[\"LivLotRatio\"] = df.GrLivArea / df.LotArea\n",
" X[\"Spaciousness\"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd\n",
" # This feature ended up not helping performance\n",
" # X[\"TotalOutsideSF\"] = \\\n",
" # df.WoodDeckSF + df.OpenPorchSF + df.EnclosedPorch + \\\n",
" # df.Threeseasonporch + df.ScreenPorch\n",
" return X\n",
"\n",
"\n",
"def interactions(df):\n",
" X = pd.get_dummies(df.BldgType, prefix=\"Bldg\")\n",
" X = X.mul(df.GrLivArea, axis=0)\n",
" return X\n",
"\n",
"\n",
"def counts(df):\n",
" X = pd.DataFrame()\n",
" X[\"PorchTypes\"] = df[[\n",
" \"WoodDeckSF\",\n",
" \"OpenPorchSF\",\n",
" \"EnclosedPorch\",\n",
" \"Threeseasonporch\",\n",
" \"ScreenPorch\",\n",
" ]].gt(0.0).sum(axis=1)\n",
" return X\n",
"\n",
"\n",
"def break_down(df):\n",
" X = pd.DataFrame()\n",
" X[\"MSClass\"] = df.MSSubClass.str.split(\"_\", n=1, expand=True)[0]\n",
" return X\n",
"\n",
"\n",
"def group_transforms(df):\n",
" X = pd.DataFrame()\n",
" X[\"MedNhbdArea\"] = df.groupby(\"Neighborhood\")[\"GrLivArea\"].transform(\"median\")\n",
" return X\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.034589,
"end_time": "2021-10-27T20:04:29.057590",
"exception": false,
"start_time": "2021-10-27T20:04:29.023001",
"status": "completed"
},
"tags": []
},
"source": [
"Here are some ideas for other transforms you could explore:\n",
"- Interactions between the quality `Qual` and condition `Cond` features. `OverallQual`, for instance, was a high-scoring feature. You could try combining it with `OverallCond` by converting both to integer type and taking a product.\n",
"- Square roots of area features. This would convert units of square feet to just feet.\n",
"- Logarithms of numeric features. If a feature has a skewed distribution, applying a logarithm can help normalize it.\n",
"- Interactions between numeric and categorical features that describe the same thing. You could look at interactions between `BsmtQual` and `TotalBsmtSF`, for instance.\n",
"- Other group statistics in `Neighboorhood`. We did the median of `GrLivArea`. Looking at `mean`, `std`, or `count` could be interesting. You could also try combining the group statistics with other features. Maybe the *difference* of `GrLivArea` and the median is important?\n",
"\n",
"## k-Means Clustering ##\n",
"\n",
"The first unsupervised algorithm we used to create features was k-means clustering. We saw that you could either use the cluster labels as a feature (a column with `0, 1, 2, ...`) or you could use the *distance* of the observations to each cluster. We saw how these features can sometimes be effective at untangling complicated spatial relationships."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:29.131372Z",
"iopub.status.busy": "2021-10-27T20:04:29.130553Z",
"iopub.status.idle": "2021-10-27T20:04:29.141326Z",
"shell.execute_reply": "2021-10-27T20:04:29.141884Z"
},
"lines_to_next_cell": 2,
"papermill": {
"duration": 0.049786,
"end_time": "2021-10-27T20:04:29.142064",
"exception": false,
"start_time": "2021-10-27T20:04:29.092278",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"cluster_features = [\n",
" \"LotArea\",\n",
" \"TotalBsmtSF\",\n",
" \"FirstFlrSF\",\n",
" \"SecondFlrSF\",\n",
" \"GrLivArea\",\n",
"]\n",
"\n",
"\n",
"def cluster_labels(df, features, n_clusters=20):\n",
" X = df.copy()\n",
" X_scaled = X.loc[:, features]\n",
" X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
" kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)\n",
" X_new = pd.DataFrame()\n",
" X_new[\"Cluster\"] = kmeans.fit_predict(X_scaled)\n",
" return X_new\n",
"\n",
"\n",
"def cluster_distance(df, features, n_clusters=20):\n",
" X = df.copy()\n",
" X_scaled = X.loc[:, features]\n",
" X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
" kmeans = KMeans(n_clusters=20, n_init=50, random_state=0)\n",
" X_cd = kmeans.fit_transform(X_scaled)\n",
" # Label features and join to dataset\n",
" X_cd = pd.DataFrame(\n",
" X_cd, columns=[f\"Centroid_{i}\" for i in range(X_cd.shape[1])]\n",
" )\n",
" return X_cd\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.034369,
"end_time": "2021-10-27T20:04:29.273011",
"exception": false,
"start_time": "2021-10-27T20:04:29.238642",
"status": "completed"
},
"tags": []
},
"source": [
"## Principal Component Analysis ##\n",
"\n",
"PCA was the second unsupervised model we used for feature creation. We saw how it could be used to decompose the variational structure in the data. The PCA algorithm gave us *loadings* which described each component of variation, and also the *components* which were the transformed datapoints. The loadings can suggest features to create and the components we can use as features directly.\n",
"\n",
"Here are the utility functions from the PCA lesson:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:29.355089Z",
"iopub.status.busy": "2021-10-27T20:04:29.349195Z",
"iopub.status.idle": "2021-10-27T20:04:29.362003Z",
"shell.execute_reply": "2021-10-27T20:04:29.362685Z"
},
"papermill": {
"duration": 0.053188,
"end_time": "2021-10-27T20:04:29.362908",
"exception": false,
"start_time": "2021-10-27T20:04:29.309720",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"def apply_pca(X, standardize=True):\n",
" # Standardize\n",
" if standardize:\n",
" X = (X - X.mean(axis=0)) / X.std(axis=0)\n",
" # Create principal components\n",
" pca = PCA()\n",
" X_pca = pca.fit_transform(X)\n",
" # Convert to dataframe\n",
" component_names = [f\"PC{i+1}\" for i in range(X_pca.shape[1])]\n",
" X_pca = pd.DataFrame(X_pca, columns=component_names)\n",
" # Create loadings\n",
" loadings = pd.DataFrame(\n",
" pca.components_.T, # transpose the matrix of loadings\n",
" columns=component_names, # so the columns are the principal components\n",
" index=X.columns, # and the rows are the original features\n",
" )\n",
" return pca, X_pca, loadings\n",
"\n",
"\n",
"def plot_variance(pca, width=8, dpi=100):\n",
" # Create figure\n",
" fig, axs = plt.subplots(1, 2)\n",
" n = pca.n_components_\n",
" grid = np.arange(1, n + 1)\n",
" # Explained variance\n",
" evr = pca.explained_variance_ratio_\n",
" axs[0].bar(grid, evr)\n",
" axs[0].set(\n",
" xlabel=\"Component\", title=\"% Explained Variance\", ylim=(0.0, 1.0)\n",
" )\n",
" # Cumulative Variance\n",
" cv = np.cumsum(evr)\n",
" axs[1].plot(np.r_[0, grid], np.r_[0, cv], \"o-\")\n",
" axs[1].set(\n",
" xlabel=\"Component\", title=\"% Cumulative Variance\", ylim=(0.0, 1.0)\n",
" )\n",
" # Set up figure\n",
" fig.set(figwidth=8, dpi=100)\n",
" return axs\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.034582,
"end_time": "2021-10-27T20:04:29.434038",
"exception": false,
"start_time": "2021-10-27T20:04:29.399456",
"status": "completed"
},
"tags": []
},
"source": [
"And here are transforms that produce the features from the Exercise 5. You might want to change these if you came up with a different answer.\n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:29.508704Z",
"iopub.status.busy": "2021-10-27T20:04:29.507802Z",
"iopub.status.idle": "2021-10-27T20:04:29.515195Z",
"shell.execute_reply": "2021-10-27T20:04:29.515726Z"
},
"papermill": {
"duration": 0.047013,
"end_time": "2021-10-27T20:04:29.515945",
"exception": false,
"start_time": "2021-10-27T20:04:29.468932",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"def pca_inspired(df):\n",
" X = pd.DataFrame()\n",
" X[\"Feature1\"] = df.GrLivArea + df.TotalBsmtSF\n",
" X[\"Feature2\"] = df.YearRemodAdd * df.TotalBsmtSF\n",
" return X\n",
"\n",
"\n",
"def pca_components(df, features):\n",
" X = df.loc[:, features]\n",
" _, X_pca, _ = apply_pca(X)\n",
" return X_pca\n",
"\n",
"\n",
"pca_features = [\n",
" \"GarageArea\",\n",
" \"YearRemodAdd\",\n",
" \"TotalBsmtSF\",\n",
" \"GrLivArea\",\n",
"]"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.034769,
"end_time": "2021-10-27T20:04:29.586674",
"exception": false,
"start_time": "2021-10-27T20:04:29.551905",
"status": "completed"
},
"tags": []
},
"source": [
"These are only a couple ways you could use the principal components. You could also try clustering using one or more components. One thing to note is that PCA doesn't change the distance between points -- it's just like a rotation. So clustering with the full set of components is the same as clustering with the original features. Instead, pick some subset of components, maybe those with the most variance or the highest MI scores.\n",
"\n",
"For further analysis, you might want to look at a correlation matrix for the dataset:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:29.660684Z",
"iopub.status.busy": "2021-10-27T20:04:29.659882Z",
"iopub.status.idle": "2021-10-27T20:04:31.368870Z",
"shell.execute_reply": "2021-10-27T20:04:31.369432Z"
},
"papermill": {
"duration": 1.747951,
"end_time": "2021-10-27T20:04:31.369653",
"exception": false,
"start_time": "2021-10-27T20:04:29.621702",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 720x720 with 4 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def corrplot(df, method=\"pearson\", annot=True, **kwargs):\n",
" sns.clustermap(\n",
" df.corr(method),\n",
" vmin=-1.0,\n",
" vmax=1.0,\n",
" cmap=\"icefire\",\n",
" method=\"complete\",\n",
" annot=annot,\n",
" **kwargs,\n",
" )\n",
"\n",
"\n",
"corrplot(df_train, annot=None)"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.038428,
"end_time": "2021-10-27T20:04:31.446965",
"exception": false,
"start_time": "2021-10-27T20:04:31.408537",
"status": "completed"
},
"tags": []
},
"source": [
"Groups of highly correlated features often yield interesting loadings.\n",
"\n",
"### PCA Application - Indicate Outliers ###\n",
"\n",
"In Exercise 5, you applied PCA to determine houses that were **outliers**, that is, houses having values not well represented in the rest of the data. You saw that there was a group of houses in the `Edwards` neighborhood having a `SaleCondition` of `Partial` whose values were especially extreme.\n",
"\n",
"Some models can benefit from having these outliers indicated, which is what this next transform will do."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:31.531467Z",
"iopub.status.busy": "2021-10-27T20:04:31.530625Z",
"iopub.status.idle": "2021-10-27T20:04:31.532229Z",
"shell.execute_reply": "2021-10-27T20:04:31.532751Z"
},
"papermill": {
"duration": 0.04703,
"end_time": "2021-10-27T20:04:31.532997",
"exception": false,
"start_time": "2021-10-27T20:04:31.485967",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"def indicate_outliers(df):\n",
" X_new = pd.DataFrame()\n",
" X_new[\"Outlier\"] = (df.Neighborhood == \"Edwards\") & (df.SaleCondition == \"Partial\")\n",
" return X_new\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.03965,
"end_time": "2021-10-27T20:04:31.612676",
"exception": false,
"start_time": "2021-10-27T20:04:31.573026",
"status": "completed"
},
"tags": []
},
"source": [
"You could also consider applying some sort of robust scaler from scikit-learn's `sklearn.preprocessing` module to the outlying values, especially those in `GrLivArea`. [Here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html) is a tutorial illustrating some of them. Another option could be to create a feature of \"outlier scores\" using one of scikit-learn's [outlier detectors](https://scikit-learn.org/stable/modules/outlier_detection.html)."
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.039386,
"end_time": "2021-10-27T20:04:31.692256",
"exception": false,
"start_time": "2021-10-27T20:04:31.652870",
"status": "completed"
},
"tags": []
},
"source": [
"## Target Encoding ##\n",
"\n",
"Needing a separate holdout set to create a target encoding is rather wasteful of data. In *Tutorial 6* we used 25% of our dataset just to encode a single feature, `Zipcode`. The data from the other features in that 25% we didn't get to use at all.\n",
"\n",
"There is, however, a way you can use target encoding without having to use held-out encoding data. It's basically the same trick used in cross-validation:\n",
"1. Split the data into folds, each fold having two splits of the dataset.\n",
"2. Train the encoder on one split but transform the values of the other.\n",
"3. Repeat for all the splits.\n",
"\n",
"This way, training and transformation always take place on independent sets of data, just like when you use a holdout set but without any data going to waste.\n",
"\n",
"In the next hidden cell is a wrapper you can use with any target encoder:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"_kg_hide-input": true,
"execution": {
"iopub.execute_input": "2021-10-27T20:04:31.774532Z",
"iopub.status.busy": "2021-10-27T20:04:31.773838Z",
"iopub.status.idle": "2021-10-27T20:04:31.787301Z",
"shell.execute_reply": "2021-10-27T20:04:31.787998Z"
},
"papermill": {
"duration": 0.056889,
"end_time": "2021-10-27T20:04:31.788260",
"exception": false,
"start_time": "2021-10-27T20:04:31.731371",
"status": "completed"
},
"tags": []
},
"outputs": [],
"source": [
"\n",
"class CrossFoldEncoder:\n",
" def __init__(self, encoder, **kwargs):\n",
" self.encoder_ = encoder\n",
" self.kwargs_ = kwargs # keyword arguments for the encoder\n",
" self.cv_ = KFold(n_splits=5)\n",
"\n",
" # Fit an encoder on one split and transform the feature on the\n",
" # other. Iterating over the splits in all folds gives a complete\n",
" # transformation. We also now have one trained encoder on each\n",
" # fold.\n",
" def fit_transform(self, X, y, cols):\n",
" self.fitted_encoders_ = []\n",
" self.cols_ = cols\n",
" X_encoded = []\n",
" for idx_encode, idx_train in self.cv_.split(X):\n",
" fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)\n",
" fitted_encoder.fit(\n",
" X.iloc[idx_encode, :], y.iloc[idx_encode],\n",
" )\n",
" X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])\n",
" self.fitted_encoders_.append(fitted_encoder)\n",
" X_encoded = pd.concat(X_encoded)\n",
" X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
" return X_encoded\n",
"\n",
" # To transform the test data, average the encodings learned from\n",
" # each fold.\n",
" def transform(self, X):\n",
" from functools import reduce\n",
"\n",
" X_encoded_list = []\n",
" for fitted_encoder in self.fitted_encoders_:\n",
" X_encoded = fitted_encoder.transform(X)\n",
" X_encoded_list.append(X_encoded[self.cols_])\n",
" X_encoded = reduce(\n",
" lambda x, y: x.add(y, fill_value=0), X_encoded_list\n",
" ) / len(X_encoded_list)\n",
" X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
" return X_encoded\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.038236,
"end_time": "2021-10-27T20:04:31.864997",
"exception": false,
"start_time": "2021-10-27T20:04:31.826761",
"status": "completed"
},
"tags": []
},
"source": [
"Use it like:\n",
"\n",
"```\n",
"encoder = CrossFoldEncoder(MEstimateEncoder, m=1)\n",
"X_encoded = encoder.fit_transform(X, y, cols=[\"MSSubClass\"]))\n",
"```\n",
"\n",
"You can turn any of the encoders from the [`category_encoders`](http://contrib.scikit-learn.org/category_encoders/) library into a cross-fold encoder. The [`CatBoostEncoder`](http://contrib.scikit-learn.org/category_encoders/catboost.html) would be worth trying. It's similar to `MEstimateEncoder` but uses some tricks to better prevent overfitting. Its smoothing parameter is called `a` instead of `m`.\n",
"\n",
"## Create Final Feature Set ##\n",
"\n",
"Now let's combine everything together. Putting the transformations into separate functions makes it easier to experiment with various combinations. The ones I left uncommented I found gave the best results. You should experiment with you own ideas though! Modify any of these transformations or come up with some of your own to add to the pipeline."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:31.947406Z",
"iopub.status.busy": "2021-10-27T20:04:31.946606Z",
"iopub.status.idle": "2021-10-27T20:04:37.147607Z",
"shell.execute_reply": "2021-10-27T20:04:37.148278Z"
},
"papermill": {
"duration": 5.245116,
"end_time": "2021-10-27T20:04:37.148559",
"exception": false,
"start_time": "2021-10-27T20:04:31.903443",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0.1381925629969659"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def create_features(df, df_test=None):\n",
" X = df.copy()\n",
" y = X.pop(\"SalePrice\")\n",
" mi_scores = make_mi_scores(X, y)\n",
"\n",
" # Combine splits if test data is given\n",
" #\n",
" # If we're creating features for test set predictions, we should\n",
" # use all the data we have available. After creating our features,\n",
" # we'll recreate the splits.\n",
" if df_test is not None:\n",
" X_test = df_test.copy()\n",
" X_test.pop(\"SalePrice\")\n",
" X = pd.concat([X, X_test])\n",
"\n",
" # Lesson 2 - Mutual Information\n",
" X = drop_uninformative(X, mi_scores)\n",
"\n",
" # Lesson 3 - Transformations\n",
" X = X.join(mathematical_transforms(X))\n",
" X = X.join(interactions(X))\n",
" X = X.join(counts(X))\n",
" # X = X.join(break_down(X))\n",
" X = X.join(group_transforms(X))\n",
"\n",
" # Lesson 4 - Clustering\n",
" # X = X.join(cluster_labels(X, cluster_features, n_clusters=20))\n",
" # X = X.join(cluster_distance(X, cluster_features, n_clusters=20))\n",
"\n",
" # Lesson 5 - PCA\n",
" X = X.join(pca_inspired(X))\n",
" # X = X.join(pca_components(X, pca_features))\n",
" # X = X.join(indicate_outliers(X))\n",
"\n",
" X = label_encode(X)\n",
"\n",
" # Reform splits\n",
" if df_test is not None:\n",
" X_test = X.loc[df_test.index, :]\n",
" X.drop(df_test.index, inplace=True)\n",
"\n",
" # Lesson 6 - Target Encoder\n",
" encoder = CrossFoldEncoder(MEstimateEncoder, m=1)\n",
" X = X.join(encoder.fit_transform(X, y, cols=[\"MSSubClass\"]))\n",
" if df_test is not None:\n",
" X_test = X_test.join(encoder.transform(X_test))\n",
"\n",
" if df_test is not None:\n",
" return X, X_test\n",
" else:\n",
" return X\n",
"\n",
"\n",
"df_train, df_test = load_data()\n",
"X_train = create_features(df_train)\n",
"y_train = df_train.loc[:, \"SalePrice\"]\n",
"\n",
"score_dataset(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.040196,
"end_time": "2021-10-27T20:04:37.230158",
"exception": false,
"start_time": "2021-10-27T20:04:37.189962",
"status": "completed"
},
"tags": []
},
"source": [
"# Step 4 - Hyperparameter Tuning #\n",
"\n",
"At this stage, you might like to do some hyperparameter tuning with XGBoost before creating your final submission."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:37.316945Z",
"iopub.status.busy": "2021-10-27T20:04:37.312732Z",
"iopub.status.idle": "2021-10-27T20:04:56.016465Z",
"shell.execute_reply": "2021-10-27T20:04:56.017086Z"
},
"papermill": {
"duration": 18.747635,
"end_time": "2021-10-27T20:04:56.017354",
"exception": false,
"start_time": "2021-10-27T20:04:37.269719",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"0.12414985267470383"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train = create_features(df_train)\n",
"y_train = df_train.loc[:, \"SalePrice\"]\n",
"\n",
"xgb_params = dict(\n",
" max_depth=6, # maximum depth of each tree - try 2 to 10\n",
" learning_rate=0.01, # effect of each tree - try 0.0001 to 0.1\n",
" n_estimators=1000, # number of trees (that is, boosting rounds) - try 1000 to 8000\n",
" min_child_weight=1, # minimum number of houses in a leaf - try 1 to 10\n",
" colsample_bytree=0.7, # fraction of features (columns) per tree - try 0.2 to 1.0\n",
" subsample=0.7, # fraction of instances (rows) per tree - try 0.2 to 1.0\n",
" reg_alpha=0.5, # L1 regularization (like LASSO) - try 0.0 to 10.0\n",
" reg_lambda=1.0, # L2 regularization (like Ridge) - try 0.0 to 10.0\n",
" num_parallel_tree=1, # set > 1 for boosted random forests\n",
")\n",
"\n",
"xgb = XGBRegressor(**xgb_params)\n",
"score_dataset(X_train, y_train, xgb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.040633,
"end_time": "2021-10-27T20:04:56.100142",
"exception": false,
"start_time": "2021-10-27T20:04:56.059509",
"status": "completed"
},
"tags": []
},
"source": [
"Just tuning these by hand can give you great results. However, you might like to try using one of scikit-learn's automatic [hyperparameter tuners](https://scikit-learn.org/stable/modules/grid_search.html). Or you could explore more advanced tuning libraries like [Optuna](https://optuna.readthedocs.io/en/stable/index.html) or [scikit-optimize](https://scikit-optimize.github.io/stable/).\n",
"\n",
"Here is how you can use Optuna with XGBoost:\n",
"\n",
"```\n",
"import optuna\n",
"\n",
"def objective(trial):\n",
" xgb_params = dict(\n",
" max_depth=trial.suggest_int(\"max_depth\", 2, 10),\n",
" learning_rate=trial.suggest_float(\"learning_rate\", 1e-4, 1e-1, log=True),\n",
" n_estimators=trial.suggest_int(\"n_estimators\", 1000, 8000),\n",
" min_child_weight=trial.suggest_int(\"min_child_weight\", 1, 10),\n",
" colsample_bytree=trial.suggest_float(\"colsample_bytree\", 0.2, 1.0),\n",
" subsample=trial.suggest_float(\"subsample\", 0.2, 1.0),\n",
" reg_alpha=trial.suggest_float(\"reg_alpha\", 1e-4, 1e2, log=True),\n",
" reg_lambda=trial.suggest_float(\"reg_lambda\", 1e-4, 1e2, log=True),\n",
" )\n",
" xgb = XGBRegressor(**xgb_params)\n",
" return score_dataset(X_train, y_train, xgb)\n",
"\n",
"study = optuna.create_study(direction=\"minimize\")\n",
"study.optimize(objective, n_trials=20)\n",
"xgb_params = study.best_params\n",
"```\n",
"\n",
"Copy this into a code cell if you'd like to use it, but be aware that it will take quite a while to run. After it's done, you might enjoy using some of [Optuna's visualizations](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/005_visualization.html).\n",
"\n",
"# Step 5 - Train Model and Create Submissions #\n",
"\n",
"Once you're satisfied with everything, it's time to create your final predictions! This cell will:\n",
"- create your feature set from the original data\n",
"- train XGBoost on the training data\n",
"- use the trained model to make predictions from the test set\n",
"- save the predictions to a CSV file"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"execution": {
"iopub.execute_input": "2021-10-27T20:04:56.189072Z",
"iopub.status.busy": "2021-10-27T20:04:56.188321Z",
"iopub.status.idle": "2021-10-27T20:05:02.624682Z",
"shell.execute_reply": "2021-10-27T20:05:02.625438Z"
},
"papermill": {
"duration": 6.484914,
"end_time": "2021-10-27T20:05:02.625667",
"exception": false,
"start_time": "2021-10-27T20:04:56.140753",
"status": "completed"
},
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Your submission was successfully saved!\n"
]
}
],
"source": [
"X_train, X_test = create_features(df_train, df_test)\n",
"y_train = df_train.loc[:, \"SalePrice\"]\n",
"\n",
"xgb = XGBRegressor(**xgb_params)\n",
"# XGB minimizes MSE, but competition loss is RMSLE\n",
"# So, we need to log-transform y to train and exp-transform the predictions\n",
"xgb.fit(X_train, np.log(y))\n",
"predictions = np.exp(xgb.predict(X_test))\n",
"\n",
"output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})\n",
"output.to_csv('my_submission.csv', index=False)\n",
"print(\"Your submission was successfully saved!\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"papermill": {
"duration": 0.039199,
"end_time": "2021-10-27T20:05:02.705132",
"exception": false,
"start_time": "2021-10-27T20:05:02.665933",
"status": "completed"
},
"tags": []
},
"source": [
"To submit these predictions to the competition, follow these steps:\n",
"\n",
"1. Begin by clicking on the blue **Save Version** button in the top right corner of the window. This will generate a pop-up window.\n",
"2. Ensure that the **Save and Run All** option is selected, and then click on the blue **Save** button.\n",
"3. This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the **Save Version** button. This pulls up a list of versions on the right of the screen. Click on the ellipsis **(...)** to the right of the most recent version, and select **Open in Viewer**. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.\n",
"4. Click on the **Output** tab on the right of the screen. Then, click on the file you would like to submit, and click on the blue **Submit** button to submit your results to the leaderboard.\n",
"\n",
"You have now successfully submitted to the competition!\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"jupytext": {
"cell_metadata_filter": "-all",
"formats": "ipynb"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
},
"papermill": {
"default_parameters": {},
"duration": 55.549463,
"end_time": "2021-10-27T20:05:03.556919",
"environment_variables": {},
"exception": null,
"input_path": "__notebook__.ipynb",
"output_path": "__notebook__.ipynb",
"parameters": {},
"start_time": "2021-10-27T20:04:08.007456",
"version": "2.3.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}