examples/house-prices-kaggle-competi.../house-prices-kale.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "#  Kaggle Getting Started Competition : House Prices - Advanced Regression Techniques "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "The notebook is based on the [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Install necessary packages\n",
    "\n",
    "We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. We have put the dependencies in a `requirements.txt` file so we will use the former method.\n",
    "\n",
    "> NOTE: Do not forget to use the `--user` argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install --user -r requirements.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "imports"
    ]
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from IPython.display import display\n",
    "from pandas.api.types import CategoricalDtype\n",
    "\n",
    "from category_encoders import MEstimateEncoder\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.decomposition import PCA\n",
    "from sklearn.feature_selection import mutual_info_regression\n",
    "from sklearn.model_selection import KFold, cross_val_score\n",
    "from sklearn.metrics import (r2_score,mean_squared_error,\n",
    "                             mean_squared_log_error,make_scorer)\n",
    "from xgboost import XGBRegressor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:load_and_preprocess_data"
    ]
   },
   "outputs": [],
   "source": [
    "def load_and_preprocess_data():\n",
    "    # Read data\n",
    "    path = \"data/\"\n",
    "    df_train = pd.read_csv(path + \"train.csv\", index_col=\"Id\")\n",
    "    df_test = pd.read_csv(path + \"test.csv\", index_col=\"Id\")\n",
    "    # Merge the splits so we can process them together\n",
    "    df = pd.concat([df_train, df_test])\n",
    "    # Preprocessing\n",
    "    df = clean(df)\n",
    "    df = encode(df)\n",
    "    df = impute(df)\n",
    "    # Reform splits\n",
    "    df_train = df.loc[df_train.index, :]\n",
    "    df_test = df.loc[df_test.index, :]\n",
    "    return df_train, df_test \n",
    "\n",
    "df_train, df_test = load_and_preprocess_data()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Exploring the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "df.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "As we can see in the output of the previous column, the max entries should be 2919 for each column but few of them don't have it which indicates missing values. Some of them even have entries less than 500 such as columns PoolQC, MiscFeature and Alley. In the later part of the code we would look into how much potential each feature has."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Now lets check some of the categorical features in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "df.Exterior2nd.unique()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "\n",
    "Comparing these to description.txt, we can observe that there are some typos in the categories. For example 'Brk Cmn' should be 'Brk Comm'. Let's start the data preprocessing with cleaning of the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Preprocessing the data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Clean the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "functions",
     "prev:load_and_preprocess"
    ]
   },
   "outputs": [],
   "source": [
    "def clean(df):\n",
    "    df[\"Exterior2nd\"] = df[\"Exterior2nd\"].replace({\"Brk Cmn\": \"BrkComm\"})\n",
    "    # Some values of GarageYrBlt are corrupt, so we'll replace them\n",
    "    # with the year the house was built\n",
    "    df[\"GarageYrBlt\"] = df[\"GarageYrBlt\"].where(df.GarageYrBlt <= 2010, df.YearBuilt)\n",
    "    # Names beginning with numbers are awkward to work with\n",
    "    df.rename(columns={\n",
    "        \"1stFlrSF\": \"FirstFlrSF\",\n",
    "        \"2ndFlrSF\": \"SecondFlrSF\",\n",
    "        \"3SsnPorch\": \"Threeseasonporch\",\n",
    "    }, inplace=True,\n",
    "    )\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Encode the Statistical Data Type"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Pandas has Python types corresponding to the standard statistical types (numeric, categorical, etc.). Encoding each feature with its correct type helps ensure each feature is treated appropriately by whatever functions we use, and makes it easier for us to apply transformations consistently"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:"
    ]
   },
   "outputs": [],
   "source": [
    "# The numeric features are already encoded correctly (`float` for\n",
    "# continuous, `int` for discrete), but the categoricals we'll need to\n",
    "# do ourselves. Note in particular, that the `MSSubClass` feature is\n",
    "# read as an `int` type, but is actually a (nominative) categorical.\n",
    "\n",
    "# The nominative (unordered) categorical features\n",
    "features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
    "\n",
    "# Pandas calls the categories \"levels\"\n",
    "five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
    "ten_levels = list(range(10))\n",
    "\n",
    "ordered_levels = {\n",
    "    \"OverallQual\": ten_levels,\n",
    "    \"OverallCond\": ten_levels,\n",
    "    \"ExterQual\": five_levels,\n",
    "    \"ExterCond\": five_levels,\n",
    "    \"BsmtQual\": five_levels,\n",
    "    \"BsmtCond\": five_levels,\n",
    "    \"HeatingQC\": five_levels,\n",
    "    \"KitchenQual\": five_levels,\n",
    "    \"FireplaceQu\": five_levels,\n",
    "    \"GarageQual\": five_levels,\n",
    "    \"GarageCond\": five_levels,\n",
    "    \"PoolQC\": five_levels,\n",
    "    \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
    "    \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
    "    \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
    "    \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
    "    \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
    "    \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
    "    \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
    "    \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
    "    \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
    "    \"CentralAir\": [\"N\", \"Y\"],\n",
    "    \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
    "    \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
    "}\n",
    "\n",
    "# Add a None level for missing values\n",
    "ordered_levels = {key: [\"None\"] + value for key, value in\n",
    "                  ordered_levels.items()}\n",
    "\n",
    "\n",
    "def encode(df):\n",
    "    # Nominal categories\n",
    "    for name in features_nom:\n",
    "        df[name] = df[name].astype(\"category\")\n",
    "        # Add a None category for missing values\n",
    "        if \"None\" not in df[name].cat.categories:\n",
    "            df[name].cat.add_categories(\"None\", inplace=True)\n",
    "    # Ordinal categories\n",
    "    for name, levels in ordered_levels.items():\n",
    "        df[name] = df[name].astype(CategoricalDtype(levels,\n",
    "                                                    ordered=True))\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Handle missing values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Handling missing values now will make the feature engineering go more smoothly. We'll impute 0 for missing numeric values and \"None\" for missing categorical values. You might like to experiment with other imputation strategies. In particular, you could try creating \"missing value\" indicators: 1 whenever a value was imputed and 0 otherwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:"
    ]
   },
   "outputs": [],
   "source": [
    "def impute(df):\n",
    "    for name in df.select_dtypes(\"number\"):\n",
    "        df[name] = df[name].fillna(0)\n",
    "    for name in df.select_dtypes(\"category\"):\n",
    "        df[name] = df[name].fillna(\"None\")\n",
    "    return df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Creating Features"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "To understand the potential of each feature, we calculate utility score for each feature. The utility score would help us eliminate the low-scoring features. Training on top features would help us better our model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "functions"
    ]
   },
   "outputs": [],
   "source": [
    "def make_mi_scores(X, y):\n",
    "    X = X.copy()\n",
    "    for colname in X.select_dtypes([\"object\", \"category\"]):\n",
    "        X[colname], _ = X[colname].factorize()\n",
    "    # All discrete features should now have integer dtypes\n",
    "    discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]\n",
    "    mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)\n",
    "    mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
    "    mi_scores = mi_scores.sort_values(ascending=False)\n",
    "    return mi_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "def plot_mi_scores(scores):\n",
    "    scores = scores.sort_values(ascending=True)\n",
    "    width = np.arange(len(scores))\n",
    "    ticks = list(scores.index)\n",
    "    plt.barh(width, scores)\n",
    "    plt.yticks(width, ticks)\n",
    "    plt.title(\"Mutual Information Scores\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Let's analyse mutual information scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "X = df_train.copy()\n",
    "y = X.pop(\"SalePrice\")\n",
    "mi_scores = make_mi_scores(X, y)\n",
    "mi_scores.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "We will focus on the top scoring features and drop the features that have 0.0 score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def drop_uninformative(df, mi_scores):\n",
    "    return df.loc[:, mi_scores > 0.0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Label encoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def label_encode(df):\n",
    "    X = df.copy()\n",
    "    for colname in X.select_dtypes([\"category\"]):\n",
    "        X[colname] = X[colname].cat.codes\n",
    "    return X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Defining functions for feature creation using pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def mathematical_transforms(df):\n",
    "    X = pd.DataFrame()  # dataframe to hold new features\n",
    "    X[\"LivLotRatio\"] = df.GrLivArea / df.LotArea\n",
    "    X[\"Spaciousness\"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd\n",
    "    return X\n",
    "\n",
    "def interactions(df):\n",
    "    X = pd.get_dummies(df.BldgType, prefix=\"Bldg\")\n",
    "    X = X.mul(df.GrLivArea, axis=0)\n",
    "    return X\n",
    "\n",
    "def counts(df):\n",
    "    X = pd.DataFrame()\n",
    "    X[\"PorchTypes\"] = df[[\n",
    "        \"WoodDeckSF\",\n",
    "        \"OpenPorchSF\",\n",
    "        \"EnclosedPorch\",\n",
    "        \"Threeseasonporch\",\n",
    "        \"ScreenPorch\",\n",
    "    ]].gt(0.0).sum(axis=1)\n",
    "    return X\n",
    "\n",
    "def break_down(df):\n",
    "    X = pd.DataFrame()\n",
    "    X[\"MSClass\"] = df.MSSubClass.str.split(\"_\", n=1, expand=True)[0]\n",
    "    return X\n",
    "\n",
    "def group_transforms(df):\n",
    "    X = pd.DataFrame()\n",
    "    X[\"MedNhbdArea\"] = df.groupby(\"Neighborhood\")[\"GrLivArea\"].transform(\"median\")\n",
    "    return X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Defining functions for feature creation using k-Means Clustering algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "cluster_features = [\n",
    "    \"LotArea\",\n",
    "    \"TotalBsmtSF\",\n",
    "    \"FirstFlrSF\",\n",
    "    \"SecondFlrSF\",\n",
    "    \"GrLivArea\",\n",
    "]\n",
    "\n",
    "\n",
    "def cluster_labels(df, features, n_clusters=20):\n",
    "    X = df.copy()\n",
    "    X_scaled = X.loc[:, features]\n",
    "    X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
    "    kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)\n",
    "    X_new = pd.DataFrame()\n",
    "    X_new[\"Cluster\"] = kmeans.fit_predict(X_scaled)\n",
    "    return X_new\n",
    "\n",
    "\n",
    "def cluster_distance(df, features, n_clusters=20):\n",
    "    X = df.copy()\n",
    "    X_scaled = X.loc[:, features]\n",
    "    X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
    "    kmeans = KMeans(n_clusters=20, n_init=50, random_state=0)\n",
    "    X_cd = kmeans.fit_transform(X_scaled)\n",
    "    # Label features and join to dataset\n",
    "    X_cd = pd.DataFrame(\n",
    "        X_cd, columns=[f\"Centroid_{i}\" for i in range(X_cd.shape[1])]\n",
    "    )\n",
    "    return X_cd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Defining functions for feature creation using PCA algorithm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def apply_pca(X, standardize=True):\n",
    "    # Standardize\n",
    "    if standardize:\n",
    "        X = (X - X.mean(axis=0)) / X.std(axis=0)\n",
    "    # Create principal components\n",
    "    pca = PCA()\n",
    "    X_pca = pca.fit_transform(X)\n",
    "    # Convert to dataframe\n",
    "    component_names = [f\"PC{i+1}\" for i in range(X_pca.shape[1])]\n",
    "    X_pca = pd.DataFrame(X_pca, columns=component_names)\n",
    "    # Create loadings\n",
    "    loadings = pd.DataFrame(\n",
    "        pca.components_.T,  # transpose the matrix of loadings\n",
    "        columns=component_names,  # so the columns are the principal components\n",
    "        index=X.columns,  # and the rows are the original features\n",
    "    )\n",
    "    return pca, X_pca, loadings\n",
    "\n",
    "def pca_inspired(df):\n",
    "    X = pd.DataFrame()\n",
    "    X[\"Feature1\"] = df.GrLivArea + df.TotalBsmtSF\n",
    "    X[\"Feature2\"] = df.YearRemodAdd * df.TotalBsmtSF\n",
    "    return X\n",
    "\n",
    "\n",
    "def pca_components(df, features):\n",
    "    X = df.loc[:, features]\n",
    "    _, X_pca, _ = apply_pca(X)\n",
    "    return X_pca\n",
    "\n",
    "\n",
    "pca_features = [\n",
    "    \"GarageArea\",\n",
    "    \"YearRemodAdd\",\n",
    "    \"TotalBsmtSF\",\n",
    "    \"GrLivArea\",\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "def plot_variance(pca, width=8, dpi=100):\n",
    "    # Create figure\n",
    "    fig, axs = plt.subplots(1, 2)\n",
    "    n = pca.n_components_\n",
    "    grid = np.arange(1, n + 1)\n",
    "    # Explained variance\n",
    "    evr = pca.explained_variance_ratio_\n",
    "    axs[0].bar(grid, evr)\n",
    "    axs[0].set(\n",
    "        xlabel=\"Component\", title=\"% Explained Variance\", ylim=(0.0, 1.0)\n",
    "    )\n",
    "    # Cumulative Variance\n",
    "    cv = np.cumsum(evr)\n",
    "    axs[1].plot(np.r_[0, grid], np.r_[0, cv], \"o-\")\n",
    "    axs[1].set(\n",
    "        xlabel=\"Component\", title=\"% Cumulative Variance\", ylim=(0.0, 1.0)\n",
    "    )\n",
    "    # Set up figure\n",
    "    fig.set(figwidth=8, dpi=100)\n",
    "    return axs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Correlation matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "def corrplot(df, method=\"pearson\", annot=True, **kwargs):\n",
    "    sns.clustermap(\n",
    "        df.corr(method),\n",
    "        vmin=-1.0,\n",
    "        vmax=1.0,\n",
    "        cmap=\"icefire\",\n",
    "        method=\"complete\",\n",
    "        annot=annot,\n",
    "        **kwargs,\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "corrplot(df_train, annot=None)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Target Encoding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Using a separate holdout to create a target encoding would be a waste of data being used, instead we can use a trick similar to cross-validation:\n",
    "1. Split the data into folds, each fold having two splits of the dataset.\n",
    "2. Train the encoder on one split but transform the values of the other.\n",
    "3. Repeat for all the splits.\n",
    "This way, training and transformation always take place on independent sets of data, just like when you use a holdout set but without any data going to waste."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "class CrossFoldEncoder:\n",
    "    def __init__(self, encoder, **kwargs):\n",
    "        self.encoder_ = encoder\n",
    "        self.kwargs_ = kwargs  # keyword arguments for the encoder\n",
    "        self.cv_ = KFold(n_splits=5)\n",
    "\n",
    "    # Fit an encoder on one split and transform the feature on the\n",
    "    # other. Iterating over the splits in all folds gives a complete\n",
    "    # transformation. We also now have one trained encoder on each\n",
    "    # fold.\n",
    "    def fit_transform(self, X, y, cols):\n",
    "        self.fitted_encoders_ = []\n",
    "        self.cols_ = cols\n",
    "        X_encoded = []\n",
    "        for idx_encode, idx_train in self.cv_.split(X):\n",
    "            fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)\n",
    "            fitted_encoder.fit(\n",
    "                X.iloc[idx_encode, :], y.iloc[idx_encode],\n",
    "            )\n",
    "            X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])\n",
    "            self.fitted_encoders_.append(fitted_encoder)\n",
    "        X_encoded = pd.concat(X_encoded)\n",
    "        X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
    "        return X_encoded\n",
    "\n",
    "    # To transform the test data, average the encodings learned from\n",
    "    # each fold.\n",
    "    def transform(self, X):\n",
    "        from functools import reduce\n",
    "\n",
    "        X_encoded_list = []\n",
    "        for fitted_encoder in self.fitted_encoders_:\n",
    "            X_encoded = fitted_encoder.transform(X)\n",
    "            X_encoded_list.append(X_encoded[self.cols_])\n",
    "        X_encoded = reduce(\n",
    "            lambda x, y: x.add(y, fill_value=0), X_encoded_list\n",
    "        ) / len(X_encoded_list)\n",
    "        X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
    "        return X_encoded"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Creating the final feature set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Now let's combine everything together. Putting the transformations into separate functions makes it easier to experiment with various combinations "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def create_features(df, df_test=None):\n",
    "    X = df.copy()\n",
    "    y = X.pop(\"SalePrice\")\n",
    "    mi_scores = make_mi_scores(X, y)\n",
    "\n",
    "    # Combine splits if test data is given\n",
    "    #\n",
    "    # If we're creating features for test set predictions, we should\n",
    "    # use all the data we have available. After creating our features,\n",
    "    # we'll recreate the splits.\n",
    "    if df_test is not None:\n",
    "        X_test = df_test.copy()\n",
    "        X_test.pop(\"SalePrice\")\n",
    "        X = pd.concat([X, X_test])\n",
    "\n",
    "    # Mutual Information\n",
    "    X = drop_uninformative(X, mi_scores)\n",
    "    \n",
    "    X.info()\n",
    "    \n",
    "    # Experimentation can be done with uncommented codes amd transformations and che\n",
    "\n",
    "    # Transformations\n",
    "    X = X.join(mathematical_transforms(X))\n",
    "    X = X.join(interactions(X))\n",
    "    X = X.join(counts(X))\n",
    "    # X = X.join(break_down(X))\n",
    "    X = X.join(group_transforms(X))\n",
    "\n",
    "    # Clustering\n",
    "    # X = X.join(cluster_labels(X, cluster_features, n_clusters=20))\n",
    "    # X = X.join(cluster_distance(X, cluster_features, n_clusters=20))\n",
    "\n",
    "    # PCA\n",
    "    X = X.join(pca_inspired(X))\n",
    "    # X = X.join(pca_components(X, pca_features))\n",
    "    # X = X.join(indicate_outliers(X))\n",
    "\n",
    "    X = label_encode(X)\n",
    "    \n",
    "    X.info()\n",
    "    \n",
    "\n",
    "    # Reform splits\n",
    "    if df_test is not None:\n",
    "        X_test = X.loc[df_test.index, :]\n",
    "        X.drop(df_test.index, inplace=True)\n",
    "\n",
    "#     # Target Encoder\n",
    "#     encoder = CrossFoldEncoder(MEstimateEncoder, m=1)\n",
    "#     X = X.join(encoder.fit_transform(X, y, cols=[\"MSSubClass\"]))\n",
    "#     if df_test is not None:\n",
    "#         X_test = X_test.join(encoder.transform(X_test))\n",
    "\n",
    "    if df_test is not None:\n",
    "        return X, X_test\n",
    "    else:\n",
    "        return X\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Build Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def score_dataset(X, y, model=XGBRegressor()):\n",
    "    # Label encoding for categoricals\n",
    "    #\n",
    "    # Label encoding is good for XGBoost and RandomForest, but one-hot\n",
    "    # would be better for models like Lasso or Ridge. The `cat.codes`\n",
    "    # attribute holds the category levels.\n",
    "    for colname in X.select_dtypes([\"category\"]):\n",
    "        X[colname] = X[colname].cat.codes\n",
    "    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)\n",
    "    log_y = np.log(y)\n",
    "    score = cross_val_score(\n",
    "        model, X, log_y, cv=5, scoring=\"neg_mean_squared_error\",\n",
    "    )\n",
    "    score = -1 * score.mean()\n",
    "    score = np.sqrt(score)\n",
    "    return score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Baseline training model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "As we are trying to understand what features would provide a better model, we need to establish a baseline model which would act as a reference point for comparison of the models with different combination of features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "X = df_train.copy()\n",
    "y = X.pop(\"SalePrice\") \n",
    "\n",
    "baseline_score = score_dataset(X, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "print(f\"Baseline score: {baseline_score:.5f} RMSLE\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Features training model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip",
     "prev:load_and_preprocess"
    ]
   },
   "outputs": [],
   "source": [
    "X_train = create_features(df_train)\n",
    "y_train = df_train.loc[:, \"SalePrice\"]\n",
    "feature_model_score = score_dataset(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "print(f\"Feature model score: {feature_model_score:.5f} RMSLE\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "As we can see feature, engineering has help us improve score"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "## Final Model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:featured_data",
     "prev:load_and_preprocess_data"
    ]
   },
   "outputs": [],
   "source": [
    "X_train, X_test = create_features(df_train, df_test)\n",
    "y_train = df_train.loc[:, \"SalePrice\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "X_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:train_data",
     "prev:featured_data"
    ]
   },
   "outputs": [],
   "source": [
    "xgb_params = dict(\n",
    "    max_depth=6,           # maximum depth of each tree - try 2 to 10\n",
    "    learning_rate=0.01,    # effect of each tree - try 0.0001 to 0.1\n",
    "    n_estimators=1000,     # number of trees (that is, boosting rounds) - try 1000 to 8000\n",
    "    min_child_weight=1,    # minimum number of houses in a leaf - try 1 to 10\n",
    "    colsample_bytree=0.7,  # fraction of features (columns) per tree - try 0.2 to 1.0\n",
    "    subsample=0.7,         # fraction of instances (rows) per tree - try 0.2 to 1.0\n",
    "    reg_alpha=0.5,         # L1 regularization (like LASSO) - try 0.0 to 10.0\n",
    "    reg_lambda=1.0,        # L2 regularization (like Ridge) - try 0.0 to 10.0\n",
    "    num_parallel_tree=1,   # set > 1 for boosted random forests\n",
    ")\n",
    "\n",
    "xgb = XGBRegressor(**xgb_params)\n",
    "# XGB minimizes MSE, but competition loss is RMSLE\n",
    "# So, we need to log-transform y to train and exp-transform the predictions\n",
    "xgb.fit(X_train, np.log(y_train))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "block:eval_data",
     "prev:train_data"
    ]
   },
   "outputs": [],
   "source": [
    "predictions = np.exp(xgb.predict(X_test))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "print(predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Submission"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "skip"
    ]
   },
   "outputs": [],
   "source": [
    "output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})\n",
    "output.to_csv('data/my_submission.csv', index=False)\n",
    "print(\"Your submission was successfully saved!\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kubeflow_notebook": {
   "autosnapshot": true,
   "docker_image": "",
   "experiment": {
    "id": "new",
    "name": "house-prices"
   },
   "experiment_name": "house-prices",
   "katib_metadata": {
    "algorithm": {
     "algorithmName": "grid"
    },
    "maxFailedTrialCount": 3,
    "maxTrialCount": 12,
    "objective": {
     "objectiveMetricName": "",
     "type": "minimize"
    },
    "parallelTrialCount": 3,
    "parameters": []
   },
   "katib_run": false,
   "pipeline_description": "",
   "pipeline_name": "predict-house-prices-kale",
   "snapshot_volumes": true,
   "steps_defaults": [
    "label:access-ml-pipeline:true",
    "label:access-rok:true"
   ],
   "volume_access_mode": "rwm",
   "volumes": [
    {
     "annotations": [],
     "mount_point": "/home/jovyan/data",
     "name": "data-rbdrx",
     "size": 5,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    },
    {
     "annotations": [],
     "mount_point": "/home/jovyan",
     "name": "house-prices-workspace-q2pv2",
     "size": 5,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    }
   ]
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}