examples/house-prices-kaggle-competi.../house-prices-kfp.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "#  Kaggle Getting Started Competition : House Prices - Advanced Regression Techniques "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "The notebook is based on the [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "Following are the imports required to build the pipeline and pass the data between components for building up the kubeflow pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Install the kfp \n",
    "# !pip install kfp --upgrade "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import kfp\n",
    "from kfp.components import func_to_container_op\n",
    "import kfp.components as comp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "All the essential imports required in a pipeline component are put together in a list which then is passed on to each pipeline component. Though this might not be efficient when you are dealing with lot of packages, so in cases with many packages and dependencies you can go for docker image which then can be passed to each pipeline component"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import_packages = ['pandas', 'sklearn', 'category_encoders', 'xgboost', 'numpy']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "In the following implementation of kubeflow pipeline we are making use of [lightweight python function components](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) to build up the pipeline. The data is passed between component instances(tasks) using InputPath and OutputPath. This doesn't require use of defining external volume and attaching to the tasks as the system takes care of storing the data. Further details and examples of it can be found in the following [link](https://github.com/Ark-kun/kfp_samples/blob/65a98da2d4d2bd27a803ee58213b4cfd8a84825e/2019-10%20Kubeflow%20summit/104%20-%20Passing%20data%20for%20python%20components/104%20-%20Passing%20data%20for%20python%20components.ipynb)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "The pipeline is divided into five components\n",
    "1. Download data zip file from url\n",
    "2. Load data\n",
    "3. Creating data with features\n",
    "4. Train data\n",
    "5. Evaluating data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Download Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "For the purpose of this, we are using an existing yaml file available from kubeflow/pipelines for 'Download Data' component to download data from URLs. In our case, we are getting it from github."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "web_downloader_op = kfp.components.load_component_from_url(\n",
    "    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/contrib/web/Download/component.yaml')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Load and Preprocess Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_and_preprocess_data(file_path : comp.InputPath() , train_output_csv: comp.OutputPath(), test_output_csv: comp.OutputPath()):\n",
    "    \n",
    "    # Read data\n",
    "    import pandas as pd\n",
    "    from pandas.api.types import CategoricalDtype\n",
    "    from zipfile import ZipFile   \n",
    "    \n",
    "    # Extracting from zip file \n",
    "    with ZipFile(file_path, 'r') as zip:\n",
    "        zip.extractall()\n",
    "        \n",
    "    # Load the training and test data\n",
    "    train_file_dir = 'data/train.csv'\n",
    "    test_file_dir = 'data/test.csv'\n",
    "    df_train = pd.read_csv(train_file_dir, index_col=\"Id\")\n",
    "    df_test = pd.read_csv( test_file_dir , index_col=\"Id\")\n",
    "    \n",
    "    # Merge the splits so we can process them together\n",
    "    df = pd.concat([df_train, df_test])\n",
    "        \n",
    "    # Clean data\n",
    "    df[\"Exterior2nd\"] = df[\"Exterior2nd\"].replace({\"Brk Cmn\": \"BrkComm\"})\n",
    "    # Some values of GarageYrBlt are corrupt, so we'll replace them\n",
    "    # with the year the house was built\n",
    "    df[\"GarageYrBlt\"] = df[\"GarageYrBlt\"].where(df.GarageYrBlt <= 2010, df.YearBuilt)\n",
    "    # Names beginning with numbers are awkward to work with\n",
    "    df.rename(columns={\n",
    "        \"1stFlrSF\": \"FirstFlrSF\",\n",
    "        \"2ndFlrSF\": \"SecondFlrSF\",\n",
    "        \"3SsnPorch\": \"Threeseasonporch\",\n",
    "    }, inplace=True,\n",
    "    )\n",
    "    \n",
    "    # Encode data\n",
    "    \n",
    "    # Nominal categories\n",
    "    # The numeric features are already encoded correctly (`float` for\n",
    "    # continuous, `int` for discrete), but the categoricals we'll need to\n",
    "    # do ourselves. Note in particular, that the `MSSubClass` feature is\n",
    "    # read as an `int` type, but is actually a (nominative) categorical.\n",
    "\n",
    "    # The nominative (unordered) categorical features\n",
    "    features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
    "\n",
    "    # Pandas calls the categories \"levels\"\n",
    "    five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
    "    ten_levels = list(range(10))\n",
    "\n",
    "    ordered_levels = {\n",
    "        \"OverallQual\": ten_levels,\n",
    "        \"OverallCond\": ten_levels,\n",
    "        \"ExterQual\": five_levels,\n",
    "        \"ExterCond\": five_levels,\n",
    "        \"BsmtQual\": five_levels,\n",
    "        \"BsmtCond\": five_levels,\n",
    "        \"HeatingQC\": five_levels,\n",
    "        \"KitchenQual\": five_levels,\n",
    "        \"FireplaceQu\": five_levels,\n",
    "        \"GarageQual\": five_levels,\n",
    "        \"GarageCond\": five_levels,\n",
    "        \"PoolQC\": five_levels,\n",
    "        \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
    "        \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
    "        \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
    "        \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
    "        \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
    "        \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
    "        \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
    "        \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
    "        \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
    "        \"CentralAir\": [\"N\", \"Y\"],\n",
    "        \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
    "        \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
    "    }\n",
    "\n",
    "    # Add a None level for missing values\n",
    "    ordered_levels = {key: [\"None\"] + value for key, value in\n",
    "                      ordered_levels.items()}\n",
    "\n",
    "\n",
    "    for name in features_nom:\n",
    "        df[name] = df[name].astype(\"category\")\n",
    "        # Add a None category for missing values\n",
    "        if \"None\" not in df[name].cat.categories:\n",
    "            df[name].cat.add_categories(\"None\", inplace=True)\n",
    "    # Ordinal categories\n",
    "    for name, levels in ordered_levels.items():\n",
    "        df[name] = df[name].astype(CategoricalDtype(levels,\n",
    "                                                    ordered=True))\n",
    "        \n",
    "    \n",
    "    # Impute data\n",
    "    for name in df.select_dtypes(\"number\"):\n",
    "        df[name] = df[name].fillna(0)\n",
    "    for name in df.select_dtypes(include = [\"category\"]):\n",
    "        df[name] = df[name].fillna(\"None\")\n",
    "        \n",
    "    # Reform splits        \n",
    "    df_train = df.loc[df_train.index, :]\n",
    "    df_test = df.loc[df_test.index, :]\n",
    "    \n",
    "    # passing the data as csv files to outputs\n",
    "    df_train.to_csv(train_output_csv)\n",
    "    df_test.to_csv(test_output_csv)        \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "load_and_preprocess_data_op = func_to_container_op(load_and_preprocess_data,packages_to_install = import_packages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Creating data with features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def featured_data(train_path: comp.InputPath(), test_path : comp.InputPath(), feat_train_output_csv: comp.OutputPath(), feat_test_output_csv: comp.OutputPath()):\n",
    "    \n",
    "    import pandas as pd\n",
    "    from pandas.api.types import CategoricalDtype\n",
    "    from category_encoders import MEstimateEncoder\n",
    "    from sklearn.feature_selection import mutual_info_regression\n",
    "    from sklearn.cluster import KMeans\n",
    "    from sklearn.decomposition import PCA\n",
    "    from sklearn.model_selection import KFold, cross_val_score\n",
    "    \n",
    "    df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
    "    df_test = pd.read_csv(test_path, index_col=\"Id\")\n",
    "    \n",
    "    def make_mi_scores(X, y):\n",
    "        X = X.copy()\n",
    "        for colname in X.select_dtypes([\"object\",\"category\"]):\n",
    "            X[colname], _ = X[colname].factorize()\n",
    "        # All discrete features should now have integer dtypes\n",
    "        discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]\n",
    "        mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)\n",
    "        mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
    "        mi_scores = mi_scores.sort_values(ascending=False)\n",
    "        return mi_scores\n",
    "    \n",
    "    def drop_uninformative(df, mi_scores):\n",
    "        return df.loc[:, mi_scores > 0.0]\n",
    "    \n",
    "    def label_encode(df):\n",
    "        \n",
    "        X = df.copy()   \n",
    "        for colname in X.select_dtypes([\"category\"]):\n",
    "            X[colname] = X[colname].cat.codes\n",
    "        return X\n",
    "\n",
    "    def mathematical_transforms(df):\n",
    "        X = pd.DataFrame()  # dataframe to hold new features\n",
    "        X[\"LivLotRatio\"] = df.GrLivArea / df.LotArea\n",
    "        X[\"Spaciousness\"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd\n",
    "        return X\n",
    "\n",
    "    def interactions(df):\n",
    "        X = pd.get_dummies(df.BldgType, prefix=\"Bldg\")\n",
    "        X = X.mul(df.GrLivArea, axis=0)\n",
    "        return X\n",
    "\n",
    "    def counts(df):\n",
    "        X = pd.DataFrame()\n",
    "        X[\"PorchTypes\"] = df[[\n",
    "            \"WoodDeckSF\",\n",
    "            \"OpenPorchSF\",\n",
    "            \"EnclosedPorch\",\n",
    "            \"Threeseasonporch\",\n",
    "            \"ScreenPorch\",\n",
    "        ]].gt(0.0).sum(axis=1)\n",
    "        return X\n",
    "\n",
    "    def break_down(df):\n",
    "        X = pd.DataFrame()\n",
    "        X[\"MSClass\"] = df.MSSubClass.str.split(\"_\", n=1, expand=True)[0]\n",
    "        return X\n",
    "\n",
    "    def group_transforms(df):\n",
    "        X = pd.DataFrame()\n",
    "        X[\"MedNhbdArea\"] = df.groupby(\"Neighborhood\")[\"GrLivArea\"].transform(\"median\")\n",
    "        return X\n",
    "    \n",
    "    cluster_features = [\n",
    "        \"LotArea\",\n",
    "        \"TotalBsmtSF\",\n",
    "        \"FirstFlrSF\",\n",
    "        \"SecondFlrSF\",\n",
    "        \"GrLivArea\",\n",
    "        ]\n",
    "\n",
    "    def cluster_labels(df, features, n_clusters=20):\n",
    "        X = df.copy()\n",
    "        X_scaled = X.loc[:, features]\n",
    "        X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
    "        kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)\n",
    "        X_new = pd.DataFrame()\n",
    "        X_new[\"Cluster\"] = kmeans.fit_predict(X_scaled)\n",
    "        return X_new\n",
    "\n",
    "    def cluster_distance(df, features, n_clusters=20):\n",
    "        X = df.copy()\n",
    "        X_scaled = X.loc[:, features]\n",
    "        X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
    "        kmeans = KMeans(n_clusters=20, n_init=50, random_state=0)\n",
    "        X_cd = kmeans.fit_transform(X_scaled)\n",
    "        # Label features and join to dataset\n",
    "        X_cd = pd.DataFrame(\n",
    "            X_cd, columns=[f\"Centroid_{i}\" for i in range(X_cd.shape[1])]\n",
    "        )\n",
    "        return X_cd\n",
    "    \n",
    "    def apply_pca(X, standardize=True):\n",
    "        # Standardize\n",
    "        if standardize:\n",
    "            X = (X - X.mean(axis=0)) / X.std(axis=0)\n",
    "        # Create principal components\n",
    "        pca = PCA()\n",
    "        X_pca = pca.fit_transform(X)\n",
    "        # Convert to dataframe\n",
    "        component_names = [f\"PC{i+1}\" for i in range(X_pca.shape[1])]\n",
    "        X_pca = pd.DataFrame(X_pca, columns=component_names)\n",
    "        # Create loadings\n",
    "        loadings = pd.DataFrame(\n",
    "            pca.components_.T,  # transpose the matrix of loadings\n",
    "            columns=component_names,  # so the columns are the principal components\n",
    "            index=X.columns,  # and the rows are the original features\n",
    "        )\n",
    "        return pca, X_pca, loadings\n",
    "\n",
    "    def pca_inspired(df):\n",
    "        X = pd.DataFrame()\n",
    "        X[\"Feature1\"] = df.GrLivArea + df.TotalBsmtSF\n",
    "        X[\"Feature2\"] = df.YearRemodAdd * df.TotalBsmtSF\n",
    "        return X\n",
    "\n",
    "\n",
    "    def pca_components(df, features):\n",
    "        X = df.loc[:, features]\n",
    "        _, X_pca, _ = apply_pca(X)\n",
    "        return X_pca\n",
    "\n",
    "\n",
    "    pca_features = [\n",
    "        \"GarageArea\",\n",
    "        \"YearRemodAdd\",\n",
    "        \"TotalBsmtSF\",\n",
    "        \"GrLivArea\",\n",
    "    ]\n",
    "    \n",
    "    class CrossFoldEncoder:\n",
    "        def __init__(self, encoder, **kwargs):\n",
    "            self.encoder_ = encoder\n",
    "            self.kwargs_ = kwargs  # keyword arguments for the encoder\n",
    "            self.cv_ = KFold(n_splits=5)\n",
    "\n",
    "        # Fit an encoder on one split and transform the feature on the\n",
    "        # other. Iterating over the splits in all folds gives a complete\n",
    "        # transformation. We also now have one trained encoder on each\n",
    "        # fold.\n",
    "        def fit_transform(self, X, y, cols):\n",
    "            self.fitted_encoders_ = []\n",
    "            self.cols_ = cols\n",
    "            X_encoded = []\n",
    "            for idx_encode, idx_train in self.cv_.split(X):\n",
    "                fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)\n",
    "                fitted_encoder.fit(\n",
    "                    X.iloc[idx_encode, :], y.iloc[idx_encode],\n",
    "                )\n",
    "                X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])\n",
    "                self.fitted_encoders_.append(fitted_encoder)\n",
    "            X_encoded = pd.concat(X_encoded)\n",
    "            X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
    "            return X_encoded\n",
    "\n",
    "        # To transform the test data, average the encodings learned from\n",
    "        # each fold.\n",
    "        def transform(self, X):\n",
    "            from functools import reduce\n",
    "\n",
    "            X_encoded_list = []\n",
    "            for fitted_encoder in self.fitted_encoders_:\n",
    "                X_encoded = fitted_encoder.transform(X)\n",
    "                X_encoded_list.append(X_encoded[self.cols_])\n",
    "            X_encoded = reduce(\n",
    "                lambda x, y: x.add(y, fill_value=0), X_encoded_list\n",
    "            ) / len(X_encoded_list)\n",
    "            X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
    "            return X_encoded\n",
    "        \n",
    "    X = df_train.copy()\n",
    "    y = X.pop(\"SalePrice\") \n",
    "    \n",
    "    X_test = df_test.copy()\n",
    "    X_test.pop(\"SalePrice\")\n",
    "    \n",
    "    # Get the mutual information scores\n",
    "    mi_scores = make_mi_scores(X, y)\n",
    "    \n",
    "    # Concat the training and test dataset before restoring categorical encoding\n",
    "    X = pd.concat([X, X_test])\n",
    "    \n",
    "    # Restore the categorical encoding removed during csv conversion\n",
    "    # The nominative (unordered) categorical features\n",
    "    features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
    "\n",
    "    # Pandas calls the categories \"levels\"\n",
    "    five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
    "    ten_levels = list(range(10))\n",
    "\n",
    "    ordered_levels = {\n",
    "        \"OverallQual\": ten_levels,\n",
    "        \"OverallCond\": ten_levels,\n",
    "        \"ExterQual\": five_levels,\n",
    "        \"ExterCond\": five_levels,\n",
    "        \"BsmtQual\": five_levels,\n",
    "        \"BsmtCond\": five_levels,\n",
    "        \"HeatingQC\": five_levels,\n",
    "        \"KitchenQual\": five_levels,\n",
    "        \"FireplaceQu\": five_levels,\n",
    "        \"GarageQual\": five_levels,\n",
    "        \"GarageCond\": five_levels,\n",
    "        \"PoolQC\": five_levels,\n",
    "        \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
    "        \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
    "        \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
    "        \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
    "        \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
    "        \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
    "        \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
    "        \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
    "        \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
    "        \"CentralAir\": [\"N\", \"Y\"],\n",
    "        \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
    "        \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
    "    }\n",
    "\n",
    "#     Add a None level for missing values\n",
    "    ordered_levels = {key: [\"None\"] + value for key, value in\n",
    "                      ordered_levels.items()}\n",
    "    \n",
    "    for name in features_nom:\n",
    "        X[name] = X[name].astype(\"category\")\n",
    "        if \"None\" not in X[name].cat.categories:\n",
    "            X[name].cat.add_categories(\"None\", inplace=True)\n",
    "        \n",
    "    # Ordinal categories\n",
    "    for name, levels in ordered_levels.items():\n",
    "        X[name] = X[name].astype(CategoricalDtype(levels,\n",
    "                                                    ordered=True))\n",
    "           \n",
    "    # Drop features with less mutual information scores\n",
    "    X = drop_uninformative(X, mi_scores)\n",
    "    \n",
    "\n",
    "    # Transformations\n",
    "    X = X.join(mathematical_transforms(X))\n",
    "    X = X.join(interactions(X))\n",
    "    X = X.join(counts(X))\n",
    "    # X = X.join(break_down(X))\n",
    "    X = X.join(group_transforms(X))\n",
    "\n",
    "    # Clustering\n",
    "    # X = X.join(cluster_labels(X, cluster_features, n_clusters=20))\n",
    "    # X = X.join(cluster_distance(X, cluster_features, n_clusters=20))\n",
    "\n",
    "    # PCA\n",
    "    X = X.join(pca_inspired(X))\n",
    "    # X = X.join(pca_components(X, pca_features))\n",
    "    # X = X.join(indicate_outliers(X))\n",
    "    \n",
    "    # Label encoding\n",
    "    X = label_encode(X)\n",
    "    \n",
    "    # Reform splits\n",
    "    X_test = X.loc[df_test.index, :]\n",
    "    X.drop(df_test.index, inplace=True)\n",
    "\n",
    "    # Target Encoder\n",
    "    encoder = CrossFoldEncoder(MEstimateEncoder, m=1)\n",
    "    X = X.join(encoder.fit_transform(X, y, cols=[\"MSSubClass\"]))\n",
    "    \n",
    "    X_test = X_test.join(encoder.transform(X_test))\n",
    "    \n",
    "    X.to_csv(feat_train_output_csv)\n",
    "    X_test.to_csv(feat_test_output_csv)\n",
    "\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "featured_data_op = func_to_container_op(featured_data, packages_to_install = import_packages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Train data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def train_data(train_path: comp.InputPath(), feat_train_path: comp.InputPath(), feat_test_path : comp.InputPath(), model_path : comp.OutputPath('XGBoostModel')):\n",
    "    \n",
    "    import pandas as pd\n",
    "    import numpy as np\n",
    "    from xgboost.sklearn import XGBRegressor\n",
    "    from pathlib import Path\n",
    "    \n",
    "    df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
    "    X_train = pd.read_csv(feat_train_path, index_col=\"Id\")\n",
    "    X_test = pd.read_csv(feat_test_path, index_col=\"Id\")\n",
    "    y_train = df_train.loc[:, \"SalePrice\"]\n",
    "    \n",
    "    xgb_params = dict(\n",
    "    max_depth=6,           # maximum depth of each tree - try 2 to 10\n",
    "    learning_rate=0.01,    # effect of each tree - try 0.0001 to 0.1\n",
    "    n_estimators=1000,     # number of trees (that is, boosting rounds) - try 1000 to 8000\n",
    "    min_child_weight=1,    # minimum number of houses in a leaf - try 1 to 10\n",
    "    colsample_bytree=0.7,  # fraction of features (columns) per tree - try 0.2 to 1.0\n",
    "    subsample=0.7,         # fraction of instances (rows) per tree - try 0.2 to 1.0\n",
    "    reg_alpha=0.5,         # L1 regularization (like LASSO) - try 0.0 to 10.0\n",
    "    reg_lambda=1.0,        # L2 regularization (like Ridge) - try 0.0 to 10.0\n",
    "    num_parallel_tree=1,   # set > 1 for boosted random forests\n",
    "    )\n",
    "\n",
    "    xgb = XGBRegressor(**xgb_params)\n",
    "    # XGB minimizes MSE, but competition loss is RMSLE\n",
    "    # So, we need to log-transform y to train and exp-transform the predictions\n",
    "    xgb.fit(X_train, np.log(y_train))\n",
    "\n",
    "    Path(model_path).parent.mkdir(parents=True, exist_ok=True)\n",
    "    xgb.save_model(model_path)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data_op = func_to_container_op(train_data, packages_to_install= import_packages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Evaluate data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "def eval_data(test_data_path: comp.InputPath(), model_path: comp.InputPath('XGBoostModel')):\n",
    "    \n",
    "    import pandas as pd\n",
    "    import numpy as np\n",
    "    from xgboost.sklearn import XGBRegressor\n",
    "    \n",
    "    X_test = pd.read_csv(test_data_path, index_col=\"Id\")\n",
    "    \n",
    "    xgb = XGBRegressor()\n",
    "    \n",
    "    \n",
    "    xgb.load_model(model_path)\n",
    "    \n",
    "    predictions = np.exp(xgb.predict(X_test))\n",
    "    \n",
    "    print(predictions)\n",
    "       \n",
    "#     output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})\n",
    "#     output.to_csv('data/my_submission.csv', index=False)\n",
    "#     print(\"Your submission was successfully saved!\")\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_data_op = func_to_container_op(eval_data, packages_to_install= import_packages)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": []
   },
   "source": [
    "### Defining function that implements the pipeline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "def vanilla_pipeline(url):\n",
    "    \n",
    "    web_downloader_task = web_downloader_op(url=url)\n",
    "\n",
    "    load_and_preprocess_data_task = load_and_preprocess_data_op(file = web_downloader_task.outputs['data'])\n",
    "\n",
    "    featured_data_task = featured_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'], test = load_and_preprocess_data_task.outputs['test_output_csv'])\n",
    "    \n",
    "    train_eval_task = train_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'] , feat_train = featured_data_task.outputs['feat_train_output_csv'],\n",
    "                                                 feat_test = featured_data_task.outputs['feat_test_output_csv'])\n",
    "    \n",
    "    eval_data_task = eval_data_op(test_data = featured_data_task.outputs['feat_test_output_csv'],model = train_eval_task.output)\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href=\"/pipeline/#/experiments/details/246b31c7-909a-446b-8152-0f429a0e745c\" target=\"_blank\" >Experiment details</a>."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<a href=\"/pipeline/#/runs/details/66011ba0-a465-4d5b-beba-f081ab3002b4\" target=\"_blank\" >Run details</a>."
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": [
       "RunPipelineResult(run_id=66011ba0-a465-4d5b-beba-f081ab3002b4)"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Using kfp.Client() to run the pipeline from notebook itself\n",
    "client = kfp.Client() # change arguments accordingly\n",
    "\n",
    "# Running the pipeline\n",
    "client.create_run_from_pipeline_func(\n",
    "    vanilla_pipeline,\n",
    "    arguments={\n",
    "        # Github url to fetch the data. This would change when you clone the repo. Please update the url as per that.\n",
    "        'url': 'https://github.com/kubeflow/examples/raw/master/house-prices-kaggle-competition/data.zip'\n",
    "    })"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kubeflow_notebook": {
   "autosnapshot": true,
   "docker_image": "",
   "experiment": {
    "id": "",
    "name": ""
   },
   "experiment_name": "",
   "katib_metadata": {
    "algorithm": {
     "algorithmName": "grid"
    },
    "maxFailedTrialCount": 3,
    "maxTrialCount": 12,
    "objective": {
     "objectiveMetricName": "",
     "type": "minimize"
    },
    "parallelTrialCount": 3,
    "parameters": []
   },
   "katib_run": false,
   "pipeline_description": "",
   "pipeline_name": "",
   "snapshot_volumes": true,
   "steps_defaults": [
    "label:access-ml-pipeline:true",
    "label:access-rok:true"
   ],
   "volume_access_mode": "rwm",
   "volumes": [
    {
     "annotations": [],
     "mount_point": "/home/jovyan/data",
     "name": "data-g2n6k",
     "size": 5,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    },
    {
     "annotations": [],
     "mount_point": "/home/jovyan",
     "name": "house-prices-vanilla-workspace-2wscr",
     "size": 5,
     "size_type": "Gi",
     "snapshot": false,
     "type": "clone"
    }
   ]
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}