Example of converting a Kaggle Notebook to a Kubeflow pipeline (#945)

* Files added to github * Correcting image paths for all images * Modified image paths * Centering Images * Updated Github Download URL link in -kfp.ipynb file * Updated the kfp_client.PNG image * Updated README.md * Fixed a grammatical error in README.md * Fixed issues with README.md and house-prices-kale notebook * Fixed minor grammatical errors/typos * Fixed couple of grammatical mistakes * Modified the notebooks and README.md * Added the notebook server docker image to README.md * Updated README.md with KFaaS references
2022-05-20 06:07:23 +05:30 · 2022-05-20 06:07:23 +05:30 · 7454117305
parent c55eb667a7
commit 7454117305
16 changed files with 8834 additions and 0 deletions
--- a/house-prices-kaggle-competition/README.md
+++ b/house-prices-kaggle-competition/README.md
@ -0,0 +1,120 @@
+# Predicting House-Prices
+
+In this repo we are converting a [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for the [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition into a Kubeflow pipeline. The notebook is a buildup of hands-on exercises presented in the Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)
+
+## Prerequisites for Building the Kubeflow Pipeline
+
+If you don’t already have Kubeflow up and running, we recommend signing up for a free trial of Arrikto's [Kubeflow as a Service](https://www.arrikto.com/kubeflow-as-a-service/). For the following example, we are using Kubeflow as a Service, but you should be able to run this example on any Kubeflow distribution.
+
+## Testing environment
+
+| Name          | version       | 
+| ------------- |:-------------:|
+| Kubeflow      | v1.4          |
+| kfp           | 1.7.1         |
+| kubeflow-kale | 0.6.0         |
+
+## Initial Steps
+
+1. Please follow the Prerequisites section to get Kubeflow running.
+2. Create and connect to a new Jupyter Notebook server.
+3. Clone this repo so you have access to this directory. Please go through the kfp and kale steps explained below.
+
+## KFP version
+
+To start building out a Kubeflow pipeline, you need to get yourself acquainted with the Kubeflow Pipelines [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/) to understand what the pipelines are, its components, what goes into these components. There are different ways to build out a pipeline component as mentioned [here](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#building-pipeline-components). In the following example, we are going to use the lightweight python functions based components for building up the pipeline.
+
+### Step 1: Install Kubeflow Pipeline SDK and import the required kfp packages to run the pipeline
+
+From the kfp, we will be using [func_to_container_op](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.func_to_container_op) which would help in building the factory function from the python function and we will use [InputPath](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.InputPath) and [OutputPath](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.OutputPath) from the components package to pass the paths of the files or models to these tasks. The [passing of data](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#pass-data) is being implemented by kfp’s supported data passing mechanism. InputPath and OutputPath is how you pass on the data or model between the components.
+
+### Step 2: Next build out the pipeline components 
+
+Our Kubeflow pipeline is broken down into five pipeline components:
+
+- Download data
+- Load and Preprocess data
+- Create Features
+- Train data
+- Evaluate data
+
+We convert each python function to a factory function using the func_to_container_op which will then be converted to a pipeline task for our pipeline function.
+
+### Step 3 : Creating pipeline function
+
+After building all the pipeline components, we have to define a pipeline function connecting all the pipeline components with  appropriate inputs and outputs. This when run would generate the pipeline graph.
+
+Our pipeline function takes in the GitHub URL as an input to start with the first pipeline task viz. download_data_task. For this we had used the [load_component_from_url](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html?highlight=load_component_from_url#kfp.components.load_component_from_url) method to create the pipeline task.
+
+Pipeline function:
+
+<p align="center">
+    <img src="images/kfp_pipeline_func.PNG">
+</p>
+
+
+### Step 4 : Running the pipeline using kfp.client instance
+
+There are different ways to run the pipeline function as mentioned in the [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#compile-and-run-your-pipeline). We would run the pipeline using the Kubeflow Pipelines SDK client.
+
+<p align="center">
+    <img src="images/kfp_client.PNG">
+</p>
+
+Once all the cells are executed successfully, you should see two hyperlinks ‘Experiment details’ and ‘Run details’. Click on ‘Run details’ link to observe the pipeline running. 
+
+The final pipeline graph would look as follow:
+
+<p align="center">
+    <img src="images/kfp_pipeline_graph.PNG">
+</p>
+
+## Kale version
+
+For the Kaggle notebook example, we are using [Kubeflow as a Service](https://www.arrikto.com/kubeflow-as-a-service/). If you are using Kubeflow as a Service then Kale comes preinstalled. For users with different Kubeflow setup, you can refer to the [GitHub link](https://github.com/kubeflow-kale/kale#getting-started) for installing the Kale JupyterLab extension on your setup.
+
+### Step 1: Annotate the notebook with Kale tags
+
+The Kale notebook in the directory is already annotated. To see the annotations, open up the Kale Deployment panel and click on the Enable switch button. Once you have it switched on, you should see the following:
+
+<p align="center">
+    <img src="images/kale_deployment_panel.PNG">
+</p>
+
+Please take time to understand how each cell is annotated by clicking on the cell and checking out the tag being used and what are is its dependencies. Kale provides us with six tags for annotations:
+
+- Imports
+- Functions
+- Pipeline Parameters
+- Pipeline Metrics
+- Pipeline Step
+- Skip Cell
+
+You can also see the tags being created by checking out the Cell Metadata by clicking on the Property Inspector above the Kale Deployment Panel button.
+
+<p align="center">
+    <img src="images/kale_cell_metadata.PNG">
+</p>
+
+### Step 2: Run the Kubeflow Pipeline
+
+Once you’ve tagged your notebook, click on the “Compile and Run” button in the Kale widget. Kale will perform the following tasks for you:
+
+- Validate the notebook
+- Take a snapshot
+- Compile the notebook
+- Upload the pipeline
+- Run the pipeline
+
+In the “Running pipeline” output, click on the “View” hyperlink. This will take you directly to the runtime execution graph where you can watch your pipeline execute and update in real-time.
+
+<p align="center">
+    <img src="images/kale_pipeline_graph.PNG">
+</p>
+
+## Note:
+Both the notebooks have been tested out. In case of any error, please test out with the following docker image. 
+
+Notebook server docker image used: gcr.io/arrikto/jupyter-kale-py36@sha256:dd3f92ca66b46d247e4b9b6a9d84ffbb368646263c2e3909473c3b851f3fe198
+
+If the error persists, please raise an issue.
--- a/house-prices-kaggle-competition/data.zip
+++ b/house-prices-kaggle-competition/data.zip
--- a/house-prices-kaggle-competition/data/data_description.txt
+++ b/house-prices-kaggle-competition/data/data_description.txt
@ -0,0 +1,523 @@
+MSSubClass: Identifies the type of dwelling involved in the sale.	
+
+        20	1-STORY 1946 & NEWER ALL STYLES
+        30	1-STORY 1945 & OLDER
+        40	1-STORY W/FINISHED ATTIC ALL AGES
+        45	1-1/2 STORY - UNFINISHED ALL AGES
+        50	1-1/2 STORY FINISHED ALL AGES
+        60	2-STORY 1946 & NEWER
+        70	2-STORY 1945 & OLDER
+        75	2-1/2 STORY ALL AGES
+        80	SPLIT OR MULTI-LEVEL
+        85	SPLIT FOYER
+        90	DUPLEX - ALL STYLES AND AGES
+       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
+       150	1-1/2 STORY PUD - ALL AGES
+       160	2-STORY PUD - 1946 & NEWER
+       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
+       190	2 FAMILY CONVERSION - ALL STYLES AND AGES
+
+MSZoning: Identifies the general zoning classification of the sale.
+		
+       A	Agriculture
+       C	Commercial
+       FV	Floating Village Residential
+       I	Industrial
+       RH	Residential High Density
+       RL	Residential Low Density
+       RP	Residential Low Density Park 
+       RM	Residential Medium Density
+	
+LotFrontage: Linear feet of street connected to property
+
+LotArea: Lot size in square feet
+
+Street: Type of road access to property
+
+       Grvl	Gravel	
+       Pave	Paved
+       	
+Alley: Type of alley access to property
+
+       Grvl	Gravel
+       Pave	Paved
+       NA 	No alley access
+		
+LotShape: General shape of property
+
+       Reg	Regular	
+       IR1	Slightly irregular
+       IR2	Moderately Irregular
+       IR3	Irregular
+       
+LandContour: Flatness of the property
+
+       Lvl	Near Flat/Level	
+       Bnk	Banked - Quick and significant rise from street grade to building
+       HLS	Hillside - Significant slope from side to side
+       Low	Depression
+		
+Utilities: Type of utilities available
+		
+       AllPub	All public Utilities (E,G,W,& S)	
+       NoSewr	Electricity, Gas, and Water (Septic Tank)
+       NoSeWa	Electricity and Gas Only
+       ELO	Electricity only	
+	
+LotConfig: Lot configuration
+
+       Inside	Inside lot
+       Corner	Corner lot
+       CulDSac	Cul-de-sac
+       FR2	Frontage on 2 sides of property
+       FR3	Frontage on 3 sides of property
+	
+LandSlope: Slope of property
+		
+       Gtl	Gentle slope
+       Mod	Moderate Slope	
+       Sev	Severe Slope
+	
+Neighborhood: Physical locations within Ames city limits
+
+       Blmngtn	Bloomington Heights
+       Blueste	Bluestem
+       BrDale	Briardale
+       BrkSide	Brookside
+       ClearCr	Clear Creek
+       CollgCr	College Creek
+       Crawfor	Crawford
+       Edwards	Edwards
+       Gilbert	Gilbert
+       IDOTRR	Iowa DOT and Rail Road
+       MeadowV	Meadow Village
+       Mitchel	Mitchell
+       Names	North Ames
+       NoRidge	Northridge
+       NPkVill	Northpark Villa
+       NridgHt	Northridge Heights
+       NWAmes	Northwest Ames
+       OldTown	Old Town
+       SWISU	South & West of Iowa State University
+       Sawyer	Sawyer
+       SawyerW	Sawyer West
+       Somerst	Somerset
+       StoneBr	Stone Brook
+       Timber	Timberland
+       Veenker	Veenker
+			
+Condition1: Proximity to various conditions
+	
+       Artery	Adjacent to arterial street
+       Feedr	Adjacent to feeder street	
+       Norm	Normal	
+       RRNn	Within 200' of North-South Railroad
+       RRAn	Adjacent to North-South Railroad
+       PosN	Near positive off-site feature--park, greenbelt, etc.
+       PosA	Adjacent to postive off-site feature
+       RRNe	Within 200' of East-West Railroad
+       RRAe	Adjacent to East-West Railroad
+	
+Condition2: Proximity to various conditions (if more than one is present)
+		
+       Artery	Adjacent to arterial street
+       Feedr	Adjacent to feeder street	
+       Norm	Normal	
+       RRNn	Within 200' of North-South Railroad
+       RRAn	Adjacent to North-South Railroad
+       PosN	Near positive off-site feature--park, greenbelt, etc.
+       PosA	Adjacent to postive off-site feature
+       RRNe	Within 200' of East-West Railroad
+       RRAe	Adjacent to East-West Railroad
+	
+BldgType: Type of dwelling
+		
+       1Fam	Single-family Detached	
+       2FmCon	Two-family Conversion; originally built as one-family dwelling
+       Duplx	Duplex
+       TwnhsE	Townhouse End Unit
+       TwnhsI	Townhouse Inside Unit
+	
+HouseStyle: Style of dwelling
+	
+       1Story	One story
+       1.5Fin	One and one-half story: 2nd level finished
+       1.5Unf	One and one-half story: 2nd level unfinished
+       2Story	Two story
+       2.5Fin	Two and one-half story: 2nd level finished
+       2.5Unf	Two and one-half story: 2nd level unfinished
+       SFoyer	Split Foyer
+       SLvl	Split Level
+	
+OverallQual: Rates the overall material and finish of the house
+
+       10	Very Excellent
+       9	Excellent
+       8	Very Good
+       7	Good
+       6	Above Average
+       5	Average
+       4	Below Average
+       3	Fair
+       2	Poor
+       1	Very Poor
+	
+OverallCond: Rates the overall condition of the house
+
+       10	Very Excellent
+       9	Excellent
+       8	Very Good
+       7	Good
+       6	Above Average	
+       5	Average
+       4	Below Average	
+       3	Fair
+       2	Poor
+       1	Very Poor
+		
+YearBuilt: Original construction date
+
+YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
+
+RoofStyle: Type of roof
+
+       Flat	Flat
+       Gable	Gable
+       Gambrel	Gabrel (Barn)
+       Hip	Hip
+       Mansard	Mansard
+       Shed	Shed
+		
+RoofMatl: Roof material
+
+       ClyTile	Clay or Tile
+       CompShg	Standard (Composite) Shingle
+       Membran	Membrane
+       Metal	Metal
+       Roll	Roll
+       Tar&Grv	Gravel & Tar
+       WdShake	Wood Shakes
+       WdShngl	Wood Shingles
+		
+Exterior1st: Exterior covering on house
+
+       AsbShng	Asbestos Shingles
+       AsphShn	Asphalt Shingles
+       BrkComm	Brick Common
+       BrkFace	Brick Face
+       CBlock	Cinder Block
+       CemntBd	Cement Board
+       HdBoard	Hard Board
+       ImStucc	Imitation Stucco
+       MetalSd	Metal Siding
+       Other	Other
+       Plywood	Plywood
+       PreCast	PreCast	
+       Stone	Stone
+       Stucco	Stucco
+       VinylSd	Vinyl Siding
+       Wd Sdng	Wood Siding
+       WdShing	Wood Shingles
+	
+Exterior2nd: Exterior covering on house (if more than one material)
+
+       AsbShng	Asbestos Shingles
+       AsphShn	Asphalt Shingles
+       BrkComm	Brick Common
+       BrkFace	Brick Face
+       CBlock	Cinder Block
+       CemntBd	Cement Board
+       HdBoard	Hard Board
+       ImStucc	Imitation Stucco
+       MetalSd	Metal Siding
+       Other	Other
+       Plywood	Plywood
+       PreCast	PreCast
+       Stone	Stone
+       Stucco	Stucco
+       VinylSd	Vinyl Siding
+       Wd Sdng	Wood Siding
+       WdShing	Wood Shingles
+	
+MasVnrType: Masonry veneer type
+
+       BrkCmn	Brick Common
+       BrkFace	Brick Face
+       CBlock	Cinder Block
+       None	None
+       Stone	Stone
+	
+MasVnrArea: Masonry veneer area in square feet
+
+ExterQual: Evaluates the quality of the material on the exterior 
+		
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       Po	Poor
+		
+ExterCond: Evaluates the present condition of the material on the exterior
+		
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       Po	Poor
+		
+Foundation: Type of foundation
+		
+       BrkTil	Brick & Tile
+       CBlock	Cinder Block
+       PConc	Poured Contrete	
+       Slab	Slab
+       Stone	Stone
+       Wood	Wood
+		
+BsmtQual: Evaluates the height of the basement
+
+       Ex	Excellent (100+ inches)	
+       Gd	Good (90-99 inches)
+       TA	Typical (80-89 inches)
+       Fa	Fair (70-79 inches)
+       Po	Poor (<70 inches
+       NA	No Basement
+		
+BsmtCond: Evaluates the general condition of the basement
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical - slight dampness allowed
+       Fa	Fair - dampness or some cracking or settling
+       Po	Poor - Severe cracking, settling, or wetness
+       NA	No Basement
+	
+BsmtExposure: Refers to walkout or garden level walls
+
+       Gd	Good Exposure
+       Av	Average Exposure (split levels or foyers typically score average or above)	
+       Mn	Mimimum Exposure
+       No	No Exposure
+       NA	No Basement
+	
+BsmtFinType1: Rating of basement finished area
+
+       GLQ	Good Living Quarters
+       ALQ	Average Living Quarters
+       BLQ	Below Average Living Quarters	
+       Rec	Average Rec Room
+       LwQ	Low Quality
+       Unf	Unfinshed
+       NA	No Basement
+		
+BsmtFinSF1: Type 1 finished square feet
+
+BsmtFinType2: Rating of basement finished area (if multiple types)
+
+       GLQ	Good Living Quarters
+       ALQ	Average Living Quarters
+       BLQ	Below Average Living Quarters	
+       Rec	Average Rec Room
+       LwQ	Low Quality
+       Unf	Unfinshed
+       NA	No Basement
+
+BsmtFinSF2: Type 2 finished square feet
+
+BsmtUnfSF: Unfinished square feet of basement area
+
+TotalBsmtSF: Total square feet of basement area
+
+Heating: Type of heating
+		
+       Floor	Floor Furnace
+       GasA	Gas forced warm air furnace
+       GasW	Gas hot water or steam heat
+       Grav	Gravity furnace	
+       OthW	Hot water or steam heat other than gas
+       Wall	Wall furnace
+		
+HeatingQC: Heating quality and condition
+
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       Po	Poor
+		
+CentralAir: Central air conditioning
+
+       N	No
+       Y	Yes
+		
+Electrical: Electrical system
+
+       SBrkr	Standard Circuit Breakers & Romex
+       FuseA	Fuse Box over 60 AMP and all Romex wiring (Average)	
+       FuseF	60 AMP Fuse Box and mostly Romex wiring (Fair)
+       FuseP	60 AMP Fuse Box and mostly knob & tube wiring (poor)
+       Mix	Mixed
+		
+1stFlrSF: First Floor square feet
+ 
+2ndFlrSF: Second floor square feet
+
+LowQualFinSF: Low quality finished square feet (all floors)
+
+GrLivArea: Above grade (ground) living area square feet
+
+BsmtFullBath: Basement full bathrooms
+
+BsmtHalfBath: Basement half bathrooms
+
+FullBath: Full bathrooms above grade
+
+HalfBath: Half baths above grade
+
+Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
+
+Kitchen: Kitchens above grade
+
+KitchenQual: Kitchen quality
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical/Average
+       Fa	Fair
+       Po	Poor
+       	
+TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
+
+Functional: Home functionality (Assume typical unless deductions are warranted)
+
+       Typ	Typical Functionality
+       Min1	Minor Deductions 1
+       Min2	Minor Deductions 2
+       Mod	Moderate Deductions
+       Maj1	Major Deductions 1
+       Maj2	Major Deductions 2
+       Sev	Severely Damaged
+       Sal	Salvage only
+		
+Fireplaces: Number of fireplaces
+
+FireplaceQu: Fireplace quality
+
+       Ex	Excellent - Exceptional Masonry Fireplace
+       Gd	Good - Masonry Fireplace in main level
+       TA	Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
+       Fa	Fair - Prefabricated Fireplace in basement
+       Po	Poor - Ben Franklin Stove
+       NA	No Fireplace
+		
+GarageType: Garage location
+		
+       2Types	More than one type of garage
+       Attchd	Attached to home
+       Basment	Basement Garage
+       BuiltIn	Built-In (Garage part of house - typically has room above garage)
+       CarPort	Car Port
+       Detchd	Detached from home
+       NA	No Garage
+		
+GarageYrBlt: Year garage was built
+		
+GarageFinish: Interior finish of the garage
+
+       Fin	Finished
+       RFn	Rough Finished	
+       Unf	Unfinished
+       NA	No Garage
+		
+GarageCars: Size of garage in car capacity
+
+GarageArea: Size of garage in square feet
+
+GarageQual: Garage quality
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical/Average
+       Fa	Fair
+       Po	Poor
+       NA	No Garage
+		
+GarageCond: Garage condition
+
+       Ex	Excellent
+       Gd	Good
+       TA	Typical/Average
+       Fa	Fair
+       Po	Poor
+       NA	No Garage
+		
+PavedDrive: Paved driveway
+
+       Y	Paved 
+       P	Partial Pavement
+       N	Dirt/Gravel
+		
+WoodDeckSF: Wood deck area in square feet
+
+OpenPorchSF: Open porch area in square feet
+
+EnclosedPorch: Enclosed porch area in square feet
+
+3SsnPorch: Three season porch area in square feet
+
+ScreenPorch: Screen porch area in square feet
+
+PoolArea: Pool area in square feet
+
+PoolQC: Pool quality
+		
+       Ex	Excellent
+       Gd	Good
+       TA	Average/Typical
+       Fa	Fair
+       NA	No Pool
+		
+Fence: Fence quality
+		
+       GdPrv	Good Privacy
+       MnPrv	Minimum Privacy
+       GdWo	Good Wood
+       MnWw	Minimum Wood/Wire
+       NA	No Fence
+	
+MiscFeature: Miscellaneous feature not covered in other categories
+		
+       Elev	Elevator
+       Gar2	2nd Garage (if not described in garage section)
+       Othr	Other
+       Shed	Shed (over 100 SF)
+       TenC	Tennis Court
+       NA	None
+		
+MiscVal: $Value of miscellaneous feature
+
+MoSold: Month Sold (MM)
+
+YrSold: Year Sold (YYYY)
+
+SaleType: Type of sale
+		
+       WD 	Warranty Deed - Conventional
+       CWD	Warranty Deed - Cash
+       VWD	Warranty Deed - VA Loan
+       New	Home just constructed and sold
+       COD	Court Officer Deed/Estate
+       Con	Contract 15% Down payment regular terms
+       ConLw	Contract Low Down payment and low interest
+       ConLI	Contract Low Interest
+       ConLD	Contract Low Down
+       Oth	Other
+		
+SaleCondition: Condition of sale
+
+       Normal	Normal Sale
+       Abnorml	Abnormal Sale -  trade, foreclosure, short sale
+       AdjLand	Adjoining Land Purchase
+       Alloca	Allocation - two linked properties with separate deeds, typically condo with a garage unit	
+       Family	Sale between family members
+       Partial	Home was not completed when last assessed (associated with New Homes)
--- a/house-prices-kaggle-competition/data/sample_submission.csv
+++ b/house-prices-kaggle-competition/data/sample_submission.csv
--- a/house-prices-kaggle-competition/data/test.csv
+++ b/house-prices-kaggle-competition/data/test.csv
--- a/house-prices-kaggle-competition/data/train.csv
+++ b/house-prices-kaggle-competition/data/train.csv
--- a/house-prices-kaggle-competition/house-prices-kale.ipynb
+++ b/house-prices-kaggle-competition/house-prices-kale.ipynb
--- a/house-prices-kaggle-competition/house-prices-kfp.ipynb
+++ b/house-prices-kaggle-competition/house-prices-kfp.ipynb
@ -0,0 +1,815 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "#  Kaggle Getting Started Competition : House Prices - Advanced Regression Techniques "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "The notebook is based on the [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "Following are the imports required to build the pipeline and pass the data between components for building up the kubeflow pipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Install the kfp \n",
+    "# !pip install kfp --upgrade "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import kfp\n",
+    "from kfp.components import func_to_container_op\n",
+    "import kfp.components as comp"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "All the essential imports required in a pipeline component are put together in a list which then is passed on to each pipeline component. Though this might not be efficient when you are dealing with lot of packages, so in cases with many packages and dependencies you can go for docker image which then can be passed to each pipeline component"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import_packages = ['pandas', 'sklearn', 'category_encoders', 'xgboost', 'numpy']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "In the following implementation of kubeflow pipeline we are making use of [lightweight python function components](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) to build up the pipeline. The data is passed between component instances(tasks) using InputPath and OutputPath. This doesn't require use of defining external volume and attaching to the tasks as the system takes care of storing the data. Further details and examples of it can be found in the following [link](https://github.com/Ark-kun/kfp_samples/blob/65a98da2d4d2bd27a803ee58213b4cfd8a84825e/2019-10%20Kubeflow%20summit/104%20-%20Passing%20data%20for%20python%20components/104%20-%20Passing%20data%20for%20python%20components.ipynb)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "The pipeline is divided into five components\n",
+    "1. Download data zip file from url\n",
+    "2. Load data\n",
+    "3. Creating data with features\n",
+    "4. Train data\n",
+    "5. Evaluating data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Download Data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "For the purpose of this, we are using an existing yaml file available from kubeflow/pipelines for 'Download Data' component to download data from URLs. In our case, we are getting it from github."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "web_downloader_op = kfp.components.load_component_from_url(\n",
+    "    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/contrib/web/Download/component.yaml')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Load and Preprocess Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def load_and_preprocess_data(file_path : comp.InputPath() , train_output_csv: comp.OutputPath(), test_output_csv: comp.OutputPath()):\n",
+    "    \n",
+    "    # Read data\n",
+    "    import pandas as pd\n",
+    "    from pandas.api.types import CategoricalDtype\n",
+    "    from zipfile import ZipFile   \n",
+    "    \n",
+    "    # Extracting from zip file \n",
+    "    with ZipFile(file_path, 'r') as zip:\n",
+    "        zip.extractall()\n",
+    "        \n",
+    "    # Load the training and test data\n",
+    "    train_file_dir = 'data/train.csv'\n",
+    "    test_file_dir = 'data/test.csv'\n",
+    "    df_train = pd.read_csv(train_file_dir, index_col=\"Id\")\n",
+    "    df_test = pd.read_csv( test_file_dir , index_col=\"Id\")\n",
+    "    \n",
+    "    # Merge the splits so we can process them together\n",
+    "    df = pd.concat([df_train, df_test])\n",
+    "        \n",
+    "    # Clean data\n",
+    "    df[\"Exterior2nd\"] = df[\"Exterior2nd\"].replace({\"Brk Cmn\": \"BrkComm\"})\n",
+    "    # Some values of GarageYrBlt are corrupt, so we'll replace them\n",
+    "    # with the year the house was built\n",
+    "    df[\"GarageYrBlt\"] = df[\"GarageYrBlt\"].where(df.GarageYrBlt <= 2010, df.YearBuilt)\n",
+    "    # Names beginning with numbers are awkward to work with\n",
+    "    df.rename(columns={\n",
+    "        \"1stFlrSF\": \"FirstFlrSF\",\n",
+    "        \"2ndFlrSF\": \"SecondFlrSF\",\n",
+    "        \"3SsnPorch\": \"Threeseasonporch\",\n",
+    "    }, inplace=True,\n",
+    "    )\n",
+    "    \n",
+    "    # Encode data\n",
+    "    \n",
+    "    # Nominal categories\n",
+    "    # The numeric features are already encoded correctly (`float` for\n",
+    "    # continuous, `int` for discrete), but the categoricals we'll need to\n",
+    "    # do ourselves. Note in particular, that the `MSSubClass` feature is\n",
+    "    # read as an `int` type, but is actually a (nominative) categorical.\n",
+    "\n",
+    "    # The nominative (unordered) categorical features\n",
+    "    features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
+    "\n",
+    "    # Pandas calls the categories \"levels\"\n",
+    "    five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
+    "    ten_levels = list(range(10))\n",
+    "\n",
+    "    ordered_levels = {\n",
+    "        \"OverallQual\": ten_levels,\n",
+    "        \"OverallCond\": ten_levels,\n",
+    "        \"ExterQual\": five_levels,\n",
+    "        \"ExterCond\": five_levels,\n",
+    "        \"BsmtQual\": five_levels,\n",
+    "        \"BsmtCond\": five_levels,\n",
+    "        \"HeatingQC\": five_levels,\n",
+    "        \"KitchenQual\": five_levels,\n",
+    "        \"FireplaceQu\": five_levels,\n",
+    "        \"GarageQual\": five_levels,\n",
+    "        \"GarageCond\": five_levels,\n",
+    "        \"PoolQC\": five_levels,\n",
+    "        \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
+    "        \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
+    "        \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
+    "        \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
+    "        \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
+    "        \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
+    "        \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
+    "        \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
+    "        \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
+    "        \"CentralAir\": [\"N\", \"Y\"],\n",
+    "        \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
+    "        \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
+    "    }\n",
+    "\n",
+    "    # Add a None level for missing values\n",
+    "    ordered_levels = {key: [\"None\"] + value for key, value in\n",
+    "                      ordered_levels.items()}\n",
+    "\n",
+    "\n",
+    "    for name in features_nom:\n",
+    "        df[name] = df[name].astype(\"category\")\n",
+    "        # Add a None category for missing values\n",
+    "        if \"None\" not in df[name].cat.categories:\n",
+    "            df[name].cat.add_categories(\"None\", inplace=True)\n",
+    "    # Ordinal categories\n",
+    "    for name, levels in ordered_levels.items():\n",
+    "        df[name] = df[name].astype(CategoricalDtype(levels,\n",
+    "                                                    ordered=True))\n",
+    "        \n",
+    "    \n",
+    "    # Impute data\n",
+    "    for name in df.select_dtypes(\"number\"):\n",
+    "        df[name] = df[name].fillna(0)\n",
+    "    for name in df.select_dtypes(include = [\"category\"]):\n",
+    "        df[name] = df[name].fillna(\"None\")\n",
+    "        \n",
+    "    # Reform splits        \n",
+    "    df_train = df.loc[df_train.index, :]\n",
+    "    df_test = df.loc[df_test.index, :]\n",
+    "    \n",
+    "    # passing the data as csv files to outputs\n",
+    "    df_train.to_csv(train_output_csv)\n",
+    "    df_test.to_csv(test_output_csv)        \n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "load_and_preprocess_data_op = func_to_container_op(load_and_preprocess_data,packages_to_install = import_packages)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Creating data with features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def featured_data(train_path: comp.InputPath(), test_path : comp.InputPath(), feat_train_output_csv: comp.OutputPath(), feat_test_output_csv: comp.OutputPath()):\n",
+    "    \n",
+    "    import pandas as pd\n",
+    "    from pandas.api.types import CategoricalDtype\n",
+    "    from category_encoders import MEstimateEncoder\n",
+    "    from sklearn.feature_selection import mutual_info_regression\n",
+    "    from sklearn.cluster import KMeans\n",
+    "    from sklearn.decomposition import PCA\n",
+    "    from sklearn.model_selection import KFold, cross_val_score\n",
+    "    \n",
+    "    df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
+    "    df_test = pd.read_csv(test_path, index_col=\"Id\")\n",
+    "    \n",
+    "    def make_mi_scores(X, y):\n",
+    "        X = X.copy()\n",
+    "        for colname in X.select_dtypes([\"object\",\"category\"]):\n",
+    "            X[colname], _ = X[colname].factorize()\n",
+    "        # All discrete features should now have integer dtypes\n",
+    "        discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]\n",
+    "        mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)\n",
+    "        mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
+    "        mi_scores = mi_scores.sort_values(ascending=False)\n",
+    "        return mi_scores\n",
+    "    \n",
+    "    def drop_uninformative(df, mi_scores):\n",
+    "        return df.loc[:, mi_scores > 0.0]\n",
+    "    \n",
+    "    def label_encode(df):\n",
+    "        \n",
+    "        X = df.copy()   \n",
+    "        for colname in X.select_dtypes([\"category\"]):\n",
+    "            X[colname] = X[colname].cat.codes\n",
+    "        return X\n",
+    "\n",
+    "    def mathematical_transforms(df):\n",
+    "        X = pd.DataFrame()  # dataframe to hold new features\n",
+    "        X[\"LivLotRatio\"] = df.GrLivArea / df.LotArea\n",
+    "        X[\"Spaciousness\"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd\n",
+    "        return X\n",
+    "\n",
+    "    def interactions(df):\n",
+    "        X = pd.get_dummies(df.BldgType, prefix=\"Bldg\")\n",
+    "        X = X.mul(df.GrLivArea, axis=0)\n",
+    "        return X\n",
+    "\n",
+    "    def counts(df):\n",
+    "        X = pd.DataFrame()\n",
+    "        X[\"PorchTypes\"] = df[[\n",
+    "            \"WoodDeckSF\",\n",
+    "            \"OpenPorchSF\",\n",
+    "            \"EnclosedPorch\",\n",
+    "            \"Threeseasonporch\",\n",
+    "            \"ScreenPorch\",\n",
+    "        ]].gt(0.0).sum(axis=1)\n",
+    "        return X\n",
+    "\n",
+    "    def break_down(df):\n",
+    "        X = pd.DataFrame()\n",
+    "        X[\"MSClass\"] = df.MSSubClass.str.split(\"_\", n=1, expand=True)[0]\n",
+    "        return X\n",
+    "\n",
+    "    def group_transforms(df):\n",
+    "        X = pd.DataFrame()\n",
+    "        X[\"MedNhbdArea\"] = df.groupby(\"Neighborhood\")[\"GrLivArea\"].transform(\"median\")\n",
+    "        return X\n",
+    "    \n",
+    "    cluster_features = [\n",
+    "        \"LotArea\",\n",
+    "        \"TotalBsmtSF\",\n",
+    "        \"FirstFlrSF\",\n",
+    "        \"SecondFlrSF\",\n",
+    "        \"GrLivArea\",\n",
+    "        ]\n",
+    "\n",
+    "    def cluster_labels(df, features, n_clusters=20):\n",
+    "        X = df.copy()\n",
+    "        X_scaled = X.loc[:, features]\n",
+    "        X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
+    "        kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)\n",
+    "        X_new = pd.DataFrame()\n",
+    "        X_new[\"Cluster\"] = kmeans.fit_predict(X_scaled)\n",
+    "        return X_new\n",
+    "\n",
+    "    def cluster_distance(df, features, n_clusters=20):\n",
+    "        X = df.copy()\n",
+    "        X_scaled = X.loc[:, features]\n",
+    "        X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
+    "        kmeans = KMeans(n_clusters=20, n_init=50, random_state=0)\n",
+    "        X_cd = kmeans.fit_transform(X_scaled)\n",
+    "        # Label features and join to dataset\n",
+    "        X_cd = pd.DataFrame(\n",
+    "            X_cd, columns=[f\"Centroid_{i}\" for i in range(X_cd.shape[1])]\n",
+    "        )\n",
+    "        return X_cd\n",
+    "    \n",
+    "    def apply_pca(X, standardize=True):\n",
+    "        # Standardize\n",
+    "        if standardize:\n",
+    "            X = (X - X.mean(axis=0)) / X.std(axis=0)\n",
+    "        # Create principal components\n",
+    "        pca = PCA()\n",
+    "        X_pca = pca.fit_transform(X)\n",
+    "        # Convert to dataframe\n",
+    "        component_names = [f\"PC{i+1}\" for i in range(X_pca.shape[1])]\n",
+    "        X_pca = pd.DataFrame(X_pca, columns=component_names)\n",
+    "        # Create loadings\n",
+    "        loadings = pd.DataFrame(\n",
+    "            pca.components_.T,  # transpose the matrix of loadings\n",
+    "            columns=component_names,  # so the columns are the principal components\n",
+    "            index=X.columns,  # and the rows are the original features\n",
+    "        )\n",
+    "        return pca, X_pca, loadings\n",
+    "\n",
+    "    def pca_inspired(df):\n",
+    "        X = pd.DataFrame()\n",
+    "        X[\"Feature1\"] = df.GrLivArea + df.TotalBsmtSF\n",
+    "        X[\"Feature2\"] = df.YearRemodAdd * df.TotalBsmtSF\n",
+    "        return X\n",
+    "\n",
+    "\n",
+    "    def pca_components(df, features):\n",
+    "        X = df.loc[:, features]\n",
+    "        _, X_pca, _ = apply_pca(X)\n",
+    "        return X_pca\n",
+    "\n",
+    "\n",
+    "    pca_features = [\n",
+    "        \"GarageArea\",\n",
+    "        \"YearRemodAdd\",\n",
+    "        \"TotalBsmtSF\",\n",
+    "        \"GrLivArea\",\n",
+    "    ]\n",
+    "    \n",
+    "    class CrossFoldEncoder:\n",
+    "        def __init__(self, encoder, **kwargs):\n",
+    "            self.encoder_ = encoder\n",
+    "            self.kwargs_ = kwargs  # keyword arguments for the encoder\n",
+    "            self.cv_ = KFold(n_splits=5)\n",
+    "\n",
+    "        # Fit an encoder on one split and transform the feature on the\n",
+    "        # other. Iterating over the splits in all folds gives a complete\n",
+    "        # transformation. We also now have one trained encoder on each\n",
+    "        # fold.\n",
+    "        def fit_transform(self, X, y, cols):\n",
+    "            self.fitted_encoders_ = []\n",
+    "            self.cols_ = cols\n",
+    "            X_encoded = []\n",
+    "            for idx_encode, idx_train in self.cv_.split(X):\n",
+    "                fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)\n",
+    "                fitted_encoder.fit(\n",
+    "                    X.iloc[idx_encode, :], y.iloc[idx_encode],\n",
+    "                )\n",
+    "                X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])\n",
+    "                self.fitted_encoders_.append(fitted_encoder)\n",
+    "            X_encoded = pd.concat(X_encoded)\n",
+    "            X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
+    "            return X_encoded\n",
+    "\n",
+    "        # To transform the test data, average the encodings learned from\n",
+    "        # each fold.\n",
+    "        def transform(self, X):\n",
+    "            from functools import reduce\n",
+    "\n",
+    "            X_encoded_list = []\n",
+    "            for fitted_encoder in self.fitted_encoders_:\n",
+    "                X_encoded = fitted_encoder.transform(X)\n",
+    "                X_encoded_list.append(X_encoded[self.cols_])\n",
+    "            X_encoded = reduce(\n",
+    "                lambda x, y: x.add(y, fill_value=0), X_encoded_list\n",
+    "            ) / len(X_encoded_list)\n",
+    "            X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
+    "            return X_encoded\n",
+    "        \n",
+    "    X = df_train.copy()\n",
+    "    y = X.pop(\"SalePrice\") \n",
+    "    \n",
+    "    X_test = df_test.copy()\n",
+    "    X_test.pop(\"SalePrice\")\n",
+    "    \n",
+    "    # Get the mutual information scores\n",
+    "    mi_scores = make_mi_scores(X, y)\n",
+    "    \n",
+    "    # Concat the training and test dataset before restoring categorical encoding\n",
+    "    X = pd.concat([X, X_test])\n",
+    "    \n",
+    "    # Restore the categorical encoding removed during csv conversion\n",
+    "    # The nominative (unordered) categorical features\n",
+    "    features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
+    "\n",
+    "    # Pandas calls the categories \"levels\"\n",
+    "    five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
+    "    ten_levels = list(range(10))\n",
+    "\n",
+    "    ordered_levels = {\n",
+    "        \"OverallQual\": ten_levels,\n",
+    "        \"OverallCond\": ten_levels,\n",
+    "        \"ExterQual\": five_levels,\n",
+    "        \"ExterCond\": five_levels,\n",
+    "        \"BsmtQual\": five_levels,\n",
+    "        \"BsmtCond\": five_levels,\n",
+    "        \"HeatingQC\": five_levels,\n",
+    "        \"KitchenQual\": five_levels,\n",
+    "        \"FireplaceQu\": five_levels,\n",
+    "        \"GarageQual\": five_levels,\n",
+    "        \"GarageCond\": five_levels,\n",
+    "        \"PoolQC\": five_levels,\n",
+    "        \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
+    "        \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
+    "        \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
+    "        \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
+    "        \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
+    "        \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
+    "        \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
+    "        \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
+    "        \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
+    "        \"CentralAir\": [\"N\", \"Y\"],\n",
+    "        \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
+    "        \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
+    "    }\n",
+    "\n",
+    "#     Add a None level for missing values\n",
+    "    ordered_levels = {key: [\"None\"] + value for key, value in\n",
+    "                      ordered_levels.items()}\n",
+    "    \n",
+    "    for name in features_nom:\n",
+    "        X[name] = X[name].astype(\"category\")\n",
+    "        if \"None\" not in X[name].cat.categories:\n",
+    "            X[name].cat.add_categories(\"None\", inplace=True)\n",
+    "        \n",
+    "    # Ordinal categories\n",
+    "    for name, levels in ordered_levels.items():\n",
+    "        X[name] = X[name].astype(CategoricalDtype(levels,\n",
+    "                                                    ordered=True))\n",
+    "           \n",
+    "    # Drop features with less mutual information scores\n",
+    "    X = drop_uninformative(X, mi_scores)\n",
+    "    \n",
+    "\n",
+    "    # Transformations\n",
+    "    X = X.join(mathematical_transforms(X))\n",
+    "    X = X.join(interactions(X))\n",
+    "    X = X.join(counts(X))\n",
+    "    # X = X.join(break_down(X))\n",
+    "    X = X.join(group_transforms(X))\n",
+    "\n",
+    "    # Clustering\n",
+    "    # X = X.join(cluster_labels(X, cluster_features, n_clusters=20))\n",
+    "    # X = X.join(cluster_distance(X, cluster_features, n_clusters=20))\n",
+    "\n",
+    "    # PCA\n",
+    "    X = X.join(pca_inspired(X))\n",
+    "    # X = X.join(pca_components(X, pca_features))\n",
+    "    # X = X.join(indicate_outliers(X))\n",
+    "    \n",
+    "    # Label encoding\n",
+    "    X = label_encode(X)\n",
+    "    \n",
+    "    # Reform splits\n",
+    "    X_test = X.loc[df_test.index, :]\n",
+    "    X.drop(df_test.index, inplace=True)\n",
+    "\n",
+    "    # Target Encoder\n",
+    "    encoder = CrossFoldEncoder(MEstimateEncoder, m=1)\n",
+    "    X = X.join(encoder.fit_transform(X, y, cols=[\"MSSubClass\"]))\n",
+    "    \n",
+    "    X_test = X_test.join(encoder.transform(X_test))\n",
+    "    \n",
+    "    X.to_csv(feat_train_output_csv)\n",
+    "    X_test.to_csv(feat_test_output_csv)\n",
+    "\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "featured_data_op = func_to_container_op(featured_data, packages_to_install = import_packages)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Train data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def train_data(train_path: comp.InputPath(), feat_train_path: comp.InputPath(), feat_test_path : comp.InputPath(), model_path : comp.OutputPath('XGBoostModel')):\n",
+    "    \n",
+    "    import pandas as pd\n",
+    "    import numpy as np\n",
+    "    from xgboost.sklearn import XGBRegressor\n",
+    "    from pathlib import Path\n",
+    "    \n",
+    "    df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
+    "    X_train = pd.read_csv(feat_train_path, index_col=\"Id\")\n",
+    "    X_test = pd.read_csv(feat_test_path, index_col=\"Id\")\n",
+    "    y_train = df_train.loc[:, \"SalePrice\"]\n",
+    "    \n",
+    "    xgb_params = dict(\n",
+    "    max_depth=6,           # maximum depth of each tree - try 2 to 10\n",
+    "    learning_rate=0.01,    # effect of each tree - try 0.0001 to 0.1\n",
+    "    n_estimators=1000,     # number of trees (that is, boosting rounds) - try 1000 to 8000\n",
+    "    min_child_weight=1,    # minimum number of houses in a leaf - try 1 to 10\n",
+    "    colsample_bytree=0.7,  # fraction of features (columns) per tree - try 0.2 to 1.0\n",
+    "    subsample=0.7,         # fraction of instances (rows) per tree - try 0.2 to 1.0\n",
+    "    reg_alpha=0.5,         # L1 regularization (like LASSO) - try 0.0 to 10.0\n",
+    "    reg_lambda=1.0,        # L2 regularization (like Ridge) - try 0.0 to 10.0\n",
+    "    num_parallel_tree=1,   # set > 1 for boosted random forests\n",
+    "    )\n",
+    "\n",
+    "    xgb = XGBRegressor(**xgb_params)\n",
+    "    # XGB minimizes MSE, but competition loss is RMSLE\n",
+    "    # So, we need to log-transform y to train and exp-transform the predictions\n",
+    "    xgb.fit(X_train, np.log(y_train))\n",
+    "\n",
+    "    Path(model_path).parent.mkdir(parents=True, exist_ok=True)\n",
+    "    xgb.save_model(model_path)\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_data_op = func_to_container_op(train_data, packages_to_install= import_packages)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Evaluate data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def eval_data(test_data_path: comp.InputPath(), model_path: comp.InputPath('XGBoostModel')):\n",
+    "    \n",
+    "    import pandas as pd\n",
+    "    import numpy as np\n",
+    "    from xgboost.sklearn import XGBRegressor\n",
+    "    \n",
+    "    X_test = pd.read_csv(test_data_path, index_col=\"Id\")\n",
+    "    \n",
+    "    xgb = XGBRegressor()\n",
+    "    \n",
+    "    \n",
+    "    xgb.load_model(model_path)\n",
+    "    \n",
+    "    predictions = np.exp(xgb.predict(X_test))\n",
+    "    \n",
+    "    print(predictions)\n",
+    "       \n",
+    "#     output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})\n",
+    "#     output.to_csv('data/my_submission.csv', index=False)\n",
+    "#     print(\"Your submission was successfully saved!\")\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "eval_data_op = func_to_container_op(eval_data, packages_to_install= import_packages)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "tags": []
+   },
+   "source": [
+    "### Defining function that implements the pipeline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def vanilla_pipeline(url):\n",
+    "    \n",
+    "    web_downloader_task = web_downloader_op(url=url)\n",
+    "\n",
+    "    load_and_preprocess_data_task = load_and_preprocess_data_op(file = web_downloader_task.outputs['data'])\n",
+    "\n",
+    "    featured_data_task = featured_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'], test = load_and_preprocess_data_task.outputs['test_output_csv'])\n",
+    "    \n",
+    "    train_eval_task = train_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'] , feat_train = featured_data_task.outputs['feat_train_output_csv'],\n",
+    "                                                 feat_test = featured_data_task.outputs['feat_test_output_csv'])\n",
+    "    \n",
+    "    eval_data_task = eval_data_op(test_data = featured_data_task.outputs['feat_test_output_csv'],model = train_eval_task.output)\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<a href=\"/pipeline/#/experiments/details/246b31c7-909a-446b-8152-0f429a0e745c\" target=\"_blank\" >Experiment details</a>."
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/html": [
+       "<a href=\"/pipeline/#/runs/details/66011ba0-a465-4d5b-beba-f081ab3002b4\" target=\"_blank\" >Run details</a>."
+      ],
+      "text/plain": [
+       "<IPython.core.display.HTML object>"
+      ]
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    },
+    {
+     "data": {
+      "text/plain": [
+       "RunPipelineResult(run_id=66011ba0-a465-4d5b-beba-f081ab3002b4)"
+      ]
+     },
+     "execution_count": 14,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# Using kfp.Client() to run the pipeline from notebook itself\n",
+    "client = kfp.Client() # change arguments accordingly\n",
+    "\n",
+    "# Running the pipeline\n",
+    "client.create_run_from_pipeline_func(\n",
+    "    vanilla_pipeline,\n",
+    "    arguments={\n",
+    "        # Github url to fetch the data. This would change when you clone the repo. Please update the url as per that.\n",
+    "        'url': 'https://github.com/NeoKish/examples/raw/master/house-prices-kaggle-competition/data.zip'\n",
+    "    })"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "kubeflow_notebook": {
+   "autosnapshot": true,
+   "docker_image": "gcr.io/arrikto/jupyter-kale-py36@sha256:dd3f92ca66b46d247e4b9b6a9d84ffbb368646263c2e3909473c3b851f3fe198",
+   "experiment": {
+    "id": "",
+    "name": ""
+   },
+   "experiment_name": "",
+   "katib_metadata": {
+    "algorithm": {
+     "algorithmName": "grid"
+    },
+    "maxFailedTrialCount": 3,
+    "maxTrialCount": 12,
+    "objective": {
+     "objectiveMetricName": "",
+     "type": "minimize"
+    },
+    "parallelTrialCount": 3,
+    "parameters": []
+   },
+   "katib_run": false,
+   "pipeline_description": "",
+   "pipeline_name": "",
+   "snapshot_volumes": true,
+   "steps_defaults": [
+    "label:access-ml-pipeline:true",
+    "label:access-rok:true"
+   ],
+   "volume_access_mode": "rwm",
+   "volumes": [
+    {
+     "annotations": [],
+     "mount_point": "/home/jovyan/data",
+     "name": "data-g2n6k",
+     "size": 5,
+     "size_type": "Gi",
+     "snapshot": false,
+     "type": "clone"
+    },
+    {
+     "annotations": [],
+     "mount_point": "/home/jovyan",
+     "name": "house-prices-vanilla-workspace-2wscr",
+     "size": 5,
+     "size_type": "Gi",
+     "snapshot": false,
+     "type": "clone"
+    }
+   ]
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
--- a/house-prices-kaggle-competition/house-prices-orig.ipynb
+++ b/house-prices-kaggle-competition/house-prices-orig.ipynb
--- a/house-prices-kaggle-competition/images/kale_cell_metadata.PNG
+++ b/house-prices-kaggle-competition/images/kale_cell_metadata.PNG
--- a/house-prices-kaggle-competition/images/kale_deployment_panel.PNG
+++ b/house-prices-kaggle-competition/images/kale_deployment_panel.PNG
--- a/house-prices-kaggle-competition/images/kale_pipeline_graph.PNG
+++ b/house-prices-kaggle-competition/images/kale_pipeline_graph.PNG
--- a/house-prices-kaggle-competition/images/kfp_client.PNG
+++ b/house-prices-kaggle-competition/images/kfp_client.PNG
--- a/house-prices-kaggle-competition/images/kfp_pipeline_func.PNG
+++ b/house-prices-kaggle-competition/images/kfp_pipeline_func.PNG
--- a/house-prices-kaggle-competition/images/kfp_pipeline_graph.PNG
+++ b/house-prices-kaggle-competition/images/kfp_pipeline_graph.PNG
--- a/house-prices-kaggle-competition/requirements.txt
+++ b/house-prices-kaggle-competition/requirements.txt
@ -0,0 +1,7 @@
+numpy
+pandas
+matplotlib
+sklearn
+seaborn
+category_encoders
+xgboost