Example of converting a Kaggle Notebook to a Kubeflow pipeline (#945)

* Files added to github

* Correcting image paths for all images

* Modified image paths

* Centering Images

* Updated Github Download URL link in -kfp.ipynb file

* Updated the kfp_client.PNG image

* Updated README.md

* Fixed a grammatical error in README.md

* Fixed issues with README.md and house-prices-kale notebook

* Fixed minor grammatical errors/typos

* Fixed couple of grammatical mistakes

* Modified the notebooks and README.md

* Added the notebook server docker image to README.md

* Updated README.md with KFaaS references
This commit is contained in:
Kishan Savant 2022-05-20 06:07:23 +05:30 committed by GitHub
parent c55eb667a7
commit 7454117305
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
16 changed files with 8834 additions and 0 deletions

View File

@ -0,0 +1,120 @@
# Predicting House-Prices
In this repo we are converting a [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for the [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition into a Kubeflow pipeline. The notebook is a buildup of hands-on exercises presented in the Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)
## Prerequisites for Building the Kubeflow Pipeline
If you dont already have Kubeflow up and running, we recommend signing up for a free trial of Arrikto's [Kubeflow as a Service](https://www.arrikto.com/kubeflow-as-a-service/). For the following example, we are using Kubeflow as a Service, but you should be able to run this example on any Kubeflow distribution.
## Testing environment
| Name | version |
| ------------- |:-------------:|
| Kubeflow | v1.4 |
| kfp | 1.7.1 |
| kubeflow-kale | 0.6.0 |
## Initial Steps
1. Please follow the Prerequisites section to get Kubeflow running.
2. Create and connect to a new Jupyter Notebook server.
3. Clone this repo so you have access to this directory. Please go through the kfp and kale steps explained below.
## KFP version
To start building out a Kubeflow pipeline, you need to get yourself acquainted with the Kubeflow Pipelines [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/) to understand what the pipelines are, its components, what goes into these components. There are different ways to build out a pipeline component as mentioned [here](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#building-pipeline-components). In the following example, we are going to use the lightweight python functions based components for building up the pipeline.
### Step 1: Install Kubeflow Pipeline SDK and import the required kfp packages to run the pipeline
From the kfp, we will be using [func_to_container_op](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.func_to_container_op) which would help in building the factory function from the python function and we will use [InputPath](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.InputPath) and [OutputPath](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.OutputPath) from the components package to pass the paths of the files or models to these tasks. The [passing of data](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#pass-data) is being implemented by kfps supported data passing mechanism. InputPath and OutputPath is how you pass on the data or model between the components.
### Step 2: Next build out the pipeline components
Our Kubeflow pipeline is broken down into five pipeline components:
- Download data
- Load and Preprocess data
- Create Features
- Train data
- Evaluate data
We convert each python function to a factory function using the func_to_container_op which will then be converted to a pipeline task for our pipeline function.
### Step 3 : Creating pipeline function
After building all the pipeline components, we have to define a pipeline function connecting all the pipeline components with appropriate inputs and outputs. This when run would generate the pipeline graph.
Our pipeline function takes in the GitHub URL as an input to start with the first pipeline task viz. download_data_task. For this we had used the [load_component_from_url](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html?highlight=load_component_from_url#kfp.components.load_component_from_url) method to create the pipeline task.
Pipeline function:
<p align="center">
<img src="images/kfp_pipeline_func.PNG">
</p>
### Step 4 : Running the pipeline using kfp.client instance
There are different ways to run the pipeline function as mentioned in the [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#compile-and-run-your-pipeline). We would run the pipeline using the Kubeflow Pipelines SDK client.
<p align="center">
<img src="images/kfp_client.PNG">
</p>
Once all the cells are executed successfully, you should see two hyperlinks Experiment details and Run details. Click on Run details link to observe the pipeline running.
The final pipeline graph would look as follow:
<p align="center">
<img src="images/kfp_pipeline_graph.PNG">
</p>
## Kale version
For the Kaggle notebook example, we are using [Kubeflow as a Service](https://www.arrikto.com/kubeflow-as-a-service/). If you are using Kubeflow as a Service then Kale comes preinstalled. For users with different Kubeflow setup, you can refer to the [GitHub link](https://github.com/kubeflow-kale/kale#getting-started) for installing the Kale JupyterLab extension on your setup.
### Step 1: Annotate the notebook with Kale tags
The Kale notebook in the directory is already annotated. To see the annotations, open up the Kale Deployment panel and click on the Enable switch button. Once you have it switched on, you should see the following:
<p align="center">
<img src="images/kale_deployment_panel.PNG">
</p>
Please take time to understand how each cell is annotated by clicking on the cell and checking out the tag being used and what are is its dependencies. Kale provides us with six tags for annotations:
- Imports
- Functions
- Pipeline Parameters
- Pipeline Metrics
- Pipeline Step
- Skip Cell
You can also see the tags being created by checking out the Cell Metadata by clicking on the Property Inspector above the Kale Deployment Panel button.
<p align="center">
<img src="images/kale_cell_metadata.PNG">
</p>
### Step 2: Run the Kubeflow Pipeline
Once youve tagged your notebook, click on the “Compile and Run” button in the Kale widget. Kale will perform the following tasks for you:
- Validate the notebook
- Take a snapshot
- Compile the notebook
- Upload the pipeline
- Run the pipeline
In the “Running pipeline” output, click on the “View” hyperlink. This will take you directly to the runtime execution graph where you can watch your pipeline execute and update in real-time.
<p align="center">
<img src="images/kale_pipeline_graph.PNG">
</p>
## Note:
Both the notebooks have been tested out. In case of any error, please test out with the following docker image.
Notebook server docker image used: gcr.io/arrikto/jupyter-kale-py36@sha256:dd3f92ca66b46d247e4b9b6a9d84ffbb368646263c2e3909473c3b851f3fe198
If the error persists, please raise an issue.

Binary file not shown.

View File

@ -0,0 +1,523 @@
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property
Grvl Gravel
Pave Paved
Alley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
LotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Condition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition2: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
BldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
OverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
RoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
MasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
MasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
ExterCond: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
BsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
BsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
BsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
CentralAir: Central air conditioning
N No
Y Yes
Electrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
GarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
GarageCond: Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
PavedDrive: Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
MiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold (MM)
YrSold: Year Sold (YYYY)
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
SaleCondition: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,815 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Kaggle Getting Started Competition : House Prices - Advanced Regression Techniques "
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"The notebook is based on the [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Following are the imports required to build the pipeline and pass the data between components for building up the kubeflow pipeline"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Install the kfp \n",
"# !pip install kfp --upgrade "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import kfp\n",
"from kfp.components import func_to_container_op\n",
"import kfp.components as comp"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"All the essential imports required in a pipeline component are put together in a list which then is passed on to each pipeline component. Though this might not be efficient when you are dealing with lot of packages, so in cases with many packages and dependencies you can go for docker image which then can be passed to each pipeline component"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import_packages = ['pandas', 'sklearn', 'category_encoders', 'xgboost', 'numpy']"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"In the following implementation of kubeflow pipeline we are making use of [lightweight python function components](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) to build up the pipeline. The data is passed between component instances(tasks) using InputPath and OutputPath. This doesn't require use of defining external volume and attaching to the tasks as the system takes care of storing the data. Further details and examples of it can be found in the following [link](https://github.com/Ark-kun/kfp_samples/blob/65a98da2d4d2bd27a803ee58213b4cfd8a84825e/2019-10%20Kubeflow%20summit/104%20-%20Passing%20data%20for%20python%20components/104%20-%20Passing%20data%20for%20python%20components.ipynb)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"The pipeline is divided into five components\n",
"1. Download data zip file from url\n",
"2. Load data\n",
"3. Creating data with features\n",
"4. Train data\n",
"5. Evaluating data"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Download Data"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"For the purpose of this, we are using an existing yaml file available from kubeflow/pipelines for 'Download Data' component to download data from URLs. In our case, we are getting it from github."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"web_downloader_op = kfp.components.load_component_from_url(\n",
" 'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/contrib/web/Download/component.yaml')"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Load and Preprocess Data"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"def load_and_preprocess_data(file_path : comp.InputPath() , train_output_csv: comp.OutputPath(), test_output_csv: comp.OutputPath()):\n",
" \n",
" # Read data\n",
" import pandas as pd\n",
" from pandas.api.types import CategoricalDtype\n",
" from zipfile import ZipFile \n",
" \n",
" # Extracting from zip file \n",
" with ZipFile(file_path, 'r') as zip:\n",
" zip.extractall()\n",
" \n",
" # Load the training and test data\n",
" train_file_dir = 'data/train.csv'\n",
" test_file_dir = 'data/test.csv'\n",
" df_train = pd.read_csv(train_file_dir, index_col=\"Id\")\n",
" df_test = pd.read_csv( test_file_dir , index_col=\"Id\")\n",
" \n",
" # Merge the splits so we can process them together\n",
" df = pd.concat([df_train, df_test])\n",
" \n",
" # Clean data\n",
" df[\"Exterior2nd\"] = df[\"Exterior2nd\"].replace({\"Brk Cmn\": \"BrkComm\"})\n",
" # Some values of GarageYrBlt are corrupt, so we'll replace them\n",
" # with the year the house was built\n",
" df[\"GarageYrBlt\"] = df[\"GarageYrBlt\"].where(df.GarageYrBlt <= 2010, df.YearBuilt)\n",
" # Names beginning with numbers are awkward to work with\n",
" df.rename(columns={\n",
" \"1stFlrSF\": \"FirstFlrSF\",\n",
" \"2ndFlrSF\": \"SecondFlrSF\",\n",
" \"3SsnPorch\": \"Threeseasonporch\",\n",
" }, inplace=True,\n",
" )\n",
" \n",
" # Encode data\n",
" \n",
" # Nominal categories\n",
" # The numeric features are already encoded correctly (`float` for\n",
" # continuous, `int` for discrete), but the categoricals we'll need to\n",
" # do ourselves. Note in particular, that the `MSSubClass` feature is\n",
" # read as an `int` type, but is actually a (nominative) categorical.\n",
"\n",
" # The nominative (unordered) categorical features\n",
" features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
"\n",
" # Pandas calls the categories \"levels\"\n",
" five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
" ten_levels = list(range(10))\n",
"\n",
" ordered_levels = {\n",
" \"OverallQual\": ten_levels,\n",
" \"OverallCond\": ten_levels,\n",
" \"ExterQual\": five_levels,\n",
" \"ExterCond\": five_levels,\n",
" \"BsmtQual\": five_levels,\n",
" \"BsmtCond\": five_levels,\n",
" \"HeatingQC\": five_levels,\n",
" \"KitchenQual\": five_levels,\n",
" \"FireplaceQu\": five_levels,\n",
" \"GarageQual\": five_levels,\n",
" \"GarageCond\": five_levels,\n",
" \"PoolQC\": five_levels,\n",
" \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
" \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
" \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
" \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
" \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
" \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
" \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
" \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
" \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
" \"CentralAir\": [\"N\", \"Y\"],\n",
" \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
" \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
" }\n",
"\n",
" # Add a None level for missing values\n",
" ordered_levels = {key: [\"None\"] + value for key, value in\n",
" ordered_levels.items()}\n",
"\n",
"\n",
" for name in features_nom:\n",
" df[name] = df[name].astype(\"category\")\n",
" # Add a None category for missing values\n",
" if \"None\" not in df[name].cat.categories:\n",
" df[name].cat.add_categories(\"None\", inplace=True)\n",
" # Ordinal categories\n",
" for name, levels in ordered_levels.items():\n",
" df[name] = df[name].astype(CategoricalDtype(levels,\n",
" ordered=True))\n",
" \n",
" \n",
" # Impute data\n",
" for name in df.select_dtypes(\"number\"):\n",
" df[name] = df[name].fillna(0)\n",
" for name in df.select_dtypes(include = [\"category\"]):\n",
" df[name] = df[name].fillna(\"None\")\n",
" \n",
" # Reform splits \n",
" df_train = df.loc[df_train.index, :]\n",
" df_test = df.loc[df_test.index, :]\n",
" \n",
" # passing the data as csv files to outputs\n",
" df_train.to_csv(train_output_csv)\n",
" df_test.to_csv(test_output_csv) \n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"load_and_preprocess_data_op = func_to_container_op(load_and_preprocess_data,packages_to_install = import_packages)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Creating data with features"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"def featured_data(train_path: comp.InputPath(), test_path : comp.InputPath(), feat_train_output_csv: comp.OutputPath(), feat_test_output_csv: comp.OutputPath()):\n",
" \n",
" import pandas as pd\n",
" from pandas.api.types import CategoricalDtype\n",
" from category_encoders import MEstimateEncoder\n",
" from sklearn.feature_selection import mutual_info_regression\n",
" from sklearn.cluster import KMeans\n",
" from sklearn.decomposition import PCA\n",
" from sklearn.model_selection import KFold, cross_val_score\n",
" \n",
" df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
" df_test = pd.read_csv(test_path, index_col=\"Id\")\n",
" \n",
" def make_mi_scores(X, y):\n",
" X = X.copy()\n",
" for colname in X.select_dtypes([\"object\",\"category\"]):\n",
" X[colname], _ = X[colname].factorize()\n",
" # All discrete features should now have integer dtypes\n",
" discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]\n",
" mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)\n",
" mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
" mi_scores = mi_scores.sort_values(ascending=False)\n",
" return mi_scores\n",
" \n",
" def drop_uninformative(df, mi_scores):\n",
" return df.loc[:, mi_scores > 0.0]\n",
" \n",
" def label_encode(df):\n",
" \n",
" X = df.copy() \n",
" for colname in X.select_dtypes([\"category\"]):\n",
" X[colname] = X[colname].cat.codes\n",
" return X\n",
"\n",
" def mathematical_transforms(df):\n",
" X = pd.DataFrame() # dataframe to hold new features\n",
" X[\"LivLotRatio\"] = df.GrLivArea / df.LotArea\n",
" X[\"Spaciousness\"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd\n",
" return X\n",
"\n",
" def interactions(df):\n",
" X = pd.get_dummies(df.BldgType, prefix=\"Bldg\")\n",
" X = X.mul(df.GrLivArea, axis=0)\n",
" return X\n",
"\n",
" def counts(df):\n",
" X = pd.DataFrame()\n",
" X[\"PorchTypes\"] = df[[\n",
" \"WoodDeckSF\",\n",
" \"OpenPorchSF\",\n",
" \"EnclosedPorch\",\n",
" \"Threeseasonporch\",\n",
" \"ScreenPorch\",\n",
" ]].gt(0.0).sum(axis=1)\n",
" return X\n",
"\n",
" def break_down(df):\n",
" X = pd.DataFrame()\n",
" X[\"MSClass\"] = df.MSSubClass.str.split(\"_\", n=1, expand=True)[0]\n",
" return X\n",
"\n",
" def group_transforms(df):\n",
" X = pd.DataFrame()\n",
" X[\"MedNhbdArea\"] = df.groupby(\"Neighborhood\")[\"GrLivArea\"].transform(\"median\")\n",
" return X\n",
" \n",
" cluster_features = [\n",
" \"LotArea\",\n",
" \"TotalBsmtSF\",\n",
" \"FirstFlrSF\",\n",
" \"SecondFlrSF\",\n",
" \"GrLivArea\",\n",
" ]\n",
"\n",
" def cluster_labels(df, features, n_clusters=20):\n",
" X = df.copy()\n",
" X_scaled = X.loc[:, features]\n",
" X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
" kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)\n",
" X_new = pd.DataFrame()\n",
" X_new[\"Cluster\"] = kmeans.fit_predict(X_scaled)\n",
" return X_new\n",
"\n",
" def cluster_distance(df, features, n_clusters=20):\n",
" X = df.copy()\n",
" X_scaled = X.loc[:, features]\n",
" X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
" kmeans = KMeans(n_clusters=20, n_init=50, random_state=0)\n",
" X_cd = kmeans.fit_transform(X_scaled)\n",
" # Label features and join to dataset\n",
" X_cd = pd.DataFrame(\n",
" X_cd, columns=[f\"Centroid_{i}\" for i in range(X_cd.shape[1])]\n",
" )\n",
" return X_cd\n",
" \n",
" def apply_pca(X, standardize=True):\n",
" # Standardize\n",
" if standardize:\n",
" X = (X - X.mean(axis=0)) / X.std(axis=0)\n",
" # Create principal components\n",
" pca = PCA()\n",
" X_pca = pca.fit_transform(X)\n",
" # Convert to dataframe\n",
" component_names = [f\"PC{i+1}\" for i in range(X_pca.shape[1])]\n",
" X_pca = pd.DataFrame(X_pca, columns=component_names)\n",
" # Create loadings\n",
" loadings = pd.DataFrame(\n",
" pca.components_.T, # transpose the matrix of loadings\n",
" columns=component_names, # so the columns are the principal components\n",
" index=X.columns, # and the rows are the original features\n",
" )\n",
" return pca, X_pca, loadings\n",
"\n",
" def pca_inspired(df):\n",
" X = pd.DataFrame()\n",
" X[\"Feature1\"] = df.GrLivArea + df.TotalBsmtSF\n",
" X[\"Feature2\"] = df.YearRemodAdd * df.TotalBsmtSF\n",
" return X\n",
"\n",
"\n",
" def pca_components(df, features):\n",
" X = df.loc[:, features]\n",
" _, X_pca, _ = apply_pca(X)\n",
" return X_pca\n",
"\n",
"\n",
" pca_features = [\n",
" \"GarageArea\",\n",
" \"YearRemodAdd\",\n",
" \"TotalBsmtSF\",\n",
" \"GrLivArea\",\n",
" ]\n",
" \n",
" class CrossFoldEncoder:\n",
" def __init__(self, encoder, **kwargs):\n",
" self.encoder_ = encoder\n",
" self.kwargs_ = kwargs # keyword arguments for the encoder\n",
" self.cv_ = KFold(n_splits=5)\n",
"\n",
" # Fit an encoder on one split and transform the feature on the\n",
" # other. Iterating over the splits in all folds gives a complete\n",
" # transformation. We also now have one trained encoder on each\n",
" # fold.\n",
" def fit_transform(self, X, y, cols):\n",
" self.fitted_encoders_ = []\n",
" self.cols_ = cols\n",
" X_encoded = []\n",
" for idx_encode, idx_train in self.cv_.split(X):\n",
" fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)\n",
" fitted_encoder.fit(\n",
" X.iloc[idx_encode, :], y.iloc[idx_encode],\n",
" )\n",
" X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])\n",
" self.fitted_encoders_.append(fitted_encoder)\n",
" X_encoded = pd.concat(X_encoded)\n",
" X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
" return X_encoded\n",
"\n",
" # To transform the test data, average the encodings learned from\n",
" # each fold.\n",
" def transform(self, X):\n",
" from functools import reduce\n",
"\n",
" X_encoded_list = []\n",
" for fitted_encoder in self.fitted_encoders_:\n",
" X_encoded = fitted_encoder.transform(X)\n",
" X_encoded_list.append(X_encoded[self.cols_])\n",
" X_encoded = reduce(\n",
" lambda x, y: x.add(y, fill_value=0), X_encoded_list\n",
" ) / len(X_encoded_list)\n",
" X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
" return X_encoded\n",
" \n",
" X = df_train.copy()\n",
" y = X.pop(\"SalePrice\") \n",
" \n",
" X_test = df_test.copy()\n",
" X_test.pop(\"SalePrice\")\n",
" \n",
" # Get the mutual information scores\n",
" mi_scores = make_mi_scores(X, y)\n",
" \n",
" # Concat the training and test dataset before restoring categorical encoding\n",
" X = pd.concat([X, X_test])\n",
" \n",
" # Restore the categorical encoding removed during csv conversion\n",
" # The nominative (unordered) categorical features\n",
" features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
"\n",
" # Pandas calls the categories \"levels\"\n",
" five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
" ten_levels = list(range(10))\n",
"\n",
" ordered_levels = {\n",
" \"OverallQual\": ten_levels,\n",
" \"OverallCond\": ten_levels,\n",
" \"ExterQual\": five_levels,\n",
" \"ExterCond\": five_levels,\n",
" \"BsmtQual\": five_levels,\n",
" \"BsmtCond\": five_levels,\n",
" \"HeatingQC\": five_levels,\n",
" \"KitchenQual\": five_levels,\n",
" \"FireplaceQu\": five_levels,\n",
" \"GarageQual\": five_levels,\n",
" \"GarageCond\": five_levels,\n",
" \"PoolQC\": five_levels,\n",
" \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
" \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
" \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
" \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
" \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
" \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
" \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
" \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
" \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
" \"CentralAir\": [\"N\", \"Y\"],\n",
" \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
" \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
" }\n",
"\n",
"# Add a None level for missing values\n",
" ordered_levels = {key: [\"None\"] + value for key, value in\n",
" ordered_levels.items()}\n",
" \n",
" for name in features_nom:\n",
" X[name] = X[name].astype(\"category\")\n",
" if \"None\" not in X[name].cat.categories:\n",
" X[name].cat.add_categories(\"None\", inplace=True)\n",
" \n",
" # Ordinal categories\n",
" for name, levels in ordered_levels.items():\n",
" X[name] = X[name].astype(CategoricalDtype(levels,\n",
" ordered=True))\n",
" \n",
" # Drop features with less mutual information scores\n",
" X = drop_uninformative(X, mi_scores)\n",
" \n",
"\n",
" # Transformations\n",
" X = X.join(mathematical_transforms(X))\n",
" X = X.join(interactions(X))\n",
" X = X.join(counts(X))\n",
" # X = X.join(break_down(X))\n",
" X = X.join(group_transforms(X))\n",
"\n",
" # Clustering\n",
" # X = X.join(cluster_labels(X, cluster_features, n_clusters=20))\n",
" # X = X.join(cluster_distance(X, cluster_features, n_clusters=20))\n",
"\n",
" # PCA\n",
" X = X.join(pca_inspired(X))\n",
" # X = X.join(pca_components(X, pca_features))\n",
" # X = X.join(indicate_outliers(X))\n",
" \n",
" # Label encoding\n",
" X = label_encode(X)\n",
" \n",
" # Reform splits\n",
" X_test = X.loc[df_test.index, :]\n",
" X.drop(df_test.index, inplace=True)\n",
"\n",
" # Target Encoder\n",
" encoder = CrossFoldEncoder(MEstimateEncoder, m=1)\n",
" X = X.join(encoder.fit_transform(X, y, cols=[\"MSSubClass\"]))\n",
" \n",
" X_test = X_test.join(encoder.transform(X_test))\n",
" \n",
" X.to_csv(feat_train_output_csv)\n",
" X_test.to_csv(feat_test_output_csv)\n",
"\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"featured_data_op = func_to_container_op(featured_data, packages_to_install = import_packages)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Train data"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"def train_data(train_path: comp.InputPath(), feat_train_path: comp.InputPath(), feat_test_path : comp.InputPath(), model_path : comp.OutputPath('XGBoostModel')):\n",
" \n",
" import pandas as pd\n",
" import numpy as np\n",
" from xgboost.sklearn import XGBRegressor\n",
" from pathlib import Path\n",
" \n",
" df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
" X_train = pd.read_csv(feat_train_path, index_col=\"Id\")\n",
" X_test = pd.read_csv(feat_test_path, index_col=\"Id\")\n",
" y_train = df_train.loc[:, \"SalePrice\"]\n",
" \n",
" xgb_params = dict(\n",
" max_depth=6, # maximum depth of each tree - try 2 to 10\n",
" learning_rate=0.01, # effect of each tree - try 0.0001 to 0.1\n",
" n_estimators=1000, # number of trees (that is, boosting rounds) - try 1000 to 8000\n",
" min_child_weight=1, # minimum number of houses in a leaf - try 1 to 10\n",
" colsample_bytree=0.7, # fraction of features (columns) per tree - try 0.2 to 1.0\n",
" subsample=0.7, # fraction of instances (rows) per tree - try 0.2 to 1.0\n",
" reg_alpha=0.5, # L1 regularization (like LASSO) - try 0.0 to 10.0\n",
" reg_lambda=1.0, # L2 regularization (like Ridge) - try 0.0 to 10.0\n",
" num_parallel_tree=1, # set > 1 for boosted random forests\n",
" )\n",
"\n",
" xgb = XGBRegressor(**xgb_params)\n",
" # XGB minimizes MSE, but competition loss is RMSLE\n",
" # So, we need to log-transform y to train and exp-transform the predictions\n",
" xgb.fit(X_train, np.log(y_train))\n",
"\n",
" Path(model_path).parent.mkdir(parents=True, exist_ok=True)\n",
" xgb.save_model(model_path)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"train_data_op = func_to_container_op(train_data, packages_to_install= import_packages)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Evaluate data"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def eval_data(test_data_path: comp.InputPath(), model_path: comp.InputPath('XGBoostModel')):\n",
" \n",
" import pandas as pd\n",
" import numpy as np\n",
" from xgboost.sklearn import XGBRegressor\n",
" \n",
" X_test = pd.read_csv(test_data_path, index_col=\"Id\")\n",
" \n",
" xgb = XGBRegressor()\n",
" \n",
" \n",
" xgb.load_model(model_path)\n",
" \n",
" predictions = np.exp(xgb.predict(X_test))\n",
" \n",
" print(predictions)\n",
" \n",
"# output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})\n",
"# output.to_csv('data/my_submission.csv', index=False)\n",
"# print(\"Your submission was successfully saved!\")\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"eval_data_op = func_to_container_op(eval_data, packages_to_install= import_packages)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"### Defining function that implements the pipeline"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"def vanilla_pipeline(url):\n",
" \n",
" web_downloader_task = web_downloader_op(url=url)\n",
"\n",
" load_and_preprocess_data_task = load_and_preprocess_data_op(file = web_downloader_task.outputs['data'])\n",
"\n",
" featured_data_task = featured_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'], test = load_and_preprocess_data_task.outputs['test_output_csv'])\n",
" \n",
" train_eval_task = train_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'] , feat_train = featured_data_task.outputs['feat_train_output_csv'],\n",
" feat_test = featured_data_task.outputs['feat_test_output_csv'])\n",
" \n",
" eval_data_task = eval_data_op(test_data = featured_data_task.outputs['feat_test_output_csv'],model = train_eval_task.output)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<a href=\"/pipeline/#/experiments/details/246b31c7-909a-446b-8152-0f429a0e745c\" target=\"_blank\" >Experiment details</a>."
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<a href=\"/pipeline/#/runs/details/66011ba0-a465-4d5b-beba-f081ab3002b4\" target=\"_blank\" >Run details</a>."
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/plain": [
"RunPipelineResult(run_id=66011ba0-a465-4d5b-beba-f081ab3002b4)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Using kfp.Client() to run the pipeline from notebook itself\n",
"client = kfp.Client() # change arguments accordingly\n",
"\n",
"# Running the pipeline\n",
"client.create_run_from_pipeline_func(\n",
" vanilla_pipeline,\n",
" arguments={\n",
" # Github url to fetch the data. This would change when you clone the repo. Please update the url as per that.\n",
" 'url': 'https://github.com/NeoKish/examples/raw/master/house-prices-kaggle-competition/data.zip'\n",
" })"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kubeflow_notebook": {
"autosnapshot": true,
"docker_image": "gcr.io/arrikto/jupyter-kale-py36@sha256:dd3f92ca66b46d247e4b9b6a9d84ffbb368646263c2e3909473c3b851f3fe198",
"experiment": {
"id": "",
"name": ""
},
"experiment_name": "",
"katib_metadata": {
"algorithm": {
"algorithmName": "grid"
},
"maxFailedTrialCount": 3,
"maxTrialCount": 12,
"objective": {
"objectiveMetricName": "",
"type": "minimize"
},
"parallelTrialCount": 3,
"parameters": []
},
"katib_run": false,
"pipeline_description": "",
"pipeline_name": "",
"snapshot_volumes": true,
"steps_defaults": [
"label:access-ml-pipeline:true",
"label:access-rok:true"
],
"volume_access_mode": "rwm",
"volumes": [
{
"annotations": [],
"mount_point": "/home/jovyan/data",
"name": "data-g2n6k",
"size": 5,
"size_type": "Gi",
"snapshot": false,
"type": "clone"
},
{
"annotations": [],
"mount_point": "/home/jovyan",
"name": "house-prices-vanilla-workspace-2wscr",
"size": 5,
"size_type": "Gi",
"snapshot": false,
"type": "clone"
}
]
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

File diff suppressed because one or more lines are too long

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 92 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 36 KiB

View File

@ -0,0 +1,7 @@
numpy
pandas
matplotlib
sklearn
seaborn
category_encoders
xgboost