mirror of https://github.com/kubeflow/examples.git
Example of converting a Kaggle Notebook to a Kubeflow pipeline (#945)
* Files added to github * Correcting image paths for all images * Modified image paths * Centering Images * Updated Github Download URL link in -kfp.ipynb file * Updated the kfp_client.PNG image * Updated README.md * Fixed a grammatical error in README.md * Fixed issues with README.md and house-prices-kale notebook * Fixed minor grammatical errors/typos * Fixed couple of grammatical mistakes * Modified the notebooks and README.md * Added the notebook server docker image to README.md * Updated README.md with KFaaS references
This commit is contained in:
parent
c55eb667a7
commit
7454117305
|
|
@ -0,0 +1,120 @@
|
|||
# Predicting House-Prices
|
||||
|
||||
In this repo we are converting a [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for the [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition into a Kubeflow pipeline. The notebook is a buildup of hands-on exercises presented in the Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)
|
||||
|
||||
## Prerequisites for Building the Kubeflow Pipeline
|
||||
|
||||
If you don’t already have Kubeflow up and running, we recommend signing up for a free trial of Arrikto's [Kubeflow as a Service](https://www.arrikto.com/kubeflow-as-a-service/). For the following example, we are using Kubeflow as a Service, but you should be able to run this example on any Kubeflow distribution.
|
||||
|
||||
## Testing environment
|
||||
|
||||
| Name | version |
|
||||
| ------------- |:-------------:|
|
||||
| Kubeflow | v1.4 |
|
||||
| kfp | 1.7.1 |
|
||||
| kubeflow-kale | 0.6.0 |
|
||||
|
||||
## Initial Steps
|
||||
|
||||
1. Please follow the Prerequisites section to get Kubeflow running.
|
||||
2. Create and connect to a new Jupyter Notebook server.
|
||||
3. Clone this repo so you have access to this directory. Please go through the kfp and kale steps explained below.
|
||||
|
||||
## KFP version
|
||||
|
||||
To start building out a Kubeflow pipeline, you need to get yourself acquainted with the Kubeflow Pipelines [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/) to understand what the pipelines are, its components, what goes into these components. There are different ways to build out a pipeline component as mentioned [here](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#building-pipeline-components). In the following example, we are going to use the lightweight python functions based components for building up the pipeline.
|
||||
|
||||
### Step 1: Install Kubeflow Pipeline SDK and import the required kfp packages to run the pipeline
|
||||
|
||||
From the kfp, we will be using [func_to_container_op](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.func_to_container_op) which would help in building the factory function from the python function and we will use [InputPath](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.InputPath) and [OutputPath](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html#kfp.components.OutputPath) from the components package to pass the paths of the files or models to these tasks. The [passing of data](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/#pass-data) is being implemented by kfp’s supported data passing mechanism. InputPath and OutputPath is how you pass on the data or model between the components.
|
||||
|
||||
### Step 2: Next build out the pipeline components
|
||||
|
||||
Our Kubeflow pipeline is broken down into five pipeline components:
|
||||
|
||||
- Download data
|
||||
- Load and Preprocess data
|
||||
- Create Features
|
||||
- Train data
|
||||
- Evaluate data
|
||||
|
||||
We convert each python function to a factory function using the func_to_container_op which will then be converted to a pipeline task for our pipeline function.
|
||||
|
||||
### Step 3 : Creating pipeline function
|
||||
|
||||
After building all the pipeline components, we have to define a pipeline function connecting all the pipeline components with appropriate inputs and outputs. This when run would generate the pipeline graph.
|
||||
|
||||
Our pipeline function takes in the GitHub URL as an input to start with the first pipeline task viz. download_data_task. For this we had used the [load_component_from_url](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.components.html?highlight=load_component_from_url#kfp.components.load_component_from_url) method to create the pipeline task.
|
||||
|
||||
Pipeline function:
|
||||
|
||||
<p align="center">
|
||||
<img src="images/kfp_pipeline_func.PNG">
|
||||
</p>
|
||||
|
||||
|
||||
### Step 4 : Running the pipeline using kfp.client instance
|
||||
|
||||
There are different ways to run the pipeline function as mentioned in the [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#compile-and-run-your-pipeline). We would run the pipeline using the Kubeflow Pipelines SDK client.
|
||||
|
||||
<p align="center">
|
||||
<img src="images/kfp_client.PNG">
|
||||
</p>
|
||||
|
||||
Once all the cells are executed successfully, you should see two hyperlinks ‘Experiment details’ and ‘Run details’. Click on ‘Run details’ link to observe the pipeline running.
|
||||
|
||||
The final pipeline graph would look as follow:
|
||||
|
||||
<p align="center">
|
||||
<img src="images/kfp_pipeline_graph.PNG">
|
||||
</p>
|
||||
|
||||
## Kale version
|
||||
|
||||
For the Kaggle notebook example, we are using [Kubeflow as a Service](https://www.arrikto.com/kubeflow-as-a-service/). If you are using Kubeflow as a Service then Kale comes preinstalled. For users with different Kubeflow setup, you can refer to the [GitHub link](https://github.com/kubeflow-kale/kale#getting-started) for installing the Kale JupyterLab extension on your setup.
|
||||
|
||||
### Step 1: Annotate the notebook with Kale tags
|
||||
|
||||
The Kale notebook in the directory is already annotated. To see the annotations, open up the Kale Deployment panel and click on the Enable switch button. Once you have it switched on, you should see the following:
|
||||
|
||||
<p align="center">
|
||||
<img src="images/kale_deployment_panel.PNG">
|
||||
</p>
|
||||
|
||||
Please take time to understand how each cell is annotated by clicking on the cell and checking out the tag being used and what are is its dependencies. Kale provides us with six tags for annotations:
|
||||
|
||||
- Imports
|
||||
- Functions
|
||||
- Pipeline Parameters
|
||||
- Pipeline Metrics
|
||||
- Pipeline Step
|
||||
- Skip Cell
|
||||
|
||||
You can also see the tags being created by checking out the Cell Metadata by clicking on the Property Inspector above the Kale Deployment Panel button.
|
||||
|
||||
<p align="center">
|
||||
<img src="images/kale_cell_metadata.PNG">
|
||||
</p>
|
||||
|
||||
### Step 2: Run the Kubeflow Pipeline
|
||||
|
||||
Once you’ve tagged your notebook, click on the “Compile and Run” button in the Kale widget. Kale will perform the following tasks for you:
|
||||
|
||||
- Validate the notebook
|
||||
- Take a snapshot
|
||||
- Compile the notebook
|
||||
- Upload the pipeline
|
||||
- Run the pipeline
|
||||
|
||||
In the “Running pipeline” output, click on the “View” hyperlink. This will take you directly to the runtime execution graph where you can watch your pipeline execute and update in real-time.
|
||||
|
||||
<p align="center">
|
||||
<img src="images/kale_pipeline_graph.PNG">
|
||||
</p>
|
||||
|
||||
## Note:
|
||||
Both the notebooks have been tested out. In case of any error, please test out with the following docker image.
|
||||
|
||||
Notebook server docker image used: gcr.io/arrikto/jupyter-kale-py36@sha256:dd3f92ca66b46d247e4b9b6a9d84ffbb368646263c2e3909473c3b851f3fe198
|
||||
|
||||
If the error persists, please raise an issue.
|
||||
Binary file not shown.
|
|
@ -0,0 +1,523 @@
|
|||
MSSubClass: Identifies the type of dwelling involved in the sale.
|
||||
|
||||
20 1-STORY 1946 & NEWER ALL STYLES
|
||||
30 1-STORY 1945 & OLDER
|
||||
40 1-STORY W/FINISHED ATTIC ALL AGES
|
||||
45 1-1/2 STORY - UNFINISHED ALL AGES
|
||||
50 1-1/2 STORY FINISHED ALL AGES
|
||||
60 2-STORY 1946 & NEWER
|
||||
70 2-STORY 1945 & OLDER
|
||||
75 2-1/2 STORY ALL AGES
|
||||
80 SPLIT OR MULTI-LEVEL
|
||||
85 SPLIT FOYER
|
||||
90 DUPLEX - ALL STYLES AND AGES
|
||||
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
|
||||
150 1-1/2 STORY PUD - ALL AGES
|
||||
160 2-STORY PUD - 1946 & NEWER
|
||||
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
|
||||
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
|
||||
|
||||
MSZoning: Identifies the general zoning classification of the sale.
|
||||
|
||||
A Agriculture
|
||||
C Commercial
|
||||
FV Floating Village Residential
|
||||
I Industrial
|
||||
RH Residential High Density
|
||||
RL Residential Low Density
|
||||
RP Residential Low Density Park
|
||||
RM Residential Medium Density
|
||||
|
||||
LotFrontage: Linear feet of street connected to property
|
||||
|
||||
LotArea: Lot size in square feet
|
||||
|
||||
Street: Type of road access to property
|
||||
|
||||
Grvl Gravel
|
||||
Pave Paved
|
||||
|
||||
Alley: Type of alley access to property
|
||||
|
||||
Grvl Gravel
|
||||
Pave Paved
|
||||
NA No alley access
|
||||
|
||||
LotShape: General shape of property
|
||||
|
||||
Reg Regular
|
||||
IR1 Slightly irregular
|
||||
IR2 Moderately Irregular
|
||||
IR3 Irregular
|
||||
|
||||
LandContour: Flatness of the property
|
||||
|
||||
Lvl Near Flat/Level
|
||||
Bnk Banked - Quick and significant rise from street grade to building
|
||||
HLS Hillside - Significant slope from side to side
|
||||
Low Depression
|
||||
|
||||
Utilities: Type of utilities available
|
||||
|
||||
AllPub All public Utilities (E,G,W,& S)
|
||||
NoSewr Electricity, Gas, and Water (Septic Tank)
|
||||
NoSeWa Electricity and Gas Only
|
||||
ELO Electricity only
|
||||
|
||||
LotConfig: Lot configuration
|
||||
|
||||
Inside Inside lot
|
||||
Corner Corner lot
|
||||
CulDSac Cul-de-sac
|
||||
FR2 Frontage on 2 sides of property
|
||||
FR3 Frontage on 3 sides of property
|
||||
|
||||
LandSlope: Slope of property
|
||||
|
||||
Gtl Gentle slope
|
||||
Mod Moderate Slope
|
||||
Sev Severe Slope
|
||||
|
||||
Neighborhood: Physical locations within Ames city limits
|
||||
|
||||
Blmngtn Bloomington Heights
|
||||
Blueste Bluestem
|
||||
BrDale Briardale
|
||||
BrkSide Brookside
|
||||
ClearCr Clear Creek
|
||||
CollgCr College Creek
|
||||
Crawfor Crawford
|
||||
Edwards Edwards
|
||||
Gilbert Gilbert
|
||||
IDOTRR Iowa DOT and Rail Road
|
||||
MeadowV Meadow Village
|
||||
Mitchel Mitchell
|
||||
Names North Ames
|
||||
NoRidge Northridge
|
||||
NPkVill Northpark Villa
|
||||
NridgHt Northridge Heights
|
||||
NWAmes Northwest Ames
|
||||
OldTown Old Town
|
||||
SWISU South & West of Iowa State University
|
||||
Sawyer Sawyer
|
||||
SawyerW Sawyer West
|
||||
Somerst Somerset
|
||||
StoneBr Stone Brook
|
||||
Timber Timberland
|
||||
Veenker Veenker
|
||||
|
||||
Condition1: Proximity to various conditions
|
||||
|
||||
Artery Adjacent to arterial street
|
||||
Feedr Adjacent to feeder street
|
||||
Norm Normal
|
||||
RRNn Within 200' of North-South Railroad
|
||||
RRAn Adjacent to North-South Railroad
|
||||
PosN Near positive off-site feature--park, greenbelt, etc.
|
||||
PosA Adjacent to postive off-site feature
|
||||
RRNe Within 200' of East-West Railroad
|
||||
RRAe Adjacent to East-West Railroad
|
||||
|
||||
Condition2: Proximity to various conditions (if more than one is present)
|
||||
|
||||
Artery Adjacent to arterial street
|
||||
Feedr Adjacent to feeder street
|
||||
Norm Normal
|
||||
RRNn Within 200' of North-South Railroad
|
||||
RRAn Adjacent to North-South Railroad
|
||||
PosN Near positive off-site feature--park, greenbelt, etc.
|
||||
PosA Adjacent to postive off-site feature
|
||||
RRNe Within 200' of East-West Railroad
|
||||
RRAe Adjacent to East-West Railroad
|
||||
|
||||
BldgType: Type of dwelling
|
||||
|
||||
1Fam Single-family Detached
|
||||
2FmCon Two-family Conversion; originally built as one-family dwelling
|
||||
Duplx Duplex
|
||||
TwnhsE Townhouse End Unit
|
||||
TwnhsI Townhouse Inside Unit
|
||||
|
||||
HouseStyle: Style of dwelling
|
||||
|
||||
1Story One story
|
||||
1.5Fin One and one-half story: 2nd level finished
|
||||
1.5Unf One and one-half story: 2nd level unfinished
|
||||
2Story Two story
|
||||
2.5Fin Two and one-half story: 2nd level finished
|
||||
2.5Unf Two and one-half story: 2nd level unfinished
|
||||
SFoyer Split Foyer
|
||||
SLvl Split Level
|
||||
|
||||
OverallQual: Rates the overall material and finish of the house
|
||||
|
||||
10 Very Excellent
|
||||
9 Excellent
|
||||
8 Very Good
|
||||
7 Good
|
||||
6 Above Average
|
||||
5 Average
|
||||
4 Below Average
|
||||
3 Fair
|
||||
2 Poor
|
||||
1 Very Poor
|
||||
|
||||
OverallCond: Rates the overall condition of the house
|
||||
|
||||
10 Very Excellent
|
||||
9 Excellent
|
||||
8 Very Good
|
||||
7 Good
|
||||
6 Above Average
|
||||
5 Average
|
||||
4 Below Average
|
||||
3 Fair
|
||||
2 Poor
|
||||
1 Very Poor
|
||||
|
||||
YearBuilt: Original construction date
|
||||
|
||||
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
|
||||
|
||||
RoofStyle: Type of roof
|
||||
|
||||
Flat Flat
|
||||
Gable Gable
|
||||
Gambrel Gabrel (Barn)
|
||||
Hip Hip
|
||||
Mansard Mansard
|
||||
Shed Shed
|
||||
|
||||
RoofMatl: Roof material
|
||||
|
||||
ClyTile Clay or Tile
|
||||
CompShg Standard (Composite) Shingle
|
||||
Membran Membrane
|
||||
Metal Metal
|
||||
Roll Roll
|
||||
Tar&Grv Gravel & Tar
|
||||
WdShake Wood Shakes
|
||||
WdShngl Wood Shingles
|
||||
|
||||
Exterior1st: Exterior covering on house
|
||||
|
||||
AsbShng Asbestos Shingles
|
||||
AsphShn Asphalt Shingles
|
||||
BrkComm Brick Common
|
||||
BrkFace Brick Face
|
||||
CBlock Cinder Block
|
||||
CemntBd Cement Board
|
||||
HdBoard Hard Board
|
||||
ImStucc Imitation Stucco
|
||||
MetalSd Metal Siding
|
||||
Other Other
|
||||
Plywood Plywood
|
||||
PreCast PreCast
|
||||
Stone Stone
|
||||
Stucco Stucco
|
||||
VinylSd Vinyl Siding
|
||||
Wd Sdng Wood Siding
|
||||
WdShing Wood Shingles
|
||||
|
||||
Exterior2nd: Exterior covering on house (if more than one material)
|
||||
|
||||
AsbShng Asbestos Shingles
|
||||
AsphShn Asphalt Shingles
|
||||
BrkComm Brick Common
|
||||
BrkFace Brick Face
|
||||
CBlock Cinder Block
|
||||
CemntBd Cement Board
|
||||
HdBoard Hard Board
|
||||
ImStucc Imitation Stucco
|
||||
MetalSd Metal Siding
|
||||
Other Other
|
||||
Plywood Plywood
|
||||
PreCast PreCast
|
||||
Stone Stone
|
||||
Stucco Stucco
|
||||
VinylSd Vinyl Siding
|
||||
Wd Sdng Wood Siding
|
||||
WdShing Wood Shingles
|
||||
|
||||
MasVnrType: Masonry veneer type
|
||||
|
||||
BrkCmn Brick Common
|
||||
BrkFace Brick Face
|
||||
CBlock Cinder Block
|
||||
None None
|
||||
Stone Stone
|
||||
|
||||
MasVnrArea: Masonry veneer area in square feet
|
||||
|
||||
ExterQual: Evaluates the quality of the material on the exterior
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Average/Typical
|
||||
Fa Fair
|
||||
Po Poor
|
||||
|
||||
ExterCond: Evaluates the present condition of the material on the exterior
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Average/Typical
|
||||
Fa Fair
|
||||
Po Poor
|
||||
|
||||
Foundation: Type of foundation
|
||||
|
||||
BrkTil Brick & Tile
|
||||
CBlock Cinder Block
|
||||
PConc Poured Contrete
|
||||
Slab Slab
|
||||
Stone Stone
|
||||
Wood Wood
|
||||
|
||||
BsmtQual: Evaluates the height of the basement
|
||||
|
||||
Ex Excellent (100+ inches)
|
||||
Gd Good (90-99 inches)
|
||||
TA Typical (80-89 inches)
|
||||
Fa Fair (70-79 inches)
|
||||
Po Poor (<70 inches
|
||||
NA No Basement
|
||||
|
||||
BsmtCond: Evaluates the general condition of the basement
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Typical - slight dampness allowed
|
||||
Fa Fair - dampness or some cracking or settling
|
||||
Po Poor - Severe cracking, settling, or wetness
|
||||
NA No Basement
|
||||
|
||||
BsmtExposure: Refers to walkout or garden level walls
|
||||
|
||||
Gd Good Exposure
|
||||
Av Average Exposure (split levels or foyers typically score average or above)
|
||||
Mn Mimimum Exposure
|
||||
No No Exposure
|
||||
NA No Basement
|
||||
|
||||
BsmtFinType1: Rating of basement finished area
|
||||
|
||||
GLQ Good Living Quarters
|
||||
ALQ Average Living Quarters
|
||||
BLQ Below Average Living Quarters
|
||||
Rec Average Rec Room
|
||||
LwQ Low Quality
|
||||
Unf Unfinshed
|
||||
NA No Basement
|
||||
|
||||
BsmtFinSF1: Type 1 finished square feet
|
||||
|
||||
BsmtFinType2: Rating of basement finished area (if multiple types)
|
||||
|
||||
GLQ Good Living Quarters
|
||||
ALQ Average Living Quarters
|
||||
BLQ Below Average Living Quarters
|
||||
Rec Average Rec Room
|
||||
LwQ Low Quality
|
||||
Unf Unfinshed
|
||||
NA No Basement
|
||||
|
||||
BsmtFinSF2: Type 2 finished square feet
|
||||
|
||||
BsmtUnfSF: Unfinished square feet of basement area
|
||||
|
||||
TotalBsmtSF: Total square feet of basement area
|
||||
|
||||
Heating: Type of heating
|
||||
|
||||
Floor Floor Furnace
|
||||
GasA Gas forced warm air furnace
|
||||
GasW Gas hot water or steam heat
|
||||
Grav Gravity furnace
|
||||
OthW Hot water or steam heat other than gas
|
||||
Wall Wall furnace
|
||||
|
||||
HeatingQC: Heating quality and condition
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Average/Typical
|
||||
Fa Fair
|
||||
Po Poor
|
||||
|
||||
CentralAir: Central air conditioning
|
||||
|
||||
N No
|
||||
Y Yes
|
||||
|
||||
Electrical: Electrical system
|
||||
|
||||
SBrkr Standard Circuit Breakers & Romex
|
||||
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
|
||||
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
|
||||
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
|
||||
Mix Mixed
|
||||
|
||||
1stFlrSF: First Floor square feet
|
||||
|
||||
2ndFlrSF: Second floor square feet
|
||||
|
||||
LowQualFinSF: Low quality finished square feet (all floors)
|
||||
|
||||
GrLivArea: Above grade (ground) living area square feet
|
||||
|
||||
BsmtFullBath: Basement full bathrooms
|
||||
|
||||
BsmtHalfBath: Basement half bathrooms
|
||||
|
||||
FullBath: Full bathrooms above grade
|
||||
|
||||
HalfBath: Half baths above grade
|
||||
|
||||
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
|
||||
|
||||
Kitchen: Kitchens above grade
|
||||
|
||||
KitchenQual: Kitchen quality
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Typical/Average
|
||||
Fa Fair
|
||||
Po Poor
|
||||
|
||||
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
|
||||
|
||||
Functional: Home functionality (Assume typical unless deductions are warranted)
|
||||
|
||||
Typ Typical Functionality
|
||||
Min1 Minor Deductions 1
|
||||
Min2 Minor Deductions 2
|
||||
Mod Moderate Deductions
|
||||
Maj1 Major Deductions 1
|
||||
Maj2 Major Deductions 2
|
||||
Sev Severely Damaged
|
||||
Sal Salvage only
|
||||
|
||||
Fireplaces: Number of fireplaces
|
||||
|
||||
FireplaceQu: Fireplace quality
|
||||
|
||||
Ex Excellent - Exceptional Masonry Fireplace
|
||||
Gd Good - Masonry Fireplace in main level
|
||||
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
|
||||
Fa Fair - Prefabricated Fireplace in basement
|
||||
Po Poor - Ben Franklin Stove
|
||||
NA No Fireplace
|
||||
|
||||
GarageType: Garage location
|
||||
|
||||
2Types More than one type of garage
|
||||
Attchd Attached to home
|
||||
Basment Basement Garage
|
||||
BuiltIn Built-In (Garage part of house - typically has room above garage)
|
||||
CarPort Car Port
|
||||
Detchd Detached from home
|
||||
NA No Garage
|
||||
|
||||
GarageYrBlt: Year garage was built
|
||||
|
||||
GarageFinish: Interior finish of the garage
|
||||
|
||||
Fin Finished
|
||||
RFn Rough Finished
|
||||
Unf Unfinished
|
||||
NA No Garage
|
||||
|
||||
GarageCars: Size of garage in car capacity
|
||||
|
||||
GarageArea: Size of garage in square feet
|
||||
|
||||
GarageQual: Garage quality
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Typical/Average
|
||||
Fa Fair
|
||||
Po Poor
|
||||
NA No Garage
|
||||
|
||||
GarageCond: Garage condition
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Typical/Average
|
||||
Fa Fair
|
||||
Po Poor
|
||||
NA No Garage
|
||||
|
||||
PavedDrive: Paved driveway
|
||||
|
||||
Y Paved
|
||||
P Partial Pavement
|
||||
N Dirt/Gravel
|
||||
|
||||
WoodDeckSF: Wood deck area in square feet
|
||||
|
||||
OpenPorchSF: Open porch area in square feet
|
||||
|
||||
EnclosedPorch: Enclosed porch area in square feet
|
||||
|
||||
3SsnPorch: Three season porch area in square feet
|
||||
|
||||
ScreenPorch: Screen porch area in square feet
|
||||
|
||||
PoolArea: Pool area in square feet
|
||||
|
||||
PoolQC: Pool quality
|
||||
|
||||
Ex Excellent
|
||||
Gd Good
|
||||
TA Average/Typical
|
||||
Fa Fair
|
||||
NA No Pool
|
||||
|
||||
Fence: Fence quality
|
||||
|
||||
GdPrv Good Privacy
|
||||
MnPrv Minimum Privacy
|
||||
GdWo Good Wood
|
||||
MnWw Minimum Wood/Wire
|
||||
NA No Fence
|
||||
|
||||
MiscFeature: Miscellaneous feature not covered in other categories
|
||||
|
||||
Elev Elevator
|
||||
Gar2 2nd Garage (if not described in garage section)
|
||||
Othr Other
|
||||
Shed Shed (over 100 SF)
|
||||
TenC Tennis Court
|
||||
NA None
|
||||
|
||||
MiscVal: $Value of miscellaneous feature
|
||||
|
||||
MoSold: Month Sold (MM)
|
||||
|
||||
YrSold: Year Sold (YYYY)
|
||||
|
||||
SaleType: Type of sale
|
||||
|
||||
WD Warranty Deed - Conventional
|
||||
CWD Warranty Deed - Cash
|
||||
VWD Warranty Deed - VA Loan
|
||||
New Home just constructed and sold
|
||||
COD Court Officer Deed/Estate
|
||||
Con Contract 15% Down payment regular terms
|
||||
ConLw Contract Low Down payment and low interest
|
||||
ConLI Contract Low Interest
|
||||
ConLD Contract Low Down
|
||||
Oth Other
|
||||
|
||||
SaleCondition: Condition of sale
|
||||
|
||||
Normal Normal Sale
|
||||
Abnorml Abnormal Sale - trade, foreclosure, short sale
|
||||
AdjLand Adjoining Land Purchase
|
||||
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
|
||||
Family Sale between family members
|
||||
Partial Home was not completed when last assessed (associated with New Homes)
|
||||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,815 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Kaggle Getting Started Competition : House Prices - Advanced Regression Techniques "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"The notebook is based on the [notebook](https://www.kaggle.com/code/ryanholbrook/feature-engineering-for-house-prices) provided for [House prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques) Kaggle competition. The notebook is a buildup of hands-on-exercises presented in Kaggle Learn courses of [Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning) and [Feature Engineering](https://www.kaggle.com/learn/feature-engineering)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"Following are the imports required to build the pipeline and pass the data between components for building up the kubeflow pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Install the kfp \n",
|
||||
"# !pip install kfp --upgrade "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import kfp\n",
|
||||
"from kfp.components import func_to_container_op\n",
|
||||
"import kfp.components as comp"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"All the essential imports required in a pipeline component are put together in a list which then is passed on to each pipeline component. Though this might not be efficient when you are dealing with lot of packages, so in cases with many packages and dependencies you can go for docker image which then can be passed to each pipeline component"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import_packages = ['pandas', 'sklearn', 'category_encoders', 'xgboost', 'numpy']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"In the following implementation of kubeflow pipeline we are making use of [lightweight python function components](https://www.kubeflow.org/docs/components/pipelines/sdk/python-function-components/) to build up the pipeline. The data is passed between component instances(tasks) using InputPath and OutputPath. This doesn't require use of defining external volume and attaching to the tasks as the system takes care of storing the data. Further details and examples of it can be found in the following [link](https://github.com/Ark-kun/kfp_samples/blob/65a98da2d4d2bd27a803ee58213b4cfd8a84825e/2019-10%20Kubeflow%20summit/104%20-%20Passing%20data%20for%20python%20components/104%20-%20Passing%20data%20for%20python%20components.ipynb)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"The pipeline is divided into five components\n",
|
||||
"1. Download data zip file from url\n",
|
||||
"2. Load data\n",
|
||||
"3. Creating data with features\n",
|
||||
"4. Train data\n",
|
||||
"5. Evaluating data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"### Download Data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"For the purpose of this, we are using an existing yaml file available from kubeflow/pipelines for 'Download Data' component to download data from URLs. In our case, we are getting it from github."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"web_downloader_op = kfp.components.load_component_from_url(\n",
|
||||
" 'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/contrib/web/Download/component.yaml')"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"### Load and Preprocess Data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def load_and_preprocess_data(file_path : comp.InputPath() , train_output_csv: comp.OutputPath(), test_output_csv: comp.OutputPath()):\n",
|
||||
" \n",
|
||||
" # Read data\n",
|
||||
" import pandas as pd\n",
|
||||
" from pandas.api.types import CategoricalDtype\n",
|
||||
" from zipfile import ZipFile \n",
|
||||
" \n",
|
||||
" # Extracting from zip file \n",
|
||||
" with ZipFile(file_path, 'r') as zip:\n",
|
||||
" zip.extractall()\n",
|
||||
" \n",
|
||||
" # Load the training and test data\n",
|
||||
" train_file_dir = 'data/train.csv'\n",
|
||||
" test_file_dir = 'data/test.csv'\n",
|
||||
" df_train = pd.read_csv(train_file_dir, index_col=\"Id\")\n",
|
||||
" df_test = pd.read_csv( test_file_dir , index_col=\"Id\")\n",
|
||||
" \n",
|
||||
" # Merge the splits so we can process them together\n",
|
||||
" df = pd.concat([df_train, df_test])\n",
|
||||
" \n",
|
||||
" # Clean data\n",
|
||||
" df[\"Exterior2nd\"] = df[\"Exterior2nd\"].replace({\"Brk Cmn\": \"BrkComm\"})\n",
|
||||
" # Some values of GarageYrBlt are corrupt, so we'll replace them\n",
|
||||
" # with the year the house was built\n",
|
||||
" df[\"GarageYrBlt\"] = df[\"GarageYrBlt\"].where(df.GarageYrBlt <= 2010, df.YearBuilt)\n",
|
||||
" # Names beginning with numbers are awkward to work with\n",
|
||||
" df.rename(columns={\n",
|
||||
" \"1stFlrSF\": \"FirstFlrSF\",\n",
|
||||
" \"2ndFlrSF\": \"SecondFlrSF\",\n",
|
||||
" \"3SsnPorch\": \"Threeseasonporch\",\n",
|
||||
" }, inplace=True,\n",
|
||||
" )\n",
|
||||
" \n",
|
||||
" # Encode data\n",
|
||||
" \n",
|
||||
" # Nominal categories\n",
|
||||
" # The numeric features are already encoded correctly (`float` for\n",
|
||||
" # continuous, `int` for discrete), but the categoricals we'll need to\n",
|
||||
" # do ourselves. Note in particular, that the `MSSubClass` feature is\n",
|
||||
" # read as an `int` type, but is actually a (nominative) categorical.\n",
|
||||
"\n",
|
||||
" # The nominative (unordered) categorical features\n",
|
||||
" features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
|
||||
"\n",
|
||||
" # Pandas calls the categories \"levels\"\n",
|
||||
" five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
|
||||
" ten_levels = list(range(10))\n",
|
||||
"\n",
|
||||
" ordered_levels = {\n",
|
||||
" \"OverallQual\": ten_levels,\n",
|
||||
" \"OverallCond\": ten_levels,\n",
|
||||
" \"ExterQual\": five_levels,\n",
|
||||
" \"ExterCond\": five_levels,\n",
|
||||
" \"BsmtQual\": five_levels,\n",
|
||||
" \"BsmtCond\": five_levels,\n",
|
||||
" \"HeatingQC\": five_levels,\n",
|
||||
" \"KitchenQual\": five_levels,\n",
|
||||
" \"FireplaceQu\": five_levels,\n",
|
||||
" \"GarageQual\": five_levels,\n",
|
||||
" \"GarageCond\": five_levels,\n",
|
||||
" \"PoolQC\": five_levels,\n",
|
||||
" \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
|
||||
" \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
|
||||
" \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
|
||||
" \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
|
||||
" \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
|
||||
" \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
|
||||
" \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
|
||||
" \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
|
||||
" \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
|
||||
" \"CentralAir\": [\"N\", \"Y\"],\n",
|
||||
" \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
|
||||
" \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
" # Add a None level for missing values\n",
|
||||
" ordered_levels = {key: [\"None\"] + value for key, value in\n",
|
||||
" ordered_levels.items()}\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" for name in features_nom:\n",
|
||||
" df[name] = df[name].astype(\"category\")\n",
|
||||
" # Add a None category for missing values\n",
|
||||
" if \"None\" not in df[name].cat.categories:\n",
|
||||
" df[name].cat.add_categories(\"None\", inplace=True)\n",
|
||||
" # Ordinal categories\n",
|
||||
" for name, levels in ordered_levels.items():\n",
|
||||
" df[name] = df[name].astype(CategoricalDtype(levels,\n",
|
||||
" ordered=True))\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" # Impute data\n",
|
||||
" for name in df.select_dtypes(\"number\"):\n",
|
||||
" df[name] = df[name].fillna(0)\n",
|
||||
" for name in df.select_dtypes(include = [\"category\"]):\n",
|
||||
" df[name] = df[name].fillna(\"None\")\n",
|
||||
" \n",
|
||||
" # Reform splits \n",
|
||||
" df_train = df.loc[df_train.index, :]\n",
|
||||
" df_test = df.loc[df_test.index, :]\n",
|
||||
" \n",
|
||||
" # passing the data as csv files to outputs\n",
|
||||
" df_train.to_csv(train_output_csv)\n",
|
||||
" df_test.to_csv(test_output_csv) \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"load_and_preprocess_data_op = func_to_container_op(load_and_preprocess_data,packages_to_install = import_packages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"### Creating data with features"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def featured_data(train_path: comp.InputPath(), test_path : comp.InputPath(), feat_train_output_csv: comp.OutputPath(), feat_test_output_csv: comp.OutputPath()):\n",
|
||||
" \n",
|
||||
" import pandas as pd\n",
|
||||
" from pandas.api.types import CategoricalDtype\n",
|
||||
" from category_encoders import MEstimateEncoder\n",
|
||||
" from sklearn.feature_selection import mutual_info_regression\n",
|
||||
" from sklearn.cluster import KMeans\n",
|
||||
" from sklearn.decomposition import PCA\n",
|
||||
" from sklearn.model_selection import KFold, cross_val_score\n",
|
||||
" \n",
|
||||
" df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
|
||||
" df_test = pd.read_csv(test_path, index_col=\"Id\")\n",
|
||||
" \n",
|
||||
" def make_mi_scores(X, y):\n",
|
||||
" X = X.copy()\n",
|
||||
" for colname in X.select_dtypes([\"object\",\"category\"]):\n",
|
||||
" X[colname], _ = X[colname].factorize()\n",
|
||||
" # All discrete features should now have integer dtypes\n",
|
||||
" discrete_features = [pd.api.types.is_integer_dtype(t) for t in X.dtypes]\n",
|
||||
" mi_scores = mutual_info_regression(X, y, discrete_features=discrete_features, random_state=0)\n",
|
||||
" mi_scores = pd.Series(mi_scores, name=\"MI Scores\", index=X.columns)\n",
|
||||
" mi_scores = mi_scores.sort_values(ascending=False)\n",
|
||||
" return mi_scores\n",
|
||||
" \n",
|
||||
" def drop_uninformative(df, mi_scores):\n",
|
||||
" return df.loc[:, mi_scores > 0.0]\n",
|
||||
" \n",
|
||||
" def label_encode(df):\n",
|
||||
" \n",
|
||||
" X = df.copy() \n",
|
||||
" for colname in X.select_dtypes([\"category\"]):\n",
|
||||
" X[colname] = X[colname].cat.codes\n",
|
||||
" return X\n",
|
||||
"\n",
|
||||
" def mathematical_transforms(df):\n",
|
||||
" X = pd.DataFrame() # dataframe to hold new features\n",
|
||||
" X[\"LivLotRatio\"] = df.GrLivArea / df.LotArea\n",
|
||||
" X[\"Spaciousness\"] = (df.FirstFlrSF + df.SecondFlrSF) / df.TotRmsAbvGrd\n",
|
||||
" return X\n",
|
||||
"\n",
|
||||
" def interactions(df):\n",
|
||||
" X = pd.get_dummies(df.BldgType, prefix=\"Bldg\")\n",
|
||||
" X = X.mul(df.GrLivArea, axis=0)\n",
|
||||
" return X\n",
|
||||
"\n",
|
||||
" def counts(df):\n",
|
||||
" X = pd.DataFrame()\n",
|
||||
" X[\"PorchTypes\"] = df[[\n",
|
||||
" \"WoodDeckSF\",\n",
|
||||
" \"OpenPorchSF\",\n",
|
||||
" \"EnclosedPorch\",\n",
|
||||
" \"Threeseasonporch\",\n",
|
||||
" \"ScreenPorch\",\n",
|
||||
" ]].gt(0.0).sum(axis=1)\n",
|
||||
" return X\n",
|
||||
"\n",
|
||||
" def break_down(df):\n",
|
||||
" X = pd.DataFrame()\n",
|
||||
" X[\"MSClass\"] = df.MSSubClass.str.split(\"_\", n=1, expand=True)[0]\n",
|
||||
" return X\n",
|
||||
"\n",
|
||||
" def group_transforms(df):\n",
|
||||
" X = pd.DataFrame()\n",
|
||||
" X[\"MedNhbdArea\"] = df.groupby(\"Neighborhood\")[\"GrLivArea\"].transform(\"median\")\n",
|
||||
" return X\n",
|
||||
" \n",
|
||||
" cluster_features = [\n",
|
||||
" \"LotArea\",\n",
|
||||
" \"TotalBsmtSF\",\n",
|
||||
" \"FirstFlrSF\",\n",
|
||||
" \"SecondFlrSF\",\n",
|
||||
" \"GrLivArea\",\n",
|
||||
" ]\n",
|
||||
"\n",
|
||||
" def cluster_labels(df, features, n_clusters=20):\n",
|
||||
" X = df.copy()\n",
|
||||
" X_scaled = X.loc[:, features]\n",
|
||||
" X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
|
||||
" kmeans = KMeans(n_clusters=n_clusters, n_init=50, random_state=0)\n",
|
||||
" X_new = pd.DataFrame()\n",
|
||||
" X_new[\"Cluster\"] = kmeans.fit_predict(X_scaled)\n",
|
||||
" return X_new\n",
|
||||
"\n",
|
||||
" def cluster_distance(df, features, n_clusters=20):\n",
|
||||
" X = df.copy()\n",
|
||||
" X_scaled = X.loc[:, features]\n",
|
||||
" X_scaled = (X_scaled - X_scaled.mean(axis=0)) / X_scaled.std(axis=0)\n",
|
||||
" kmeans = KMeans(n_clusters=20, n_init=50, random_state=0)\n",
|
||||
" X_cd = kmeans.fit_transform(X_scaled)\n",
|
||||
" # Label features and join to dataset\n",
|
||||
" X_cd = pd.DataFrame(\n",
|
||||
" X_cd, columns=[f\"Centroid_{i}\" for i in range(X_cd.shape[1])]\n",
|
||||
" )\n",
|
||||
" return X_cd\n",
|
||||
" \n",
|
||||
" def apply_pca(X, standardize=True):\n",
|
||||
" # Standardize\n",
|
||||
" if standardize:\n",
|
||||
" X = (X - X.mean(axis=0)) / X.std(axis=0)\n",
|
||||
" # Create principal components\n",
|
||||
" pca = PCA()\n",
|
||||
" X_pca = pca.fit_transform(X)\n",
|
||||
" # Convert to dataframe\n",
|
||||
" component_names = [f\"PC{i+1}\" for i in range(X_pca.shape[1])]\n",
|
||||
" X_pca = pd.DataFrame(X_pca, columns=component_names)\n",
|
||||
" # Create loadings\n",
|
||||
" loadings = pd.DataFrame(\n",
|
||||
" pca.components_.T, # transpose the matrix of loadings\n",
|
||||
" columns=component_names, # so the columns are the principal components\n",
|
||||
" index=X.columns, # and the rows are the original features\n",
|
||||
" )\n",
|
||||
" return pca, X_pca, loadings\n",
|
||||
"\n",
|
||||
" def pca_inspired(df):\n",
|
||||
" X = pd.DataFrame()\n",
|
||||
" X[\"Feature1\"] = df.GrLivArea + df.TotalBsmtSF\n",
|
||||
" X[\"Feature2\"] = df.YearRemodAdd * df.TotalBsmtSF\n",
|
||||
" return X\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" def pca_components(df, features):\n",
|
||||
" X = df.loc[:, features]\n",
|
||||
" _, X_pca, _ = apply_pca(X)\n",
|
||||
" return X_pca\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" pca_features = [\n",
|
||||
" \"GarageArea\",\n",
|
||||
" \"YearRemodAdd\",\n",
|
||||
" \"TotalBsmtSF\",\n",
|
||||
" \"GrLivArea\",\n",
|
||||
" ]\n",
|
||||
" \n",
|
||||
" class CrossFoldEncoder:\n",
|
||||
" def __init__(self, encoder, **kwargs):\n",
|
||||
" self.encoder_ = encoder\n",
|
||||
" self.kwargs_ = kwargs # keyword arguments for the encoder\n",
|
||||
" self.cv_ = KFold(n_splits=5)\n",
|
||||
"\n",
|
||||
" # Fit an encoder on one split and transform the feature on the\n",
|
||||
" # other. Iterating over the splits in all folds gives a complete\n",
|
||||
" # transformation. We also now have one trained encoder on each\n",
|
||||
" # fold.\n",
|
||||
" def fit_transform(self, X, y, cols):\n",
|
||||
" self.fitted_encoders_ = []\n",
|
||||
" self.cols_ = cols\n",
|
||||
" X_encoded = []\n",
|
||||
" for idx_encode, idx_train in self.cv_.split(X):\n",
|
||||
" fitted_encoder = self.encoder_(cols=cols, **self.kwargs_)\n",
|
||||
" fitted_encoder.fit(\n",
|
||||
" X.iloc[idx_encode, :], y.iloc[idx_encode],\n",
|
||||
" )\n",
|
||||
" X_encoded.append(fitted_encoder.transform(X.iloc[idx_train, :])[cols])\n",
|
||||
" self.fitted_encoders_.append(fitted_encoder)\n",
|
||||
" X_encoded = pd.concat(X_encoded)\n",
|
||||
" X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
|
||||
" return X_encoded\n",
|
||||
"\n",
|
||||
" # To transform the test data, average the encodings learned from\n",
|
||||
" # each fold.\n",
|
||||
" def transform(self, X):\n",
|
||||
" from functools import reduce\n",
|
||||
"\n",
|
||||
" X_encoded_list = []\n",
|
||||
" for fitted_encoder in self.fitted_encoders_:\n",
|
||||
" X_encoded = fitted_encoder.transform(X)\n",
|
||||
" X_encoded_list.append(X_encoded[self.cols_])\n",
|
||||
" X_encoded = reduce(\n",
|
||||
" lambda x, y: x.add(y, fill_value=0), X_encoded_list\n",
|
||||
" ) / len(X_encoded_list)\n",
|
||||
" X_encoded.columns = [name + \"_encoded\" for name in X_encoded.columns]\n",
|
||||
" return X_encoded\n",
|
||||
" \n",
|
||||
" X = df_train.copy()\n",
|
||||
" y = X.pop(\"SalePrice\") \n",
|
||||
" \n",
|
||||
" X_test = df_test.copy()\n",
|
||||
" X_test.pop(\"SalePrice\")\n",
|
||||
" \n",
|
||||
" # Get the mutual information scores\n",
|
||||
" mi_scores = make_mi_scores(X, y)\n",
|
||||
" \n",
|
||||
" # Concat the training and test dataset before restoring categorical encoding\n",
|
||||
" X = pd.concat([X, X_test])\n",
|
||||
" \n",
|
||||
" # Restore the categorical encoding removed during csv conversion\n",
|
||||
" # The nominative (unordered) categorical features\n",
|
||||
" features_nom = [\"MSSubClass\", \"MSZoning\", \"Street\", \"Alley\", \"LandContour\", \"LotConfig\", \"Neighborhood\", \"Condition1\", \"Condition2\", \"BldgType\", \"HouseStyle\", \"RoofStyle\", \"RoofMatl\", \"Exterior1st\", \"Exterior2nd\", \"MasVnrType\", \"Foundation\", \"Heating\", \"CentralAir\", \"GarageType\", \"MiscFeature\", \"SaleType\", \"SaleCondition\"]\n",
|
||||
"\n",
|
||||
" # Pandas calls the categories \"levels\"\n",
|
||||
" five_levels = [\"Po\", \"Fa\", \"TA\", \"Gd\", \"Ex\"]\n",
|
||||
" ten_levels = list(range(10))\n",
|
||||
"\n",
|
||||
" ordered_levels = {\n",
|
||||
" \"OverallQual\": ten_levels,\n",
|
||||
" \"OverallCond\": ten_levels,\n",
|
||||
" \"ExterQual\": five_levels,\n",
|
||||
" \"ExterCond\": five_levels,\n",
|
||||
" \"BsmtQual\": five_levels,\n",
|
||||
" \"BsmtCond\": five_levels,\n",
|
||||
" \"HeatingQC\": five_levels,\n",
|
||||
" \"KitchenQual\": five_levels,\n",
|
||||
" \"FireplaceQu\": five_levels,\n",
|
||||
" \"GarageQual\": five_levels,\n",
|
||||
" \"GarageCond\": five_levels,\n",
|
||||
" \"PoolQC\": five_levels,\n",
|
||||
" \"LotShape\": [\"Reg\", \"IR1\", \"IR2\", \"IR3\"],\n",
|
||||
" \"LandSlope\": [\"Sev\", \"Mod\", \"Gtl\"],\n",
|
||||
" \"BsmtExposure\": [\"No\", \"Mn\", \"Av\", \"Gd\"],\n",
|
||||
" \"BsmtFinType1\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
|
||||
" \"BsmtFinType2\": [\"Unf\", \"LwQ\", \"Rec\", \"BLQ\", \"ALQ\", \"GLQ\"],\n",
|
||||
" \"Functional\": [\"Sal\", \"Sev\", \"Maj1\", \"Maj2\", \"Mod\", \"Min2\", \"Min1\", \"Typ\"],\n",
|
||||
" \"GarageFinish\": [\"Unf\", \"RFn\", \"Fin\"],\n",
|
||||
" \"PavedDrive\": [\"N\", \"P\", \"Y\"],\n",
|
||||
" \"Utilities\": [\"NoSeWa\", \"NoSewr\", \"AllPub\"],\n",
|
||||
" \"CentralAir\": [\"N\", \"Y\"],\n",
|
||||
" \"Electrical\": [\"Mix\", \"FuseP\", \"FuseF\", \"FuseA\", \"SBrkr\"],\n",
|
||||
" \"Fence\": [\"MnWw\", \"GdWo\", \"MnPrv\", \"GdPrv\"],\n",
|
||||
" }\n",
|
||||
"\n",
|
||||
"# Add a None level for missing values\n",
|
||||
" ordered_levels = {key: [\"None\"] + value for key, value in\n",
|
||||
" ordered_levels.items()}\n",
|
||||
" \n",
|
||||
" for name in features_nom:\n",
|
||||
" X[name] = X[name].astype(\"category\")\n",
|
||||
" if \"None\" not in X[name].cat.categories:\n",
|
||||
" X[name].cat.add_categories(\"None\", inplace=True)\n",
|
||||
" \n",
|
||||
" # Ordinal categories\n",
|
||||
" for name, levels in ordered_levels.items():\n",
|
||||
" X[name] = X[name].astype(CategoricalDtype(levels,\n",
|
||||
" ordered=True))\n",
|
||||
" \n",
|
||||
" # Drop features with less mutual information scores\n",
|
||||
" X = drop_uninformative(X, mi_scores)\n",
|
||||
" \n",
|
||||
"\n",
|
||||
" # Transformations\n",
|
||||
" X = X.join(mathematical_transforms(X))\n",
|
||||
" X = X.join(interactions(X))\n",
|
||||
" X = X.join(counts(X))\n",
|
||||
" # X = X.join(break_down(X))\n",
|
||||
" X = X.join(group_transforms(X))\n",
|
||||
"\n",
|
||||
" # Clustering\n",
|
||||
" # X = X.join(cluster_labels(X, cluster_features, n_clusters=20))\n",
|
||||
" # X = X.join(cluster_distance(X, cluster_features, n_clusters=20))\n",
|
||||
"\n",
|
||||
" # PCA\n",
|
||||
" X = X.join(pca_inspired(X))\n",
|
||||
" # X = X.join(pca_components(X, pca_features))\n",
|
||||
" # X = X.join(indicate_outliers(X))\n",
|
||||
" \n",
|
||||
" # Label encoding\n",
|
||||
" X = label_encode(X)\n",
|
||||
" \n",
|
||||
" # Reform splits\n",
|
||||
" X_test = X.loc[df_test.index, :]\n",
|
||||
" X.drop(df_test.index, inplace=True)\n",
|
||||
"\n",
|
||||
" # Target Encoder\n",
|
||||
" encoder = CrossFoldEncoder(MEstimateEncoder, m=1)\n",
|
||||
" X = X.join(encoder.fit_transform(X, y, cols=[\"MSSubClass\"]))\n",
|
||||
" \n",
|
||||
" X_test = X_test.join(encoder.transform(X_test))\n",
|
||||
" \n",
|
||||
" X.to_csv(feat_train_output_csv)\n",
|
||||
" X_test.to_csv(feat_test_output_csv)\n",
|
||||
"\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"featured_data_op = func_to_container_op(featured_data, packages_to_install = import_packages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"### Train data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def train_data(train_path: comp.InputPath(), feat_train_path: comp.InputPath(), feat_test_path : comp.InputPath(), model_path : comp.OutputPath('XGBoostModel')):\n",
|
||||
" \n",
|
||||
" import pandas as pd\n",
|
||||
" import numpy as np\n",
|
||||
" from xgboost.sklearn import XGBRegressor\n",
|
||||
" from pathlib import Path\n",
|
||||
" \n",
|
||||
" df_train = pd.read_csv(train_path, index_col=\"Id\")\n",
|
||||
" X_train = pd.read_csv(feat_train_path, index_col=\"Id\")\n",
|
||||
" X_test = pd.read_csv(feat_test_path, index_col=\"Id\")\n",
|
||||
" y_train = df_train.loc[:, \"SalePrice\"]\n",
|
||||
" \n",
|
||||
" xgb_params = dict(\n",
|
||||
" max_depth=6, # maximum depth of each tree - try 2 to 10\n",
|
||||
" learning_rate=0.01, # effect of each tree - try 0.0001 to 0.1\n",
|
||||
" n_estimators=1000, # number of trees (that is, boosting rounds) - try 1000 to 8000\n",
|
||||
" min_child_weight=1, # minimum number of houses in a leaf - try 1 to 10\n",
|
||||
" colsample_bytree=0.7, # fraction of features (columns) per tree - try 0.2 to 1.0\n",
|
||||
" subsample=0.7, # fraction of instances (rows) per tree - try 0.2 to 1.0\n",
|
||||
" reg_alpha=0.5, # L1 regularization (like LASSO) - try 0.0 to 10.0\n",
|
||||
" reg_lambda=1.0, # L2 regularization (like Ridge) - try 0.0 to 10.0\n",
|
||||
" num_parallel_tree=1, # set > 1 for boosted random forests\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" xgb = XGBRegressor(**xgb_params)\n",
|
||||
" # XGB minimizes MSE, but competition loss is RMSLE\n",
|
||||
" # So, we need to log-transform y to train and exp-transform the predictions\n",
|
||||
" xgb.fit(X_train, np.log(y_train))\n",
|
||||
"\n",
|
||||
" Path(model_path).parent.mkdir(parents=True, exist_ok=True)\n",
|
||||
" xgb.save_model(model_path)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"train_data_op = func_to_container_op(train_data, packages_to_install= import_packages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"### Evaluate data"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def eval_data(test_data_path: comp.InputPath(), model_path: comp.InputPath('XGBoostModel')):\n",
|
||||
" \n",
|
||||
" import pandas as pd\n",
|
||||
" import numpy as np\n",
|
||||
" from xgboost.sklearn import XGBRegressor\n",
|
||||
" \n",
|
||||
" X_test = pd.read_csv(test_data_path, index_col=\"Id\")\n",
|
||||
" \n",
|
||||
" xgb = XGBRegressor()\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" xgb.load_model(model_path)\n",
|
||||
" \n",
|
||||
" predictions = np.exp(xgb.predict(X_test))\n",
|
||||
" \n",
|
||||
" print(predictions)\n",
|
||||
" \n",
|
||||
"# output = pd.DataFrame({'Id': X_test.index, 'SalePrice': predictions})\n",
|
||||
"# output.to_csv('data/my_submission.csv', index=False)\n",
|
||||
"# print(\"Your submission was successfully saved!\")\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"eval_data_op = func_to_container_op(eval_data, packages_to_install= import_packages)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"### Defining function that implements the pipeline"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 13,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def vanilla_pipeline(url):\n",
|
||||
" \n",
|
||||
" web_downloader_task = web_downloader_op(url=url)\n",
|
||||
"\n",
|
||||
" load_and_preprocess_data_task = load_and_preprocess_data_op(file = web_downloader_task.outputs['data'])\n",
|
||||
"\n",
|
||||
" featured_data_task = featured_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'], test = load_and_preprocess_data_task.outputs['test_output_csv'])\n",
|
||||
" \n",
|
||||
" train_eval_task = train_data_op(train = load_and_preprocess_data_task.outputs['train_output_csv'] , feat_train = featured_data_task.outputs['feat_train_output_csv'],\n",
|
||||
" feat_test = featured_data_task.outputs['feat_test_output_csv'])\n",
|
||||
" \n",
|
||||
" eval_data_task = eval_data_op(test_data = featured_data_task.outputs['feat_test_output_csv'],model = train_eval_task.output)\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<a href=\"/pipeline/#/experiments/details/246b31c7-909a-446b-8152-0f429a0e745c\" target=\"_blank\" >Experiment details</a>."
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<a href=\"/pipeline/#/runs/details/66011ba0-a465-4d5b-beba-f081ab3002b4\" target=\"_blank\" >Run details</a>."
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
]
|
||||
},
|
||||
"metadata": {},
|
||||
"output_type": "display_data"
|
||||
},
|
||||
{
|
||||
"data": {
|
||||
"text/plain": [
|
||||
"RunPipelineResult(run_id=66011ba0-a465-4d5b-beba-f081ab3002b4)"
|
||||
]
|
||||
},
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# Using kfp.Client() to run the pipeline from notebook itself\n",
|
||||
"client = kfp.Client() # change arguments accordingly\n",
|
||||
"\n",
|
||||
"# Running the pipeline\n",
|
||||
"client.create_run_from_pipeline_func(\n",
|
||||
" vanilla_pipeline,\n",
|
||||
" arguments={\n",
|
||||
" # Github url to fetch the data. This would change when you clone the repo. Please update the url as per that.\n",
|
||||
" 'url': 'https://github.com/NeoKish/examples/raw/master/house-prices-kaggle-competition/data.zip'\n",
|
||||
" })"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"kubeflow_notebook": {
|
||||
"autosnapshot": true,
|
||||
"docker_image": "gcr.io/arrikto/jupyter-kale-py36@sha256:dd3f92ca66b46d247e4b9b6a9d84ffbb368646263c2e3909473c3b851f3fe198",
|
||||
"experiment": {
|
||||
"id": "",
|
||||
"name": ""
|
||||
},
|
||||
"experiment_name": "",
|
||||
"katib_metadata": {
|
||||
"algorithm": {
|
||||
"algorithmName": "grid"
|
||||
},
|
||||
"maxFailedTrialCount": 3,
|
||||
"maxTrialCount": 12,
|
||||
"objective": {
|
||||
"objectiveMetricName": "",
|
||||
"type": "minimize"
|
||||
},
|
||||
"parallelTrialCount": 3,
|
||||
"parameters": []
|
||||
},
|
||||
"katib_run": false,
|
||||
"pipeline_description": "",
|
||||
"pipeline_name": "",
|
||||
"snapshot_volumes": true,
|
||||
"steps_defaults": [
|
||||
"label:access-ml-pipeline:true",
|
||||
"label:access-rok:true"
|
||||
],
|
||||
"volume_access_mode": "rwm",
|
||||
"volumes": [
|
||||
{
|
||||
"annotations": [],
|
||||
"mount_point": "/home/jovyan/data",
|
||||
"name": "data-g2n6k",
|
||||
"size": 5,
|
||||
"size_type": "Gi",
|
||||
"snapshot": false,
|
||||
"type": "clone"
|
||||
},
|
||||
{
|
||||
"annotations": [],
|
||||
"mount_point": "/home/jovyan",
|
||||
"name": "house-prices-vanilla-workspace-2wscr",
|
||||
"size": 5,
|
||||
"size_type": "Gi",
|
||||
"snapshot": false,
|
||||
"type": "clone"
|
||||
}
|
||||
]
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.6.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 4
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
Binary file not shown.
|
After Width: | Height: | Size: 18 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 92 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 18 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 17 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 72 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 36 KiB |
|
|
@ -0,0 +1,7 @@
|
|||
numpy
|
||||
pandas
|
||||
matplotlib
|
||||
sklearn
|
||||
seaborn
|
||||
category_encoders
|
||||
xgboost
|
||||
Loading…
Reference in New Issue