mirror of https://github.com/kubeflow/examples.git
772 lines
24 KiB
Plaintext
772 lines
24 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"# 🪙 American Express - Default Prediction Competition Kale Pipeline\n",
|
|
"\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"In this [Kaggle competition](https://www.kaggle.com/competitions/g-research-crypto-forecasting/overview), you'll use your machine learning expertise to predict credit default. This competition is hosted by American Express. \n",
|
|
"\n",
|
|
"> American Express is a globally integrated payments company. The largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.\n",
|
|
"\n",
|
|
"The dataset provided is an industrial scale data set of about 5.5 million rows. It has been pre-processed and converted to a lightweight version by raddar for ease of training and better result. This dataset is available in a [parquet format][1].\n",
|
|
"\n",
|
|
"[1]: https://www.kaggle.com/datasets/raddar/amex-data-integer-dtypes-parquet-format"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Install necessary packages\n",
|
|
"\n",
|
|
"We can install the necessary package by either running pip install --user <package_name> or include everything in a requirements.txt file and run pip install --user -r requirements.txt. We have put the dependencies in a requirements.txt file so we will use the former method.\n",
|
|
"\n",
|
|
"NOTE: After installing python packages, restart notebook kernel before proceeding."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"metadata": {
|
|
"tags": [
|
|
"skip"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"!pip install -r requirements.txt --user --quiet"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Imports\n",
|
|
"\n",
|
|
"In this section we import the packages we need for this example. Make it a habit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"metadata": {
|
|
"tags": [
|
|
"imports"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import os, subprocess\n",
|
|
"import random, zipfile, joblib\n",
|
|
"import scipy.stats\n",
|
|
"import warnings\n",
|
|
"import gc, wget\n",
|
|
"\n",
|
|
"from sklearn.model_selection import StratifiedKFold\n",
|
|
"from lightgbm import LGBMClassifier\n",
|
|
"\n",
|
|
"warnings.filterwarnings(\"ignore\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Project hyper-parameters\n",
|
|
"\n",
|
|
"In this cell, we define the different hyper-parameters. Defining them in one place makes it easier to experiment with their values and also facilitates the execution of HP Tuning experiments using Kale and Katib."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"metadata": {
|
|
"tags": [
|
|
"pipeline-parameters"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Hyper-parameters\n",
|
|
"N_EST = 30\n",
|
|
"LR = 0.1"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"Set random seed for reproducibility"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"metadata": {
|
|
"tags": [
|
|
"skip"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"def fix_all_seeds(seed):\n",
|
|
" np.random.seed(seed)\n",
|
|
" random.seed(seed)\n",
|
|
" os.environ['PYTHONHASHSEED'] = str(seed)\n",
|
|
"\n",
|
|
"fix_all_seeds(2022)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Download data\n",
|
|
"\n",
|
|
"In this section, we download the data from kaggle using the Kaggle API credentials"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"block:download_data"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# setup kaggle environment for data download\n",
|
|
"dataset = \"amex-data-integer-dtypes-parquet-format\"\n",
|
|
"\n",
|
|
"# setup kaggle environment for data download\n",
|
|
"with open('/secret/kaggle-secret/password', 'r') as file:\n",
|
|
" kaggle_key = file.read().rstrip()\n",
|
|
"with open('/secret/kaggle-secret/username', 'r') as file:\n",
|
|
" kaggle_user = file.read().rstrip()\n",
|
|
"\n",
|
|
"os.environ['KAGGLE_USERNAME'], os.environ['KAGGLE_KEY'] = kaggle_user, kaggle_key\n",
|
|
"\n",
|
|
"# download kaggle's Amex-credit-prediction data\n",
|
|
"subprocess.run([\"kaggle\",\"datasets\", \"download\", \"-d\", f'raddar/{dataset}'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"block:"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# path to download to\n",
|
|
"data_path = 'data'\n",
|
|
"\n",
|
|
"# extract Amex-credit-prediction.zip to data_path\n",
|
|
"with zipfile.ZipFile(f\"{dataset}.zip\",\"r\") as zip_ref:\n",
|
|
" zip_ref.extractall(data_path)\n",
|
|
" \n",
|
|
"# download kaggle's Amex-credit-prediction train_labels.zip\n",
|
|
"download_link = \"https://github.com/kubeflow/examples/blob/master/american-express-default-kaggle-competition/data/train_labels.zip?raw=true\"\n",
|
|
"wget.download(download_link, f'{data_path}/train_labels.zip')\n",
|
|
"\n",
|
|
"# extract Amex-credit-prediction.zip to data_path\n",
|
|
"with zipfile.ZipFile(f'{data_path}/train_labels.zip','r') as zip_ref:\n",
|
|
" zip_ref.extractall(data_path)\n",
|
|
" \n",
|
|
"# delete zipfiles\n",
|
|
"subprocess.run(['rm', f'{dataset}.zip'])\n",
|
|
"subprocess.run(['rm', f'{data_path}/train_labels.zip'])"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Load the dataset\n",
|
|
"\n",
|
|
"First, let us load and analyze the data.\n",
|
|
"\n",
|
|
"The data is in csv format, thus, we use the handy read_csv pandas method."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"metadata": {
|
|
"tags": [
|
|
"block:load_data",
|
|
"prev:download_data"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"TRAIN_CSV = (f'{data_path}/train.parquet')\n",
|
|
"TEST_CSV = f'{data_path}/test.parquet'\n",
|
|
"TARGET_CSV = f'{data_path}/train_labels.csv'"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"target shape: (458913,)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df_train = pd.read_parquet(TRAIN_CSV)\n",
|
|
"df_test = pd.read_parquet(TEST_CSV)\n",
|
|
"target = pd.read_csv(TARGET_CSV).target.values\n",
|
|
"print(f\"target shape: {target.shape}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"(5531451, 190)"
|
|
]
|
|
},
|
|
"execution_count": 13,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df_train.shape"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 14,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"customer_ID 0\n",
|
|
"S_2 0\n",
|
|
"P_2 45985\n",
|
|
"D_39 0\n",
|
|
"B_1 0\n",
|
|
" ... \n",
|
|
"D_141 101548\n",
|
|
"D_142 4587043\n",
|
|
"D_143 0\n",
|
|
"D_144 40727\n",
|
|
"D_145 0\n",
|
|
"Length: 190, dtype: int64"
|
|
]
|
|
},
|
|
"execution_count": 14,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"df_train.isna().sum()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"### Define Helper Functions"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"metadata": {
|
|
"tags": [
|
|
"functions"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# @yunchonggan's fast metric implementation\n",
|
|
"# From https://www.kaggle.com/competitions/amex-default-prediction/discussion/328020\n",
|
|
"def amex_metric(y_true: np.array, y_pred: np.array) -> float:\n",
|
|
"\n",
|
|
" # count of positives and negatives\n",
|
|
" n_pos = y_true.sum()\n",
|
|
" n_neg = y_true.shape[0] - n_pos\n",
|
|
"\n",
|
|
" # sorting by descring prediction values\n",
|
|
" indices = np.argsort(y_pred)[::-1]\n",
|
|
" preds, target = y_pred[indices], y_true[indices]\n",
|
|
"\n",
|
|
" # filter the top 4% by cumulative row weights\n",
|
|
" weight = 20.0 - target * 19.0\n",
|
|
" cum_norm_weight = (weight / weight.sum()).cumsum()\n",
|
|
" four_pct_filter = cum_norm_weight <= 0.04\n",
|
|
"\n",
|
|
" # default rate captured at 4%\n",
|
|
" d = target[four_pct_filter].sum() / n_pos\n",
|
|
"\n",
|
|
" # weighted gini coefficient\n",
|
|
" lorentz = (target / n_pos).cumsum()\n",
|
|
" gini = ((lorentz - cum_norm_weight) * weight).sum()\n",
|
|
"\n",
|
|
" # max weighted gini coefficient\n",
|
|
" gini_max = 10 * n_neg * (1 - 19 / (n_pos + 20 * n_neg))\n",
|
|
"\n",
|
|
" # normalized weighted gini coefficient\n",
|
|
" g = gini / gini_max\n",
|
|
"\n",
|
|
" return 0.5 * (g + d)\n",
|
|
"\n",
|
|
"def lgb_amex_metric(y_true, y_pred):\n",
|
|
" \"\"\"The competition metric with lightgbm's calling convention\"\"\"\n",
|
|
" return ('amex_metric_score',\n",
|
|
" amex_metric(y_true, y_pred),\n",
|
|
" True)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Feature Engineering"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 17,
|
|
"metadata": {
|
|
"tags": [
|
|
"block:feature_engineering",
|
|
"prev:load_data"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# feature engineering gotten from https://www.kaggle.com/code/ambrosm/amex-lightgbm-quickstart\n",
|
|
"def get_features(df, \n",
|
|
" features_avg, \n",
|
|
" features_min, \n",
|
|
" features_max, \n",
|
|
" features_last\n",
|
|
" ):\n",
|
|
" '''\n",
|
|
" This function takes a dataframe with all features and returns the aggregated feature grouped by the customer id.\n",
|
|
" \n",
|
|
" df - dataframe\n",
|
|
" '''\n",
|
|
" cid = pd.Categorical(df.pop('customer_ID'), ordered=True) # get customer id\n",
|
|
" last = (cid != np.roll(cid, -1)) # mask for last statement of every customer\n",
|
|
" \n",
|
|
" df_avg = (df\n",
|
|
" .groupby(cid)\n",
|
|
" .mean()[features_avg]\n",
|
|
" .rename(columns={f: f\"{f}_avg\" for f in features_avg})\n",
|
|
" ) \n",
|
|
" \n",
|
|
" df_min = (df\n",
|
|
" .groupby(cid)\n",
|
|
" .min()[features_min]\n",
|
|
" .rename(columns={f: f\"{f}_min\" for f in features_min})\n",
|
|
" )\n",
|
|
" gc.collect()\n",
|
|
" print('Computed min')\n",
|
|
" \n",
|
|
" df_max = (df\n",
|
|
" .groupby(cid)\n",
|
|
" .max()[features_max]\n",
|
|
" .rename(columns={f: f\"{f}_max\" for f in features_max})\n",
|
|
" )\n",
|
|
" gc.collect()\n",
|
|
" print('Computed max')\n",
|
|
" \n",
|
|
" df = (df.loc[last, features_last]\n",
|
|
" .rename(columns={f: f\"{f}_last\" for f in features_last})\n",
|
|
" .set_index(np.asarray(cid[last]))\n",
|
|
" )\n",
|
|
" gc.collect()\n",
|
|
" print('Computed last')\n",
|
|
" \n",
|
|
" df_ = pd.concat([df, df_min, df_max, df_avg], axis=1, )\n",
|
|
" \n",
|
|
" del df, df_avg, df_min, df_max, cid, last\n",
|
|
" \n",
|
|
" return df_"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"features_avg = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_8', 'B_9', 'B_10', 'B_11', 'B_12', 'B_13', 'B_14', 'B_15', \n",
|
|
" 'B_16', 'B_17', 'B_18', 'B_19', 'B_20', 'B_21', 'B_22', 'B_23', 'B_24', 'B_25', 'B_28', 'B_29', 'B_30', \n",
|
|
" 'B_32', 'B_33', 'B_37', 'B_38', 'B_39', 'B_40', 'B_41', 'B_42', 'D_39', 'D_41', 'D_42', 'D_43', 'D_44', \n",
|
|
" 'D_45', 'D_46', 'D_47', 'D_48', 'D_50', 'D_51', 'D_53', 'D_54', 'D_55', 'D_58', 'D_59', 'D_60', 'D_61', \n",
|
|
" 'D_62', 'D_65', 'D_66', 'D_69', 'D_70', 'D_71', 'D_72', 'D_73', 'D_74', 'D_75', 'D_76', 'D_77', 'D_78', \n",
|
|
" 'D_80', 'D_82', 'D_84', 'D_86', 'D_91', 'D_92', 'D_94', 'D_96', 'D_103', 'D_104', 'D_108', 'D_112', 'D_113', \n",
|
|
" 'D_114', 'D_115', 'D_117', 'D_118', 'D_119', 'D_120', 'D_121', 'D_122', 'D_123', 'D_124', 'D_125', 'D_126', \n",
|
|
" 'D_128', 'D_129', 'D_131', 'D_132', 'D_133', 'D_134', 'D_135', 'D_136', 'D_140', 'D_141', 'D_142', 'D_144', \n",
|
|
" 'D_145', 'P_2', 'P_3', 'P_4', 'R_1', 'R_2', 'R_3', 'R_7', 'R_8', 'R_9', 'R_10', 'R_11', 'R_14', 'R_15', 'R_16', \n",
|
|
" 'R_17', 'R_20', 'R_21', 'R_22', 'R_24', 'R_26', 'R_27', 'S_3', 'S_5', 'S_6', 'S_7', 'S_9', 'S_11', 'S_12', 'S_13', \n",
|
|
" 'S_15', 'S_16', 'S_18', 'S_22', 'S_23', 'S_25', 'S_26']\n",
|
|
"features_min = ['B_2', 'B_4', 'B_5', 'B_9', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', 'B_19', 'B_20', 'B_28', 'B_29', 'B_33', 'B_36', \n",
|
|
" 'B_42', 'D_39', 'D_41', 'D_42', 'D_45', 'D_46', 'D_48', 'D_50', 'D_51', 'D_53', 'D_55', 'D_56', 'D_58', 'D_59', \n",
|
|
" 'D_60', 'D_62', 'D_70', 'D_71', 'D_74', 'D_75', 'D_78', 'D_83', 'D_102', 'D_112', 'D_113', 'D_115', 'D_118', 'D_119', \n",
|
|
" 'D_121', 'D_122', 'D_128', 'D_132', 'D_140', 'D_141', 'D_144', 'D_145', 'P_2', 'P_3', 'R_1', 'R_27', 'S_3', 'S_5', \n",
|
|
" 'S_7', 'S_9', 'S_11', 'S_12', 'S_23', 'S_25']\n",
|
|
"features_max = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_7', 'B_8', 'B_9', 'B_10', 'B_12', 'B_13', 'B_14', 'B_15', 'B_16', 'B_17', \n",
|
|
" 'B_18', 'B_19', 'B_21', 'B_23', 'B_24', 'B_25', 'B_29', 'B_30', 'B_33', 'B_37', 'B_38', 'B_39', 'B_40', 'B_42', 'D_39', \n",
|
|
" 'D_41', 'D_42', 'D_43', 'D_44', 'D_45', 'D_46', 'D_47', 'D_48', 'D_49', 'D_50', 'D_52', 'D_55', 'D_56', 'D_58', 'D_59', \n",
|
|
" 'D_60', 'D_61', 'D_63', 'D_64', 'D_65', 'D_70', 'D_71', 'D_72', 'D_73', 'D_74', 'D_76', 'D_77', 'D_78', 'D_80', 'D_82', \n",
|
|
" 'D_84', 'D_91', 'D_102', 'D_105', 'D_107', 'D_110', 'D_111', 'D_112', 'D_115', 'D_116', 'D_117', 'D_118', 'D_119', \n",
|
|
" 'D_121', 'D_122', 'D_123', 'D_124', 'D_125', 'D_126', 'D_128', 'D_131', 'D_132', 'D_133', 'D_134', 'D_135', 'D_136', \n",
|
|
" 'D_138', 'D_140', 'D_141', 'D_142', 'D_144', 'D_145', 'P_2', 'P_3', 'P_4', 'R_1', 'R_3', 'R_5', 'R_6', 'R_7', 'R_8', \n",
|
|
" 'R_10', 'R_11', 'R_14', 'R_17', 'R_20', 'R_26', 'R_27', 'S_3', 'S_5', 'S_7', 'S_8', 'S_11', 'S_12', 'S_13', 'S_15', 'S_16', \n",
|
|
" 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27']\n",
|
|
"features_last = ['B_1', 'B_2', 'B_3', 'B_4', 'B_5', 'B_6', 'B_7', 'B_8', 'B_9', 'B_10', 'B_11', 'B_12', 'B_13', 'B_14', 'B_15', 'B_16', \n",
|
|
" 'B_17', 'B_18', 'B_19', 'B_20', 'B_21', 'B_22', 'B_23', 'B_24', 'B_25', 'B_26', 'B_28', 'B_29', 'B_30', 'B_32', 'B_33', \n",
|
|
" 'B_36', 'B_37', 'B_38', 'B_39', 'B_40', 'B_41', 'B_42', 'D_39', 'D_41', 'D_42', 'D_43', 'D_44', 'D_45', 'D_46', 'D_47', \n",
|
|
" 'D_48', 'D_49', 'D_50', 'D_51', 'D_52', 'D_53', 'D_54', 'D_55', 'D_56', 'D_58', 'D_59', 'D_60', 'D_61', 'D_62', 'D_63', \n",
|
|
" 'D_64', 'D_65', 'D_69', 'D_70', 'D_71', 'D_72', 'D_73', 'D_75', 'D_76', 'D_77', 'D_78', 'D_79', 'D_80', 'D_81', 'D_82', \n",
|
|
" 'D_83', 'D_86', 'D_91', 'D_96', 'D_105', 'D_106', 'D_112', 'D_114', 'D_119', 'D_120', 'D_121', 'D_122', 'D_124', 'D_125', \n",
|
|
" 'D_126', 'D_127', 'D_130', 'D_131', 'D_132', 'D_133', 'D_134', 'D_138', 'D_140', 'D_141', 'D_142', 'D_145', 'P_2', 'P_3', \n",
|
|
" 'P_4', 'R_1', 'R_2', 'R_3', 'R_4', 'R_5', 'R_6', 'R_7', 'R_8', 'R_9', 'R_10', 'R_11', 'R_12', 'R_13', 'R_14', 'R_15', \n",
|
|
" 'R_19', 'R_20', 'R_26', 'R_27', 'S_3', 'S_5', 'S_6', 'S_7', 'S_8', 'S_9', 'S_11', 'S_12', 'S_13', 'S_16', 'S_19', 'S_20', \n",
|
|
" 'S_22', 'S_23', 'S_24', 'S_25', 'S_26', 'S_27']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# apply feature engineering function\n",
|
|
"train = get_features(df_train, features_avg, features_min, features_max, features_last)\n",
|
|
"test = get_features(df_test, features_avg, features_min, features_max, features_last)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# check null values\n",
|
|
"train.isna().any()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Modelling: StratifiedKFold\n",
|
|
"\n",
|
|
"We cross-validate with a six-fold StratifiedKFold to handle the imbalanced nature of the target.\n",
|
|
"\n",
|
|
"Lightgbm handles null values efficiently."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"block:modelling",
|
|
"prev:feature_engineering"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Cross-validation\n",
|
|
"\n",
|
|
"features = [f for f in train.columns if f != 'customer_ID' and f != 'target']\n",
|
|
"\n",
|
|
"print(f\"{len(features)} features\")\n",
|
|
"\n",
|
|
"score_list = [] # lgbm score per fold\n",
|
|
"y_pred_list = [] # fold predictions list\n",
|
|
"\n",
|
|
"# init StratifiedKFold\n",
|
|
"kf = StratifiedKFold(n_splits=4)\n",
|
|
"\n",
|
|
"for fold, (idx_tr, idx_va) in enumerate(kf.split(train, target)):\n",
|
|
" \n",
|
|
" X_tr, X_va, y_tr, y_va, model = None, None, None, None, None\n",
|
|
"\n",
|
|
" X_tr = train.iloc[idx_tr][features]\n",
|
|
" X_va = train.iloc[idx_va][features]\n",
|
|
" y_tr = target[idx_tr]\n",
|
|
" y_va = target[idx_va]\n",
|
|
" \n",
|
|
" # init model\n",
|
|
" model = LGBMClassifier(n_estimators=int(N_EST),\n",
|
|
" learning_rate=float(LR), \n",
|
|
" random_state=2022)\n",
|
|
" # fit model\n",
|
|
" model.fit(X_tr, y_tr,\n",
|
|
" eval_set = [(X_va, y_va)], \n",
|
|
" eval_metric=[lgb_amex_metric],\n",
|
|
" verbose = 20,\n",
|
|
" early_stopping_rounds=30)\n",
|
|
" \n",
|
|
" X_tr, y_tr = None, None\n",
|
|
" \n",
|
|
" # fold validation set predictions\n",
|
|
" y_va_pred = model.predict_proba(X_va, raw_score=True)\n",
|
|
" \n",
|
|
" # model score\n",
|
|
" score = amex_metric(y_va, y_va_pred)\n",
|
|
"\n",
|
|
" print(f\"Score = {score}\")\n",
|
|
" score_list.append(score)\n",
|
|
" \n",
|
|
" # test set predictions\n",
|
|
" y_pred_list.append(model.predict_proba(test[features], raw_score=True))\n",
|
|
" \n",
|
|
" print(f\"Fold {fold}\") \n",
|
|
"\n",
|
|
"# save model\n",
|
|
"joblib.dump(model, 'lgb.jl')\n",
|
|
"print(f\"OOF Score: {np.mean(score_list):.5f}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# feature importance for top 30 features\n",
|
|
"fea_imp = pd.DataFrame({'imp':model.feature_importances_, 'col': features})\n",
|
|
"fea_imp = fea_imp.sort_values(['imp', 'col'], ascending=[True, False]).iloc[-30:]\n",
|
|
"_ = fea_imp.plot(kind='barh', x='col', y='imp', figsize=(20, 10))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Evaluation"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"block:evaluation_result",
|
|
"prev:modelling"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"model = joblib.load('lgb.jl')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"binary_logloss = model.booster_.best_score.get('valid_0').get('binary_logloss')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"amex_metric_score = model.booster_.best_score.get('valid_0').get('amex_metric_score')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"pipeline-metrics"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(binary_logloss)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"pipeline-metrics"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"print(amex_metric_score)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Submission"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"skip"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"sub = pd.DataFrame({'customer_ID': test.index,\n",
|
|
" 'prediction': np.mean(y_pred_list, axis=0)})\n",
|
|
"sub.to_csv('submission.csv', index=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {
|
|
"tags": [
|
|
"skip"
|
|
]
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"sub"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"kubeflow_notebook": {
|
|
"autosnapshot": true,
|
|
"experiment": {
|
|
"id": "new",
|
|
"name": "american-express-defaul-prediction"
|
|
},
|
|
"experiment_name": "american-express-defaul-prediction",
|
|
"katib_metadata": {
|
|
"algorithm": {
|
|
"algorithmName": "grid"
|
|
},
|
|
"maxFailedTrialCount": 3,
|
|
"maxTrialCount": 12,
|
|
"objective": {
|
|
"objectiveMetricName": "",
|
|
"type": "minimize"
|
|
},
|
|
"parallelTrialCount": 3,
|
|
"parameters": []
|
|
},
|
|
"katib_run": false,
|
|
"pipeline_description": "predicting credit default",
|
|
"pipeline_name": "american-express-defaul-prediction-pipeline",
|
|
"snapshot_volumes": true,
|
|
"steps_defaults": [
|
|
"label:access-ml-pipeline:true",
|
|
"label:kaggle-secret:true",
|
|
"label:access-rok:true"
|
|
],
|
|
"volume_access_mode": "rwm",
|
|
"volumes": [
|
|
{
|
|
"annotations": [],
|
|
"mount_point": "/home/jovyan",
|
|
"name": "test-workspace-qtvmt",
|
|
"size": 32,
|
|
"size_type": "Gi",
|
|
"snapshot": false,
|
|
"type": "clone"
|
|
}
|
|
]
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.6.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 4
|
|
}
|