examples/digit-recognition-kaggle-co.../digit-recognizer-kale.ipynb

919 lines
24 KiB
Plaintext
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Digit Recognizer Kale Pipeline\n",
"\n",
"In this [Kaggle competition](https://www.kaggle.com/competitions/digit-recognizer/overview) \n",
"\n",
">MNIST (\"Modified National Institute of Standards and Technology\") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.\n",
"\n",
">In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Install necessary packages\n",
"\n",
"We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. We have put the dependencies in a `requirements.txt` file so we will use the former method.\n",
"\n",
"> NOTE: Do not forget to use the `--user` argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline.",
"\n",
"After installing python packages, restart notebook kernel before proceeding.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"skip"
]
},
"outputs": [],
"source": [
"!pip install --user -r requirements.txt --quiet"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Imports\n",
"\n",
"In this section we import the packages we need for this example. Make it a habit to gather your imports in a single place. It will make your life easier if you are going to transform this notebook into a Kubeflow pipeline using Kale."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"imports"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"tensorflow version: 2.3.0\n"
]
}
],
"source": [
"import os\n",
"import datetime\n",
"import numpy as np\n",
"import pandas as pd\n",
"import pickle\n",
"import zipfile\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.metrics import confusion_matrix\n",
"\n",
"import tensorflow as tf\n",
"from tensorflow import keras, optimizers\n",
"from tensorflow.keras.metrics import SparseCategoricalAccuracy\n",
"from tensorflow.keras.losses import SparseCategoricalCrossentropy\n",
"from tensorflow.keras import layers\n",
"print(\"tensorflow version: \", tf.__version__)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Project hyper-parameters\n",
"\n",
"In this cell, we define the different hyper-parameters. Defining them in one place makes it easier to experiment with their values and also facilitates the execution of HP Tuning experiments using Kale and Katib."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"tags": [
"pipeline-parameters"
]
},
"outputs": [],
"source": [
"# Hyper-parameters\n",
"LR = 1e-3\n",
"EPOCHS = 2\n",
"BATCH_SIZE = 64\n",
"CONV_DIM1 = 56\n",
"CONV_DIM2 = 100"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Set random seed for reproducibility and ignore warning messages."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"tags": [
"skip"
]
},
"outputs": [],
"source": [
"tf.random.set_seed(42)\n",
"np.random.seed(42)\n",
"\n",
"tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)\n",
"\n",
"# Setting the graph style\n",
"plt.rc('figure', autolayout=True)\n",
"plt.rc('axes', titleweight='bold', \n",
" titlesize=15)\n",
"\n",
"plt.rc('font', size=12)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Download data"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"tags": [
"block:download_data"
]
},
"outputs": [],
"source": [
"import zipfile\n",
"import wget\n",
"\n",
"# path to download to\n",
"data_path = 'data'\n",
"\n",
"# data link\n",
"train_link = 'https://github.com/kubeflow/examples/blob/master/digit-recognition-kaggle-competition/data/train.csv.zip?raw=true'\n",
"test_link = 'https://github.com/kubeflow/examples/blob/master/digit-recognition-kaggle-competition/data/test.csv.zip?raw=true'\n",
"sample_submission = 'https://raw.githubusercontent.com/kubeflow/examples/master/digit-recognition-kaggle-competition/data/sample_submission.csv'\n", "\n",
"# download data\n",
"wget.download(train_link, f'{data_path}/train_csv.zip')\n",
"wget.download(test_link, f'{data_path}/test_csv.zip')\n",
"wget.download(sample_submission, f'{data_path}/sample_submission.csv')\n",
"\n",
"with zipfile.ZipFile(f\"{data_path}/train_csv.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(data_path)\n",
"\n",
"with zipfile.ZipFile(f\"{data_path}/test_csv.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(data_path)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Load and preprocess data\n",
"\n",
"In this section, we load and process the dataset to get it in a ready-to-use form by the model. First, let us load and analyze the data."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Load data"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"The data are in `csv` format, thus, we use the handy `read_csv` pandas method. There is one train data set and two test sets (one public and one private)."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"tags": [
"block:load_data",
"prev:download_data"
]
},
"outputs": [],
"source": [
"data_path = 'data'\n",
"\n",
"# Data Path\n",
"train_data_path = data_path + '/train.csv'\n",
"test_data_path = data_path + '/test.csv'\n",
"sample_submission_path = data_path + '/sample_submission.csv'\n",
"\n",
"\n",
"# Loading dataset into pandas \n",
"train_df = pd.read_csv(train_data_path)\n",
"test_df = pd.read_csv(test_data_path)\n",
"ss = pd.read_csv(sample_submission_path)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Let us now explore the data\n",
"To this end, we use the pandas `head` method to visualize the 1st five rows of our data set."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>label</th>\n",
" <th>pixel0</th>\n",
" <th>pixel1</th>\n",
" <th>pixel2</th>\n",
" <th>pixel3</th>\n",
" <th>pixel4</th>\n",
" <th>pixel5</th>\n",
" <th>pixel6</th>\n",
" <th>pixel7</th>\n",
" <th>pixel8</th>\n",
" <th>...</th>\n",
" <th>pixel774</th>\n",
" <th>pixel775</th>\n",
" <th>pixel776</th>\n",
" <th>pixel777</th>\n",
" <th>pixel778</th>\n",
" <th>pixel779</th>\n",
" <th>pixel780</th>\n",
" <th>pixel781</th>\n",
" <th>pixel782</th>\n",
" <th>pixel783</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 785 columns</p>\n",
"</div>"
],
"text/plain": [
" label pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 \\\n",
"0 1 0 0 0 0 0 0 0 0 \n",
"1 0 0 0 0 0 0 0 0 0 \n",
"2 1 0 0 0 0 0 0 0 0 \n",
"3 4 0 0 0 0 0 0 0 0 \n",
"4 0 0 0 0 0 0 0 0 0 \n",
"\n",
" pixel8 ... pixel774 pixel775 pixel776 pixel777 pixel778 pixel779 \\\n",
"0 0 ... 0 0 0 0 0 0 \n",
"1 0 ... 0 0 0 0 0 0 \n",
"2 0 ... 0 0 0 0 0 0 \n",
"3 0 ... 0 0 0 0 0 0 \n",
"4 0 ... 0 0 0 0 0 0 \n",
"\n",
" pixel780 pixel781 pixel782 pixel783 \n",
"0 0 0 0 0 \n",
"1 0 0 0 0 \n",
"2 0 0 0 0 \n",
"3 0 0 0 0 \n",
"4 0 0 0 0 \n",
"\n",
"[5 rows x 785 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Data dimension\n",
"lets check train and test dimensions"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/plain": [
"((42000, 785), (28000, 784))"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df.shape, test_df.shape"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"all_data size is : (70000, 785)\n"
]
}
],
"source": [
"# join train and test together\n",
"ntrain = train_df.shape[0]\n",
"ntest = test_df.shape[0]\n",
"\n",
"all_data = pd.concat((train_df, test_df)).reset_index(drop=True)\n",
"print(\"all_data size is : {}\".format(all_data.shape))"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Preprocess data"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"We are now ready to transform the data set and split the dataset into features and the target variables."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"tags": [
"block:preprocess_data",
"prev:load_data"
]
},
"outputs": [],
"source": [
"all_data_X = all_data.drop('label', axis=1)\n",
"all_data_y = all_data.label"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# Reshape image in 3 dimensions (height = 28px, width = 28px , channel = 1)\n",
"all_data_X = all_data_X.values.reshape(-1,28,28,1)\n",
"\n",
"# Normalize the data\n",
"all_data_X = all_data_X / 255.0"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"#Get the new dataset\n",
"X = all_data_X[:ntrain].copy()\n",
"y = all_data_y[:ntrain].copy()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"# split the dataset\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Define and train the model\n",
"\n",
"we define models with convoolution and dropout layers in our model architecture"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"tags": [
"block:modeling",
"prev:preprocess_data"
]
},
"outputs": [],
"source": [
"def build_model(hidden_dim1=int(CONV_DIM1), hidden_dim2=int(CONV_DIM2), DROPOUT=0.5):\n",
" model = tf.keras.Sequential([\n",
" tf.keras.layers.Conv2D(filters = hidden_dim1, kernel_size = (5,5),padding = 'Same', \n",
" activation ='relu'),\n",
" tf.keras.layers.Dropout(DROPOUT),\n",
" tf.keras.layers.Conv2D(filters = hidden_dim2, kernel_size = (3,3),padding = 'Same', \n",
" activation ='relu'),\n",
" tf.keras.layers.Dropout(DROPOUT),\n",
" tf.keras.layers.Conv2D(filters = hidden_dim2, kernel_size = (3,3),padding = 'Same', \n",
" activation ='relu'),\n",
" tf.keras.layers.Dropout(DROPOUT),\n",
" tf.keras.layers.Flatten(),\n",
" tf.keras.layers.Dense(10, activation = \"softmax\")\n",
" ])\n",
"\n",
" model.build(input_shape=(None,28,28,1))\n",
" \n",
" return model"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"model = build_model()"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Model: \"sequential\"\n",
"_________________________________________________________________\n",
"Layer (type) Output Shape Param # \n",
"=================================================================\n",
"conv2d (Conv2D) (None, 28, 28, 56) 1456 \n",
"_________________________________________________________________\n",
"dropout (Dropout) (None, 28, 28, 56) 0 \n",
"_________________________________________________________________\n",
"conv2d_1 (Conv2D) (None, 28, 28, 100) 50500 \n",
"_________________________________________________________________\n",
"dropout_1 (Dropout) (None, 28, 28, 100) 0 \n",
"_________________________________________________________________\n",
"conv2d_2 (Conv2D) (None, 28, 28, 100) 90100 \n",
"_________________________________________________________________\n",
"dropout_2 (Dropout) (None, 28, 28, 100) 0 \n",
"_________________________________________________________________\n",
"flatten (Flatten) (None, 78400) 0 \n",
"_________________________________________________________________\n",
"dense (Dense) (None, 10) 784010 \n",
"=================================================================\n",
"Total params: 926,066\n",
"Trainable params: 926,066\n",
"Non-trainable params: 0\n",
"_________________________________________________________________\n"
]
}
],
"source": [
"# display the model summary\n",
"model.summary()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"We are now ready to compile and fit the model."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"model.compile(optimizers.Adam(learning_rate=float(LR)), \n",
" loss=SparseCategoricalCrossentropy(), \n",
" metrics=SparseCategoricalAccuracy(name='accuracy'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/2\n",
" 37/532 [=>............................] - ETA: 1:41 - loss: 0.9405 - accuracy: 0.7078"
]
}
],
"source": [
"history = model.fit(np.array(X_train), np.array(y_train), \n",
" validation_split=.1, batch_size=int(BATCH_SIZE), epochs=int(EPOCHS))"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"## Evaluate the model\n",
"\n",
"Evaluate the model and print the results"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"block:"
]
},
"outputs": [],
"source": [
"# Evaluate the model and print the results\n",
"test_loss, test_acc = model.evaluate(np.array(X_test), np.array(y_test), verbose=0)\n",
"print(\"Test_loss: {}, Test_accuracy: {} \".format(test_loss,test_acc))"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Confusion matrix"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"block:prediction",
"prev:modeling"
]
},
"outputs": [],
"source": [
"y_pred = np.argmax(model.predict(X_test), axis=-1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"block:"
]
},
"outputs": [],
"source": [
"cm = confusion_matrix(y_test, y_pred)\n",
"\n",
"plt.figure(figsize=(7,7))\n",
"sns.heatmap(cm, fmt='g', cbar=False, annot=True, cmap='Blues')\n",
"plt.title('confusion_matrix')\n",
"plt.ylabel('True label')\n",
"plt.xlabel('Predicted label')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Submission\n",
"\n",
"Last but note least, we create our submission to the Kaggle competition. The submission is just a `csv` file with the specified columns."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"skip"
]
},
"outputs": [],
"source": [
"test = all_data_X[ntrain:].copy()\n",
"submission_file = np.argmax(model.predict(test), axis=-1)\n",
"ss['Label'] = submission_file"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"skip"
]
},
"outputs": [],
"source": [
"ss.to_csv('sub.csv', index=False)\n",
"ss.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kubeflow_notebook": {
"autosnapshot": true,
"experiment": {
"id": "new",
"name": "digit-recognizer-kale"
},
"experiment_name": "digit-recognizer-kale",
"katib_metadata": {
"algorithm": {
"algorithmName": "grid"
},
"maxFailedTrialCount": 3,
"maxTrialCount": 12,
"objective": {
"objectiveMetricName": "",
"type": "minimize"
},
"parallelTrialCount": 3,
"parameters": []
},
"katib_run": false,
"pipeline_description": "Performs Preprocessing, training and prediction of digits",
"pipeline_name": "digit-recognizer-kale",
"snapshot_volumes": true,
"steps_defaults": [
"label:access-ml-pipeline:true",
"label:access-rok:true"
],
"volume_access_mode": "rwm",
"volumes": [
{
"annotations": [],
"mount_point": "/home/jovyan",
"name": "demoo-workspace-44q2m",
"size": 5,
"size_type": "Gi",
"snapshot": false,
"type": "clone"
}
]
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}