examples/digit_recognition/digit-recognizer-kfp-pipeli...

720 lines
24 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Digit Recognizer Kubeflow Pipeline\n",
"\n",
"In this [Kaggle competition](https://www.kaggle.com/competitions/digit-recognizer/overview) \n",
"\n",
">MNIST (\"Modified National Institute of Standards and Technology\") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of handwritten images has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.\n",
"\n",
">In this competition, your goal is to correctly identify digits from a dataset of tens of thousands of handwritten images."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"# Install relevant libraries\n",
"\n",
"\n",
">Update pip `pip install --user --upgrade pip`\n",
"\n",
">Install and upgrade kubeflow sdk `pip install kfp --upgrade --user --quiet`\n",
"\n",
"You may need to restart your notebook kernel after installing the kfp sdk"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "38y13drotnXK",
"outputId": "61184254-57cb-4f29-c0e5-dba30df0c914",
"tags": [
"imports"
]
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Requirement already satisfied: pip in /usr/local/lib/python3.6/dist-packages (21.3.1)\n"
]
}
],
"source": [
"!pip install --user --upgrade pip"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "6a8V8LN9ttJT"
},
"outputs": [],
"source": [
"!pip install kfp --upgrade --user --quiet"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "taqx2u69tnXS",
"outputId": "32ec75b1-7d1e-434e-8834-aa841fecd4d5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name: kfp\n",
"Version: 1.8.11\n",
"Summary: KubeFlow Pipelines SDK\n",
"Home-page: https://github.com/kubeflow/pipelines\n",
"Author: The Kubeflow Authors\n",
"Author-email: \n",
"License: UNKNOWN\n",
"Location: /home/jovyan/.local/lib/python3.6/site-packages\n",
"Requires: absl-py, click, cloudpickle, dataclasses, Deprecated, docstring-parser, fire, google-api-python-client, google-auth, google-cloud-storage, jsonschema, kfp-pipeline-spec, kfp-server-api, kubernetes, protobuf, pydantic, PyYAML, requests-toolbelt, strip-hints, tabulate, typer, typing-extensions, uritemplate\n",
"Required-by: kubeflow-kale\n"
]
}
],
"source": [
"# confirm the kfp sdk\n",
"! pip show kfp"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "4B2CjdRcuRgT"
},
"source": [
"## Import kubeflow pipeline libraries"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"id": "_8P4-rCDtnXT"
},
"outputs": [],
"source": [
"import kfp\n",
"import kfp.components as comp\n",
"import kfp.dsl as dsl\n",
"from kfp.components import InputPath, OutputPath\n",
"from typing import NamedTuple\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "NWPLyw_DuzSl"
},
"source": [
"## Kubeflow pipeline component creation"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "98g9LoIcuaPB"
},
"source": [
"Component 1: Download the digits Dataset"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"id": "lGK3hlXdtnXV"
},
"outputs": [],
"source": [
"# download data step\n",
"def download_data(download_link: str, data_path: OutputPath(str)):\n",
" import zipfile\n",
" import sys, subprocess;\n",
" subprocess.run([\"python\", \"-m\", \"pip\", \"install\", \"--upgrade\", \"pip\"])\n",
" subprocess.run([sys.executable, \"-m\", \"pip\", \"install\", \"wget\"])\n",
" import wget\n",
" import os\n",
"\n",
" if not os.path.exists(data_path):\n",
" os.makedirs(data_path)\n",
"\n",
" # download files\n",
" wget.download(download_link.format(file='train'), f'{data_path}/train_csv.zip')\n",
" wget.download(download_link.format(file='test'), f'{data_path}/test_csv.zip')\n",
" \n",
" with zipfile.ZipFile(f\"{data_path}/train_csv.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(data_path)\n",
" \n",
" with zipfile.ZipFile(f\"{data_path}/test_csv.zip\",\"r\") as zip_ref:\n",
" zip_ref.extractall(data_path)\n",
" \n",
" return(print('Done!'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "J9Yhdk3xudcn"
},
"source": [
"Component 2: load the digits Dataset"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"id": "oEdsQpH2tnXX"
},
"outputs": [],
"source": [
"# load data\n",
"\n",
"def load_data(data_path: InputPath(str), \n",
" load_data_path: OutputPath(str)):\n",
" \n",
" # import Library\n",
" import sys, subprocess;\n",
" subprocess.run([\"python\", \"-m\", \"pip\", \"install\", \"--upgrade\", \"pip\"])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','pandas'])\n",
" # import Library\n",
" import os, pickle;\n",
" import pandas as pd\n",
" import numpy as np\n",
"\n",
" #importing the data\n",
" # Data Path\n",
" train_data_path = data_path + '/train.csv'\n",
" test_data_path = data_path + '/test.csv'\n",
"\n",
" # Loading dataset into pandas \n",
" train_df = pd.read_csv(train_data_path)\n",
" test_df = pd.read_csv(test_data_path)\n",
" \n",
" # join train and test together\n",
" ntrain = train_df.shape[0]\n",
" ntest = test_df.shape[0]\n",
" all_data = pd.concat((train_df, test_df)).reset_index(drop=True)\n",
" print(\"all_data size is : {}\".format(all_data.shape))\n",
" \n",
" #creating the preprocess directory\n",
" os.makedirs(load_data_path, exist_ok = True)\n",
" \n",
" #Save the combined_data as a pickle file to be used by the preprocess component.\n",
" with open(f'{load_data_path}/all_data', 'wb') as f:\n",
" pickle.dump((ntrain, all_data), f)\n",
" \n",
" return(print('Done!'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Component 3: Preprocess the digits Dataset"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"# preprocess data\n",
"\n",
"def preprocess_data(load_data_path: InputPath(str), \n",
" preprocess_data_path: OutputPath(str)):\n",
" \n",
" # import Library\n",
" import sys, subprocess;\n",
" subprocess.run([\"python\", \"-m\", \"pip\", \"install\", \"--upgrade\", \"pip\"])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','pandas'])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','scikit-learn'])\n",
" import os, pickle;\n",
" import pandas as pd\n",
" import numpy as np\n",
" from sklearn.model_selection import train_test_split\n",
"\n",
" #loading the train data\n",
" with open(f'{load_data_path}/all_data', 'rb') as f:\n",
" ntrain, all_data = pickle.load(f)\n",
" \n",
" # split features and label\n",
" all_data_X = all_data.drop('label', axis=1)\n",
" all_data_y = all_data.label\n",
" \n",
" # Reshape image in 3 dimensions (height = 28px, width = 28px , channel = 1)\n",
" all_data_X = all_data_X.values.reshape(-1,28,28,1)\n",
"\n",
" # Normalize the data\n",
" all_data_X = all_data_X / 255.0\n",
" \n",
" #Get the new dataset\n",
" X = all_data_X[:ntrain].copy()\n",
" y = all_data_y[:ntrain].copy()\n",
" \n",
" # split into train and test\n",
" X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)\n",
" \n",
" #creating the preprocess directory\n",
" os.makedirs(preprocess_data_path, exist_ok = True)\n",
" \n",
" #Save the train_data as a pickle file to be used by the modelling component.\n",
" with open(f'{preprocess_data_path}/train', 'wb') as f:\n",
" pickle.dump((X_train, y_train), f)\n",
" \n",
" #Save the test_data as a pickle file to be used by the predict component.\n",
" with open(f'{preprocess_data_path}/test', 'wb') as f:\n",
" pickle.dump((X_test, y_test), f)\n",
" \n",
" return(print('Done!'))"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Component 4: ML modeling"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"id": "8KGMGMEmtnXZ"
},
"outputs": [],
"source": [
"def modeling(preprocess_data_path: InputPath(str), \n",
" model_path: OutputPath(str)):\n",
" \n",
" # import Library\n",
" import sys, subprocess;\n",
" subprocess.run([\"python\", \"-m\", \"pip\", \"install\", \"--upgrade\", \"pip\"])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','pandas'])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','tensorflow'])\n",
" import os, pickle;\n",
" import numpy as np\n",
" import tensorflow as tf\n",
" from tensorflow import keras, optimizers\n",
" from tensorflow.keras.metrics import SparseCategoricalAccuracy\n",
" from tensorflow.keras.losses import SparseCategoricalCrossentropy\n",
" from tensorflow.keras import layers\n",
"\n",
" #loading the train data\n",
" with open(f'{preprocess_data_path}/train', 'rb') as f:\n",
" train_data = pickle.load(f)\n",
" \n",
" # Separate the X_train from y_train.\n",
" X_train, y_train = train_data\n",
" \n",
" #initializing the classifier model with its input, hidden and output layers\n",
" hidden_dim1=56\n",
" hidden_dim2=100\n",
" DROPOUT=0.5\n",
" model = tf.keras.Sequential([\n",
" tf.keras.layers.Conv2D(filters = hidden_dim1, kernel_size = (5,5),padding = 'Same', \n",
" activation ='relu'),\n",
" tf.keras.layers.Dropout(DROPOUT),\n",
" tf.keras.layers.Conv2D(filters = hidden_dim2, kernel_size = (3,3),padding = 'Same', \n",
" activation ='relu'),\n",
" tf.keras.layers.Dropout(DROPOUT),\n",
" tf.keras.layers.Conv2D(filters = hidden_dim2, kernel_size = (3,3),padding = 'Same', \n",
" activation ='relu'),\n",
" tf.keras.layers.Dropout(DROPOUT),\n",
" tf.keras.layers.Flatten(),\n",
" tf.keras.layers.Dense(10, activation = \"softmax\")\n",
" ])\n",
"\n",
" model.build(input_shape=(None,28,28,1))\n",
" \n",
" #Compiling the classifier model with Adam optimizer\n",
" model.compile(optimizers.Adam(learning_rate=0.001), \n",
" loss=SparseCategoricalCrossentropy(), \n",
" metrics=SparseCategoricalAccuracy(name='accuracy'))\n",
"\n",
" # model fitting\n",
" history = model.fit(np.array(X_train), np.array(y_train),\n",
" validation_split=.1, epochs=1, batch_size=64)\n",
" \n",
" #loading the X_test and y_test\n",
" with open(f'{preprocess_data_path}/test', 'rb') as f:\n",
" test_data = pickle.load(f)\n",
" # Separate the X_test from y_test.\n",
" X_test, y_test = test_data\n",
" \n",
" # Evaluate the model and print the results\n",
" test_loss, test_acc = model.evaluate(np.array(X_test), np.array(y_test), verbose=0)\n",
" print(\"Test_loss: {}, Test_accuracy: {} \".format(test_loss,test_acc))\n",
" \n",
" #creating the preprocess directory\n",
" os.makedirs(model_path, exist_ok = True)\n",
" \n",
" #saving the model\n",
" model.save(f'{model_path}/model.h5') "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "XwXJuoHQui3d"
},
"source": [
"Component 5: Prediction and evaluation"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"def prediction(model_path: InputPath(str), \n",
" preprocess_data_path: InputPath(str), \n",
" mlpipeline_ui_metadata_path: OutputPath(str)) -> NamedTuple('conf_m_result', [('mlpipeline_ui_metadata', 'UI_metadata')]):\n",
" \n",
" # import Library\n",
" import sys, subprocess;\n",
" subprocess.run([\"python\", \"-m\", \"pip\", \"install\", \"--upgrade\", \"pip\"])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','scikit-learn'])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','pandas'])\n",
" subprocess.run([sys.executable, '-m', 'pip', 'install','tensorflow'])\n",
" import pickle, json;\n",
" import pandas as pd\n",
" import numpy as np\n",
" from collections import namedtuple\n",
" from sklearn.metrics import confusion_matrix\n",
" from tensorflow.keras.models import load_model\n",
"\n",
" #loading the X_test and y_test\n",
" with open(f'{preprocess_data_path}/test', 'rb') as f:\n",
" test_data = pickle.load(f)\n",
" # Separate the X_test from y_test.\n",
" X_test, y_test = test_data\n",
" \n",
" #loading the model\n",
" model = load_model(f'{model_path}/model.h5')\n",
" \n",
" # prediction\n",
" y_pred = np.argmax(model.predict(X_test), axis=-1)\n",
" \n",
" # confusion matrix\n",
" cm = confusion_matrix(y_test, y_pred)\n",
" vocab = list(np.unique(y_test))\n",
" \n",
" # confusion_matrix pair dataset \n",
" data = []\n",
" for target_index, target_row in enumerate(cm):\n",
" for predicted_index, count in enumerate(target_row):\n",
" data.append((vocab[target_index], vocab[predicted_index], count))\n",
" \n",
" # convert confusion_matrix pair dataset to dataframe\n",
" df = pd.DataFrame(data,columns=['target','predicted','count'])\n",
" \n",
" # change 'target', 'predicted' to integer strings\n",
" df[['target', 'predicted']] = (df[['target', 'predicted']].astype(int)).astype(str)\n",
" \n",
" # create kubeflow metric metadata for UI\n",
" metadata = {\n",
" \"outputs\": [\n",
" {\n",
" \"type\": \"confusion_matrix\",\n",
" \"format\": \"csv\",\n",
" \"schema\": [\n",
" {\n",
" \"name\": \"target\",\n",
" \"type\": \"CATEGORY\"\n",
" },\n",
" {\n",
" \"name\": \"predicted\",\n",
" \"type\": \"CATEGORY\"\n",
" },\n",
" {\n",
" \"name\": \"count\",\n",
" \"type\": \"NUMBER\"\n",
" }\n",
" ],\n",
" \"source\": df.to_csv(header=False, index=False),\n",
" \"storage\": \"inline\",\n",
" \"labels\": [\n",
" \"0\",\n",
" \"1\",\n",
" \"2\",\n",
" \"3\",\n",
" \"4\",\n",
" \"5\",\n",
" \"6\",\n",
" \"7\",\n",
" \"8\",\n",
" \"9\",\n",
" ]\n",
" }\n",
" ]\n",
" }\n",
" \n",
" with open(mlpipeline_ui_metadata_path, 'w') as metadata_file:\n",
" json.dump(metadata, metadata_file)\n",
"\n",
" conf_m_result = namedtuple('conf_m_result', ['mlpipeline_ui_metadata'])\n",
" \n",
" return conf_m_result(json.dumps(metadata))"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# create light weight components\n",
"download_op = comp.create_component_from_func(download_data,base_image=\"python:3.7.1\")\n",
"load_op = comp.create_component_from_func(load_data,base_image=\"python:3.7.1\")\n",
"preprocess_op = comp.create_component_from_func(preprocess_data,base_image=\"python:3.7.1\")\n",
"modeling_op = comp.create_component_from_func(modeling, base_image=\"tensorflow/tensorflow:latest\")\n",
"predict_op = comp.create_component_from_func(prediction, base_image=\"tensorflow/tensorflow:latest\")"
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"Create kubeflow pipeline components from images"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "v3bNFpBOuuG-"
},
"source": [
"## Kubeflow pipeline creation"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"id": "LVbUms_ptnXc",
"tags": []
},
"outputs": [],
"source": [
"# create client that would enable communication with the Pipelines API server \n",
"client = kfp.Client()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"id": "ECZRaIgCtnXd",
"tags": []
},
"outputs": [],
"source": [
"# define pipeline\n",
"@dsl.pipeline(name=\"digit-recognizer-pipeline\", \n",
" description=\"Performs Preprocessing, training and prediction of digits\")\n",
"\n",
"# Define parameters to be fed into pipeline\n",
"def digit_recognize_pipeline(download_link: str,\n",
" data_path: str,\n",
" load_data_path: str, \n",
" preprocess_data_path: str,\n",
" model_path:str\n",
" ):\n",
"\n",
"\n",
" # Create download container.\n",
" download_container = download_op(download_link)\n",
" # Create load container.\n",
" load_container = load_op(download_container.output)\n",
" # Create preprocess container.\n",
" preprocess_container = preprocess_op(load_container.output)\n",
" # Create modeling container.\n",
" modeling_container = modeling_op(preprocess_container.output)\n",
" # Create prediction container.\n",
" predict_container = predict_op(modeling_container.output, preprocess_container.output)\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"# replace download_link with the repo link where the data is stored https:github-repo/data-dir/{file}.csv.zip?raw=true\n",
"download_link = 'https://github.com/josepholaide/KfaaS/blob/main/kale/data/{file}.csv.zip?raw=true'\n",
"data_path = \"/mnt\"\n",
"load_data_path = \"load\"\n",
"preprocess_data_path = \"preprocess\"\n",
"model_path = \"model\""
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"id": "Jq2M3chhtnXd",
"outputId": "cd75c395-f0d1-415f-8205-9ea23d52fdb5"
},
"outputs": [
{
"data": {
"text/html": [
"<a href=\"/pipeline/#/experiments/details/cd1408ad-670b-4fc2-90a0-f9fa177b5570\" target=\"_blank\" >Experiment details</a>."
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"<a href=\"/pipeline/#/runs/details/178ac72f-97e3-455a-af7b-37af66f7b8e8\" target=\"_blank\" >Run details</a>."
],
"text/plain": [
"<IPython.core.display.HTML object>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"pipeline_func = digit_recognize_pipeline\n",
"\n",
"experiment_name = 'digit_recognizer_lightweight'\n",
"run_name = pipeline_func.__name__ + ' run'\n",
"\n",
"arguments = {\"download_link\": download_link,\n",
" \"data_path\": data_path,\n",
" \"load_data_path\": load_data_path,\n",
" \"preprocess_data_path\": preprocess_data_path,\n",
" \"model_path\":model_path}\n",
"\n",
"# Compile pipeline to generate compressed YAML definition of the pipeline.\n",
"kfp.compiler.Compiler().compile(pipeline_func, \n",
" '{}.zip'.format(experiment_name))\n",
"\n",
"# Submit pipeline directly from pipeline function\n",
"run_result = client.create_run_from_pipeline_func(pipeline_func, \n",
" experiment_name=experiment_name, \n",
" run_name=run_name, \n",
" arguments=arguments\n",
" )\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"colab": {
"name": "kubepipe.ipynb",
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"kubeflow_notebook": {
"autosnapshot": true,
"docker_image": "gcr.io/arrikto/jupyter-kale-py36@sha256:dd3f92ca66b46d247e4b9b6a9d84ffbb368646263c2e3909473c3b851f3fe198",
"experiment": {
"id": "6f6c9b81-54e3-414b-974a-6fe8b445a59e",
"name": "digit_recognize_lightweight"
},
"experiment_name": "digit_recognize_lightweight",
"katib_metadata": {
"algorithm": {
"algorithmName": "grid"
},
"maxFailedTrialCount": 3,
"maxTrialCount": 12,
"objective": {
"objectiveMetricName": "",
"type": "minimize"
},
"parallelTrialCount": 3,
"parameters": []
},
"katib_run": false,
"pipeline_description": "Performs Preprocessing, training and prediction of digits",
"pipeline_name": "digit-recognizer-kfp",
"snapshot_volumes": true,
"steps_defaults": [
"label:access-ml-pipeline:true",
"label:access-rok:true"
],
"volume_access_mode": "rwm",
"volumes": [
{
"annotations": [],
"mount_point": "/home/jovyan",
"name": "arikkto-workspace-7xzjm",
"size": 5,
"size_type": "Gi",
"snapshot": false,
"type": "clone"
}
]
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}