Conversion of Telco Churn and JPX Stock Market Kaggle notebooks to KFP Pipeline (#964)
* Create README.md * kaggle to kfp * Create README.md * Update README.md * Add files via upload * Update README.md * Update README.md * Update README.md * kaggle to kfp * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Notebook latest verision * Update digit-recognizer-kfp-pipeline.ipynb * Add files via upload * Create README.md * Add files via upload * Update README.md * Delete img1.PNG * Add files via upload * rename notebooks * Update README.md * Update README.md * Delete digit-recognizer-kfp-pipeline.ipynb * Delete digit-recognizer-kale-pipeline.ipynb * Delete digit_recognizer_orig.ipynb * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * rename notebooks * Delete digit-recognizer-orig.ipynb * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Update digit-recognizer-kale.ipynb * Update digit-recognizer-kale.ipynb * Update digit-recognizer-kale.ipynb * Update digit-recognizer-kale.ipynb * Update digit-recognizer-kale.ipynb * Rename digit_recognition/README.md to digit-recognition-kaggle-competition/README.md * Create README.md * Rename digit_recognition/data/README.md to digit-recognition-kaggle-competition/data/README.md * Rename digit_recognition/data/sample_submission.csv to digit-recognition-kaggle-competition/data/sample_submission.csv * Rename digit_recognition/images/README.md to digit-recognition-kaggle-competition/images/README.md * Rename digit_recognition/digit-recognizer-kale.ipynb to digit-recognition-kaggle-competition/digit-recognizer-kale.ipynb * Rename digit_recognition/digit-recognizer-kfp.ipynb to digit-recognition-kaggle-competition/digit-recognizer-kfp.ipynb * Rename digit_recognition/digit_recognizer_orig.ipynb to digit-recognition-kaggle-competition/digit_recognizer_orig.ipynb * Rename digit_recognition/requirements.txt to digit-recognition-kaggle-competition/requirements.txt * Add files via upload * Update README.md * Add files via upload * Add files via upload * Update README.md * Delete digit_recognition directory * Update digit-recognizer-kale.ipynb * Update digit-recognizer-kfp.ipynb * Update digit_recognizer_orig.ipynb * Update README.md * Update README.md * Update README.md * Update README.md * Update README.md * Create time_series_split.py * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Delete time_series_split.py * Add files via upload * Update requirements.txt * Create README.md * Create ... * Add files via upload * Update jpx-tokyo-stock-exchange-prediction-kale.ipynb * Add files via upload * Add files via upload * Add files via upload * Delete jpx_tokyo_stock_exchange_prediction_orig.ipynb * Add files via upload * Update jpx-tokyo-stock-exchange-prediction-kale.ipynb * Update jpx-tokyo-stock-exchange-prediction-kfp.ipynb * Update jpx-tokyo-stock-exchange-prediction-orig.ipynb * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Add files via upload * Update README.md * Update README.md * Update README.md * Update README.md * Add files via upload * Update jpx-tokyo-stock-exchange-prediction-kfp.ipynb * Add files via upload * Update README.md * Update README.md * Update README.md * Add files via upload * Add Pipeline Metrics images to README * Add files via upload * Add files via upload * Add Kale Pipeline Metrics images to README * enter-api-key image at the center * Update README.md * Update README.md * Update README.md * Add pip install kaggle * Add Kaggle version * Add files via upload * Kaggle API setup image update * Update README.md * Image resolution in README.md corrected * README update * Added K8s secrets setup for Kaggle download * data link update * download link update * data link update * digit_recognizer_orig.ipynb to digit-recognizer-orig.ipynb * Update README.md * Create WA_Fn-UseC_-Telco-Customer-Churn.csv * kfp notebook markdown update * telco notebooks upload * kale notebook update * Add files via upload * added pipeline ui visualizations * Create README.md * Update README.md * Update README.md * Update README.md * Notebooks comment update * Create ... * Add files via upload * Update README.md * Added readme indentation * Update README.md * Update README.md * Update jpx-tokyo-stock-exchange-prediction-kale.ipynb * Update jpx-tokyo-stock-exchange-prediction-kale.ipynb * readme indentation update * Add files via upload * Update README.md * Update README.md * Update README.md * Update README.md * Update telco-customer-churn-kfp.ipynb * Update README.md * Update README.md * Update README.md * Update README.md * Add files via upload * Update README.md * Add files via upload * Update README.md * Add files via upload * Update README.md * Update README.md * Update telco-customer-churn-kfp.ipynb * Update telco-customer-churn-kale.ipynb * Update jpx-tokyo-stock-exchange-prediction-kale.ipynb * Update jpx-tokyo-stock-exchange-prediction-kale.ipynb * Update telco-customer-churn-kale.ipynb
|
@ -178,10 +178,9 @@
|
|||
"data_path = 'data'\n",
|
||||
"\n",
|
||||
"# data link\n",
|
||||
"train_link = 'https://github.com/josepholaide/examples/blob/master/digit-recognition-kaggle-competition/data/train.csv.zip?raw=true'\n",
|
||||
"test_link = 'https://github.com/josepholaide/examples/blob/master/digit-recognition-kaggle-competition/data/test.csv.zip?raw=true'\n",
|
||||
"sample_submission = 'https://raw.githubusercontent.com/josepholaide/examples/master/digit-recognition-kaggle-competition/data/sample_submission.csv'\n",
|
||||
"\n",
|
||||
"train_link = 'https://github.com/kubeflow/examples/blob/master/digit-recognition-kaggle-competition/data/train.csv.zip?raw=true'\n",
|
||||
"test_link = 'https://github.com/kubeflow/examples/blob/master/digit-recognition-kaggle-competition/data/test.csv.zip?raw=true'\n",
|
||||
"sample_submission = 'https://raw.githubusercontent.com/kubeflow/examples/master/digit-recognition-kaggle-competition/data/sample_submission.csv'\n", "\n",
|
||||
"# download data\n",
|
||||
"wget.download(train_link, f'{data_path}/train_csv.zip')\n",
|
||||
"wget.download(test_link, f'{data_path}/test_csv.zip')\n",
|
||||
|
|
|
@ -569,7 +569,7 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"download_link = 'https://github.com/josepholaide/examples/blob/master/digit-recognition-kaggle-competition/data/{file}.csv.zip?raw=true'\n",
|
||||
"download_link = 'https://github.com/kubeflow/examples/blob/master/digit-recognition-kaggle-competition/data/{file}.csv.zip?raw=true'\n",
|
||||
"data_path = \"/mnt\"\n",
|
||||
"load_data_path = \"load\"\n",
|
||||
"preprocess_data_path = \"preprocess\"\n",
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
# Objective
|
||||
Here we convert the https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction code to a Kubeflow pipeline
|
||||
|
||||
In this example we are going to convert this generic [notebook](https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/jpx-tokyo-stock-exchange-prediction-orig.ipynb) based on the [Kaggle JPX Tokyo Stock Exchange Prediction](https://www.kaggle.com/competitions/jpx-tokyo-stock-exchange-prediction) competition into a Kubeflow pipeline.
|
||||
|
||||
The objective of this task is to correctly model real future returns of around 2,000 stocks. The stocks are ranked from highest
|
||||
to lowest expected returns and they are evaluated on the difference in returns between the top and bottom 200 stocks.
|
||||
|
||||
|
@ -12,90 +14,149 @@ Environment:
|
|||
| kfp | 1.8.11 |
|
||||
| kubeflow-kale | 0.6.0 |
|
||||
| pip | 21.3.1 |
|
||||
| kaggle | 1.5.12 |
|
||||
|
||||
|
||||
The KFP version used for testing can be installed as `pip install kfp==1.8.11`
|
||||
## Section 1: Overview
|
||||
|
||||
# Section 1: KFP Pipeline
|
||||
1. Vanilla KFP Pipeline: Kubeflow lightweight component method
|
||||
|
||||
## Kubeflow lightweight component method
|
||||
Here, a python function is created to carry out a certain task and the python function is passed inside a kfp component method`create_component_from_func`.
|
||||
To get started, visit the Kubeflow Pipelines [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/)
|
||||
to get acquainted with what pipelines are, its components, pipeline metrics and how to pass data between components in a pipeline.
|
||||
There are different ways to build out a pipeline component as mentioned [here](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#building-pipeline-components).
|
||||
In the following example, we are going to use the lightweight python functions based components for building our Kubeflow pipeline.
|
||||
|
||||
2. Kale KFP Pipeline
|
||||
|
||||
## Kubeflow pipelines
|
||||
A Kubeflow pipelines connects all components together, to create a directed acyclic graph (DAG). The kfp `dsl.pipeline` method was used to create a pipeline function.
|
||||
The kfp component method `InputPath` and `OutputPath` was used to pass data amongst component.
|
||||
To get started, visit Kale's [documentation](https://docs.arrikto.com/user/kale/index.html) to get acquainted with the
|
||||
Kale user interface (UI) from a Jupyter Notebook, [notebook cell annotation](https://docs.arrikto.com/user/kale/jupyterlab/annotate.html)
|
||||
and how to create a machine learning pipeline using Kale.
|
||||
In the following example, we are going to use the Kale JupyterLab Extension to building our Kubeflow pipeline.
|
||||
|
||||
Finally, the `create_run_from_pipeline_func` was used to submit pipeline directly from pipeline function
|
||||
|
||||
## To create pipeline on KFP
|
||||
## Section 2: Prepare environment for data download
|
||||
|
||||
1. Open your Kubeflow Cluster, create a Notebook Server and connect to it.
|
||||
|
||||
2. Clone this repo and navigate to this directory
|
||||
|
||||
3. Download JPX dataset using Kaggle's API. To do this, do the following:
|
||||
2. Download JPX dataset using Kaggle's API. To do this, do the following:
|
||||
|
||||
* Login to Kaggle and click on your user profile picture.
|
||||
* Click on ‘Account’.
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kaggle-click-account.PNG?raw=true" alt="kaggle-click-account"/>
|
||||
</p>
|
||||
|
||||
* Under ‘Account’, navigate to the ‘API’ section.
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kaggle-create-new-api-token.PNG?raw=true" alt="kaggle-create-new-api-token"/>
|
||||
</p>
|
||||
|
||||
* Click ‘Create New API token’.
|
||||
* After creating a new API token, a kaggle.json file is automatically downloaded,
|
||||
and the json file contains the ‘api-key’ and ‘username’ needed to download the dataset.
|
||||
* After creating a new API token, a kaggle.json file is automatically downloaded, and the json file contains the ‘api-key’ and ‘username’ needed to download the dataset.
|
||||
* Create a Kubernetes secret to handle the sensitive API credentials and to prevent you from passing your credentials in plain text to the pipeline notebook.
|
||||
```
|
||||
!kubectl create secret generic -n kubeflow-user kaggle-secret --from-literal=username=<"username"> --from-literal=password=<"api-key">
|
||||
```
|
||||
* Create a secret PodDefault YAML file in your Kubeflow namespace.
|
||||
```
|
||||
apiVersion: "kubeflow.org/v1alpha1"
|
||||
kind: PodDefault
|
||||
metadata:
|
||||
name: kaggle-secret
|
||||
namespace: kubeflow-user
|
||||
spec:
|
||||
selector:
|
||||
matchLabels:
|
||||
kaggle-secret: "true"
|
||||
desc: "kaggle-secret"
|
||||
volumeMounts:
|
||||
- name: secret-volume
|
||||
mountPath: /secret/kaggle-secret
|
||||
readOnly: false
|
||||
volumes:
|
||||
- name: secret-volume
|
||||
secret:
|
||||
secretName: kaggle-secret
|
||||
```
|
||||
* Apply the pod YAML file
|
||||
`kubectl apply -f kaggle_pod.yaml`
|
||||
* After successfully deploying the PodDefault, create a new Notebook Server and add the `kaggle-secret` configuration to the new Notebook Server
|
||||
that runs kale or kfp pipeline.
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/notebook-ui-kaggle-config.png?raw=true" alt="notebook-ui-kaggle-config"/>
|
||||
</p>
|
||||
|
||||
4. Open the digit-recognizer-kfp notebook and pass the ‘api-key’ and ‘username’ in the following cells.
|
||||
## Section 3: Vanilla KFP Pipeline
|
||||
|
||||
* enter username
|
||||
### Kubeflow lightweight component method
|
||||
Here, a python function is created to carry out a certain task and the python function is passed inside a kfp component method [`create_component_from_func`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.create_component_from_func).
|
||||
|
||||
<p>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/enter-username.PNG?raw=true" alt="enter username" width="700" height="300"/>
|
||||
</p>
|
||||
The different components used in this example are:
|
||||
|
||||
* enter api key
|
||||
- Load data
|
||||
- Transform data
|
||||
- Feature Engineering
|
||||
- Modelling
|
||||
- Prediction
|
||||
|
||||
<p>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/enter-api-key.PNG?raw=true" alt="enter api key" width="700" height="250"/>
|
||||
</p>
|
||||
## Kubeflow pipelines
|
||||
A Kubeflow pipeline connects all components together, to create a directed acyclic graph (DAG). The kfp [`dsl.pipeline`](https://www.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/) decorator was used to create a pipeline function.
|
||||
The kfp component method [`InputPath`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.InputPath) and [`OutputPath`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.OutputPath) was used to pass data between components in the pipeline.
|
||||
|
||||
5. Run the digit-recognizer-kfp notebook from start to finish
|
||||
Finally, the [`create_run_from_pipeline_func`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.client.html) from the KFP SDK Client was used to submit pipeline directly from pipeline function
|
||||
|
||||
6. View run details immediately after submitting pipeline.
|
||||
## To create pipeline using Vanilla KFP
|
||||
|
||||
1. Open your Kubeflow Cluster, create a new Notebook Server and add the `kaggle-secret` configuration to the new Notebook Server.
|
||||
|
||||
2. Create a new Terminal and clone this repo. After cloning, navigate to this directory.
|
||||
|
||||
3. Open the [jpx-tokyo-stock-exchange-prediction-kfp](https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/jpx-tokyo-stock-exchange-prediction-kfp.ipynb) notebook
|
||||
|
||||
4. Run the jpx-tokyo-stock-exchange-prediction-kfp notebook from start to finish
|
||||
|
||||
5. View run details immediately after submitting pipeline.
|
||||
|
||||
### View Pipeline
|
||||
|
||||
<p>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kfp-pipeline.PNG?raw=true" alt="kubeflow pipeline" width="600" height="700"/>
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kfp-pipeline.PNG?raw=true" alt="kubeflow pipeline"/>
|
||||
</p>
|
||||
|
||||
### View Pipeline Metric
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kfp-metrics.PNG?raw=true" alt="kubeflow pipeline metrics"/>
|
||||
</p>
|
||||
|
||||
|
||||
# Section 2: Kale Pipeline
|
||||
## Section 4: Kale KFP Pipeline
|
||||
|
||||
To create pipeline using the Kale JupyterLab extension
|
||||
To create a KFP pipeline using the Kale JupyterLab extension
|
||||
|
||||
|
||||
1. Clone GitHub repo and navigate to this directory
|
||||
1. Open your Kubeflow Cluster, create a new Notebook Server and add the `kaggle-secret` configuration to the new Notebook Server.
|
||||
|
||||
2. Install the requirements.txt file
|
||||
2. Create a new Terminal and clone this repo. After cloning, navigate to this directory.
|
||||
|
||||
3. Launch the digit-recognizer-kale.ipynb Notebook
|
||||
3. Launch the [jpx-tokyo-stock-exchange-prediction-kale.ipynb](https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/jpx-tokyo-stock-exchange-prediction-kale.ipynb) Notebook
|
||||
|
||||
4. Enable the Kale extension in JupyterLab
|
||||
4. Install the requirements.txt file. After installation, restart the kernel.
|
||||
|
||||
5. Download JPX dataset using Kaggle's API. To do this, do the following:
|
||||
5. Enable the Kale extension in JupyterLab
|
||||
|
||||
* Login to Kaggle and click on your user profile picture.
|
||||
* Click on ‘Account’.
|
||||
* Under ‘Account’, navigate to the ‘API’ section.
|
||||
* Click ‘Create New API token’.
|
||||
* After creating a new API token, a kaggle.json file is automatically downloaded,
|
||||
and the json file contains the ‘api-key’ and ‘username’ needed to download the dataset.
|
||||
* Upload the JSON file to the Jupyter notebook instance
|
||||
* Pass the JSON file directory into the following cell.
|
||||
<p>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/pass-kaggle-json-path.PNG?raw=true" alt="pass kaggle json path" width="850" height="250"/>
|
||||
</p>
|
||||
6. The notebook's cells are automatically annotated with Kale tags
|
||||
|
||||
5. The notebook's cells are automatically annotated with Kale tags
|
||||
To fully understand the different Kale tags available, visit Kale [documentation](https://docs.arrikto.com/user/kale/jupyterlab/cell-types.html?highlight=pipeline%20metrics#annotate-pipeline-step-cells)
|
||||
|
||||
The following Kale tags were used in this example:
|
||||
|
||||
* Imports
|
||||
* Pipeline Parameters
|
||||
* Pipeline Metrics
|
||||
* Pipeline Step
|
||||
* Skip Cell
|
||||
|
||||
With the use of Kale tags we define the following:
|
||||
|
||||
|
@ -105,11 +166,34 @@ To create pipeline using the Kale JupyterLab extension
|
|||
* Cell dependencies are defined between the different pipeline steps with the "depends on" flag
|
||||
* Pipeline metrics are assigned using the "pipeline metrics" tag
|
||||
|
||||
6. Compile and run Notebook using Kale
|
||||
The pipeline steps created in this example:
|
||||
|
||||
* Load data
|
||||
* Transform data
|
||||
* Feature Engineering
|
||||
* Modelling
|
||||
* Prediction
|
||||
|
||||
7. Compile and run the Notebook by hitting the "Compile & Run" in Kale's left panel
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/jpx-kale-deployment-panel.PNG?raw=true" alt="jpx-kale-deployment-panel"/>
|
||||
</p>
|
||||
|
||||
### View Pipeline
|
||||
|
||||
<p>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kale-pipeline.PNG?raw=true" alt="kubeflow pipeline" width="600" height="700"/>
|
||||
View Pipeline by clicking "View" in Kale's left panel
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/jpx-view-pipeline.PNG?raw=true" alt="jpx-view-pipeline"/>
|
||||
</p>
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kale-pipeline.PNG?raw=true" alt="kale-pipeline"/>
|
||||
</p>
|
||||
|
||||
### View Pipeline Metric
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/jpx-tokyo-stock-exchange-kaggle-competition/images/kale-metrics.PNG?raw=true" alt="kale-metrics"/>
|
||||
</p>
|
After Width: | Height: | Size: 32 KiB |
After Width: | Height: | Size: 14 KiB |
After Width: | Height: | Size: 371 KiB |
After Width: | Height: | Size: 32 KiB |
After Width: | Height: | Size: 21 KiB |
After Width: | Height: | Size: 22 KiB |
After Width: | Height: | Size: 240 KiB |
|
@ -15,6 +15,21 @@
|
|||
"> In this competition, you will model real future returns of around 2,000 stocks. The competition will involve building portfolios from the stocks eligible for predictions. The stocks are ranked from highest to lowest expected returns and they are evaluated on the difference in returns between the top and bottom 200 stocks."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Install necessary packages\n",
|
||||
"\n",
|
||||
"We can install the necessary package by either running `pip install --user <package_name>` or include everything in a `requirements.txt` file and run `pip install --user -r requirements.txt`. We have put the dependencies in a `requirements.txt` file so we will use the former method.\n",
|
||||
"\n",
|
||||
"> NOTE: Do not forget to use the `--user` argument. It is necessary if you want to use Kale to transform this notebook into a Kubeflow pipeline.",
|
||||
"\n",
|
||||
"After installing python packages, restart notebook kernel before proceeding.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
|
@ -32,6 +47,7 @@
|
|||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# After installation, restart the kernel.\n",
|
||||
"!pip install -r requirements.txt --user --quiet"
|
||||
]
|
||||
},
|
||||
|
@ -48,7 +64,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"execution_count": 3,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"imports"
|
||||
|
@ -83,7 +99,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"pipeline-parameters"
|
||||
|
@ -107,7 +123,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": 5,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"skip"
|
||||
|
@ -135,25 +151,12 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": 7,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"block:load_data"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# setup kaggle environment for data download\n",
|
||||
"# set kaggle.json path\n",
|
||||
"os.environ['KAGGLE_CONFIG_DIR'] = \"/home/jovyan/examples/jpx-tokyo-stock-exchange-kaggle-competition\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 6,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
|
@ -161,22 +164,30 @@
|
|||
"CompletedProcess(args=['kaggle', 'competitions', 'download', '-c', 'jpx-tokyo-stock-exchange-prediction'], returncode=0)"
|
||||
]
|
||||
},
|
||||
"execution_count": 6,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# grant rwo permission to .kaggle/kaggle.json\n",
|
||||
"subprocess.run([\"chmod\",\"600\", f\"{os.environ['KAGGLE_CONFIG_DIR']}/kaggle.json\"])\n",
|
||||
"# setup kaggle environment for data download\n",
|
||||
"dataset = \"jpx-tokyo-stock-exchange-prediction\"\n",
|
||||
"\n",
|
||||
"# setup kaggle environment for data download\n",
|
||||
"with open('/secret/kaggle-secret/password', 'r') as file:\n",
|
||||
" kaggle_key = file.read().rstrip()\n",
|
||||
"with open('/secret/kaggle-secret/username', 'r') as file:\n",
|
||||
" kaggle_user = file.read().rstrip()\n",
|
||||
"\n",
|
||||
"os.environ['KAGGLE_USERNAME'], os.environ['KAGGLE_KEY'] = kaggle_user, kaggle_key\n",
|
||||
"\n",
|
||||
"# download kaggle's jpx-tokyo-stock-exchange-prediction data\n",
|
||||
"subprocess.run([\"kaggle\",\"competitions\", \"download\", \"-c\", \"jpx-tokyo-stock-exchange-prediction\"])"
|
||||
"subprocess.run([\"kaggle\",\"competitions\", \"download\", \"-c\", dataset])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"execution_count": 8,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
|
@ -186,13 +197,13 @@
|
|||
"data_path = 'data'\n",
|
||||
"\n",
|
||||
"# extract jpx-tokyo-stock-exchange-prediction.zip to load_data_path\n",
|
||||
"with zipfile.ZipFile(\"jpx-tokyo-stock-exchange-prediction.zip\",\"r\") as zip_ref:\n",
|
||||
"with zipfile.ZipFile(f\"{dataset}.zip\",\"r\") as zip_ref:\n",
|
||||
" zip_ref.extractall(data_path)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 9,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
|
@ -204,7 +215,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 10,
|
||||
"metadata": {
|
||||
"tags": []
|
||||
},
|
||||
|
@ -215,7 +226,7 @@
|
|||
"Timestamp('2021-12-03 00:00:00')"
|
||||
]
|
||||
},
|
||||
"execution_count": 9,
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"output_type": "execute_result"
|
||||
}
|
||||
|
@ -2212,6 +2223,7 @@
|
|||
"snapshot_volumes": true,
|
||||
"steps_defaults": [
|
||||
"label:access-ml-pipeline:true",
|
||||
"label:kaggle-secret:true",
|
||||
"label:access-rok:true"
|
||||
],
|
||||
"volume_access_mode": "rwm",
|
||||
|
@ -2219,7 +2231,7 @@
|
|||
{
|
||||
"annotations": [],
|
||||
"mount_point": "/home/jovyan",
|
||||
"name": "jpx-workspace-lp2ng",
|
||||
"name": "dem-workspace-snqdc",
|
||||
"size": 5,
|
||||
"size_type": "Gi",
|
||||
"snapshot": false,
|
||||
|
|
|
@ -16,28 +16,22 @@
|
|||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"metadata": {
|
||||
"papermill": {
|
||||
"duration": 1.321604,
|
||||
"end_time": "2022-04-17T07:17:04.141763",
|
||||
"exception": false,
|
||||
"start_time": "2022-04-17T07:17:02.820159",
|
||||
"status": "completed"
|
||||
},
|
||||
"tags": [
|
||||
"skip"
|
||||
]
|
||||
},
|
||||
"outputs": [],
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"!pip install -r requirements.txt --user --quiet"
|
||||
"# Install relevant libraries\n",
|
||||
"\n",
|
||||
"\n",
|
||||
">Update pip `pip install --user --upgrade pip`\n",
|
||||
"\n",
|
||||
">Install and upgrade kubeflow sdk `pip install kfp --upgrade --user --quiet`\n",
|
||||
"\n",
|
||||
"You may need to restart your notebook kernel after installing the kfp sdk"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 2,
|
||||
"execution_count": 1,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -54,7 +48,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 3,
|
||||
"execution_count": 2,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -63,7 +57,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"execution_count": 3,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
|
@ -90,7 +84,7 @@
|
|||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"execution_count": 4,
|
||||
"metadata": {
|
||||
"tags": [
|
||||
"imports"
|
||||
|
@ -102,8 +96,7 @@
|
|||
"import kfp.components as comp\n",
|
||||
"import kfp.dsl as dsl\n",
|
||||
"from kfp.components import InputPath, OutputPath\n",
|
||||
"from typing import NamedTuple\n",
|
||||
"import getpass"
|
||||
"from typing import NamedTuple"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -112,7 +105,64 @@
|
|||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"# Download and load the dataset"
|
||||
"# Kubeflow pipeline component creation\n",
|
||||
"\n",
|
||||
"## Download and load the dataset"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 5,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# load data step\n",
|
||||
"def load_data(dataset: str, data_path: OutputPath(str)):\n",
|
||||
" \n",
|
||||
" # install the necessary libraries\n",
|
||||
" import os, sys, subprocess, zipfile, pickle;\n",
|
||||
" subprocess.run([\"python\", \"-m\", \"pip\", \"install\", \"--upgrade\", \"pip\"])\n",
|
||||
" subprocess.run([sys.executable, '-m', 'pip', 'install','pandas'])\n",
|
||||
" subprocess.run([sys.executable, '-m', 'pip', 'install','kaggle'])\n",
|
||||
" \n",
|
||||
" # import libraries\n",
|
||||
" import pandas as pd\n",
|
||||
"\n",
|
||||
" # setup kaggle environment for data download\n",
|
||||
" with open('/secret/kaggle-secret/password', 'r') as file:\n",
|
||||
" kaggle_key = file.read().rstrip()\n",
|
||||
" with open('/secret/kaggle-secret/username', 'r') as file:\n",
|
||||
" kaggle_user = file.read().rstrip()\n",
|
||||
" \n",
|
||||
" os.environ['KAGGLE_USERNAME'], os.environ['KAGGLE_KEY'] = kaggle_user, kaggle_key\n",
|
||||
" \n",
|
||||
" # create data_path directory\n",
|
||||
" if not os.path.exists(data_path):\n",
|
||||
" os.makedirs(data_path)\n",
|
||||
" \n",
|
||||
" # download kaggle's jpx-tokyo-stock-exchange-prediction data\n",
|
||||
" subprocess.run([\"kaggle\",\"competitions\", \"download\", \"-c\", dataset])\n",
|
||||
" \n",
|
||||
" # extract jpx-tokyo-stock-exchange-prediction.zip to data_path\n",
|
||||
" with zipfile.ZipFile(f\"{dataset}.zip\",\"r\") as zip_ref:\n",
|
||||
" zip_ref.extractall(data_path)\n",
|
||||
" \n",
|
||||
" # read train_files/stock_prices.csv\n",
|
||||
" df_prices = pd.read_csv(f\"{data_path}/train_files/stock_prices.csv\", parse_dates=['Date'])\n",
|
||||
" \n",
|
||||
" # Save the loaded data as a pickle file to be used by the tranform_data component.\n",
|
||||
" with open(f'{data_path}/df_prices', 'wb') as f:\n",
|
||||
" pickle.dump(df_prices, f)\n",
|
||||
"\n",
|
||||
" \n",
|
||||
" return(print('Done!'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Transform data"
|
||||
]
|
||||
},
|
||||
{
|
||||
|
@ -120,64 +170,10 @@
|
|||
"execution_count": 6,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# load data step\n",
|
||||
"def load_data(api_key: str, load_data_path: OutputPath(str)):\n",
|
||||
" \n",
|
||||
" # install the necessary libraries\n",
|
||||
" import sys, subprocess;\n",
|
||||
" subprocess.run([\"python\", \"-m\", \"pip\", \"install\", \"--upgrade\", \"pip\"])\n",
|
||||
" subprocess.run([sys.executable, '-m', 'pip', 'install','pandas'])\n",
|
||||
" subprocess.run([sys.executable, '-m', 'pip', 'install','kaggle'])\n",
|
||||
" \n",
|
||||
" # import libraries\n",
|
||||
" import os, json, zipfile, pickle;\n",
|
||||
" import pandas as pd\n",
|
||||
" # setup kaggle environment for data download\n",
|
||||
" os.environ['KAGGLE_CONFIG_DIR'] = \"/.kaggle\"\n",
|
||||
" subprocess.call([\"mkdir\",\".kaggle\"])\n",
|
||||
" \n",
|
||||
" # kaggle api token\n",
|
||||
" # enter only username here. Do not enter your api_key\n",
|
||||
" api_token = {\"username\": \"olaidejoseph10\", \"key\": api_key}\n",
|
||||
" \n",
|
||||
" with open('.kaggle/kaggle.json', 'w') as file:\n",
|
||||
" json.dump(api_token, file)\n",
|
||||
" \n",
|
||||
" # grant rwo permission to .kaggle/kaggle.json\n",
|
||||
" subprocess.run([\"chmod\",\"600\", \".kaggle/kaggle.json\"])\n",
|
||||
" \n",
|
||||
" # download kaggle's jpx-tokyo-stock-exchange-prediction data\n",
|
||||
" subprocess.run([\"kaggle\",\"competitions\", \"download\", \"-c\", \"jpx-tokyo-stock-exchange-prediction\"])\n",
|
||||
" \n",
|
||||
" # create load_data_path directory\n",
|
||||
" if not os.path.exists(load_data_path):\n",
|
||||
" os.makedirs(load_data_path)\n",
|
||||
"\n",
|
||||
" # extract jpx-tokyo-stock-exchange-prediction.zip to load_data_path\n",
|
||||
" with zipfile.ZipFile(\"jpx-tokyo-stock-exchange-prediction.zip\",\"r\") as zip_ref:\n",
|
||||
" zip_ref.extractall(load_data_path)\n",
|
||||
" \n",
|
||||
" # read train_files/stock_prices.csv\n",
|
||||
" df_prices = pd.read_csv(f\"{load_data_path}/train_files/stock_prices.csv\", parse_dates=['Date'])\n",
|
||||
" \n",
|
||||
" \n",
|
||||
" # Save the loaded data as a pickle file to be used by the tranform_data component.\n",
|
||||
" with open(f'{load_data_path}/df_prices', 'wb') as f:\n",
|
||||
" pickle.dump(df_prices, f)\n",
|
||||
" \n",
|
||||
" return(print('Done!'))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# transform data step\n",
|
||||
"\n",
|
||||
"def transform_data(load_data_path: InputPath(str), \n",
|
||||
"def transform_data(data_path: InputPath(str), \n",
|
||||
" transform_data_path: OutputPath(str)):\n",
|
||||
" \n",
|
||||
" # install the necessary libraries\n",
|
||||
|
@ -193,7 +189,7 @@
|
|||
" from scipy import stats\n",
|
||||
" \n",
|
||||
" # load the df_prices data from load_data_path\n",
|
||||
" with open(f'{load_data_path}/df_prices', 'rb') as f:\n",
|
||||
" with open(f'{data_path}/df_prices', 'rb') as f:\n",
|
||||
" df_prices = pickle.load(f)\n",
|
||||
"\n",
|
||||
" # sort data by 'Date' and 'SecuritiesCode'\n",
|
||||
|
@ -229,12 +225,12 @@
|
|||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"<h1>Feature Engineering"
|
||||
"## Feature Engineering"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 8,
|
||||
"execution_count": 7,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -319,13 +315,13 @@
|
|||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"<h1>Modelling\n",
|
||||
"## Modelling\n",
|
||||
" "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 9,
|
||||
"execution_count": 8,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -407,12 +403,12 @@
|
|||
"tags": []
|
||||
},
|
||||
"source": [
|
||||
"<h1> Evaluation and Prediction"
|
||||
"## Evaluation and Prediction"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 10,
|
||||
"execution_count": 9,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -466,9 +462,18 @@
|
|||
" return output_tuple(json.dumps(metrics))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Create pipeline components \n",
|
||||
"\n",
|
||||
"using `create_component_from_func`"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"execution_count": 10,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
|
@ -480,6 +485,50 @@
|
|||
"predict_op = comp.create_component_from_func(prediction, base_image=\"python:3.7.1\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Kubeflow pipeline creation"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 11,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# define pipeline\n",
|
||||
"@dsl.pipeline(name=\"jpx-tokyo-stock-exchange\", \n",
|
||||
" description=\"Predicting real future returns of around 2,000 stocks.\")\n",
|
||||
"\n",
|
||||
"# Define parameters to be fed into pipeline\n",
|
||||
"def tokyo_stock_exchange_pipeline(\n",
|
||||
" dataset: str,\n",
|
||||
" data_path: str,\n",
|
||||
" transform_data_path: str, \n",
|
||||
" feat_eng_data_path: str,\n",
|
||||
" model_path:str\n",
|
||||
" ):\n",
|
||||
"\n",
|
||||
" vop = dsl.VolumeOp(\n",
|
||||
" name=\"create_volume\",\n",
|
||||
" resource_name=\"data-volume\", \n",
|
||||
" size=\"2Gi\", \n",
|
||||
" modes=dsl.VOLUME_MODE_RWO)\n",
|
||||
" \n",
|
||||
" # Create load container.\n",
|
||||
" load_container = load_op(dataset).add_pvolumes({\"/mnt\": vop.volume}).add_pod_label(\"kaggle-secret\", \"true\")\n",
|
||||
" # Create transform container.\n",
|
||||
" transform_container = transform_op(load_container.output)\n",
|
||||
" # Create feature engineering container.\n",
|
||||
" feature_eng_container = feature_eng_op(transform_container.output)\n",
|
||||
" # Create modeling container.\n",
|
||||
" modeling_container = modeling_op(feature_eng_container.output)\n",
|
||||
" # Create prediction container.\n",
|
||||
" predict_container = predict_op(modeling_container.output)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 12,
|
||||
|
@ -496,64 +545,23 @@
|
|||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# define pipeline\n",
|
||||
"@dsl.pipeline(name=\"jpx-tokyo-stock-exchange\", \n",
|
||||
" description=\"Predicting real future returns of around 2,000 stocks.\")\n",
|
||||
"\n",
|
||||
"# Define parameters to be fed into pipeline\n",
|
||||
"def tokyo_stock_exchange_pipeline(\n",
|
||||
" api_key: str,\n",
|
||||
" load_data_path: str,\n",
|
||||
" transform_data_path: str, \n",
|
||||
" feat_eng_data_path: str,\n",
|
||||
" model_path:str\n",
|
||||
" ):\n",
|
||||
"\n",
|
||||
"\n",
|
||||
" # Create load container.\n",
|
||||
" load_container = load_op(api_key)\n",
|
||||
" # Create transform container.\n",
|
||||
" transform_container = transform_op(load_container.output)\n",
|
||||
" # Create feature engineering container.\n",
|
||||
" feature_eng_container = feature_eng_op(transform_container.output)\n",
|
||||
" # Create modeling container.\n",
|
||||
" modeling_container = modeling_op(feature_eng_container.output)\n",
|
||||
" # Create prediction container.\n",
|
||||
" predict_container = predict_op(modeling_container.output)\n",
|
||||
" "
|
||||
"# arguments\n",
|
||||
"dataset = \"jpx-tokyo-stock-exchange-prediction\"\n",
|
||||
"data_path = \"mnt/data\"\n",
|
||||
"transform_data_path = \"tdp\"\n",
|
||||
"feat_eng_data_path = \"feat\"\n",
|
||||
"model_path = \"model\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 14,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"name": "stdin",
|
||||
"output_type": "stream",
|
||||
"text": [
|
||||
" ································\n"
|
||||
]
|
||||
}
|
||||
],
|
||||
"source": [
|
||||
"# arguments\n",
|
||||
"api_key = getpass.getpass() # enter only username here. Do not enter your api_key here\n",
|
||||
"load_data_path = \"load\"\n",
|
||||
"transform_data_path = \"tdp\"\n",
|
||||
"feat_eng_data_path = \"feat\"\n",
|
||||
"model_path = \"model\""
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 15,
|
||||
"metadata": {},
|
||||
"outputs": [
|
||||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<a href=\"/pipeline/#/experiments/details/c3070350-f6e1-4df2-a91a-2ac7f8be929b\" target=\"_blank\" >Experiment details</a>."
|
||||
"<a href=\"/pipeline/#/experiments/details/124926a8-cc3d-4726-a356-5169e84ed762\" target=\"_blank\" >Experiment details</a>."
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
|
@ -565,7 +573,7 @@
|
|||
{
|
||||
"data": {
|
||||
"text/html": [
|
||||
"<a href=\"/pipeline/#/runs/details/befd3f20-224d-4b2d-84d6-30197c8adc60\" target=\"_blank\" >Run details</a>."
|
||||
"<a href=\"/pipeline/#/runs/details/0fa09cdf-c976-4887-bcfa-a24b3968e294\" target=\"_blank\" >Run details</a>."
|
||||
],
|
||||
"text/plain": [
|
||||
"<IPython.core.display.HTML object>"
|
||||
|
@ -582,8 +590,8 @@
|
|||
"run_name = pipeline_func.__name__ + ' run1'\n",
|
||||
"\n",
|
||||
"arguments = {\n",
|
||||
" \"api_key\": api_key,\n",
|
||||
" \"load_data_path\": load_data_path,\n",
|
||||
" \"dataset\": dataset,\n",
|
||||
" \"data_path\": data_path,\n",
|
||||
" \"transform_data_path\": transform_data_path,\n",
|
||||
" \"feat_eng_data_path\": feat_eng_data_path,\n",
|
||||
" \"model_path\":model_path\n",
|
||||
|
@ -642,6 +650,7 @@
|
|||
"snapshot_volumes": true,
|
||||
"steps_defaults": [
|
||||
"label:access-ml-pipeline:true",
|
||||
"label:kaggle-secret:true",
|
||||
"label:access-rok:true"
|
||||
],
|
||||
"volume_access_mode": "rwm",
|
||||
|
@ -649,7 +658,7 @@
|
|||
{
|
||||
"annotations": [],
|
||||
"mount_point": "/home/jovyan",
|
||||
"name": "demoo-workspace-6rm6j",
|
||||
"name": "dem-workspace-snqdc",
|
||||
"size": 5,
|
||||
"size_type": "Gi",
|
||||
"snapshot": false,
|
||||
|
|
|
@ -0,0 +1,141 @@
|
|||
# Objective
|
||||
In this example we are going to convert this generic [notebook](https://github.com/josepholaide/examples/blob/telco/telco-customer-churn-kaggle-competition/telco-customer-churn-orig.ipynb)
|
||||
based on the [Telco Customer Churn Prediction](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) competition into a Kubeflow pipeline.
|
||||
|
||||
The objective of this task is to analyze customer behavior in the telecommunication sector and to predict their tendency to churn.
|
||||
|
||||
# Testing Environment
|
||||
|
||||
Environment:
|
||||
| Name | version |
|
||||
| ------------- |:-------------:|
|
||||
| Kubeflow | v1.4 |
|
||||
| kfp | 1.8.11 |
|
||||
| kubeflow-kale | 0.6.0 |
|
||||
| pip | 21.3.1 |
|
||||
|
||||
|
||||
## Section 1: Overview
|
||||
|
||||
1. Vanilla KFP Pipeline: Kubeflow lightweight component method
|
||||
|
||||
To get started, visit the Kubeflow Pipelines [documentation](https://www.kubeflow.org/docs/components/pipelines/sdk/)
|
||||
to get acquainted with what pipelines are, its components, pipeline metrics and how to pass data between components in a pipeline.
|
||||
There are different ways to build out a pipeline component as mentioned [here](https://www.kubeflow.org/docs/components/pipelines/sdk/build-pipeline/#building-pipeline-components).
|
||||
In the following example, we are going to use the lightweight python functions based components for building our Kubeflow pipeline.
|
||||
|
||||
2. Kale KFP Pipeline
|
||||
|
||||
To get started, visit Kale's [documentation](https://docs.arrikto.com/user/kale/index.html) to get acquainted with the
|
||||
Kale user interface (UI) from a Jupyter Notebook, [notebook cell annotation](https://docs.arrikto.com/user/kale/jupyterlab/annotate.html)
|
||||
and how to create a machine learning pipeline using Kale.
|
||||
In the following example, we are going to use the Kale JupyterLab Extension to building our Kubeflow pipeline.
|
||||
|
||||
## Section 2: Vanilla KFP Pipeline
|
||||
|
||||
### Kubeflow lightweight component method
|
||||
Here, a python function is created to carry out a certain task and the python function is passed inside a kfp component method [`create_component_from_func`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.create_component_from_func).
|
||||
|
||||
The different components used in this example are:
|
||||
|
||||
- Load data
|
||||
- Transform data
|
||||
- Feature Engineering
|
||||
- Catboost Modeling
|
||||
- Xgboost Modeling
|
||||
- Lightgbm Modeling
|
||||
- Ensembling
|
||||
|
||||
## Kubeflow pipelines
|
||||
A Kubeflow pipeline connects all components together, to create a directed acyclic graph (DAG). The kfp [`dsl.pipeline`](https://www.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/) decorator was used to create a pipeline function.
|
||||
The kfp component method [`InputPath`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.InputPath) and [`OutputPath`](https://kubeflow-pipelines.readthedocs.io/en/latest/source/kfp.components.html#kfp.components.OutputPath) was used to pass data between components in the pipeline.
|
||||
|
||||
Finally, the [`create_run_from_pipeline_func`](https://kubeflow-pipelines.readthedocs.io/en/stable/source/kfp.client.html) from the KFP SDK Client was used to submit pipeline directly from pipeline function
|
||||
|
||||
## To create pipeline using Vanilla KFP
|
||||
|
||||
1. Open your Kubeflow Cluster, create a Notebook Server and connect to it.
|
||||
|
||||
2. Clone this repo and navigate to this directory.
|
||||
3. Open the [telco-customer-churn-kfp](https://github.com/josepholaide/examples/blob/telco/telco-customer-churn-kaggle-competition/telco-customer-churn-kfp.ipynb) notebook
|
||||
4. Run the telco-customer-churn-kfp notebook from start to finish
|
||||
5. View run details immediately after submitting pipeline.
|
||||
|
||||
### View Pipeline
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/telco-customer-churn-kaggle-competition/images/telco-kfp-pipeline.PNG?raw=true" alt="telco-kfp-pipeline"/>
|
||||
</p>
|
||||
|
||||
### View Pipeline Visualization
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/telco-customer-churn-kaggle-competition/images/telco-kfp-pipeline-visualization.PNG?raw=true" alt="telco-kfp-pipeline-visualization"/>
|
||||
</p>
|
||||
|
||||
## Section 2: Kale KFP Pipeline
|
||||
|
||||
To create pipeline using the Kale JupyterLab extension
|
||||
|
||||
|
||||
1. Clone GitHub repo and navigate to this directory
|
||||
|
||||
2. Launch the [telco-customer-churn-kale](https://github.com/josepholaide/examples/blob/telco/telco-customer-churn-kaggle-competition/telco-customer-churn-kale.ipynb) Notebook
|
||||
|
||||
3. Install the requirements.txt file. After installation, restart kernel.
|
||||
|
||||
4. Enable the Kale extension in JupyterLab
|
||||
|
||||
5. The notebook's cells are automatically annotated with Kale tags
|
||||
|
||||
To fully understand the different Kale tags available, visit Kale [documentation](https://docs.arrikto.com/user/kale/jupyterlab/cell-types.html?highlight=pipeline%20metrics#annotate-pipeline-step-cells)
|
||||
|
||||
The following Kale tags were used in this example:
|
||||
|
||||
* Imports
|
||||
* Pipeline Step
|
||||
* Skip Cell
|
||||
|
||||
With the use of Kale tags we define the following:
|
||||
|
||||
* Pipeline parameters are assigned using the "pipeline parameters" tag
|
||||
* The necessary libraries that need to be used throughout the Pipeline are passed through the "imports" tag
|
||||
* Notebook cells are assigned to specific Pipeline components (download data, load data, etc.) using the "pipeline step" tag
|
||||
* Cell dependencies are defined between the different pipeline steps with the "depends on" flag
|
||||
* Pipeline metrics are assigned using the "pipeline metrics" tag
|
||||
|
||||
The pipeline steps created in this example:
|
||||
|
||||
* Load data
|
||||
* Transform data
|
||||
* Feature Engineering
|
||||
* Catboost Modeling
|
||||
* Xgboost Modeling
|
||||
* Lightgbm Modeling
|
||||
* Ensembling
|
||||
|
||||
6. Compile and run the Notebook by hitting the "Compile & Run" in Kale's left panel
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/telco-customer-churn-kaggle-competition/images/kale-deployment-panel.PNG?raw=true" alt="kale-deployment-panel"/>
|
||||
</p>
|
||||
|
||||
### View Pipeline
|
||||
|
||||
View Pipeline by clicking "View" in Kale's left panel
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/telco-customer-churn-kaggle-competition/images/view-pipeline.PNG?raw=true" alt="view-pipeline
|
||||
"/>
|
||||
</p>
|
||||
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/telco-customer-churn-kaggle-competition/images/telco-kale-pipeline.PNG?raw=true" alt="telco-kale-pipeline"/>
|
||||
</p>
|
||||
|
||||
### View Pipeline Visualization
|
||||
|
||||
<p align=center>
|
||||
<img src="https://github.com/josepholaide/examples/blob/master/telco-customer-churn-kaggle-competition/images/telco-kale-pipeline-visualization.PNG?raw=true" alt="telco-kale-pipeline-visualization"/>
|
||||
</p>
|
|
@ -0,0 +1 @@
|
|||
|
After Width: | Height: | Size: 50 KiB |
After Width: | Height: | Size: 30 KiB |
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 24 KiB |
After Width: | Height: | Size: 27 KiB |
After Width: | Height: | Size: 21 KiB |
|
@ -0,0 +1,6 @@
|
|||
pandas
|
||||
seaborn
|
||||
lightgbm
|
||||
catboost
|
||||
xgboost
|
||||
wget
|