History

Kimonas Sotirchos 067ba59439 Update the examples with correct image paths and packages (#1016 ) * fix ipynb images to be file paths, and not relevant urls Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com> * Don't explicitly set the kale image Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com> * Update packages Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com> Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>		2022-11-22 21:49:42 +00:00
..
images	Conversion of G-research-crypto-forecasting and American express Kaggle notebooks to KFP Pipeline (#986 )	2022-08-17 01:23:50 +00:00
README.md	Use kubeflow/examples as base link for G-Research (#1006 )	2022-11-01 17:26:45 +00:00
g-research-crypto-forecast-kale.ipynb	Update the examples with correct image paths and packages (#1016 )	2022-11-22 21:49:42 +00:00
g-research-crypto-forecast-kfp.ipynb	Update the examples with correct image paths and packages (#1016 )	2022-11-22 21:49:42 +00:00
g-research-crypto-forecast-orig.ipynb	Update the examples with correct image paths and packages (#1016 )	2022-11-22 21:49:42 +00:00
requirements.txt	Update the examples with correct image paths and packages (#1016 )	2022-11-22 21:49:42 +00:00

README.md

Objective

In this example we are going to convert this generic notebook based on the Kaggle G-Research Crypto Forecast competition into a Kubeflow pipeline.

The objective of this task is to correctly forecast short term returns in 14 popular cryptocurrencies. The dataset provided contains information on historic trades for several cryptoassets, such as Bitcoin and Ethereum.

Testing Environment

Environment:

Name	version
Kubeflow	v1.4
kfp	1.8.11
kubeflow-kale	0.6.0
pip	21.3.1
kaggle	1.5.12

Section 1: Overview

Vanilla KFP Pipeline: Kubeflow lightweight component method

To get started, visit the Kubeflow Pipelines documentation to get acquainted with what pipelines are, its components, pipeline metrics and how to pass data between components in a pipeline. There are different ways to build out a pipeline component as mentioned here. In the following example, we are going to use the lightweight python functions based components for building our Kubeflow pipeline.
Kale KFP Pipeline

To get started, visit Kale's documentation to get acquainted with the Kale user interface (UI) from a Jupyter Notebook, notebook cell annotation and how to create a machine learning pipeline using Kale. In the following example, we are going to use the Kale JupyterLab Extension to build our Kubeflow pipeline.

Section 2: Prepare environment for data download

Open your Kubeflow Cluster, create a Notebook Server and connect to it.
- Fill in your Notebook server name and leave the rest as default.
- Launch Notebook Server.
Download the G-research dataset using Kaggle's API. To do this, do the following:
- Login to Kaggle and click on your user profile picture.
- Click on ‘Account’.
- Under ‘Account’, navigate to the ‘API’ section.
- Click ‘Create New API token’.
- After creating a new API token, a kaggle.json file is automatically downloaded, and the json file contains the ‘api-key’ and ‘username’ needed to download the dataset.
- Create a Kubernetes secret to handle the sensitive API credentials and to prevent you from passing your credentials in plain text to the pipeline notebook.
```
!kubectl create secret generic -n kubeflow-user kaggle-secret --from-literal=username=<"username"> --from-literal=password=<"api-key">
```
- Create a secret PodDefault YAML file named kaggle_pod.yaml in your Kubeflow namespace.
```
apiVersion: "kubeflow.org/v1alpha1"
kind: PodDefault
metadata:
 name: kaggle-secret
 namespace: kubeflow-user
spec:
selector:
 matchLabels:
   kaggle-secret: "true"
desc: "kaggle-secret"
volumeMounts:
- name: secret-volume
  mountPath: /secret/kaggle-secret
  readOnly: false
volumes:
- name: secret-volume
  secret:
   secretName: kaggle-secret
```
- Apply the pod YAML file kubectl apply -f kaggle_pod.yaml
- After successfully deploying the PodDefault, create a new Notebook Server and add the kaggle-secret configuration to the new Notebook Server that runs the Kale or KFP pipeline.

Section 3: Vanilla KFP Pipeline

Kubeflow lightweight component method

Here, a python function is created to carry out a certain task and the python function is passed inside a kfp component method create_component_from_func.

The different components used in this example are:

Download data
Load data
Feature Engineering
Merge Assets and Features
Modelling
Functions

Kubeflow pipelines

A Kubeflow pipeline connects all components together, to create a directed acyclic graph (DAG). The kfp dsl.pipeline decorator was used to create a pipeline function. The kfp.dsl.VolumeOp method was used to create a PersistentVolumeClaim that helps request for data storage. This storage is used to pass data between components in the pipeline.

Finally, the create_run_from_pipeline_func from the KFP SDK Client was used to submit pipeline directly from pipeline function

To create pipeline using Vanilla KFP

Open your Kubeflow Cluster and do the following:
- Create a new Notebook Server
- Set the CPU specification to 8 and RAM to 16 GB
- Add the kaggle-secret configuration to the new Notebook Server.
Create a new Terminal and clone this repo. After cloning, navigate to this directory.
Open the g-research-crypto-forecast-kfp notebook
Run the g-research-crypto-forecast notebook from start to finish
View run details immediately after submitting pipeline.

View Pipeline

kubeflow pipeline

View Pipeline Metric

kubeflow pipeline metrics

Section 4: Kale KFP Pipeline

To create a KFP pipeline using the Kale JupyterLab extension

Open your Kubeflow Cluster and do the following:
- Create a new Notebook Server
- Set the CPU specification to 8 and RAM to 16 GB
- Increase the Workspace Volume to 15Gi
- Add the kaggle-secret configuration to the new Notebook Server.
Create a new Terminal and clone this repo. After cloning, navigate to this directory.
Launch the g-research-crypto-forecast-orig.ipynb Notebook
Install the requirements.txt file. After installation, restart the kernel.
Enable the Kale extension in JupyterLab
Ensure the notebook cells are annotated with Kale tags just as it is in the g-research-crypto-forecast-kale.ipynb Notebook

To fully understand the different Kale tags available, visit Kale documentation

The following Kale tags were used in this example:
- Imports
- Pipeline Parameters
- Functions
- Pipeline Metrics
- Pipeline Step
- Skip Cell
With the use of Kale tags we define the following:
- Pipeline parameters are assigned using the "pipeline parameters" tag
- The necessary libraries that need to be used throughout the Pipeline are passed through the "imports" tag
- Notebook cells are assigned to specific Pipeline components (download data, load data, etc.) using the "pipeline step" tag
- Cell dependencies are defined between the different pipeline steps with the "depends on" flag
- Pipeline metrics are assigned using the "pipeline metrics" tag
The pipeline steps created in this example:
- Download data
- Load data
- Feature Engineering
- Merge Assets and Features
- Modelling
- Evaluation
Compile and run the Notebook by hitting the "Compile & Run" in Kale's left panel