23 KiB

Raw Blame History

Instructions for Demo Setup

To setup your environment for running the demo, the following steps are required. These steps should only need to be completed once.

Install tools locally
Set environment variables
Setup GCP project permissions
Create a minikube cluster
Create a GKE cluster
Prepare the ksonnet app
Generate and store artifacts
Troubleshooting

1. Install tools locally

Ensure that you have at least the below versions of these tools (latest as of 2018-09-02). If so, skip to the next step.

docker v18.03.1-ce
gcloud v202.0.0
kfctl v0.3.1
kfp v0.1.3-rc.2
ksonnet v0.12.0
kubectl v1.10.3
miniconda v4.4.10
minikube v0.27.0
tensorflow v1.7.0
tensor2tensor v1.6.3
VirtualBox v5.2.12

Install docker

The latest version for MacOS can be found here.

Install gcloud

The Google Cloud SDK can be found here.

Install kfctl

Clone the Kubeflow GitHub repository, create a symlink to kfctl.sh, and add the directory to your $PATH:

export KUBEFLOW_TAG=v0.3.1
git clone git@github.com:kubeflow/kubeflow.git
cd kubeflow/scripts
git checkout ${KUBEFLOW_TAG}
ln -s kfctl.sh kfctl
export PATH=${PATH}:`pwd`

Install kfp

Create a clean python environment for installing Kubeflow Pipelines:

conda create --name kfp python=3.6
source activate kfp

Install the Kubeflow Pipelines SDK:

pip install https://storage.googleapis.com/ml-pipeline/release/0.1.3-rc.2/kfp.tar.gz --upgrade

Troubleshooting

If you encounter any errors, run this before repeating the previous command:

pip uninstall kfp

Install ksonnet

Download the correct binary based on your OS distro. The latest release can be found here.

#export KS_VER=ks_0.12.0_linux_amd64
# MacOS
export KS_VER=ks_0.12.0_darwin_amd64
wget -O /tmp/$KS_VER.tar.gz https://github.com/ksonnet/ksonnet/releases/download/v0.12.0/$KS_VER.tar.gz
mkdir -p ${HOME}/bin
tar -xvf /tmp/$KS_VER.tar.gz -C ${HOME}/bin
export PATH=$PATH:${HOME}/bin/$KS_VER

Install kubectl

After installing the Google Cloud SDK, install the kubectl CLI by running this command:

gcloud components install kubectl

Install miniconda

Installation of Miniconda for MacOS:

INSTALL_FILE=Miniconda2-latest-MacOSX-x86_64.sh
wget -O /tmp/${INSTALL_FILE} https://repo.continuum.io/miniconda/${INSTALL_FILE}
chmod 744 /tmp/${INSTALL_FILE}
bash -c /tmp/${INSTALL_FILE}

Installation of conda for Ubuntu:

curl -O https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
sudo apt-get install -y bzip2 # Not installed by default on GCP VMs
chmod 744 Anaconda3-5.0.1-Linux-x86_64.sh
bash -c ./Anaconda3-5.0.1-Linux-x86_64.sh

Create a new python2.7 environment:

conda create -y -n kfdemo python=2 pip scipy gevent sympy
source activate kfdemo

Install minikube

For troubleshooting tips, see the Kubeflow user guide.

The below instructions install Minikube:

Linux

curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/

macOS

brew cask install minikube

Install VirtualBox

VirtualBox is required for minikube. To install, follow the instructions for your OS distro in the link.

Install Tensorflow

These instructions install Tensorflow. Choose a version based on whether you have GPUs locally:

pip install tensorflow==1.7.0 | tensorflow-gpu==1.7.0

Install Tensor2Tensor

These instructions install tensor2tensor from master:

git clone git@github.com:tensorflow/tensor2tensor.git
cd tensor2tensor
git checkout tags/v1.6.3
python setup.py install

2. Set environment variables

Create a bash file with all the environment variables for this setup:

export DEMO_PROJECT=<your-project-name>
echo "source kubeflow-demo-base.env" >> ${DEMO_PROJECT}.env

Overwrite any environment variables from kubeflow-demo-base.env and add any additional as required, then source the file:

source ${DEMO_PROJECT}.env

If you have not set the GITHUB_TOKEN environment variable, follow these instructions to prevent rate-limiting errors by the GitHub API when installing ksonnet packages. Using a personal access token authorizes you as an individual rather than anonymous user, activating higher API limits.

Navigate to https://github.com/settings/tokens and generate a new token with no permissions. Save it somewhere safe. If you lose it, you will need to delete and create a new one. Set the GITHUB_TOKEN environment variable, preferably in your .bash_profile file:

export GITHUB_TOKEN=<token>

3. Setup GCP project permissions

The GCP project kubeflow-demo-base has been created as part of kubeflow.org. It has quota for GPUs and TPUs and should be sufficient for most purposes. To access this project, create an issue in this repo and tag any of the approvers.

To create an entirely new project of your own, complete the following steps:

Create a Google group
Create an owners project
Create a demo project
Setup GKE service account permissions
Setup minikube service account permissions

Create a Google group

To easily maintain access to GCP resources, create a Google group. Members can be added and removed over time as needed.

Set the following environment variables:

export GROUP_NAME=kubeflow-demos
export ORG_NAME=<your-org-name>
export DEMO_OWNERS_PROJECT_NAME=<unique_project_name>

Using the GAM cli, execute the following command:

~/bin/gam/gam create group ${GROUP_NAME}@${ORG_NAME} who_can_join \
  invited_can_join name ${GROUP_NAME} \
  description"Group members with access to demos" \
  allow_external_members true

Create an owners project

Create a master project that allows creation of new projects with Deployment Manager.

gcloud projects create ${DEMO_OWNERS_PROJECT_NAME} \
  --organization=${ORG_NAME}

Add permissions to an owners project

Grant access to the Gooogle group so that only members of ${GROUP_NAME}@${ORG_NAME} can create projects with Deployment Manager and register DNS records:

gcloud projects add-iam-policy-binding ${DEMO_OWNERS_PROJECT_NAME} \
  --member group:${GROUP_NAME}@${ORG_NAME} \
  --role=roles/deploymentmanager.editor

gcloud projects add-iam-policy-binding ${DEMO_OWNERS_PROJECT_NAME} \
  --member group:${GROUP_NAME}@${ORG_NAME} \
  --role=roles/kubeflow-dns

Create a demo project

Use Deployment Manager to easily create new projects for the demo.

To create a new project for use during demos:

Create a config file

cp project_creation/config-kubeflow-demo-base.yaml project_creation/config-${DEMO_PROJECT}.yaml

For ${DEMO_PROJECT} use whatever name you want that isn't already taken. This name must be unique across all organizations, not just kubeflow.org.

Modify project_creation/config-${DEMO_PROJECT}.yaml

Change resources.name to ${DEMO_PROJECT}
Populate resources.properties.organization-id or resources.properties.parent-folder-id
Populate resources.properties.billing-account-name
Populate resources.properties.iam-policy-patch.add.members (both array elements)

cd project_creation
gcloud deployment-manager deployments create ${DEMO_PROJECT} \
  --project=${DEMO_OWNERS_PROJECT_NAME} \
  --config=config-${DEMO_PROJECT}.yaml

After creating the deployment, it can be changed later with this command:

gcloud deployment-manager deployments update ${DEMO_PROJECT} \
  --project=${DEMO_OWNERS_PROJECT_NAME} \
  --config=config-${DEMO_PROJECT}.yaml

Update Resource Quotas for the Project

Currently this has to be done via the UI. Change the project name in this [URL](https://console.cloud.google.com/iam-admin/quotas?project=kubeflow-demo-base&metric=Backend%20services,CPUs,CPUs%20(all%20regions%29,Health%20checks,NVIDIA%20K80%20GPUs,Persistent%20Disk%20Standard%20(GB%29&location=GLOBAL,us-central1,us-east1).

Suggested quota usages:

In regions us-east1 & us-central1
100 CPUs per region
200 CPUs (All Region)
100000 Gb PDs in each region
10 K80s in each region
10 backend services
100 health checks

Usually the resource grants are auto-approved pretty quickly.

Setup GKE service account permissions

SERVICE_ACCOUNT=${CLUSTER}@${DEMO_PROJECT}.iam.gserviceaccount.com
gcloud iam service-accounts create ${CLUSTER} --display-name=${CLUSTER}

Issue permissions to the service account:

gcloud projects add-iam-policy-binding ${DEMO_PROJECT} \
  --member=serviceAccount:${SERVICE_ACCOUNT} \
  --role=roles/storage.admin

Create a private key for the service account:

gcloud iam service-accounts keys create ${HOME}/.ssh/${CLUSTER}_key.json \
  --iam-account=${SERVICE_ACCOUNT}

Setup minikube service account permissions

To run from a cluster outside of GKE such as minikube or Docker EE, kubeflow needs access to credentials for a service account. To create a service account, issue the following command:

SERVICE_ACCOUNT=minikube@${DEMO_PROJECT}.iam.gserviceaccount.com
gcloud iam service-accounts create ${SERVICE_ACCOUNT} --display-name=${SERVICE_ACCOUNT}

Issue permissions to the service account:

gcloud projects add-iam-policy-binding ${DEMO_PROJECT} \
  --member=serviceAccount:${SERVICE_ACCOUNT} \
  --role=roles/storage.admin

Create a private key for the service account:

gcloud iam service-accounts keys create ${HOME}/.ssh/minikube_key.json \
  --iam-account=${SERVICE_ACCOUNT}

4. Create a minikube cluster

To start a minikube instance:

minikube start \
  --cpus 4 \
  --memory 8192 \
  --disk-size=50g \
  --kubernetes-version v1.10.7

Create k8s secrets

Since our project is private, we need to provide access to resources via the use of service accounts. We need two different types of secrets for storing these credentials. One of type docker-registry for pulling images from GCR and one one of type generic for accessing private assets.

kubectl create namespace ${NAMESPACE}

kubectl -n ${NAMESPACE} create secret docker-registry gcp-registry-credentials \
  --docker-server=gcr.io \
  --docker-username=_json_key \
  --docker-password="$(cat ${HOME}/.ssh/minikube_key.json)" \
  --docker-email=minikube@${DEMO_PROJECT}.iam.gserviceaccount.com

kubectl -n ${NAMESPACE} create secret generic gcp-credentials \
  --from-file=key.json="${HOME}/.ssh/minikube_key.json"

Setup context to include namespace

This allows the use of kubectl without needing to specify -n ${NAMESPACE}

./create_context.sh minikube ${NAMESPACE}

Prepare the ksonnet app

Create the minikube environment:

cd ../demo
ks env add minikube --namespace=${NAMESPACE}

Set parameter values for training:

ks param set --env minikube t2tcpu \
  dataDir ${GCS_TRAINING_DATA_DIR}
ks param set --env minikube t2tcpu \
  outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_LOCAL}
ks param set --env minikube t2tcpu \
  cpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-cpu:latest
ks param set --env minikube t2tcpu \
  gpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-gpu:latest

Set parameter values for serving component:

ks param set --env minikube serving modelPath ${GCS_TRAINING_OUTPUT_DIR_LOCAL}/export/Servo

5. Create a GKE cluster

Choose one of the following options for creating a cluster and installing Kubeflow with pipelines:

Click-to-deploy
CLI (kfctl)

Click-to-deploy

This is the recommended path if you do not require access to GKE beta features such as TPUs and node auto-provisioning (NAP).

Generate a web app Client ID and Client Secret by following the instructions here. Save these as environment variables for easy access.

In the browser, navigate to the Click-to-deploy app. Enter the project name, along with the Client ID and Client Secret previously generated. Select the desired ${ZONE} and latest version of Kubeflow, then click Create Deployment.

In the GCP Console, navigate to the Kubernetes Engine panel to watch the cluster creation process. This results in a full cluster with Kubeflow installed.

CLI (kfctl)

If you require GKE beta features such as TPUs and node autoprovisioning (NAP), these instructions describe manual cluster creation and Kubeflow installation with kfctl.

Create service accounts

Create service accounts, add permissions, and download credentials

ADMIN_EMAIL=${CLUSTER}-admin@${PROJECT}.iam.gserviceaccount.com
ADMIN_FILE=${HOME}/.ssh/${ADMIN_EMAIL}.json
USER_EMAIL=${CLUSTER}-user@${PROJECT}.iam.gserviceaccount.com
USER_FILE=${HOME}/.ssh/${USER_EMAIL}.json

gcloud iam service-accounts create ${CLUSTER}-admin --display-name=${CLUSTER}-admin
gcloud iam service-accounts create ${CLUSTER}-user --display-name=${CLUSTER}-user

gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${ADMIN_EMAIL} \
  --role=roles/source.admin
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${ADMIN_EMAIL} \
  --role=roles/servicemanagement.admin
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${ADMIN_EMAIL} \
  --role=roles/compute.networkAdmin
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${ADMIN_EMAIL} \
  --role=roles/storage.admin

gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${USER_EMAIL} \
  --role=roles/cloudbuild.builds.editor
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${USER_EMAIL} \
  --role=roles/viewer
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${USER_EMAIL} \
  --role=roles/source.admin
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${USER_EMAIL} \
  --role=roles/storage.admin
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${USER_EMAIL} \
  --role=roles/bigquery.admin
gcloud projects add-iam-policy-binding ${PROJECT} \
  --member=serviceAccount:${USER_EMAIL} \
  --role=roles/dataflow.admin

gcloud iam service-accounts keys create ${ADMIN_FILE} \
  --project ${PROJECT} \
  --iam-account ${ADMIN_EMAIL}
gcloud iam service-accounts keys create ${USER_FILE} \
  --project ${PROJECT} \
  --iam-account ${USER_EMAIL}

Create the cluster with gcloud

To create a cluster with auto-provisioning, run the following commands (estimated: 30 minutes):

Follow the instructions here to create a GKE cluster for use with TPUs and node autoprovisiong (NAP) (estimated: 30 minutes):

gcloud beta container clusters create ${CLUSTER} \
  --project ${DEMO_PROJECT} \
  --zone ${ZONE} \
  --cluster-version 1.11 \
  --enable-ip-alias \
  --enable-tpu \
  --machine-type n1-highmem-8 \
  --num-nodes=5 \
  --scopes cloud-platform,compute-rw,storage-rw \
  --verbosity error

# scale down cluster to 3 (initial 5 is just to prevent master restarts due to upscaling)
# we cannot use 0 because then cluster autoscaler treats the cluster as unhealthy.
# Also having a few small non-gpu nodes is needed to handle system pods
gcloud container clusters resize ${CLUSTER} \
  --project ${DEMO_PROJECT} \
  --zone ${ZONE} \
  --size=3 \
  --node-pool=default-pool

# enable node auto-provisioning
gcloud beta container clusters update ${CLUSTER} \
  --project ${DEMO_PROJECT} \
  --zone ${ZONE} \
  --enable-autoprovisioning \
  --max-cpu 48 \
  --max-memory 312 \
  --max-accelerator=type=nvidia-tesla-k80,count=8

Once the cluster has been created, install GPU drivers:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml

Add RBAC permissions, which allows your user to install kubeflow components on the cluster:

kubectl create clusterrolebinding cluster-admin-binding-${USER} \
  --clusterrole cluster-admin \
  --user $(gcloud config get-value account)

Setup kubectl access:

kubectl create namespace kubeflow
./create_context.sh gke ${NAMESPACE}

Setup OAuth environment variables {CLIENT_ID} and {CLIENT_SECRET} using the instructions here.

kubectl create secret generic kubeflow-oauth \
  --from-literal=client_id=${CLIENT_ID} \
  --from-literal=client_secret=${CLIENT_SECRET}

kubectl create secret generic admin-gcp-sa \
  --from-file=admin-gcp-sa.json=${ADMIN_FILE}

kubectl create secret generic user-gcp-sa \
  --from-file=user-gcp-sa.json=${USER_FILE}

Install Kubeflow with kfctl

kfctl init ${CLUSTER} --platform gcp
cd ${CLUSTER}
kfctl generate k8s
kfctl apply k8s

To change the settings for any component, apply the change in ks_app/components/params.libsonnet, then delete and recreate the component. For example, to make changes to jupyterhub:

cd ks_app
sed -i "" "s/jupyterHubAuthenticator: 'iap'/jupyterHubAuthenticator: 'null'/" components/params.libsonnet
ks delete default -c jupyterhub
ks apply default -c jupyterhub

View the installed components in the GCP Console. In the Kubernetes Engine section, you will see a new cluster ${CLUSTER}. Under Workloads, you will see all the default Kubeflow and pipeline components.

6. Prepare the ksonnet app

The kfctl tool created a new ksonnet app in the directory ks_app. The ksonnet application files specific to this demo can be found in the ks_app directory of this repo.

Set parameter values for training components

cd ks_app

ks param set t2tcpu \
  dataDir ${GCS_TRAINING_DATA_DIR}
ks param set --env ${ENV} t2tcpu \
  outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_CPU}
ks param set --env ${ENV} t2tcpu \
  cpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-cpu:latest
ks param set --env ${ENV} t2tcpu \
  gpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-gpu:latest

ks param set --env ${ENV} t2tgpu \
  dataDir ${GCS_TRAINING_DATA_DIR}
ks param set --env ${ENV} t2tgpu \
  outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_GPU}
ks param set --env ${ENV} t2tgpu \
  cpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-cpu:latest
ks param set --env ${ENV} t2tgpu \
  gpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-gpu:latest

ks param set --env ${ENV_TPU} t2ttpu \
  dataDir ${GCS_TRAINING_DATA_DIR}
ks param set --env ${ENV_TPU} t2ttpu \
  outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_TPU}
ks param set --env ${ENV_TPU} t2ttpu \
  cpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-cpu:latest
ks param set --env ${ENV_TPU} t2ttpu \
  gpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-gpu:latest

Set parameter values for serving component

Choose the directory depending on whether you want to serve from the CPU, GPU, or TPU model.

ks param set --env ${ENV} serving modelPath ${GCS_TRAINING_OUTPUT_DIR_GPU}/export/Servo
ks param set --env ${ENV_TPU} serving modelPath ${GCS_TRAINING_OUTPUT_DIR_TPU}/export/Servo

7. Generate and store artifacts

To safeguard against potential failures while running a live demo, pre-generate artifacts and squirrel them away for use in a break-glass scenario.

The following artifacts are useful to have ready:

Training data
Training & UI images
Trained model files

Generate training data

Ensure that you are using the right conda environment:

source activate kfdemo

Set the following environment variable temporarily:

export MAX_CASES=0

In the ./yelp/yelp_sentiment/yelp_problem.py file, set the constant YELP_DATASET_URL to the full dataset (i.e. yelp-dataset.zip).

Generate a dataset for training and store in GCS. ${GOOGLE_APPLICATION_CREDENTIALS} must be set properly in order for this to work.

Warning: this command takes around 45-60 mins to complete on the full Yelp dataset. Smaller versions are available for faster processing (yelp_review_10000.zip).

cd ../yelp/

t2t-datagen \
  --t2t_usr_dir=${USR_DIR} \
  --problem=${PROBLEM} \
  --data_dir=${GCS_TRAINING_DATA_DIR} \
  --tmp_dir=${TMP_DIR}-${MAX_CASES} \
  --max_cases=${MAX_CASES}

Cleanup data files:

rm -rf ${TMP_DIR}-${MAX_CASES}

Generate training and UI images

Generate all necessary docker images and store them in GCR. This generates a CPU, GPU, and UI image.

cd ..
make PROJECT=${DEMO_PROJECT} set-image

Generate trained model files

Warning: this command takes 8+ hours to complete.

cd demo
ks param set --env ${ENV} t2tcpu trainSteps 20000
ks param set --env ${ENV} t2tcpu dataDir ${GCS_TRAINING_DATA_DIR}
ks param set --env ${ENV} t2tcpu outputGCSPath ${GCS_TRAINING_OUTPUT_DIR_CPU}
ks param set --env ${ENV} t2tcpu cpuImage gcr.io/${DEMO_PROJECT}/kubeflow-yelp-demo-cpu:latest
ks apply ${ENV} -c t2tcpu

Export the trained model

This will export the model to an export/ directory in output_dir.

cd ../yelp
t2t-exporter \
  --t2t_usr_dir=${USR_DIR} \
  --model=${MODEL} \
  --hparams_set=${HPARAMS_SET} \
  --problem=${PROBLEM} \
  --data_dir=${GCS_TRAINING_DATA_DIR} \
  --output_dir=${GCS_TRAINING_OUTPUT_DIR_CPU}

8. Troubleshooting

Updating node pools in CPU/GPU clusters

The update method for node pools does not allow arbitrary fields to be changed. To make a change to node pools, do the following:

Make any changes to the node pool config
Bump the property pool-version
- This causes the existing pool to be deleted and new ones to be created with a different name.
Issue an update command:

gcloud deployment-manager deployments update gke-${CLUSTER} \
  --project=${DEMO_PROJECT} \
  --config=gke/cluster-${DEMO_PROJECT}.yaml

23 KiB Raw Blame History