# Training the model using TFJob Kubeflow offers a TensorFlow job controller for Kubernetes. This allows you to run your distributed Tensorflow training job on a Kubernetes cluster. For this training job, we will read our training data from Google Cloud Storage (GCS) and write our output model back to GCS. ## Create the image for training The [notebooks](notebooks) directory contains the necessary files to create an image for training. The [train.py](notebooks/train.py) file contains the training code. Here is how you can create an image and push it to Google Container Registry (GCR): ```bash cd notebooks/ make PROJECT=${PROJECT} set-image ``` ## Train Using PVC If you don't have access to GCS or do not wish to use GCS, you can use a Persistent Volume Claim (PVC) to store the data and model. Note: your cluster must have a default storage class defined for this to work. Create a PVC: ``` ks apply --env=${KF_ENV} -c data-pvc ``` Run the job to download the data to the PVC: ``` ks apply --env=${KF_ENV} -c data-downloader ``` Submit the training job ``` ks apply --env=${KF_ENV} -c tfjob-pvc ``` The resulting model will be stored on the PVC, so to access it you will need to run a pod and attach the PVC. For serving, you can just attach it to the pod serving the model. ## Training Using GCS If you are using GCS, you can train using GCS to store the input and the resulting model. ### GCS service account * Create a service account that will be used to read and write data from the GCS bucket. * Give the storage account `roles/storage.admin` role so that it can access GCS buckets. * Download its key as a json file and create a secret named `user-gcp-sa` with the key `user-gcp-sa.json` ```bash SERVICE_ACCOUNT=github-issue-summarization PROJECT=kubeflow-example-project # The GCP Project name gcloud iam service-accounts --project=${PROJECT} create ${SERVICE_ACCOUNT} \ --display-name "GCP Service Account for use with kubeflow examples" gcloud projects add-iam-policy-binding ${PROJECT} --member \ serviceAccount:${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com --role=roles/storage.admin KEY_FILE=/home/agwl/secrets/${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com.json gcloud iam service-accounts keys create ${KEY_FILE} \ --iam-account ${SERVICE_ACCOUNT}@${PROJECT}.iam.gserviceaccount.com kubectl --namespace=${NAMESPACE} create secret generic user-gcp-sa --from-file=user-gcp-sa.json="${KEY_FILE}" ``` ### Run the TFJob using your image [ks_app](ks_app) contains a ksonnet app to deploy the TFJob. Set the appropriate params for the tfjob component: ```bash cd ks_app ks param set tfjob namespace ${NAMESPACE} --env=${KF_ENV} # The image pushed in the previous step ks param set tfjob image "gcr.io/agwl-kubeflow/tf-job-issue-summarization:latest" --env=${KF_ENV} # Sample Size for training ks param set tfjob sample_size 100000 --env=${KF_ENV} # Set the input and output GCS Bucket locations ks param set tfjob input_data_gcs_bucket "kubeflow-examples" --env=${KF_ENV} ks param set tfjob input_data_gcs_path "github-issue-summarization-data/github-issues.zip" --env=${KF_ENV} ks param set tfjob output_model_gcs_bucket "kubeflow-examples" --env=${KF_ENV} ks param set tfjob output_model_gcs_path "github-issue-summarization-data/output_model.h5" --env=${KF_ENV} ``` Deploy the app: ```bash ks apply ${KF_ENV} -c tfjob ``` In a while you should see a new pod with the label `tf_job_name=tf-job-issue-summarization` ```bash kubectl get pods -n=${NAMESPACE} tfjob-issue-summarization-master-0 ``` You can view the training logs using ```bash kubectl logs -f -n=${NAMESPACE} tfjob-issue-summarization-master-0 ``` You can view the logs of the tf-job operator using ```bash kubectl logs -f -n=${NAMESPACE} $(kubectl get pods -n=${NAMESPACE} -lname=tf-job-operator -o=jsonpath='{.items[0].metadata.name}') ``` _(Optional)_ You can also perform training with two alternate methods: - [Training the model with a notebook](02_training_the_model.md) - [Distributed training using Estimator](02_distributed_training.md) *Next*: [Serving the model](03_serving_the_model.md) *Back*: [Setup a kubeflow cluster](01_setup_a_kubeflow_cluster.md)