* A notebook to run the mnist E2E example on GCP.
This fixes a number of issues with the example
* Use ISTIO instead of Ambassador to add reverse proxy routes
* The training job needs to be updated to run in a profile created namespace in order to have the required service accounts
* See kubeflow/examples#713
* Running inside a notebook running on Kubeflow should ensure user
is running inside an appropriately setup namespace
* With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server
* A short term fix was to not include the ISTIO side car
* In the future we can add an appropriate ISTIO rbac policy
* Using a notebook allows us to eliminate the use of kustomize
* This resolveskubeflow/examples#713 which required people to use
and old version of kustomize
* Rather than using kustomize we can use python f style strings to
write the YAML specs and then easily substitute in user specific values
* This should be more informative; it avoids introducing kustomize and
users can see the resource specs.
* I've opted to make the notebook GCP specific. I think its less confusing
to users to have separate notebooks focused on specific platforms rather
than having one notebook with a lot of caveats about what to do under
different conditions
* I've deleted the kustomize overlays for GCS since we don't want users to
use them anymore
* I used fairing and kaniko to eliminate the use of docker to build the images
so that everything can run from a notebook running inside the cluster.
* k8s_utils.py has some reusable functions to add some details from users
(e.g. low level calls to K8s APIs.)
* * Change the mnist test to just run the notebook
* Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable
* Fix lint.
* Update for lint.
* A notebook to run the mnist E2E example.
Related to: kubeflow/website#1553
* 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK.
* Fix the ISTIO rule.
* Fix UI and serving; need to update TF serving to match version trained on.
* Get the IAP endpoint.
* Start writing some helper python functions for K8s.
* Commit before switching from replace to delete.
* Create a library to bulk create objects.
* Cleanup.
* Add back k8s_util.py
* Delete train.yaml; this shouldn't have been aded.
* update the notebook image.
* Refactor code into k8s_util; print out links.
* Clean up the notebok. Should be working E2E.
* Added section to get logs from stackdriver.
* Add comment about profile.
* Latest.
* Override mnist_gcp.ipynb with mnist.ipynb
I accidentally put my latest changes in mnist.ipynb even though that file
was deleted.
* More fixes.
* Resolve some conflicts from the rebase; override with changes on remote branch.
* Update xgboost_synthetic test infra to use pytest and pyfunc.
* Related to #655 update xgboost_synthetic to use workload identity
* Related to to #665 no signal about xgboost_synthetic
* We need to update the xgboost_synthetic example to work with 0.7.0;
e.g. workload identity
* This PR focuses on updating the test infra and some preliminary
updates the notebook
* More fixes to the test and the notebook are probably needed in order
to get it to actually pass
* Update job spec for 0.7; remove the secret and set the default service
account.
* This is to make it work with workload identity
* Instead of using kustomize to define the job to run the notebook we can just modify the YAML spec using python.
* Use the python API for K8s to create the job rather than shelling out.
* Notebook should do a 0.7 compatible check for credentials
* We don't want to assume GOOGLE_APPLICATION_CREDENTIALS is set
because we will be using workload identity.
* Take in repos as an argument akin to what checkout_repos.sh requires
* Convert xgboost_test.py to a pytest.
* This allows us to mark it as expected to fail so we can start to get
signal without blocking
* We also need to emit junit files to show up in test grid.
* Convert the jsonnet workflow for the E2E test to a python function to
define the workflow.
* Remove the old jsonnet workflow.
* Address comments.
* Fix issues with the notebook
* Install pip packages in user space
* 0.7.0 images are based on TF images and they have different permissions
* Install a newer version of fairing sdk that works with workload identity
* Split pip installing dependencies out of util.py and into notebook_setup.py
* That's because util.py could depend on the packages being installed by
notebook_setup.py
* After pip installing the modules into user space; we need to add the local
path for pip packages to the python otherwise we get import not found
errors.
* Remove modules from .pylintrc
* Add lint inline exceptions
* Add lint inline exceptions as all as the specific exception is not available for Pylint 1.8
* Fix string formatting logging message and remove unnecessary Pylint exception
* Update app.yaml with correct environment details
* add object detection grpc client
Fixes: #377
* fix kubeflow-examples-presubmit error
object_detection_grpc_client.py depends on other files in
https://github.com/tensorflow/models.git, pylint will fail
for those files need to be compiled manually.
Since mnist_DDP.py has similar dependency, here just follow
mnist_DDP.py and ignore checking this file.
* Default to model trained with CPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Checkout 1.0rc1 release as latest Pytorch master seems to have MPI backend detection broken
* Track changes in pytorch_mnist/training/ddp/mnist folder to trigger test jobs
* Repoint to pull images from gcr.io/kubeflow-ci built during pre-submit
* Fix image webui name
* Fix logging
* Add GCFS to CPU train
* Fix logging
* Add GCFS to CPU train
* Default to model trained with GPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Fix Predict() method as Seldon expects 3 arguments
* Fix x reference
* Add Pytorch MNIST example
* Fix link to Pytorch NMIST example
* Fix indentation in README
* Fix lint errors
* Fix lint errors
Add prediction proto files
* Add build_image.sh script to build image and push to gcr.io
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Fix lint errors
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Add PB2 autogenerated files to ignore with Pylint
* Fix lint errors
* Add official Pytorch DDP examples to ignore with Pylint
* Fix lint errors
* Update component to web-ui release
* Update mount point to kubeflow-gcfs as the example is GCP specific
* 01_setup_a_kubeflow_cluster document complete
* Test release job while PR is WIP
* Reduce workflow name to avoid Argo error:
"must be no more than 63 characters"
* Fix extra_repos to pull worker image
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix extra_repo, only needs kubeflow/testing
* Set build_image.sh executable
* Update build_image.sh from CentralDashboard component
* Remove old reference to centraldashboard in echo message
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Add releases for the training and serving images
* Add releases for the training and serving images
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix path to Seldon-wrapper build_image.sh
* Fix image name in ksonnet parameter
* Add 02 distributed training documentation
* Add 03 serving the model documentation
Update shared persistent reference in 02 distributed training documentation
* Add 05 teardown documentation
* Add section to test the model is deployed correctly in 03 serving the model
* Add 04 querying the model documentation
* Fix ks-app to ks_app
* Set prow jobs back to postsubmit
* Set prow jobs to trigger presubmit to kubeflow-ci and postsubmit to
kubeflow-images-public
* Change to kubeflow-ci project
* Increase timeout limit during image build to compile Pytorch
* Increase timeout limit during image build to compile Pytorch
* Change build machine type to compile Pytorch for training image
* Change build machine type to compile Pytorch for training image
* Add OWNERS file to Pytorch example
* Fix typo in documentation
* Remove checking docker daemon as we are using gcloud build instead
* Use logging module rather print()
* Remove empty file, replace with .gitignore to keep tmp folder
* Add ksonnet application to deploy model server and web-ui
Delete model server JSON manifest
* Refactor ks-app to ks_app
* Parametrise serving_model ksonnet component
Default web-ui to use ambassador route to seldon
Remove form section in web-ui
* Remove default environment from ksonnet application
* Update documentation to use ksonnet application
* Fix component name in documentation
* Consolidate Pytorch train module and build_image.sh script
* Consolidate Pytorch train module
* Consolidate Pytorch train module
* Consolidate Pytorch train module and build_image.sh script
* Revert back build_image.sh scripts
* Remove duplicates
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
* Fix docker build command
* Fix docker build command
* Fix image name for cpu and gpu train
* Consolidate Pytorch train module
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
* Add .pylintrc
* Resolve lint complaints in agents/trainer/task.py
* Resolve lint complaints with flask app.py
* Resolve linting issues
Remove duplicate seq2seq_utils.py from workflow/workspace/src
* Use python 3.5.2 with pylint to match prow
Put pybullet import back into agents/trainer/task.py with a pylint ignore statement
Use main(_) to ensure it works with tf.app.run