This is important as this is an E2E tutorial. Moreover, the catch that GCP Free Tier and the 12-month trial period with $300 credit does not offer enough resources to run default GCP installation of Kubeflow is mentioned in those docs.
Base image `FROM tensorflow/tensorflow:1.15.2-py3` uses python3 and therefore the python binary location is `/usr/bin/python3`. However, [tensorflow base image creates a symlink](e5bf8de410/tensorflow/tools/dockerfiles/dockerfiles/cpu.Dockerfile (L45)) to the current python binary as `/usr/local/bin/python` regardless if that is python version 2 or version 3, so that binary location should be used in the *ENTRYPOINT* of the `Dockerfile.model` instead of `/usr/bin/python` which is customary for Python v2.x installations.
* Remove kustomize from mnist example.
* The mnist E2E guide has been updated to use notebooks and get rid
of kustomize
* We have notebooks for AWS, GCP, and Vanilla K8s.
* As such we no longer need the old, outdated kustomization files or
Docker containers anymore
* The notebooks handle parameterizing the K8s resources using Python
f style string.
* Update the README to remove the old instructions.
* Cleanup more references.
* Add method to get ALB hostname for aws users
* Revoke setup based on the platform
* Add AWS notebook for mnist e2e example
* Remove legacy kustomize manifests for mnist example
* Address feedbacks from reviewers
* A notebook to run the mnist E2E example on GCP.
This fixes a number of issues with the example
* Use ISTIO instead of Ambassador to add reverse proxy routes
* The training job needs to be updated to run in a profile created namespace in order to have the required service accounts
* See kubeflow/examples#713
* Running inside a notebook running on Kubeflow should ensure user
is running inside an appropriately setup namespace
* With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server
* A short term fix was to not include the ISTIO side car
* In the future we can add an appropriate ISTIO rbac policy
* Using a notebook allows us to eliminate the use of kustomize
* This resolveskubeflow/examples#713 which required people to use
and old version of kustomize
* Rather than using kustomize we can use python f style strings to
write the YAML specs and then easily substitute in user specific values
* This should be more informative; it avoids introducing kustomize and
users can see the resource specs.
* I've opted to make the notebook GCP specific. I think its less confusing
to users to have separate notebooks focused on specific platforms rather
than having one notebook with a lot of caveats about what to do under
different conditions
* I've deleted the kustomize overlays for GCS since we don't want users to
use them anymore
* I used fairing and kaniko to eliminate the use of docker to build the images
so that everything can run from a notebook running inside the cluster.
* k8s_utils.py has some reusable functions to add some details from users
(e.g. low level calls to K8s APIs.)
* * Change the mnist test to just run the notebook
* Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable
* Fix lint.
* Update for lint.
* A notebook to run the mnist E2E example.
Related to: kubeflow/website#1553
* 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK.
* Fix the ISTIO rule.
* Fix UI and serving; need to update TF serving to match version trained on.
* Get the IAP endpoint.
* Start writing some helper python functions for K8s.
* Commit before switching from replace to delete.
* Create a library to bulk create objects.
* Cleanup.
* Add back k8s_util.py
* Delete train.yaml; this shouldn't have been aded.
* update the notebook image.
* Refactor code into k8s_util; print out links.
* Clean up the notebok. Should be working E2E.
* Added section to get logs from stackdriver.
* Add comment about profile.
* Latest.
* Override mnist_gcp.ipynb with mnist.ipynb
I accidentally put my latest changes in mnist.ipynb even though that file
was deleted.
* More fixes.
* Resolve some conflicts from the rebase; override with changes on remote branch.
* [mnist] Add support for S3 in TensorBoard component; Update docs.
* [mnist] reverted autonumbering in README
* [mnist] add expected fail for predict_test, until it'ss fixed
* Add e2e test for xgboost housing example
* fix typo
add ks apply
add [
modify example to trigger tests
add prediction test
add xgboost ks param
rename the job name without _
use - instead of _
libson params
rm redudent component
rename component in prow config
add ames-hoursing-env
use - for all names
use _ for params names
use xgboost_ames_accross
rename component name
shorten the name
change deploy-test command
change to xgboost-
namespace
init ks app
fix type
add confest.py
change path
change deploy command
change dep
change the query URL for seldon
add ks_app with seldon lib
update ks_app
use ks init only
rerun
change to kf-v0-4-n00 cluster
add ks_app
use ks-13
remove --namespace
use kubeflow as namespace
delete seldon deployment
simplify ks_app
retry on 503
fix typo
query 1285
move deletion after prediction
wait 10s
always retry till 10 mins
move check to retry
fix pylint
move clean-up to the delete template
* set up xgboost component
* check in ks component& run it directly
* change comments
* add comment on why use 'ks delete'
* add two modules to pylint whitelist
* ignore tf_operator/py
* disable pylint per line
* reorder import
* Create an E2E test for TFServing using the rest API
* We use the pytest framework because
1. it has really good support for using command line arguments
2. can emit junit xml file to report results to prow.
Related to #270: Create a generic test runner
* Address comments.
* Fix lint.
* Add retries to the prediction.
* Add some comments.
* Fix model path.
* * Fix the workflow labels
* Set the K8s service name correctly on the test.
* Fix the workflow.
* Fix lint.
* Add the web-ui for the mnist example
Copy the mnist web app from
https://github.com/googlecodelabs/kubeflow-introduction
* Update the web app
* Change "server-name" argument to "model-name" because this is what
is.
* Update the prediction client code; The prediction code was copied
from https://github.com/googlecodelabs/kubeflow-introduction and
that model used slightly different values for the input names
and outputs.
* Add a test for the mnist_client code; currently it needs to be run
manually.
* Fix the label selector for the mnist service so that it matches the
TFServing deployment.
* Delete the old copy of mnist_client.py; we will go with the copy in ewb-ui from https://github.com/googlecodelabs/kubeflow-introduction
* Delete model-deploy.yaml, model-train.yaml, and tf-user.yaml.
The K8s resources for training and deploying the model are now in ks_app.
* Fix tensorboard; tensorboard only partially works behind Ambassador. It seems like some requests don't work behind a reverse proxy.
* Fix lint.
* Add the TFServing component
* Create TFServing components.
* The model.py code doesn't appear to be exporting a model in saved model
format; it was a missing a call to export.
* I'm not sure how this ever worked.
* It also looks like there is a bug in the code in that its using the cnn input fn even if the model is the linear one. I'm going to leave that as is for now.
* Create a namespace for each test run; delete the namespace on teardown
* We need to copy the GCP service account key to the new namespace.
* Add a shell script to do that.
* Update training to use Kubeflow 0.4 and add testing.
* To support testing we need to create a ksonnet template to train
the model so we can easily subsitute in different parameters during
training.
* We create a ksonnet component for just training; we don't use Argo.
This makes the example much simpler.
* To support S3 we add a generic ksonnet parameter to take environment
variables as a comma separated list of variables. This should make it
easy for users to set the environment variables needed to talk to S3.
This is compatible with the existing Argo workflow which supports S3.
* By default the training job runs non-distributed; this is because to
run distributed the user needs a shared filesystem (e.g. S3/GCS/NFS).
* Update the mnist workflow to correctly build the images.
* We didn't update the workflow in the previous example to actually
build the correct images.
* Update the workflow to run the tfjob_test
* Related to #460 E2E test for mnist.
* Add a parameter to specify a secret that can be used to mount
a secret such as the GCP service account key.
* Update the README with instructions for GCS and S3.
* Remove the instructions about Argo; the Argo workflow is outdated.
Using Argo adds complexity to the example and the thinking is to remove
that to provide a simpler example and to mirror the pytorch example.
* Add a TOC to the README
* Update prerequisite instructions.
* Delete instructions for installing Kubeflow; just link to the
getting started guide.
* Argo CLI should no longer be needed.
* GitHub token shouldn't be needed; I think that was only needed
for ksonnet to pull the registry.
* * Fix instructions; access keys shouldn't be stored as ksonnet parameters
as these will get checked into source control.
* This is the first step in adding E2E tests for the mnist example.
* Add a Makefile and .jsonnet file to build the Docker images using GCB
* Define an Argo workflow to trigger the image builds on pre & post submit.
Related to: #460