* create pv for pets-pv
For a lot of user k8s clusters, dynamic volume provisioning isn't
enabled. So the newcomer may be blocked since pets-pv will keep
Pending. We can guide them to create a nfs PV as an option.
* tell user how to check if a default storage class is defined
* add link about how to create PV
* Create a script to count lines of code.
* This is used in the presentation to get an estimate of where the human effort is involved.
* Fix lint issues.
* We need to set the parameters for the model and index.
* It looks like when we split up the web app into its own ksonnet app
we forgot to set the parameters.
* SInce the web app is being deployed in a separate namespace we need to
copy the GCP credential to that namespace. Add instructions to the
demo README.md on how to do that.
* It looks like the pods were never getting started because the secret
couldn't be mounted.
* We need to disable TLS (its handled by ingress) because that leads to
endless redirects.
* ArgoCD is running in namespace argo-cd but Ambassador is running in a
different namespace and currently only configured with RBAC to monitor
a single namespace.
* So we add a service in namespace kubeflow just to define the Ambassador mapping.
* add object detection grpc client
Fixes: #377
* fix kubeflow-examples-presubmit error
object_detection_grpc_client.py depends on other files in
https://github.com/tensorflow/models.git, pylint will fail
for those files need to be compiled manually.
Since mnist_DDP.py has similar dependency, here just follow
mnist_DDP.py and ignore checking this file.
* Update instructions and setup for yelp demo
Update kubeflow version to v0.3.4-rc.1
Add pipelines version v0.1.3-rc.2
Add simple pipelines example using GPUs
Conform cluster name, secrets, and ks app directory name to click-to-deploy standard
Update ks_app directory to v0.3.4-rc.1
Pin bokeh package to v0.13.0 in yelp notebook
Fix bug in secret creation
* Port-forward to svcs instead of pods
Add clarification for using kfctl & updating component params
* Datflow job should support writing embeddings to a different location (Fix#366).
* Dataflow job to compute code embeddings needs to have parameters controlling
the location of the outputs independent of the inputs. Prior to this fix the
same table in the dataset was always written and the files were always created
in the data dir.
* This made it very difficult to rerun the embeddings job on the latest GitHub
data (e.g to regularly update the code embeddings) without overwritting
the current embeddings.
* Refactor how we create BQ sinks and sources in this pipeline
* Rather than create a wrapper class that bundles together a sink and schema
we should have a separate helper class for creating BQ schemas and then
use WriteToBigQuery directly.
* Similarly for ReadTransforms we don't need a wrapper class that bundles
a query and source. We can just create a class/constant to represent
queries and pass them directly to the appropriate source.
* Change BQ write disposition to if empty so we don't overwrite existing data.
* Fix#390 worker setup fails because requirements.dataflow.txt not found
* Dataflow always uses the local file requirements.txt regardless of the
local file used as the source.
* When job is submitted it will also try to build a sdist package on
the client which invokes setup.py
* So we in setup.py we always refer to requirements.txt
* If trying to install the package in other contexts,
requirements.dataflow.txt should be renamed to requirements.txt
* We do this in the Dockerfile.
* Refactor the CreateFunctionEmbeddings code so that writing to BQ
is not part of the compute function embeddings code;
(will make it easier to test.)
* * Fix typo in jsonnet with output dir; missing an "=".
In tensorflow/models/research/object_detection/, only
tensorflow/models/research/object_detection/legacy/train.py
supports kubeflow sor far (construct cluster by reading
TF_CONFIG environment var).
Fixes: #277
Remove separate pipelines installation
Update kfp version to 0.1.3-rc.2
Clarify difference in installation paths (click-to-deploy vs CLI)
Use set_gpu_limit() and remove generated yaml with resource limits
* Follow argocd instructions
https://github.com/argoproj/argo-cd/blob/master/docs/getting_started.md
to install ArgoCD on the cluster
* Down the argocd manifest and update the namespace to argocd.
* Check it in so ArgoCD can be deployed declaratively.
* Update README.md with the instructions for deploying ArgoCD.
Move the web app components into their own ksonnet app.
* We do this because we want to be able to sync the web app components using
Argo CD
* ArgoCD doesn't allow us to apply autosync with granularity less than the
app. We don't want to sync any of the components except the servers.
* Rename the t2t-code-search-serving component to query-embed-server because
this is more descriptive.
* Check in a YAML spec defining the ksonnet application for the web UI.
Update the instructions in nodebook code-search.ipynb
* Provided updated instructions for deploying the web app due the
fact that the web app is now a separate component.
* Improve code-search.ipynb
* Use gcloud to get sensible defaults for parameters like the project.
* Provide more information about what the variables mean.
* This script will be the last step in a pipeline to continuously update
the index for serving.
* The script updates the parameters of the search index server to point
to the supplied index files. It then commits them and creates a PR
to push those commits.
* Restructure the parameters for the search index server so that we can use
ks param set to override the indexFile and lookupFile.
* We do this because we want to be able to push a new index by doing
ks param set in a continuously running pipeline
* Remove default parameters from search-index-server
* Create a dockerfile suitable for running this script.
* The latest changes to the ksonnet components require certain values
to be defined as defaults.
* This is part of the move away from using a fake component to define
parameters that should be reused across different modules.
see #308
* Verify we can run ks show on a new environment and can evaluate the ksonnet.
Fix#353
* Upgrade and fix the serving components.
* Install a new version of the TFServing package so we can use the new template.
* Fix the UI image. Use the same requirements file as for Dataflow so we are
consistent w.r.t the version of TF and Tensor2Tesnro.
* remove nms.libsonnet; move all the manifests into the actual component
files rather than using a shared library.
* Fix the name of the TFServing service and deployment; need to use the same
name as used by the front end server.
* Change the port of TFServing; we are now using the built in http server
in TFServing which uses port 8500 as opposed to our custom http proxy.
* We encountered an error importning nmslib; moving it to the top of the file
appears to fix this.
* Fix lint.
* Default to model trained with CPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Checkout 1.0rc1 release as latest Pytorch master seems to have MPI backend detection broken
* Track changes in pytorch_mnist/training/ddp/mnist folder to trigger test jobs
* Repoint to pull images from gcr.io/kubeflow-ci built during pre-submit
* Fix image webui name
* Fix logging
* Add GCFS to CPU train
* Fix logging
* Add GCFS to CPU train
* Default to model trained with GPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Fix Predict() method as Seldon expects 3 arguments
* Fix x reference
* Install nmslib in the Dataflow container so its suitable for running
the index creation job.
* Use command not args in the job specs.
* Dockerfile.dataflow should install nmslib so that we can use that Docker
image to create the index.
* build.jsonnet should tag images as latest. We will use this to use
the latest images as a layer cache to speed up builds.
* Set logging level to info for start_search_server.py and
create_search_index.py
* Create search index pod keeps was getting evicted because node runs out of
memory
* Add a new node pool consisting of n1-standard-32 nodes to the demo cluster.
These have 120 GB of RAM compared to 30GB in our default pool of n1-standard-8
* Set requests and limits on the creator search index pod.
* Move all the config for the search-index-creator job into the
search-index-creator.jsonnet file. We need to customize the memory resources
so there's not much value to try to sharing config with other components.
* Add Pytorch MNIST example
* Fix link to Pytorch NMIST example
* Fix indentation in README
* Fix lint errors
* Fix lint errors
Add prediction proto files
* Add build_image.sh script to build image and push to gcr.io
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Fix lint errors
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Add PB2 autogenerated files to ignore with Pylint
* Fix lint errors
* Add official Pytorch DDP examples to ignore with Pylint
* Fix lint errors
* Update component to web-ui release
* Update mount point to kubeflow-gcfs as the example is GCP specific
* 01_setup_a_kubeflow_cluster document complete
* Test release job while PR is WIP
* Reduce workflow name to avoid Argo error:
"must be no more than 63 characters"
* Fix extra_repos to pull worker image
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix extra_repo, only needs kubeflow/testing
* Set build_image.sh executable
* Update build_image.sh from CentralDashboard component
* Remove old reference to centraldashboard in echo message
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Add releases for the training and serving images
* Add releases for the training and serving images
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix path to Seldon-wrapper build_image.sh
* Fix image name in ksonnet parameter
* Add 02 distributed training documentation
* Add 03 serving the model documentation
Update shared persistent reference in 02 distributed training documentation
* Add 05 teardown documentation
* Add section to test the model is deployed correctly in 03 serving the model
* Add 04 querying the model documentation
* Fix ks-app to ks_app
* Set prow jobs back to postsubmit
* Set prow jobs to trigger presubmit to kubeflow-ci and postsubmit to
kubeflow-images-public
* Change to kubeflow-ci project
* Increase timeout limit during image build to compile Pytorch
* Increase timeout limit during image build to compile Pytorch
* Change build machine type to compile Pytorch for training image
* Change build machine type to compile Pytorch for training image
* Add OWNERS file to Pytorch example
* Fix typo in documentation
* Remove checking docker daemon as we are using gcloud build instead
* Use logging module rather print()
* Remove empty file, replace with .gitignore to keep tmp folder
* Add ksonnet application to deploy model server and web-ui
Delete model server JSON manifest
* Refactor ks-app to ks_app
* Parametrise serving_model ksonnet component
Default web-ui to use ambassador route to seldon
Remove form section in web-ui
* Remove default environment from ksonnet application
* Update documentation to use ksonnet application
* Fix component name in documentation
* Consolidate Pytorch train module and build_image.sh script
* Consolidate Pytorch train module
* Consolidate Pytorch train module
* Consolidate Pytorch train module and build_image.sh script
* Revert back build_image.sh scripts
* Remove duplicates
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
* Fix docker build command
* Fix docker build command
* Fix image name for cpu and gpu train
* Consolidate Pytorch train module
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
* Add simple pipeline demo
* Add hyperparameter tuning & GPU autoprovisioning
Use pipelines v0.1.2
* Resolve lint issues
* Disable lint warning
Correct SDK syntax that labels the name of the pipeline step
* Add postprocessing step
Basically empty step just to show more than one step
* Add clarity to instructions
* Update pipelines install to release v0.1.2
* Add repo cloning with release versions
Remove katib patch
Use kubeflow v0.3.3
Add PROJECT to env var override file
Further clarification of instructions
In order to build a pipeline that can runs ksonnet command, the ksonnet registry need to be containerized.
Remove it from dockerignore to unblock the work.