Commit Graph

546 Commits

Author SHA1 Message Date
Hougang Liu 1ed74b274c create pv for pets-pv (#439)
* create pv for pets-pv

For a lot of user k8s clusters, dynamic volume provisioning isn't
enabled. So the newcomer may be blocked since pets-pv will keep
Pending. We can guide them to create a nfs PV as an option.

* tell user how to check if a default storage class is defined

* add link about how to create PV
2018-12-21 06:05:11 -08:00
Jeremy Lewi 2e6e891a5b Update the ArgoCD app to use the kubeflow/examples repo (#440)
* We were using jlewi's fork because PRs hadn't been committed but
  all the relevant PRs have been merged and master is the source of truth.
2018-12-19 21:26:49 -08:00
Jeremy Lewi ba9af34805 Create a script to count lines of code. (#379)
* Create a script to count lines of code.

* This is used in the presentation to get an estimate of where the human effort is involved.

* Fix lint issues.
2018-12-19 09:42:25 -08:00
Guang Ya Liu 345e69ab4c Removed empty application centric section. (#375) 2018-12-14 18:36:18 -08:00
Svendegroote91 a2e8a08e11 remove obsolete PS in GPU jsonnet (#407) 2018-12-12 18:08:15 -08:00
Jeremy Lewi 9f061a0554 Update the central dashboard UI image to one that includes pipelines. (#430) 2018-12-12 09:34:21 -08:00
Jeremy Lewi 1b643c2b81 Fix the web app. (#432)
* We need to set the parameters for the model and index.

  * It looks like when we split up the web app into its own ksonnet app
    we forgot to set the parameters.

* SInce the web app is being deployed in a separate namespace we need to
  copy the GCP credential to that namespace. Add instructions to the
  demo README.md on how to do that.

* It looks like the pods were never getting started because the secret
  couldn't be mounted.
2018-12-12 09:24:40 -08:00
IronPan 4a7e2c868c fix bq table dupliation (#418)
* fix bq table dupliation

* fix bq table dupliation

* update

* update image

* use index for placeholder
2018-12-10 18:50:28 -08:00
David Sabater Dinter d408ae09f0 Point images back to gcr.io/kubeflow-examples (#421) 2018-12-09 16:02:24 -08:00
Jeremy Lewi b26f7e9a48 Add pods/logs permission to the jupyter notebook role. (#419)
* This is needed so that fairing can tail the logs.
2018-12-09 15:53:46 -08:00
Azmi Kamis 6cdc461b50 fix directories in Dockerfile (#416) 2018-12-08 14:53:10 -08:00
Azmi Kamis c234f18b0b Fix curl command when sending request to seldon-served model on GKE (#415)
* fix curl command when sending request to seldon-served model on GKE

* modified response
2018-12-08 14:44:06 -08:00
Hougang Liu fc5a85b948 reconcile tensorflow serving version (#409)
Since default OBJ_DETECTION_IMAGE tensorflow version is 1.10.0, we
pin consistent version 1.10.0 of TF across the example.

Fixes: #408
2018-12-08 14:32:46 -08:00
Jeremy Lewi 67d42c4661 Expose ArgoCD UI behind Ambassador. (#413)
* We need to disable TLS (its handled by ingress) because that leads to
  endless redirects.

* ArgoCD is running in namespace argo-cd but Ambassador is running in a
  different namespace and currently only configured with RBAC to monitor
  a single namespace.

* So we add a service in namespace kubeflow just to define the Ambassador mapping.
2018-12-08 12:49:34 -08:00
IronPan 4c970876dc add notebook for code search pipeline (#410) 2018-12-07 10:29:02 -08:00
Sam Shi b2e6aa231c Save the batch-predict package in the image; Create a separate Dockfile for GPU (#383)
* Save the batch-predict package in the image; create a separate Dockerfile for gpu

* remove commented code
2018-12-07 10:28:57 -08:00
IronPan 0d2f5b6342 Clean up code search pipeline (#406)
* update pipeline to use out of box gcp credential support

* Update index_update_pipeline.py
2018-12-07 10:13:12 -08:00
Hougang Liu 9994b57497 add object detection grpc client (#378)
* add object detection grpc client

Fixes: #377

* fix kubeflow-examples-presubmit error

object_detection_grpc_client.py depends on other files in
https://github.com/tensorflow/models.git, pylint will fail
for those files need to be compiled manually.
Since mnist_DDP.py has similar dependency, here just follow
mnist_DDP.py and ignore checking this file.
2018-12-06 18:51:24 -08:00
Karthic Rao b69cf36a39 Fixing broken links (#403)
- Fix broken links for the install instructions.
- Minor modifications to the instructions.
- Minior formatting fixes.
2018-12-05 18:42:11 -08:00
IronPan 206ad8fda4 Add preprocess github data step to code search pipeline (#396)
* refactor ks

* remove unecessary params

* update ks

* address comments

* add preprocess step

* update images

* update preprocess code

* reformat

* minor fix

* reuse function embedding pipeline to preprocess

* add preprocess

* update pipeline

* propagate failed token table

* format code

* copy vocabulary

* address comments

* address comments

* update

* fix

* fix format

* Update arguments.py
2018-12-05 18:06:06 -08:00
Michelle Casbon 5e395c1a88 Add components (#402)
Replace files that were mistakenly removed in #376
2018-12-05 15:06:42 -08:00
govind cs 60ba49c68d
fixed "setting persistent disk" link
Fixed the linked to advanced customization link on kubeflow which currently redirects to a non-existent page.
2018-12-04 16:02:53 +05:30
Michelle Casbon fa1311833c Update instructions and setup for yelp demo (#376)
* Update instructions and setup for yelp demo

Update kubeflow version to v0.3.4-rc.1
Add pipelines version v0.1.3-rc.2
Add simple pipelines example using GPUs
Conform cluster name, secrets, and ks app directory name to click-to-deploy standard
Update ks_app directory to v0.3.4-rc.1
Pin bokeh package to v0.13.0 in yelp notebook
Fix bug in secret creation

* Port-forward to svcs instead of pods

Add clarification for using kfctl & updating component params
2018-12-03 22:39:51 -08:00
IronPan cea0ffde0d Update the ks parameter (#394)
* refactor ks

* remove unecessary params

* update ks

* address comments
2018-12-02 22:14:11 -08:00
Jeremy Lewi 78fdc74b56 Dataflow job should support writing embeddings to a different location (Fix #366). (#388)
* Datflow job should support writing embeddings to a different location (Fix #366).

* Dataflow job to compute code embeddings needs to have parameters controlling
  the location of the outputs independent of the inputs. Prior to this fix the
  same table in the dataset was always written and the files were always created
  in the data dir.

* This made it very difficult to rerun the embeddings job on the latest GitHub
  data (e.g to regularly update the code embeddings) without overwritting
  the current embeddings.

* Refactor how we create BQ sinks and sources in this pipeline

  * Rather than create a wrapper class that bundles together a sink and schema
    we should have a separate helper class for creating BQ schemas and then
    use WriteToBigQuery directly.

  * Similarly for ReadTransforms we don't need a wrapper class that bundles
    a query and source. We can just create a class/constant to represent
    queries and pass them directly to the appropriate source.

* Change BQ write disposition to if empty so we don't overwrite existing data.

* Fix #390 worker setup fails because requirements.dataflow.txt not found

  * Dataflow always uses the local file requirements.txt regardless of the
    local file used as the source.

  * When job is submitted it will also try to build a sdist package on
    the client which invokes setup.py

  * So we in setup.py we always refer to requirements.txt

  * If trying to install the package in other contexts,
    requirements.dataflow.txt should be renamed to requirements.txt

  * We do this in the Dockerfile.

* Refactor the CreateFunctionEmbeddings code so that writing to BQ
  is not part of the compute function embeddings code;
  (will make it easier to test.)

* * Fix typo in jsonnet with output dir; missing an "=".
2018-12-02 09:51:27 -08:00
IronPan e8cf9c58ce add pipeline step to push to git (#387)
* add push to git

* small fixes

* work around .after()

* format
2018-12-02 09:37:21 -08:00
IronPan 494fc05f16 Add IronPan to code_search owner (#386) 2018-11-30 17:37:57 -08:00
IronPan b807843031 add pipeline environment to code search web app (#372)
* add pipeline

* Update app.yaml
2018-11-30 07:51:00 -08:00
IronPan 3799bac22c Update the update_index.sh (#373)
* add search index creator container

* add pipeline

* update op name

* update readme

* update scripts

* typo fix

* Update Makefile

* Update Makefile

* address comments

* fix ks

* update pipeline

* restructure the images

* remove echo

* update image

* add code embedding launcher

* small fixes

* format

* format

* address comments

* add flag

* Update arguments.py

* update parameter

* revert to use --wait_until_finished. --wait_until_finish never works

* update image

* update git script

* update script

* update readme
2018-11-29 00:53:09 -08:00
Hougang Liu 6855802aa1 tf-training-job doesn't complete (#367)
In tensorflow/models/research/object_detection/, only
tensorflow/models/research/object_detection/legacy/train.py
supports kubeflow sor far (construct cluster by reading
TF_CONFIG environment var).

Fixes: #277
2018-11-28 22:48:21 -08:00
David Sabater Dinter f9a707ee85 [pytorch_mnist] Point images back to gcr.io/kubeflow-examples (#360)
* Point images back to gcr.io/kubeflow-images-public

* Point images back to gcr.io/kubeflow-examples

* Point images back to gcr.io/kubeflow-examples
2018-11-28 22:48:16 -08:00
Guang Ya Liu db8f4f4b37 Highlight the kubectl command. (#369) 2018-11-28 22:41:40 -08:00
IronPan 7ffc50e0ee Add dataflow launcher script (#364)
* add search index creator container

* add pipeline

* update op name

* update readme

* update scripts

* typo fix

* Update Makefile

* Update Makefile

* address comments

* fix ks

* update pipeline

* restructure the images

* remove echo

* update image

* add code embedding launcher

* small fixes

* format

* format

* address comments

* add flag

* Update arguments.py

* update parameter

* revert to use --wait_until_finished. --wait_until_finish never works

* update image
2018-11-27 19:23:54 -08:00
IronPan 760ba7b9e8 Cleanup build directory before code search GCB build (#370)
The build directory cached the staled deleted files and without cleaning up the folder, those staled files are carried over to the new image.
2018-11-27 12:54:57 -08:00
IronPan c0345dec90 Update setup.py to point to the new requirement file (#371) 2018-11-27 12:45:07 -08:00
Michelle Casbon 6fcb28bc26 Use latest kubeflow release branch v0.3.4-rc.1 (#365)
Remove separate pipelines installation
Update kfp version to 0.1.3-rc.2
Clarify difference in installation paths (click-to-deploy vs CLI)
Use set_gpu_limit() and remove generated yaml with resource limits
2018-11-27 09:27:34 -08:00
IronPan 31390d39a0 Add update search index pipeline (#361)
* add search index creator container

* add pipeline

* update op name

* update readme

* update scripts

* typo fix

* Update Makefile

* Update Makefile

* address comments

* fix ks

* update pipeline

* restructure the images

* remove echo

* update image

* format

* format

* address comments
2018-11-27 04:43:55 -08:00
Hougang Liu 15007fdeea Add ks env configuration guideline and directory(#346) (#347) 2018-11-26 22:05:36 -08:00
Jeremy Lewi e1e1422da4 Setup ArgoCD to synchornize the code search web app with the demo cluster. (#359)
* Follow argocd instructions
  https://github.com/argoproj/argo-cd/blob/master/docs/getting_started.md
  to install ArgoCD on the cluster

  * Down the argocd manifest and update the namespace to argocd.
  * Check it in so ArgoCD can be deployed declaratively.

* Update README.md with the instructions for deploying ArgoCD.

Move the web app components into their own ksonnet app.

* We do this because we want to be able to sync the web app components using
  Argo CD

* ArgoCD doesn't allow us to apply autosync with granularity less than the
  app. We don't want to sync any of the components except the servers.

* Rename the t2t-code-search-serving component to query-embed-server because
  this is more descriptive.

* Check in a YAML spec defining the ksonnet application for the web UI.

Update the instructions in nodebook code-search.ipynb

  * Provided updated instructions for deploying the web app due the
  fact that the web app is now a separate component.

  * Improve code-search.ipynb
    * Use gcloud to get sensible defaults for parameters like the project.
    * Provide more information about what the variables mean.
2018-11-26 18:19:19 -08:00
IronPan 7924fa7fd0 parameterize search index job name (#358)
* parameterize search index job name

* change namespace

* Update search-index-creator.jsonnet
2018-11-26 12:03:30 -08:00
Jeremy Lewi 5d6a4e9d71 Create a script to update the index and lookup file used to serve predictions. (#352)
* This script will be the last step in a pipeline to continuously update
  the index for serving.

* The script updates the parameters of the search index server to point
  to the supplied index files. It then commits them and creates a PR
  to push those commits.

* Restructure the parameters for the search index server so that we can use
  ks param set to override the indexFile and lookupFile.

* We do this because we want to be able to push a new index by doing
  ks param set in a continuously running pipeline
* Remove default parameters from search-index-server

* Create a dockerfile suitable for running this script.
2018-11-26 06:35:27 -08:00
IronPan 4f95e85e63 add pipeline component (#356)
* add pipeline component

* update pipeline component
2018-11-26 06:21:07 -08:00
Sarah Maddox 62c2e4c249 Updated example and demo READMEs (#344)
* Explained purpose of demos vs examples and added pipelines demo to README.

* Fixed some rendering in list items.
2018-11-24 17:27:52 -08:00
Jeremy Lewi a32227f371 Fix the ksonnet by defining globals. (#354)
* The latest changes to the ksonnet components require certain values
  to be defined as defaults.

* This is part of the move away from using a fake component to define
  parameters that should be reused across different modules.

  see #308

* Verify we can run ks show on a new environment and can evaluate the ksonnet.

Fix #353
2018-11-24 14:36:43 -08:00
Jeremy Lewi de17011066 Upgrade and fix the serving components. (#348)
* Upgrade and fix the serving components.

* Install a new version of the TFServing package so we can use the new template.

* Fix the UI image. Use the same requirements file as for Dataflow so we are
consistent w.r.t the version of TF and Tensor2Tesnro.

* remove nms.libsonnet; move all the manifests into the actual component
  files rather than using a shared library.

* Fix the name of the TFServing service and deployment; need to use the same
  name as used by the front end server.

* Change the port of TFServing; we are now using the built in http server
  in TFServing which uses port 8500 as opposed to our custom http proxy.

* We encountered an error importning nmslib; moving it to the top of the file
  appears to fix this.

* Fix lint.
2018-11-24 13:22:34 -08:00
David Sabater Dinter a630fcea34 [mnist_pytorch] fix train image (#342)
* Default to model trained with CPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models

* Checkout 1.0rc1 release as latest Pytorch master seems to have MPI backend detection broken

* Track changes in pytorch_mnist/training/ddp/mnist folder to trigger test jobs

* Repoint to pull images from gcr.io/kubeflow-ci built during pre-submit

* Fix image webui name

* Fix logging

* Add GCFS to CPU train

* Fix logging

* Add GCFS to CPU train

* Default to model trained with GPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models

* Fix Predict() method as Seldon expects 3 arguments

* Fix x reference
2018-11-24 13:22:28 -08:00
Jeremy Lewi d2b68f15d7 Fix the K8s job to create the nmslib index. (#338)
* Install nmslib in the Dataflow container so its suitable for running
  the index creation job.

* Use command not args in the job specs.

* Dockerfile.dataflow should install nmslib so that we can use that Docker
  image to create the index.

* build.jsonnet should tag images as latest. We will use this to use
  the latest images as a layer cache to speed up builds.

* Set logging level to info for start_search_server.py and
  create_search_index.py

* Create search index pod keeps was getting evicted because node runs out of
  memory

* Add a new node pool consisting of n1-standard-32 nodes to the demo cluster.
 These have 120 GB of RAM compared to 30GB in our default pool of n1-standard-8

* Set requests and limits on the creator search index pod.

* Move all the config for the search-index-creator job into the
  search-index-creator.jsonnet file. We need to customize the memory resources
  so there's not much value to try to sharing config with other components.
2018-11-20 12:53:09 -08:00
David Sabater Dinter a402db1ccc E2E Pytorch mnist example (#274)
* Add Pytorch MNIST example

* Fix link to Pytorch NMIST example

* Fix indentation in README

* Fix lint errors

* Fix lint errors
Add prediction proto files

* Add build_image.sh script to build image and push to gcr.io

* Add pytorch-mnist-webui-release release through automatic ksonnet package

* Fix lint errors

* Add pytorch-mnist-webui-release release through automatic ksonnet package

* Add PB2 autogenerated files to ignore with Pylint

* Fix lint errors

* Add official Pytorch DDP examples to ignore with Pylint

* Fix lint errors

* Update component to web-ui release

* Update mount point to kubeflow-gcfs as the example is GCP specific

* 01_setup_a_kubeflow_cluster document complete

* Test release job while PR is WIP

* Reduce workflow name to avoid Argo error:
"must be no more than 63 characters"

* Fix extra_repos to pull worker image

* Fix testing_image using kubeflow-ci rather than kubeflow-releasing

* Fix extra_repo, only needs kubeflow/testing

* Set build_image.sh executable

* Update build_image.sh from CentralDashboard component

* Remove old reference to centraldashboard in echo message

* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md

* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md

* Add releases for the training and serving images

* Add releases for the training and serving images

* Fix testing_image using kubeflow-ci rather than kubeflow-releasing

* Fix path to Seldon-wrapper build_image.sh

* Fix image name in ksonnet parameter

* Add 02 distributed training documentation

* Add 03 serving the model documentation
Update shared persistent reference in 02 distributed training documentation

* Add 05 teardown documentation

* Add section to test the model is deployed correctly in 03 serving the model

* Add 04 querying the model documentation

* Fix ks-app to ks_app

* Set prow jobs back to postsubmit

* Set prow jobs to trigger presubmit to kubeflow-ci and postsubmit to
kubeflow-images-public

* Change to kubeflow-ci project

* Increase timeout limit during image build to compile Pytorch

* Increase timeout limit during image build to compile Pytorch

* Change build machine type to compile Pytorch for training image

* Change build machine type to compile Pytorch for training image

* Add OWNERS file to Pytorch example

* Fix typo in documentation

* Remove checking docker daemon as we are using gcloud build instead

* Use logging module rather print()

* Remove empty file, replace with .gitignore to keep tmp folder

* Add ksonnet application to deploy model server and web-ui
Delete model server JSON manifest

* Refactor ks-app to ks_app

* Parametrise serving_model ksonnet component
Default web-ui to use ambassador route to seldon
Remove form section in web-ui

* Remove default environment from ksonnet application

* Update documentation to use ksonnet application

* Fix component name in documentation

* Consolidate Pytorch train module and build_image.sh script

* Consolidate Pytorch train module

* Consolidate Pytorch train module

* Consolidate Pytorch train module and build_image.sh script

* Revert back build_image.sh scripts

* Remove duplicates

* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud

* Fix docker build command

* Fix docker build command

* Fix image name for cpu and gpu train

* Consolidate Pytorch train module

* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
2018-11-18 14:24:43 -08:00
Michelle Casbon 4bbc0c8fd8 Simple pipeline demo (#322)
* Add simple pipeline demo

* Add hyperparameter tuning & GPU autoprovisioning

Use pipelines v0.1.2

* Resolve lint issues

* Disable lint warning

Correct SDK syntax that labels the name of the pipeline step

* Add postprocessing step

Basically empty step just to show more than one step

* Add clarity to instructions

* Update pipelines install to release v0.1.2

* Add repo cloning with release versions

Remove katib patch
Use kubeflow v0.3.3
Add PROJECT to env var override file
Further clarification of instructions
2018-11-16 11:16:12 -08:00
Yang Pan 60a7413cc5 Remove ksonnet registry from dockerignore file (#333)
In order to build a pipeline that can runs ksonnet command, the ksonnet registry need to be containerized.
Remove it from dockerignore to unblock the work.
2018-11-14 13:45:15 -08:00