Commit Graph

277 Commits

Author SHA1 Message Date
Oleg Shepetyuk ea86a41172 Updated mnist example README with AWS credentials setting 2019-01-25 17:26:56 +02:00
Oleg Shepetyuk f85a8e970f Made SecretRefs more generic and fixed failed test 2019-01-24 18:40:56 +02:00
Oleg Shepetyuk f89af01e2c Add support for AWS access/secret keys in train component (#466) 2019-01-23 09:58:00 +02:00
Jeremy Lewi 2b0eec34c3 Enable periodic tests for mnist & GH issue examples. (#486)
* Add a link to the E2E testing guide to the contributing page.

Related to #485 - enable periodic mnist E2E testing.
2019-01-22 16:10:17 -08:00
Zhenghui Wang 22715c4900 build image for ames-housing-serving (#484) 2019-01-18 18:19:44 -08:00
Jeremy Lewi 5b797c871e Create an E2E test for TFServing using the rest API (#479)
* Create an E2E test for TFServing using the rest API

* We use the pytest framework because
  1. it has really good support for using command line arguments
  2. can emit junit xml file to report results to prow.

Related to #270: Create a generic test runner

* Address comments.

* Fix lint.

* Add retries to the prediction.

* Add some comments.

* Fix model path.

* * Fix the workflow labels
* Set the K8s service name correctly on the test.

* Fix the workflow.

* Fix lint.
2019-01-18 16:29:42 -08:00
govind cs b71a14396a optimized apt-get to reduce image size (#482)
* optimized apt-get to reduce image size

* More verbose logging

* minor fix

removed no install recommends
2019-01-18 06:00:18 -08:00
cliveseldon 8d728f0b06 GitHub Summarization Seldon Update (#472)
* Update model inference wrapping to use S2I and update docs

* Add s2i reference in docs

* Fix typo highlighted in review

* Add pyLint annotation to allow protected-access on keras make predict function method
2019-01-17 16:07:34 -08:00
David Sabater Dinter 152c38b386 [mnist_pytorch] Optimise build and switch backend from MPI to GLOO (#480)
* Refactor Python module:
- Replace MPI by GLOO as backend to avoid having to recompily Pytorch
- Replace DistributedDataParallel() class with official version when using GPUs
- Remove unnecessary method to disable logs in workers
- Refactor run()

* Simplify Dockerfile by using Pytorch 0.4 official image with Cuda and remove mpirun call
2019-01-16 11:38:52 -08:00
Zhenghui Wang 1ed08b9af2 Fix model serving part of xgboost_ams_housing example. (#478)
* Fix model serving for ames house example

* change the step instructions

* add public image
2019-01-15 12:30:51 -08:00
Jeremy Lewi 46a795693a Minor fixes to the notebook. (#427)
* Need to fix the import and compile commands.

* Check if an experiment with the name already exists.
2019-01-15 08:33:19 -08:00
Zhenghui Wang 8f32202a36 Add richard and zhenghui to approvers for kubeflow/examples (#470)
* add richard and zhenghui as approvers for examples

* add owner file to xgboost example

* reduce approvers

* update
2019-01-14 17:50:30 -08:00
Richard Liu 64c3889071
Merge pull request #476 from richardsliu/hp_tuning
Fix xgboost example for hyperparameter tuning
2019-01-14 17:41:07 -08:00
Hung-Ting Wen c83ed09a77 revert back removed v1alpha2 yaml manifests (#475)
* revert back removed v1alpha2 yaml manifests

* Add documentation

* Fix format
2019-01-14 17:08:29 -08:00
Richard Liu 3859564422 Fix pylint and log fmt 2019-01-14 17:01:27 -08:00
Richard Liu 1b29c2176e Merge remote-tracking branch 'upstream/master' into hp_tuning 2019-01-14 16:00:30 -08:00
Richard Liu 8437ec9e5c Fix logging 2019-01-14 15:54:25 -08:00
Jeremy Lewi 6770b4adcc Add the web-ui for the mnist example (#473)
* Add the web-ui for the mnist example

Copy the mnist web app from
https://github.com/googlecodelabs/kubeflow-introduction

* Update the web app

   * Change "server-name" argument to "model-name" because this is what
     is.

   * Update the prediction client code; The prediction code was copied
     from https://github.com/googlecodelabs/kubeflow-introduction and
     that model used slightly different values for the input names
     and outputs.

  * Add a test for the mnist_client code; currently it needs to be run
    manually.

* Fix the label selector for the mnist service so that it matches the
  TFServing deployment.

* Delete the old copy of mnist_client.py; we will go with the copy in ewb-ui from https://github.com/googlecodelabs/kubeflow-introduction

* Delete model-deploy.yaml, model-train.yaml, and tf-user.yaml.
  The K8s resources for training and deploying the model are now in ks_app.

* Fix tensorboard; tensorboard only partially works behind Ambassador. It seems like some requests don't work behind a reverse proxy.

* Fix lint.
2019-01-14 13:56:39 -08:00
Richard Liu 9e1ee20512 Fix xgboost for hp tuning 2019-01-14 11:50:13 -08:00
Zhenghui Wang b3f06c204d Fix the model training of ames-housing example (#468)
* correct the image path

* fix training part

* rm downloading from github
2019-01-11 17:08:22 -08:00
Jeremy Lewi 2494fdf8c5 Update serving in mnist example; use 0.4 and add testing. (#469)
* Add the TFServing component
* Create TFServing components.

* The model.py code doesn't appear to be exporting a model in saved model
  format; it was a missing a call to export.

  * I'm not sure how this ever worked.

* It also looks like there is a bug in the code in that its using the cnn input fn even if the model is the linear one. I'm going to leave that as is for now.

* Create a namespace for each test run; delete the namespace on teardown
* We need to copy the GCP service account key to the new namespace.
* Add a shell script to do that.
2019-01-11 14:36:43 -08:00
Jeremy Lewi ef108dbbcc Update training to use Kubeflow 0.4 and add testing. (#465)
* Update training to use Kubeflow 0.4 and add testing.

* To support testing we need to create a ksonnet template to train
  the model so we can easily subsitute in different parameters during
  training.

* We create a ksonnet component for just training; we don't use Argo.
  This makes the example much simpler.

* To support S3 we add a generic ksonnet parameter to take environment
  variables as a comma separated list of variables. This should make it
  easy for users to set the environment variables needed to talk to S3.
  This is compatible with the existing Argo workflow which supports S3.

* By default the training job runs non-distributed; this is because to
  run distributed the user needs a shared filesystem (e.g. S3/GCS/NFS).

* Update the mnist workflow to correctly build the images.

  * We didn't update the workflow in the previous example to actually
    build the correct images.

* Update the workflow to run the tfjob_test

* Related to #460 E2E test for mnist.

* Add a parameter to specify a secret that can be used to mount
  a secret such as the GCP service account key.

* Update the README with instructions for GCS and S3.

* Remove the instructions about Argo; the Argo workflow is outdated.

  Using Argo adds complexity to the example and the thinking is to remove
  that to provide a simpler example and to mirror the pytorch example.

* Add a TOC to the README

* Update prerequisite instructions.

  * Delete instructions for installing Kubeflow; just link to the
    getting started guide.

  * Argo CLI should no longer be needed.

  * GitHub token shouldn't be needed; I think that was only needed
    for ksonnet to pull the registry.

* * Fix instructions; access keys shouldn't be stored as ksonnet parameters
  as these will get checked into source control.
2019-01-10 12:42:45 -08:00
Hung-Ting Wen 4dda73afbf Update pytorch_mnist example to use v1beta1 (#445)
* Add job_mnist_DDP_CPU for v1beta1

* Add job_mnist_DDP_GPU for v1beta1

* Update 02_distributed_training.md to use v1beta1

* Remove pytorch v1alpha2 config

* Add missing CPU training config
2019-01-09 05:27:35 -08:00
David Sabater Dinter 38daafa0c3 [mnist_pytorch] Update documentation (#463)
* Fix link to next section, training the model

* Added links to next and previous sections in training the model README

* Fix link to previous section, training the model

* Remove TODO list
2019-01-08 15:32:51 -08:00
Jeremy Lewi d28ba7c4db Continuously build the docker images used by mnist. (#462)
* This is the first step in adding E2E tests for the mnist example.

* Add a Makefile and .jsonnet file to build the Docker images using GCB

* Define an Argo workflow to trigger the image builds on pre & post submit.

Related to: #460
2019-01-08 15:21:49 -08:00
Jeremy Lewi 1cc4550b7d GIS E2E test verify the TFJob runs successfully (#456)
* Create a test for submitting the TFJob for the GitHub issue summarization example.

* This test needs to be run manually right now. In a follow on PR we will
  integrate it into CI.

* We use the image built from Dockerfile.estimator because that is the image
  we are running train_test.py in.

  * Note: The current version of the code now requires Python3 (I think this
    is due to an earlier PR which refactored the code into a shared
    implementation for using TF estimator and not TF estimator).

* Create a TFJob component for TFJob v1beta1; this is the version
  in KF 0.4.

TFJob component
  * Upgrade to v1beta to work with 0.4
  * Update command line arguments to match the versions in the current code
      * input & output are now single parameters rather then separate parameters
        for bucket and name

  * change default input to a CSV file because the current version of the
    code doesn't handle unzipping it.

* Use ks_util from kubeflow/testing

* Address comments.
2019-01-08 15:06:49 -08:00
Jeremy Lewi 959d072e68 Setup continuous building of Docker images for GH Issue Summarization Example (#449)
* Setup continuous building of Docker images and testing  for GH Issue Summarization Example.

* This is the first step in setting up a continuously running CI test.

* Add support for building the Docker images using GCB; we will use GCB
  to trigger the builds from our CI system.

  * Make the Makefile top level (at root of GIS example) so that we can
    easily access all the different resources.

* Add a .gitignore file to avoid checking in the build directory used by
  the Makefile.

* Define an Argo workflow to use as the E2E test.

Related to #92: E2E test & CI for github issue summarization

* Trigger the test on pre & post submit

* Dockerfile.estimator don't install the data_download.sh script
  * It doesn't look like we are currently using data_download.sh in the
    DockerImage
  * It looks like it only gets used vias the ksonnet job which mounts the
    script via a config map

  * Copying data_download.sh to the Docker image is currently weird
    given the organization of the Dockerfile and context.

* Copy the test_data to the Docker images so that we can run the test
  inside the images.

* Invoke the python unittest for training from our CI system.

  * In a follow on PR we will update the test to emit a JUnit XML file to
    report results to prow.

* Fix image build.
2019-01-04 17:02:24 -08:00
Michelle Casbon 70a22d6d7b [GH Issue Summarization] Upgrade to kf v0.4.0-rc.2 (#450)
* Update tfjob components to v1beta1

Remove old version of tensor2tensor component

* Combine UI into a single jsonnet file

* Upgrade GH issue summarization to kf v0.4.0-rc.2

Use latest ksonnet v0.13.1
Use latest seldon v1alpha2
Remove ksonnet app with full kubeflow platform & replace with components specific to this example.
Remove outdated scripts
Add cluster creation links to Click-to-deploy & kfctl
Add warning not to use the Training with an Estimator guide
Replace commandline with bash for better syntax highlighting
Replace messy port-forwarding commands with svc/ambassador
Add modelUrl param to ui component
Modify teardown instructions to remove the deployment
Fix grammatical mistakes

* Rearrange tfjob instructions
2018-12-30 20:05:29 -08:00
Jeremy Lewi 7990408207 Delete obsolete HP tuning code. (#451)
* Katib no longer uses custom go programs. Instead it uses the new
  StudyJobController custom resource.

* This code is no longer needed so delete it.
2018-12-29 19:00:14 -08:00
Hung-Ting Wen 37dd52f49d Fix example documentation (#447) 2018-12-28 18:11:33 -08:00
Jeremy Lewi e15bfffca4 An Argo workflow to use as the E2E test for code_search example. (#446)
* An Argo workflow to use as the E2E test for code_search example.

* The workflow builds the Docker images and then runs the python test
  to train and export a model

* Move common utilities into util.libsonnet.

* Add the workflow to the set of triggered workflows.

* Update the test environment used by the test ksonnet app; we've since
  changed the location of the app.

Related to #295

* Refactor the jsonnet file defining the GCB build workflow

  * Use an external variable to conditionally pull and use a previous
    Docker image as a cache

  * Reduce code duplication by building a shared template for all the different
    workflows.

* BUILD_ID needs to be defined in the default parameters otherwise we get an error when adding a new environment.

* Define suitable defaults.
2018-12-28 16:12:32 -08:00
David Sabater Dinter a1f0d6dfec Fixed some outdated comments to trigger pushing web-ui and model serve images to gcr.io/kubeflow-examples (#444) 2018-12-26 15:05:42 -08:00
Hougang Liu 1ed74b274c create pv for pets-pv (#439)
* create pv for pets-pv

For a lot of user k8s clusters, dynamic volume provisioning isn't
enabled. So the newcomer may be blocked since pets-pv will keep
Pending. We can guide them to create a nfs PV as an option.

* tell user how to check if a default storage class is defined

* add link about how to create PV
2018-12-21 06:05:11 -08:00
Jeremy Lewi 2e6e891a5b Update the ArgoCD app to use the kubeflow/examples repo (#440)
* We were using jlewi's fork because PRs hadn't been committed but
  all the relevant PRs have been merged and master is the source of truth.
2018-12-19 21:26:49 -08:00
Jeremy Lewi ba9af34805 Create a script to count lines of code. (#379)
* Create a script to count lines of code.

* This is used in the presentation to get an estimate of where the human effort is involved.

* Fix lint issues.
2018-12-19 09:42:25 -08:00
Guang Ya Liu 345e69ab4c Removed empty application centric section. (#375) 2018-12-14 18:36:18 -08:00
Svendegroote91 a2e8a08e11 remove obsolete PS in GPU jsonnet (#407) 2018-12-12 18:08:15 -08:00
Jeremy Lewi 9f061a0554 Update the central dashboard UI image to one that includes pipelines. (#430) 2018-12-12 09:34:21 -08:00
Jeremy Lewi 1b643c2b81 Fix the web app. (#432)
* We need to set the parameters for the model and index.

  * It looks like when we split up the web app into its own ksonnet app
    we forgot to set the parameters.

* SInce the web app is being deployed in a separate namespace we need to
  copy the GCP credential to that namespace. Add instructions to the
  demo README.md on how to do that.

* It looks like the pods were never getting started because the secret
  couldn't be mounted.
2018-12-12 09:24:40 -08:00
IronPan 4a7e2c868c fix bq table dupliation (#418)
* fix bq table dupliation

* fix bq table dupliation

* update

* update image

* use index for placeholder
2018-12-10 18:50:28 -08:00
David Sabater Dinter d408ae09f0 Point images back to gcr.io/kubeflow-examples (#421) 2018-12-09 16:02:24 -08:00
Jeremy Lewi b26f7e9a48 Add pods/logs permission to the jupyter notebook role. (#419)
* This is needed so that fairing can tail the logs.
2018-12-09 15:53:46 -08:00
Azmi Kamis 6cdc461b50 fix directories in Dockerfile (#416) 2018-12-08 14:53:10 -08:00
Azmi Kamis c234f18b0b Fix curl command when sending request to seldon-served model on GKE (#415)
* fix curl command when sending request to seldon-served model on GKE

* modified response
2018-12-08 14:44:06 -08:00
Hougang Liu fc5a85b948 reconcile tensorflow serving version (#409)
Since default OBJ_DETECTION_IMAGE tensorflow version is 1.10.0, we
pin consistent version 1.10.0 of TF across the example.

Fixes: #408
2018-12-08 14:32:46 -08:00
Jeremy Lewi 67d42c4661 Expose ArgoCD UI behind Ambassador. (#413)
* We need to disable TLS (its handled by ingress) because that leads to
  endless redirects.

* ArgoCD is running in namespace argo-cd but Ambassador is running in a
  different namespace and currently only configured with RBAC to monitor
  a single namespace.

* So we add a service in namespace kubeflow just to define the Ambassador mapping.
2018-12-08 12:49:34 -08:00
IronPan 4c970876dc add notebook for code search pipeline (#410) 2018-12-07 10:29:02 -08:00
Sam Shi b2e6aa231c Save the batch-predict package in the image; Create a separate Dockfile for GPU (#383)
* Save the batch-predict package in the image; create a separate Dockerfile for gpu

* remove commented code
2018-12-07 10:28:57 -08:00
IronPan 0d2f5b6342 Clean up code search pipeline (#406)
* update pipeline to use out of box gcp credential support

* Update index_update_pipeline.py
2018-12-07 10:13:12 -08:00
Hougang Liu 9994b57497 add object detection grpc client (#378)
* add object detection grpc client

Fixes: #377

* fix kubeflow-examples-presubmit error

object_detection_grpc_client.py depends on other files in
https://github.com/tensorflow/models.git, pylint will fail
for those files need to be compiled manually.
Since mnist_DDP.py has similar dependency, here just follow
mnist_DDP.py and ignore checking this file.
2018-12-06 18:51:24 -08:00