Commit Graph

25 Commits

Author SHA1 Message Date
Jeremy Lewi cc93a80420
Create a notebook for mnist E2E on GCP (#723)
* A notebook to run the mnist E2E example on GCP.

This fixes a number of issues with the example
* Use ISTIO instead of Ambassador to add reverse proxy routes
* The training job needs to be updated to run in a profile created namespace in order to have the required service accounts
     * See kubeflow/examples#713
     * Running inside a notebook running on Kubeflow should ensure user
       is running inside an appropriately setup namespace
* With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server
     * A short term fix was to not include the ISTIO side car
     * In the future we can add an appropriate ISTIO rbac policy

* Using a notebook allows us to eliminate the use of kustomize
  * This resolves kubeflow/examples#713 which required people to use
    and old version of kustomize

  * Rather than using kustomize we can use python f style strings to
    write the YAML specs and then easily substitute in user specific values

  * This should be more informative; it avoids introducing kustomize and
    users can see the resource specs.

* I've opted to make the notebook GCP specific. I think its less confusing
  to users to have separate notebooks focused on specific platforms rather
  than having one notebook with a lot of caveats about what to do under
  different conditions

* I've deleted the kustomize overlays for GCS since we don't want users to
  use them anymore

* I used fairing and kaniko to eliminate the use of docker to build the images
  so that everything can run from a notebook running inside the cluster.

* k8s_utils.py has some reusable functions to add some details from users
  (e.g. low level calls to K8s APIs.)

* * Change the mnist test to just run the notebook
  * Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable

* Fix lint.

* Update for lint.

* A notebook to run the mnist E2E example.

Related to: kubeflow/website#1553

* 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK.

* Fix the ISTIO rule.

* Fix UI and serving; need to update TF serving to match version trained on.

* Get the IAP endpoint.

* Start writing some helper python functions for K8s.

* Commit before switching from replace to delete.

* Create a library to bulk create objects.

* Cleanup.

* Add back k8s_util.py

* Delete train.yaml; this shouldn't have been aded.

* update the notebook image.

* Refactor code into k8s_util; print out links.

* Clean up the notebok. Should be working E2E.

* Added section to get logs from stackdriver.

* Add comment about profile.

* Latest.

* Override mnist_gcp.ipynb with mnist.ipynb

I accidentally put my latest changes in mnist.ipynb even though that file
was deleted.

* More fixes.

* Resolve some conflicts from the rebase; override with changes on remote branch.
2020-02-16 19:15:28 -08:00
Jeremy Lewi e91e9c0df3
Remove the lint tests because they are using python2 (#728)
* Lint is failing because we are still runing python2 for lint

* kubeflow/testing#560 is related to building an updated image with python3.8
  compatible version of lint so we can support f style strings.

* However, the unittests for kubeflow examples are still written in
  ksonnet. Its not worth trying to update that so we just
  remove that test for now. The test was just running lint

* We should really see about using Tekton to write the workflows

  see kubeflow/testing#425
2020-02-11 18:16:08 -08:00
Jeremy Lewi 7e28cd6b23 Update xgboost_synthetic test infra; preliminary updates to work with 0.7.0 (#666)
* Update xgboost_synthetic test infra to use pytest and pyfunc.

* Related to #655 update xgboost_synthetic to use workload identity

* Related to to #665 no signal about xgboost_synthetic

* We need to update the xgboost_synthetic example to work with 0.7.0;
  e.g. workload identity

* This PR focuses on updating the test infra and some preliminary
  updates the notebook

* More fixes to the test and the notebook are probably needed in order
  to get it to actually pass

* Update job spec for 0.7; remove the secret and set the default service
  account.

  * This is to make it work with workload identity

* Instead of using kustomize to define the job to run the notebook we can just modify the YAML spec using python.
* Use the python API for K8s to create the job rather than shelling out.

* Notebook should do a 0.7 compatible check for credentials

  * We don't want to assume GOOGLE_APPLICATION_CREDENTIALS is set
    because we will be using workload identity.

* Take in repos as an argument akin to what checkout_repos.sh requires

* Convert xgboost_test.py to a pytest.

  * This allows us to mark it as expected to fail so we can start to get
    signal without blocking

  * We also need to emit junit files to show up in test grid.

* Convert the jsonnet workflow for the E2E test to a python function to
  define the workflow.

  * Remove the old jsonnet workflow.

* Address comments.

* Fix issues with the notebook
* Install pip packages in user space
  * 0.7.0 images are based on TF images and they have different permissions
* Install a newer version of fairing sdk that works with workload identity

* Split pip installing dependencies out of util.py and into notebook_setup.py

  * That's because util.py could depend on the packages being installed by
    notebook_setup.py

* After pip installing the modules into user space; we need to add the local
  path for pip packages to the python otherwise we get import not found
  errors.
2019-10-24 19:53:38 -07:00
Jin Chi He 4f8cf87d4f add testing for xgboost_synthetic (#633) 2019-09-16 15:28:24 -07:00
Jin Chi He 41765a5830 fix e2e test failed problem (#594) 2019-07-12 14:29:05 -07:00
Jin Chi He 871895c544 recomment using kustomize v2.0.3 for mnist (#584) 2019-07-04 19:20:35 -07:00
David Sabater Dinter 7a6dc7b911 [pytorch_mnist] Automate image build (#490)
* Add build and test presubmit jobs for Pytorch nmist example
Keep postsubmit jobs as original release job to push images to examples registry

* Refactor all jobs like mnist and GIS, will drop using release jobs

* Implement test scripts and Ksonnet artifacts from mnist example to enable E2E tests

* Remove release components as they are no longer used

* Refactor YAML manifests as Ksonnet components

* Update documentation to submit training jobs from Ksonnet

* Updated to point to correct component and refactor to PytorchJob

* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint

* Commented out calls to tf-util until
https://github.com/kubeflow/pytorch-operator/issues/108 is implemented

* Refactor to PytorchJob

* Add seldon image build
Add train CPU and GPU in jsonnet to build workflow
Add Dockerfile.ksonnet and entrypoint

* Refactor to PytorchJob

* Rename workflow to avoid dns issue with "_"

* Add TODO note to convert to GRPC

* Rename workflow to avoid dns issue with "_"

* Rename workflow to avoid dns issue with "_"

* Fix path to build Seldon image in Makefile

* Fix tabs in Makefile

* Fix tabs in Makefile

* Fix rule in Makefile

* Add sleep in Makefile to wait for docker ps

* Change node worker image to have docker

* Remove seldon image step from Makefile
Add steps to wrap model with Seldon
Add boolean flag to build Seldon steps

* Add step id build- in jsonnet

* Skip pull step for Seldon

* Fix wait for in Seldon build

* Fix lint errors

* Set useimagecache to false first time the pipeline is executed to avoid error

* Set contextDir as absolute path for Seldon step

* Remove unnecessary argument and Dockerfile in Seldon step

* Add absolute path for build in Seldon steps

* Include absolute path inside jsonnet hardcoded to GCB /workspace/
Remove setting rootDir from Makefile

* Update images with new naming from E2E tests

* Change test-worker image version

* Update images with new naming from E2E tests

* Set useimagecache to true now that we have first images built

* Fix cachelist in Seldon build

* Fix cachelist in Seldon build

* Leverage tf-operator test framework for test_runner
As per https://github.com/kubeflow/pytorch-operator/issues/108

* Consolidate testing imports
Rename testing package as https://github.com/kubeflow/tf-operator/pull/945
Added correct path to import test framework from tf-operator

* Add test framework in PYTHONPATH in build_template

* Remove old release jobs to build images

* Update stepimage to same as GIS example

* Bump up supported Pytorch operator versions from v1alpha2/v1beta1 to v1beta1/v1beta2 to support Kubeflow 0.5
- Refactor training manifests from v1alpha2 to v1beta2
- Update documents

* Update KF cluster version to latest to run tests

* Update KF cluster zone

* Add pylint exception while importing test_runner class from tf-operator

* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest

* Pass dummy tests to train, deploy and predict
Remove no longer used test_data and conftest
2019-06-14 16:20:09 -07:00
Jin Chi He 5fac627725 drop_ksonnet_from_mnist (#546) 2019-05-07 19:54:32 -07:00
Amy b23adc1f0b import of Pipelines Github issue summarization examples & tutorial (#507)
* initial import of Pipelines Github issue summarization examples & lab

* more linting/cleanup, fix tf version to 1.12

* bit more linting; pin some lib versions

* last? lint fixes

* another attempt to fix linting issues

* ughh

* changed test cluster config info

* update ktext package in a test docker image

* hmm, retrying fix for the ktext package update
2019-04-18 17:57:54 -07:00
Jin Chi He fb0c5eb115 fix import issue in the mnist e2e testing (#531) 2019-04-05 18:36:27 -07:00
zabbasi 7924e0fe21 Fixed tf_operator import for github_issue_summarization example (#527)
* fixed tf_operator import

* updated tf-operator import path

* small change

* updated PYTHONPATH

* fixed syntax error

* formating issue
2019-03-14 18:36:58 -07:00
Zhenghui Wang 74378a2990 Add end2end test for Xgboost housing example (#493)
* Add e2e test for xgboost housing example

* fix typo

add ks apply

add [

modify example to trigger tests

add prediction test

add xgboost ks param

rename the job name without _

use - instead of _

libson params

rm redudent component

rename component in prow config

add ames-hoursing-env

use - for all names

use _ for params names

use xgboost_ames_accross

rename component name

shorten the name

change deploy-test command

change to xgboost-
namespace

init ks app

fix type

add confest.py

change path

change deploy command

change dep

change the query URL for seldon

add ks_app with seldon lib

update ks_app

use ks init only

rerun

change to kf-v0-4-n00 cluster

add ks_app

use ks-13

remove --namespace

use kubeflow as namespace

delete seldon deployment

simplify ks_app

retry on 503

fix typo

query 1285

move deletion after prediction

wait 10s

always retry till 10 mins

move check to retry

 fix pylint

move  clean-up to the delete template

* set up xgboost component

* check in ks component& run it directly

* change comments

* add comment on why use 'ks delete'

* add two modules to pylint whitelist

* ignore tf_operator/py

* disable pylint per line

* reorder import
2019-02-12 06:37:05 -08:00
Jeremy Lewi 5b797c871e Create an E2E test for TFServing using the rest API (#479)
* Create an E2E test for TFServing using the rest API

* We use the pytest framework because
  1. it has really good support for using command line arguments
  2. can emit junit xml file to report results to prow.

Related to #270: Create a generic test runner

* Address comments.

* Fix lint.

* Add retries to the prediction.

* Add some comments.

* Fix model path.

* * Fix the workflow labels
* Set the K8s service name correctly on the test.

* Fix the workflow.

* Fix lint.
2019-01-18 16:29:42 -08:00
Jeremy Lewi 2494fdf8c5 Update serving in mnist example; use 0.4 and add testing. (#469)
* Add the TFServing component
* Create TFServing components.

* The model.py code doesn't appear to be exporting a model in saved model
  format; it was a missing a call to export.

  * I'm not sure how this ever worked.

* It also looks like there is a bug in the code in that its using the cnn input fn even if the model is the linear one. I'm going to leave that as is for now.

* Create a namespace for each test run; delete the namespace on teardown
* We need to copy the GCP service account key to the new namespace.
* Add a shell script to do that.
2019-01-11 14:36:43 -08:00
Jeremy Lewi ef108dbbcc Update training to use Kubeflow 0.4 and add testing. (#465)
* Update training to use Kubeflow 0.4 and add testing.

* To support testing we need to create a ksonnet template to train
  the model so we can easily subsitute in different parameters during
  training.

* We create a ksonnet component for just training; we don't use Argo.
  This makes the example much simpler.

* To support S3 we add a generic ksonnet parameter to take environment
  variables as a comma separated list of variables. This should make it
  easy for users to set the environment variables needed to talk to S3.
  This is compatible with the existing Argo workflow which supports S3.

* By default the training job runs non-distributed; this is because to
  run distributed the user needs a shared filesystem (e.g. S3/GCS/NFS).

* Update the mnist workflow to correctly build the images.

  * We didn't update the workflow in the previous example to actually
    build the correct images.

* Update the workflow to run the tfjob_test

* Related to #460 E2E test for mnist.

* Add a parameter to specify a secret that can be used to mount
  a secret such as the GCP service account key.

* Update the README with instructions for GCS and S3.

* Remove the instructions about Argo; the Argo workflow is outdated.

  Using Argo adds complexity to the example and the thinking is to remove
  that to provide a simpler example and to mirror the pytorch example.

* Add a TOC to the README

* Update prerequisite instructions.

  * Delete instructions for installing Kubeflow; just link to the
    getting started guide.

  * Argo CLI should no longer be needed.

  * GitHub token shouldn't be needed; I think that was only needed
    for ksonnet to pull the registry.

* * Fix instructions; access keys shouldn't be stored as ksonnet parameters
  as these will get checked into source control.
2019-01-10 12:42:45 -08:00
Jeremy Lewi d28ba7c4db Continuously build the docker images used by mnist. (#462)
* This is the first step in adding E2E tests for the mnist example.

* Add a Makefile and .jsonnet file to build the Docker images using GCB

* Define an Argo workflow to trigger the image builds on pre & post submit.

Related to: #460
2019-01-08 15:21:49 -08:00
Jeremy Lewi 1cc4550b7d GIS E2E test verify the TFJob runs successfully (#456)
* Create a test for submitting the TFJob for the GitHub issue summarization example.

* This test needs to be run manually right now. In a follow on PR we will
  integrate it into CI.

* We use the image built from Dockerfile.estimator because that is the image
  we are running train_test.py in.

  * Note: The current version of the code now requires Python3 (I think this
    is due to an earlier PR which refactored the code into a shared
    implementation for using TF estimator and not TF estimator).

* Create a TFJob component for TFJob v1beta1; this is the version
  in KF 0.4.

TFJob component
  * Upgrade to v1beta to work with 0.4
  * Update command line arguments to match the versions in the current code
      * input & output are now single parameters rather then separate parameters
        for bucket and name

  * change default input to a CSV file because the current version of the
    code doesn't handle unzipping it.

* Use ks_util from kubeflow/testing

* Address comments.
2019-01-08 15:06:49 -08:00
Jeremy Lewi 959d072e68 Setup continuous building of Docker images for GH Issue Summarization Example (#449)
* Setup continuous building of Docker images and testing  for GH Issue Summarization Example.

* This is the first step in setting up a continuously running CI test.

* Add support for building the Docker images using GCB; we will use GCB
  to trigger the builds from our CI system.

  * Make the Makefile top level (at root of GIS example) so that we can
    easily access all the different resources.

* Add a .gitignore file to avoid checking in the build directory used by
  the Makefile.

* Define an Argo workflow to use as the E2E test.

Related to #92: E2E test & CI for github issue summarization

* Trigger the test on pre & post submit

* Dockerfile.estimator don't install the data_download.sh script
  * It doesn't look like we are currently using data_download.sh in the
    DockerImage
  * It looks like it only gets used vias the ksonnet job which mounts the
    script via a config map

  * Copying data_download.sh to the Docker image is currently weird
    given the organization of the Dockerfile and context.

* Copy the test_data to the Docker images so that we can run the test
  inside the images.

* Invoke the python unittest for training from our CI system.

  * In a follow on PR we will update the test to emit a JUnit XML file to
    report results to prow.

* Fix image build.
2019-01-04 17:02:24 -08:00
Jeremy Lewi e15bfffca4 An Argo workflow to use as the E2E test for code_search example. (#446)
* An Argo workflow to use as the E2E test for code_search example.

* The workflow builds the Docker images and then runs the python test
  to train and export a model

* Move common utilities into util.libsonnet.

* Add the workflow to the set of triggered workflows.

* Update the test environment used by the test ksonnet app; we've since
  changed the location of the app.

Related to #295

* Refactor the jsonnet file defining the GCB build workflow

  * Use an external variable to conditionally pull and use a previous
    Docker image as a cache

  * Reduce code duplication by building a shared template for all the different
    workflows.

* BUILD_ID needs to be defined in the default parameters otherwise we get an error when adding a new environment.

* Define suitable defaults.
2018-12-28 16:12:32 -08:00
David Sabater Dinter a402db1ccc E2E Pytorch mnist example (#274)
* Add Pytorch MNIST example

* Fix link to Pytorch NMIST example

* Fix indentation in README

* Fix lint errors

* Fix lint errors
Add prediction proto files

* Add build_image.sh script to build image and push to gcr.io

* Add pytorch-mnist-webui-release release through automatic ksonnet package

* Fix lint errors

* Add pytorch-mnist-webui-release release through automatic ksonnet package

* Add PB2 autogenerated files to ignore with Pylint

* Fix lint errors

* Add official Pytorch DDP examples to ignore with Pylint

* Fix lint errors

* Update component to web-ui release

* Update mount point to kubeflow-gcfs as the example is GCP specific

* 01_setup_a_kubeflow_cluster document complete

* Test release job while PR is WIP

* Reduce workflow name to avoid Argo error:
"must be no more than 63 characters"

* Fix extra_repos to pull worker image

* Fix testing_image using kubeflow-ci rather than kubeflow-releasing

* Fix extra_repo, only needs kubeflow/testing

* Set build_image.sh executable

* Update build_image.sh from CentralDashboard component

* Remove old reference to centraldashboard in echo message

* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md

* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md

* Add releases for the training and serving images

* Add releases for the training and serving images

* Fix testing_image using kubeflow-ci rather than kubeflow-releasing

* Fix path to Seldon-wrapper build_image.sh

* Fix image name in ksonnet parameter

* Add 02 distributed training documentation

* Add 03 serving the model documentation
Update shared persistent reference in 02 distributed training documentation

* Add 05 teardown documentation

* Add section to test the model is deployed correctly in 03 serving the model

* Add 04 querying the model documentation

* Fix ks-app to ks_app

* Set prow jobs back to postsubmit

* Set prow jobs to trigger presubmit to kubeflow-ci and postsubmit to
kubeflow-images-public

* Change to kubeflow-ci project

* Increase timeout limit during image build to compile Pytorch

* Increase timeout limit during image build to compile Pytorch

* Change build machine type to compile Pytorch for training image

* Change build machine type to compile Pytorch for training image

* Add OWNERS file to Pytorch example

* Fix typo in documentation

* Remove checking docker daemon as we are using gcloud build instead

* Use logging module rather print()

* Remove empty file, replace with .gitignore to keep tmp folder

* Add ksonnet application to deploy model server and web-ui
Delete model server JSON manifest

* Refactor ks-app to ks_app

* Parametrise serving_model ksonnet component
Default web-ui to use ambassador route to seldon
Remove form section in web-ui

* Remove default environment from ksonnet application

* Update documentation to use ksonnet application

* Fix component name in documentation

* Consolidate Pytorch train module and build_image.sh script

* Consolidate Pytorch train module

* Consolidate Pytorch train module

* Consolidate Pytorch train module and build_image.sh script

* Revert back build_image.sh scripts

* Remove duplicates

* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud

* Fix docker build command

* Fix docker build command

* Fix image name for cpu and gpu train

* Consolidate Pytorch train module

* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
2018-11-18 14:24:43 -08:00
Jeremy Lewi acd8007717 Use conditionals and add test for code search (#291)
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix #260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on #233 by @activatedgeek

Loss function improvements

* See #259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in #233
  and this pulls in some of those changes.

Add manual tests

* Related to #258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
2018-11-02 09:52:11 -07:00
Ankush Agarwal a5d808cc88 Fix failing test due to https://github.com/kubeflow/testing/pull/111 (#95) 2018-04-24 12:11:00 -07:00
Ankush Agarwal 1c72cf942f Move from mlkube-testing to kubeflow-ci for test-infra (#65)
Fixes https://github.com/kubeflow/examples/issues/63
2018-03-29 15:25:03 -07:00
Ankush Agarwal 96c11b03cc ks upgrade test/workflows and agents/app (#49) 2018-03-15 14:24:24 -07:00
Michelle Casbon a855d666d8 Skeleton testing framework (#18)
* First stab at adding tests to this repo

* Add prow_config.yaml & remove test-infra dir

* Add .gitignore

* Add components.workflows.prow to params.libsonnet

Change ksonnet app name

* Add package names & EXTRA_REPOS, remove steps

* Put steps back

* Remove build step

* Remove cluster setup & teardown
2018-03-01 21:30:50 -08:00