Commit Graph

77 Commits

Author SHA1 Message Date
Amy 443f4bd2a3
deprecating gis e2e example until it is fixed. (#736) 2020-02-18 20:58:25 -08:00
dependabot[bot] c20eafc4fc Bump nltk from 3.2.5 to 3.4.5 in /github_issue_summarization (#698)
Bumps [nltk](https://github.com/nltk/nltk) from 3.2.5 to 3.4.5.
- [Release notes](https://github.com/nltk/nltk/releases)
- [Changelog](https://github.com/nltk/nltk/blob/develop/ChangeLog)
- [Commits](https://github.com/nltk/nltk/compare/3.2.5...3.4.5)

Signed-off-by: dependabot[bot] <support@github.com>
2019-12-17 15:54:03 -08:00
dependabot[bot] f6a7adb2fc Bump tensorflow-gpu from 1.3.0 to 1.15.0 in /github_issue_summarization (#697)
Bumps [tensorflow-gpu](https://github.com/tensorflow/tensorflow) from 1.3.0 to 1.15.0.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v1.3.0...v1.15.0)

Signed-off-by: dependabot[bot] <support@github.com>
2019-12-16 12:59:39 -08:00
Amy 91374e6d27 notebook cleanup (#679) 2019-11-11 16:42:06 -08:00
Amy 67041ec4d5 updates to reflect changed node credentials, minor cleanup, update component URLs, (#675)
update serving-only pipeline as well
2019-11-05 18:29:01 -08:00
Amy 452aa428b6
Updates to the pipelines GH summarization lab to demonstrate component input/output (#669)
* copy and training step params, remove unused args,
use google-samples images

* update notebook to reflect new pipeline

* type definition change

* fix typo, use kfp.dsl.RUN_ID_PLACEHOLDER

* change 'serve' setp to use gcp secret- req'd for 0.7
2019-10-27 04:55:14 -07:00
Amy ad55b0a246 change gpu limit (#651) 2019-10-03 09:12:09 -07:00
Jin Chi He cfe166f73f update to kubeflow-metadata in examples (#646) 2019-09-26 16:13:34 -07:00
Amy c20ebb5c0f fix component URLs in pipeline now that primary PR is in (#642) 2019-09-19 14:46:58 -07:00
Amy b5349df27d Update to KFP pipelines codelab code (GH summarization) (#638)
* checkpointing

* checkpointing

* refactored pipeline that uses pre-emptible VMs

* checkpointing. istio routing for the webapp.

* checkpointing

* - temp testing components
- initial v of metadata logging 'component'
- new dirs; file rename

* public md log image; add md server connect retry

* update pipeline to include md logging steps

* - file rename, notebook updates
- update compiled pipeline; fix component name typo

- change DAG to allow md logging concurrently; update pre-emptible VMS PL

* pylint cleanup, readme/tutorial update/deprecation, minor tweaks

* file cleanup

* update the tfjob api version for an (unrelated) test to address presubmit issues

* try annotating test_train in github_issue_summarization/testing/tfjob_test.py with @unittest.expectedFailure

* try commenting out a (likely) problematic unittest unrelated to the code changes in this PR

* try adding @test_util.expectedFailure annotation instead of commenting out test

* update the codelab shortlink; revert to commenting out a problematic unit test
2019-09-19 08:47:00 -07:00
Nick Harvey 9675480997 minor update to the pachyderm seldon example (#562)
* minor update to the pachyderm seldon exaple

* Another minor update to the pipeline
2019-07-04 13:24:19 -07:00
Amy 767ecd240d Use the client libs to do a GCS copy instead of gsutil (#558)
* use gcs client libs to copy checkpoint dir

* more minor cleanup, use tagged image, use newer pipeline param spec. syntax.
pylint cleanup.
added set_memory_limit() to notebook pipeline training steps.
modified the pipelines definitions to use the user-defined params as defaults.

* put a retry loop around the copy_blob
2019-05-17 14:00:11 -07:00
Michelle Casbon 1ebcd33881 Add memory limit (#552)
This aids GKE autoprovisioning in creation of appropriate node sizes
2019-05-14 17:54:18 -07:00
Amy ed38a1a35c re-add file that I'd missed in porting over; minor comment phrasing change in notebook (#556) 2019-05-13 12:28:50 -07:00
Amy b23adc1f0b import of Pipelines Github issue summarization examples & tutorial (#507)
* initial import of Pipelines Github issue summarization examples & lab

* more linting/cleanup, fix tf version to 1.12

* bit more linting; pin some lib versions

* last? lint fixes

* another attempt to fix linting issues

* ughh

* changed test cluster config info

* update ktext package in a test docker image

* hmm, retrying fix for the ktext package update
2019-04-18 17:57:54 -07:00
Nick Harvey 52795bcaf5 Adding Pachyderm Example (squashed) (#522)
* Adding Pachyderm Example (squashed)

* Add Dan Sanche to OWNERS (#520)

Fixed tf_operator import for github_issue_summarization example (#527)

* fixed tf_operator import

* updated tf-operator import path

* small change

* updated PYTHONPATH

* fixed syntax error

* formating issue

Mnist pipelines (#524)

* added mnist pipelines sample

* fixed lint issues
2019-03-20 08:41:00 -07:00
zabbasi 7924e0fe21 Fixed tf_operator import for github_issue_summarization example (#527)
* fixed tf_operator import

* updated tf-operator import path

* small change

* updated PYTHONPATH

* fixed syntax error

* formating issue
2019-03-14 18:36:58 -07:00
Michelle Casbon 692c78550e
Merge pull request #399 from govindKAG/patch-1
fixed "setting persistent disk" link
2019-02-17 09:04:42 -08:00
Zhenghui Wang 74378a2990 Add end2end test for Xgboost housing example (#493)
* Add e2e test for xgboost housing example

* fix typo

add ks apply

add [

modify example to trigger tests

add prediction test

add xgboost ks param

rename the job name without _

use - instead of _

libson params

rm redudent component

rename component in prow config

add ames-hoursing-env

use - for all names

use _ for params names

use xgboost_ames_accross

rename component name

shorten the name

change deploy-test command

change to xgboost-
namespace

init ks app

fix type

add confest.py

change path

change deploy command

change dep

change the query URL for seldon

add ks_app with seldon lib

update ks_app

use ks init only

rerun

change to kf-v0-4-n00 cluster

add ks_app

use ks-13

remove --namespace

use kubeflow as namespace

delete seldon deployment

simplify ks_app

retry on 503

fix typo

query 1285

move deletion after prediction

wait 10s

always retry till 10 mins

move check to retry

 fix pylint

move  clean-up to the delete template

* set up xgboost component

* check in ks component& run it directly

* change comments

* add comment on why use 'ks delete'

* add two modules to pylint whitelist

* ignore tf_operator/py

* disable pylint per line

* reorder import
2019-02-12 06:37:05 -08:00
govind cs 225a7e9f90
Merge branch 'master' into patch-1 2019-01-21 09:49:12 +05:30
govind cs bf5e18a34e
Update 01_setup_a_kubeflow_cluster.md 2019-01-21 09:46:04 +05:30
cliveseldon 8d728f0b06 GitHub Summarization Seldon Update (#472)
* Update model inference wrapping to use S2I and update docs

* Add s2i reference in docs

* Fix typo highlighted in review

* Add pyLint annotation to allow protected-access on keras make predict function method
2019-01-17 16:07:34 -08:00
Jeremy Lewi 1cc4550b7d GIS E2E test verify the TFJob runs successfully (#456)
* Create a test for submitting the TFJob for the GitHub issue summarization example.

* This test needs to be run manually right now. In a follow on PR we will
  integrate it into CI.

* We use the image built from Dockerfile.estimator because that is the image
  we are running train_test.py in.

  * Note: The current version of the code now requires Python3 (I think this
    is due to an earlier PR which refactored the code into a shared
    implementation for using TF estimator and not TF estimator).

* Create a TFJob component for TFJob v1beta1; this is the version
  in KF 0.4.

TFJob component
  * Upgrade to v1beta to work with 0.4
  * Update command line arguments to match the versions in the current code
      * input & output are now single parameters rather then separate parameters
        for bucket and name

  * change default input to a CSV file because the current version of the
    code doesn't handle unzipping it.

* Use ks_util from kubeflow/testing

* Address comments.
2019-01-08 15:06:49 -08:00
Jeremy Lewi 959d072e68 Setup continuous building of Docker images for GH Issue Summarization Example (#449)
* Setup continuous building of Docker images and testing  for GH Issue Summarization Example.

* This is the first step in setting up a continuously running CI test.

* Add support for building the Docker images using GCB; we will use GCB
  to trigger the builds from our CI system.

  * Make the Makefile top level (at root of GIS example) so that we can
    easily access all the different resources.

* Add a .gitignore file to avoid checking in the build directory used by
  the Makefile.

* Define an Argo workflow to use as the E2E test.

Related to #92: E2E test & CI for github issue summarization

* Trigger the test on pre & post submit

* Dockerfile.estimator don't install the data_download.sh script
  * It doesn't look like we are currently using data_download.sh in the
    DockerImage
  * It looks like it only gets used vias the ksonnet job which mounts the
    script via a config map

  * Copying data_download.sh to the Docker image is currently weird
    given the organization of the Dockerfile and context.

* Copy the test_data to the Docker images so that we can run the test
  inside the images.

* Invoke the python unittest for training from our CI system.

  * In a follow on PR we will update the test to emit a JUnit XML file to
    report results to prow.

* Fix image build.
2019-01-04 17:02:24 -08:00
Michelle Casbon 70a22d6d7b [GH Issue Summarization] Upgrade to kf v0.4.0-rc.2 (#450)
* Update tfjob components to v1beta1

Remove old version of tensor2tensor component

* Combine UI into a single jsonnet file

* Upgrade GH issue summarization to kf v0.4.0-rc.2

Use latest ksonnet v0.13.1
Use latest seldon v1alpha2
Remove ksonnet app with full kubeflow platform & replace with components specific to this example.
Remove outdated scripts
Add cluster creation links to Click-to-deploy & kfctl
Add warning not to use the Training with an Estimator guide
Replace commandline with bash for better syntax highlighting
Replace messy port-forwarding commands with svc/ambassador
Add modelUrl param to ui component
Modify teardown instructions to remove the deployment
Fix grammatical mistakes

* Rearrange tfjob instructions
2018-12-30 20:05:29 -08:00
Jeremy Lewi 7990408207 Delete obsolete HP tuning code. (#451)
* Katib no longer uses custom go programs. Instead it uses the new
  StudyJobController custom resource.

* This code is no longer needed so delete it.
2018-12-29 19:00:14 -08:00
Karthic Rao b69cf36a39 Fixing broken links (#403)
- Fix broken links for the install instructions.
- Minor modifications to the instructions.
- Minior formatting fixes.
2018-12-05 18:42:11 -08:00
govind cs 60ba49c68d
fixed "setting persistent disk" link
Fixed the linked to advanced customization link on kubeflow which currently redirects to a non-existent page.
2018-12-04 16:02:53 +05:30
Jeremy Lewi 1043bc0c26 A bunch of changes to support distributed training using tf.estimator (#265)
* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See #196
The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
2018-11-07 16:23:59 -08:00
Jeremy Lewi 90044d24c4 Remove v1alpah1 TFJobs from the GH issue summarization example. (#264)
* We should be using v1alpha2 exclusively now.
2018-10-15 09:52:01 -07:00
Jeremy Lewi 4ea761630d Fix gh-demo.kubeflow.org and make it easy to setup. (#261)
* Fix gh-demo.kubeflow.org and make it easy to setup.

* Our public demo of the GitHub issue summarization example
  (gh-demo.kubeflow.org) is down. It was running in one of our dev
   clusters and with the the churn in dev clusters it ended up getting deleted.

* To make it more stable lets move it to project kubecon-gh-demo-1
  and create a separate cluster for running it.
  This cluster can also serve as a readily available Kubeflow cluster
  setup for giving demos.

* Create the directory demo within the github_issue_summarization example
  to contain all the required files.

* Add a makefile to make building the image work.

* The ksonnet app for the public demo was previously stored here
  https://github.com/kubeflow/testing/tree/master/deployment/ks-app

* Fix the uiservice account.

* Address comments.
2018-10-15 08:36:11 -07:00
Akado2009 5329bfa59b docs updated (#240) 2018-09-24 15:07:27 -07:00
Katsunori Kanda 1b7df0c141 Fixed broken link in github issue summarization example (#235) 2018-08-26 18:01:31 -07:00
Michał Jastrzębski 35786ed9cb Add estimator example for github issues (#203)
* Add estimator example for github issues

This is code input for doc about writing Keras for tfjob.

There are few todos:

1. bug in dataset injection, can't raise number of steps
2. intead of adding hostpath for data, we should have quick job + pvc
for this

* pyling

* wip

* confirmed working on minikube

* pylint

* remove t2t, add documentation

* add note about storageclass

* fix link

* remove code redundancy

* adress review

* small language fix
2018-08-24 18:10:27 -07:00
Pete MacKinnon d2c5e949e5 Update PVC to /home/jovyan (#119) 2018-07-13 14:39:26 -07:00
Jeremy Lewi eaf0298590 Create a deployment to run the HP/Katib controller for the GitHub issue example. (#161)
* Some of the code is copied over from https://github.com/kubeflow/katib/tree/master/examples/GKEDemo

  * I think it makes sense to centralize all the code in a single place.

* Update the controller program (git-issue-summarize-demo.go) so that can
  specify the Docker image containing the training code.

* Create a ksonnet deployment for running the controller on the cluster.

* The HP tuning job isn't functional here's an incomplete list of issues

  * The training jobs launched fail because they don't have GCP credentials
    so they can't download the data.

  * We don't actually extract and report metrics back to Katib.

Related to: kubeflow/katib#116
2018-07-11 08:46:25 -07:00
Michelle Casbon 836ad70421 Fix model file upload (#160)
* Add component parameters

Add model_url & port arguments to flask app
Add service_type, image, and model_url parameters to ui component
Fix problem argument in tensor2tensor component

* Fix broken UI component

Fix broken UI component structure by adding all, service, & deployment parts
Add parameter defaults for tfjob to resolve failures deploying other components

* Add missing imports in flask app

Fix syntax error in argument parsing
Remove underscores from parameter names to workaround ksonnet bug #554: https://github.com/ksonnet/ksonnet/issues/554

* Fix syntax errors in t2t instructions

Add CPU image build arg to docker build command for t2t-training
Fix link to ksonnet app dir
Correct param names for tensor2tensor component
Add missing params for tensor2tensor component
Fix apply command syntax
Swap out log view pod for t2t-master instead of tf-operator
Fix link to training with tfjob

* Fix model file upload

Update default params for tfjob-v1alpha2
Fix build directory path in Makefile

* Resolve lint issues

Lines too long

* Add specific image tag to tfjob-v1alpha2 default

* Fix defaults for training output files

Update image tag
Add UI image tag

* Revert service account secret details

Update associated readme
2018-06-29 18:41:20 -07:00
Jeremy Lewi 98ed4b4a69 Fix v1alpha2 version of the T2T training job. (#158)
* Update the Docker image for T2T to use a newer version of T2T library

* Add parameters to set the GCP secret; we need GCP credentials to
  read from GCS even if reading a public bucket. We default
  to the parameters that are created automatically in the case of a GKE
  deployment.

* Create a v1alpha2 template for the job that uses PVC.
2018-06-29 12:26:18 -07:00
Jeremy Lewi 93db7e369e Update the GH summarization example to Kubeflow 0.2 and TFJob v1alpha2. (#157)
* Update the GH summarization example to Kubeflow 0.2 and TFJob v1alpha2.

* Upgrade the ksonnet app to Kubeflow 0.2 rc.1
* Add the examples package.
* Add a .gitignore file and ignore all environments so that we won't pick
  up people's testing environments.

* Add tfjob-v1alpha2 component; this trains the model using Keras using
  TFJob v1alpha2.
  * Update the parameters so that we use the GCP secrets created as part
    of the Kubeflow deployment.

* Remove jlewi environment.

* Verified that training ran successfully and outputted a model to GCS

  * There was an error about some missing arguments to a logging statement
    but this can be ignored although it would be good to fix.

* Started working on T2T v1alpha2. Seems to be messing up the app.

* Update the v1alpha2 template for the tensor2tensor job but it looks like
there is an error

2018-06-29 17:45:23,369] Found unknown flag: --problem=github_issue_summarization_problem
Traceback (most recent call last):
  File "/home/jovyan/.conda/bin/t2t-trainer", line 32, in <module>
    tf.app.run()
  File "/home/jovyan/.conda/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/home/jovyan/.conda/bin/t2t-trainer", line 28, in main
    t2t_trainer.main(argv)
  File "/home/jovyan/.conda/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 334, in main
    exp_fn = create_experiment_fn()
  File "/home/jovyan/.conda/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 158, in create_experiment_fn
    problem_name=get_problem_name(),
  File "/home/jovyan/.conda/lib/python3.6/site-packages/tensor2tensor/bin/t2t_trainer.py", line 115, in get_problem_name
    problems = FLAGS.problems.split("-")
AttributeError: 'NoneType' object has no attribute 'split'
2018-06-29 11:31:23 -07:00
Michelle Casbon 11b75edfd9 Add component parameters (#155)
* Add component parameters

Add model_url & port arguments to flask app
Add service_type, image, and model_url parameters to ui component
Fix problem argument in tensor2tensor component

* Fix broken UI component

Fix broken UI component structure by adding all, service, & deployment parts
Add parameter defaults for tfjob to resolve failures deploying other components

* Add missing imports in flask app

Fix syntax error in argument parsing
Remove underscores from parameter names to workaround ksonnet bug #554: https://github.com/ksonnet/ksonnet/issues/554

* Fix syntax errors in t2t instructions

Add CPU image build arg to docker build command for t2t-training
Fix link to ksonnet app dir
Correct param names for tensor2tensor component
Add missing params for tensor2tensor component
Fix apply command syntax
Swap out log view pod for t2t-master instead of tf-operator
Fix link to training with tfjob
2018-06-28 13:52:21 -07:00
Puneith Kaul 174d6602ac Update README.md (#116) 2018-05-20 17:43:48 -07:00
Jeremy Lewi 002119010f Fix data-downloader; parameters are in the wrong order. (#115)
* URL should be the first argument; data dir should be the second.
2018-05-17 11:16:51 -07:00
Carol Willing 0b303e70f1 Edit navigation and markdown for github example (#93)
* edit TF example readme

* prefix tutorial steps with a number for nicer display in repo

* fix typo

* edit steps 4 and 5

* edit docs

* add navigation and formatting edits to example
2018-05-09 12:12:54 -07:00
Jeremy Lewi 79aa2074cd Improvements to the tensor2tensor trainer for the GitHub summarization example. (#109)
* Improvements to the tensor2tensor traininer for the GitHub summarization example.

* Simplify the launcher; we can just pass through most command line arguments and not
  use environment variables and command line arguments.

  * This makes it easier to control the job just by setting the parameters in the template
    rather than having to rebuild the images.

* Add a Makefile to build the image.

* Replace the tensor2tensor jsonnet with a newer version of the jsonnet used with T2T.

* Address reviewer comments.

* Install pip packages as user Jovyan
* Rely on implicit string conversion with concatenation in template file.
2018-04-29 20:39:16 -07:00
Jeremy Lewi afdd4c544e Add a component to run TensorBoard. (#110)
* Add a component to run TensorBoard.

* Autoformate jsonnet file.

* * Set a default of "" for logDir; there's not a really good default location
  because it will depend on where the data is stored.
2018-04-29 20:34:16 -07:00
Jeremy Lewi e12231bae3 Make it easier to demo serving and run in Katacoda (#107)
* Make it easier to demo serving and run in Katacoda

* Allow the model path to be specified via environment variables so that
  we could potentially load the model from PVC.

* Continue to bake the model into the image so that we don't need to train
  in order to serve.

* Parameterize download_data.sh so we could potentially fetch different sources.

* Update the Makefile so that we can build and set the image for the serving
  component.

* Fix lint.

* Update the serving docs.
2018-04-28 08:11:18 -07:00
Ankush Agarwal 26d68ead6c Replace kubeflow-images-staging with kubeflow-images-public (#99)
Fixes https://github.com/kubeflow/kubeflow/issues/534
2018-04-27 11:46:20 -07:00
Jeremy Lewi 4b33d44af6 Support training using a PVC for the data. (#98)
* Support training using a PVC for the data.

* This will make it easier to run the example on Katacoda and non-GCP platforms.

* Modify train.py so we can use a GCS location or local file paths.

* Update the Dockerfile. The jupyter Docker images and had a bunch of
  dependencies removed and the latest images don't have the dependencies
  needed to run the examples.

* Creat a tfjob-pvc component that trains reading/writing using PVC
  and not GCP.

* * Address reviewer comments

* Ignore changes to the ksonnet parameters when determining whether to include
  dirty and sha of the diff in the image. This way we can update the
  ksonnet app with the newly built image without it leading to subsequent
  images being marked dirty.

* Fix lint issues.

* Fix lint import issue.
2018-04-27 04:08:19 -07:00
Jeremy Lewi 34d6f8809d Add a job to download the data to PVC. (#97)
* This is the first step to doing training and serving using a PV as opposed
  to GCS.

* This will make the sample easier to run anyhere and in particular on Katacoda.

* This currently would work as follows

User creates a PVC

ks apply ${ENV} -c data-pvc

User runs a K8s job to download the data to PVC

ks apply ${ENV} -c data-downloader

In subsequent PRs we will update the train and serve steps to load the
model from the PVC as opposed to GCS.

Related to #91
2018-04-25 10:36:02 -07:00
Michelle Casbon 1a4f4dc1ea Remove vendor from .gitignore (#94)
* Remove vendor from .gitignore

* Tell pylint to ignore generated file
2018-04-24 15:25:01 -07:00