Commit Graph

343 Commits

Author SHA1 Message Date
Jeremy Lewi 2487194fbd Modify K8s models to export the models; tensorboard manifests (#320)
* Modify K8s models to export the models; tensorboard manifests

* Use a K8s job not a TFJob to export the model.
* Start an experiments.libsonnet file to define groups of parameters for
  different experiments that should be reused

* Need to install tensorflow_hub in the Docker image because it is
  required by t2t exporter.

* * Address review comments.
2018-11-11 19:09:42 -08:00
Yang Pan c6ff5dbef8 Change dataflow default workdir to /src (#330)
Otherwise when I want to execute dataflow code 
```
python2 -m code_search.dataflow.cli.create_function_embeddings \
```
it complains no setup.py

I could workaround by using workingdir container API but setting it to default would be more convenient.
2018-11-11 15:37:59 -08:00
Jeremy Lewi 65e89a599b code search example make distributed training work; Create some components to train models (#317)
* Make distributed training work; Create some components to train models

* Check in a ksonnet component to train a model using the tinyparam
  hyperparameter set.

* We want to check in the ksonnet component to facilitate reproducibility.
  We need a better way to separate the particular experiments used for
  the CS search demo effort from the jobs we want customers to try.

   Related to #239 train a high quality model.

* Check in the cs_demo ks environment; this was being ignored as a result of
  .gitignore

Make distributed training work #208

* We got distributed synchronous training to work with TensorTensor 1.10
* This required creating a simple python script to start the TF standard
  server and run it as a sidecar of the chief pod and as the main container
  for the workers/ps.

* Rename the model to kf_similarity_transformer to be consistent with other
  code.
  * We don't want to use the default name because we don't want to inadvertently
  use the SimilarityTransformer model defined in the Tensor2Tensor project.

* replace build.sh by a Makefile. Makes it easier to add variant commands
  * Use the GitHash not a random id as the tag.
  * Add a label to the docker image to indicate the git version.

* Put the Makefile at the top of the code_search tree; makes it easier
  to pull all the different sources for the Docker images.

* Add an option to build the Docker iamges with GCB; this is more efficient
  when you are on a poor network connection because you don't have to download
  images locally.
    * Use jsonnet to define and parameterize the GCB workflow.

* Build separate docker images for running Dataflow and for running the trainer.
  This helps avoid versioning conflicts caused by different versions of protobuf
  pulled in by the TF version used as the base image vs. the version used
  with apache beam.

      Fix #310 - Training fails with GPUs.

* Changes to support distributed training.
* Simplify t2t-entrypoint.sh so that all we do is parse TF_CONFIG
  and pass requisite config information as command line arguments;
  everything else can be set in the K8s spec.

* Upgrade to T2T 1.10.

* * Add ksonnet prototypes for tensorboard.
2018-11-08 16:13:01 -08:00
Jeremy Lewi 1043bc0c26 A bunch of changes to support distributed training using tf.estimator (#265)
* Unify the code for training with Keras and TF.Estimator

Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki

See #196
The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further

We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image

Add a NFS PVC to our Kubeflow demo deployment.

Create a tfjob-estimator component in our ksonnet component.

changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.

Fix notebooks/train.py (#186)

The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.

* Address comments.
2018-11-07 16:23:59 -08:00
Jeremy Lewi d01b76b6f9 Update ksonnet for datagen (#309)
* Update the datagen component.

* We should use a K8s job rather than a TFJob. We can also simplify the
  ksonnet by just putting the spec into the jsonnet file rather than trying
  to share various bits of the spec with the TFJob for training.

Related to kubeflow/examples#308 use globals to allow parameters to be shared
across components (e.g. working directory.)

* Update the README with information about data.

* Fix table markdown.
2018-11-07 14:28:16 -08:00
Yang Pan 11879e2ff1 wait on create function embedding (#311) 2018-11-06 14:37:11 -08:00
Jeremy Lewi df278567f0 Fix performance of dataflow preprocessing job. (#302)
* Fix performance of dataflow preprocessing job.

* Fix #300; Dataflow job for preprocessing is really slow.

  * The problem is we are loading the spacy tokenization model on every
    invocation of the tokenization function and this is really expensive.
  * We should be doing this once per module import.

* After fixing this issue; the job completed in approximately 20 minutes using
  5 workers.

  * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.

* Add options to the Dataflow job to read from files as opposed to BigQuery
  and to skip BigQuery writes. This is useful for testing.

* Add a "unittest" that verifies the Dataflow preprocessing job can run
  successfully using the DirectRunner.

* Update the Docker image and a ksonnet component for a K8s job that
  can be used to submit the Dataflow job.

* Fix #299; Add logging to the Dataflow preprocessing job to indicate that
  a Dataflow job was submitted.

* Add an option to the preprocessing Dataflow job to read an entire
  BigQuery table as the input rather than running a query to get the input.
  This is useful in the case where the user wants to run a different
  query to select the repo paths and contents to process and write them
  to some table to be processed by the Dataflow job.

* Fix lint.

* More lint fixes.
2018-11-06 14:14:28 -08:00
Yang Pan aa0061dae2 update instruction with proper namespace (#307) 2018-11-05 20:47:46 -08:00
Yang Pan 1f82dc41cd [code search] add flag to wait till code search job finish (#306)
* add flag to wait till job finish

* wait till -> wait until
2018-11-05 19:04:20 -08:00
Jeremy Lewi f87dfd8e53 Create a demo cluster for the code search example. (#298) 2018-11-05 06:07:52 -08:00
Jeremy Lewi acd8007717 Use conditionals and add test for code search (#291)
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix #260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on #233 by @activatedgeek

Loss function improvements

* See #259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in #233
  and this pulls in some of those changes.

Add manual tests

* Related to #258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
2018-11-02 09:52:11 -07:00
Jeremy Lewi 07483c2dff Remove inactive reviewers/approvers. (#296)
https://devstats.kubeflow.org/d/46/user-reviews-repository-groups?orgId=1&var-period=d7&var-repo_name=All&var-repo=all&var-reviewers=DjangoPeng&var-reviewers=nkashy1

This will help blunderbuss assign better reviewers.
2018-11-02 08:34:20 -07:00
Karthik Ramasamy 847ecb414e Delete readme (#294) 2018-11-01 19:41:55 -07:00
Karthik Ramasamy 04f4c0767d Create OWNERS file (#289)
Adding my username to the owners file
2018-10-31 12:38:57 -07:00
Yu-Han Liu 266316bfd5 add pipelines/components (#285) 2018-10-30 13:27:02 -07:00
Michelle Casbon dde7d3ee8e Upgrade demo to KF v0.3.1 (#278)
* Upgrade demo to KF v0.3.1

Update env variable names and values in base file
Cleanup ambassador metadata for UI component
Add kfctl installation instructions
Tighten minikube setup instructions and update k8s version
Move environment variable setup to very beginning
Replace cluster creation commands with links to the appropriate section in demo_setup/README.md
Replace deploy.sh with kfctl
Replace kubeflow-core component with individual components
Remove connection to UI pod directly & connect via ambassador instead
Add cleanup commands

* Clarify wording

* Update parameter file

Resolve python error with file open
Consolidate kubeflow install command
2018-10-26 12:58:00 -07:00
Konstantinos Samaras-Tsakiris 5c38c96fae Fix #272 (#273)
* Fix #272

Fix #272 where the `create-pet-record-job` pod produces this error: `models/research/object_detection/data/pet_label_map.pbtxt; No such file or directory`

* Update create-pet-record-job.jsonnet
2018-10-22 14:57:24 -07:00
Konstantinos Samaras-Tsakiris 6edf7915f5 Fix #275 (#276)
Fix #275 by changing the default mount path for the training data.
2018-10-22 12:14:13 -07:00
Konstantinos Samaras-Tsakiris b0f9b4cfd0 Fix bash (#271)
Remove spaces around a bash variable declaration.
2018-10-22 12:02:04 -07:00
Svendegroote91 bc0380dda6 minor fixes for instructions (#267) 2018-10-15 10:02:17 -07:00
Jeremy Lewi 90044d24c4 Remove v1alpah1 TFJobs from the GH issue summarization example. (#264)
* We should be using v1alpha2 exclusively now.
2018-10-15 09:52:01 -07:00
Jeremy Lewi 4ea761630d Fix gh-demo.kubeflow.org and make it easy to setup. (#261)
* Fix gh-demo.kubeflow.org and make it easy to setup.

* Our public demo of the GitHub issue summarization example
  (gh-demo.kubeflow.org) is down. It was running in one of our dev
   clusters and with the the churn in dev clusters it ended up getting deleted.

* To make it more stable lets move it to project kubecon-gh-demo-1
  and create a separate cluster for running it.
  This cluster can also serve as a readily available Kubeflow cluster
  setup for giving demos.

* Create the directory demo within the github_issue_summarization example
  to contain all the required files.

* Add a makefile to make building the image work.

* The ksonnet app for the public demo was previously stored here
  https://github.com/kubeflow/testing/tree/master/deployment/ks-app

* Fix the uiservice account.

* Address comments.
2018-10-15 08:36:11 -07:00
Svendegroote91 d3e1731d7f add financial time series example (#252)
* add financial time series example

* fix ReadMe comments

* fix PyLint remarks

* clean up based on PR remarks

* Completing docstrings and fixing PR remarks
2018-10-12 08:04:07 -07:00
Jeremy Lewi adf614fc5f Add tensorboard and check in vendor for the code search example. (#255)
* Add tensorboard and check in vendor for the code search example.

* * Remove the default env; when I ran ks show I got errors but
  removing it and adding a fresh env worked. It also won't point to
  the correct cluster for users.
2018-10-04 10:18:58 -07:00
Ankush Agarwal 2064b43def Ankush Signing Out (#253) 2018-09-28 16:17:20 -07:00
Michelle Casbon 5c2d8aefc2 Remove reviewers who are already approvers (#247)
* Remove reviewers who are already approvers

Remove ScorpioCPH and zjj2wry due to inactivity (no PRs or comments on PRs).

* Add zjj2wry back on request
2018-09-24 17:25:32 -07:00
Akado2009 5329bfa59b docs updated (#240) 2018-09-24 15:07:27 -07:00
Michelle Casbon 42592fed4a Update demo script & add notebook (#248)
* Update demo script

Update demo script to include deploy script and notebook created by @drscott173
Simplify by removing unnecessary commands
Use default namespace instead of kubeflow

* Add yelp notebook readme

* Add cluster creation commands

Add instructions for highlighting changes resulting from each command
2018-09-11 11:17:02 -07:00
Inki Hwang 8e30631c54 example mnist upgrade to v1alpha2 (#246)
* example mnist upgrade to v1alpha2

* Remove cleanPodPolicy

* Fix kubeflow branch to v0.2.4
2018-09-09 13:01:21 -07:00
Michelle Casbon d878462bc5 Upgrade demo to use latest versions of kubeflow, tfjob, ksonnet, & gke (#242)
* Upgrade ks dir to 0.12.0

* Upgrade kubeflow to v0.2.0-rc.1

Use https://github.com/kubeflow/kubeflow/blob/master/scripts/upgrade_ks_app.py
to upgrade ks registry
Add t2tcpu-v1alpha2 component

* Rename t2tcpu-v1alpha2 -> t2tcpu

Rename t2tcpu -> t2tcpu-v1alpha1 and t2tcpu-v1alpha2 -> t2tcpu
Update demo_setup/README.md to reflect ks v0.12.0
Update REPO_PATH in demo_setup/kubeflow-demo-base.env
Update initialClusterVersion in k8s cluster creation script to 1.10.6-gke.2
Remove quotation marks from serving.deployHttpProxy so that it is parsed as a boolean instead of string

* Rename t2tgpu & t2ttpu

Rename t2tgpu -> t2tgpu-v1alpha1 and add t2tgpu-v1alpha2 as t2tgpu
Rename t2ttpu -> t2ttpu-v1alpha1 and add t2ttpu-v1alpha2 as t2ttpu
Resolve jsonnet parsing issues

* Upgrade kubeflow to v0.2.4

Add gke environment

* Add instructions for creating TPU clusters

* Replace hard-coded value with env var

* Update kf version to v0.2.4 in env var file

* Add non-gke requirements to t2tcpu component

Sync t2tgpu with t2tcpu
Remove non-gke statements from t2ttpu component
Add k8s v1.10.6 to minikube start command

* Fix bug with non-gke environment setup in t2t

Add service account setup and k8s secret creation instructions for serving & UI

* Single cluster with GPU & TPU

Add creation script for single cluster with access to CPU, GPU, & TPU
Update GPU driver installation to k8s-1.10

* Remove v1alpha1 components

* Update parameter values for t2t components

Increase disk size for minikube cluster creation since 0.2.4 is larger
Update gke cluster creation command

* Update TPU annotation to TF 1.9

* Update kf version to v0.2.5

Update tfJobImage version to v20180809-d2509aa
2018-09-05 05:46:33 -07:00
Katsunori Kanda 1b7df0c141 Fixed broken link in github issue summarization example (#235) 2018-08-26 18:01:31 -07:00
Michał Jastrzębski 35786ed9cb Add estimator example for github issues (#203)
* Add estimator example for github issues

This is code input for doc about writing Keras for tfjob.

There are few todos:

1. bug in dataset injection, can't raise number of steps
2. intead of adding hostpath for data, we should have quick job + pvc
for this

* pyling

* wip

* confirmed working on minikube

* pylint

* remove t2t, add documentation

* add note about storageclass

* fix link

* remove code redundancy

* adress review

* small language fix
2018-08-24 18:10:27 -07:00
Puneith Kaul 1d5ddf560b
Merge pull request #236 from kubeflow/xgboost_readme
Update README.md
2018-08-24 15:35:07 -07:00
Puneith Kaul ab61a75373
Update README.md 2018-08-24 15:34:48 -07:00
Puneith Kaul 7b7d671b87
Update README.md 2018-08-24 07:49:18 -07:00
Puneith Kaul e7996c33a2
Update README.md 2018-08-24 07:48:18 -07:00
Puneith Kaul bd07a2f84e new PR for XGBoost due to problems with history rewrite (#232)
* new PR for XGBoost due to problems with history rewrite

* Update housing.py

* Update HousingServe.py

* Update housing.py

* added bitly

* removed test function

* reorder imports

* fix spaces

* fix spaces

* fixed lint errors

* renamed to xgboost_ames_housing
2018-08-22 06:01:36 -07:00
Daniel Castellanos e6b6730650 Updated object detection training example (#228)
* Updated Dockerfile.traning to use latest tensorflow
  and tensorflow object detetion api.
* Updated tf-training-job component and added a chief
  replica spec
* Corrected some typos and updated some instructions
2018-08-20 19:32:12 -07:00
Sanyam Kapoor f9873e6ac4 Upgrade notebook commands and other relevant changes (#229)
* Replace double quotes for field values (ks convention)

* Recreate the ksonnet application from scratch

* Fix pip commands to find requirements and redo installation, fix ks param set

* Use sed replace instead of ks param set.

* Add cells to first show JobSpec and then apply

* Upgrade T2T, fix conflicting problem types

* Update docker images

* Reduce to 200k samples for vocab

* Use Jupyter notebook service account

* Add illustrative gsutil commands to show output files, specify index files glob explicitly

* List files after index creation step

* Use the model in current repository and not upstream t2t

* Update Docker images

* Expose TF Serving Rest API at 9001

* Spawn terminal from the notebooks ui, no need to go to lab
2018-08-20 16:35:07 -07:00
Michelle Casbon 0843cdad66 Add Yelp restaurant review demo files (#220)
* Add Yelp restaurant review demo files

* Add video links

* Resolve lint issues
2018-08-15 22:49:00 -07:00
Sanyam Kapoor 4e015e76a3 Cherry pick changes to PredictionDoFn (#226)
* Cherry pick changes to PredictionDoFn

* Disable lint checks for cherry picked file

* Update TODO and notebook install instructions

* Restore CUSTOM_COMMANDS todo
2018-08-15 06:21:00 -07:00
Sanyam Kapoor 18829159b0 Add a new github function docstring extended problem (#225)
* Add a new github function docstring extended problem

* Fix lint errors

* Update images
2018-08-14 15:41:47 -07:00
Sanyam Kapoor 8fce4a7799 Allow ks param set for Code Search Ksonnet Application (#224)
* Allow ks param set for t2t-code-search

* Update notebook with working directory param set

* Abstract out common variables for easy ks param set
2018-08-14 15:29:04 -07:00
Lun-Kai Hsu f3806d0bac Small fix to TF serving gpu (#221)
* Small fix to TF serving gpu

* fix

* fix

* fix
2018-08-14 14:27:35 -07:00
Sanyam Kapoor a687c51036 Add a Jupyter notebook to be used for Kubeflow codelabs (#217)
* Add a Jupyter notebook to be used for Kubeflow codelabs

* Add help command for create_function_embeddings module

* Update README to point to Jupyter Notebook

* Add prerequisites to readme

* Update README and getting started with notebook guide

* [wip]

* Update noebook with BigQuery previews

* Update notebook to automatically select the latest MODEL_VERSION
2018-08-13 21:43:26 -07:00
Ankush Agarwal a80c15b50e
Merge pull request #213 from activatedgeek/search-server-kubeflow
Update Search Index server spec
2018-08-09 14:57:49 -07:00
Sanyam Kapoor 6e9150bad6 Parametrize volumes and ports for nmslib containers 2018-08-09 10:53:23 -07:00
Sanyam Kapoor 133e054033 Refactor job and deployment specs into different functions 2018-08-09 10:53:23 -07:00
Sanyam Kapoor e34f9aca75 Build just one image with the correct tag instead of double the number 2018-08-09 10:53:23 -07:00
Sanyam Kapoor c86f306d79 Use kind Job instead of Pod 2018-08-09 10:53:23 -07:00