* Modify K8s models to export the models; tensorboard manifests
* Use a K8s job not a TFJob to export the model.
* Start an experiments.libsonnet file to define groups of parameters for
different experiments that should be reused
* Need to install tensorflow_hub in the Docker image because it is
required by t2t exporter.
* * Address review comments.
Otherwise when I want to execute dataflow code
```
python2 -m code_search.dataflow.cli.create_function_embeddings \
```
it complains no setup.py
I could workaround by using workingdir container API but setting it to default would be more convenient.
* Make distributed training work; Create some components to train models
* Check in a ksonnet component to train a model using the tinyparam
hyperparameter set.
* We want to check in the ksonnet component to facilitate reproducibility.
We need a better way to separate the particular experiments used for
the CS search demo effort from the jobs we want customers to try.
Related to #239 train a high quality model.
* Check in the cs_demo ks environment; this was being ignored as a result of
.gitignore
Make distributed training work #208
* We got distributed synchronous training to work with TensorTensor 1.10
* This required creating a simple python script to start the TF standard
server and run it as a sidecar of the chief pod and as the main container
for the workers/ps.
* Rename the model to kf_similarity_transformer to be consistent with other
code.
* We don't want to use the default name because we don't want to inadvertently
use the SimilarityTransformer model defined in the Tensor2Tensor project.
* replace build.sh by a Makefile. Makes it easier to add variant commands
* Use the GitHash not a random id as the tag.
* Add a label to the docker image to indicate the git version.
* Put the Makefile at the top of the code_search tree; makes it easier
to pull all the different sources for the Docker images.
* Add an option to build the Docker iamges with GCB; this is more efficient
when you are on a poor network connection because you don't have to download
images locally.
* Use jsonnet to define and parameterize the GCB workflow.
* Build separate docker images for running Dataflow and for running the trainer.
This helps avoid versioning conflicts caused by different versions of protobuf
pulled in by the TF version used as the base image vs. the version used
with apache beam.
Fix#310 - Training fails with GPUs.
* Changes to support distributed training.
* Simplify t2t-entrypoint.sh so that all we do is parse TF_CONFIG
and pass requisite config information as command line arguments;
everything else can be set in the K8s spec.
* Upgrade to T2T 1.10.
* * Add ksonnet prototypes for tensorboard.
* Unify the code for training with Keras and TF.Estimator
Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki
See #196
The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further
We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image
Add a NFS PVC to our Kubeflow demo deployment.
Create a tfjob-estimator component in our ksonnet component.
changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.
Fix notebooks/train.py (#186)
The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.
* Address comments.
* Update the datagen component.
* We should use a K8s job rather than a TFJob. We can also simplify the
ksonnet by just putting the spec into the jsonnet file rather than trying
to share various bits of the spec with the TFJob for training.
Related to kubeflow/examples#308 use globals to allow parameters to be shared
across components (e.g. working directory.)
* Update the README with information about data.
* Fix table markdown.
* Fix performance of dataflow preprocessing job.
* Fix#300; Dataflow job for preprocessing is really slow.
* The problem is we are loading the spacy tokenization model on every
invocation of the tokenization function and this is really expensive.
* We should be doing this once per module import.
* After fixing this issue; the job completed in approximately 20 minutes using
5 workers.
* We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.
* Add options to the Dataflow job to read from files as opposed to BigQuery
and to skip BigQuery writes. This is useful for testing.
* Add a "unittest" that verifies the Dataflow preprocessing job can run
successfully using the DirectRunner.
* Update the Docker image and a ksonnet component for a K8s job that
can be used to submit the Dataflow job.
* Fix#299; Add logging to the Dataflow preprocessing job to indicate that
a Dataflow job was submitted.
* Add an option to the preprocessing Dataflow job to read an entire
BigQuery table as the input rather than running a query to get the input.
This is useful in the case where the user wants to run a different
query to select the repo paths and contents to process and write them
to some table to be processed by the Dataflow job.
* Fix lint.
* More lint fixes.
* Fix model export, loss function, and add some manual tests.
Fix Model export to support computing code embeddings: Fix#260
* The previous exported model was always using the embeddings trained for
the search query.
* But we need to be able to compute embedding vectors for both the query
and code.
* To support this we add a new input feature "embed_code" and conditional
ops. The exported model uses the value of the embed_code feature to determine
whether to treat the inputs as a query string or code and computes
the embeddings appropriately.
* Originally based on #233 by @activatedgeek
Loss function improvements
* See #259 for a long discussion about different loss functions.
* @activatedgeek was experimenting with different loss functions in #233
and this pulls in some of those changes.
Add manual tests
* Related to #258
* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
input based on which embeddings we are computing.
Change Problem/Model name
* Register the problem github_function_docstring with a different name
to distinguish it from the version inside the Tensor2Tensor library.
* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.
* * Fix lint and skip tests.
* Fix lint.
* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.
* * Run generate_data as part of the test rather than reusing a cached
vocab and processed input file.
* Modify SimilarityTransformer so we can overwrite the number of shards
used easily to facilitate testing.
* Comment out py-test for now.
* Upgrade demo to KF v0.3.1
Update env variable names and values in base file
Cleanup ambassador metadata for UI component
Add kfctl installation instructions
Tighten minikube setup instructions and update k8s version
Move environment variable setup to very beginning
Replace cluster creation commands with links to the appropriate section in demo_setup/README.md
Replace deploy.sh with kfctl
Replace kubeflow-core component with individual components
Remove connection to UI pod directly & connect via ambassador instead
Add cleanup commands
* Clarify wording
* Update parameter file
Resolve python error with file open
Consolidate kubeflow install command
* Fix#272Fix#272 where the `create-pet-record-job` pod produces this error: `models/research/object_detection/data/pet_label_map.pbtxt; No such file or directory`
* Update create-pet-record-job.jsonnet
* Fix gh-demo.kubeflow.org and make it easy to setup.
* Our public demo of the GitHub issue summarization example
(gh-demo.kubeflow.org) is down. It was running in one of our dev
clusters and with the the churn in dev clusters it ended up getting deleted.
* To make it more stable lets move it to project kubecon-gh-demo-1
and create a separate cluster for running it.
This cluster can also serve as a readily available Kubeflow cluster
setup for giving demos.
* Create the directory demo within the github_issue_summarization example
to contain all the required files.
* Add a makefile to make building the image work.
* The ksonnet app for the public demo was previously stored here
https://github.com/kubeflow/testing/tree/master/deployment/ks-app
* Fix the uiservice account.
* Address comments.
* add financial time series example
* fix ReadMe comments
* fix PyLint remarks
* clean up based on PR remarks
* Completing docstrings and fixing PR remarks
* Add tensorboard and check in vendor for the code search example.
* * Remove the default env; when I ran ks show I got errors but
removing it and adding a fresh env worked. It also won't point to
the correct cluster for users.
* Remove reviewers who are already approvers
Remove ScorpioCPH and zjj2wry due to inactivity (no PRs or comments on PRs).
* Add zjj2wry back on request
* Update demo script
Update demo script to include deploy script and notebook created by @drscott173
Simplify by removing unnecessary commands
Use default namespace instead of kubeflow
* Add yelp notebook readme
* Add cluster creation commands
Add instructions for highlighting changes resulting from each command
* Upgrade ks dir to 0.12.0
* Upgrade kubeflow to v0.2.0-rc.1
Use https://github.com/kubeflow/kubeflow/blob/master/scripts/upgrade_ks_app.py
to upgrade ks registry
Add t2tcpu-v1alpha2 component
* Rename t2tcpu-v1alpha2 -> t2tcpu
Rename t2tcpu -> t2tcpu-v1alpha1 and t2tcpu-v1alpha2 -> t2tcpu
Update demo_setup/README.md to reflect ks v0.12.0
Update REPO_PATH in demo_setup/kubeflow-demo-base.env
Update initialClusterVersion in k8s cluster creation script to 1.10.6-gke.2
Remove quotation marks from serving.deployHttpProxy so that it is parsed as a boolean instead of string
* Rename t2tgpu & t2ttpu
Rename t2tgpu -> t2tgpu-v1alpha1 and add t2tgpu-v1alpha2 as t2tgpu
Rename t2ttpu -> t2ttpu-v1alpha1 and add t2ttpu-v1alpha2 as t2ttpu
Resolve jsonnet parsing issues
* Upgrade kubeflow to v0.2.4
Add gke environment
* Add instructions for creating TPU clusters
* Replace hard-coded value with env var
* Update kf version to v0.2.4 in env var file
* Add non-gke requirements to t2tcpu component
Sync t2tgpu with t2tcpu
Remove non-gke statements from t2ttpu component
Add k8s v1.10.6 to minikube start command
* Fix bug with non-gke environment setup in t2t
Add service account setup and k8s secret creation instructions for serving & UI
* Single cluster with GPU & TPU
Add creation script for single cluster with access to CPU, GPU, & TPU
Update GPU driver installation to k8s-1.10
* Remove v1alpha1 components
* Update parameter values for t2t components
Increase disk size for minikube cluster creation since 0.2.4 is larger
Update gke cluster creation command
* Update TPU annotation to TF 1.9
* Update kf version to v0.2.5
Update tfJobImage version to v20180809-d2509aa
* Add estimator example for github issues
This is code input for doc about writing Keras for tfjob.
There are few todos:
1. bug in dataset injection, can't raise number of steps
2. intead of adding hostpath for data, we should have quick job + pvc
for this
* pyling
* wip
* confirmed working on minikube
* pylint
* remove t2t, add documentation
* add note about storageclass
* fix link
* remove code redundancy
* adress review
* small language fix
* new PR for XGBoost due to problems with history rewrite
* Update housing.py
* Update HousingServe.py
* Update housing.py
* added bitly
* removed test function
* reorder imports
* fix spaces
* fix spaces
* fixed lint errors
* renamed to xgboost_ames_housing
* Updated Dockerfile.traning to use latest tensorflow
and tensorflow object detetion api.
* Updated tf-training-job component and added a chief
replica spec
* Corrected some typos and updated some instructions
* Replace double quotes for field values (ks convention)
* Recreate the ksonnet application from scratch
* Fix pip commands to find requirements and redo installation, fix ks param set
* Use sed replace instead of ks param set.
* Add cells to first show JobSpec and then apply
* Upgrade T2T, fix conflicting problem types
* Update docker images
* Reduce to 200k samples for vocab
* Use Jupyter notebook service account
* Add illustrative gsutil commands to show output files, specify index files glob explicitly
* List files after index creation step
* Use the model in current repository and not upstream t2t
* Update Docker images
* Expose TF Serving Rest API at 9001
* Spawn terminal from the notebooks ui, no need to go to lab
* Cherry pick changes to PredictionDoFn
* Disable lint checks for cherry picked file
* Update TODO and notebook install instructions
* Restore CUSTOM_COMMANDS todo
* Add a Jupyter notebook to be used for Kubeflow codelabs
* Add help command for create_function_embeddings module
* Update README to point to Jupyter Notebook
* Add prerequisites to readme
* Update README and getting started with notebook guide
* [wip]
* Update noebook with BigQuery previews
* Update notebook to automatically select the latest MODEL_VERSION