* Datflow job should support writing embeddings to a different location (Fix#366).
* Dataflow job to compute code embeddings needs to have parameters controlling
the location of the outputs independent of the inputs. Prior to this fix the
same table in the dataset was always written and the files were always created
in the data dir.
* This made it very difficult to rerun the embeddings job on the latest GitHub
data (e.g to regularly update the code embeddings) without overwritting
the current embeddings.
* Refactor how we create BQ sinks and sources in this pipeline
* Rather than create a wrapper class that bundles together a sink and schema
we should have a separate helper class for creating BQ schemas and then
use WriteToBigQuery directly.
* Similarly for ReadTransforms we don't need a wrapper class that bundles
a query and source. We can just create a class/constant to represent
queries and pass them directly to the appropriate source.
* Change BQ write disposition to if empty so we don't overwrite existing data.
* Fix#390 worker setup fails because requirements.dataflow.txt not found
* Dataflow always uses the local file requirements.txt regardless of the
local file used as the source.
* When job is submitted it will also try to build a sdist package on
the client which invokes setup.py
* So we in setup.py we always refer to requirements.txt
* If trying to install the package in other contexts,
requirements.dataflow.txt should be renamed to requirements.txt
* We do this in the Dockerfile.
* Refactor the CreateFunctionEmbeddings code so that writing to BQ
is not part of the compute function embeddings code;
(will make it easier to test.)
* * Fix typo in jsonnet with output dir; missing an "=".
In tensorflow/models/research/object_detection/, only
tensorflow/models/research/object_detection/legacy/train.py
supports kubeflow sor far (construct cluster by reading
TF_CONFIG environment var).
Fixes: #277
Remove separate pipelines installation
Update kfp version to 0.1.3-rc.2
Clarify difference in installation paths (click-to-deploy vs CLI)
Use set_gpu_limit() and remove generated yaml with resource limits
* Follow argocd instructions
https://github.com/argoproj/argo-cd/blob/master/docs/getting_started.md
to install ArgoCD on the cluster
* Down the argocd manifest and update the namespace to argocd.
* Check it in so ArgoCD can be deployed declaratively.
* Update README.md with the instructions for deploying ArgoCD.
Move the web app components into their own ksonnet app.
* We do this because we want to be able to sync the web app components using
Argo CD
* ArgoCD doesn't allow us to apply autosync with granularity less than the
app. We don't want to sync any of the components except the servers.
* Rename the t2t-code-search-serving component to query-embed-server because
this is more descriptive.
* Check in a YAML spec defining the ksonnet application for the web UI.
Update the instructions in nodebook code-search.ipynb
* Provided updated instructions for deploying the web app due the
fact that the web app is now a separate component.
* Improve code-search.ipynb
* Use gcloud to get sensible defaults for parameters like the project.
* Provide more information about what the variables mean.
* This script will be the last step in a pipeline to continuously update
the index for serving.
* The script updates the parameters of the search index server to point
to the supplied index files. It then commits them and creates a PR
to push those commits.
* Restructure the parameters for the search index server so that we can use
ks param set to override the indexFile and lookupFile.
* We do this because we want to be able to push a new index by doing
ks param set in a continuously running pipeline
* Remove default parameters from search-index-server
* Create a dockerfile suitable for running this script.
* The latest changes to the ksonnet components require certain values
to be defined as defaults.
* This is part of the move away from using a fake component to define
parameters that should be reused across different modules.
see #308
* Verify we can run ks show on a new environment and can evaluate the ksonnet.
Fix#353
* Upgrade and fix the serving components.
* Install a new version of the TFServing package so we can use the new template.
* Fix the UI image. Use the same requirements file as for Dataflow so we are
consistent w.r.t the version of TF and Tensor2Tesnro.
* remove nms.libsonnet; move all the manifests into the actual component
files rather than using a shared library.
* Fix the name of the TFServing service and deployment; need to use the same
name as used by the front end server.
* Change the port of TFServing; we are now using the built in http server
in TFServing which uses port 8500 as opposed to our custom http proxy.
* We encountered an error importning nmslib; moving it to the top of the file
appears to fix this.
* Fix lint.
* Default to model trained with CPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Checkout 1.0rc1 release as latest Pytorch master seems to have MPI backend detection broken
* Track changes in pytorch_mnist/training/ddp/mnist folder to trigger test jobs
* Repoint to pull images from gcr.io/kubeflow-ci built during pre-submit
* Fix image webui name
* Fix logging
* Add GCFS to CPU train
* Fix logging
* Add GCFS to CPU train
* Default to model trained with GPUs
TODO: Enable A/B testing with Seldon to load GPU and CPU models
* Fix Predict() method as Seldon expects 3 arguments
* Fix x reference
* Install nmslib in the Dataflow container so its suitable for running
the index creation job.
* Use command not args in the job specs.
* Dockerfile.dataflow should install nmslib so that we can use that Docker
image to create the index.
* build.jsonnet should tag images as latest. We will use this to use
the latest images as a layer cache to speed up builds.
* Set logging level to info for start_search_server.py and
create_search_index.py
* Create search index pod keeps was getting evicted because node runs out of
memory
* Add a new node pool consisting of n1-standard-32 nodes to the demo cluster.
These have 120 GB of RAM compared to 30GB in our default pool of n1-standard-8
* Set requests and limits on the creator search index pod.
* Move all the config for the search-index-creator job into the
search-index-creator.jsonnet file. We need to customize the memory resources
so there's not much value to try to sharing config with other components.
* Add Pytorch MNIST example
* Fix link to Pytorch NMIST example
* Fix indentation in README
* Fix lint errors
* Fix lint errors
Add prediction proto files
* Add build_image.sh script to build image and push to gcr.io
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Fix lint errors
* Add pytorch-mnist-webui-release release through automatic ksonnet package
* Add PB2 autogenerated files to ignore with Pylint
* Fix lint errors
* Add official Pytorch DDP examples to ignore with Pylint
* Fix lint errors
* Update component to web-ui release
* Update mount point to kubeflow-gcfs as the example is GCP specific
* 01_setup_a_kubeflow_cluster document complete
* Test release job while PR is WIP
* Reduce workflow name to avoid Argo error:
"must be no more than 63 characters"
* Fix extra_repos to pull worker image
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix extra_repo, only needs kubeflow/testing
* Set build_image.sh executable
* Update build_image.sh from CentralDashboard component
* Remove old reference to centraldashboard in echo message
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Build Pytorch serving image using Python Docker Seldon wrapper rather than s2i:
https://github.com/SeldonIO/seldon-core/blob/master/docs/wrappers/python-docker.md
* Add releases for the training and serving images
* Add releases for the training and serving images
* Fix testing_image using kubeflow-ci rather than kubeflow-releasing
* Fix path to Seldon-wrapper build_image.sh
* Fix image name in ksonnet parameter
* Add 02 distributed training documentation
* Add 03 serving the model documentation
Update shared persistent reference in 02 distributed training documentation
* Add 05 teardown documentation
* Add section to test the model is deployed correctly in 03 serving the model
* Add 04 querying the model documentation
* Fix ks-app to ks_app
* Set prow jobs back to postsubmit
* Set prow jobs to trigger presubmit to kubeflow-ci and postsubmit to
kubeflow-images-public
* Change to kubeflow-ci project
* Increase timeout limit during image build to compile Pytorch
* Increase timeout limit during image build to compile Pytorch
* Change build machine type to compile Pytorch for training image
* Change build machine type to compile Pytorch for training image
* Add OWNERS file to Pytorch example
* Fix typo in documentation
* Remove checking docker daemon as we are using gcloud build instead
* Use logging module rather print()
* Remove empty file, replace with .gitignore to keep tmp folder
* Add ksonnet application to deploy model server and web-ui
Delete model server JSON manifest
* Refactor ks-app to ks_app
* Parametrise serving_model ksonnet component
Default web-ui to use ambassador route to seldon
Remove form section in web-ui
* Remove default environment from ksonnet application
* Update documentation to use ksonnet application
* Fix component name in documentation
* Consolidate Pytorch train module and build_image.sh script
* Consolidate Pytorch train module
* Consolidate Pytorch train module
* Consolidate Pytorch train module and build_image.sh script
* Revert back build_image.sh scripts
* Remove duplicates
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
* Fix docker build command
* Fix docker build command
* Fix image name for cpu and gpu train
* Consolidate Pytorch train module
* Consolidate train Dockerflies and build_image.sh script using docker build rather than gcloud
* Add simple pipeline demo
* Add hyperparameter tuning & GPU autoprovisioning
Use pipelines v0.1.2
* Resolve lint issues
* Disable lint warning
Correct SDK syntax that labels the name of the pipeline step
* Add postprocessing step
Basically empty step just to show more than one step
* Add clarity to instructions
* Update pipelines install to release v0.1.2
* Add repo cloning with release versions
Remove katib patch
Use kubeflow v0.3.3
Add PROJECT to env var override file
Further clarification of instructions
In order to build a pipeline that can runs ksonnet command, the ksonnet registry need to be containerized.
Remove it from dockerignore to unblock the work.
* Create a component to submit the Dataflow job to compute embeddings for code search.
* Update Beam to 2.8.0
* Remove nmslib from Apache beam requirements.txt; its not needed and appears
to have problems installing on the Dataflow workers.
* Spacy download was failing on Dataflow workers; reinstalling the spacy
package as a pip package appears to fix this.
* Fix some bugs in the workflow for building the Docker images.
* * Split requirements.txt into separate requirements for the Dataflow
workers and the UI.
* We don't want to install unnecessary dependencies in the Dataflow workers.
Some unnecessary dependencies; e.g. nmslib were also having problems
being installed in the workers.
* Modify K8s models to export the models; tensorboard manifests
* Use a K8s job not a TFJob to export the model.
* Start an experiments.libsonnet file to define groups of parameters for
different experiments that should be reused
* Need to install tensorflow_hub in the Docker image because it is
required by t2t exporter.
* * Address review comments.
Otherwise when I want to execute dataflow code
```
python2 -m code_search.dataflow.cli.create_function_embeddings \
```
it complains no setup.py
I could workaround by using workingdir container API but setting it to default would be more convenient.
* Make distributed training work; Create some components to train models
* Check in a ksonnet component to train a model using the tinyparam
hyperparameter set.
* We want to check in the ksonnet component to facilitate reproducibility.
We need a better way to separate the particular experiments used for
the CS search demo effort from the jobs we want customers to try.
Related to #239 train a high quality model.
* Check in the cs_demo ks environment; this was being ignored as a result of
.gitignore
Make distributed training work #208
* We got distributed synchronous training to work with TensorTensor 1.10
* This required creating a simple python script to start the TF standard
server and run it as a sidecar of the chief pod and as the main container
for the workers/ps.
* Rename the model to kf_similarity_transformer to be consistent with other
code.
* We don't want to use the default name because we don't want to inadvertently
use the SimilarityTransformer model defined in the Tensor2Tensor project.
* replace build.sh by a Makefile. Makes it easier to add variant commands
* Use the GitHash not a random id as the tag.
* Add a label to the docker image to indicate the git version.
* Put the Makefile at the top of the code_search tree; makes it easier
to pull all the different sources for the Docker images.
* Add an option to build the Docker iamges with GCB; this is more efficient
when you are on a poor network connection because you don't have to download
images locally.
* Use jsonnet to define and parameterize the GCB workflow.
* Build separate docker images for running Dataflow and for running the trainer.
This helps avoid versioning conflicts caused by different versions of protobuf
pulled in by the TF version used as the base image vs. the version used
with apache beam.
Fix#310 - Training fails with GPUs.
* Changes to support distributed training.
* Simplify t2t-entrypoint.sh so that all we do is parse TF_CONFIG
and pass requisite config information as command line arguments;
everything else can be set in the K8s spec.
* Upgrade to T2T 1.10.
* * Add ksonnet prototypes for tensorboard.
* Unify the code for training with Keras and TF.Estimator
Create a single train.py and trainer.py which uses Keras inside TensorFlow
Provide options to either train with Keras or TF.TensorFlow
The code to train with TF.estimator doesn't worki
See #196
The original PR (#203) worked around a blocking issue with Keras and TF.Estimator by commenting
certain layers in the model architecture leading to a model that wouldn't generate meaningful
predictions
We weren't able to get TF.Estimator working but this PR should make it easier to troubleshoot further
We've unified the existing code so that we don't duplicate the code just to train with TF.estimator
We've added unitttests that can be used to verify training with TF.estimator works. This test
can also be used to reproduce the current errors with TF.estimator.
Add a Makefile to build the Docker image
Add a NFS PVC to our Kubeflow demo deployment.
Create a tfjob-estimator component in our ksonnet component.
changes to distributed/train.py as part of merging with notebooks/train.py
* Add command line arguments to specify paths rather than hard coding them.
* Remove the code at the start of train.py to wait until the input data
becomes available.
* I think the original intent was to allow the TFJob to be started simultaneously with the preprocessing
job and just block until the data is available
* That should be unnecessary since we can just run the preprocessing job as a separate job.
Fix notebooks/train.py (#186)
The code wasn't actually calling Model Fit
Add a unittest to verify we can invoke fit and evaluate without throwing exceptions.
* Address comments.
* Update the datagen component.
* We should use a K8s job rather than a TFJob. We can also simplify the
ksonnet by just putting the spec into the jsonnet file rather than trying
to share various bits of the spec with the TFJob for training.
Related to kubeflow/examples#308 use globals to allow parameters to be shared
across components (e.g. working directory.)
* Update the README with information about data.
* Fix table markdown.
* Fix performance of dataflow preprocessing job.
* Fix#300; Dataflow job for preprocessing is really slow.
* The problem is we are loading the spacy tokenization model on every
invocation of the tokenization function and this is really expensive.
* We should be doing this once per module import.
* After fixing this issue; the job completed in approximately 20 minutes using
5 workers.
* We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.
* Add options to the Dataflow job to read from files as opposed to BigQuery
and to skip BigQuery writes. This is useful for testing.
* Add a "unittest" that verifies the Dataflow preprocessing job can run
successfully using the DirectRunner.
* Update the Docker image and a ksonnet component for a K8s job that
can be used to submit the Dataflow job.
* Fix#299; Add logging to the Dataflow preprocessing job to indicate that
a Dataflow job was submitted.
* Add an option to the preprocessing Dataflow job to read an entire
BigQuery table as the input rather than running a query to get the input.
This is useful in the case where the user wants to run a different
query to select the repo paths and contents to process and write them
to some table to be processed by the Dataflow job.
* Fix lint.
* More lint fixes.
* Fix model export, loss function, and add some manual tests.
Fix Model export to support computing code embeddings: Fix#260
* The previous exported model was always using the embeddings trained for
the search query.
* But we need to be able to compute embedding vectors for both the query
and code.
* To support this we add a new input feature "embed_code" and conditional
ops. The exported model uses the value of the embed_code feature to determine
whether to treat the inputs as a query string or code and computes
the embeddings appropriately.
* Originally based on #233 by @activatedgeek
Loss function improvements
* See #259 for a long discussion about different loss functions.
* @activatedgeek was experimenting with different loss functions in #233
and this pulls in some of those changes.
Add manual tests
* Related to #258
* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
input based on which embeddings we are computing.
Change Problem/Model name
* Register the problem github_function_docstring with a different name
to distinguish it from the version inside the Tensor2Tensor library.
* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.
* * Fix lint and skip tests.
* Fix lint.
* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.
* * Run generate_data as part of the test rather than reusing a cached
vocab and processed input file.
* Modify SimilarityTransformer so we can overwrite the number of shards
used easily to facilitate testing.
* Comment out py-test for now.
* Upgrade demo to KF v0.3.1
Update env variable names and values in base file
Cleanup ambassador metadata for UI component
Add kfctl installation instructions
Tighten minikube setup instructions and update k8s version
Move environment variable setup to very beginning
Replace cluster creation commands with links to the appropriate section in demo_setup/README.md
Replace deploy.sh with kfctl
Replace kubeflow-core component with individual components
Remove connection to UI pod directly & connect via ambassador instead
Add cleanup commands
* Clarify wording
* Update parameter file
Resolve python error with file open
Consolidate kubeflow install command
* Fix#272Fix#272 where the `create-pet-record-job` pod produces this error: `models/research/object_detection/data/pet_label_map.pbtxt; No such file or directory`
* Update create-pet-record-job.jsonnet