Commit Graph

14 Commits

Author SHA1 Message Date
Jeremy Lewi d2b68f15d7 Fix the K8s job to create the nmslib index. (#338)
* Install nmslib in the Dataflow container so its suitable for running
  the index creation job.

* Use command not args in the job specs.

* Dockerfile.dataflow should install nmslib so that we can use that Docker
  image to create the index.

* build.jsonnet should tag images as latest. We will use this to use
  the latest images as a layer cache to speed up builds.

* Set logging level to info for start_search_server.py and
  create_search_index.py

* Create search index pod keeps was getting evicted because node runs out of
  memory

* Add a new node pool consisting of n1-standard-32 nodes to the demo cluster.
 These have 120 GB of RAM compared to 30GB in our default pool of n1-standard-8

* Set requests and limits on the creator search index pod.

* Move all the config for the search-index-creator job into the
  search-index-creator.jsonnet file. We need to customize the memory resources
  so there's not much value to try to sharing config with other components.
2018-11-20 12:53:09 -08:00
Jeremy Lewi 26c400a4cd Create a component to submit the Dataflow job to compute embeddings for code search (#324)
* Create a component to submit the Dataflow job to compute embeddings for code search.

* Update Beam to 2.8.0
* Remove nmslib from Apache beam requirements.txt; its not needed and appears
  to have problems installing on the Dataflow workers.

* Spacy download was failing on Dataflow workers; reinstalling the spacy
  package as a pip package appears to fix this.

* Fix some bugs in the workflow for building the Docker images.

* * Split requirements.txt into separate requirements for the Dataflow
  workers and the UI.

* We don't want to install unnecessary dependencies in the Dataflow workers.
  Some unnecessary dependencies; e.g. nmslib were also having problems
  being installed in the workers.
2018-11-14 13:45:09 -08:00
Yang Pan 6c976342a3 exit if t2t job failed (#327) 2018-11-11 21:35:44 -08:00
Jeremy Lewi 2487194fbd Modify K8s models to export the models; tensorboard manifests (#320)
* Modify K8s models to export the models; tensorboard manifests

* Use a K8s job not a TFJob to export the model.
* Start an experiments.libsonnet file to define groups of parameters for
  different experiments that should be reused

* Need to install tensorflow_hub in the Docker image because it is
  required by t2t exporter.

* * Address review comments.
2018-11-11 19:09:42 -08:00
Yang Pan c6ff5dbef8 Change dataflow default workdir to /src (#330)
Otherwise when I want to execute dataflow code 
```
python2 -m code_search.dataflow.cli.create_function_embeddings \
```
it complains no setup.py

I could workaround by using workingdir container API but setting it to default would be more convenient.
2018-11-11 15:37:59 -08:00
Jeremy Lewi 65e89a599b code search example make distributed training work; Create some components to train models (#317)
* Make distributed training work; Create some components to train models

* Check in a ksonnet component to train a model using the tinyparam
  hyperparameter set.

* We want to check in the ksonnet component to facilitate reproducibility.
  We need a better way to separate the particular experiments used for
  the CS search demo effort from the jobs we want customers to try.

   Related to #239 train a high quality model.

* Check in the cs_demo ks environment; this was being ignored as a result of
  .gitignore

Make distributed training work #208

* We got distributed synchronous training to work with TensorTensor 1.10
* This required creating a simple python script to start the TF standard
  server and run it as a sidecar of the chief pod and as the main container
  for the workers/ps.

* Rename the model to kf_similarity_transformer to be consistent with other
  code.
  * We don't want to use the default name because we don't want to inadvertently
  use the SimilarityTransformer model defined in the Tensor2Tensor project.

* replace build.sh by a Makefile. Makes it easier to add variant commands
  * Use the GitHash not a random id as the tag.
  * Add a label to the docker image to indicate the git version.

* Put the Makefile at the top of the code_search tree; makes it easier
  to pull all the different sources for the Docker images.

* Add an option to build the Docker iamges with GCB; this is more efficient
  when you are on a poor network connection because you don't have to download
  images locally.
    * Use jsonnet to define and parameterize the GCB workflow.

* Build separate docker images for running Dataflow and for running the trainer.
  This helps avoid versioning conflicts caused by different versions of protobuf
  pulled in by the TF version used as the base image vs. the version used
  with apache beam.

      Fix #310 - Training fails with GPUs.

* Changes to support distributed training.
* Simplify t2t-entrypoint.sh so that all we do is parse TF_CONFIG
  and pass requisite config information as command line arguments;
  everything else can be set in the K8s spec.

* Upgrade to T2T 1.10.

* * Add ksonnet prototypes for tensorboard.
2018-11-08 16:13:01 -08:00
Jeremy Lewi df278567f0 Fix performance of dataflow preprocessing job. (#302)
* Fix performance of dataflow preprocessing job.

* Fix #300; Dataflow job for preprocessing is really slow.

  * The problem is we are loading the spacy tokenization model on every
    invocation of the tokenization function and this is really expensive.
  * We should be doing this once per module import.

* After fixing this issue; the job completed in approximately 20 minutes using
  5 workers.

  * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.

* Add options to the Dataflow job to read from files as opposed to BigQuery
  and to skip BigQuery writes. This is useful for testing.

* Add a "unittest" that verifies the Dataflow preprocessing job can run
  successfully using the DirectRunner.

* Update the Docker image and a ksonnet component for a K8s job that
  can be used to submit the Dataflow job.

* Fix #299; Add logging to the Dataflow preprocessing job to indicate that
  a Dataflow job was submitted.

* Add an option to the preprocessing Dataflow job to read an entire
  BigQuery table as the input rather than running a query to get the input.
  This is useful in the case where the user wants to run a different
  query to select the repo paths and contents to process and write them
  to some table to be processed by the Dataflow job.

* Fix lint.

* More lint fixes.
2018-11-06 14:14:28 -08:00
Jeremy Lewi acd8007717 Use conditionals and add test for code search (#291)
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix #260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on #233 by @activatedgeek

Loss function improvements

* See #259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in #233
  and this pulls in some of those changes.

Add manual tests

* Related to #258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
2018-11-02 09:52:11 -07:00
Sanyam Kapoor f9873e6ac4 Upgrade notebook commands and other relevant changes (#229)
* Replace double quotes for field values (ks convention)

* Recreate the ksonnet application from scratch

* Fix pip commands to find requirements and redo installation, fix ks param set

* Use sed replace instead of ks param set.

* Add cells to first show JobSpec and then apply

* Upgrade T2T, fix conflicting problem types

* Update docker images

* Reduce to 200k samples for vocab

* Use Jupyter notebook service account

* Add illustrative gsutil commands to show output files, specify index files glob explicitly

* List files after index creation step

* Use the model in current repository and not upstream t2t

* Update Docker images

* Expose TF Serving Rest API at 9001

* Spawn terminal from the notebooks ui, no need to go to lab
2018-08-20 16:35:07 -07:00
Sanyam Kapoor e34f9aca75 Build just one image with the correct tag instead of double the number 2018-08-09 10:53:23 -07:00
Sanyam Kapoor 6527aba7c1 Upgrade JS app to be served at any path prefix 2018-08-09 10:53:23 -07:00
Sanyam Kapoor f2151f66fc Merge UI and Search Server (#209)
* Use the nicer tf.gfile interface for search index creation

* Update documentation and more maintainable interface to search server

* Add ability to control number of outputs

* Serve React UI from the Flask server

* Update Dockerfile for the unified server and ui
2018-08-03 15:56:09 -07:00
Sanyam Kapoor e9e844022e Disable Distributed Training (#207)
* Upgrade TFJob and Ksonnet app

* Container name should be tensorflow. See #563.

* Working single node training and serving on Kubeflow

* Add issue link for fixme

* Remove redundant create secrets and use Kubeflow provided secrets
2018-08-02 23:02:05 -07:00
Sanyam Kapoor 636cf1c3d0 Integrate batch prediction (#184)
* Refactor the dataflow package

* Create placeholder for new prediction pipeline

* [WIP] add dofn for encoding

* Merge all modules under single package

* Pipeline data flow complete, wip prediction values

* Fallback to custom commands for extra dependency

* Working Dataflow runner installs, separate docker-related folder

* [WIP] Updated local user journey in README, fully working commands, easy container translation

* Working Batch Predictions.

* Remove docstring embeddings

* Complete batch prediction pipeline

* Update Dockerfiles and T2T Ksonnet components

* Fix linting

* Downgrade runtime to Python2, wip memory issues so use lesser data

* Pin master to index 0.

* Working batch prediction pipeline

* Modular Github Batch Prediction Pipeline, stores back to BigQuery

* Fix lint errors

* Fix module-wide imports, pin batch-prediction version

* Fix relative import, update docstrings

* Add references to issue and current workaround for Batch Prediction dependency.
2018-07-23 16:26:23 -07:00