Commit Graph

89 Commits

Author SHA1 Message Date
Jeremy Lewi f87dfd8e53 Create a demo cluster for the code search example. (#298) 2018-11-05 06:07:52 -08:00
Jeremy Lewi acd8007717 Use conditionals and add test for code search (#291)
* Fix model export, loss function, and add some manual tests.

Fix Model export to support computing code embeddings: Fix #260

* The previous exported model was always using the embeddings trained for
  the search query.

* But we need to be able to compute embedding vectors for both the query
  and code.

* To support this we add a new input feature "embed_code" and conditional
  ops. The exported model uses the value of the embed_code feature to determine
  whether to treat the inputs as a query string or code and computes
  the embeddings appropriately.

* Originally based on #233 by @activatedgeek

Loss function improvements

* See #259 for a long discussion about different loss functions.

* @activatedgeek was experimenting with different loss functions in #233
  and this pulls in some of those changes.

Add manual tests

* Related to #258

* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
  input based on which embeddings we are computing.

Change Problem/Model name

* Register the problem github_function_docstring with a different name
  to distinguish it from the version inside the Tensor2Tensor library.

* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.

* * Fix lint and skip tests.

* Fix lint.

* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.

* * Run generate_data as part of the test rather than reusing a cached
  vocab and processed input file.

* Modify SimilarityTransformer so we can overwrite the number of shards
  used easily to facilitate testing.

* Comment out py-test for now.
2018-11-02 09:52:11 -07:00
Jeremy Lewi adf614fc5f Add tensorboard and check in vendor for the code search example. (#255)
* Add tensorboard and check in vendor for the code search example.

* * Remove the default env; when I ran ks show I got errors but
  removing it and adding a fresh env worked. It also won't point to
  the correct cluster for users.
2018-10-04 10:18:58 -07:00
Sanyam Kapoor f9873e6ac4 Upgrade notebook commands and other relevant changes (#229)
* Replace double quotes for field values (ks convention)

* Recreate the ksonnet application from scratch

* Fix pip commands to find requirements and redo installation, fix ks param set

* Use sed replace instead of ks param set.

* Add cells to first show JobSpec and then apply

* Upgrade T2T, fix conflicting problem types

* Update docker images

* Reduce to 200k samples for vocab

* Use Jupyter notebook service account

* Add illustrative gsutil commands to show output files, specify index files glob explicitly

* List files after index creation step

* Use the model in current repository and not upstream t2t

* Update Docker images

* Expose TF Serving Rest API at 9001

* Spawn terminal from the notebooks ui, no need to go to lab
2018-08-20 16:35:07 -07:00
Sanyam Kapoor 4e015e76a3 Cherry pick changes to PredictionDoFn (#226)
* Cherry pick changes to PredictionDoFn

* Disable lint checks for cherry picked file

* Update TODO and notebook install instructions

* Restore CUSTOM_COMMANDS todo
2018-08-15 06:21:00 -07:00
Sanyam Kapoor 18829159b0 Add a new github function docstring extended problem (#225)
* Add a new github function docstring extended problem

* Fix lint errors

* Update images
2018-08-14 15:41:47 -07:00
Sanyam Kapoor 8fce4a7799 Allow ks param set for Code Search Ksonnet Application (#224)
* Allow ks param set for t2t-code-search

* Update notebook with working directory param set

* Abstract out common variables for easy ks param set
2018-08-14 15:29:04 -07:00
Sanyam Kapoor a687c51036 Add a Jupyter notebook to be used for Kubeflow codelabs (#217)
* Add a Jupyter notebook to be used for Kubeflow codelabs

* Add help command for create_function_embeddings module

* Update README to point to Jupyter Notebook

* Add prerequisites to readme

* Update README and getting started with notebook guide

* [wip]

* Update noebook with BigQuery previews

* Update notebook to automatically select the latest MODEL_VERSION
2018-08-13 21:43:26 -07:00
Sanyam Kapoor 6e9150bad6 Parametrize volumes and ports for nmslib containers 2018-08-09 10:53:23 -07:00
Sanyam Kapoor 133e054033 Refactor job and deployment specs into different functions 2018-08-09 10:53:23 -07:00
Sanyam Kapoor e34f9aca75 Build just one image with the correct tag instead of double the number 2018-08-09 10:53:23 -07:00
Sanyam Kapoor c86f306d79 Use kind Job instead of Pod 2018-08-09 10:53:23 -07:00
Sanyam Kapoor 6527aba7c1 Upgrade JS app to be served at any path prefix 2018-08-09 10:53:23 -07:00
Sanyam Kapoor 9ce23d9fc6 Working search index server 2018-08-09 10:53:23 -07:00
Sanyam Kapoor 02db0065c1 Make search index creation a one-off job 2018-08-09 10:53:23 -07:00
Sanyam Kapoor d4669467d8 Update Search Index server spec with new commands 2018-08-09 10:53:23 -07:00
Sanyam Kapoor f2151f66fc Merge UI and Search Server (#209)
* Use the nicer tf.gfile interface for search index creation

* Update documentation and more maintainable interface to search server

* Add ability to control number of outputs

* Serve React UI from the Flask server

* Update Dockerfile for the unified server and ui
2018-08-03 15:56:09 -07:00
Sanyam Kapoor e9e844022e Disable Distributed Training (#207)
* Upgrade TFJob and Ksonnet app

* Container name should be tensorflow. See #563.

* Working single node training and serving on Kubeflow

* Add issue link for fixme

* Remove redundant create secrets and use Kubeflow provided secrets
2018-08-02 23:02:05 -07:00
Sanyam Kapoor fd2e750990 Fix T2T memory problem (#205)
* Update T2T problems to workaround memory limitations

* Add max_samples_for_vocab to prevent memory overflow

* Fix a base URL to download data from, sweet spot for max samples

* Convert class variables to class properties

* Fix lint errors

* Use Python2/3 compatible code for StringIO

* Fix lint errors

* Fix source data files format

* Move to Text2TextProblem instead of TranslateProblem

* Update details for num_shards and T2T problem dataset
2018-08-01 13:37:41 -07:00
Sanyam Kapoor 767c90ff20 Refactor dataflow pipelines (#197)
* Update to a new dataflow package

* [WIP] updating docstrings, fixing redundancies

* Limit the scope of Github Transform pipeline, make everything unicode

* Add ability to start github pipelines from transformed bigquery dataset

* Upgrade batch prediction pipeline to be modular

* Fix lint errors

* Add write disposition to BigQuery transform

* Update documentation format

* Nicer names for modules

* Add unicode encoding to parsed function docstring tuples

* Use Apache Beam options parser to expose all CLI arguments
2018-07-27 06:26:56 -07:00
Sanyam Kapoor 994fdf82c0 Integrate nmslib (#194)
* Integrate NMSLib server with new data file

* Integrate UI with query URL of search server
2018-07-23 17:17:24 -07:00
Sanyam Kapoor 636cf1c3d0 Integrate batch prediction (#184)
* Refactor the dataflow package

* Create placeholder for new prediction pipeline

* [WIP] add dofn for encoding

* Merge all modules under single package

* Pipeline data flow complete, wip prediction values

* Fallback to custom commands for extra dependency

* Working Dataflow runner installs, separate docker-related folder

* [WIP] Updated local user journey in README, fully working commands, easy container translation

* Working Batch Predictions.

* Remove docstring embeddings

* Complete batch prediction pipeline

* Update Dockerfiles and T2T Ksonnet components

* Fix linting

* Downgrade runtime to Python2, wip memory issues so use lesser data

* Pin master to index 0.

* Working batch prediction pipeline

* Modular Github Batch Prediction Pipeline, stores back to BigQuery

* Fix lint errors

* Fix module-wide imports, pin batch-prediction version

* Fix relative import, update docstrings

* Add references to issue and current workaround for Batch Prediction dependency.
2018-07-23 16:26:23 -07:00
Sanyam Kapoor 2adbb7ace4 Fix transformer export (#169)
* Add auto-downloads for the data

* Make top() a no-op, working export

* Fix lint errors

* Integrate NMSlib server with TF Serving

* Clarify data URLs purpose
2018-07-16 14:06:52 -07:00
Sanyam Kapoor d692db36e8 Search UI Components (#168)
* Initialize search UI. Needs connection to search service

* Fix page title

* Add component for code search results, dummy values for now

* Fix title and manifest

* Add mock loading UI. Need to fill in real API results

* Wrap application into Dockerfile
2018-07-10 20:08:25 -07:00
Sanyam Kapoor c5f13464b4 Add negative sampling to Transformer network (#167)
* Add negative sampling to Transformer network

* Add generate data flag, can skip t2t-datagen step
2018-07-04 20:14:22 -07:00
Sanyam Kapoor 5a9748bf8f Add similarity transformer body (#159)
* Add similarity transformer body

* Update pipeline to Write a single CSV file

* Fix lint errors

* Use CSV writer to handle formatting rows

* Use direct transformer encoding methods with variable scopes

* Complete end-to-end training with new model and problem

* Read from mutliple csv files
2018-07-03 11:14:19 -07:00
Sanyam Kapoor c1b2802313 Add new TF-Serving component with sample task (#152)
* Add new TF-Serving component with sample task

* Unify nmslib and t2t packages, need to be cohesive

* [WIP] update references to the package

* Replace old T2T problem

* Add representative code for encoding/decoding from tf serving service

* Add rest API port to TF serving (replaces custom http proxy)

* Fix linting

* Add NMSLib creator and server components

* Add docs to CLI module
2018-06-28 20:37:21 -07:00
Sanyam Kapoor f20161167e Add a new similarity transformer model, register new problem (#146)
* Add a new similarity transformer model, register new problem

* Remove useless constructor
2018-06-27 11:00:18 -07:00
Sanyam Kapoor 656e1e3e7c Extension of T2T Ksonnet component (#149)
* Add jobs derived from t2t component, GCP credentials assumed

* Add script to create IAM role bindings for Docker container to use

* Fix names to hyphens

* Add t2t-exporter wrapper

* Fix typos

* A temporary workaround for tensorflow/tensor2tensor#879

* Complete working pipeline of datagen, trainer and exporter

* Add docstring to create_secrets.sh
2018-06-25 15:09:22 -07:00
Sanyam Kapoor 21506ffc51 Python package for indexing and serving the index (#150)
* Add a utility python package for indexing and serving the index

* Add CLI arguments, conditional GCS download

* Complete skeleton CLIs for serving and index creation

* Fix lint issues
2018-06-20 15:34:05 -07:00
Sanyam Kapoor 4bd30a1e68 Language task on kubeflow (#143)
* [WIP] initialize ksonnet app

* Push images to GCR

* Upgrade Docker container to run T2T entrypoint with appropriate env vars

* Add a tf-job based t2t-job

* Fix GPU parameters
2018-06-15 18:16:34 -07:00
Sanyam Kapoor 242c2e6d20 Add custom metrics, write raw tokens to GCS (#141)
* Add custom metrics, write raw tokens to GCS

* Change number of output file shards to 1
2018-06-13 12:03:27 -07:00
Sanyam Kapoor 3bff3339f7 Isolate t2t execution into docker (#131)
* Isolate t2t execution into a docker

* Add image build script, update run interface

* Fix grammar typo
2018-06-12 12:53:29 -07:00
Sanyam Kapoor d3c781772c Language modeling using Transformer Networks (#129)
* Add Github language modeling problem

* Rename folders, update README with datagen and train scripts

* Fix linting
2018-06-07 06:31:22 -07:00
Sanyam Kapoor f4c8b7f80d Add error handling to Dataflow (#128)
* Add error handling to dataflow

* Fix lint issues

* Update pipeline with error handling on tokenization and info splitting
2018-06-06 21:46:24 -07:00
Sanyam Kapoor 6220907044 New tensor2tensor problem datagen for function summarization (#127)
* New tensor2tensor problem for function summarization

* Consolidate README with improved docs

* Remove old readme

* Add T2T Trainer using Transformer Networks

* Fix missing requirement for t2t-trainer
2018-06-06 00:38:58 -07:00
Sanyam Kapoor 17dd02b803 Add num workers options to Dataflow (#125) 2018-06-05 17:05:56 -07:00
Sanyam Kapoor e26a290f0f Fix utf-8 encoding issues (#122) 2018-06-01 10:35:56 -07:00
Sanyam Kapoor 26ff66d747 Semantic Code Search Example Data Ingestion (#120)
* Code Search Preprocessing Pipeline

* Add missing pipeline execution to git tree

* Move the preprocessing step into its own package

* Add docstrings

* Fix pylint errors
2018-05-31 15:28:56 -07:00