* Fix model export, loss function, and add some manual tests.
Fix Model export to support computing code embeddings: Fix#260
* The previous exported model was always using the embeddings trained for
the search query.
* But we need to be able to compute embedding vectors for both the query
and code.
* To support this we add a new input feature "embed_code" and conditional
ops. The exported model uses the value of the embed_code feature to determine
whether to treat the inputs as a query string or code and computes
the embeddings appropriately.
* Originally based on #233 by @activatedgeek
Loss function improvements
* See #259 for a long discussion about different loss functions.
* @activatedgeek was experimenting with different loss functions in #233
and this pulls in some of those changes.
Add manual tests
* Related to #258
* We add a smoke test for T2T steps so we can catch bugs in the code.
* We also add a smoke test for serving the model with TFServing.
* We add a sanity check to ensure we get different values for the same
input based on which embeddings we are computing.
Change Problem/Model name
* Register the problem github_function_docstring with a different name
to distinguish it from the version inside the Tensor2Tensor library.
* * Skip the test when running under prow because its a manual test.
* Fix some lint errors.
* * Fix lint and skip tests.
* Fix lint.
* * Fix lint
* Revert loss function changes; we can do that in a follow on PR.
* * Run generate_data as part of the test rather than reusing a cached
vocab and processed input file.
* Modify SimilarityTransformer so we can overwrite the number of shards
used easily to facilitate testing.
* Comment out py-test for now.
* Add tensorboard and check in vendor for the code search example.
* * Remove the default env; when I ran ks show I got errors but
removing it and adding a fresh env worked. It also won't point to
the correct cluster for users.
* Replace double quotes for field values (ks convention)
* Recreate the ksonnet application from scratch
* Fix pip commands to find requirements and redo installation, fix ks param set
* Use sed replace instead of ks param set.
* Add cells to first show JobSpec and then apply
* Upgrade T2T, fix conflicting problem types
* Update docker images
* Reduce to 200k samples for vocab
* Use Jupyter notebook service account
* Add illustrative gsutil commands to show output files, specify index files glob explicitly
* List files after index creation step
* Use the model in current repository and not upstream t2t
* Update Docker images
* Expose TF Serving Rest API at 9001
* Spawn terminal from the notebooks ui, no need to go to lab
* Cherry pick changes to PredictionDoFn
* Disable lint checks for cherry picked file
* Update TODO and notebook install instructions
* Restore CUSTOM_COMMANDS todo
* Add a Jupyter notebook to be used for Kubeflow codelabs
* Add help command for create_function_embeddings module
* Update README to point to Jupyter Notebook
* Add prerequisites to readme
* Update README and getting started with notebook guide
* [wip]
* Update noebook with BigQuery previews
* Update notebook to automatically select the latest MODEL_VERSION
* Use the nicer tf.gfile interface for search index creation
* Update documentation and more maintainable interface to search server
* Add ability to control number of outputs
* Serve React UI from the Flask server
* Update Dockerfile for the unified server and ui
* Upgrade TFJob and Ksonnet app
* Container name should be tensorflow. See #563.
* Working single node training and serving on Kubeflow
* Add issue link for fixme
* Remove redundant create secrets and use Kubeflow provided secrets
* Update T2T problems to workaround memory limitations
* Add max_samples_for_vocab to prevent memory overflow
* Fix a base URL to download data from, sweet spot for max samples
* Convert class variables to class properties
* Fix lint errors
* Use Python2/3 compatible code for StringIO
* Fix lint errors
* Fix source data files format
* Move to Text2TextProblem instead of TranslateProblem
* Update details for num_shards and T2T problem dataset
* Update to a new dataflow package
* [WIP] updating docstrings, fixing redundancies
* Limit the scope of Github Transform pipeline, make everything unicode
* Add ability to start github pipelines from transformed bigquery dataset
* Upgrade batch prediction pipeline to be modular
* Fix lint errors
* Add write disposition to BigQuery transform
* Update documentation format
* Nicer names for modules
* Add unicode encoding to parsed function docstring tuples
* Use Apache Beam options parser to expose all CLI arguments
* Refactor the dataflow package
* Create placeholder for new prediction pipeline
* [WIP] add dofn for encoding
* Merge all modules under single package
* Pipeline data flow complete, wip prediction values
* Fallback to custom commands for extra dependency
* Working Dataflow runner installs, separate docker-related folder
* [WIP] Updated local user journey in README, fully working commands, easy container translation
* Working Batch Predictions.
* Remove docstring embeddings
* Complete batch prediction pipeline
* Update Dockerfiles and T2T Ksonnet components
* Fix linting
* Downgrade runtime to Python2, wip memory issues so use lesser data
* Pin master to index 0.
* Working batch prediction pipeline
* Modular Github Batch Prediction Pipeline, stores back to BigQuery
* Fix lint errors
* Fix module-wide imports, pin batch-prediction version
* Fix relative import, update docstrings
* Add references to issue and current workaround for Batch Prediction dependency.
* Add auto-downloads for the data
* Make top() a no-op, working export
* Fix lint errors
* Integrate NMSlib server with TF Serving
* Clarify data URLs purpose
* Initialize search UI. Needs connection to search service
* Fix page title
* Add component for code search results, dummy values for now
* Fix title and manifest
* Add mock loading UI. Need to fill in real API results
* Wrap application into Dockerfile
* Add similarity transformer body
* Update pipeline to Write a single CSV file
* Fix lint errors
* Use CSV writer to handle formatting rows
* Use direct transformer encoding methods with variable scopes
* Complete end-to-end training with new model and problem
* Read from mutliple csv files
* Add new TF-Serving component with sample task
* Unify nmslib and t2t packages, need to be cohesive
* [WIP] update references to the package
* Replace old T2T problem
* Add representative code for encoding/decoding from tf serving service
* Add rest API port to TF serving (replaces custom http proxy)
* Fix linting
* Add NMSLib creator and server components
* Add docs to CLI module
* Add jobs derived from t2t component, GCP credentials assumed
* Add script to create IAM role bindings for Docker container to use
* Fix names to hyphens
* Add t2t-exporter wrapper
* Fix typos
* A temporary workaround for tensorflow/tensor2tensor#879
* Complete working pipeline of datagen, trainer and exporter
* Add docstring to create_secrets.sh
* Add a utility python package for indexing and serving the index
* Add CLI arguments, conditional GCS download
* Complete skeleton CLIs for serving and index creation
* Fix lint issues
* [WIP] initialize ksonnet app
* Push images to GCR
* Upgrade Docker container to run T2T entrypoint with appropriate env vars
* Add a tf-job based t2t-job
* Fix GPU parameters
* New tensor2tensor problem for function summarization
* Consolidate README with improved docs
* Remove old readme
* Add T2T Trainer using Transformer Networks
* Fix missing requirement for t2t-trainer
* Code Search Preprocessing Pipeline
* Add missing pipeline execution to git tree
* Move the preprocessing step into its own package
* Add docstrings
* Fix pylint errors