examples

Commit Graph

Author	SHA1	Message	Date
Jeremy Lewi	46a795693a	Minor fixes to the notebook. (#427 ) * Need to fix the import and compile commands. * Check if an experiment with the name already exists.	2019-01-15 08:33:19 -08:00
Jeremy Lewi	e15bfffca4	An Argo workflow to use as the E2E test for code_search example. (#446 ) * An Argo workflow to use as the E2E test for code_search example. * The workflow builds the Docker images and then runs the python test to train and export a model * Move common utilities into util.libsonnet. * Add the workflow to the set of triggered workflows. * Update the test environment used by the test ksonnet app; we've since changed the location of the app. Related to #295 * Refactor the jsonnet file defining the GCB build workflow * Use an external variable to conditionally pull and use a previous Docker image as a cache * Reduce code duplication by building a shared template for all the different workflows. * BUILD_ID needs to be defined in the default parameters otherwise we get an error when adding a new environment. * Define suitable defaults.	2018-12-28 16:12:32 -08:00
Jeremy Lewi	2e6e891a5b	Update the ArgoCD app to use the kubeflow/examples repo (#440 ) * We were using jlewi's fork because PRs hadn't been committed but all the relevant PRs have been merged and master is the source of truth.	2018-12-19 21:26:49 -08:00
Jeremy Lewi	ba9af34805	Create a script to count lines of code. (#379 ) * Create a script to count lines of code. * This is used in the presentation to get an estimate of where the human effort is involved. * Fix lint issues.	2018-12-19 09:42:25 -08:00
Jeremy Lewi	9f061a0554	Update the central dashboard UI image to one that includes pipelines. (#430 )	2018-12-12 09:34:21 -08:00
Jeremy Lewi	1b643c2b81	Fix the web app. (#432 ) * We need to set the parameters for the model and index. * It looks like when we split up the web app into its own ksonnet app we forgot to set the parameters. * SInce the web app is being deployed in a separate namespace we need to copy the GCP credential to that namespace. Add instructions to the demo README.md on how to do that. * It looks like the pods were never getting started because the secret couldn't be mounted.	2018-12-12 09:24:40 -08:00
IronPan	4a7e2c868c	fix bq table dupliation (#418 ) * fix bq table dupliation * fix bq table dupliation * update * update image * use index for placeholder	2018-12-10 18:50:28 -08:00
Jeremy Lewi	b26f7e9a48	Add pods/logs permission to the jupyter notebook role. (#419 ) * This is needed so that fairing can tail the logs.	2018-12-09 15:53:46 -08:00
Jeremy Lewi	67d42c4661	Expose ArgoCD UI behind Ambassador. (#413 ) * We need to disable TLS (its handled by ingress) because that leads to endless redirects. * ArgoCD is running in namespace argo-cd but Ambassador is running in a different namespace and currently only configured with RBAC to monitor a single namespace. * So we add a service in namespace kubeflow just to define the Ambassador mapping.	2018-12-08 12:49:34 -08:00
IronPan	4c970876dc	add notebook for code search pipeline (#410 )	2018-12-07 10:29:02 -08:00
IronPan	0d2f5b6342	Clean up code search pipeline (#406 ) * update pipeline to use out of box gcp credential support * Update index_update_pipeline.py	2018-12-07 10:13:12 -08:00
IronPan	206ad8fda4	Add preprocess github data step to code search pipeline (#396 ) * refactor ks * remove unecessary params * update ks * address comments * add preprocess step * update images * update preprocess code * reformat * minor fix * reuse function embedding pipeline to preprocess * add preprocess * update pipeline * propagate failed token table * format code * copy vocabulary * address comments * address comments * update * fix * fix format * Update arguments.py	2018-12-05 18:06:06 -08:00
IronPan	cea0ffde0d	Update the ks parameter (#394 ) * refactor ks * remove unecessary params * update ks * address comments	2018-12-02 22:14:11 -08:00
Jeremy Lewi	78fdc74b56	Dataflow job should support writing embeddings to a different location (Fix #366 ). (#388 ) * Datflow job should support writing embeddings to a different location (Fix #366). * Dataflow job to compute code embeddings needs to have parameters controlling the location of the outputs independent of the inputs. Prior to this fix the same table in the dataset was always written and the files were always created in the data dir. * This made it very difficult to rerun the embeddings job on the latest GitHub data (e.g to regularly update the code embeddings) without overwritting the current embeddings. * Refactor how we create BQ sinks and sources in this pipeline * Rather than create a wrapper class that bundles together a sink and schema we should have a separate helper class for creating BQ schemas and then use WriteToBigQuery directly. * Similarly for ReadTransforms we don't need a wrapper class that bundles a query and source. We can just create a class/constant to represent queries and pass them directly to the appropriate source. * Change BQ write disposition to if empty so we don't overwrite existing data. * Fix #390 worker setup fails because requirements.dataflow.txt not found * Dataflow always uses the local file requirements.txt regardless of the local file used as the source. * When job is submitted it will also try to build a sdist package on the client which invokes setup.py * So we in setup.py we always refer to requirements.txt * If trying to install the package in other contexts, requirements.dataflow.txt should be renamed to requirements.txt * We do this in the Dockerfile. * Refactor the CreateFunctionEmbeddings code so that writing to BQ is not part of the compute function embeddings code; (will make it easier to test.) * * Fix typo in jsonnet with output dir; missing an "=".	2018-12-02 09:51:27 -08:00
IronPan	e8cf9c58ce	add pipeline step to push to git (#387 ) * add push to git * small fixes * work around .after() * format	2018-12-02 09:37:21 -08:00
IronPan	494fc05f16	Add IronPan to code_search owner (#386 )	2018-11-30 17:37:57 -08:00
IronPan	b807843031	add pipeline environment to code search web app (#372 ) * add pipeline * Update app.yaml	2018-11-30 07:51:00 -08:00
IronPan	3799bac22c	Update the update_index.sh (#373 ) * add search index creator container * add pipeline * update op name * update readme * update scripts * typo fix * Update Makefile * Update Makefile * address comments * fix ks * update pipeline * restructure the images * remove echo * update image * add code embedding launcher * small fixes * format * format * address comments * add flag * Update arguments.py * update parameter * revert to use --wait_until_finished. --wait_until_finish never works * update image * update git script * update script * update readme	2018-11-29 00:53:09 -08:00
IronPan	7ffc50e0ee	Add dataflow launcher script (#364 ) * add search index creator container * add pipeline * update op name * update readme * update scripts * typo fix * Update Makefile * Update Makefile * address comments * fix ks * update pipeline * restructure the images * remove echo * update image * add code embedding launcher * small fixes * format * format * address comments * add flag * Update arguments.py * update parameter * revert to use --wait_until_finished. --wait_until_finish never works * update image	2018-11-27 19:23:54 -08:00
IronPan	760ba7b9e8	Cleanup build directory before code search GCB build (#370 ) The build directory cached the staled deleted files and without cleaning up the folder, those staled files are carried over to the new image.	2018-11-27 12:54:57 -08:00
IronPan	c0345dec90	Update setup.py to point to the new requirement file (#371 )	2018-11-27 12:45:07 -08:00
IronPan	31390d39a0	Add update search index pipeline (#361 ) * add search index creator container * add pipeline * update op name * update readme * update scripts * typo fix * Update Makefile * Update Makefile * address comments * fix ks * update pipeline * restructure the images * remove echo * update image * format * format * address comments	2018-11-27 04:43:55 -08:00
Jeremy Lewi	e1e1422da4	Setup ArgoCD to synchornize the code search web app with the demo cluster. (#359 ) * Follow argocd instructions https://github.com/argoproj/argo-cd/blob/master/docs/getting_started.md to install ArgoCD on the cluster * Down the argocd manifest and update the namespace to argocd. * Check it in so ArgoCD can be deployed declaratively. * Update README.md with the instructions for deploying ArgoCD. Move the web app components into their own ksonnet app. * We do this because we want to be able to sync the web app components using Argo CD * ArgoCD doesn't allow us to apply autosync with granularity less than the app. We don't want to sync any of the components except the servers. * Rename the t2t-code-search-serving component to query-embed-server because this is more descriptive. * Check in a YAML spec defining the ksonnet application for the web UI. Update the instructions in nodebook code-search.ipynb * Provided updated instructions for deploying the web app due the fact that the web app is now a separate component. * Improve code-search.ipynb * Use gcloud to get sensible defaults for parameters like the project. * Provide more information about what the variables mean.	2018-11-26 18:19:19 -08:00
IronPan	7924fa7fd0	parameterize search index job name (#358 ) * parameterize search index job name * change namespace * Update search-index-creator.jsonnet	2018-11-26 12:03:30 -08:00
Jeremy Lewi	5d6a4e9d71	Create a script to update the index and lookup file used to serve predictions. (#352 ) * This script will be the last step in a pipeline to continuously update the index for serving. * The script updates the parameters of the search index server to point to the supplied index files. It then commits them and creates a PR to push those commits. * Restructure the parameters for the search index server so that we can use ks param set to override the indexFile and lookupFile. * We do this because we want to be able to push a new index by doing ks param set in a continuously running pipeline * Remove default parameters from search-index-server * Create a dockerfile suitable for running this script.	2018-11-26 06:35:27 -08:00
IronPan	4f95e85e63	add pipeline component (#356 ) * add pipeline component * update pipeline component	2018-11-26 06:21:07 -08:00
Jeremy Lewi	a32227f371	Fix the ksonnet by defining globals. (#354 ) * The latest changes to the ksonnet components require certain values to be defined as defaults. * This is part of the move away from using a fake component to define parameters that should be reused across different modules. see #308 * Verify we can run ks show on a new environment and can evaluate the ksonnet. Fix #353	2018-11-24 14:36:43 -08:00
Jeremy Lewi	de17011066	Upgrade and fix the serving components. (#348 ) * Upgrade and fix the serving components. * Install a new version of the TFServing package so we can use the new template. * Fix the UI image. Use the same requirements file as for Dataflow so we are consistent w.r.t the version of TF and Tensor2Tesnro. * remove nms.libsonnet; move all the manifests into the actual component files rather than using a shared library. * Fix the name of the TFServing service and deployment; need to use the same name as used by the front end server. * Change the port of TFServing; we are now using the built in http server in TFServing which uses port 8500 as opposed to our custom http proxy. * We encountered an error importning nmslib; moving it to the top of the file appears to fix this. * Fix lint.	2018-11-24 13:22:34 -08:00
Jeremy Lewi	d2b68f15d7	Fix the K8s job to create the nmslib index. (#338 ) * Install nmslib in the Dataflow container so its suitable for running the index creation job. * Use command not args in the job specs. * Dockerfile.dataflow should install nmslib so that we can use that Docker image to create the index. * build.jsonnet should tag images as latest. We will use this to use the latest images as a layer cache to speed up builds. * Set logging level to info for start_search_server.py and create_search_index.py * Create search index pod keeps was getting evicted because node runs out of memory * Add a new node pool consisting of n1-standard-32 nodes to the demo cluster. These have 120 GB of RAM compared to 30GB in our default pool of n1-standard-8 * Set requests and limits on the creator search index pod. * Move all the config for the search-index-creator job into the search-index-creator.jsonnet file. We need to customize the memory resources so there's not much value to try to sharing config with other components.	2018-11-20 12:53:09 -08:00
Yang Pan	60a7413cc5	Remove ksonnet registry from dockerignore file (#333 ) In order to build a pipeline that can runs ksonnet command, the ksonnet registry need to be containerized. Remove it from dockerignore to unblock the work.	2018-11-14 13:45:15 -08:00
Jeremy Lewi	26c400a4cd	Create a component to submit the Dataflow job to compute embeddings for code search (#324 ) * Create a component to submit the Dataflow job to compute embeddings for code search. * Update Beam to 2.8.0 * Remove nmslib from Apache beam requirements.txt; its not needed and appears to have problems installing on the Dataflow workers. * Spacy download was failing on Dataflow workers; reinstalling the spacy package as a pip package appears to fix this. * Fix some bugs in the workflow for building the Docker images. * * Split requirements.txt into separate requirements for the Dataflow workers and the UI. * We don't want to install unnecessary dependencies in the Dataflow workers. Some unnecessary dependencies; e.g. nmslib were also having problems being installed in the workers.	2018-11-14 13:45:09 -08:00
Yang Pan	6c976342a3	exit if t2t job failed (#327 )	2018-11-11 21:35:44 -08:00
Yang Pan	ee74868bec	Fix build-dataflow makefile rule (#325 )	2018-11-11 21:26:35 -08:00
Jeremy Lewi	2487194fbd	Modify K8s models to export the models; tensorboard manifests (#320 ) * Modify K8s models to export the models; tensorboard manifests * Use a K8s job not a TFJob to export the model. * Start an experiments.libsonnet file to define groups of parameters for different experiments that should be reused * Need to install tensorflow_hub in the Docker image because it is required by t2t exporter. * * Address review comments.	2018-11-11 19:09:42 -08:00
Yang Pan	c6ff5dbef8	Change dataflow default workdir to /src (#330 ) Otherwise when I want to execute dataflow code ``` python2 -m code_search.dataflow.cli.create_function_embeddings \ ``` it complains no setup.py I could workaround by using workingdir container API but setting it to default would be more convenient.	2018-11-11 15:37:59 -08:00
Jeremy Lewi	65e89a599b	code search example make distributed training work; Create some components to train models (#317 ) * Make distributed training work; Create some components to train models * Check in a ksonnet component to train a model using the tinyparam hyperparameter set. * We want to check in the ksonnet component to facilitate reproducibility. We need a better way to separate the particular experiments used for the CS search demo effort from the jobs we want customers to try. Related to #239 train a high quality model. * Check in the cs_demo ks environment; this was being ignored as a result of .gitignore Make distributed training work #208 * We got distributed synchronous training to work with TensorTensor 1.10 * This required creating a simple python script to start the TF standard server and run it as a sidecar of the chief pod and as the main container for the workers/ps. * Rename the model to kf_similarity_transformer to be consistent with other code. * We don't want to use the default name because we don't want to inadvertently use the SimilarityTransformer model defined in the Tensor2Tensor project. * replace build.sh by a Makefile. Makes it easier to add variant commands * Use the GitHash not a random id as the tag. * Add a label to the docker image to indicate the git version. * Put the Makefile at the top of the code_search tree; makes it easier to pull all the different sources for the Docker images. * Add an option to build the Docker iamges with GCB; this is more efficient when you are on a poor network connection because you don't have to download images locally. * Use jsonnet to define and parameterize the GCB workflow. * Build separate docker images for running Dataflow and for running the trainer. This helps avoid versioning conflicts caused by different versions of protobuf pulled in by the TF version used as the base image vs. the version used with apache beam. Fix #310 - Training fails with GPUs. * Changes to support distributed training. * Simplify t2t-entrypoint.sh so that all we do is parse TF_CONFIG and pass requisite config information as command line arguments; everything else can be set in the K8s spec. * Upgrade to T2T 1.10. * * Add ksonnet prototypes for tensorboard.	2018-11-08 16:13:01 -08:00
Jeremy Lewi	d01b76b6f9	Update ksonnet for datagen (#309 ) * Update the datagen component. * We should use a K8s job rather than a TFJob. We can also simplify the ksonnet by just putting the spec into the jsonnet file rather than trying to share various bits of the spec with the TFJob for training. Related to kubeflow/examples#308 use globals to allow parameters to be shared across components (e.g. working directory.) * Update the README with information about data. * Fix table markdown.	2018-11-07 14:28:16 -08:00
Yang Pan	11879e2ff1	wait on create function embedding (#311 )	2018-11-06 14:37:11 -08:00
Jeremy Lewi	df278567f0	Fix performance of dataflow preprocessing job. (#302 ) * Fix performance of dataflow preprocessing job. * Fix #300; Dataflow job for preprocessing is really slow. * The problem is we are loading the spacy tokenization model on every invocation of the tokenization function and this is really expensive. * We should be doing this once per module import. * After fixing this issue; the job completed in approximately 20 minutes using 5 workers. * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether. * Add options to the Dataflow job to read from files as opposed to BigQuery and to skip BigQuery writes. This is useful for testing. * Add a "unittest" that verifies the Dataflow preprocessing job can run successfully using the DirectRunner. * Update the Docker image and a ksonnet component for a K8s job that can be used to submit the Dataflow job. * Fix #299; Add logging to the Dataflow preprocessing job to indicate that a Dataflow job was submitted. * Add an option to the preprocessing Dataflow job to read an entire BigQuery table as the input rather than running a query to get the input. This is useful in the case where the user wants to run a different query to select the repo paths and contents to process and write them to some table to be processed by the Dataflow job. * Fix lint. * More lint fixes.	2018-11-06 14:14:28 -08:00
Yang Pan	aa0061dae2	update instruction with proper namespace (#307 )	2018-11-05 20:47:46 -08:00
Yang Pan	1f82dc41cd	[code search] add flag to wait till code search job finish (#306 ) * add flag to wait till job finish * wait till -> wait until	2018-11-05 19:04:20 -08:00
Jeremy Lewi	f87dfd8e53	Create a demo cluster for the code search example. (#298 )	2018-11-05 06:07:52 -08:00
Jeremy Lewi	acd8007717	Use conditionals and add test for code search (#291 ) * Fix model export, loss function, and add some manual tests. Fix Model export to support computing code embeddings: Fix #260 * The previous exported model was always using the embeddings trained for the search query. * But we need to be able to compute embedding vectors for both the query and code. * To support this we add a new input feature "embed_code" and conditional ops. The exported model uses the value of the embed_code feature to determine whether to treat the inputs as a query string or code and computes the embeddings appropriately. * Originally based on #233 by @activatedgeek Loss function improvements * See #259 for a long discussion about different loss functions. * @activatedgeek was experimenting with different loss functions in #233 and this pulls in some of those changes. Add manual tests * Related to #258 * We add a smoke test for T2T steps so we can catch bugs in the code. * We also add a smoke test for serving the model with TFServing. * We add a sanity check to ensure we get different values for the same input based on which embeddings we are computing. Change Problem/Model name * Register the problem github_function_docstring with a different name to distinguish it from the version inside the Tensor2Tensor library. * * Skip the test when running under prow because its a manual test. * Fix some lint errors. * * Fix lint and skip tests. * Fix lint. * * Fix lint * Revert loss function changes; we can do that in a follow on PR. * * Run generate_data as part of the test rather than reusing a cached vocab and processed input file. * Modify SimilarityTransformer so we can overwrite the number of shards used easily to facilitate testing. * Comment out py-test for now.	2018-11-02 09:52:11 -07:00
Jeremy Lewi	adf614fc5f	Add tensorboard and check in vendor for the code search example. (#255 ) * Add tensorboard and check in vendor for the code search example. * * Remove the default env; when I ran ks show I got errors but removing it and adding a fresh env worked. It also won't point to the correct cluster for users.	2018-10-04 10:18:58 -07:00
Sanyam Kapoor	f9873e6ac4	Upgrade notebook commands and other relevant changes (#229 ) * Replace double quotes for field values (ks convention) * Recreate the ksonnet application from scratch * Fix pip commands to find requirements and redo installation, fix ks param set * Use sed replace instead of ks param set. * Add cells to first show JobSpec and then apply * Upgrade T2T, fix conflicting problem types * Update docker images * Reduce to 200k samples for vocab * Use Jupyter notebook service account * Add illustrative gsutil commands to show output files, specify index files glob explicitly * List files after index creation step * Use the model in current repository and not upstream t2t * Update Docker images * Expose TF Serving Rest API at 9001 * Spawn terminal from the notebooks ui, no need to go to lab	2018-08-20 16:35:07 -07:00
Sanyam Kapoor	4e015e76a3	Cherry pick changes to PredictionDoFn (#226 ) * Cherry pick changes to PredictionDoFn * Disable lint checks for cherry picked file * Update TODO and notebook install instructions * Restore CUSTOM_COMMANDS todo	2018-08-15 06:21:00 -07:00
Sanyam Kapoor	18829159b0	Add a new github function docstring extended problem (#225 ) * Add a new github function docstring extended problem * Fix lint errors * Update images	2018-08-14 15:41:47 -07:00
Sanyam Kapoor	8fce4a7799	Allow ks param set for Code Search Ksonnet Application (#224 ) * Allow ks param set for t2t-code-search * Update notebook with working directory param set * Abstract out common variables for easy ks param set	2018-08-14 15:29:04 -07:00
Sanyam Kapoor	a687c51036	Add a Jupyter notebook to be used for Kubeflow codelabs (#217 ) * Add a Jupyter notebook to be used for Kubeflow codelabs * Add help command for create_function_embeddings module * Update README to point to Jupyter Notebook * Add prerequisites to readme * Update README and getting started with notebook guide * [wip] * Update noebook with BigQuery previews * Update notebook to automatically select the latest MODEL_VERSION	2018-08-13 21:43:26 -07:00
Sanyam Kapoor	6e9150bad6	Parametrize volumes and ports for nmslib containers	2018-08-09 10:53:23 -07:00

1 2

80 Commits