Commit Graph

14 Commits

Author SHA1 Message Date
Jeremy Lewi 2e6e891a5b Update the ArgoCD app to use the kubeflow/examples repo (#440)
* We were using jlewi's fork because PRs hadn't been committed but
  all the relevant PRs have been merged and master is the source of truth.
2018-12-19 21:26:49 -08:00
Jeremy Lewi ba9af34805 Create a script to count lines of code. (#379)
* Create a script to count lines of code.

* This is used in the presentation to get an estimate of where the human effort is involved.

* Fix lint issues.
2018-12-19 09:42:25 -08:00
Jeremy Lewi 9f061a0554 Update the central dashboard UI image to one that includes pipelines. (#430) 2018-12-12 09:34:21 -08:00
Jeremy Lewi 1b643c2b81 Fix the web app. (#432)
* We need to set the parameters for the model and index.

  * It looks like when we split up the web app into its own ksonnet app
    we forgot to set the parameters.

* SInce the web app is being deployed in a separate namespace we need to
  copy the GCP credential to that namespace. Add instructions to the
  demo README.md on how to do that.

* It looks like the pods were never getting started because the secret
  couldn't be mounted.
2018-12-12 09:24:40 -08:00
Jeremy Lewi b26f7e9a48 Add pods/logs permission to the jupyter notebook role. (#419)
* This is needed so that fairing can tail the logs.
2018-12-09 15:53:46 -08:00
Jeremy Lewi 67d42c4661 Expose ArgoCD UI behind Ambassador. (#413)
* We need to disable TLS (its handled by ingress) because that leads to
  endless redirects.

* ArgoCD is running in namespace argo-cd but Ambassador is running in a
  different namespace and currently only configured with RBAC to monitor
  a single namespace.

* So we add a service in namespace kubeflow just to define the Ambassador mapping.
2018-12-08 12:49:34 -08:00
Jeremy Lewi e1e1422da4 Setup ArgoCD to synchornize the code search web app with the demo cluster. (#359)
* Follow argocd instructions
  https://github.com/argoproj/argo-cd/blob/master/docs/getting_started.md
  to install ArgoCD on the cluster

  * Down the argocd manifest and update the namespace to argocd.
  * Check it in so ArgoCD can be deployed declaratively.

* Update README.md with the instructions for deploying ArgoCD.

Move the web app components into their own ksonnet app.

* We do this because we want to be able to sync the web app components using
  Argo CD

* ArgoCD doesn't allow us to apply autosync with granularity less than the
  app. We don't want to sync any of the components except the servers.

* Rename the t2t-code-search-serving component to query-embed-server because
  this is more descriptive.

* Check in a YAML spec defining the ksonnet application for the web UI.

Update the instructions in nodebook code-search.ipynb

  * Provided updated instructions for deploying the web app due the
  fact that the web app is now a separate component.

  * Improve code-search.ipynb
    * Use gcloud to get sensible defaults for parameters like the project.
    * Provide more information about what the variables mean.
2018-11-26 18:19:19 -08:00
IronPan 4f95e85e63 add pipeline component (#356)
* add pipeline component

* update pipeline component
2018-11-26 06:21:07 -08:00
Jeremy Lewi de17011066 Upgrade and fix the serving components. (#348)
* Upgrade and fix the serving components.

* Install a new version of the TFServing package so we can use the new template.

* Fix the UI image. Use the same requirements file as for Dataflow so we are
consistent w.r.t the version of TF and Tensor2Tesnro.

* remove nms.libsonnet; move all the manifests into the actual component
  files rather than using a shared library.

* Fix the name of the TFServing service and deployment; need to use the same
  name as used by the front end server.

* Change the port of TFServing; we are now using the built in http server
  in TFServing which uses port 8500 as opposed to our custom http proxy.

* We encountered an error importning nmslib; moving it to the top of the file
  appears to fix this.

* Fix lint.
2018-11-24 13:22:34 -08:00
Jeremy Lewi d2b68f15d7 Fix the K8s job to create the nmslib index. (#338)
* Install nmslib in the Dataflow container so its suitable for running
  the index creation job.

* Use command not args in the job specs.

* Dockerfile.dataflow should install nmslib so that we can use that Docker
  image to create the index.

* build.jsonnet should tag images as latest. We will use this to use
  the latest images as a layer cache to speed up builds.

* Set logging level to info for start_search_server.py and
  create_search_index.py

* Create search index pod keeps was getting evicted because node runs out of
  memory

* Add a new node pool consisting of n1-standard-32 nodes to the demo cluster.
 These have 120 GB of RAM compared to 30GB in our default pool of n1-standard-8

* Set requests and limits on the creator search index pod.

* Move all the config for the search-index-creator job into the
  search-index-creator.jsonnet file. We need to customize the memory resources
  so there's not much value to try to sharing config with other components.
2018-11-20 12:53:09 -08:00
Jeremy Lewi 65e89a599b code search example make distributed training work; Create some components to train models (#317)
* Make distributed training work; Create some components to train models

* Check in a ksonnet component to train a model using the tinyparam
  hyperparameter set.

* We want to check in the ksonnet component to facilitate reproducibility.
  We need a better way to separate the particular experiments used for
  the CS search demo effort from the jobs we want customers to try.

   Related to #239 train a high quality model.

* Check in the cs_demo ks environment; this was being ignored as a result of
  .gitignore

Make distributed training work #208

* We got distributed synchronous training to work with TensorTensor 1.10
* This required creating a simple python script to start the TF standard
  server and run it as a sidecar of the chief pod and as the main container
  for the workers/ps.

* Rename the model to kf_similarity_transformer to be consistent with other
  code.
  * We don't want to use the default name because we don't want to inadvertently
  use the SimilarityTransformer model defined in the Tensor2Tensor project.

* replace build.sh by a Makefile. Makes it easier to add variant commands
  * Use the GitHash not a random id as the tag.
  * Add a label to the docker image to indicate the git version.

* Put the Makefile at the top of the code_search tree; makes it easier
  to pull all the different sources for the Docker images.

* Add an option to build the Docker iamges with GCB; this is more efficient
  when you are on a poor network connection because you don't have to download
  images locally.
    * Use jsonnet to define and parameterize the GCB workflow.

* Build separate docker images for running Dataflow and for running the trainer.
  This helps avoid versioning conflicts caused by different versions of protobuf
  pulled in by the TF version used as the base image vs. the version used
  with apache beam.

      Fix #310 - Training fails with GPUs.

* Changes to support distributed training.
* Simplify t2t-entrypoint.sh so that all we do is parse TF_CONFIG
  and pass requisite config information as command line arguments;
  everything else can be set in the K8s spec.

* Upgrade to T2T 1.10.

* * Add ksonnet prototypes for tensorboard.
2018-11-08 16:13:01 -08:00
Jeremy Lewi d01b76b6f9 Update ksonnet for datagen (#309)
* Update the datagen component.

* We should use a K8s job rather than a TFJob. We can also simplify the
  ksonnet by just putting the spec into the jsonnet file rather than trying
  to share various bits of the spec with the TFJob for training.

Related to kubeflow/examples#308 use globals to allow parameters to be shared
across components (e.g. working directory.)

* Update the README with information about data.

* Fix table markdown.
2018-11-07 14:28:16 -08:00
Jeremy Lewi df278567f0 Fix performance of dataflow preprocessing job. (#302)
* Fix performance of dataflow preprocessing job.

* Fix #300; Dataflow job for preprocessing is really slow.

  * The problem is we are loading the spacy tokenization model on every
    invocation of the tokenization function and this is really expensive.
  * We should be doing this once per module import.

* After fixing this issue; the job completed in approximately 20 minutes using
  5 workers.

  * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.

* Add options to the Dataflow job to read from files as opposed to BigQuery
  and to skip BigQuery writes. This is useful for testing.

* Add a "unittest" that verifies the Dataflow preprocessing job can run
  successfully using the DirectRunner.

* Update the Docker image and a ksonnet component for a K8s job that
  can be used to submit the Dataflow job.

* Fix #299; Add logging to the Dataflow preprocessing job to indicate that
  a Dataflow job was submitted.

* Add an option to the preprocessing Dataflow job to read an entire
  BigQuery table as the input rather than running a query to get the input.
  This is useful in the case where the user wants to run a different
  query to select the repo paths and contents to process and write them
  to some table to be processed by the Dataflow job.

* Fix lint.

* More lint fixes.
2018-11-06 14:14:28 -08:00
Jeremy Lewi f87dfd8e53 Create a demo cluster for the code search example. (#298) 2018-11-05 06:07:52 -08:00