examples/code_search/demo/cs-demo-1103/ks_app/environments
Jeremy Lewi df278567f0 Fix performance of dataflow preprocessing job. (#302)
* Fix performance of dataflow preprocessing job.

* Fix #300; Dataflow job for preprocessing is really slow.

  * The problem is we are loading the spacy tokenization model on every
    invocation of the tokenization function and this is really expensive.
  * We should be doing this once per module import.

* After fixing this issue; the job completed in approximately 20 minutes using
  5 workers.

  * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.

* Add options to the Dataflow job to read from files as opposed to BigQuery
  and to skip BigQuery writes. This is useful for testing.

* Add a "unittest" that verifies the Dataflow preprocessing job can run
  successfully using the DirectRunner.

* Update the Docker image and a ksonnet component for a K8s job that
  can be used to submit the Dataflow job.

* Fix #299; Add logging to the Dataflow preprocessing job to indicate that
  a Dataflow job was submitted.

* Add an option to the preprocessing Dataflow job to read an entire
  BigQuery table as the input rather than running a query to get the input.
  This is useful in the case where the user wants to run a different
  query to select the repo paths and contents to process and write them
  to some table to be processed by the Dataflow job.

* Fix lint.

* More lint fixes.
2018-11-06 14:14:28 -08:00
..
default Fix performance of dataflow preprocessing job. (#302) 2018-11-06 14:14:28 -08:00
base.libsonnet Fix performance of dataflow preprocessing job. (#302) 2018-11-06 14:14:28 -08:00