examples

History

Jeremy Lewi df278567f0 Fix performance of dataflow preprocessing job. (#302 ) * Fix performance of dataflow preprocessing job. * Fix #300; Dataflow job for preprocessing is really slow. * The problem is we are loading the spacy tokenization model on every invocation of the tokenization function and this is really expensive. * We should be doing this once per module import. * After fixing this issue; the job completed in approximately 20 minutes using 5 workers. * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether. * Add options to the Dataflow job to read from files as opposed to BigQuery and to skip BigQuery writes. This is useful for testing. * Add a "unittest" that verifies the Dataflow preprocessing job can run successfully using the DirectRunner. * Update the Docker image and a ksonnet component for a K8s job that can be used to submit the Dataflow job. * Fix #299; Add logging to the Dataflow preprocessing job to indicate that a Dataflow job was submitted. * Add an option to the preprocessing Dataflow job to read an entire BigQuery table as the input rather than running a query to get the input. This is useful in the case where the user wants to run a different query to select the repo paths and contents to process and write them to some table to be processed by the Dataflow job. * Fix lint. * More lint fixes.	2018-11-06 14:14:28 -08:00
..
default	Fix performance of dataflow preprocessing job. (#302 )	2018-11-06 14:14:28 -08:00
base.libsonnet	Fix performance of dataflow preprocessing job. (#302 )	2018-11-06 14:14:28 -08:00

Jeremy Lewi df278567f0 Fix performance of dataflow preprocessing job. (#302 )

* Fix performance of dataflow preprocessing job.

* Fix #300; Dataflow job for preprocessing is really slow.

  * The problem is we are loading the spacy tokenization model on every
    invocation of the tokenization function and this is really expensive.
  * We should be doing this once per module import.

* After fixing this issue; the job completed in approximately 20 minutes using
  5 workers.

  * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.

* Add options to the Dataflow job to read from files as opposed to BigQuery
  and to skip BigQuery writes. This is useful for testing.

* Add a "unittest" that verifies the Dataflow preprocessing job can run
  successfully using the DirectRunner.

* Update the Docker image and a ksonnet component for a K8s job that
  can be used to submit the Dataflow job.

* Fix #299; Add logging to the Dataflow preprocessing job to indicate that
  a Dataflow job was submitted.

* Add an option to the preprocessing Dataflow job to read an entire
  BigQuery table as the input rather than running a query to get the input.
  This is useful in the case where the user wants to run a different
  query to select the repo paths and contents to process and write them
  to some table to be processed by the Dataflow job.

* Fix lint.

* More lint fixes.

2018-11-06 14:14:28 -08:00

default

Fix performance of dataflow preprocessing job. (#302 )

2018-11-06 14:14:28 -08:00

base.libsonnet

Fix performance of dataflow preprocessing job. (#302 )

2018-11-06 14:14:28 -08:00