* Fix performance of dataflow preprocessing job.
* Fix#300; Dataflow job for preprocessing is really slow.
* The problem is we are loading the spacy tokenization model on every
invocation of the tokenization function and this is really expensive.
* We should be doing this once per module import.
* After fixing this issue; the job completed in approximately 20 minutes using
5 workers.
* We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.
* Add options to the Dataflow job to read from files as opposed to BigQuery
and to skip BigQuery writes. This is useful for testing.
* Add a "unittest" that verifies the Dataflow preprocessing job can run
successfully using the DirectRunner.
* Update the Docker image and a ksonnet component for a K8s job that
can be used to submit the Dataflow job.
* Fix#299; Add logging to the Dataflow preprocessing job to indicate that
a Dataflow job was submitted.
* Add an option to the preprocessing Dataflow job to read an entire
BigQuery table as the input rather than running a query to get the input.
This is useful in the case where the user wants to run a different
query to select the repo paths and contents to process and write them
to some table to be processed by the Dataflow job.
* Fix lint.
* More lint fixes.