mirror of https://github.com/kubeflow/examples.git
* Fix performance of dataflow preprocessing job. * Fix #300; Dataflow job for preprocessing is really slow. * The problem is we are loading the spacy tokenization model on every invocation of the tokenization function and this is really expensive. * We should be doing this once per module import. * After fixing this issue; the job completed in approximately 20 minutes using 5 workers. * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether. * Add options to the Dataflow job to read from files as opposed to BigQuery and to skip BigQuery writes. This is useful for testing. * Add a "unittest" that verifies the Dataflow preprocessing job can run successfully using the DirectRunner. * Update the Docker image and a ksonnet component for a K8s job that can be used to submit the Dataflow job. * Fix #299; Add logging to the Dataflow preprocessing job to indicate that a Dataflow job was submitted. * Add an option to the preprocessing Dataflow job to read an entire BigQuery table as the input rather than running a query to get the input. This is useful in the case where the user wants to run a different query to select the repo paths and contents to process and write them to some table to be processed by the Dataflow job. * Fix lint. * More lint fixes. |
||
|---|---|---|
| .. | ||
| default | ||
| base.libsonnet | ||