Commit Graph

5 Commits

Author SHA1 Message Date
Jeremy Lewi 9f061a0554 Update the central dashboard UI image to one that includes pipelines. (#430) 2018-12-12 09:34:21 -08:00
Jeremy Lewi b26f7e9a48 Add pods/logs permission to the jupyter notebook role. (#419)
* This is needed so that fairing can tail the logs.
2018-12-09 15:53:46 -08:00
IronPan 4f95e85e63 add pipeline component (#356)
* add pipeline component

* update pipeline component
2018-11-26 06:21:07 -08:00
Jeremy Lewi df278567f0 Fix performance of dataflow preprocessing job. (#302)
* Fix performance of dataflow preprocessing job.

* Fix #300; Dataflow job for preprocessing is really slow.

  * The problem is we are loading the spacy tokenization model on every
    invocation of the tokenization function and this is really expensive.
  * We should be doing this once per module import.

* After fixing this issue; the job completed in approximately 20 minutes using
  5 workers.

  * We can process all 1.3 million records in ~ 20 minutes (elapsed time) using 5 32 CPU workers and about 1 hour of CPU time altogether.

* Add options to the Dataflow job to read from files as opposed to BigQuery
  and to skip BigQuery writes. This is useful for testing.

* Add a "unittest" that verifies the Dataflow preprocessing job can run
  successfully using the DirectRunner.

* Update the Docker image and a ksonnet component for a K8s job that
  can be used to submit the Dataflow job.

* Fix #299; Add logging to the Dataflow preprocessing job to indicate that
  a Dataflow job was submitted.

* Add an option to the preprocessing Dataflow job to read an entire
  BigQuery table as the input rather than running a query to get the input.
  This is useful in the case where the user wants to run a different
  query to select the repo paths and contents to process and write them
  to some table to be processed by the Dataflow job.

* Fix lint.

* More lint fixes.
2018-11-06 14:14:28 -08:00
Jeremy Lewi f87dfd8e53 Create a demo cluster for the code search example. (#298) 2018-11-05 06:07:52 -08:00