Commit Graph

53 Commits

Author SHA1 Message Date
Yinan Li 7a34c91f6c Use a Kubernetes Job to run spark-submit for multi-version support 2020-04-18 11:03:26 -07:00
Shiqi Sun e1d70afe9c
Add total SparkApplication count metric (#856)
* Add total SparkApplication count metric

Total SparkApplication count is the total number of SparkApplications
that have been processed by the operator. This metric can be used to
track how many SparkApplication the users have submitted to the K8s API
server, and also can be used as denominator when computing job success
rate, for example.

* Export SparkApp count metric in sparkapp_metric.go

Invoking the export of SparkApp count metric in exportMetrics() in
sparkapp_metrics.go instead of syncSparkApplication() in controller.go,
in order to align with the metric exporting convention in the code base.
2020-04-05 18:03:59 -07:00
Shiqi Sun 0e0867f0b4
Add job start latency metrics and add namespace tag in metrics (#852)
* Add new metric for job start latency

* Add job latency histogram metric and namespace tag

Job start latency is defined as the time difference between when the
job is submitted by the user and when the job is in a running or any of
the terminal states. We use histogram with configurable boundaries
because users can provide different boundaries that they are interested
of. They can use one of them as their SLO/SLA and use the histogram
values to compute the percentage of number of jobs that meet the SLA. We
also added the namespace label into all the metrics when applicable when
the users specify it in the command line option. In addition, we fixed
the controller state machine diagram.

* Add start latency metrics doc, fix based on review

Added start latency summary and histogram metrics doc in
quick-start-guide.md. Added fixes based on the code review comments in
the PR.

Co-authored-by: Vaishnavi Giridaran <vgiridaran@salesforce.com>
2020-04-01 13:13:36 -07:00
Yinan Li c8ae9e3520
Upgraded to Spark 2.4.5 (#798) 2020-02-10 09:01:06 -08:00
akhurana001 85180bc593 Update Ingress details in quick-start-guide.md (#667)
* Update quick-start-guide.md

* Update Ingress Setup docs
2019-10-21 09:15:38 -07:00
Yinan Li 86ee076aab
Upgraded default Spark version from 2.4.0 to 2.4.4 (#625) 2019-09-20 11:03:55 -07:00
kevin hogeland 55a1eebc0c Generate CRD specs, bump to v1beta2 (#578)
* Generate CRD specs, bump to v1beta2

* Add short/singular CRD names

* Merge upstream/master

* Tweak Cores validation

* Fix typo, merge upstream

* Update remaining docs for v1beta2
2019-09-13 10:37:21 -07:00
Michael Marshall fb59a3aed6 Improve documentation for SparkJobNamespace (#608)
* Improve documentation for SparkJobNamespace

* Modifications after review
2019-09-11 08:14:43 -07:00
Ramya Rajasekaran b7361477c5 Updated prefixes in documentation to match produced metrics. (#574)
* Modified metrics documentation to reflect code

* Updated metrics prefix

* Updated prefixes to match produced metrics
2019-08-29 11:51:26 -07:00
Yinan Li c5d04ab4a4
Added documentation for leader election and HA mode (#575) 2019-08-19 15:54:43 -07:00
Ramya Rajasekaran f9370be2a3 Modified metrics documentation to reflect code (#570) 2019-08-15 13:36:53 -07:00
kevin hogeland 249860f0cc Webhook enhancements (#543)
* Cert configuration and reloading

* Add support for strict webhook error handling

* Improve webhook error handling

* Don't deregister the webhook when failure policy is strict

* standard error message capitalization

* have the webhook parse its own configuration from flags

* clean up cert provider code

* Add explanation for skipping deregistration

* Tests and fixes
2019-08-15 10:46:44 -07:00
Martijn Zwennes 79ec720f7e Update docs with GKE private cluster info (#568) 2019-08-13 13:23:33 -07:00
akhurana001 6fc512301d Replace NodePort Service with ClusterIP (#520)
* Remove NodePort Service

* Docs update
2019-06-17 12:05:35 -07:00
Xingjun Wang 566d917a64
Fix a typo
Fix a typo
2019-04-28 16:14:57 +08:00
Alex Glikson 86bfeaf994
typo in quick-start-guide.md 2019-03-01 16:25:55 -05:00
Chaoran Yu cc70d3c979 Fixed doc to remove createSparkJobNamespace 2019-01-25 09:51:27 +01:00
Yinan Li 2db484a82e Updated the quick start guilde 2019-01-23 11:22:18 -08:00
Yinan Li c41576b5ff Updated the use the v1beta1 version of the APIs 2019-01-17 10:45:54 -08:00
Yinan Li 8a84d6f39f
Merge pull request #343 from lightbend/fixes
OpenShift image and fixed bug on running multiple operators/webhooks
2019-01-14 12:38:29 -08:00
Yinan Li e802f66ea5 Some refactoring to main.go and the quick start guide 2018-12-18 11:30:46 -08:00
Chaoran Yu 73d09c8c4d Added support for running the containers using a non-root user; fixed an issue with running multiple instances of operator; fixed doc to reflect webhook new default in the chart 2018-12-11 23:53:07 +08:00
Yinan Li d909c246cd Added yamls for CRDs 2018-11-27 14:59:23 -08:00
akhurana001 8c7fdbb306 Operator State Management + Ingress Creation (#291)
* SparkOperator: Prometheus Metrics Integration

* Prometheus Metric Update

* Spark Operator:Prometheus Metric Integration

* PositiveGauge rework

* remove unwanted dependencies

* Propogating ScheduledSpark App Labels

* Doc update

* Metric Description update

* fix app wait

* SparkOperator: Prometheus Metrics Integration

* Spark Operator metrics:PR Comments

* SparkOperator: Set completion time for Failed App

* Operator Metrics: PR comments

* Spark Operator: PR Comments

* Controller Update

* PR Comments

* Docs Update

* Driver State Transition Check Update

* Operator State Management

* Clean-up

* Exposing Spark Application Id in Operator

* SparkAppId updates

* Add Lyft as a user and contributor to operator

* Spark Operator Rework

* Reworking Restart-Policy

* Documentation update

* PR comments

* PR comments

* Ingress impl

* Ingress Tests + Updates

* go fmt

* PR Comments

* missing files

* AppId removal: Doc Update

* Doc update

* Delete UI/Ingress + Other minor changes

* Add PENDING_RETRY State

* PR comments

* PR comments

* Clean-up

* Update controller.go

* Add Terminal State

* Terminal State

* Spark improvements

* event type

* Events update

* Update controller.go

* Update controller.go

* PR Comments

* PR comments

* Support Best-effort Spec updates

* New State

* PR comments

* PR comments

* go fmt

* Docs update

* PR feedback

* PR Feedback

* PR comments
2018-11-19 15:25:55 -08:00
Chaoran Yu 61663a2ce6 Documentation updates 2018-11-06 11:49:22 -05:00
Chaoran Yu 77772a1987 Refactored tests to use generated client instead of kubectl 2018-11-06 11:49:22 -05:00
Chaoran Yu 83cb8dd90e Documentation update corresponding to Helm chart v0.1.3 2018-11-06 11:49:22 -05:00
Yinan Li b575f307ad Upgraded to use the Spark 2.4.0 image 2018-11-05 07:51:32 -08:00
Sergey Samsonov f350a23efe Add support for tolerations 2018-10-20 17:21:18 -07:00
Piotr Mrowczynski bc9bf22a73 Add batch job to initialize the webhook secret to manifest, remove requirement of manual step 2018-10-05 10:16:41 -07:00
Yinan Li d277088c65 Added a developer guide and notes on a code gen issue 2018-09-28 14:08:52 -07:00
Sarjeet Singh 18dafddd64 Fix the namespace for sparkoperator deployment
Fix the namespace for sparkoperator deployment
2018-09-25 19:28:21 -07:00
Yinan Li 33f8cd1031 Fixed documentation after the go import path change 2018-09-23 21:28:25 -07:00
Yinan Li c43465e832 Used "Kubernetes Operator for Apache Spark" in documentation 2018-09-19 09:11:59 -07:00
Yinan Li 29a4329ea4 Make the base spark image an argument 2018-09-10 11:51:38 -07:00
Chaoran Yu 71e5048a14 Moved to the install section 2018-09-06 21:54:26 -07:00
Chaoran Yu cb51a5e7f2 Explain additional k8s resources that get created for webhook 2018-09-06 21:54:26 -07:00
Chaoran Yu d60cf87f8c Updated doc to use Helm chart 2018-09-06 21:54:26 -07:00
akhurana001 ceb230481e SparkOperator: Prometheus Metrics Integration (#227)
* SparkOperator: Prometheus Metrics Integration

* Spark Operator metrics:PR Comments

* SparkOperator: Set completion time for Failed App

* Operator Metrics: PR comments

* Spark Operator: PR Comments

* Controller Update

* PR Comments

* Docs Update

* Driver State Transition Check Update
2018-08-03 13:56:58 -07:00
barney-s 7e878578ad
typo 2018-07-19 14:48:54 -07:00
Yinan Li 9a08ae361c Get rid of the initializer (replaced by the mutating webhook) 2018-07-16 12:05:12 -07:00
Yinan Li 368b4a5ac3 Add mutating admission webhook (#211)
* Preparation to support affinity/anti-affinity

* Initial commit for the mutating admission webhook
2018-07-16 11:43:46 -07:00
Yinan Li 971de3b542 Made DNS check optional (on by default) 2018-06-09 22:28:45 -07:00
scotthew1 21567725f2 use multi-stage Dockerfile for reliable builds (#174)
* use multi-stage Dockerfile for reliable builds

* fixed date in codegen boilerplate

* clean up whitespace in quick-start-guide.md

* update quick-start-guide.md with new docker build
2018-06-01 10:15:06 -07:00
Scott Reisor e941c18f11 added namespace to initializer, some updates from review 2018-06-01 07:23:09 -07:00
Jirka Kremser 7d78a21655 Fix the typos in quick start guide and adding the namespace explicitly 2018-05-10 09:32:21 -07:00
Palak Dalal b5344d3ad5 docs changes 2018-05-03 13:59:37 -07:00
Yinan Li 74af880633 Updated spark-operator.yaml 2018-04-25 09:50:27 -07:00
Yinan Li eff4c59cb2 Add cron support 2018-04-19 12:01:37 -07:00
Yinan Li 3624ad80de
Update quick-start-guide.md 2018-04-17 10:28:10 -07:00