Commit Graph

231 Commits

Author SHA1 Message Date
urihoenig 0ac8b878d6
Support nodeSelector in spark-submit pod (#1250) 2021-05-03 16:30:29 -07:00
urihoenig 6f35a11526
Higher spark-submit pod memory limit (#984) 2020-07-23 08:46:20 -07:00
Yinan Li 0600e96b1c Fixed issues after merging 2020-04-18 11:21:12 -07:00
Yinan Li 804266f35a Added support for using envFrom 2020-04-18 11:13:47 -07:00
XsWack e86a2b051d fix bug that when submit job failes but sparkapp status do not change (#720) 2020-04-18 11:04:47 -07:00
XsWack 9836597026 fix when set onFailureRetries and restartPolicy to onFailure but the sparkapp are always restart (#713) 2020-04-18 11:04:47 -07:00
XsWack a7028ead0d fix bug when delete job, the job pod are not deleted (#712) 2020-04-18 11:04:47 -07:00
akhurana001 0dd54e56c9 Add ImagePullSecrets/ImagePullPolicy for submit job (#571) 2020-04-18 11:04:47 -07:00
akhurana001 8583270093 Multi Version Support via Job: Add Tests + other changes (#553)
* Add resource request and limits for spark-submit pod

* Minor changes for submission-job-manager + add tests

* Remove un-used fields from spec

* deepcopy funcs

* Update auto-generated code

* PR feedback

* PR comments

* Handling submission failure
2020-04-18 11:04:38 -07:00
Yinan Li 7a34c91f6c Use a Kubernetes Job to run spark-submit for multi-version support 2020-04-18 11:03:26 -07:00
Sébastien Maintrot 2bd7a00e98
update code to fix error on termination time when using sidecar (#867)
* update code to fix error on termination time when using sidecar

* fix and add new tests

* fixup! update code to fix error on termination time when using sidecar
2020-04-11 09:40:07 -07:00
Thi Nguyen 2e4559eac0
Re-enable scheduled app unit test (TestSyncScheduledSparkApplication_Forbid) (#865)
* Fix scheduled app test

* fix typo

Co-authored-by: Thi Nguyen <duongnt@users.noreply.github.com>
Co-authored-by: Thi Nguyen <thi.nguyen@cloudkitchens.com>
2020-04-09 09:07:17 -07:00
Shiqi Sun e1d70afe9c
Add total SparkApplication count metric (#856)
* Add total SparkApplication count metric

Total SparkApplication count is the total number of SparkApplications
that have been processed by the operator. This metric can be used to
track how many SparkApplication the users have submitted to the K8s API
server, and also can be used as denominator when computing job success
rate, for example.

* Export SparkApp count metric in sparkapp_metric.go

Invoking the export of SparkApp count metric in exportMetrics() in
sparkapp_metrics.go instead of syncSparkApplication() in controller.go,
in order to align with the metric exporting convention in the code base.
2020-04-05 18:03:59 -07:00
Yinan Li 8a1e259931
Refactored how status update is logged (#859) 2020-04-05 12:14:03 -07:00
Anders Chen 53a3b003c4
ScheduledSparkApplications: NextRun should be recalculated if desired schedule changes (#857)
* scheduledSparkApplications: NextRun should be recalculated whenever schedule changes

* updatedScheduleRuntime -> updatedNextRunTime
2020-04-03 11:55:50 -07:00
Shiqi Sun 0e0867f0b4
Add job start latency metrics and add namespace tag in metrics (#852)
* Add new metric for job start latency

* Add job latency histogram metric and namespace tag

Job start latency is defined as the time difference between when the
job is submitted by the user and when the job is in a running or any of
the terminal states. We use histogram with configurable boundaries
because users can provide different boundaries that they are interested
of. They can use one of them as their SLO/SLA and use the histogram
values to compute the percentage of number of jobs that meet the SLA. We
also added the namespace label into all the metrics when applicable when
the users specify it in the command line option. In addition, we fixed
the controller state machine diagram.

* Add start latency metrics doc, fix based on review

Added start latency summary and histogram metrics doc in
quick-start-guide.md. Added fixes based on the code review comments in
the PR.

Co-authored-by: Vaishnavi Giridaran <vgiridaran@salesforce.com>
2020-04-01 13:13:36 -07:00
jinxingwang 5afcce2919
add fix for metricsProperties when HasPrometheusConfigFile is true. (#847)
* add fix for metricsProperties when HasPrometheusConfigFile is true.

* add new config MetricsPropertiesFile.

* add missing auto-generated code from privous PRs.

* fix monitoring_config_test.go test condition, redo the configmap logic in monitoring_config.go.

* redo the configmap & javaOption logic in monitoring_config.go.

* set back the configmap & javaOption logic in monitoring_config.go

* update log.
2020-03-31 09:14:10 -07:00
Yinan Li aae36546e5
Fix for #826 and some refactoring (#832) 2020-03-11 10:32:42 -07:00
Yinan Li 18572fa33b
Added terminationGracePeriodSeconds and pod/container lifecycle hook to driver pods (#811)
* feat: delete driver pods with a grace period

* feat: adding lifecycle pod spec for driver pods

* adding tests for grace period and lifecycle

* fix: adding user guide for termination grace period and container hooks
2020-03-05 14:29:25 -08:00
Yinan Li 3df7030989
Revert "Fixed the spec update issue in #795 (#804)" (#805)
This reverts commit 1687c0647c.
2020-02-11 20:31:50 -08:00
Yinan Li 1687c0647c
Fixed the spec update issue in #795 (#804) 2020-02-11 17:11:31 -08:00
Yinan Li 1c4486cdda
Removed legacy init-container related fields (#750) 2019-12-20 10:08:05 -08:00
Yinan Li 52dc9a6412
Added support for separate ket for driver physical CPU request (#748) 2019-12-19 15:15:18 -08:00
Jiaxin Shan 0bd7592035 Add volumes support for Spark scratch space spark.local.dir (#707) 2019-12-15 15:05:40 -08:00
Jiaxin Shan 83312b9c47 Add deleteOnTermination option to keep executor pods after a job is completed. (#709) 2019-12-10 19:30:02 -08:00
Arsenii Venherak aba591088c Driver may still be in running state after executors has terminated (#716)
(#715)
2019-12-09 09:33:42 -08:00
XsWack 4990c026d0 if ApplicationState is SUCCEEDING and we could found the executor pod, assume the executor pod has been completed (#672) 2019-10-28 09:29:09 -07:00
akhurana001 61ce189fc8 Fix v1beta2: Use UpdateStatus to update subresource (#645)
* Use UpdateStatus for subresource update

* Remove terminationTime check

* Remove status validation

* Use UpdateStatus for ScheduledSparkApplications
2019-10-02 13:59:37 -07:00
Yinan Li 5d9e8e924f
Use ExponentialBackoff when updating SparkApplication status (#638) 2019-09-25 13:31:10 -07:00
Hu Sheng 001644602c Fix integration issue with volcano (#632)
See above
2019-09-24 16:41:13 -07:00
Yinan Li bae9d14be3
Derive driver pod name if it's not found in status (#624) 2019-09-18 18:44:09 -07:00
Yinan Li a4f77b7c32
Check application expiration on OnUpdate (#618) 2019-09-16 10:03:41 -07:00
Yinan Li e704c7b15d
Added TTL for SparkApplications (#615) 2019-09-13 14:21:58 -07:00
kevin hogeland 55a1eebc0c Generate CRD specs, bump to v1beta2 (#578)
* Generate CRD specs, bump to v1beta2

* Add short/singular CRD names

* Merge upstream/master

* Tweak Cores validation

* Fix typo, merge upstream

* Update remaining docs for v1beta2
2019-09-13 10:37:21 -07:00
Hu Sheng 403cf0a7e8 Support batch scheduler manager (#585)
* Support batch scheduler manager

* Address comments

* Fix comment issues
2019-09-04 19:06:21 -07:00
Hu Sheng 023ed60c0c Support volcano podgroup (#567)
* Support volcano schedule

* Refactor codes

* Fix debug issue

* Address comments

* Add minresource support

* Use webhook to patch pods

* Fix comment issues

* Fix comment issues

* Address comment issue
2019-08-25 21:10:32 -07:00
jinxingwang 7260f70daa sparkapplication status will report succeed/fail after driver contain… (#576)
* sparkapplication status will report succeed/fail after driver container in spark driver has terminated even there is long running sidecar container in spark driver pod

* rerun fmt. rewrite check pod status.

* fix go test -v ./...

* add break back & contaier fail will also fail the pod.
2019-08-22 13:46:24 -07:00
Yinan Li 4439d47e47
Removed unsed annotations (used to be used by the webhook) (#556) 2019-07-30 18:35:13 -07:00
kevin hogeland 2c6e9781be Avoid enqueuing non-updated applications (#541) 2019-07-03 18:28:57 -07:00
kevin hogeland f418560500 Use semantic DeepEqual for object comparison (#542) 2019-07-03 18:02:28 -07:00
Yinan Li 595b3ee173
Reset ExecutionAttempts/SubmissionAttempts properly (#527) 2019-06-23 22:02:58 -07:00
akhurana001 6fc512301d Replace NodePort Service with ClusterIP (#520)
* Remove NodePort Service

* Docs update
2019-06-17 12:05:35 -07:00
Jose Luis Pedrosa c597fba2c5 Added node selector to pod level (#504) 2019-06-09 19:21:15 -07:00
Yinan Li df9487159c
Record driver pod name during submission (#506)
* Record driver pod name during submission

and try getting the driver pod from the API server when it's not found in the informer cache.

* Addressed comments
2019-06-02 12:52:01 -07:00
Chaoran Yu 0966040950 Updated Go and dep versions (#500) 2019-05-20 07:25:40 -07:00
akhurana001 4c916e010a Expose Driver Container Error Code on Pod Failure (#494)
* Expose Container Error Code/Status on Driver Failure

* Test update

* PR comments
2019-05-13 18:58:18 -07:00
Chaoran Yu 0fa38e4fc9 Fixed job restart and upgraded Prometheus exporter jar and Go (#486)
* Fixed job restart; upgraded Prometheus agent jar

* Code cleanup

* Addressed review comments

* Refactored app state update

* Handled errors

* Handled errors

* Break once driver pod found

* Break once driver pod found
2019-05-01 18:39:28 -07:00
Piotr Mrowczynski e6584862e9 For DriverInfo.WebUIAddress, return internal ip of node if external ip not found 2019-04-29 14:43:21 +02:00
Yinan Li bd89732d0d Fixed mounting of Prometheus configuration ConfigMap 2019-04-19 15:46:14 -07:00
Yinan Li 7614e2d1a9 Some minor refactoring 2019-04-02 10:23:37 -07:00