spark-operator

Commit Graph

Author	SHA1	Message	Date
urihoenig	0ac8b878d6	Support nodeSelector in spark-submit pod (#1250 )	2021-05-03 16:30:29 -07:00
urihoenig	6f35a11526	Higher spark-submit pod memory limit (#984 )	2020-07-23 08:46:20 -07:00
Yinan Li	0600e96b1c	Fixed issues after merging	2020-04-18 11:21:12 -07:00
Yinan Li	804266f35a	Added support for using envFrom	2020-04-18 11:13:47 -07:00
XsWack	e86a2b051d	fix bug that when submit job failes but sparkapp status do not change (#720 )	2020-04-18 11:04:47 -07:00
XsWack	9836597026	fix when set onFailureRetries and restartPolicy to onFailure but the sparkapp are always restart (#713 )	2020-04-18 11:04:47 -07:00
XsWack	a7028ead0d	fix bug when delete job, the job pod are not deleted (#712 )	2020-04-18 11:04:47 -07:00
akhurana001	0dd54e56c9	Add ImagePullSecrets/ImagePullPolicy for submit job (#571 )	2020-04-18 11:04:47 -07:00
akhurana001	8583270093	Multi Version Support via Job: Add Tests + other changes (#553 ) * Add resource request and limits for spark-submit pod * Minor changes for submission-job-manager + add tests * Remove un-used fields from spec * deepcopy funcs * Update auto-generated code * PR feedback * PR comments * Handling submission failure	2020-04-18 11:04:38 -07:00
Yinan Li	7a34c91f6c	Use a Kubernetes Job to run spark-submit for multi-version support	2020-04-18 11:03:26 -07:00
Sébastien Maintrot	2bd7a00e98	update code to fix error on termination time when using sidecar (#867 ) * update code to fix error on termination time when using sidecar * fix and add new tests * fixup! update code to fix error on termination time when using sidecar	2020-04-11 09:40:07 -07:00
Thi Nguyen	2e4559eac0	Re-enable scheduled app unit test (TestSyncScheduledSparkApplication_Forbid) (#865 ) * Fix scheduled app test * fix typo Co-authored-by: Thi Nguyen <duongnt@users.noreply.github.com> Co-authored-by: Thi Nguyen <thi.nguyen@cloudkitchens.com>	2020-04-09 09:07:17 -07:00
Shiqi Sun	e1d70afe9c	Add total SparkApplication count metric (#856 ) * Add total SparkApplication count metric Total SparkApplication count is the total number of SparkApplications that have been processed by the operator. This metric can be used to track how many SparkApplication the users have submitted to the K8s API server, and also can be used as denominator when computing job success rate, for example. * Export SparkApp count metric in sparkapp_metric.go Invoking the export of SparkApp count metric in exportMetrics() in sparkapp_metrics.go instead of syncSparkApplication() in controller.go, in order to align with the metric exporting convention in the code base.	2020-04-05 18:03:59 -07:00
Yinan Li	8a1e259931	Refactored how status update is logged (#859 )	2020-04-05 12:14:03 -07:00
Anders Chen	53a3b003c4	ScheduledSparkApplications: NextRun should be recalculated if desired schedule changes (#857 ) * scheduledSparkApplications: NextRun should be recalculated whenever schedule changes * updatedScheduleRuntime -> updatedNextRunTime	2020-04-03 11:55:50 -07:00
Shiqi Sun	0e0867f0b4	Add job start latency metrics and add namespace tag in metrics (#852 ) * Add new metric for job start latency * Add job latency histogram metric and namespace tag Job start latency is defined as the time difference between when the job is submitted by the user and when the job is in a running or any of the terminal states. We use histogram with configurable boundaries because users can provide different boundaries that they are interested of. They can use one of them as their SLO/SLA and use the histogram values to compute the percentage of number of jobs that meet the SLA. We also added the namespace label into all the metrics when applicable when the users specify it in the command line option. In addition, we fixed the controller state machine diagram. * Add start latency metrics doc, fix based on review Added start latency summary and histogram metrics doc in quick-start-guide.md. Added fixes based on the code review comments in the PR. Co-authored-by: Vaishnavi Giridaran <vgiridaran@salesforce.com>	2020-04-01 13:13:36 -07:00
jinxingwang	5afcce2919	add fix for metricsProperties when HasPrometheusConfigFile is true. (#847 ) * add fix for metricsProperties when HasPrometheusConfigFile is true. * add new config MetricsPropertiesFile. * add missing auto-generated code from privous PRs. * fix monitoring_config_test.go test condition, redo the configmap logic in monitoring_config.go. * redo the configmap & javaOption logic in monitoring_config.go. * set back the configmap & javaOption logic in monitoring_config.go * update log.	2020-03-31 09:14:10 -07:00
Yinan Li	aae36546e5	Fix for #826 and some refactoring (#832 )	2020-03-11 10:32:42 -07:00
Yinan Li	18572fa33b	Added terminationGracePeriodSeconds and pod/container lifecycle hook to driver pods (#811 ) * feat: delete driver pods with a grace period * feat: adding lifecycle pod spec for driver pods * adding tests for grace period and lifecycle * fix: adding user guide for termination grace period and container hooks	2020-03-05 14:29:25 -08:00
Yinan Li	3df7030989	Revert "Fixed the spec update issue in #795 (#804 )" (#805 ) This reverts commit `1687c0647c`.	2020-02-11 20:31:50 -08:00
Yinan Li	1687c0647c	Fixed the spec update issue in #795 (#804 )	2020-02-11 17:11:31 -08:00
Yinan Li	1c4486cdda	Removed legacy init-container related fields (#750 )	2019-12-20 10:08:05 -08:00
Yinan Li	52dc9a6412	Added support for separate ket for driver physical CPU request (#748 )	2019-12-19 15:15:18 -08:00
Jiaxin Shan	0bd7592035	Add volumes support for Spark scratch space spark.local.dir (#707 )	2019-12-15 15:05:40 -08:00
Jiaxin Shan	83312b9c47	Add deleteOnTermination option to keep executor pods after a job is completed. (#709 )	2019-12-10 19:30:02 -08:00
Arsenii Venherak	aba591088c	Driver may still be in running state after executors has terminated (#716 ) (#715)	2019-12-09 09:33:42 -08:00
XsWack	4990c026d0	if ApplicationState is SUCCEEDING and we could found the executor pod, assume the executor pod has been completed (#672 )	2019-10-28 09:29:09 -07:00
akhurana001	61ce189fc8	Fix v1beta2: Use UpdateStatus to update subresource (#645 ) * Use UpdateStatus for subresource update * Remove terminationTime check * Remove status validation * Use UpdateStatus for ScheduledSparkApplications	2019-10-02 13:59:37 -07:00
Yinan Li	5d9e8e924f	Use ExponentialBackoff when updating SparkApplication status (#638 )	2019-09-25 13:31:10 -07:00
Hu Sheng	001644602c	Fix integration issue with volcano (#632 ) See above	2019-09-24 16:41:13 -07:00
Yinan Li	bae9d14be3	Derive driver pod name if it's not found in status (#624 )	2019-09-18 18:44:09 -07:00
Yinan Li	a4f77b7c32	Check application expiration on OnUpdate (#618 )	2019-09-16 10:03:41 -07:00
Yinan Li	e704c7b15d	Added TTL for SparkApplications (#615 )	2019-09-13 14:21:58 -07:00
kevin hogeland	55a1eebc0c	Generate CRD specs, bump to v1beta2 (#578 ) * Generate CRD specs, bump to v1beta2 * Add short/singular CRD names * Merge upstream/master * Tweak Cores validation * Fix typo, merge upstream * Update remaining docs for v1beta2	2019-09-13 10:37:21 -07:00
Hu Sheng	403cf0a7e8	Support batch scheduler manager (#585 ) * Support batch scheduler manager * Address comments * Fix comment issues	2019-09-04 19:06:21 -07:00
Hu Sheng	023ed60c0c	Support volcano podgroup (#567 ) * Support volcano schedule * Refactor codes * Fix debug issue * Address comments * Add minresource support * Use webhook to patch pods * Fix comment issues * Fix comment issues * Address comment issue	2019-08-25 21:10:32 -07:00
jinxingwang	7260f70daa	sparkapplication status will report succeed/fail after driver contain… (#576 ) * sparkapplication status will report succeed/fail after driver container in spark driver has terminated even there is long running sidecar container in spark driver pod * rerun fmt. rewrite check pod status. * fix go test -v ./... * add break back & contaier fail will also fail the pod.	2019-08-22 13:46:24 -07:00
Yinan Li	4439d47e47	Removed unsed annotations (used to be used by the webhook) (#556 )	2019-07-30 18:35:13 -07:00
kevin hogeland	2c6e9781be	Avoid enqueuing non-updated applications (#541 )	2019-07-03 18:28:57 -07:00
kevin hogeland	f418560500	Use semantic DeepEqual for object comparison (#542 )	2019-07-03 18:02:28 -07:00
Yinan Li	595b3ee173	Reset ExecutionAttempts/SubmissionAttempts properly (#527 )	2019-06-23 22:02:58 -07:00
akhurana001	6fc512301d	Replace NodePort Service with ClusterIP (#520 ) * Remove NodePort Service * Docs update	2019-06-17 12:05:35 -07:00
Jose Luis Pedrosa	c597fba2c5	Added node selector to pod level (#504 )	2019-06-09 19:21:15 -07:00
Yinan Li	df9487159c	Record driver pod name during submission (#506 ) * Record driver pod name during submission and try getting the driver pod from the API server when it's not found in the informer cache. * Addressed comments	2019-06-02 12:52:01 -07:00
Chaoran Yu	0966040950	Updated Go and dep versions (#500 )	2019-05-20 07:25:40 -07:00
akhurana001	4c916e010a	Expose Driver Container Error Code on Pod Failure (#494 ) * Expose Container Error Code/Status on Driver Failure * Test update * PR comments	2019-05-13 18:58:18 -07:00
Chaoran Yu	0fa38e4fc9	Fixed job restart and upgraded Prometheus exporter jar and Go (#486 ) * Fixed job restart; upgraded Prometheus agent jar * Code cleanup * Addressed review comments * Refactored app state update * Handled errors * Handled errors * Break once driver pod found * Break once driver pod found	2019-05-01 18:39:28 -07:00
Piotr Mrowczynski	e6584862e9	For DriverInfo.WebUIAddress, return internal ip of node if external ip not found	2019-04-29 14:43:21 +02:00
Yinan Li	bd89732d0d	Fixed mounting of Prometheus configuration ConfigMap	2019-04-19 15:46:14 -07:00
Yinan Li	7614e2d1a9	Some minor refactoring	2019-04-02 10:23:37 -07:00

1 2 3 4 5

231 Commits