* update code to fix error on termination time when using sidecar
* fix and add new tests
* fixup! update code to fix error on termination time when using sidecar
* Add total SparkApplication count metric
Total SparkApplication count is the total number of SparkApplications
that have been processed by the operator. This metric can be used to
track how many SparkApplication the users have submitted to the K8s API
server, and also can be used as denominator when computing job success
rate, for example.
* Export SparkApp count metric in sparkapp_metric.go
Invoking the export of SparkApp count metric in exportMetrics() in
sparkapp_metrics.go instead of syncSparkApplication() in controller.go,
in order to align with the metric exporting convention in the code base.
* Add new metric for job start latency
* Add job latency histogram metric and namespace tag
Job start latency is defined as the time difference between when the
job is submitted by the user and when the job is in a running or any of
the terminal states. We use histogram with configurable boundaries
because users can provide different boundaries that they are interested
of. They can use one of them as their SLO/SLA and use the histogram
values to compute the percentage of number of jobs that meet the SLA. We
also added the namespace label into all the metrics when applicable when
the users specify it in the command line option. In addition, we fixed
the controller state machine diagram.
* Add start latency metrics doc, fix based on review
Added start latency summary and histogram metrics doc in
quick-start-guide.md. Added fixes based on the code review comments in
the PR.
Co-authored-by: Vaishnavi Giridaran <vgiridaran@salesforce.com>
* add fix for metricsProperties when HasPrometheusConfigFile is true.
* add new config MetricsPropertiesFile.
* add missing auto-generated code from privous PRs.
* fix monitoring_config_test.go test condition, redo the configmap logic in monitoring_config.go.
* redo the configmap & javaOption logic in monitoring_config.go.
* set back the configmap & javaOption logic in monitoring_config.go
* update log.
* feat: delete driver pods with a grace period
* feat: adding lifecycle pod spec for driver pods
* adding tests for grace period and lifecycle
* fix: adding user guide for termination grace period and container hooks
* sparkapplication status will report succeed/fail after driver container in spark driver has terminated even there is long running sidecar container in spark driver pod
* rerun fmt. rewrite check pod status.
* fix go test -v ./...
* add break back & contaier fail will also fail the pod.
* Record driver pod name during submission
and try getting the driver pod from the API server when it's not found in the informer cache.
* Addressed comments
* Fixed job restart; upgraded Prometheus agent jar
* Code cleanup
* Addressed review comments
* Refactored app state update
* Handled errors
* Handled errors
* Break once driver pod found
* Break once driver pod found