* Add total SparkApplication count metric
Total SparkApplication count is the total number of SparkApplications
that have been processed by the operator. This metric can be used to
track how many SparkApplication the users have submitted to the K8s API
server, and also can be used as denominator when computing job success
rate, for example.
* Export SparkApp count metric in sparkapp_metric.go
Invoking the export of SparkApp count metric in exportMetrics() in
sparkapp_metrics.go instead of syncSparkApplication() in controller.go,
in order to align with the metric exporting convention in the code base.
* Add new metric for job start latency
* Add job latency histogram metric and namespace tag
Job start latency is defined as the time difference between when the
job is submitted by the user and when the job is in a running or any of
the terminal states. We use histogram with configurable boundaries
because users can provide different boundaries that they are interested
of. They can use one of them as their SLO/SLA and use the histogram
values to compute the percentage of number of jobs that meet the SLA. We
also added the namespace label into all the metrics when applicable when
the users specify it in the command line option. In addition, we fixed
the controller state machine diagram.
* Add start latency metrics doc, fix based on review
Added start latency summary and histogram metrics doc in
quick-start-guide.md. Added fixes based on the code review comments in
the PR.
Co-authored-by: Vaishnavi Giridaran <vgiridaran@salesforce.com>
* add fix for metricsProperties when HasPrometheusConfigFile is true.
* add new config MetricsPropertiesFile.
* add missing auto-generated code from privous PRs.
* fix monitoring_config_test.go test condition, redo the configmap logic in monitoring_config.go.
* redo the configmap & javaOption logic in monitoring_config.go.
* set back the configmap & javaOption logic in monitoring_config.go
* update log.
* feat: delete driver pods with a grace period
* feat: adding lifecycle pod spec for driver pods
* adding tests for grace period and lifecycle
* fix: adding user guide for termination grace period and container hooks
* Cert configuration and reloading
* Add support for strict webhook error handling
* Improve webhook error handling
* Don't deregister the webhook when failure policy is strict
* standard error message capitalization
* have the webhook parse its own configuration from flags
* clean up cert provider code
* Add explanation for skipping deregistration
* Resource Quota enforcement webhook
* Fix bad merge
* Cleanup, fixes
* Cleanup
* Document the quota enforcer
* Cert configuration and reloading
* Add support for strict webhook error handling
* Improve webhook error handling
* Don't deregister the webhook when failure policy is strict
* standard error message capitalization
* have the webhook parse its own configuration from flags
* clean up cert provider code
* Add explanation for skipping deregistration
* Tests and fixes