KEP for TTL-after-finished controller

2018-08-16 10:30:49 -07:00 · 2018-08-16 10:30:49 -07:00 · 80e8c2bda7
parent a119d94f47
commit 80e8c2bda7
2 changed files with 297 additions and 1 deletions
--- a/keps/NEXT_KEP_NUMBER
+++ b/keps/NEXT_KEP_NUMBER
@ -1 +1 @@
-26
+27
--- a/keps/sig-apps/0026-ttl-after-finish.md
+++ b/keps/sig-apps/0026-ttl-after-finish.md
@ -0,0 +1,296 @@
 ---
 kep-number: 26
 title: TTL After Finished
 authors:
  - "@janetkuo"
 owning-sig: sig-apps
 participating-sigs:
  - sig-api-machinery
 reviewers:
  - @enisoc
  - @tnozicka
 approvers:
  - @kow3ns
 editor: TBD
 creation-date: 2018-08-16
 last-updated: 2018-08-16
 status: provisional
 see-also:
  - n/a
 replaces:
  - n/a
 superseded-by:
  - n/a
 ---
 # TTL After Finished Controller
 ## Table of Contents
 A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template.
 [Tools for generating][] a table of contents from markdown are available.
   * [TTL After Finished Controller](#ttl-after-finished-controller)
      * [Table of Contents](#table-of-contents)
      * [Summary](#summary)
      * [Motivation](#motivation)
         * [Goals](#goals)
      * [Proposal](#proposal)
         * [Concrete Use Cases](#concrete-use-cases)
         * [Detailed Design](#detailed-design)
            * [Feature Gate](#feature-gate)
            * [API Object](#api-object)
               * [Validation](#validation)
         * [User Stories](#user-stories)
         * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
            * [TTL Controller](#ttl-controller)
            * [Finished Jobs](#finished-jobs)
            * [Finished Pods](#finished-pods)
            * [Owner References](#owner-references)
         * [Risks and Mitigations](#risks-and-mitigations)
      * [Graduation Criteria](#graduation-criteria)
      * [Implementation History](#implementation-history)
 [Tools for generating]: https://github.com/ekalinin/github-markdown-toc
 ## Summary
 We propose a TTL mechanism to limit the lifetime of finished resource objects,
 including Jobs and Pods, to make it easy for users to clean up old Jobs/Pods
 after they finish. The TTL timer starts when the Job/Pod finishes, and the
 finished Job/Pod will be cleaned up after the TTL expires.
 ## Motivation
 In Kubernetes, finishable resources, such as Jobs and Pods, are often
 frequently-created and short-lived. If a Job or Pod isn't controlled by a
 higher-level resource (e.g. CronJob for Jobs or Job for Pods), or owned by some
 other resources, it's difficult for the users to clean them up automatically,
 and those Jobs and Pods can accumulate and overload a Kubernetes cluster very
 easily. Even if we can avoid the overload issue by implementing a cluster-wide
 (global) resource quota, users won't be able to create new resources without
 cleaning up old ones first. See [#64470][].
 The design of this proposal can be later generalized to other finishable
 frequently-created, short-lived resources, such as completed Pods or finished
 custom resources.
 [#64470]: https://github.com/kubernetes/kubernetes/issues/64470
 ### Goals
 Make it easy to for the users to specify a time-based clean up mechanism for
 finished resource objects. 
 * It's configurable at resource creation time and after the resource is created.
 ## Proposal
 [K8s Proposal: TTL controller for finished Jobs and Pods][]
 [K8s Proposal: TTL controller for finished Jobs and Pods]: https://docs.google.com/document/d/1U6h1DrRJNuQlL2_FYY_FdkQhgtTRn1kEylEOHRoESTc/edit
 ### Concrete Use Cases
 * [Kubeflow][] needs to clean up old finished Jobs (K8s Jobs, TF Jobs, Argo
  workflows, etc.), see [#718][].
 * [Prow][] needs to clean up old completed Pods & finished Jobs. Currently implemented with Prow sinker.
 * [Apache Spark on Kubernetes][] needs proper cleanup of terminated Spark executor Pods.
 * Jenkins Kubernetes plugin creates slave pods that execute builds. It needs a better way to clean up old completed Pods.
 [Kubeflow]: https://github.com/kubeflow
 [#718]: https://github.com/kubeflow/tf-operator/issues/718
 [Prow]: https://github.com/kubernetes/test-infra/tree/master/prow
 [Apache Spark on Kubernetes]: http://spark.apache.org/docs/latest/running-on-kubernetes.html
 ### Detailed Design 
 #### Feature Gate
 This will be launched as an alpha feature first, with feature gate
 `TTLAfterFinished`.
 #### API Object
 We will add the following API fields to `JobSpec` (`Job`'s `.spec`).
 ```go
 type JobSpec struct {
 	// ttlSecondsAfterFinished limits the lifetime of a Job that has finished
 	// execution (either Complete or Failed). If this field is set, once the Job
 	// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
 	// the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will
 	// be honored. If this field is unset, ttlSecondsAfterFinished will not
 	// expire. If this field is set to zero, ttlSecondsAfterFinished expires
 	// immediately after the Job finishes.
 	// This field is alpha-level and is only honored by servers that enable the
 	// TTLAfterFinished feature.
 	// +optional
 	TTLSecondsAfterFinished *int32
 }
 ```
 This allows Jobs to be cleaned up after they finish and provides time for
 asynchronous clients to observe Jobs' final states before they are deleted.
 Similarly, we will add the following API fields to `PodSpec` (`Pod`'s `.spec`).
 ```go
 type PodSpec struct {
 	// ttlSecondsAfterFinished limits the lifetime of a Pod that has finished
 	// execution (either Succeeded or Failed). If this field is set, once the Pod
 	// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
 	// the Pod is being deleted, its lifecycle guarantees (e.g. finalizers) will
 	// be honored. If this field is unset, ttlSecondsAfterFinished will not
 	// expire. If this field is set to zero, ttlSecondsAfterFinished expires
 	// immediately after the Pod finishes.
 	// This field is alpha-level and is only honored by servers that enable the
 	// TTLAfterFinished feature.
 	// +optional
 	TTLSecondsAfterFinished *int32
 }
 ```
 ##### Validation
 Because Job controller depends on Pods to exist to work correctly. In Job
 validation, `ttlSecondsAfterFinished` of its pod template shouldn't be set, to
 prevent users from breaking their Jobs. Users should set TTL seconds on a Job,
 instead of Pods owned by a Job.
 It is common for higher level resources to call generic PodSpec validation;
 therefore, in PodSpec validation, `ttlSecondsAfterFinished` is only allowed to
 be set on a PodSpec with a `restartPolicy` that is either `OnFailure` or `Never`
 (i.e. not `Always`).
 ### User Stories
 The users keep creating Jobs in a small Kubernetes cluster with 4 nodes.
 The Jobs accumulates over time, and 1 year later, the cluster ended up with more
 than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests,
 and eventually made the cluster unavailable.
 The problem could have been avoided easily with TTL controller for Jobs.
 The steps are as easy as:
 1. When creating Jobs, the user sets Jobs' `.spec.ttlSecondsAfterFinished` to
   3600 (i.e. 1 hour).
 1. The user deploys Jobs as usual.
 1. After a Job finishes, the result is observed asynchronously within an hour
   and stored elsewhere.
 1. The TTL collector cleans up Jobs 1 hour after they complete.
 ### Implementation Details/Notes/Constraints
 #### TTL Controller
 We will add a TTL controller for finished Jobs and finished Pods. We considered
 adding it in Job controller, but decided not to, for the following reasons:
 1. Job controller should focus on managing Pods based on the Job's spec and pod
   template, but not cleaning up Jobs.
 1. We also need the TTL controller to clean up finished Pods, and we consider
   generalizing TTL controller later for custom resources. 
 The TTL controller utilizes informer framework, watches all Jobs and Pods, and
 read Jobs and Pods from a local cache.
 #### Finished Jobs
 When a Job is created or updated:
 1. Check its `.status.conditions` to see if it has finished (`Complete` or
   `Failed`). If it hasn't finished, do nothing. 
 1. Otherwise, if the Job has finished, check if Job's 
   `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
   not set. 
 1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e. 
   `.spec.ttlSecondsAfterFinished` + the time when the Job finishes
   (`.status.conditions.lastTransitionTime`) > now. 
 1. If the TTL hasn't expired, delay re-enqueuing the Job after a computed amount
   of time when it will expire. The computed time period is:
   (`.spec.ttlSecondsAfterFinished` + `.status.conditions.lastTransitionTime` -
   now).
 1. If the TTL has expired, `GET` the Job from API server to do final sanity
   checks before deleting it.
 1. Check if the freshly got Job's TTL has expired. This field may be updated
   before TTL controller observes the new value in its local cache.
   * If it hasn't expired, it is not safe to delete the Job. Delay re-enqueue
     the Job after a computed amount of time when it will expire.
 1. Delete the Job if passing the sanity checks. 
 #### Finished Pods
 When a Pod is created or updated:
 1. Check its `.status.phase` to see if it has finished (`Succeeded` or `Failed`).
   If it hasn't finished, do nothing. 
 1. Otherwise, if the Pod has finished, check if Pod's
   `.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
   not set. 
 1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e.
   `.spec.ttlSecondsAfterFinished` + the time when the Pod finishes (max of all
   of its containers termination time
   `.containerStatuses.state.terminated.finishedAt`) > now. 
 1. If the TTL hasn't expired, delay re-enqueuing the Pod after a computed amount
   of time when it will expire. The computed time period is:
   (`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes - now).
 1. If the TTL has expired, `GET` the Pod from API server to do final sanity
   checks before deleting it.
 1. Check if the freshly got Pod's TTL has expired. This field may be updated
   before TTL controller observes the new value in its local cache.
   * If it hasn't expired, it is not safe to delete the Pod. Delay re-enqueue
     the Pod after a computed amount of time when it will expire.
 1. Delete the Pod if passing the sanity checks. 
 #### Owner References
 We have considered making TTL controller leave a Job/Pod around even after its
 TTL expires, if the Job/Pod has any owner specified in its
 `.metadata.ownerReferences`.
 We decided not to block deletion on owners, because the purpose of
 `.metadata.ownerReferences` is for cascading deletion, but not for keeping an
 owner's dependents alive. If the Job is owned by a CronJob, the Job can be
 cleaned up based on CronJob's history limit (i.e. the number of dependent Jobs
 to keep), or CronJob can choose not to set history limit but set the TTL of its
 Job template to clean up Jobs after TTL expires instead of based on the history
 limit capacity. 
 Therefore, a Job/Pod can be deleted after its TTL expires, even if it still has
 owners. 
 Similarly, the TTL won't block deletion from generic garbage collector. This
 means that when a Job's or Pod's owners are gone, generic garbage collector will
 delete it, even if it hasn't finished or its TTL hasn't expired. 
 ### Risks and Mitigations
 Risks:
 * Time skew may cause TTL controller to clean up resource objects at the wrong
  time.
 Mitigations:
 * In Kubernetes, it's required to run NTP on all nodes ([#6159][]) to avoid time
  skew. We will also document this risk.
 [#6159]: https://github.com/kubernetes/kubernetes/issues/6159#issuecomment-93844058
 ## Graduation Criteria
 We want to implement this feature for Pods/Jobs first to gather feedback, and
 decide whether to generalize it to custom resources. This feature can be
 promoted to beta after we finalize the decision for whether to generalize it or
 not, and when it satisfies users' need for cleaning up finished resource
 objects, without regressions.
 This will be promoted to GA once it's gone a sufficient amount of time as beta
 with no changes. 
 [umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752
 ## Implementation History
 TBD