KEP for TTL-after-finished controller
This commit is contained in:
parent
a119d94f47
commit
80e8c2bda7
|
@ -1 +1 @@
|
||||||
26
|
27
|
||||||
|
|
|
@ -0,0 +1,296 @@
|
||||||
|
---
|
||||||
|
kep-number: 26
|
||||||
|
title: TTL After Finished
|
||||||
|
authors:
|
||||||
|
- "@janetkuo"
|
||||||
|
owning-sig: sig-apps
|
||||||
|
participating-sigs:
|
||||||
|
- sig-api-machinery
|
||||||
|
reviewers:
|
||||||
|
- @enisoc
|
||||||
|
- @tnozicka
|
||||||
|
approvers:
|
||||||
|
- @kow3ns
|
||||||
|
editor: TBD
|
||||||
|
creation-date: 2018-08-16
|
||||||
|
last-updated: 2018-08-16
|
||||||
|
status: provisional
|
||||||
|
see-also:
|
||||||
|
- n/a
|
||||||
|
replaces:
|
||||||
|
- n/a
|
||||||
|
superseded-by:
|
||||||
|
- n/a
|
||||||
|
---
|
||||||
|
|
||||||
|
# TTL After Finished Controller
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template.
|
||||||
|
[Tools for generating][] a table of contents from markdown are available.
|
||||||
|
|
||||||
|
* [TTL After Finished Controller](#ttl-after-finished-controller)
|
||||||
|
* [Table of Contents](#table-of-contents)
|
||||||
|
* [Summary](#summary)
|
||||||
|
* [Motivation](#motivation)
|
||||||
|
* [Goals](#goals)
|
||||||
|
* [Proposal](#proposal)
|
||||||
|
* [Concrete Use Cases](#concrete-use-cases)
|
||||||
|
* [Detailed Design](#detailed-design)
|
||||||
|
* [Feature Gate](#feature-gate)
|
||||||
|
* [API Object](#api-object)
|
||||||
|
* [Validation](#validation)
|
||||||
|
* [User Stories](#user-stories)
|
||||||
|
* [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
|
||||||
|
* [TTL Controller](#ttl-controller)
|
||||||
|
* [Finished Jobs](#finished-jobs)
|
||||||
|
* [Finished Pods](#finished-pods)
|
||||||
|
* [Owner References](#owner-references)
|
||||||
|
* [Risks and Mitigations](#risks-and-mitigations)
|
||||||
|
* [Graduation Criteria](#graduation-criteria)
|
||||||
|
* [Implementation History](#implementation-history)
|
||||||
|
|
||||||
|
[Tools for generating]: https://github.com/ekalinin/github-markdown-toc
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
We propose a TTL mechanism to limit the lifetime of finished resource objects,
|
||||||
|
including Jobs and Pods, to make it easy for users to clean up old Jobs/Pods
|
||||||
|
after they finish. The TTL timer starts when the Job/Pod finishes, and the
|
||||||
|
finished Job/Pod will be cleaned up after the TTL expires.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
In Kubernetes, finishable resources, such as Jobs and Pods, are often
|
||||||
|
frequently-created and short-lived. If a Job or Pod isn't controlled by a
|
||||||
|
higher-level resource (e.g. CronJob for Jobs or Job for Pods), or owned by some
|
||||||
|
other resources, it's difficult for the users to clean them up automatically,
|
||||||
|
and those Jobs and Pods can accumulate and overload a Kubernetes cluster very
|
||||||
|
easily. Even if we can avoid the overload issue by implementing a cluster-wide
|
||||||
|
(global) resource quota, users won't be able to create new resources without
|
||||||
|
cleaning up old ones first. See [#64470][].
|
||||||
|
|
||||||
|
The design of this proposal can be later generalized to other finishable
|
||||||
|
frequently-created, short-lived resources, such as completed Pods or finished
|
||||||
|
custom resources.
|
||||||
|
|
||||||
|
[#64470]: https://github.com/kubernetes/kubernetes/issues/64470
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
|
||||||
|
Make it easy to for the users to specify a time-based clean up mechanism for
|
||||||
|
finished resource objects.
|
||||||
|
* It's configurable at resource creation time and after the resource is created.
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
|
||||||
|
[K8s Proposal: TTL controller for finished Jobs and Pods][]
|
||||||
|
|
||||||
|
[K8s Proposal: TTL controller for finished Jobs and Pods]: https://docs.google.com/document/d/1U6h1DrRJNuQlL2_FYY_FdkQhgtTRn1kEylEOHRoESTc/edit
|
||||||
|
|
||||||
|
### Concrete Use Cases
|
||||||
|
|
||||||
|
* [Kubeflow][] needs to clean up old finished Jobs (K8s Jobs, TF Jobs, Argo
|
||||||
|
workflows, etc.), see [#718][].
|
||||||
|
|
||||||
|
* [Prow][] needs to clean up old completed Pods & finished Jobs. Currently implemented with Prow sinker.
|
||||||
|
|
||||||
|
* [Apache Spark on Kubernetes][] needs proper cleanup of terminated Spark executor Pods.
|
||||||
|
|
||||||
|
* Jenkins Kubernetes plugin creates slave pods that execute builds. It needs a better way to clean up old completed Pods.
|
||||||
|
|
||||||
|
[Kubeflow]: https://github.com/kubeflow
|
||||||
|
[#718]: https://github.com/kubeflow/tf-operator/issues/718
|
||||||
|
[Prow]: https://github.com/kubernetes/test-infra/tree/master/prow
|
||||||
|
[Apache Spark on Kubernetes]: http://spark.apache.org/docs/latest/running-on-kubernetes.html
|
||||||
|
|
||||||
|
### Detailed Design
|
||||||
|
|
||||||
|
#### Feature Gate
|
||||||
|
|
||||||
|
This will be launched as an alpha feature first, with feature gate
|
||||||
|
`TTLAfterFinished`.
|
||||||
|
|
||||||
|
#### API Object
|
||||||
|
|
||||||
|
We will add the following API fields to `JobSpec` (`Job`'s `.spec`).
|
||||||
|
|
||||||
|
```go
|
||||||
|
type JobSpec struct {
|
||||||
|
// ttlSecondsAfterFinished limits the lifetime of a Job that has finished
|
||||||
|
// execution (either Complete or Failed). If this field is set, once the Job
|
||||||
|
// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
|
||||||
|
// the Job is being deleted, its lifecycle guarantees (e.g. finalizers) will
|
||||||
|
// be honored. If this field is unset, ttlSecondsAfterFinished will not
|
||||||
|
// expire. If this field is set to zero, ttlSecondsAfterFinished expires
|
||||||
|
// immediately after the Job finishes.
|
||||||
|
// This field is alpha-level and is only honored by servers that enable the
|
||||||
|
// TTLAfterFinished feature.
|
||||||
|
// +optional
|
||||||
|
TTLSecondsAfterFinished *int32
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This allows Jobs to be cleaned up after they finish and provides time for
|
||||||
|
asynchronous clients to observe Jobs' final states before they are deleted.
|
||||||
|
|
||||||
|
|
||||||
|
Similarly, we will add the following API fields to `PodSpec` (`Pod`'s `.spec`).
|
||||||
|
|
||||||
|
```go
|
||||||
|
type PodSpec struct {
|
||||||
|
// ttlSecondsAfterFinished limits the lifetime of a Pod that has finished
|
||||||
|
// execution (either Succeeded or Failed). If this field is set, once the Pod
|
||||||
|
// finishes, it will be deleted after ttlSecondsAfterFinished expires. When
|
||||||
|
// the Pod is being deleted, its lifecycle guarantees (e.g. finalizers) will
|
||||||
|
// be honored. If this field is unset, ttlSecondsAfterFinished will not
|
||||||
|
// expire. If this field is set to zero, ttlSecondsAfterFinished expires
|
||||||
|
// immediately after the Pod finishes.
|
||||||
|
// This field is alpha-level and is only honored by servers that enable the
|
||||||
|
// TTLAfterFinished feature.
|
||||||
|
// +optional
|
||||||
|
TTLSecondsAfterFinished *int32
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
##### Validation
|
||||||
|
|
||||||
|
Because Job controller depends on Pods to exist to work correctly. In Job
|
||||||
|
validation, `ttlSecondsAfterFinished` of its pod template shouldn't be set, to
|
||||||
|
prevent users from breaking their Jobs. Users should set TTL seconds on a Job,
|
||||||
|
instead of Pods owned by a Job.
|
||||||
|
|
||||||
|
It is common for higher level resources to call generic PodSpec validation;
|
||||||
|
therefore, in PodSpec validation, `ttlSecondsAfterFinished` is only allowed to
|
||||||
|
be set on a PodSpec with a `restartPolicy` that is either `OnFailure` or `Never`
|
||||||
|
(i.e. not `Always`).
|
||||||
|
|
||||||
|
### User Stories
|
||||||
|
|
||||||
|
The users keep creating Jobs in a small Kubernetes cluster with 4 nodes.
|
||||||
|
The Jobs accumulates over time, and 1 year later, the cluster ended up with more
|
||||||
|
than 100k old Jobs. This caused etcd hiccups, long high latency etcd requests,
|
||||||
|
and eventually made the cluster unavailable.
|
||||||
|
|
||||||
|
The problem could have been avoided easily with TTL controller for Jobs.
|
||||||
|
|
||||||
|
The steps are as easy as:
|
||||||
|
|
||||||
|
1. When creating Jobs, the user sets Jobs' `.spec.ttlSecondsAfterFinished` to
|
||||||
|
3600 (i.e. 1 hour).
|
||||||
|
1. The user deploys Jobs as usual.
|
||||||
|
1. After a Job finishes, the result is observed asynchronously within an hour
|
||||||
|
and stored elsewhere.
|
||||||
|
1. The TTL collector cleans up Jobs 1 hour after they complete.
|
||||||
|
|
||||||
|
### Implementation Details/Notes/Constraints
|
||||||
|
|
||||||
|
#### TTL Controller
|
||||||
|
We will add a TTL controller for finished Jobs and finished Pods. We considered
|
||||||
|
adding it in Job controller, but decided not to, for the following reasons:
|
||||||
|
|
||||||
|
1. Job controller should focus on managing Pods based on the Job's spec and pod
|
||||||
|
template, but not cleaning up Jobs.
|
||||||
|
1. We also need the TTL controller to clean up finished Pods, and we consider
|
||||||
|
generalizing TTL controller later for custom resources.
|
||||||
|
|
||||||
|
The TTL controller utilizes informer framework, watches all Jobs and Pods, and
|
||||||
|
read Jobs and Pods from a local cache.
|
||||||
|
|
||||||
|
#### Finished Jobs
|
||||||
|
|
||||||
|
When a Job is created or updated:
|
||||||
|
|
||||||
|
1. Check its `.status.conditions` to see if it has finished (`Complete` or
|
||||||
|
`Failed`). If it hasn't finished, do nothing.
|
||||||
|
1. Otherwise, if the Job has finished, check if Job's
|
||||||
|
`.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
|
||||||
|
not set.
|
||||||
|
1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e.
|
||||||
|
`.spec.ttlSecondsAfterFinished` + the time when the Job finishes
|
||||||
|
(`.status.conditions.lastTransitionTime`) > now.
|
||||||
|
1. If the TTL hasn't expired, delay re-enqueuing the Job after a computed amount
|
||||||
|
of time when it will expire. The computed time period is:
|
||||||
|
(`.spec.ttlSecondsAfterFinished` + `.status.conditions.lastTransitionTime` -
|
||||||
|
now).
|
||||||
|
1. If the TTL has expired, `GET` the Job from API server to do final sanity
|
||||||
|
checks before deleting it.
|
||||||
|
1. Check if the freshly got Job's TTL has expired. This field may be updated
|
||||||
|
before TTL controller observes the new value in its local cache.
|
||||||
|
* If it hasn't expired, it is not safe to delete the Job. Delay re-enqueue
|
||||||
|
the Job after a computed amount of time when it will expire.
|
||||||
|
1. Delete the Job if passing the sanity checks.
|
||||||
|
|
||||||
|
#### Finished Pods
|
||||||
|
|
||||||
|
When a Pod is created or updated:
|
||||||
|
1. Check its `.status.phase` to see if it has finished (`Succeeded` or `Failed`).
|
||||||
|
If it hasn't finished, do nothing.
|
||||||
|
1. Otherwise, if the Pod has finished, check if Pod's
|
||||||
|
`.spec.ttlSecondsAfterFinished` field is set. Do nothing if the TTL field is
|
||||||
|
not set.
|
||||||
|
1. Otherwise, if the TTL field is set, check if the TTL has expired, i.e.
|
||||||
|
`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes (max of all
|
||||||
|
of its containers termination time
|
||||||
|
`.containerStatuses.state.terminated.finishedAt`) > now.
|
||||||
|
1. If the TTL hasn't expired, delay re-enqueuing the Pod after a computed amount
|
||||||
|
of time when it will expire. The computed time period is:
|
||||||
|
(`.spec.ttlSecondsAfterFinished` + the time when the Pod finishes - now).
|
||||||
|
1. If the TTL has expired, `GET` the Pod from API server to do final sanity
|
||||||
|
checks before deleting it.
|
||||||
|
1. Check if the freshly got Pod's TTL has expired. This field may be updated
|
||||||
|
before TTL controller observes the new value in its local cache.
|
||||||
|
* If it hasn't expired, it is not safe to delete the Pod. Delay re-enqueue
|
||||||
|
the Pod after a computed amount of time when it will expire.
|
||||||
|
1. Delete the Pod if passing the sanity checks.
|
||||||
|
|
||||||
|
#### Owner References
|
||||||
|
|
||||||
|
We have considered making TTL controller leave a Job/Pod around even after its
|
||||||
|
TTL expires, if the Job/Pod has any owner specified in its
|
||||||
|
`.metadata.ownerReferences`.
|
||||||
|
|
||||||
|
We decided not to block deletion on owners, because the purpose of
|
||||||
|
`.metadata.ownerReferences` is for cascading deletion, but not for keeping an
|
||||||
|
owner's dependents alive. If the Job is owned by a CronJob, the Job can be
|
||||||
|
cleaned up based on CronJob's history limit (i.e. the number of dependent Jobs
|
||||||
|
to keep), or CronJob can choose not to set history limit but set the TTL of its
|
||||||
|
Job template to clean up Jobs after TTL expires instead of based on the history
|
||||||
|
limit capacity.
|
||||||
|
|
||||||
|
Therefore, a Job/Pod can be deleted after its TTL expires, even if it still has
|
||||||
|
owners.
|
||||||
|
|
||||||
|
Similarly, the TTL won't block deletion from generic garbage collector. This
|
||||||
|
means that when a Job's or Pod's owners are gone, generic garbage collector will
|
||||||
|
delete it, even if it hasn't finished or its TTL hasn't expired.
|
||||||
|
|
||||||
|
### Risks and Mitigations
|
||||||
|
|
||||||
|
Risks:
|
||||||
|
* Time skew may cause TTL controller to clean up resource objects at the wrong
|
||||||
|
time.
|
||||||
|
|
||||||
|
Mitigations:
|
||||||
|
* In Kubernetes, it's required to run NTP on all nodes ([#6159][]) to avoid time
|
||||||
|
skew. We will also document this risk.
|
||||||
|
|
||||||
|
[#6159]: https://github.com/kubernetes/kubernetes/issues/6159#issuecomment-93844058
|
||||||
|
|
||||||
|
## Graduation Criteria
|
||||||
|
|
||||||
|
We want to implement this feature for Pods/Jobs first to gather feedback, and
|
||||||
|
decide whether to generalize it to custom resources. This feature can be
|
||||||
|
promoted to beta after we finalize the decision for whether to generalize it or
|
||||||
|
not, and when it satisfies users' need for cleaning up finished resource
|
||||||
|
objects, without regressions.
|
||||||
|
|
||||||
|
This will be promoted to GA once it's gone a sufficient amount of time as beta
|
||||||
|
with no changes.
|
||||||
|
|
||||||
|
[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/42752
|
||||||
|
|
||||||
|
## Implementation History
|
||||||
|
|
||||||
|
TBD
|
Loading…
Reference in New Issue