Introduce AEP with Provisioning Request CRD
This commit is contained in:
parent
1aaf4d0152
commit
09fb9591cf
|
|
@ -0,0 +1,326 @@
|
||||||
|
# Provisioning Request CRD
|
||||||
|
|
||||||
|
author: kisieland
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
Currently CA does not provide any way to express that a group of pods would like
|
||||||
|
to have a capacity available.
|
||||||
|
This is caused by the fact that each CA loop picks a group of unschedulable pods
|
||||||
|
and works on provisioning capacity for them, meaning that the grouping is random
|
||||||
|
(as it depends on the kube-scheduler and CA loop interactions).
|
||||||
|
This is especially problematic in couple of cases:
|
||||||
|
|
||||||
|
- Users would like to have all-or-nothing semantics for their workloads.
|
||||||
|
Currently CA will try to provision this capacity and if it is partially
|
||||||
|
successful it will leave it in cluster until user removes the workload.
|
||||||
|
- Users would like to lower e2e scale-up latency for huge scale-ups (100
|
||||||
|
nodes+). Due to CA nature and kube-scheduler throughput, CA will create
|
||||||
|
partial scale-ups, e.g. `0->200->400->600` rather than one `0->600`. This
|
||||||
|
significantly increases the e2e latency as there is non-negligible time tax
|
||||||
|
on each scale-up operation.
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
|
||||||
|
### High level
|
||||||
|
|
||||||
|
Provisioning Request (abbr. ProvReq) is a new namespaced Custom Resource that
|
||||||
|
aims to allow users to ask CA for capacity for groups of pods.
|
||||||
|
It allows users to express the fact that group of pods is connected and should
|
||||||
|
be threated as one entity.
|
||||||
|
This AEP proposes an API that can have multiple provisioning classes and can be
|
||||||
|
extended by cloud provider specific ones.
|
||||||
|
This object is meant as one-shot request to CA, so that if CA fails to provision
|
||||||
|
the capacity it is up to users to retry (such retry functionality can be added
|
||||||
|
later on).
|
||||||
|
|
||||||
|
### ProvisioningRequest CRD
|
||||||
|
|
||||||
|
The following code snippets assume [kubebuilder](https://book.kubebuilder.io/)
|
||||||
|
is used to generate the CRD:
|
||||||
|
|
||||||
|
```go
|
||||||
|
// ProvisioningRequest is a way to express additional capacity
|
||||||
|
// that we would like to provision in the cluster. Cluster Autoscaler
|
||||||
|
// can use this information in its calculations and signal if the capacity
|
||||||
|
// is available in the cluster or actively add capacity if needed.
|
||||||
|
type ProvisioningRequest struct {
|
||||||
|
metav1.TypeMeta `json:",inline"`
|
||||||
|
// Standard object metadata. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
metav1.ObjectMeta `json:"metadata,omitempty"`
|
||||||
|
// Spec contains specification of the ProvisioningRequest object.
|
||||||
|
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status.
|
||||||
|
//
|
||||||
|
// +kubebuilder:validation:Required
|
||||||
|
Spec ProvisioningRequestSpec `json:"spec"`
|
||||||
|
// Status of the ProvisioningRequest. CA constantly reconciles this field.
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
Status ProvisioningRequestStatus `json:"status,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// ProvisioningRequestList is a object for list of ProvisioningRequest.
|
||||||
|
type ProvisioningRequestList struct {
|
||||||
|
metav1.TypeMeta `json:",inline"`
|
||||||
|
// Standard list metadata.
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
metav1.ListMeta `json:"metadata"`
|
||||||
|
// Items, list of ProvisioningRequest returned from API.
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
Items []ProvisioningRequest `json:"items"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// ProvisioningRequestSpec is a specification of additional pods for which we
|
||||||
|
// would like to provision additional resources in the cluster.
|
||||||
|
type ProvisioningRequestSpec struct {
|
||||||
|
// PodSets lists groups of pods for which we would like to provision
|
||||||
|
// resources.
|
||||||
|
//
|
||||||
|
// +kubebuilder:validation:Required
|
||||||
|
// +kubebuilder:validation:MinItems=1
|
||||||
|
// +kubebuilder:validation:MaxItems=32
|
||||||
|
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
|
||||||
|
PodSets []PodSet `json:"podSets"`
|
||||||
|
|
||||||
|
// ProvisioningClass describes the different modes of provisioning the resources.
|
||||||
|
// Supported values:
|
||||||
|
// * check-capacity.kubernetes.io - check if current cluster state can fullfil this request,
|
||||||
|
// do not reserve the capacity.
|
||||||
|
// * atomic-scale-up.kubernetes.io - provision the resources in an atomic manner
|
||||||
|
// * ... - potential other classes that are specific to the cloud providers
|
||||||
|
//
|
||||||
|
// +kubebuilder:validation:Required
|
||||||
|
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
|
||||||
|
ProvisioningClass string `json:"provisioningClass"`
|
||||||
|
|
||||||
|
// AdditionalParameters contains all other parameters custom classes may require.
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
|
||||||
|
AdditionalParameters map[string]string `json:"additionalParameters"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type PodSet struct {
|
||||||
|
// PodTemplateRef is a reference to a PodTemplate object that is representing pods
|
||||||
|
// that will consume this reservation (must be within the same namespace).
|
||||||
|
// Users need to make sure that the fields relevant to scheduler (e.g. node selector tolerations)
|
||||||
|
// are consistent between this template and actual pods consuming the Provisioning Request.
|
||||||
|
//
|
||||||
|
// +kubebuilder:validation:Required
|
||||||
|
PodTemplateRef Reference `json:"podTemplateRef"`
|
||||||
|
// Count contains the number of pods that will be created with a given
|
||||||
|
// template.
|
||||||
|
//
|
||||||
|
// +kubebuilder:validation:Minimum=1
|
||||||
|
// +kubebuilder:validation:Maximum=16384
|
||||||
|
Count int32 `json:"count"`
|
||||||
|
}
|
||||||
|
|
||||||
|
type Reference struct {
|
||||||
|
// Name of the referenced object.
|
||||||
|
// More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names#names
|
||||||
|
//
|
||||||
|
// +kubebuilder:validation:Required
|
||||||
|
Name string `json:"name,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// ProvisioningRequestStatus represents the status of the resource reservation.
|
||||||
|
type ProvisioningRequestStatus struct {
|
||||||
|
// Conditions represent the observations of a Provisioning Request's
|
||||||
|
// current state. Those will contain information whether the capacity
|
||||||
|
// was found/created or if there were any issues. The condition types
|
||||||
|
// may differ between different provisioning classes.
|
||||||
|
//
|
||||||
|
// +listType=map
|
||||||
|
// +listMapKey=type
|
||||||
|
// +patchStrategy=merge
|
||||||
|
// +patchMergeKey=type
|
||||||
|
// +optional
|
||||||
|
Conditions []metav1.Condition `json:"conditions"`
|
||||||
|
|
||||||
|
// AdditionalStatus contains all other status values custom provisioning classes may require.
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
// +kubebuilder:validation:MaxItems=64
|
||||||
|
AdditionalStatus map[string]string `json:"additionalStatus"`
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Provisioning Classes
|
||||||
|
|
||||||
|
#### check-capacity.kubernetes.io class
|
||||||
|
|
||||||
|
The `check-capacity.kubernetes.io` is one-off check to verify that the in the cluster
|
||||||
|
there is enough capacity to provision given set of pods.
|
||||||
|
|
||||||
|
Note: If two of such objects are created around the same time, CA will consider
|
||||||
|
them independently and place no guards for the capacity.
|
||||||
|
Also the capacity is not reserved in any manner so it may be scaled-down.
|
||||||
|
|
||||||
|
#### atomic-scale-up.kubernetes.io class
|
||||||
|
|
||||||
|
The `atomic-scale-up.kubernetes.io` aims to provision the resources required for the
|
||||||
|
specified pods in an atomic way. The proposed logic is to:
|
||||||
|
1. Try to provision required VMs in one loop.
|
||||||
|
2. If it failed, remove the partially provisioned VMs and back-off.
|
||||||
|
3. Stop the back-off after a given duration (optional), which would be passed
|
||||||
|
via `AdditionalParameters` field, using `ValidUntilSeconds` key and would contain string
|
||||||
|
denoting duration for which we should retry (measured since creation fo the CR).
|
||||||
|
|
||||||
|
Note: that the VMs created in this mode are subject to the scale-down logic.
|
||||||
|
So the duration during which users need to create the Pods is equal to the
|
||||||
|
value of `--scale-down-unneeded-time` flag.
|
||||||
|
|
||||||
|
### Adding pods that consume given ProvisioningRequest
|
||||||
|
|
||||||
|
To avoid generating double scale-ups and exclude pods that are meant to consume
|
||||||
|
given capacity CA should be able to differentiate those from all other pods.
|
||||||
|
To do so users need to specify the following pod annotation (it is not required
|
||||||
|
in ProvReq’s template, though it can be specified):
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
annotations:
|
||||||
|
"cluster-autoscaler.kubernetes.io/consume-provisioning-request": "provreq-name"
|
||||||
|
```
|
||||||
|
|
||||||
|
If it is provided for the pods that consume the ProvReq with `check-capacity.kubernetes.io` class,
|
||||||
|
the CA will not provision the capacity, even if it was needed (as some other pods might have been
|
||||||
|
scheduled on it) and will result in visibility events passed to the ProvReq and pods.
|
||||||
|
If it is not passed the CA will behave normally and just provision the capacity if it needed.
|
||||||
|
|
||||||
|
Note: CA will match all pods with this annotation to a corresponding ProvReq and
|
||||||
|
ignore them when executing a scale-up loop (so that is up to users to make sure
|
||||||
|
that the ProvReq count is matching the number of created pods).
|
||||||
|
If the ProvReq is missing, all of the pods that consume it will be unschedulable indefinitely.
|
||||||
|
|
||||||
|
### CRD lifecycle
|
||||||
|
|
||||||
|
1. A ProvReq will be created either by the end user or by a framework.
|
||||||
|
At this point needed PodTemplate objects should be also created.
|
||||||
|
2. CA will pick it up, choose a nodepool (or create a new one if NAP is
|
||||||
|
enabled), and try to create nodes.
|
||||||
|
3. If CA successfully creates capacity, ProvReq will receive information about
|
||||||
|
this fact in `Conditions` field.
|
||||||
|
4. At this moment, users can create pods in that will consume the ProvReq (in
|
||||||
|
the same namespace), those will be scheduled on the capacity that was
|
||||||
|
created by the CA.
|
||||||
|
5. Once all of the pods are scheduled users can delete the ProvReq object,
|
||||||
|
otherwise it will be garbage collected after some time.
|
||||||
|
6. When pods finish the work and nodes become unused the CA will scale them
|
||||||
|
down.
|
||||||
|
|
||||||
|
Note: Users can create a ProvReq and pods consuming them at the same time (in a
|
||||||
|
"fire and forget" manner), but this may result in the pods being unschedulable
|
||||||
|
and triggering user configured alerts.
|
||||||
|
|
||||||
|
### Canceling the requests
|
||||||
|
|
||||||
|
To cancel a pending Provisioning Request with atomic class, all that the users need to do is
|
||||||
|
to delete the Provisioning Request object.
|
||||||
|
After that the CA will no longer guard the nodes from deletion and proceed with standard scale-down logic.
|
||||||
|
|
||||||
|
### Conditions
|
||||||
|
|
||||||
|
The following Condition states should encode the states of the ProvReq:
|
||||||
|
|
||||||
|
- Provisioned - VMs were created successfully (Atomic class)
|
||||||
|
- CapacityAvailable - cluster contains enough capacity to schedule pods (Check
|
||||||
|
class)
|
||||||
|
* `CapacityAvailable=true` will denote that cluster contains enough capacity to schedule pods
|
||||||
|
* `CapacityAvailable=false` will denote that cluster does not contain enough capacity to schedule pods
|
||||||
|
- Failed - failed to create or check capacity (both classes)
|
||||||
|
|
||||||
|
The Reasons and Messages will contain more details about why the specific
|
||||||
|
condition was triggered.
|
||||||
|
|
||||||
|
Providers of the custom classes should reuse the conditions where available or create their own ones
|
||||||
|
if items from the above list cannot be used to denote a specific situation.
|
||||||
|
|
||||||
|
### CA implementation details
|
||||||
|
|
||||||
|
The proposed implementation is to handle each ProvReq in a separate scale-up
|
||||||
|
loop. This will require changes in multiple parts of CA:
|
||||||
|
|
||||||
|
1. Listing unschedulable pods where:
|
||||||
|
- pods that consume ProvReq need to filtered-out
|
||||||
|
- pods that are represented by the ProvReq need to be injected (we need to
|
||||||
|
ensure those are threated as one group by the sharding logic)
|
||||||
|
2. Scale-up logic, which as of now has no notion atomicity and grouping of
|
||||||
|
pods. This is simplified as the ScaleUp logic was recently put [behind an
|
||||||
|
interface](https://github.com/kubernetes/autoscaler/pull/5597).
|
||||||
|
- This is a place where the biggest part of the change will be made. Here
|
||||||
|
many parts of the logic are assuming best-effort semantics and the scale
|
||||||
|
up size is lowered in many situations:
|
||||||
|
- Estimation logic, which stops after some time-out or number of
|
||||||
|
pods/nodes.
|
||||||
|
- Size limiting, which caps the scale-up to match the size
|
||||||
|
restrictions (on node group or cluster level).
|
||||||
|
3. Node creation, which needs to support atomic resize. Either via native cloud
|
||||||
|
provider APIs or best effort with node removal if CA is unable to fulfill
|
||||||
|
the scale-up.
|
||||||
|
- This is also quite substantial change, we can provide a generic
|
||||||
|
best-effort implementation that will try to scale up and clean-up nodes
|
||||||
|
if it is unsuccessful, but it is up to cloud providers to integrate with
|
||||||
|
provider specific APIs.
|
||||||
|
4. Scale down path is not expected to change much. But users should follow
|
||||||
|
[best
|
||||||
|
practices](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node)
|
||||||
|
to avoid CA disturbing their workloads.
|
||||||
|
|
||||||
|
## Testing
|
||||||
|
|
||||||
|
The following e2e test scenarios will be created to check whether ProvReq
|
||||||
|
handling works as expected:
|
||||||
|
|
||||||
|
1. A new ProvReq with `check-capacity.kubernetes.io` provisioning class is created, CA
|
||||||
|
checks if there is enough capacity in cluster to provision specified pods.
|
||||||
|
2. A new ProvReq with `atomic-scale-up.kubernetes.io` provisioning class is created, CA
|
||||||
|
picks an appropriate node group scales it up atomically.
|
||||||
|
3. A new atomic ProvReq is created for which a NAP needs to provision a new
|
||||||
|
node group. NAP creates it CA scales it atomically.
|
||||||
|
- Here we should cover some of the different reasons why NAP may be
|
||||||
|
required.
|
||||||
|
4. An atomic ProvReq fails due to node group size limits and NAP CPU and/or RAM
|
||||||
|
limits.
|
||||||
|
5. Scalability tests.
|
||||||
|
- Scenario in which many small ProvReqs are created (strain on the number
|
||||||
|
of scale-up loops).
|
||||||
|
- Scenario in which big ProvReq is created (strain on a single scale-up
|
||||||
|
loop).
|
||||||
|
|
||||||
|
## Limitations
|
||||||
|
|
||||||
|
The current Cluster Autoscaler implementation is not taking into account [Resource Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/). \
|
||||||
|
The current proposal is to not include handling of the Resource Quotas, but it could be added later on.
|
||||||
|
|
||||||
|
## Future Expansions
|
||||||
|
|
||||||
|
### ProvisioningClass CRD
|
||||||
|
|
||||||
|
One of the expansion of this approach is to introduce the ProvisioningClass CRD,
|
||||||
|
which follows the same approach as
|
||||||
|
[StorageClass object](https://kubernetes.io/docs/concepts/storage/storage-classes/).
|
||||||
|
Such approach would allow administrators of the cluster to introduce a list of allowed
|
||||||
|
ProvisioningClasses. Such CRD can also contain a pre set configuration, i.e.
|
||||||
|
administrators may set that `atomic-scale-up.kubernetes.io` would retry up to `2h`.
|
||||||
|
|
||||||
|
Possible CRD definition:
|
||||||
|
```go
|
||||||
|
// ProvisioningClass is a way to express provisioning classes available in the cluster.
|
||||||
|
type ProvisioningClass struct {
|
||||||
|
// Name denotes the name of the object, which is to be used in the ProvisioningClass
|
||||||
|
// field in Provisioning Request CRD.
|
||||||
|
//
|
||||||
|
// +kubebuilder:validation:Required
|
||||||
|
Name string `json:"name"`
|
||||||
|
|
||||||
|
// AdditionalParameters contains all other parameters custom classes may require.
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
AdditionalParameters map[string]string `json:"additionalParameters"`
|
||||||
|
}
|
||||||
|
```
|
||||||
Loading…
Reference in New Issue