327 lines
14 KiB
Markdown
327 lines
14 KiB
Markdown
# Provisioning Request CRD
|
||
|
||
author: kisieland
|
||
|
||
## Background
|
||
|
||
Currently CA does not provide any way to express that a group of pods would like
|
||
to have a capacity available.
|
||
This is caused by the fact that each CA loop picks a group of unschedulable pods
|
||
and works on provisioning capacity for them, meaning that the grouping is random
|
||
(as it depends on the kube-scheduler and CA loop interactions).
|
||
This is especially problematic in couple of cases:
|
||
|
||
- Users would like to have all-or-nothing semantics for their workloads.
|
||
Currently CA will try to provision this capacity and if it is partially
|
||
successful it will leave it in cluster until user removes the workload.
|
||
- Users would like to lower e2e scale-up latency for huge scale-ups (100
|
||
nodes+). Due to CA nature and kube-scheduler throughput, CA will create
|
||
partial scale-ups, e.g. `0->200->400->600` rather than one `0->600`. This
|
||
significantly increases the e2e latency as there is non-negligible time tax
|
||
on each scale-up operation.
|
||
|
||
## Proposal
|
||
|
||
### High level
|
||
|
||
Provisioning Request (abbr. ProvReq) is a new namespaced Custom Resource that
|
||
aims to allow users to ask CA for capacity for groups of pods.
|
||
It allows users to express the fact that group of pods is connected and should
|
||
be threated as one entity.
|
||
This AEP proposes an API that can have multiple provisioning classes and can be
|
||
extended by cloud provider specific ones.
|
||
This object is meant as one-shot request to CA, so that if CA fails to provision
|
||
the capacity it is up to users to retry (such retry functionality can be added
|
||
later on).
|
||
|
||
### ProvisioningRequest CRD
|
||
|
||
The following code snippets assume [kubebuilder](https://book.kubebuilder.io/)
|
||
is used to generate the CRD:
|
||
|
||
```go
|
||
// ProvisioningRequest is a way to express additional capacity
|
||
// that we would like to provision in the cluster. Cluster Autoscaler
|
||
// can use this information in its calculations and signal if the capacity
|
||
// is available in the cluster or actively add capacity if needed.
|
||
type ProvisioningRequest struct {
|
||
metav1.TypeMeta `json:",inline"`
|
||
// Standard object metadata. More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#metadata
|
||
//
|
||
// +optional
|
||
metav1.ObjectMeta `json:"metadata,omitempty"`
|
||
// Spec contains specification of the ProvisioningRequest object.
|
||
// More info: https://git.k8s.io/community/contributors/devel/api-conventions.md#spec-and-status.
|
||
//
|
||
// +kubebuilder:validation:Required
|
||
Spec ProvisioningRequestSpec `json:"spec"`
|
||
// Status of the ProvisioningRequest. CA constantly reconciles this field.
|
||
//
|
||
// +optional
|
||
Status ProvisioningRequestStatus `json:"status,omitempty"`
|
||
}
|
||
|
||
// ProvisioningRequestList is a object for list of ProvisioningRequest.
|
||
type ProvisioningRequestList struct {
|
||
metav1.TypeMeta `json:",inline"`
|
||
// Standard list metadata.
|
||
//
|
||
// +optional
|
||
metav1.ListMeta `json:"metadata"`
|
||
// Items, list of ProvisioningRequest returned from API.
|
||
//
|
||
// +optional
|
||
Items []ProvisioningRequest `json:"items"`
|
||
}
|
||
|
||
// ProvisioningRequestSpec is a specification of additional pods for which we
|
||
// would like to provision additional resources in the cluster.
|
||
type ProvisioningRequestSpec struct {
|
||
// PodSets lists groups of pods for which we would like to provision
|
||
// resources.
|
||
//
|
||
// +kubebuilder:validation:Required
|
||
// +kubebuilder:validation:MinItems=1
|
||
// +kubebuilder:validation:MaxItems=32
|
||
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
|
||
PodSets []PodSet `json:"podSets"`
|
||
|
||
// ProvisioningClass describes the different modes of provisioning the resources.
|
||
// Supported values:
|
||
// * check-capacity.kubernetes.io - check if current cluster state can fullfil this request,
|
||
// do not reserve the capacity.
|
||
// * atomic-scale-up.kubernetes.io - provision the resources in an atomic manner
|
||
// * ... - potential other classes that are specific to the cloud providers
|
||
//
|
||
// +kubebuilder:validation:Required
|
||
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
|
||
ProvisioningClass string `json:"provisioningClass"`
|
||
|
||
// AdditionalParameters contains all other parameters custom classes may require.
|
||
//
|
||
// +optional
|
||
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
|
||
AdditionalParameters map[string]string `json:"additionalParameters"`
|
||
}
|
||
|
||
type PodSet struct {
|
||
// PodTemplateRef is a reference to a PodTemplate object that is representing pods
|
||
// that will consume this reservation (must be within the same namespace).
|
||
// Users need to make sure that the fields relevant to scheduler (e.g. node selector tolerations)
|
||
// are consistent between this template and actual pods consuming the Provisioning Request.
|
||
//
|
||
// +kubebuilder:validation:Required
|
||
PodTemplateRef Reference `json:"podTemplateRef"`
|
||
// Count contains the number of pods that will be created with a given
|
||
// template.
|
||
//
|
||
// +kubebuilder:validation:Minimum=1
|
||
// +kubebuilder:validation:Maximum=16384
|
||
Count int32 `json:"count"`
|
||
}
|
||
|
||
type Reference struct {
|
||
// Name of the referenced object.
|
||
// More info: https://kubernetes.io/docs/concepts/overview/working-with-objects/names#names
|
||
//
|
||
// +kubebuilder:validation:Required
|
||
Name string `json:"name,omitempty"`
|
||
}
|
||
|
||
// ProvisioningRequestStatus represents the status of the resource reservation.
|
||
type ProvisioningRequestStatus struct {
|
||
// Conditions represent the observations of a Provisioning Request's
|
||
// current state. Those will contain information whether the capacity
|
||
// was found/created or if there were any issues. The condition types
|
||
// may differ between different provisioning classes.
|
||
//
|
||
// +listType=map
|
||
// +listMapKey=type
|
||
// +patchStrategy=merge
|
||
// +patchMergeKey=type
|
||
// +optional
|
||
Conditions []metav1.Condition `json:"conditions"`
|
||
|
||
// AdditionalStatus contains all other status values custom provisioning classes may require.
|
||
//
|
||
// +optional
|
||
// +kubebuilder:validation:MaxItems=64
|
||
AdditionalStatus map[string]string `json:"additionalStatus"`
|
||
}
|
||
```
|
||
|
||
### Provisioning Classes
|
||
|
||
#### check-capacity.kubernetes.io class
|
||
|
||
The `check-capacity.kubernetes.io` is one-off check to verify that the in the cluster
|
||
there is enough capacity to provision given set of pods.
|
||
|
||
Note: If two of such objects are created around the same time, CA will consider
|
||
them independently and place no guards for the capacity.
|
||
Also the capacity is not reserved in any manner so it may be scaled-down.
|
||
|
||
#### atomic-scale-up.kubernetes.io class
|
||
|
||
The `atomic-scale-up.kubernetes.io` aims to provision the resources required for the
|
||
specified pods in an atomic way. The proposed logic is to:
|
||
1. Try to provision required VMs in one loop.
|
||
2. If it failed, remove the partially provisioned VMs and back-off.
|
||
3. Stop the back-off after a given duration (optional), which would be passed
|
||
via `AdditionalParameters` field, using `ValidUntilSeconds` key and would contain string
|
||
denoting duration for which we should retry (measured since creation fo the CR).
|
||
|
||
Note: that the VMs created in this mode are subject to the scale-down logic.
|
||
So the duration during which users need to create the Pods is equal to the
|
||
value of `--scale-down-unneeded-time` flag.
|
||
|
||
### Adding pods that consume given ProvisioningRequest
|
||
|
||
To avoid generating double scale-ups and exclude pods that are meant to consume
|
||
given capacity CA should be able to differentiate those from all other pods.
|
||
To do so users need to specify the following pod annotation (it is not required
|
||
in ProvReq’s template, though it can be specified):
|
||
|
||
```yaml
|
||
annotations:
|
||
"cluster-autoscaler.kubernetes.io/consume-provisioning-request": "provreq-name"
|
||
```
|
||
|
||
If it is provided for the pods that consume the ProvReq with `check-capacity.kubernetes.io` class,
|
||
the CA will not provision the capacity, even if it was needed (as some other pods might have been
|
||
scheduled on it) and will result in visibility events passed to the ProvReq and pods.
|
||
If it is not passed the CA will behave normally and just provision the capacity if it needed.
|
||
|
||
Note: CA will match all pods with this annotation to a corresponding ProvReq and
|
||
ignore them when executing a scale-up loop (so that is up to users to make sure
|
||
that the ProvReq count is matching the number of created pods).
|
||
If the ProvReq is missing, all of the pods that consume it will be unschedulable indefinitely.
|
||
|
||
### CRD lifecycle
|
||
|
||
1. A ProvReq will be created either by the end user or by a framework.
|
||
At this point needed PodTemplate objects should be also created.
|
||
2. CA will pick it up, choose a nodepool (or create a new one if NAP is
|
||
enabled), and try to create nodes.
|
||
3. If CA successfully creates capacity, ProvReq will receive information about
|
||
this fact in `Conditions` field.
|
||
4. At this moment, users can create pods in that will consume the ProvReq (in
|
||
the same namespace), those will be scheduled on the capacity that was
|
||
created by the CA.
|
||
5. Once all of the pods are scheduled users can delete the ProvReq object,
|
||
otherwise it will be garbage collected after some time.
|
||
6. When pods finish the work and nodes become unused the CA will scale them
|
||
down.
|
||
|
||
Note: Users can create a ProvReq and pods consuming them at the same time (in a
|
||
"fire and forget" manner), but this may result in the pods being unschedulable
|
||
and triggering user configured alerts.
|
||
|
||
### Canceling the requests
|
||
|
||
To cancel a pending Provisioning Request with atomic class, all that the users need to do is
|
||
to delete the Provisioning Request object.
|
||
After that the CA will no longer guard the nodes from deletion and proceed with standard scale-down logic.
|
||
|
||
### Conditions
|
||
|
||
The following Condition states should encode the states of the ProvReq:
|
||
|
||
- Provisioned - VMs were created successfully (Atomic class)
|
||
- CapacityAvailable - cluster contains enough capacity to schedule pods (Check
|
||
class)
|
||
* `CapacityAvailable=true` will denote that cluster contains enough capacity to schedule pods
|
||
* `CapacityAvailable=false` will denote that cluster does not contain enough capacity to schedule pods
|
||
- Failed - failed to create or check capacity (both classes)
|
||
|
||
The Reasons and Messages will contain more details about why the specific
|
||
condition was triggered.
|
||
|
||
Providers of the custom classes should reuse the conditions where available or create their own ones
|
||
if items from the above list cannot be used to denote a specific situation.
|
||
|
||
### CA implementation details
|
||
|
||
The proposed implementation is to handle each ProvReq in a separate scale-up
|
||
loop. This will require changes in multiple parts of CA:
|
||
|
||
1. Listing unschedulable pods where:
|
||
- pods that consume ProvReq need to filtered-out
|
||
- pods that are represented by the ProvReq need to be injected (we need to
|
||
ensure those are threated as one group by the sharding logic)
|
||
2. Scale-up logic, which as of now has no notion atomicity and grouping of
|
||
pods. This is simplified as the ScaleUp logic was recently put [behind an
|
||
interface](https://github.com/kubernetes/autoscaler/pull/5597).
|
||
- This is a place where the biggest part of the change will be made. Here
|
||
many parts of the logic are assuming best-effort semantics and the scale
|
||
up size is lowered in many situations:
|
||
- Estimation logic, which stops after some time-out or number of
|
||
pods/nodes.
|
||
- Size limiting, which caps the scale-up to match the size
|
||
restrictions (on node group or cluster level).
|
||
3. Node creation, which needs to support atomic resize. Either via native cloud
|
||
provider APIs or best effort with node removal if CA is unable to fulfill
|
||
the scale-up.
|
||
- This is also quite substantial change, we can provide a generic
|
||
best-effort implementation that will try to scale up and clean-up nodes
|
||
if it is unsuccessful, but it is up to cloud providers to integrate with
|
||
provider specific APIs.
|
||
4. Scale down path is not expected to change much. But users should follow
|
||
[best
|
||
practices](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node)
|
||
to avoid CA disturbing their workloads.
|
||
|
||
## Testing
|
||
|
||
The following e2e test scenarios will be created to check whether ProvReq
|
||
handling works as expected:
|
||
|
||
1. A new ProvReq with `check-capacity.kubernetes.io` provisioning class is created, CA
|
||
checks if there is enough capacity in cluster to provision specified pods.
|
||
2. A new ProvReq with `atomic-scale-up.kubernetes.io` provisioning class is created, CA
|
||
picks an appropriate node group scales it up atomically.
|
||
3. A new atomic ProvReq is created for which a NAP needs to provision a new
|
||
node group. NAP creates it CA scales it atomically.
|
||
- Here we should cover some of the different reasons why NAP may be
|
||
required.
|
||
4. An atomic ProvReq fails due to node group size limits and NAP CPU and/or RAM
|
||
limits.
|
||
5. Scalability tests.
|
||
- Scenario in which many small ProvReqs are created (strain on the number
|
||
of scale-up loops).
|
||
- Scenario in which big ProvReq is created (strain on a single scale-up
|
||
loop).
|
||
|
||
## Limitations
|
||
|
||
The current Cluster Autoscaler implementation is not taking into account [Resource Quotas](https://kubernetes.io/docs/concepts/policy/resource-quotas/). \
|
||
The current proposal is to not include handling of the Resource Quotas, but it could be added later on.
|
||
|
||
## Future Expansions
|
||
|
||
### ProvisioningClass CRD
|
||
|
||
One of the expansion of this approach is to introduce the ProvisioningClass CRD,
|
||
which follows the same approach as
|
||
[StorageClass object](https://kubernetes.io/docs/concepts/storage/storage-classes/).
|
||
Such approach would allow administrators of the cluster to introduce a list of allowed
|
||
ProvisioningClasses. Such CRD can also contain a pre set configuration, i.e.
|
||
administrators may set that `atomic-scale-up.kubernetes.io` would retry up to `2h`.
|
||
|
||
Possible CRD definition:
|
||
```go
|
||
// ProvisioningClass is a way to express provisioning classes available in the cluster.
|
||
type ProvisioningClass struct {
|
||
// Name denotes the name of the object, which is to be used in the ProvisioningClass
|
||
// field in Provisioning Request CRD.
|
||
//
|
||
// +kubebuilder:validation:Required
|
||
Name string `json:"name"`
|
||
|
||
// AdditionalParameters contains all other parameters custom classes may require.
|
||
//
|
||
// +optional
|
||
AdditionalParameters map[string]string `json:"additionalParameters"`
|
||
}
|
||
```
|