20 KiB
title | authors | reviewers | approvers | creation-date | |||||
---|---|---|---|---|---|---|---|---|---|
Introduce a rebalance mechanism to actively trigger rescheduling of resource. |
|
|
|
2024-01-30 |
Introduce a mechanism to actively trigger rescheduling
Background
According to current karmada scheduler, after replicas of workloads is scheduled, it will keep the scheduling result inert and the replicas distribution will not change. Even if reschedule is triggered by modifying replicas or placement, it will maintain the exist replicas distribution as closely as possible, only making minimal adjustments when necessary, which minimizes disruptions and preserves the balance across clusters.
However, in some scenarios, users hope to have approach to actively trigger a fresh rescheduling, which disregards the previous assignment entirely and seeks to establish an entirely new replica distribution across clusters.
Motivation
Assuming the user has propagated the workloads to member clusters, in some scenarios the current replicas distribution is not the most expected, such as:
- replicas migrated due to cluster failover, while now cluster recovered.
- replicas migrated due to application-level failover, while now each cluster has sufficient resources to run the replicas.
- as for
Aggregated
schedule strategy, replicas were initially distributed across multiple clusters due to resource constraints, but now one cluster is enough to accommodate all replicas.
Therefore, the user desires for an approach to trigger rescheduling so that the replicas distribution can do a rebalance.
Goals
Introduce a rebalance mechanism to actively trigger rescheduling of resource.
Proposal
- Introduce a configurable field into resource binding, and when it changes, the scheduler will perform a
Fresh
mode rescheduling.
In contrast to existing assignment mode of rescheduling, such as those triggered by modification of replicas or placement, will maintain the exist replicas distribution as closely as possible, the assignment mode of this rescheduling disregards the previous assignment entirely and seeks to establish an entirely new replica distribution across clusters.
We call the former assignment as
Steady
mode and the latter asFresh
mode.
- Introduce a new API, by which the users can actively adjust workload balance.
Since directly manipulating bindings is not the recommended friendly way, it would be better to design a new API specifically for adjusting workload balance. Currently, it is mainly considered for rescheduling scenario. In the future, it may continue to expand more workload rebalance scenarios, such as migration, rollback and so on, with different assignment modes and rolling modes specified.
User story
Story 1
In cluster failover scenario, replicas are distributed in member1 + member2 two clusters, however they would all migrate to member2 cluster if member1 cluster fails.
As a cluster administrator, I hope the replicas redistribute to two clusters when member1 cluster recovered, so that the resources of the member1 cluster will be re-utilized, also for the sake of high availability.
Story 2
In application-level failover, low-priority applications may be preempted, resulting in shrinking from multi clusters to single cluster due to cluster resources are in short supply (refer to Application-level Failover).
As a user, I hope the replicas of low-priority applications can be redistributed to multi clusters when cluster resources are sufficient to ensure the high availability of application.
Story 3
In Aggregated
schedule type, replicas may still distribute across multiple clusters due to resource constraints.
As a user, I hope the replicas to be redistributed in an aggregated strategy when any cluster has sufficient resource to accommodate all replicas, so that the application better meets actual business requirements.
Story 4
In disaster-recovery scenario, replicas migrated from primary cluster to backup cluster when primary cluster failue.
As a cluster administrator, I hope that replicas can migrate back when cluster restored, so that:
- restore to the disaster-recovery mode to ensure the reliability and stability of the cluster federation.
- save the cost of the backup cluster.
Notes/Constraints/Caveats
This ability is limited to triggering workload rebalance, the schedule result will be recalculated according to the
Placement
in the current ResourceBinding. That means:
-
Take story 1 as an example, reschedule happened when cluster recovered, but the new schedule result is not guaranteed to be exactly the same as before the cluster failure, it is only guaranteed that the new schedule result meets current
Placement
. -
Rebalance is basing on
Placement
in the current ResourceBinding, not PropagationPolicy. So if your activation preference of PropagationPolicy isLazy
, the rescheduling is still basing on previousResourceBinding
even if the current Policy has been changed.
Design Details
API change
- As for Introduce a configurable field into resource binding, detail description is as follows:
// ResourceBindingSpec represents the expectation of ResourceBinding.
type ResourceBindingSpec struct {
...
// RescheduleTriggeredAt is a timestamp representing when the referenced resource is triggered rescheduling.
// When this field is updated, it means a rescheduling is manually triggered by user, and the expected behavior
// of this action is to do a complete recalculation without referring to last scheduling results.
// It works with the status.lastScheduledTime field, and only when this timestamp is later than timestamp in
// status.lastScheduledTime will the rescheduling actually execute, otherwise, ignored.
//
// It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
// +optional
RescheduleTriggeredAt *metav1.Time `json:"rescheduleTriggeredAt,omitempty"`
...
}
// ResourceBindingStatus represents the overall status of the strategy as well as the referenced resources.
type ResourceBindingStatus struct {
...
// LastScheduledTime representing the latest timestamp when scheduler successfully finished a scheduling.
// It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
// +optional
LastScheduledTime *metav1.Time `json:"lastScheduledTime,omitempty"`
...
}
- As for Introduce a new API, by which the users can actively adjust workload balance., we define a new API
named
WorkloadRebalancer
into a new apiGroupapps.karmada.io/v1alpha1
:
// +genclient
// +genclient:nonNamespaced
// +kubebuilder:resource:path=workloadrebalancers,scope="Cluster"
// +kubebuilder:subresource:status
// +kubebuilder:storageversion
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
// WorkloadRebalancer represents the desired behavior and status of a job which can enforces a resource rebalance.
type WorkloadRebalancer struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
// Spec represents the specification of the desired behavior of WorkloadRebalancer.
// +required
Spec WorkloadRebalancerSpec `json:"spec"`
// Status represents the status of WorkloadRebalancer.
// +optional
Status WorkloadRebalancerStatus `json:"status,omitempty"`
}
// WorkloadRebalancerSpec represents the specification of the desired behavior of Reschedule.
type WorkloadRebalancerSpec struct {
// Workloads used to specify the list of expected resource.
// Nil or empty list is not allowed.
// +kubebuilder:validation:MinItems=1
// +required
Workloads []ObjectReference `json:"workloads"`
// TTLSecondsAfterFinished limits the lifetime of a WorkloadRebalancer that has finished execution (means each
// target workload is finished with result of Successful or Failed).
// If this field is set, ttlSecondsAfterFinished after the WorkloadRebalancer finishes, it is eligible to be automatically deleted.
// If this field is unset, the WorkloadRebalancer won't be automatically deleted.
// If this field is set to zero, the WorkloadRebalancer becomes eligible to be deleted immediately after it finishes.
// +optional
TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
}
// ObjectReference the expected resource.
type ObjectReference struct {
// APIVersion represents the API version of the target resource.
// +required
APIVersion string `json:"apiVersion"`
// Kind represents the Kind of the target resource.
// +required
Kind string `json:"kind"`
// Name of the target resource.
// +required
Name string `json:"name"`
// Namespace of the target resource.
// Default is empty, which means it is a non-namespacescoped resource.
// +optional
Namespace string `json:"namespace,omitempty"`
}
// WorkloadRebalancerStatus contains information about the current status of a WorkloadRebalancer
// updated periodically by schedule trigger controller.
type WorkloadRebalancerStatus struct {
// ObservedWorkloads contains information about the execution states and messages of target resources.
// +optional
ObservedWorkloads []ObservedWorkload `json:"observedWorkloads,omitempty"`
// ObservedGeneration is the generation(.metadata.generation) observed by the controller.
// If ObservedGeneration is less than the generation in metadata means the controller hasn't confirmed
// the rebalance result or hasn't done the rebalance yet.
// +optional
ObservedGeneration int64 `json:"observedGeneration,omitempty"`
// FinishTime represents the finish time of rebalancer.
// +optional
FinishTime *metav1.Time `json:"finishTime,omitempty"`
}
// ObservedWorkload the observed resource.
type ObservedWorkload struct {
// Workload the observed resource.
// +required
Workload ObjectReference `json:"workload"`
// Result the observed rebalance result of resource.
// +optional
Result RebalanceResult `json:"result,omitempty"`
// Reason represents a machine-readable description of why this resource rebalanced failed.
// +optional
Reason RebalanceFailedReason `json:"reason,omitempty"`
}
// RebalanceResult the specific extent to which the resource has been rebalanced
type RebalanceResult string
const (
// RebalanceFailed the resource has been rebalance failed.
RebalanceFailed RebalanceResult = "Failed"
// RebalanceSuccessful the resource has been successfully rebalanced.
RebalanceSuccessful RebalanceResult = "Successful"
)
// RebalanceFailedReason represents a machine-readable description of why this resource rebalanced failed.
type RebalanceFailedReason string
const (
// RebalanceObjectNotFound the resource referenced binding not found.
RebalanceObjectNotFound RebalanceFailedReason = "ReferencedBindingNotFound"
)
Interpretation of Realization by an Example
Step 1. apply WorkloadRebalancer resource yaml.
Assuming there is two Deployment named demo-deploy-1
and demo-deploy-2
, and a ClusterRole named demo-role
,
the user wants to trigger their rescheduling, he just needs to apply following yaml:
apiVersion: apps.karmada.io/v1alpha1
kind: WorkloadRebalancer
metadata:
name: demo
spec:
workloads:
- apiVersion: apps/v1
kind: Deployment
name: demo-deploy-1
namespace: default
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
name: demo-role
- apiVersion: apps/v1
kind: Deployment
name: demo-deploy-2
namespace: default
Notes: as for
workloads
field:
name
sub-field is required;namespace
sub-field is required when it is a namespace scoped resource, while empty when it is a cluster wide resource;
This API specified a batch of resources which needs a rescheduling, and the user will get a workloadrebalancer.apps.karmada.io/demo created
result, which means the API created success.
Step 2: Controller listens new API resource and do the rescheduling work.
Then the controller will work to trigger the rescheduling for each resource, by writing the CreationTimestamp
of WorkloadRebalancer
to each resource binding's new field spec.placement.rescheduleTriggeredAt
. Take deployment/demo-deploy-1
as example,
you will see its resource binding be modified to:
apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
name: demo-deploy-1-deployment
namespace: default
spec:
rescheduleTriggeredAt: "2024-04-17T15:04:05Z" # this field would be updated to CreationTimestamp of WorkloadRebalancer
...
status:
lastScheduledTime: "2024-04-17T15:00:05Z"
Since field rescheduleTriggeredAt
updated, and it is later than field lastScheduledTime
, rescheduling is triggered.
If it succeeds, the lastScheduledTime
field will be updated again, which represents scheduler finished a rescheduling
(if failed, the scheduler will retry), detail as follows:
apiVersion: work.karmada.io/v1alpha2
kind: ResourceBinding
metadata:
name: demo-deploy-1-deployment
namespace: default
spec:
rescheduleTriggeredAt: "2024-04-17T15:04:05Z"
...
status:
lastScheduledTime: "2024-04-17T15:04:05Z"
conditions:
- ...
- lastTransitionTime: "2024-04-17T15:00:05Z"
message: Binding has been scheduled successfully.
reason: Success
status: "True"
type: Scheduled
- lastTransitionTime: "2024-04-17T15:04:05Z"
message: All works have been successfully applied
reason: FullyAppliedSuccess
status: "True"
type: FullyApplied
Finally, all works have been successfully applied, the user will observe changes in the actual distribution of resource template; the user can also see several recorded event in resource template, just like:
$ kubectl --context karmada-apiserver describe deployment demo-deploy-1
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
...
Normal ScheduleBindingSucceed 31s default-scheduler Binding has been scheduled successfully. Result: {member2:2, member1:1}
Normal GetDependenciesSucceed 31s dependencies-distributor Get dependencies([]) succeed.
Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo-deploy-1) to cluster member1
Normal AggregateStatusSucceed 31s (x4 over 31s) resource-binding-status-controller Update resourceBinding(default/demo-deploy-1-deployment) with AggregatedStatus successfully.
Normal SyncSucceed 31s execution-controller Successfully applied resource(default/demo-deploy-1) to cluster member2
Step 3: check the status of WorkloadRebalancer.
The user can observe the rebalance result at status.observedWorkloads
of workloadrebalancer/demo
, just like:
apiVersion: apps.karmada.io/v1alpha1
kind: WorkloadRebalancer
metadata:
creationTimestamp: "2024-04-17T15:04:05Z"
name: demo
spec:
workloads:
- apiVersion: apps/v1
kind: Deployment
name: demo-deploy-1
namespace: default
- apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
name: demo-role
- apiVersion: apps/v1
kind: Deployment
name: demo-deploy-2
namespace: default
status:
observedWorkloads:
- result: Successful
workload:
apiVersion: apps/v1
kind: Deployment
name: demo-deploy-1
namespace: default
- reason: ReferencedBindingNotFound
result: Failed
workload:
apiVersion: apps/v1
kind: Deployment
name: demo-deploy-2
namespace: default
- result: Successful
workload:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
name: demo-role
Notes:
- the
observedWorkloads
is sorted in increasing dict order of the combined string ofapiVersion/kind/namespace/name
.- if workload referenced binding not found, it will be marked as
failed
without retry.- if workload rebalanced failed due to occasional network error, the controller will retry, and its
result
andreason
field will left empty until it succees.
How to update this resource
When spec
filed of WorkloadRebalancer updated, we shall refresh the workload list in status.observedWorkloads
:
- a new workload added to spec list, just add it into status list too and do the rebalance.
- a workload deleted from previous spec list, keep it in status list if already success, and remove it if not.
- a workload is modified, just regard it as deleted an old one and inserted a new one.
- if the modification only involves a list order adjustment, no additional action, since
observedWorkloads
is arranged in increasing dict order.
How to auto clean resource
referring to Automatic Cleanup for Finished Jobs.
Introduces field ttlSecondsAfterFinished
which limits the lifetime of a WorkloadRebalancer that has finished execution
(finished execution means each target workload is finished with result of Successful
or Failed
).
- If this field is set,
ttlSecondsAfterFinished
after the WorkloadRebalancer finishes, it is eligible to be automatically deleted. - If this field is unset, the WorkloadRebalancer won't be automatically deleted.
- If this field is set to zero, the WorkloadRebalancer becomes eligible to be deleted immediately after it finishes.
Considering several corner cases:
- case 1: if a new target workload was added into
WorkloadRebalancer
beforettlSecondsAfterFinished
expired, which means the finish time of theWorkloadRebalancer
is refreshed, so thedelete
action is deferred since expire time is refreshed too. - case 2: if
ttlSecondsAfterFinished
is modified beforettlSecondsAfterFinished
expired,delete
action should be performed according to latestttlSecondsAfterFinished
. - case 3: when we have got and checked latest
WorkloadRebalancer
object and try to delete it, if a modification toWorkloadRebalancer
occurred just right between the two time point, the previousdelete
action should be Interrupted.
Several key implementation:
- A
WorkloadRebalancer
is judged as finished should meet two requirements:- all expected workloads are finished with result of
Successful
orFailed
. - introduce a new field named
ObservedGeneration
toStatus
of WorkloadRebalancer, and it should be equal to.metadata.Generation
, to prevent that the WorkloadRebalancer is updated but controller hasn't in time refreshed itsStatus
.
- all expected workloads are finished with result of
- When
WorkloadRebalancer
isCreated
orUpdated
, add it to the workqueue and calculate its expiring time, and callworkqueue.AddAfter()
function to re-enqueue it once more if it hasn't expired. - Before deleting the
WorkloadRebalancer
, do a final sanity check. Use the latestWorkloadRebalancer
directly fetched from api server to see if the TTL truly expires, rather than object from lister cache. - When deleting the
WorkloadRebalancer
, it is needed to confirm that theresourceVersion
of the deleted object is as expected, to prevent from above corner case 3.
How to prevent application from being out-of-service
As for disaster-recovery scenario mentioned in above story 4, after primary cluster recovered and reschedule has been triggered, if new replicas in primary cluster become ready later than old replicas removed from backup cluster, there may be no ready replica in cluster federation and the application will be out-of-service. So, how to prevent application from being out-of-service?
This will be discussed and implemented separately in another proposal.