Merge pull request #4698 from chaosi-zju/reschedule

Proposal of introducing a rebalance mechanism to actively trigger rescheduling of resource
2024-05-24 09:09:23 +08:00 · 2024-05-24 09:09:23 +08:00 · cee1c1a691
parent 0084c8a97e 0e1922cb95
commit cee1c1a691
1 changed files with 455 additions and 0 deletions
--- a/docs/proposals/scheduling/workload-rebalancer/workload-rebalancer.md
+++ b/docs/proposals/scheduling/workload-rebalancer/workload-rebalancer.md
@ -0,0 +1,455 @@
+---
+title: Introduce a rebalance mechanism to actively trigger rescheduling of resource.
+authors:
+  - "@chaosi-zju"
+reviewers:
+  - "@RainbowMango"
+  - "@chaunceyjiang"
+  - "@wu0407"
+approvers:
+  - "@RainbowMango"
+
+creation-date: 2024-01-30
+---
+
+# Introduce a mechanism to actively trigger rescheduling
+
+## Background
+
+According to current karmada scheduler, after replicas of workloads is scheduled, it will keep the scheduling result inert 
+and the replicas distribution will not change. Even if reschedule is triggered by modifying replicas or placement, 
+it will maintain the exist replicas distribution as closely as possible, only making minimal adjustments when necessary, 
+which minimizes disruptions and preserves the balance across clusters.
+
+However, in some scenarios, users hope to have approach to actively trigger a fresh rescheduling, which disregards the 
+previous assignment entirely and seeks to establish an entirely new replica distribution across clusters.
+
+### Motivation
+
+Assuming the user has propagated the workloads to member clusters, in some scenarios the current replicas distribution
+is not the most expected, such as:
+
+* replicas migrated due to cluster failover, while now cluster recovered.
+* replicas migrated due to application-level failover, while now each cluster has sufficient resources to run the replicas.
+* as for `Aggregated` schedule strategy, replicas were initially distributed across multiple clusters due to resource
+  constraints, but now one cluster is enough to accommodate all replicas.
+
+Therefore, the user desires for an approach to trigger rescheduling so that the replicas distribution can do a rebalance.
+
+### Goals
+
+Introduce a rebalance mechanism to actively trigger rescheduling of resource.
+
+## Proposal
+
+* **Introduce a configurable field into resource binding, and when it changes, the scheduler will perform a `Fresh` mode
+  rescheduling.**
+
+> In contrast to existing assignment mode of rescheduling, such as those triggered by modification of replicas or
+> placement, will maintain the exist replicas distribution as closely as possible, the assignment mode of this rescheduling
+> disregards the previous assignment entirely and seeks to establish an entirely new replica distribution across clusters.
+>
+> We call the former assignment as `Steady` mode and the latter as `Fresh` mode.
+
+* **Introduce a new API, by which the users can actively adjust workload balance.**
+
+> Since directly manipulating bindings is not the recommended friendly way, it would be better to design a new API
+> specifically for adjusting workload balance. Currently, it is mainly considered for rescheduling scenario.
+> In the future, it may continue to expand more workload rebalance scenarios, such as migration, rollback and so on,
+> with different assignment modes and rolling modes specified.
+
+### User story
+
+#### Story 1
+
+In cluster failover scenario, replicas are distributed in member1 + member2 two clusters, however they would all migrate to
+member2 cluster if member1 cluster fails.
+
+As a cluster administrator, I hope the replicas redistribute to two clusters when member1 cluster recovered, so that 
+the resources of the member1 cluster will be re-utilized, also for the sake of high availability.
+
+#### Story 2
+
+In application-level failover, low-priority applications may be preempted, resulting in shrinking from multi clusters 
+to single cluster due to cluster resources are in short supply
+(refer to [Application-level Failover](https://karmada.io/docs/next/userguide/failover/application-failover#why-application-level-failover-is-required)).
+
+As a user, I hope the replicas of low-priority applications can be redistributed to multi clusters when
+cluster resources are sufficient to ensure the high availability of application.
+
+#### Story 3
+
+In `Aggregated` schedule type, replicas may still distribute across multiple clusters due to resource constraints.
+
+As a user, I hope the replicas to be redistributed in an aggregated strategy when any cluster has
+sufficient resource to accommodate all replicas, so that the application better meets actual business requirements.
+
+
+#### Story 4
+
+In disaster-recovery scenario, replicas migrated from primary cluster to backup cluster when primary cluster failue.
+
+As a cluster administrator, I hope that replicas can migrate back when cluster restored, so that:
+
+1. restore to the disaster-recovery mode to ensure the reliability and stability of the cluster federation.
+2. save the cost of the backup cluster.
+
+### Notes/Constraints/Caveats
+
+This ability is limited to triggering workload rebalance, the schedule result will be recalculated according to the
+`Placement` in the current ResourceBinding. That means:
+
+* Take [story 1](#story-1) as an example, reschedule happened when cluster recovered, but the new schedule result is not 
+guaranteed to be exactly the same as before the cluster failure, it is only guaranteed that the new schedule result meets
+current `Placement`.
+
+* Rebalance is basing on `Placement` in the current ResourceBinding, not PropagationPolicy. So if your activation preference
+of PropagationPolicy is `Lazy`, the rescheduling is still basing on previous `ResourceBinding` even if the current Policy has been changed.
+
+## Design Details
+
+### API change
+
+* As for *Introduce a configurable field into resource binding*, detail description is as follows:
+
+```go
+// ResourceBindingSpec represents the expectation of ResourceBinding.
+type ResourceBindingSpec struct {
+    ...
+  // RescheduleTriggeredAt is a timestamp representing when the referenced resource is triggered rescheduling.
+  // When this field is updated, it means a rescheduling is manually triggered by user, and the expected behavior
+  // of this action is to do a complete recalculation without referring to last scheduling results.
+  // It works with the status.lastScheduledTime field, and only when this timestamp is later than timestamp in
+  // status.lastScheduledTime will the rescheduling actually execute, otherwise, ignored.
+  //
+  // It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
+  // +optional
+  RescheduleTriggeredAt *metav1.Time `json:"rescheduleTriggeredAt,omitempty"`
+    ...
+}
+
+// ResourceBindingStatus represents the overall status of the strategy as well as the referenced resources.
+type ResourceBindingStatus struct {
+    ...
+  // LastScheduledTime representing the latest timestamp when scheduler successfully finished a scheduling.
+  // It is represented in RFC3339 form (like '2006-01-02T15:04:05Z') and is in UTC.
+  // +optional
+  LastScheduledTime *metav1.Time `json:"lastScheduledTime,omitempty"`
+    ...
+}
+```
+
+* As for *Introduce a new API, by which the users can actively adjust workload balance.*, we define a new API 
+  named `WorkloadRebalancer` into a new apiGroup `apps.karmada.io/v1alpha1`:
+
+```go
+// +genclient
+// +genclient:nonNamespaced
+// +kubebuilder:resource:path=workloadrebalancers,scope="Cluster"
+// +kubebuilder:subresource:status
+// +kubebuilder:storageversion
+// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
+
+// WorkloadRebalancer represents the desired behavior and status of a job which can enforces a resource rebalance.
+type WorkloadRebalancer struct {
+  metav1.TypeMeta   `json:",inline"`
+  metav1.ObjectMeta `json:"metadata,omitempty"`
+  
+  // Spec represents the specification of the desired behavior of WorkloadRebalancer.
+  // +required
+  Spec WorkloadRebalancerSpec `json:"spec"`
+  
+  // Status represents the status of WorkloadRebalancer.
+  // +optional
+  Status WorkloadRebalancerStatus `json:"status,omitempty"`
+}
+
+// WorkloadRebalancerSpec represents the specification of the desired behavior of Reschedule.
+type WorkloadRebalancerSpec struct {
+  // Workloads used to specify the list of expected resource.
+  // Nil or empty list is not allowed.
+  // +kubebuilder:validation:MinItems=1
+  // +required
+  Workloads []ObjectReference `json:"workloads"`
+
+  // TTLSecondsAfterFinished limits the lifetime of a WorkloadRebalancer that has finished execution (means each
+  // target workload is finished with result of Successful or Failed).
+  // If this field is set, ttlSecondsAfterFinished after the WorkloadRebalancer finishes, it is eligible to be automatically deleted.
+  // If this field is unset, the WorkloadRebalancer won't be automatically deleted.
+  // If this field is set to zero, the WorkloadRebalancer becomes eligible to be deleted immediately after it finishes.
+  // +optional
+  TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`
+}
+
+// ObjectReference the expected resource.
+type ObjectReference struct {
+  // APIVersion represents the API version of the target resource.
+  // +required
+  APIVersion string `json:"apiVersion"`
+  
+  // Kind represents the Kind of the target resource.
+  // +required
+  Kind string `json:"kind"`
+  
+  // Name of the target resource.
+  // +required
+  Name string `json:"name"`
+  
+  // Namespace of the target resource.
+  // Default is empty, which means it is a non-namespacescoped resource.
+  // +optional
+  Namespace string `json:"namespace,omitempty"`
+}
+
+// WorkloadRebalancerStatus contains information about the current status of a WorkloadRebalancer
+// updated periodically by schedule trigger controller.
+type WorkloadRebalancerStatus struct {
+  // ObservedWorkloads contains information about the execution states and messages of target resources.
+  // +optional
+  ObservedWorkloads []ObservedWorkload `json:"observedWorkloads,omitempty"`
+
+  // ObservedGeneration is the generation(.metadata.generation) observed by the controller.
+  // If ObservedGeneration is less than the generation in metadata means the controller hasn't confirmed
+  // the rebalance result or hasn't done the rebalance yet.
+  // +optional
+  ObservedGeneration int64 `json:"observedGeneration,omitempty"`
+
+  // FinishTime represents the finish time of rebalancer.
+  // +optional
+  FinishTime *metav1.Time `json:"finishTime,omitempty"`
+}
+
+// ObservedWorkload the observed resource.
+type ObservedWorkload struct {
+  // Workload the observed resource.
+  // +required
+  Workload ObjectReference `json:"workload"`
+  
+  // Result the observed rebalance result of resource.
+  // +optional
+  Result RebalanceResult `json:"result,omitempty"`
+  
+  // Reason represents a machine-readable description of why this resource rebalanced failed.
+  // +optional
+  Reason RebalanceFailedReason `json:"reason,omitempty"`
+}
+
+// RebalanceResult the specific extent to which the resource has been rebalanced
+type RebalanceResult string
+
+const (
+  // RebalanceFailed the resource has been rebalance failed.
+  RebalanceFailed RebalanceResult = "Failed"
+  // RebalanceSuccessful the resource has been successfully rebalanced.
+  RebalanceSuccessful RebalanceResult = "Successful"
+)
+
+// RebalanceFailedReason represents a machine-readable description of why this resource rebalanced failed.
+type RebalanceFailedReason string
+
+const (
+  // RebalanceObjectNotFound the resource referenced binding not found.
+  RebalanceObjectNotFound RebalanceFailedReason = "ReferencedBindingNotFound"
+)
+```
+
+### Interpretation of Realization by an Example
+
+#### Step 1. apply WorkloadRebalancer resource yaml.
+
+Assuming there is two Deployment named `demo-deploy-1` and `demo-deploy-2`, and a ClusterRole named `demo-role`, 
+the user wants to trigger their rescheduling, he just needs to apply following yaml:
+
+```yaml
+apiVersion: apps.karmada.io/v1alpha1
+kind: WorkloadRebalancer
+metadata:
+  name: demo
+spec:
+  workloads:
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: demo-deploy-1
+      namespace: default
+    - apiVersion: rbac.authorization.k8s.io/v1
+      kind: ClusterRole
+      name: demo-role
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: demo-deploy-2
+      namespace: default
+```
+
+> Notes: as for `workloads` field:
+> 1. `name` sub-field is required;
+> 2. `namespace` sub-field is required when it is a namespace scoped resource, while empty when it is a cluster wide
+     resource;
+
+This API specified a batch of resources which needs a rescheduling, and the user will get a `workloadrebalancer.apps.karmada.io/demo created`
+result, which means the API created success.
+
+#### Step 2: Controller listens new API resource and do the rescheduling work.
+
+Then the controller will work to trigger the rescheduling for each resource, by writing the `CreationTimestamp` of WorkloadRebalancer
+to each resource binding's new field `spec.placement.rescheduleTriggeredAt`. Take `deployment/demo-deploy-1` as example,
+you will see its resource binding be modified to:
+
+```yaml
+apiVersion: work.karmada.io/v1alpha2
+kind: ResourceBinding
+metadata:
+  name: demo-deploy-1-deployment
+  namespace: default
+spec:
+  rescheduleTriggeredAt: "2024-04-17T15:04:05Z"   # this field would be updated to CreationTimestamp of WorkloadRebalancer
+  ...
+status:
+  lastScheduledTime: "2024-04-17T15:00:05Z"
+```
+
+Since field `rescheduleTriggeredAt` updated, and it is later than field `lastScheduledTime`, rescheduling is triggered.
+If it succeeds, the `lastScheduledTime` field will be updated again, which represents scheduler finished a rescheduling 
+(if failed, the scheduler will retry), detail as follows:
+
+```yaml
+apiVersion: work.karmada.io/v1alpha2
+kind: ResourceBinding
+metadata:
+  name: demo-deploy-1-deployment
+  namespace: default
+spec:
+  rescheduleTriggeredAt: "2024-04-17T15:04:05Z"
+  ...
+status:
+  lastScheduledTime: "2024-04-17T15:04:05Z"
+  conditions:
+    - ...
+    - lastTransitionTime: "2024-04-17T15:00:05Z"
+      message: Binding has been scheduled successfully.
+      reason: Success
+      status: "True"
+      type: Scheduled
+    - lastTransitionTime: "2024-04-17T15:04:05Z"
+      message: All works have been successfully applied
+      reason: FullyAppliedSuccess
+      status: "True"
+      type: FullyApplied
+```
+
+Finally, all works have been successfully applied, the user will observe changes in the actual distribution of resource
+template; the user can also see several recorded event in resource template, just like:
+
+```shell
+$ kubectl --context karmada-apiserver describe deployment demo-deploy-1
+...
+Events:
+  Type    Reason                  Age                From                                Message
+  ----    ------                  ----               ----                                -------
+  ...
+  Normal  ScheduleBindingSucceed  31s                default-scheduler                   Binding has been scheduled successfully. Result: {member2:2, member1:1}
+  Normal  GetDependenciesSucceed  31s                dependencies-distributor            Get dependencies([]) succeed.
+  Normal  SyncSucceed             31s                execution-controller                Successfully applied resource(default/demo-deploy-1) to cluster member1
+  Normal  AggregateStatusSucceed  31s (x4 over 31s)  resource-binding-status-controller  Update resourceBinding(default/demo-deploy-1-deployment) with AggregatedStatus successfully.
+  Normal  SyncSucceed             31s                execution-controller                Successfully applied resource(default/demo-deploy-1) to cluster member2
+```
+
+#### Step 3: check the status of WorkloadRebalancer.
+
+The user can observe the rebalance result at `status.observedWorkloads` of `workloadrebalancer/demo`, just like:
+
+```yaml
+apiVersion: apps.karmada.io/v1alpha1
+kind: WorkloadRebalancer
+metadata:
+  creationTimestamp: "2024-04-17T15:04:05Z"
+  name: demo
+spec:
+  workloads:
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: demo-deploy-1
+      namespace: default
+    - apiVersion: rbac.authorization.k8s.io/v1
+      kind: ClusterRole
+      name: demo-role
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: demo-deploy-2
+      namespace: default
+status:
+  observedWorkloads:
+    - result: Successful
+      workload:
+        apiVersion: apps/v1
+        kind: Deployment
+        name: demo-deploy-1
+        namespace: default
+    - reason: ReferencedBindingNotFound
+      result: Failed
+      workload:
+        apiVersion: apps/v1
+        kind: Deployment
+        name: demo-deploy-2
+        namespace: default
+    - result: Successful
+      workload:
+        apiVersion: rbac.authorization.k8s.io/v1
+        kind: ClusterRole
+        name: demo-role
+```
+
+> Notes:
+> 1. the `observedWorkloads` is sorted in increasing dict order of the combined string of `apiVersion/kind/namespace/name` .
+> 2. if workload referenced binding not found, it will be marked as `failed` without retry.
+> 3. if workload rebalanced failed due to occasional network error, the controller will retry, and its `result` and `reason`
+> field will left empty until it succees.
+
+### How to update this resource
+
+When `spec` filed of WorkloadRebalancer updated, we shall refresh the workload list in `status.observedWorkloads`:
+
+* a new workload added to spec list, just add it into status list too and do the rebalance.
+* a workload deleted from previous spec list, keep it in status list if already success, and remove it if not.
+* a workload is modified, just regard it as deleted an old one and inserted a new one.
+* if the modification only involves a list order adjustment, no additional action, since `observedWorkloads` is arranged in increasing dict order.
+
+### How to auto clean resource
+
+referring to [Automatic Cleanup for Finished Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/).
+
+Introduces field `ttlSecondsAfterFinished` which limits the lifetime of a WorkloadRebalancer that has finished execution 
+(finished execution means each target workload is finished with result of `Successful` or `Failed`).
+
+* If this field is set, `ttlSecondsAfterFinished` after the WorkloadRebalancer finishes, it is eligible to be automatically deleted.
+* If this field is unset, the WorkloadRebalancer won't be automatically deleted.
+* If this field is set to zero, the WorkloadRebalancer becomes eligible to be deleted immediately after it finishes.
+
+Considering several corner cases:
+
+* case 1: if a new target workload was added into `WorkloadRebalancer` before `ttlSecondsAfterFinished` expired, 
+  which means the finish time of the `WorkloadRebalancer` is refreshed, so the `delete` action is deferred since expire time is refreshed too.
+* case 2: if `ttlSecondsAfterFinished` is modified before `ttlSecondsAfterFinished` expired, 
+  `delete` action should be performed according to latest `ttlSecondsAfterFinished`.
+* case 3: when we have got and checked latest `WorkloadRebalancer` object and try to delete it, 
+  if a modification to `WorkloadRebalancer` occurred just right between the two time point, the previous `delete` action should be Interrupted.
+
+Several key implementation:
+* A `WorkloadRebalancer` is judged as finished should meet two requirements:
+  *  all expected workloads are finished with result of `Successful` or `Failed`.
+  *  introduce a new field named `ObservedGeneration` to `Status` of WorkloadRebalancer, and it should be equal to `.metadata.Generation`,
+     to prevent that the WorkloadRebalancer is updated but controller hasn't in time refreshed its `Status`.
+* When `WorkloadRebalancer` is `Created` or `Updated`, add it to the workqueue and calculate its expiring time, and 
+  call `workqueue.AddAfter()` function to re-enqueue it once more if it hasn't expired.
+* Before deleting the `WorkloadRebalancer`, do a final sanity check. Use the latest `WorkloadRebalancer` directly 
+  fetched from api server to see if the TTL truly expires, rather than object from lister cache.
+* When deleting the `WorkloadRebalancer`, it is needed to confirm that the `resourceVersion` of the deleted object is as expected,
+  to prevent from above corner case 3.
+
+### How to prevent application from being out-of-service
+
+As for disaster-recovery scenario mentioned in above [story 4](#story-4), after primary cluster recovered and reschedule 
+has been triggered, if new replicas in primary cluster become ready later than old replicas removed from backup cluster,
+there may be no ready replica in cluster federation and the application will be out-of-service. So, how to prevent 
+application from being out-of-service?
+
+This will be discussed and implemented separately in another proposal.