Add multiple scheduling group proposal

Signed-off-by: RainbowMango <qdurenhongcai@gmail.com>
2023-02-06 16:56:21 +08:00 · 2023-02-06 16:56:21 +08:00 · ac662db45a
parent 4641e88fd2
commit ac662db45a
1 changed files with 371 additions and 0 deletions
--- a/docs/proposals/scheduling/multi-scheduling-group/README.md
+++ b/docs/proposals/scheduling/multi-scheduling-group/README.md
@ -0,0 +1,371 @@
+---
+title: Multiple scheduling group
+
+authors:
+- "@RainbowMango"
+
+reviewers:
+- "@chaunceyjiang"
+- "@Garrybest"
+- "@lonelyCZ"
+- "@Poor12"
+- "@XiShanYongYe-Chang"
+
+approvers:
+- "@kevin-wangzefeng"
+
+creation-date: 2023-02-06
+
+---
+
+# Multiple scheduling group
+
+## Summary
+
+The current PropagationPolicy supports declaring `only one` group of clusters, that
+is the `.spec.placement.clusterAffinity`, e.g.
+```yaml
+apiVersion: policy.karmada.io/v1alpha1
+kind: PropagationPolicy
+metadata:
+  name: foo
+spec:
+  resourceSelectors:
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: foo
+  placement:
+    clusterAffinity:
+      clusterNames:
+        - member1
+        - member2
+```
+The `clusterAffinity` supplies a group of candidate clusters to the `karmada-scheduler`.
+The `karmada-scheduler` makes schedule decisions among the candidate clusters
+according to relevant restrictions(like spreadConstraint, filter plugins, and so on),
+and the scheduling results are either successful or failure.
+- successful: successfully selected a group of clusters(probably a subset of candidates) for the referencing resource.
+- failure: failed to select a group of clusters that satisfied the restrictions.
+
+This proposal proposes a strategy that multiple `clusterAffinity` could be declared,
+and the `karmada-scheduler` could make the decision by evaluating each `clusterAffinity`
+in a specified order.
+
+## Motivation
+
+The cluster administrators usually classify their clusters by categories(like,
+by provider, usage, and so on) into different groups, and hoping to deploy
+workload to the preferred groups and leave a backup group in case of preferred
+group doesn't satisfy scheduling restrictions like lack of resources.
+
+The Karmada community received a lot of feedback about that from end users, like
+issue [#780](https://github.com/karmada-io/karmada/issues/780) and
+[#2085](https://github.com/karmada-io/karmada/issues/2085).
+
+### Goals
+
+- Extend the API of PropagationPolicy to hold multiple affinity groups declaration.
+- Extend the API of ResourceBinding to annotate the affinity group that the scheduler is inspecting.
+- Propose the implementation ideas for involved components, including `karmada-controller-manager`, `karmada-webhook` and `karmada-scheduler`.
+
+### Non-Goals
+
+- **The relative priority of clusters in the same group**
+
+This proposal focuses on introducing multiple affinity groups, for the relative
+priority of clusters in the same group topic is out of scope. And such requirements
+could be done by `weightPreference`(.spec.placement.replicaScheduling.weightPreference) or
+define a strategy for `karmada-scheduler` to score specified clusters.
+
+- **Scheduling re-balance**
+
+Indeed, oftentimes people want the `karmada-scheduler` to perform an extra schedule
+for different purposes, like the question addressed by the discussion
+[#3069](https://github.com/karmada-io/karmada/discussions/3069), these kind of
+issues should be tracked by another proposal.
+
+## Proposal
+
+### User Stories (Optional)
+
+#### As a user, I prefer to deploy my applications on clusters in local data center to save costs.
+
+I have some clusters in my data center as well as some managed clusters from cloud
+providers(like AWS and Google Cloud). Since the cost of the managed clusters is higher
+than private clusters, so I prefer to deploy applications on private clusters and
+take the managed clusters as a backup.
+
+#### As a user, I want to deploy applications on the primary cluster and leave a backup cluster for disaster recovery cases.
+
+I have two clusters, the primary, and the backup cluster, I want Karmada to deploy
+my applications on the primary cluster and migrate applications on the backup cluster
+when the primary cluster becomes unavailable due to like data center power off or
+maintenance task.
+
+### Notes/Constraints/Caveats (Optional)
+
+This proposal mainly focuses on the changes in `PropagationPolicy` and `ResourceBinding`,
+but it will also be applied to `ClusterPropagationPolicy` and `ClusterResourceBinding`.
+
+### Risks and Mitigations
+
+This proposal maintains the backward compatibility, the system built with previous
+versions of Karmada can be seamlessly migrated to the new version.
+The previous configurations(yamls) could be applied to the new version of Karmada and
+without any behavior change.
+
+## Design Details
+
+### API change
+
+#### PropagationPolicy API change
+
+This proposal proposes a new field `ClusterAffinities` for declaring
+multiple affinity terms in `.spec.placement` of `PropagationPolicy`.
+```go
+// Placement represents the rule for select clusters.
+type Placement struct {
+	// ClusterAffinity represents scheduling restrictions to a certain set of clusters.
+	// If not set, any cluster can be scheduling candidate.
+	// +optional
+	ClusterAffinity *ClusterAffinity `json:"clusterAffinity,omitempty"`
+
+    // ClusterAffinities represents scheduling restrictions to multiple cluster
+    // groups that indicated by ClusterAffinityTerm.
+    //
+    // The scheduler will evaluate these groups one by one in the order they
+    // appear in the spec, the group that does not satisfy scheduling restrictions
+    // will be ignored which means all clusters in this group will not be selected
+    // unless it also belongs to the next group(a cluster could belong to multiple
+    // groups).
+    //
+    // If none of the groups satisfy the scheduling restrictions, then scheduling
+    // fails, which means no cluster will be selected.
+    //
+    // Note:
+    //   1. ClusterAffinities can not co-exist with ClusterAffinity.
+    //   2. If both ClusterAffinity and ClusterAffinities are not set, any cluster
+    //      can be scheduling candidates.
+    //
+    // Potential use case 1:
+    // The private clusters in the local data center could be the main group, and
+    // the managed clusters provided by cluster providers could be the secondary
+    // group. So that the Karmada scheduler would prefer to schedule workloads
+    // to the main group and the second group will only be considered in case of
+    // the main group does not satisfy restrictions(like, lack of resources).
+    //
+    // Potential use case 2:
+    // For the disaster recovery scenario, the clusters could be organized to
+    // primary and backup groups, the workloads would be scheduled to primary
+    // clusters firstly, and when primary cluster fails(like data center power off),
+    // Karmada scheduler could migrate workloads to the backup clusters.
+    //
+    // +optional
+    ClusterAffinities []ClusterAffinityTerm `json:"clusterAffinities,omitempty"`
+
+	// ClusterTolerations represents the tolerations.
+	// +optional
+	ClusterTolerations []corev1.Toleration `json:"clusterTolerations,omitempty"`
+
+	// SpreadConstraints represents a list of the scheduling constraints.
+	// +optional
+	SpreadConstraints []SpreadConstraint `json:"spreadConstraints,omitempty"`
+
+	// ReplicaScheduling represents the scheduling policy on dealing with the number of replicas
+	// when propagating resources that have replicas in spec (e.g. deployments, statefulsets) to member clusters.
+	// +optional
+	ReplicaScheduling *ReplicaSchedulingStrategy `json:"replicaScheduling,omitempty"`
+}
+
+// ClusterAffinityTerm selects a set of cluster.
+type ClusterAffinityTerm struct {
+	// AffinityName is the name of the cluster group.
+	// +required
+	AffinityName string `json:"affinityName"`
+
+	ClusterAffinity `json:",inline"`
+}
+```
+Each affinity term essentially is the named `ClusterAffinity`. During the scheduling
+phase, the scheduler evaluates terms in the order they appear in the spec one by
+one, and turns to the next term if the term does not satisfy restrictions.
+
+The following configuration sample declares 3 affinity terms:
+```yaml
+apiVersion: policy.karmada.io/v1alpha1
+kind: PropagationPolicy
+metadata:
+  name: nginx
+spec:
+  resourceSelectors:
+    - apiVersion: apps/v1
+      kind: Deployment
+      name: nginx
+  placement:
+    clusterAffinities:
+      - affinityName: dc-shanghai
+        clusterNames:
+          - unavailable
+      - affinityName: dc-beijing
+        clusterNames:
+          - member1
+      - affinityName: dc-hongkong
+        clusterNames:
+          - member2
+```
+
+During the scheduling phase, the scheduler will first look at the affinity
+named `dc-shanghai` and try to select a feasible cluster set from it.
+
+If no feasible cluster is found from the term, the scheduler will try against the
+next term which is the one named `dc-beijing`. And once the scheduler successfully
+selected a feasible cluster set from this affinity term, the scheduler will not
+continue to look at the following terms.
+
+And, in case of cluster `member1` becomes unavailable, by leveraging the
+[Failover feature](https://karmada.io/docs/userguide/failover/failover-overview)
+the scheduler will start looking for alternatives `in current affinity term`, and
+if that fails, it will look at the term named `dc-hongkong`.
+
+**Note:** Each affinity term is completely independent, and the scheduler will
+only select one term at a time during scheduling. But it is allowed for the same
+cluster present in multiple affinity terms.
+
+#### ResourceBinding API change
+
+In order to track the affinity term that the scheduler last evaluated,
+the proposal proposes a new field named `SchedulerObservedAffinityName` to
+`.status` of `ResourceBinding`, so that the scheduler can continue the previous
+scheduling cycle in any cases of re-schedule.
+
+```go
+// ResourceBindingStatus represents the overall status of the strategy as well as the referenced resources.
+type ResourceBindingStatus struct {
+	// SchedulerObservedGeneration is the generation(.metadata.generation) observed by the scheduler.
+	// If SchedulerObservedGeneration is less than the generation in metadata means the scheduler hasn't confirmed
+	// the scheduling result or hasn't done the schedule yet.
+	// +optional
+	SchedulerObservedGeneration int64 `json:"schedulerObservedGeneration,omitempty"`
+
+	// SchedulerObservedAffinityName is the affinity terms that
+	// scheduler looking at.
+	// +optional
+	SchedulerObservedAffinityName string `json:"schedulerObservingAffinityName,omitempty"`
+
+	// Conditions contain the different condition statuses.
+	// +optional
+	Conditions []metav1.Condition `json:"conditions,omitempty"`
+	// AggregatedStatus represents status list of the resource running in each member cluster.
+	// +optional
+	AggregatedStatus []AggregatedStatusItem `json:"aggregatedStatus,omitempty"`
+}
+```
+
+E.g:
+
+```yaml
+status:
+  aggregatedStatus:
+  - applied: true
+    clusterName: member1
+    health: Healthy
+    status:
+      availableReplicas: 2
+      readyReplicas: 2
+      replicas: 2
+      updatedReplicas: 2
+  conditions:
+  - lastTransitionTime: "2023-02-04T09:38:20Z"
+    message: All works have been successfully applied
+    reason: FullyAppliedSuccess
+    status: "True"
+    type: FullyApplied
+  - lastTransitionTime: "2023-02-04T09:38:20Z"
+    message: Binding has been scheduled
+    reason: BindingScheduled
+    status: "True"
+    type: Scheduled
+  schedulerObservedGeneration: 2
+  SchedulerObservedAffinityName: ds-hongkong
+```
+
+The `SchedulerObservedAffinityName: ds-hongkong` means the scheduling result is based on
+the affinity term named `ds-hongkong`, in case of re-scheduling, the scheduler should
+continue to evaluate from this affinity term.
+
+### Components change
+
+
+#### karmada-controller-manager
+
+When creating or updating `ResourceBidning`/`ClusterResourceBinding`, the added
+`OrderedClusterAffinities` in `PropagationPolicy`/`ClusterPropagationPolicy` should
+be synced.
+
+#### karmada-scheduler
+
+Currently, the karmada-scheduler only runs a single loop by accepting an affinity
+term, that is [ScheduleAlgorithm interface](https://github.com/karmada-io/karmada/blob/32b8f21b79017e2f4154fbe0677cb63fb18b120c/pkg/scheduler/core/generic_scheduler.go#L22).
+
+With this proposal, the `ScheduleAlgorithm interface` probably be invoked multi
+time, and feed a different affinity term each time until the schedule succeeds.
+
+#### karmada-webhook
+
+Since it doesn't make sense for the newly introduced `ClusterAffinities`
+co-exist with the previous `ClusterAffinity`, so the webhook should perform extra
+validation work to prevent misleading configuration.
+
+In addition, the `affinityName` of each affinity terms should be unique among all terms.
+
+### Test Plan
+
+- All current testing should be passed, no break change would be involved by this feature.
+- Add new E2E tests to cover the feature, the scope should include:
+  * Workload propagating with scheduling type `Duplicated`.
+  * Workload propagating with scheduling type `Divided`
+  * Failover scenario
+
+## Alternatives
+
+Introducing a new field to specify cluster priority was one of the first suggested
+ideas.(tracked by [#842](https://github.com/karmada-io/karmada/pull/842)).
+This approach tried to reuse the terms in Kubernetes Pod affinity.
+An example of this approach is as follows:
+```yaml
+apiVersion: policy.karmada.io/v1alpha1
+kind: PropagationPolicy
+metadata:
+ name: foo
+spec:
+ resourceSelectors:
+   - apiVersion: apps/v1
+     kind: Deployment
+     name: foo
+ placement:
+   clusterAffinity:
+     clusterNames:
+       - member1
+       - member2
+     propagatePriority:
+       - weight: 30
+         preference:
+           matchExpressions:
+             - key: topology
+               operator: In
+               values:
+                 - us 
+       - weight: 20
+         preference:
+           matchExpressions:
+             - key: topology
+               operator: In
+               values:
+                 - cn 
+```
+The `.spec.placement.clusterAffinity.propagatePriority` is used to specify the
+cluster preference by reusing the [PreferredSchedulingTerm of Kubernetes](https://github.com/kubernetes/kubernetes/blob/fc002b2f07250a462bb1b471807708b542472c18/staging/src/k8s.io/api/core/v1/types.go#L3028-L3035).
+
+However, the `PreferredSchedulingTerm` relies on group cluster by the label selector
+and field selector, the `matchExpressions` involves too many nested layers that
+probably make the configuration hard to maintain.