Add multiple scheduling group proposal
Signed-off-by: RainbowMango <qdurenhongcai@gmail.com>
This commit is contained in:
parent
4641e88fd2
commit
ac662db45a
|
@ -0,0 +1,371 @@
|
||||||
|
---
|
||||||
|
title: Multiple scheduling group
|
||||||
|
|
||||||
|
authors:
|
||||||
|
- "@RainbowMango"
|
||||||
|
|
||||||
|
reviewers:
|
||||||
|
- "@chaunceyjiang"
|
||||||
|
- "@Garrybest"
|
||||||
|
- "@lonelyCZ"
|
||||||
|
- "@Poor12"
|
||||||
|
- "@XiShanYongYe-Chang"
|
||||||
|
|
||||||
|
approvers:
|
||||||
|
- "@kevin-wangzefeng"
|
||||||
|
|
||||||
|
creation-date: 2023-02-06
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# Multiple scheduling group
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
The current PropagationPolicy supports declaring `only one` group of clusters, that
|
||||||
|
is the `.spec.placement.clusterAffinity`, e.g.
|
||||||
|
```yaml
|
||||||
|
apiVersion: policy.karmada.io/v1alpha1
|
||||||
|
kind: PropagationPolicy
|
||||||
|
metadata:
|
||||||
|
name: foo
|
||||||
|
spec:
|
||||||
|
resourceSelectors:
|
||||||
|
- apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
name: foo
|
||||||
|
placement:
|
||||||
|
clusterAffinity:
|
||||||
|
clusterNames:
|
||||||
|
- member1
|
||||||
|
- member2
|
||||||
|
```
|
||||||
|
The `clusterAffinity` supplies a group of candidate clusters to the `karmada-scheduler`.
|
||||||
|
The `karmada-scheduler` makes schedule decisions among the candidate clusters
|
||||||
|
according to relevant restrictions(like spreadConstraint, filter plugins, and so on),
|
||||||
|
and the scheduling results are either successful or failure.
|
||||||
|
- successful: successfully selected a group of clusters(probably a subset of candidates) for the referencing resource.
|
||||||
|
- failure: failed to select a group of clusters that satisfied the restrictions.
|
||||||
|
|
||||||
|
This proposal proposes a strategy that multiple `clusterAffinity` could be declared,
|
||||||
|
and the `karmada-scheduler` could make the decision by evaluating each `clusterAffinity`
|
||||||
|
in a specified order.
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
The cluster administrators usually classify their clusters by categories(like,
|
||||||
|
by provider, usage, and so on) into different groups, and hoping to deploy
|
||||||
|
workload to the preferred groups and leave a backup group in case of preferred
|
||||||
|
group doesn't satisfy scheduling restrictions like lack of resources.
|
||||||
|
|
||||||
|
The Karmada community received a lot of feedback about that from end users, like
|
||||||
|
issue [#780](https://github.com/karmada-io/karmada/issues/780) and
|
||||||
|
[#2085](https://github.com/karmada-io/karmada/issues/2085).
|
||||||
|
|
||||||
|
### Goals
|
||||||
|
|
||||||
|
- Extend the API of PropagationPolicy to hold multiple affinity groups declaration.
|
||||||
|
- Extend the API of ResourceBinding to annotate the affinity group that the scheduler is inspecting.
|
||||||
|
- Propose the implementation ideas for involved components, including `karmada-controller-manager`, `karmada-webhook` and `karmada-scheduler`.
|
||||||
|
|
||||||
|
### Non-Goals
|
||||||
|
|
||||||
|
- **The relative priority of clusters in the same group**
|
||||||
|
|
||||||
|
This proposal focuses on introducing multiple affinity groups, for the relative
|
||||||
|
priority of clusters in the same group topic is out of scope. And such requirements
|
||||||
|
could be done by `weightPreference`(.spec.placement.replicaScheduling.weightPreference) or
|
||||||
|
define a strategy for `karmada-scheduler` to score specified clusters.
|
||||||
|
|
||||||
|
- **Scheduling re-balance**
|
||||||
|
|
||||||
|
Indeed, oftentimes people want the `karmada-scheduler` to perform an extra schedule
|
||||||
|
for different purposes, like the question addressed by the discussion
|
||||||
|
[#3069](https://github.com/karmada-io/karmada/discussions/3069), these kind of
|
||||||
|
issues should be tracked by another proposal.
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
|
||||||
|
### User Stories (Optional)
|
||||||
|
|
||||||
|
#### As a user, I prefer to deploy my applications on clusters in local data center to save costs.
|
||||||
|
|
||||||
|
I have some clusters in my data center as well as some managed clusters from cloud
|
||||||
|
providers(like AWS and Google Cloud). Since the cost of the managed clusters is higher
|
||||||
|
than private clusters, so I prefer to deploy applications on private clusters and
|
||||||
|
take the managed clusters as a backup.
|
||||||
|
|
||||||
|
#### As a user, I want to deploy applications on the primary cluster and leave a backup cluster for disaster recovery cases.
|
||||||
|
|
||||||
|
I have two clusters, the primary, and the backup cluster, I want Karmada to deploy
|
||||||
|
my applications on the primary cluster and migrate applications on the backup cluster
|
||||||
|
when the primary cluster becomes unavailable due to like data center power off or
|
||||||
|
maintenance task.
|
||||||
|
|
||||||
|
### Notes/Constraints/Caveats (Optional)
|
||||||
|
|
||||||
|
This proposal mainly focuses on the changes in `PropagationPolicy` and `ResourceBinding`,
|
||||||
|
but it will also be applied to `ClusterPropagationPolicy` and `ClusterResourceBinding`.
|
||||||
|
|
||||||
|
### Risks and Mitigations
|
||||||
|
|
||||||
|
This proposal maintains the backward compatibility, the system built with previous
|
||||||
|
versions of Karmada can be seamlessly migrated to the new version.
|
||||||
|
The previous configurations(yamls) could be applied to the new version of Karmada and
|
||||||
|
without any behavior change.
|
||||||
|
|
||||||
|
## Design Details
|
||||||
|
|
||||||
|
### API change
|
||||||
|
|
||||||
|
#### PropagationPolicy API change
|
||||||
|
|
||||||
|
This proposal proposes a new field `ClusterAffinities` for declaring
|
||||||
|
multiple affinity terms in `.spec.placement` of `PropagationPolicy`.
|
||||||
|
```go
|
||||||
|
// Placement represents the rule for select clusters.
|
||||||
|
type Placement struct {
|
||||||
|
// ClusterAffinity represents scheduling restrictions to a certain set of clusters.
|
||||||
|
// If not set, any cluster can be scheduling candidate.
|
||||||
|
// +optional
|
||||||
|
ClusterAffinity *ClusterAffinity `json:"clusterAffinity,omitempty"`
|
||||||
|
|
||||||
|
// ClusterAffinities represents scheduling restrictions to multiple cluster
|
||||||
|
// groups that indicated by ClusterAffinityTerm.
|
||||||
|
//
|
||||||
|
// The scheduler will evaluate these groups one by one in the order they
|
||||||
|
// appear in the spec, the group that does not satisfy scheduling restrictions
|
||||||
|
// will be ignored which means all clusters in this group will not be selected
|
||||||
|
// unless it also belongs to the next group(a cluster could belong to multiple
|
||||||
|
// groups).
|
||||||
|
//
|
||||||
|
// If none of the groups satisfy the scheduling restrictions, then scheduling
|
||||||
|
// fails, which means no cluster will be selected.
|
||||||
|
//
|
||||||
|
// Note:
|
||||||
|
// 1. ClusterAffinities can not co-exist with ClusterAffinity.
|
||||||
|
// 2. If both ClusterAffinity and ClusterAffinities are not set, any cluster
|
||||||
|
// can be scheduling candidates.
|
||||||
|
//
|
||||||
|
// Potential use case 1:
|
||||||
|
// The private clusters in the local data center could be the main group, and
|
||||||
|
// the managed clusters provided by cluster providers could be the secondary
|
||||||
|
// group. So that the Karmada scheduler would prefer to schedule workloads
|
||||||
|
// to the main group and the second group will only be considered in case of
|
||||||
|
// the main group does not satisfy restrictions(like, lack of resources).
|
||||||
|
//
|
||||||
|
// Potential use case 2:
|
||||||
|
// For the disaster recovery scenario, the clusters could be organized to
|
||||||
|
// primary and backup groups, the workloads would be scheduled to primary
|
||||||
|
// clusters firstly, and when primary cluster fails(like data center power off),
|
||||||
|
// Karmada scheduler could migrate workloads to the backup clusters.
|
||||||
|
//
|
||||||
|
// +optional
|
||||||
|
ClusterAffinities []ClusterAffinityTerm `json:"clusterAffinities,omitempty"`
|
||||||
|
|
||||||
|
// ClusterTolerations represents the tolerations.
|
||||||
|
// +optional
|
||||||
|
ClusterTolerations []corev1.Toleration `json:"clusterTolerations,omitempty"`
|
||||||
|
|
||||||
|
// SpreadConstraints represents a list of the scheduling constraints.
|
||||||
|
// +optional
|
||||||
|
SpreadConstraints []SpreadConstraint `json:"spreadConstraints,omitempty"`
|
||||||
|
|
||||||
|
// ReplicaScheduling represents the scheduling policy on dealing with the number of replicas
|
||||||
|
// when propagating resources that have replicas in spec (e.g. deployments, statefulsets) to member clusters.
|
||||||
|
// +optional
|
||||||
|
ReplicaScheduling *ReplicaSchedulingStrategy `json:"replicaScheduling,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// ClusterAffinityTerm selects a set of cluster.
|
||||||
|
type ClusterAffinityTerm struct {
|
||||||
|
// AffinityName is the name of the cluster group.
|
||||||
|
// +required
|
||||||
|
AffinityName string `json:"affinityName"`
|
||||||
|
|
||||||
|
ClusterAffinity `json:",inline"`
|
||||||
|
}
|
||||||
|
```
|
||||||
|
Each affinity term essentially is the named `ClusterAffinity`. During the scheduling
|
||||||
|
phase, the scheduler evaluates terms in the order they appear in the spec one by
|
||||||
|
one, and turns to the next term if the term does not satisfy restrictions.
|
||||||
|
|
||||||
|
The following configuration sample declares 3 affinity terms:
|
||||||
|
```yaml
|
||||||
|
apiVersion: policy.karmada.io/v1alpha1
|
||||||
|
kind: PropagationPolicy
|
||||||
|
metadata:
|
||||||
|
name: nginx
|
||||||
|
spec:
|
||||||
|
resourceSelectors:
|
||||||
|
- apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
name: nginx
|
||||||
|
placement:
|
||||||
|
clusterAffinities:
|
||||||
|
- affinityName: dc-shanghai
|
||||||
|
clusterNames:
|
||||||
|
- unavailable
|
||||||
|
- affinityName: dc-beijing
|
||||||
|
clusterNames:
|
||||||
|
- member1
|
||||||
|
- affinityName: dc-hongkong
|
||||||
|
clusterNames:
|
||||||
|
- member2
|
||||||
|
```
|
||||||
|
|
||||||
|
During the scheduling phase, the scheduler will first look at the affinity
|
||||||
|
named `dc-shanghai` and try to select a feasible cluster set from it.
|
||||||
|
|
||||||
|
If no feasible cluster is found from the term, the scheduler will try against the
|
||||||
|
next term which is the one named `dc-beijing`. And once the scheduler successfully
|
||||||
|
selected a feasible cluster set from this affinity term, the scheduler will not
|
||||||
|
continue to look at the following terms.
|
||||||
|
|
||||||
|
And, in case of cluster `member1` becomes unavailable, by leveraging the
|
||||||
|
[Failover feature](https://karmada.io/docs/userguide/failover/failover-overview)
|
||||||
|
the scheduler will start looking for alternatives `in current affinity term`, and
|
||||||
|
if that fails, it will look at the term named `dc-hongkong`.
|
||||||
|
|
||||||
|
**Note:** Each affinity term is completely independent, and the scheduler will
|
||||||
|
only select one term at a time during scheduling. But it is allowed for the same
|
||||||
|
cluster present in multiple affinity terms.
|
||||||
|
|
||||||
|
#### ResourceBinding API change
|
||||||
|
|
||||||
|
In order to track the affinity term that the scheduler last evaluated,
|
||||||
|
the proposal proposes a new field named `SchedulerObservedAffinityName` to
|
||||||
|
`.status` of `ResourceBinding`, so that the scheduler can continue the previous
|
||||||
|
scheduling cycle in any cases of re-schedule.
|
||||||
|
|
||||||
|
```go
|
||||||
|
// ResourceBindingStatus represents the overall status of the strategy as well as the referenced resources.
|
||||||
|
type ResourceBindingStatus struct {
|
||||||
|
// SchedulerObservedGeneration is the generation(.metadata.generation) observed by the scheduler.
|
||||||
|
// If SchedulerObservedGeneration is less than the generation in metadata means the scheduler hasn't confirmed
|
||||||
|
// the scheduling result or hasn't done the schedule yet.
|
||||||
|
// +optional
|
||||||
|
SchedulerObservedGeneration int64 `json:"schedulerObservedGeneration,omitempty"`
|
||||||
|
|
||||||
|
// SchedulerObservedAffinityName is the affinity terms that
|
||||||
|
// scheduler looking at.
|
||||||
|
// +optional
|
||||||
|
SchedulerObservedAffinityName string `json:"schedulerObservingAffinityName,omitempty"`
|
||||||
|
|
||||||
|
// Conditions contain the different condition statuses.
|
||||||
|
// +optional
|
||||||
|
Conditions []metav1.Condition `json:"conditions,omitempty"`
|
||||||
|
// AggregatedStatus represents status list of the resource running in each member cluster.
|
||||||
|
// +optional
|
||||||
|
AggregatedStatus []AggregatedStatusItem `json:"aggregatedStatus,omitempty"`
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
E.g:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
status:
|
||||||
|
aggregatedStatus:
|
||||||
|
- applied: true
|
||||||
|
clusterName: member1
|
||||||
|
health: Healthy
|
||||||
|
status:
|
||||||
|
availableReplicas: 2
|
||||||
|
readyReplicas: 2
|
||||||
|
replicas: 2
|
||||||
|
updatedReplicas: 2
|
||||||
|
conditions:
|
||||||
|
- lastTransitionTime: "2023-02-04T09:38:20Z"
|
||||||
|
message: All works have been successfully applied
|
||||||
|
reason: FullyAppliedSuccess
|
||||||
|
status: "True"
|
||||||
|
type: FullyApplied
|
||||||
|
- lastTransitionTime: "2023-02-04T09:38:20Z"
|
||||||
|
message: Binding has been scheduled
|
||||||
|
reason: BindingScheduled
|
||||||
|
status: "True"
|
||||||
|
type: Scheduled
|
||||||
|
schedulerObservedGeneration: 2
|
||||||
|
SchedulerObservedAffinityName: ds-hongkong
|
||||||
|
```
|
||||||
|
|
||||||
|
The `SchedulerObservedAffinityName: ds-hongkong` means the scheduling result is based on
|
||||||
|
the affinity term named `ds-hongkong`, in case of re-scheduling, the scheduler should
|
||||||
|
continue to evaluate from this affinity term.
|
||||||
|
|
||||||
|
### Components change
|
||||||
|
|
||||||
|
|
||||||
|
#### karmada-controller-manager
|
||||||
|
|
||||||
|
When creating or updating `ResourceBidning`/`ClusterResourceBinding`, the added
|
||||||
|
`OrderedClusterAffinities` in `PropagationPolicy`/`ClusterPropagationPolicy` should
|
||||||
|
be synced.
|
||||||
|
|
||||||
|
#### karmada-scheduler
|
||||||
|
|
||||||
|
Currently, the karmada-scheduler only runs a single loop by accepting an affinity
|
||||||
|
term, that is [ScheduleAlgorithm interface](https://github.com/karmada-io/karmada/blob/32b8f21b79017e2f4154fbe0677cb63fb18b120c/pkg/scheduler/core/generic_scheduler.go#L22).
|
||||||
|
|
||||||
|
With this proposal, the `ScheduleAlgorithm interface` probably be invoked multi
|
||||||
|
time, and feed a different affinity term each time until the schedule succeeds.
|
||||||
|
|
||||||
|
#### karmada-webhook
|
||||||
|
|
||||||
|
Since it doesn't make sense for the newly introduced `ClusterAffinities`
|
||||||
|
co-exist with the previous `ClusterAffinity`, so the webhook should perform extra
|
||||||
|
validation work to prevent misleading configuration.
|
||||||
|
|
||||||
|
In addition, the `affinityName` of each affinity terms should be unique among all terms.
|
||||||
|
|
||||||
|
### Test Plan
|
||||||
|
|
||||||
|
- All current testing should be passed, no break change would be involved by this feature.
|
||||||
|
- Add new E2E tests to cover the feature, the scope should include:
|
||||||
|
* Workload propagating with scheduling type `Duplicated`.
|
||||||
|
* Workload propagating with scheduling type `Divided`
|
||||||
|
* Failover scenario
|
||||||
|
|
||||||
|
## Alternatives
|
||||||
|
|
||||||
|
Introducing a new field to specify cluster priority was one of the first suggested
|
||||||
|
ideas.(tracked by [#842](https://github.com/karmada-io/karmada/pull/842)).
|
||||||
|
This approach tried to reuse the terms in Kubernetes Pod affinity.
|
||||||
|
An example of this approach is as follows:
|
||||||
|
```yaml
|
||||||
|
apiVersion: policy.karmada.io/v1alpha1
|
||||||
|
kind: PropagationPolicy
|
||||||
|
metadata:
|
||||||
|
name: foo
|
||||||
|
spec:
|
||||||
|
resourceSelectors:
|
||||||
|
- apiVersion: apps/v1
|
||||||
|
kind: Deployment
|
||||||
|
name: foo
|
||||||
|
placement:
|
||||||
|
clusterAffinity:
|
||||||
|
clusterNames:
|
||||||
|
- member1
|
||||||
|
- member2
|
||||||
|
propagatePriority:
|
||||||
|
- weight: 30
|
||||||
|
preference:
|
||||||
|
matchExpressions:
|
||||||
|
- key: topology
|
||||||
|
operator: In
|
||||||
|
values:
|
||||||
|
- us
|
||||||
|
- weight: 20
|
||||||
|
preference:
|
||||||
|
matchExpressions:
|
||||||
|
- key: topology
|
||||||
|
operator: In
|
||||||
|
values:
|
||||||
|
- cn
|
||||||
|
```
|
||||||
|
The `.spec.placement.clusterAffinity.propagatePriority` is used to specify the
|
||||||
|
cluster preference by reusing the [PreferredSchedulingTerm of Kubernetes](https://github.com/kubernetes/kubernetes/blob/fc002b2f07250a462bb1b471807708b542472c18/staging/src/k8s.io/api/core/v1/types.go#L3028-L3035).
|
||||||
|
|
||||||
|
However, the `PreferredSchedulingTerm` relies on group cluster by the label selector
|
||||||
|
and field selector, the `matchExpressions` involves too many nested layers that
|
||||||
|
probably make the configuration hard to maintain.
|
Loading…
Reference in New Issue