--- title: Service discovery with native Kubernetes naming and resolution authors: - "@bivas" - "@XiShanYongYe-Chang" - "@jwcesign" reviewers: - "@RainbowMango" - "@GitHubxsy" - "@Rains6" - "@chaunceyjiang" approvers: - "@RainbowMango" creation-date: 2023-06-22 update-date: 2023-08-19 --- # Service discovery with native Kubernetes naming and resolution ## Summary In multi-cluster scenarios, there is a need to access services across clusters. Currently, Karmada support this by creating derived service(with `derived-` prefix, ) in other clusters to access the service. This Proposal propose a method for multi-cluster service discovery using Kubernetes native Service, to modify the current implementation of Karmada's MCS. This approach does not add a `derived-` prefix when accessing services across clusters. ## Motivation Having a `derived-` prefix for `Service` resources seems counterintuitive when thinking about service discovery: - Assuming the pod is exported as the service `foo` - Another pod that wishes to access it on the same cluster will simply call `foo` and Kubernetes will bind to the correct one - If that pod is scheduled to another cluster, the original service discovery will fail as there's no service by the name `foo` - To find the original pod, the other pod is required to know it is in another cluster and use `derived-foo` to work properly If Karmada supports service discovery using native Kubernetes naming and resolution (without the `derived-` prefix), users can access the service using its original name without needing to modify their code to accommodate services with the `derived-` prefix. ### Goals - Remove the "derived-" prefix from the service - User-friendly and native service discovery ### Non-Goals - Multi cluster connectivity ## Proposal Following are flows to support the service import proposal: 1. `Deployment` and `Service` are created on cluster member1 and the `Service` imported to cluster member2 using `MultiClusterService` (described below as [user story 1](#story-1)) 2. `Deployment` and `Service` are created on cluster member1 and both propagated to cluster member2. `Service` from cluster member1 is imported to cluster member2 using `MultiClusterService` (described below as [user story 2](#story-2)) The proposal for this flow is what can be referred to as local-and-remote service discovery. In the process handling, it can be simply distinguished into the following scenarios: 1. **Local** only - In case there's a local service by the name `foo` Karmada never attempts to import the remote service and doesn't create an `EndPointSlice` 2. **Local** and **Remote** - Users accessing the `foo` service will reach either member1 or member2 3. **Remote** only - in case there's a local service by the name `foo` Karmada will remove the local `EndPointSlice` and will create an `EndPointSlice` pointing to the other cluster (e.g. instead of resolving member2 cluster is will reach member1) Based on the above three scenarios, we think there is two reasonable strategies(Users can utilize PP to propagate the Service and implement the `Local` scenario, it's not necessary to implement it with `MultiClusterService`): - **RemoteAndLocal** - When accessing Service, the traffic will be evenly distributed between the local cluster and remote cluster's Service. - **LocalFirst** - When accessing Services, if the local cluster Service can provide services, it will directly access the Service of the local cluster. If a failure occurs in the Service on the local cluster, it will access the Service on remote clusters. > Note: How can we detect the failure? > Maybe we need to watch the EndpointSlices resources of the relevant Services in the member cluster. If the EndpointSlices resource becomes non-existent or the statue become not ready, we need to synchronize it with other clusters. This proposal suggests using the [MultiClusterService API](https://github.com/karmada-io/karmada/blob/24bb5829500658dd1caeea16eeace8252bcef682/pkg/apis/networking/v1alpha1/service_types.go#L30) to enable cross-cluster service discovery. This proposal focuses on the `RemoteAndLocal` strategy, and we will subsequently iterate on the `LocalFirst` strategy. ### User Stories (Optional) #### Story 1 As a user of a Kubernetes cluster, I want to be able to access a service whose corresponding pods are located in another cluster. I hope to communicate with the service using its original name. **Scenario**: 1. Given that the `Service` named `foo` exists on cluster member1 1. When I try to access the service inside member2, I can access the service using the name `foo.myspace.svc.cluster.local` #### Story 2 As a user of a Kubernetes cluster, I want to access a service that has pods located in both this cluster and another. I expect to communicate with the service using its original name, and have the requests routed to the appropriate pods across clusters. **Scenario**: 1. Given that the `Service` named `foo` exists on cluster member1 2. And there is already a conflicting `Service` named foo on cluster member2 3. When I attempt to access the service in cluster member2 using `foo.myspace.svc.cluster.local` 4. Then the requests round-robin between the local `foo` service and the imported `foo` service (member1 and member2) ### Notes/Constraints/Caveats (Optional) ### Risks and Mitigations Adding a `Service` that resolve to a remote cluster will add a network latency of communication between clusters. ## Design Details ### API changes This proposal proposes two new fields `ServiceProvisionClusters` and `ServiceConsumptionClusters` in `MultiClusterService` API. ```go type MultiClusterServiceSpec struct { ... // ServiceProvisionClusters specifies the clusters which will provision the service backend. // If leave it empty, we will collect the backend endpoints from all clusters and sync // them to the ServiceConsumptionClusters. // +optional ServiceProvisionClusters []string `json:"serviceProvisionClusters,omitempty"` // ServiceConsumptionClusters specifies the clusters where the service will be exposed, for clients. // If leave it empty, the service will be exposed to all clusters. // +optional ServiceConsumptionClusters []string `json:"serviceConsumptionClusters,omitempty"` } ``` With this API, we will: * Use `ServiceProvisionClusters` to specify the member clusters which will provision the service backend, if leave it empty, we will collect the backend endpoints from all clusters and sync them to the `ServiceConsumptionClusters`. * Use `ServiceConsumptionClusters` to specify the clusters where the service will be exposed. If leave it empty, the service will be exposed to all clusters. For example, if we want access `foo`` service which are located in member2 from member3 , we can use the following yaml: ```yaml apiVersion: v1 kind: Service metadata: name: foo spec: ports: - port: 80 targetPort: 8080 selector: app: foo --- apiVersion: networking.karmada.io/v1alpha1 kind: MultiClusterService metadata: name: foo spec: types: - CrossCluster serviceProvisionClusters: - member2 serviceConsumptionClusters: - member3 ``` ### Implementation workflow #### Service propagation The process of propagating Service from Karmada control plane to member clusters is as follows:

1. `multiclusterservice` controller will list&watch `Service` and `MultiClusterService` resources from Karmada control plane. 1. Once there is same name `MultiClusterService` and `Service`, `multiclusterservice` will create the Work(corresponding to `Service`), the target cluster namespace is all the clusters in filed `spec.serviceProvisionClusters` and `spec.serviceConsumptionClusters`. 1. The Work will be synchronized with the member clusters. After synchronization, `EndpointSlice` will be created in member clusters. #### `EndpointSlice` synchronization The process of synchronizing `EndpointSlice` from `ServiceProvisionClusters` to `ServiceConsumptionClusters` is as follows:

1. `endpointsliceCollect` controller will list&watch `MultiClusterService`. 1. `endpointsliceCollect` controller will build the informer to list&watch the target service's EndpointSlice from `ServiceProvisionClusters`. 1. `endpointsliceCollect` controller will create the corresponding Work for each `EndpointSlice` in the cluster namespace. When creating the Work, in order to delete the corresponding work when `MultiClusterService` deletion, we should add following labels: * `endpointslice.karmada.io/name`: the service name of the original `EndpointSlice`. * `endpointslice.karmada.io/namespace`: the service namespace of the original `EndpointSlice`.

1. `endpointsliceDispatch` controller will list&watch `MultiClusterService`. 1. `endpointsliceDispatch` controller will list&watch `EndpointSlice` from `MultiClusterService`'s `spec.serviceProvisionClusters`. 1. `endpointsliceDispatch` controller will create the corresponding Work for each `EndpointSlice` in the cluster namespace of `MultiClusterService`'s `spec.serviceConsumptionClusters`. When creating the Work, in order to facilitate problem investigation, we should add following annotation to record the original `EndpointSlice` information: * `endpointslice.karmada.io/work-provision-cluster`: the cluster name of the original `EndpointSlice`. Also, we should add the following annotation to the synced `EndpointSlice` record the original information: * `endpointslice.karmada.io/endpointslice-generation`: the resource generation of the `EndpointSlice`, it could be used to check whether the `EndpointSlice` is the newest version. * `endpointslice.karmada.io/provision-cluster`: the cluster location of the original `EndpointSlice`. 1. Karmada will sync the `EndpointSlice`'s work to the member clusters. But, there is one point to note that, assume I have following configuration: ```yaml apiVersion: v1 kind: Service metadata: name: foo spec: ports: - port: 80 targetPort: 8080 selector: app: foo --- apiVersion: networking.karmada.io/v1alpha1 kind: MultiClusterService metadata: name: foo spec: types: - CrossCluster serviceProvisionClusters: - member1 - member2 serviceConsumptionClusters: - member2 ``` When create the corresponding Work, Karmada should only sync the exists `EndpointSlice` in `member1` to `member2`. ### Components change #### karmada-controller * Add `multiclusterservice` controller to support reconcile `MultiClusterService` and Clusters, including creation/deletion/updating. * Add `endpointsliceCollect` controller to support reconcile `MultiClusterService` and Clusters, collect `EndpointSlice` from `ServerClusters` as work. * Add `endpointsliceDispatch` controller to support reconcile `MultiClusterService` and Clusters, dispatch `EndpointSlice` work from `serviceProvisionClusters` to `serviceConsumptionClusters`. ### Status Record We should have following Condition in `MultiClusterService`: ```go MCSServiceAppliedConditionType = "ServiceApplied" MCSEndpointSliceCollectedConditionType = "EndpointSliceCollected" MCSEndpointSliceAppliedConditionType = "EndpointSliceApplied" ``` `MCSServiceAppliedConditionType` is used to record the status of `Service` propagation, for example: ```yaml status: conditions: - lastTransitionTime: "2023-11-20T02:30:49Z" message: Service is propagated to target clusters. reason: ServiceAppliedSuccess status: "True" type: ServiceApplied ``` `MCSEndpointSliceCollectedConditionType` is used to record the status of `EndpointSlice` collection, for example: ```yaml status: conditions: - lastTransitionTime: "2023-11-20T02:30:49Z" message: Failed to list&watch EndpointSlice in member3. reason: EndpointSliceCollectedFailed status: "False" type: EndpointSliceCollected ``` `MCSEndpointSliceAppliedConditionType` is used to record the status of `EndpointSlice` synchronization, for example: ```yaml status: conditions: - lastTransitionTime: "2023-11-20T02:30:49Z" message: EndpointSlices are propagated to target clusters. reason: EndpointSliceAppliedSuccess status: "True" type: EndpointSliceApplied ``` ### Metrics Record For better monitoring, we should have following metrics: * `mcs_sync_svc_duration_seconds` - The duration of syncing `Service` from Karmada control plane to member clusters. * `mcs_sync_eps_duration_seconds` - The time it takes from detecting the EndpointSlice to creating/updating the corresponding Work in a specific namespace. ### Development Plan * API definition, including API files, CRD files, and generated code. (1d) * For `multiclusterservice` controller, List&watch mcs and service, reconcile the work in execution namespace. (5d) * For `multiclusterservice` controller, List&watch cluster creation/deletion, reconcile the work in corresponding cluster execution namespace. (10) * For `endpointsliceCollect` controller, List&watch mcs, collect the corresponding EndpointSlice from `serviceProvisionClusters`, and `endpointsliceDispatch` controller should sync the corresponding Work. (5d) * For `endpointsliceCollect` controller, List&watch cluster creation/deletion, reconcile the EndpointSlice's work in corresponding cluster execution namespace. (10d) * If cluster gets unhealthy, mcs-eps-controller should delete the EndpointSlice from all the cluster execution namespace. (5d) ### Test Plan - UT cover for new add code - E2E cover for new add case ## Alternatives One alternative approach to service discovery with native Kubernetes naming and resolution is to rely on external DNS-based service discovery mechanisms. However, this approach may require additional configuration and management overhead, as well as potential inconsistencies between different DNS implementations. By leveraging the native Kubernetes naming and resolution capabilities, the proposed solution simplifies service discovery and provides a consistent user experience. Another alternative approach could be to enforce a strict naming convention for imported services, where a specific prefix or suffix is added to the service name to differentiate it from local services. However, this approach may introduce complexity for users and require manual handling of naming collisions. The proposed solution aims to provide a more user-friendly experience by removing the "derived-" prefix and allowing services to be accessed using their original names.