community/sig-scalability/slos/network_programming_latency.md

5.1 KiB

Network programming latency SLIs/SLOs details

Definition

Status SLI SLO
WIP Latency of programming in-cluster load balancing mechanism (e.g. iptables), measured from when service spec or list of its Ready pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes aggregated across all programmers1 In default Kubernetes installation, 99th percentile per cluster-day <= X

[1]Aggregation across all programmers means that all samples from all programmers go into one large pool, and SLI is percentile from all of them.

User stories

  • As a user of vanilla Kubernetes, I want some guarantee how quickly new backends of my service will be targets of in-cluster load-balancing
  • As a user of vanilla Kubernetes, I want some guarantee how quickly deleted (or unhealthy) backends of my service will be removed from in-cluster load-balancing
  • As a user of vanilla Kubernetes, I want some guarantee how quickly changes to service specification (including creation) will be reflected in in-cluster load-balancing

Other notes

  • We are consciously focusing on in-cluster load-balancing for the purpose of this SLI, as external load-balancing is clearly provider specific (which makes it hard to set the SLO for it).
  • However, in the future it should be possible to formulate the SLI for external load-balancing in pretty much the same way for consistency.
  • The SLI measuring end-to-end time from pod creation was also considered, but rejected due to being application specific, and thus introducing SLO would be impossible.

Caveats

  • The SLI is aggregated across all "programmers", which is what is interesting for the end-user. It may happen that small percentage of programmers are completely unresponsive (if all others are fast), but that is desired - we need to allow slower/unresponsive nodes because at some scale it will be happening. The reason for doing it this way is feasibility for efficiently computing that:
    • if we would be doing aggregation at the SLI level (i.e. the SLI would be formulated like "... reflected in in-cluster load-balancing mechanism and visible from 99% of programmers"), computing that SLI would be extremely difficult. It's because in order to decide e.g. whether pod transition to Ready state is reflected, we would have to know when exactly it was reflected in 99% of programmers (e.g. iptables). That requires tracking metrics on per-change base (which we can't do efficiently).

How to measure the SLI.

The method of measuring this SLI is not obvious, so for completeness we describe it here how it will be implemented with all caveats.

  1. We assume that for the in-cluster load-balancing programming we are using Kubernetes Endpoints objects.
  2. We will introduce a dedicated annotation for Endpoints object (name TBD).
  3. Endpoints controller (while updating a given Endpoints object) will be setting value of that annotation to the timestamp of the change that triggered this update:
  • for pod transition between Ready and NotReady states, its timestamp is simply part of pod condition
  • TBD for service updates (ideally we will add LastUpdateTimestamp field in object metadata next to already existing CreationTimestamp. The data is already present at storage layer, so it won't be hard to propagate that.
  1. The in-cluster load-balancing programmer will export a prometheus metric once done with programming. The latency of the operation is defined as difference between timestamp of then whe operation is done and timestamp recorded in the newly introduced annotation.

Caveats

There are a couple of caveats to that measurement method:

  1. Single Endpoints object may batch multiple pod state transition.
    In that case, we simply choose the oldest one (and not expose all timestamps to avoid theoretically unbounded growth of the object). That makes the metric imprecise, but the batching period should be relatively small comparing to whole end-to-end flow.
  2. A single pod may transition its state multiple times within batching period.
    For that case, we will add additional cache in Endpoints controller caching the first observed transition timestamp for each pod. The cache will be cleared when controller picks up a pod into Endpoints object update. This is consistent with choosing the oldest update in the above point.
    Initially, we may consider simply ignoring this fact.
  3. Components may fall out of watch window history and thus miss some watch events.
    This may be the case for both Endpoints controller or kube-proxy (or other network programmers if used instead). That becomes a problem when a single object changed multiple times in the meantime (otherwise informers will deliver handlers on relisting). Additionally, this can happen only when components are too slow in processing events (that would already be reflected in metrics) or (sometimes) after kube-apiserver restart. Given that, we are going to neglect this problem to avoid unnecessary complications for little or no gain.

Test scenario

TODO: Describe test scenario.