57 lines
3.0 KiB
Markdown
57 lines
3.0 KiB
Markdown
## DNS programming latency SLIs/SLOs details
|
|
|
|
### Definition
|
|
|
|
| Status | SLI | SLO |
|
|
| --- | --- | --- |
|
|
| __WIP__ | Latency of programming dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes aggregated across all dns instances<sup>[1](#footnote1)</sup> | In default Kubernetes installation, 99th percentile per cluster-day <= X |
|
|
|
|
<a name="footnote1">[1\]</a>Aggregation across all programmers means that all
|
|
samples from all programmers go into one large pool, and SLI is percentile
|
|
from all of them.
|
|
|
|
### User stories
|
|
- As a user of vanilla Kubernetes, I want some guarantee how quickly in-cluster
|
|
DNS will start resolving service name to its newly started backends.
|
|
- As a user of vanilla Kubernetes, I want some guarantee how quickly in-cluster
|
|
DNS will stop resolving service name to its removed (or unhealthy) backends.
|
|
- As a user of vanilla Kubernetes, I want some guarantee how quickly newly
|
|
create services will be resolvable via in-cluster DNS.
|
|
- As a user of vanilla Kubernetes, I want some guarantee how quickly in-cluster
|
|
DNS will start resolving headless service hostnames to its newly started backends.
|
|
|
|
### Other notes
|
|
- We are consciously focusing on in-cluster DNS for the purpose of this SLI,
|
|
as external DNS resolution clearly depends on cloud provider or environment
|
|
in which the cluster is running (it hard to set the SLO for it).
|
|
|
|
### Caveats
|
|
- The SLI is aggregated across all DNS instances, which is what is interesting
|
|
for the end-user. It may happen that small percentage of DNS instances are
|
|
completely unresponsive (if all others are fast), but that is desired - we need
|
|
to allow slower/unresponsive ones because at some scale it will be happening.
|
|
The reason for doing it this way is feasibility for efficiently computing that:
|
|
- if we would be doing aggregation at the SLI level (i.e. the SLI would be
|
|
formulated like "... reflected in in-cluster load-balancing mechanism and
|
|
visible from 99% of programmers"), computing that SLI would be extremely
|
|
difficult. It's because in order to decide e.g. whether pod transition to
|
|
Ready state is reflected, we would have to know when exactly it was reflected
|
|
in 99% of programmers (e.g. iptables). That requires tracking metrics on
|
|
per-change base (which we can't do efficiently).
|
|
|
|
- The SLI for DNS publishing should remain constant independent of the number of records.
|
|
For example, in a headless service with thousands of pods the time between the pod being
|
|
assigned an IP and the time DNS makes that IP available in the service's A/AAAA record(s)
|
|
should be statisitically consistent for the first Pod and the last Pod.
|
|
|
|
|
|
### How to measure the SLI.
|
|
There [network programming latency](./network_programming_latency.md) is
|
|
formulated in almost exactly the same way. As a result, the methodology for
|
|
measuring the SLI here is exactly the same and can be found
|
|
[here](./network_programming_latency.md#how-to-measure-the-sli).
|
|
|
|
### Test scenario
|
|
|
|
__TODO: Describe test scenario.__
|