DNS latency SLI

2018-11-28 15:47:34 +01:00 · 2018-11-28 15:47:34 +01:00 · 0cc993ebfc
parent f8c08b18ff
commit 0cc993ebfc
2 changed files with 60 additions and 0 deletions
--- a/sig-scalability/slos/dns_latency.md
+++ b/sig-scalability/slos/dns_latency.md
@ -0,0 +1,55 @@
+## In-cluster dns latency SLIs/SLOs details
+
+### Definition
+
+| Status | SLI | SLO |
+| --- | --- | --- |
+| __WIP__ | In-cluster dns latency from a single prober pod, measured as latency of per second DNS lookup<sup>[1](#footnote)</sup> for "null service" from that pod, measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day <= X |
+
+<a name="footnote">\[1\]</a> In fact two DNS lookups: (1) to nameserver IP from
+/etc/resolv.conf (2) to kube-system/kube-dns service IP and track them as two
+separate SLIs.
+
+### User stories
+- As a user of vanilla Kubernetes, I want some guarantee how fast my in-cluster
+DNS requests are resolved
+
+### Other notes
+- We obviously can't give any guarantee in a general case, because cluster
+administrators may configure cluster as they want.
+- As a result, we define the SLI to be very generic (no matter how your cluster
+is set up), but we provide SLO only for default installations with an additional
+requirement that low-level RTT between nodes is lower than Y.
+- DNS latency is one of the most crucial aspects from the point of view
+of application performance, especially in microservices world. As a result, to
+meet user expectations, we need to provide some guarantees arount that.
+- We are introducing two SLIs (for two IP addresses) to enable measuring the
+impact of node-local caching.
+
+### Caveats
+- The SLI is formulated for a prober pods, even though users are mostly
+interested in the aggregation across all pods (that is done only at the SLO
+level). However, that provides very similar guarantees and makes it fairly
+easy to measure.
+- The RTT between nodes may significantly differ, if nodes are in different
+topologies (e.g. GCP zones). However, given that topology-aware service routing
+is not natively supported in Kubernetes yet, we explicitly acknowledge that
+depending on the pinged endpoint, results may signiifcantly differ if nodes
+are spanning multiple topologies.
+- The prober reporting that is fairly trivial and itself needs only negligible
+amount of resources. Unfortunately there isn't any component to which we can
+attach that functionality (e.g. KubeProxy is running in host network), so
+**we will create a dedicated set of prober pods**. We will run a set of prober
+pods (number proportional to cluster size).
+- We don't have any "null service" running in cluster, so an administrator has
+to set up one to make the SLI measurable in real cluster. In tests, we will
+create a service on top of prober pods.
+
+### TODOs
+- DNS Latency is only a part of criticial metrics, the other being "drop rate"
+or "timeout rate". Given that is seems harder to measure/sample, we would like
+to address that separate to avoid blocking this SLI on the resolution.
+
+### Test scenario
+
+__TODO: Describe test scenario.__
--- a/sig-scalability/slos/slos.md
+++ b/sig-scalability/slos/slos.md
@ -109,10 +109,15 @@ Prerequisite: Kubernetes cluster is available and serving.
 | __WIP__ | Latency of programming a single (e.g. iptables on a given node) in-cluster load balancing mechanism, measured from when service spec or list of its `Ready` pods change to when it is reflected in load balancing mechanism, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all programmers (e.g. iptables)) per cluster-day<sup>[1](#footnote1)</sup> <= X | [Details](./network_programming_latency.md) |
 | __WIP__ | Latency of programming a single in-cluster dns instance, measured from when service spec or list of its `Ready` pods change to when it is reflected in that dns instance, measured as 99th percentile over last 5 minutes | In default Kubernetes installation, 99th percentile of (99th percentiles across all dns instances) per cluster-day <= X | [Details](./dns_programming_latency.md) |
 | __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day <= X | [Details](./network_latency.md) |
+| __WIP__ | In-cluster dns latency from a single prober pod, measured as latency of per second DNS lookup<sup>[1](#footnote2)</sup> for "null service" from that pod, measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day <= X |

 <a name="footnote1">\[1\]</a> For the purpose of visualization it will be a
 sliding window. However, for the purpose of reporting the SLO, it means one
 point per day (whether SLO was satisfied on a given day or not).
+<a name="footnote2">\[2\]</a> In fact two DNS lookups: (1) to nameserver IP from
+/etc/resolv.conf (2) to kube-system/kube-dns service IP and track them as two
+separate SLIs.
+

 ### Burst SLIs/SLOs