55 lines
		
	
	
		
			3.0 KiB
		
	
	
	
		
			Markdown
		
	
	
	
			
		
		
	
	
			55 lines
		
	
	
		
			3.0 KiB
		
	
	
	
		
			Markdown
		
	
	
	
| ## In-cluster network latency SLIs/SLOs details
 | |
| 
 | |
| ### Definition
 | |
| 
 | |
| | Status | SLI | SLO |
 | |
| | --- | --- | --- |
 | |
| | __WIP__ | In-cluster network latency from a single prober pod, measured as latency of per second ping from that pod to "null service", measured as 99th percentile over last 5 minutes. | In default Kubernetes installataion with RTT between nodes <= Y, 99th percentile of (99th percentile over all prober pods) per cluster-day <= X |
 | |
| 
 | |
| ### User stories
 | |
| - As a user of vanilla Kubernetes, I want some guarantee how fast my http
 | |
| request to some Kubernetes service reaches its endpoint
 | |
| 
 | |
| ### Other notes
 | |
| - We obviously can't give any guarantee in a general case, because cluster
 | |
| administrators may configure cluster as they want.
 | |
| - As a result, we define the SLI to be very generic (no matter how your cluster
 | |
| is set up), but we provide SLO only for default installations with an additional
 | |
| requirement that low-level RTT between nodes is lower than Y.
 | |
| - Network latency is one of the most crucial aspects from the point of view
 | |
| of application performance, especially in microservices world. As a result, to
 | |
| meet user expectations, we need to provide some guarantees arount that.
 | |
| - We decided for the SLI definition as formulated above, because:
 | |
|   - it represents a user oriented end-to-end flow - it involves among others
 | |
|     latency of in-cluster network programming mechanism (e.g. iptables). <br/>
 | |
|     __TODO:__ We considered making DNS resolution part of it, but decided not
 | |
|     to mix them. However, longer term we should consider joining them.
 | |
|   - it is easily measurable in all running clusters in which we can run probers
 | |
|     (e.g. measuring request latencies coming from all pods on a given
 | |
|     node would require some additional instrumentation, such as a side car for
 | |
|     each of them, and that overhead may be not acceptable in many cases)
 | |
|   - it is not application-specific
 | |
| 
 | |
| ### Caveats
 | |
| - The SLI is formulated for a prober pods, even though users are mostly
 | |
| interested in the aggregation across all pods (that is done only at the SLO
 | |
| level). However, that provides very similar guarantees and makes it fairly
 | |
| easy to measure.
 | |
| - The RTT between nodes may significantly differ, if nodes are in different
 | |
| topologies (e.g. GCP zones). However, given that topology-aware service routing
 | |
| is not natively supported in Kubernetes yet, we explicitly acknowledge that
 | |
| depending on the pinged endpoint, results may signiifcantly differ if nodes
 | |
| are spanning multiple topologies.
 | |
| - The prober reporting that is fairly trivial and itself needs only negligible
 | |
| amount of resources. Unfortunately there isn't any component to which we can
 | |
| attach that functionality (e.g. KubeProxy is running in host network), so
 | |
| **we will create a dedicated set of prober pods**. We will run a set of prober
 | |
| pods (number proportional to cluster size).
 | |
| - We don't have any "null service" running in cluster, so an administrator has
 | |
| to set up one to make the SLI measurable in real cluster. In tests, we will
 | |
| create a service on top of prober pods.
 | |
| 
 | |
| ### Test scenario
 | |
| 
 | |
| __TODO: Describe test scenario.__
 |