14 KiB
		
	
	
	
	
	
			
		
		
	
	
			14 KiB
		
	
	
	
	
	
Kubernetes Scalability/Performance Regressions - Case Studies & Insights
by Shyam JVS, Google Inc
February 2018
Overview
This document is a compilation of some interesting scalability/performance regression stories from the past. These were identified/studied/fixed largely by sig-scalability. We begin by listing them down, along with their succinct explanations, features/components that were involved, and relevant SIGs (besides sig-scalability). We also accompany them with data on what was the smallest scale, both for real and simulated (i.e kubemark) clusters, that managed to catch those regressions. At the end of the document we draw some useful insights based on the case studies.
Case Studies
| Issue | Brief Description | Details | Relevant feature(s)/components(s) | Relevant SIG(s) | Smallest real cluster affected | Smallest kubemark cluster affected | 
|---|---|---|---|---|---|---|
| #60035 | Kubemark-scale fails with couple of hollow-nodes getting pre-empted due to higher mem usage | Few hollow-nodes started getting pre-empted by kubelets due to memory shortage for running critical pods. The increase in memory usage of hollow-nodes (more specifically hollow kube-proxy) was due to resolving a recent bug with endpoints in kubemark (#59823). | 
 | - | - | 5000 | 
| #59823 | Endpoints objects in kubemark are empty, leading to misleading performance results | Endpoints objects weren't getting populated with more than a single entry, due to conflicting node names for same pod IP. The reason for pod IPs being the same is a bug with our mock docker-client, which assigned a constant IP to all fake pods. This is probably a regression that didn't exist about an year back. It had significant performance implications (see the bug). | 
 | sig-network | - | 100 | 
| #56061 | Apiserver memory usage increased by 10-20% after addition of admission metrics | A bunch of admission metrics were added to the apiserver for monitoring admission plugins, webhooks, etc. Soon after that change we started seeing a 100-200MB increase in memory usage of apiserver on a 100-node cluster. Thanks to the resource usage checks in our performance tests, we were able to spot the regression. It was fixed later by making those metrics lighter (i.e removing some SummaryVec metrics, reducing histogram buckets) | 
 | sig-api-machinery sig-instrumentation | 100 | - | 
| #55695 | Metadata-proxy not able to handle too many pods per node | Metadata-proxy, a newly enabled node agent for proxy'ing metadata requests coming from pods on the node, was unable to handle load from >70 pods due to memory starvation. This violated our official k8s support for 110 pods/node. | 
 | sig-auth sig-node | - | 500 | 
| #55060 | Increase in pod startup latency due to Duplicate Address Detection in CNI plugin | An update in the Container Network Interface (CNI) library introduced a new step for DAD, that caused a delay for the CNI plugins waiting on it to finish. Since this was along the code path for container creation, it led to increase in pod startup latency on the kubelet side by more than a second. As a result, we saw violation of our 5s pod-startup latency SLO on reasonably large clusters (where we were already close enough to the SLO earlier). | 
 | sig-node sig-network | 2000 (though some effect was also seen at 100) | - | 
| #54164 | Kube-dns pods coming up super slowly in large clusters due to inter-pod anti-affinity | Kube-dns, a default deployment for k8s clusters, introduced node-level soft inter-pod anti-affinity in order to spread those pods across different nodes. However, the O(pods^2) implementation of the anti-affinity in the scheduler, made their scheduling super-slow. As a result, cluster creation was failing with timeout. | 
 | sig-scheduling sig-network | 2000 | - | 
| #53327 (part) | Performance tests seeing a huge drop in scheduler throughput due to one predicate slowing down | One of the scheduler predicates was changed to compute a random 32-length string. That made the predicate super-slow as it started starving for randomness (especially with the predicate running for each of 1000s of pods) and hugely reduced the scheduler throughput (by ~10 times). After few optimizations to the random pkg (eventually getting rid of the rand() call), this was fixed. | 
 | sig-scheduling | 100 (mild signal) | 500 (strong signal) | 
| #53327 (part) | Kubemark performance tests fail with timeout during pod deletion due to bug in kubelet mock | The kubelet mock (hollow-kubelet) started showing behavioral difference from the real kubelet due to some changes in the latter. As a result, the hollow-kubelet was failing to delete pods forever under a corner condition, which is - a "DELETE pod" event is received for a pod while kubelet is in the middle of it's container creation. A tricky regression needing quite some hunting before we could set the mock right. | 
 | sig-node | - | 5000 (also 500, but flakily) | 
| #52284 | CIDR allocation super slow with IP aliases | This was a performance issue existing from before, but got exposed as a regression when we turned on IP aliasing for large clusters. CIDR-allocator (part of controller-manager) was having poor performance due to bad design. The main reasons being lack of concurrency and synchronous processing of events from shared informers. A bunch of optimizations later (#52292) fixed it's performance. | 
 | sig-network | 2000 | - | 
| #51903 | Few nodes failing to start in kubemark due to reduced PIDs limit for docker in newer COS image | When COS m60 image was introduced, we started seeing that some of the kubemark hollow-node pods were failing to start due to docker on the host-node crossing the PID limit. This is a risky regression in terms of the damage it could've caused if rolled out to production, and our scalability tests caught it. Besides the low PID threshold issue, it helped also catch another issue on containerd-shim starting too many threads. | 
 | sig-node | - | 500 | 
| #51899 (part) | "PATCH node-status" calls seeing high latency due to blocking on audit-logging | Those calls are made by kubelets once every X seconds - which adds up to be quite some qps for large clusters. Part of handling those calls is audit-logging them. When a change moving the default audit-log format to JSON was made, a performance issue with the design was exposed. The update handler for those calls was doing the audit-writing synchronously (instead of buffering + asynchronous writing), which slowed down those calls by an order of magnitude. | 
 | sig-auth sig-instrumentation sig-api-machinery | 2000 | - | 
| #51899 (part) | "DELETE pods" API call latencies shot up on large cluster tests due to kubelet thundering herd | A change to kubelet pod deletion resulted in delete pod api calls from kubelets being concentrated immediately after container garbage collection. When performing deletion of large numbers (O(10k)) of pods across large numbers (O(1k)) of nodes, the resulting concentrated delete calls from the kubelets cause increased latency of "DELETE pods" API calls (above our target SLO of 1s). | 
 | sig-node | 2000 | - | 
| #51099 | gRPC update causing failure of API calls with large responses | When gRPC vendor library was updated to v1.5.1, the default MTU for response size b/w apiserver <-> etcd changed to 4MB. This could only be caught by scalability tests, as our regular tests run at a much smaller scale - so they don't actually encounter such large response sizes. | 
 | sig-api-machinery | 100 | 100 | 
| #50854 | Route-controller timing out while listing routes from cloud-provider | Route-controller was failing to list routes from the cloud-provider API and in turn failed to create routes for the nodes. The reason was that the project in which the cluster was being created, started to have another huge cluster running there (with O(5k) routes) which was interfering with the list routes call for this cluster, due to cloud-provider side issues. | 
 | sig-network sig-gcp | - | 5000 (running besides a real 5000 cluster) | 
| #50366 | Failing to fit some pods on cluster due to accidentally increased fluentd resource request | Some change around setting fluentd resource requests accidentally doubled it's CPU request. This was caught by our kubemark scalability test where we tightly fit our hollow-node pods onto a small set of nodes. With the fluentd increase, some of those pods couldn't be scheduled due to CPU shortage and we caught it. This bug was risky for production, as it could've preempted some of the users pods for fluentd (a critical pod). | 
 | sig-instrumentation | - | 500 | 
| #48700 | Apiserver panic while logging a request in TooManyRequests handler | A change in the ordering of apiserver request handlers (where one of them is the TooManyRequests handler) caused a panic while instrumenting the request. Though this is not a scalability regression per se, this is a scenario which was exposed only by our scale tests where we actually see 429s (TooManyRequests) due to the scale at which we run the clusters (unlike normal scale tests). | 
 | sig-api-machinery | 100 | 500 | 
| #47419 | Performance tests failing due to newly exposed high LIST api latencies | After fixing a notorious bug in the instrumentation code for the 'API request latency' metric, we started seeing performance test failures due to high LIST call latencies. Though it seemed like a regression at first, it was actually a hidden performance issue that was brought to light by the fix. We then realized that list calls were not actually satisfying our 1s api latency SLO and tuned it for them appropriately. | 
 | sig-api-machinery sig-instrumentation | 2000 | 5000 | 
| #45216 | Upgrade to Go 1.8. resulted in significant performance regression | When k8s was upgraded to go-1.8, we were seeing timeouts in our kubemark-scale tests due to ~2x increase in the time taken to create services. After some experimenting/profiling, it seemed to originate from changes to the net/http.(*http2serverConn).serve library function which had some extra cases added to a select statement. One of them added some logic for gracefulShutdown which slowed down the function a lot. It was eventually fixed in a patch release by the golang team. | 
 | - | - | 5000 | 
| #42000 | Kube-proxy backlog processing causing CPU starvation for kubelet to start new pods | Kube-proxies were slow in processing endpoints updates. As a result, they were building up a backlog of work to be done while load test (which creates many services) was running. Later when the density test ran (where we create 1000s of pods), the kube-proxies were still busy processing the backlog from load test and hence consuming high memory. This memory-starved the kubelets from creating the density pods after cgroups were enabled. Before cgroups, this issue was hidden. | 
 | sig-network sig-node | - | 500 | 
Insights
- On many occasions our scalability tests caught critical/risky bugs which were missed by most other tests. If not caught, those could've seriously jeopardized production-readiness of k8s.
- SIG-Scalability has caught/fixed several important issues that span across various components, features and SIGs.
- Around 60% of times (possibly even more), we catch scalability regressions with just our medium-scale (and fast) tests, i.e gce-100 and kubemark-500. Making them run as presubmits should act as a strong shield against regressions.
- Majority of the remaining ones are caught by our large-scale (and slow) tests, i.e kubemark-5k and gce-2k. Making them as post-submit blockers (given they're "usually" quite healthy) should act as a second layer of protection against regressions.