This means that PreFilters are run once per pod in binpacking
instead of #pods*#nodes times. This makes a huge performance
difference in very large clusters.
- Leverage --cloud-config to allow for providing a separate kubeconfig for Cluster API management and workload cluster resources
- Allow for fallback to previous behavior when --cloud-config is not specified for backward compatibility
- Provides a --clusterapi-cloud-config-authoritative flag to disable the above fallback behavior and allow for both the management and workload cluster clients to use the in-cluster config
On (re)start, cluster-autoscaler will refresh all VMSS instances caches
at once, and set those cache TTL to 5mn. All VMSS VM List calls (for VMSS
discovered at boot) will then continuously hit ARM API at the same time,
potentially causing regular throttling bursts.
Exposing an optional jitter subtracted from the initial first scheduled
refresh delay will splay those calls (except for the first one, at start),
while keeping the predictable (max. 5mn, unless the VMSS changed) refresh
interval after the first refresh.
`fetchAutoAsgs()` is called at regular intervals, fetches a list of VMSS,
then call `Register()` to cache each of those. That registration function
will tell the caller wether that vmss' cache is outdated (when the provided
VMSS, supposedly fresh, is different than the one held in cache) and will
replace existing cache entry by the provided VMSS (which in effect will
require a forced refresh since that ScaleSet struct is passed by
fetchAutoAsgs with a nil lastRefresh time and an empty instanceCache).
To detect changes, `Register()` uses an `reflect.DeepEqual()` between the
provided and the cached VMSS. Which does always find them different: cached
VMSS were enriched with instances lists (while the provided one is blank,
fresh from a simple vmss.list call). That DeepEqual is also fragile due to
the compared structs containing mutexes (that may be held or not) and
refresh timestamps, attributes that shoudln't be relevant to the comparison.
As a consequence, all Register() calls causes indirect cache invalidations
and a costly refresh (VMSS VMS List). The number of Register() calls is
directly proportional to the number of VMSS attached to the cluster, and
can easily triggers ARM API throttling.
With a large number of VMSS, that throttling prevents `fetchAutoAsgs` to
ever succeed (and cluster-autoscaler to start). ie.:
```
I0807 16:55:25.875907 153 azure_scale_set.go:344] GetScaleSetVms: starts
I0807 16:55:25.875915 153 azure_scale_set.go:350] GetScaleSetVms: scaleSet.Name: a-testvmss-10, vmList: []
E0807 16:55:25.875919 153 azure_scale_set.go:352] VirtualMachineScaleSetVMsClient.List failed for a-testvmss-10: &{true 0 2020-08-07 17:10:25.875447854 +0000 UTC m=+913.985215807 azure cloud provider throttled for operation VMSSVMList with reason "client throttled"}
E0807 16:55:25.875928 153 azure_manager.go:538] Failed to regenerate ASG cache: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled"
F0807 16:55:25.875934 153 azure_cloud_provider.go:167] Failed to create Azure Manager: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled"
goroutine 28 [running]:
```
From [`ScaleSet` struct attributes](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L74-L89)
(manager, sizes, mutexes, refreshes timestamps) only sizes are relevant
to that comparison. `curSize` is not strictly necessary, but comparing it
will provide early instance caches refreshs.
k8s Azure clients keeps tracks of previous HTTP 429 and Retry-After cool
down periods. On subsequent calls, they will notice the ongoing throttling
window and will return a synthetic errors (without HTTPStatusCode) rather
than submitting a throttled request to the ARM API:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssvmclient/azure_vmssvmclient.go#L154-L158a5ed2cc3fe/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go (L118-L123)
Some CA components can cope with a temporarily outdated object view
when throttled. They call in to `isAzureRequestsThrottled()` on clients
errors to return stale objects from cache (if any) and extend the object's
refresh period (if any).
But this only works for the first API call (returning HTTP 429).
Next calls in the same throttling window (per Retry-After header)
won't be identified as throttled by `isAzureRequestsThrottled` due
to their nul `HTTPStatusCode`.
This can makes the CA panic during startup due a failing cache init, when
more than one VMSS call hits throttling. We've seen this causing early
restarts loops, re-scanning every VMSS due to cold cache on start,
keeping the subscription throttled.
Practically this change allows the 3 call sites (`scaleSet.Nodes()`,
`scaleSet.getCurSize()`, and `AgentPool.getVirtualMachinesFromCache()`) to
serve from cache (and extend the object's next refresh deadline) as they
would do on the first HTTP 429 hit, rather than returning an error.
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries`
must also be set to a value > 0, otherwise the cluster-autoscaler
will instanciate a vmssclient with 0 Steps retries, which will
cause `doBackoffRetry()` to return a nil response and nil error on
requests. ARM client can't cope with those and will then segfault.
See https://github.com/kubernetes/kubernetes/pull/94078
The README.md needed a small update, because the documented defaults
are a bit misleading: they don't apply when the cluster-autoscaler
is provided a config file, due to:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308
... which is also causing all environment variables to be ignored
when a configuration file is provided.