Commit Graph

4188 Commits

Author SHA1 Message Date
Maciek Pytel dadb68fb8b Use FitsAnyNode in binpacking
This means that PreFilters are run once per pod in binpacking
instead of #pods*#nodes times. This makes a huge performance
difference in very large clusters.
2021-04-21 19:18:14 -07:00
Kubernetes Prow Robot 42afc33d59
Merge pull request #3788 from yaroslava-serdiuk/cluster-autoscaler-release-1.18
Cherry-pick #3722: update generic labels for GCE
2020-12-30 00:41:49 -08:00
Yaroslava Serdiuk 37abbbd12b update generic labels for GCE 2020-12-29 14:51:09 +00:00
Michelle Au 4b776b822c Add GCE PD CSI zone topology key to node template. CSI drivers use their own topology keys instead of Kubernetes labels 2020-12-29 14:50:49 +00:00
Kubernetes Prow Robot f15b175828
Merge pull request #3696 from zxh326/cluster-autoscaler-release-1.18
Fix pricing endpoint in AWS China Region
2020-11-16 10:03:05 -08:00
Hanfei Shen 34b750b1cc Fix pricing endpoint in AWS China Region
(cherry picked from commit b5f95e0d73)
2020-11-16 21:05:55 +08:00
Kubernetes Prow Robot d21f0edc45
Merge pull request #3661 from MaciekPytel/ca-1.18.3
Cluster Autoscaler 1.18.3
2020-11-02 08:06:54 -08:00
Maciek Pytel 0760cdbfe0 Cluster Autoscaler 1.18.3 2020-11-02 14:55:45 +01:00
Kubernetes Prow Robot 9dcf938c2b
Merge pull request #3655 from detiber/backportEM-1.18
[cluster-autoscaler] Backport fixes for packet provider to release-1.18
2020-11-02 05:54:52 -08:00
v-pap e629617910
Add price support in Packet 2020-10-30 15:10:05 -04:00
v-pap 7304a81b84
Add support for multiple nodepools in Packet 2020-10-30 15:08:19 -04:00
v-pap 39c0540c8a
Add support for scaling up/down from/to 0 nodes in Packet 2020-10-30 15:07:30 -04:00
Marques Johansson c33b7380f9
add Packet cloudprovider owners
Signed-off-by: Marques Johansson <marques@packet.com>
2020-10-30 13:31:04 -04:00
Kubernetes Prow Robot 61507f15ad
Merge pull request #3625 from nilo19/cleanup/cherry-pick-3532-1.18
Cherry-pick #3532 onto 1.18: Azure: support allocatable resources overrides via VMSS tags
2020-10-18 20:54:14 -07:00
qini 02306ad991 Azure: support allocatable resources overrides via VMSS tags 2020-10-19 10:55:34 +08:00
Kubernetes Prow Robot b2962ef323
Merge pull request #3598 from ryaneorth/cherry-pick-3570-1.18
Merge pull request #3570 from towca/jtuznik/scale-down-after-delete-fix
2020-10-15 01:36:24 -07:00
Ryan Orth 221d03217e Merge remote-tracking branch 'upstream/cluster-autoscaler-release-1.18' into cherry-pick-3570-1.18 2020-10-14 14:14:40 -07:00
Kubernetes Prow Robot d282d74286
Merge pull request #3612 from ryaneorth/cherry-pick-3441-1.18
Merge pull request #3441 from detiber/fixCAPITests
2020-10-14 13:53:50 -07:00
Kubernetes Prow Robot 48c9d68a42 Merge pull request #3441 from detiber/fixCAPITests
Improve Cluster API tests to work better with constrained resources
2020-10-14 13:23:32 -07:00
Kubernetes Prow Robot 4e0c2ff8c5 Merge pull request #3570 from towca/jtuznik/scale-down-after-delete-fix
Remove ScaleDownNodeDeleted status since we no longer delete nodes synchronously
2020-10-09 14:22:20 -07:00
Kubernetes Prow Robot f926cf5085
Merge pull request #3581 from nitrag/cherry-pick-3308-1.18
Cherry pick 3308 onto 1.18 - Fix priority expander falling back to random although higher priority matches
2020-10-07 15:36:16 -07:00
Kubernetes Prow Robot 51379c2794 Merge pull request #3308 from bruecktech/fix-fallback
Fix priority expander falling back to random although higher priority matches
2020-10-05 09:53:31 -04:00
Kubernetes Prow Robot 486f7b25ea
Merge pull request #3551 from benmoss/capi-backports-1.18
[CA-1.18] CAPI backports for autoscaling workload clusters
2020-10-01 06:12:53 -07:00
Kubernetes Prow Robot ecbead813f
Merge pull request #3560 from marwanad/cherry-pick-3558-1.18
Cherry pick #3558 onto 1.18 - Add missing stable labels in the azure template
2020-09-30 01:50:26 -07:00
Marwan Ahmed f04cd2ef63 fix imports 2020-09-29 22:12:16 -07:00
Marwan Ahmed 4c6137f4ad add stable labels to the azure template 2020-09-29 20:39:26 -07:00
Marwan Ahmed f3cfe3d4cb move template-related code to its own file 2020-09-29 20:38:55 -07:00
Jason DeTiberus 1e0fe7c85d
Update group identifier to use for Cluster API annotations
- Also add backwards compatibility for the previously used deprecated annotations
2020-09-28 13:31:37 -04:00
Jason DeTiberus 09accf65d1
[cluster-autoscaler] Support using --cloud-config for clusterapi provider
- Leverage --cloud-config to allow for providing a separate kubeconfig for Cluster API management and workload cluster resources
- Allow for fallback to previous behavior when --cloud-config is not specified for backward compatibility
- Provides a --clusterapi-cloud-config-authoritative flag to disable the above fallback behavior and allow for both the management and workload cluster clients to use the in-cluster config
2020-09-28 13:31:37 -04:00
Jason DeTiberus 5753f3f2ab
Add node autodiscovery to cluster-autoscaler clusterapi provider 2020-09-28 13:31:36 -04:00
Jason DeTiberus 9dc30d538c
Convert clusterapi provider to use unstructured
Remove internal types for Cluster API and replace with unstructured access
2020-09-28 13:31:35 -04:00
Jason DeTiberus 72526e5a91
Update vendor to pull in necessary new paths for client-go 2020-09-28 13:31:34 -04:00
Kubernetes Prow Robot 6c243fe2ea
Merge pull request #2950 from enxebre/skip-machinedeployment
Let the controller move on if machineDeployments are not available
2020-09-28 11:28:45 -04:00
Kubernetes Prow Robot f4ba55e1e7
Merge pull request #3523 from marwanad/cherry-pick-3440-1.18
Cherry-pick #3440 onto 1.18 - optional jitter on initial VMSS VM cache refresh
2020-09-16 18:16:45 -07:00
Benjamin Pineau 1e35781b5a Azure: optional jitter on initial VMSS VM cache refresh
On (re)start, cluster-autoscaler will refresh all VMSS instances caches
at once, and set those cache TTL to 5mn. All VMSS VM List calls (for VMSS
discovered at boot) will then continuously hit ARM API at the same time,
potentially causing regular throttling bursts.

Exposing an optional jitter subtracted from the initial first scheduled
refresh delay will splay those calls (except for the first one, at start),
while keeping the predictable (max. 5mn, unless the VMSS changed) refresh
interval after the first refresh.
2020-09-16 17:55:50 -07:00
Kubernetes Prow Robot b24a5beb8b
Merge pull request #3519 from marwanad/cherry-pick-3484-1.18
Cherry pick #3484 onto 1.18: Serve stale on ongoing throttling
2020-09-16 17:10:45 -07:00
Kubernetes Prow Robot 3e51cc7f5a
Merge pull request #3521 from marwanad/cherry-pick-3437-1.18
Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation
2020-09-16 17:08:45 -07:00
Marwan Ahmed e146e3ee84 call in the nodegroup API to avoid type assertion errors 2020-09-16 12:17:56 -07:00
Benjamin Pineau c153a63df8 Avoid unwanted VMSS VMs caches invalidations
`fetchAutoAsgs()` is called at regular intervals, fetches a list of VMSS,
then call `Register()` to cache each of those. That registration function
will tell the caller wether that vmss' cache is outdated (when the provided
VMSS, supposedly fresh, is different than the one held in cache) and will
replace existing cache entry by the provided VMSS (which in effect will
require a forced refresh since that ScaleSet struct is passed by
fetchAutoAsgs with a nil lastRefresh time and an empty instanceCache).

To detect changes, `Register()` uses an `reflect.DeepEqual()` between the
provided and the cached VMSS. Which does always find them different: cached
VMSS were enriched with instances lists (while the provided one is blank,
fresh from a simple vmss.list call). That DeepEqual is also fragile due to
the compared structs containing mutexes (that may be held or not) and
refresh timestamps, attributes that shoudln't be relevant to the comparison.

As a consequence, all Register() calls causes indirect cache invalidations
and a costly refresh (VMSS VMS List). The number of Register() calls is
directly proportional to the number of VMSS attached to the cluster, and
can easily triggers ARM API throttling.

With a large number of VMSS, that throttling prevents `fetchAutoAsgs` to
ever succeed (and cluster-autoscaler to start). ie.:

```
I0807 16:55:25.875907     153 azure_scale_set.go:344] GetScaleSetVms: starts
I0807 16:55:25.875915     153 azure_scale_set.go:350] GetScaleSetVms: scaleSet.Name: a-testvmss-10, vmList: []
E0807 16:55:25.875919     153 azure_scale_set.go:352] VirtualMachineScaleSetVMsClient.List failed for a-testvmss-10: &{true 0 2020-08-07 17:10:25.875447854 +0000 UTC m=+913.985215807 azure cloud provider throttled for operation VMSSVMList with reason "client throttled"}
E0807 16:55:25.875928     153 azure_manager.go:538] Failed to regenerate ASG cache: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled"
F0807 16:55:25.875934     153 azure_cloud_provider.go:167] Failed to create Azure Manager: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled"
goroutine 28 [running]:
```

From [`ScaleSet` struct attributes](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L74-L89)
(manager, sizes, mutexes, refreshes timestamps) only sizes are relevant
to that comparison. `curSize` is not strictly necessary, but comparing it
will provide early instance caches refreshs.
2020-09-16 12:17:46 -07:00
Benjamin Pineau 6eca014c7e Azure: serve stale on ongoing throttling
k8s Azure clients keeps tracks of previous HTTP 429 and Retry-After cool
down periods. On subsequent calls, they will notice the ongoing throttling
window and will return a synthetic errors (without HTTPStatusCode) rather
than submitting a throttled request to the ARM API:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssvmclient/azure_vmssvmclient.go#L154-L158
a5ed2cc3fe/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go (L118-L123)

Some CA components can cope with a temporarily outdated object view
when throttled. They call in to `isAzureRequestsThrottled()` on clients
errors to return stale objects from cache (if any) and extend the object's
refresh period (if any).

But this only works for the first API call (returning HTTP 429).
Next calls in the same throttling window (per Retry-After header)
won't be identified as throttled by `isAzureRequestsThrottled` due
to their nul `HTTPStatusCode`.

This can makes the CA panic during startup due a failing cache init, when
more than one VMSS call hits throttling. We've seen this causing early
restarts loops, re-scanning every VMSS due to cold cache on start,
keeping the subscription throttled.

Practically this change allows the 3 call sites (`scaleSet.Nodes()`,
`scaleSet.getCurSize()`, and `AgentPool.getVirtualMachinesFromCache()`) to
serve from cache (and extend the object's next refresh deadline) as they
would do on the first HTTP 429 hit, rather than returning an error.
2020-09-16 11:20:11 -07:00
Kubernetes Prow Robot 45b905c78e
Merge pull request #3452 from nilo19/bug/cherry-pick-3418-1-18
Cherry pick the bug fix in #2418 onto 1.18
2020-08-23 05:53:41 -07:00
Kubernetes Prow Robot 3ecf85c359
Merge pull request #3450 from DataDog/backoff-needs-retries-release-1.18
Cherry-pick onto 1.18: Backoff needs retries
2020-08-23 05:51:41 -07:00
niqi 83529110e9 Fix the bug that nicName in the if block shadows the counterpart outside. 2020-08-23 18:03:53 +08:00
Benjamin Pineau 09a08c220f Azure cloud provider: backoff needs retries
When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries`
must also be set to a value > 0, otherwise the cluster-autoscaler
will instanciate a vmssclient with 0 Steps retries, which will
cause `doBackoffRetry()` to return a nil response and nil error on
requests. ARM client can't cope with those and will then segfault.
See https://github.com/kubernetes/kubernetes/pull/94078

The README.md needed a small update, because the documented defaults
are a bit misleading: they don't apply when the cluster-autoscaler
is provided a config file, due to:
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308
... which is also causing all environment variables to be ignored
when a configuration file is provided.
2020-08-23 11:05:23 +02:00
Kubernetes Prow Robot a86950d5bb
Merge pull request #3444 from marwanad/cherry-pick-new-instances
Cherry-pick #3311: Add various azure instance types now available
2020-08-20 17:27:39 -07:00
Nicholas Kiraly 07036f9177 Add various azure instance types now available 2020-08-20 16:57:18 -07:00
Kubernetes Prow Robot 72178ad66c
Merge pull request #3345 from detiber/backport3177
[CA-1.18] #3177 cherry-pick: Fix stale replicas issue with cluster-autoscaler CAPI provider
2020-07-29 04:55:48 -07:00
Kubernetes Prow Robot 0baddce016
Merge pull request #3346 from detiber/backport3034
[CA-1.18] #3034 cherry-pick: Improve delete node mechanisms for cluster-api autoscaler provider #3034
2020-07-29 04:35:49 -07:00
Kubernetes Prow Robot bc7b29e346
Merge pull request #3361 from MaciekPytel/1_18_2
Cluster Autoscaler 1.18.2
2020-07-27 06:26:17 -07:00
Maciek Pytel 7906c7fed0 Cluster Autoscaler 1.18.2 2020-07-27 14:32:56 +02:00