autoscaler

Commit Graph

Author	SHA1	Message	Date
Maciek Pytel	dadb68fb8b	Use FitsAnyNode in binpacking This means that PreFilters are run once per pod in binpacking instead of #pods*#nodes times. This makes a huge performance difference in very large clusters.	2021-04-21 19:18:14 -07:00
Kubernetes Prow Robot	42afc33d59	Merge pull request #3788 from yaroslava-serdiuk/cluster-autoscaler-release-1.18 Cherry-pick #3722: update generic labels for GCE	2020-12-30 00:41:49 -08:00
Yaroslava Serdiuk	37abbbd12b	update generic labels for GCE	2020-12-29 14:51:09 +00:00
Michelle Au	4b776b822c	Add GCE PD CSI zone topology key to node template. CSI drivers use their own topology keys instead of Kubernetes labels	2020-12-29 14:50:49 +00:00
Kubernetes Prow Robot	f15b175828	Merge pull request #3696 from zxh326/cluster-autoscaler-release-1.18 Fix pricing endpoint in AWS China Region	2020-11-16 10:03:05 -08:00
Hanfei Shen	34b750b1cc	Fix pricing endpoint in AWS China Region (cherry picked from commit `b5f95e0d73`)	2020-11-16 21:05:55 +08:00
Kubernetes Prow Robot	d21f0edc45	Merge pull request #3661 from MaciekPytel/ca-1.18.3 Cluster Autoscaler 1.18.3	2020-11-02 08:06:54 -08:00
Maciek Pytel	0760cdbfe0	Cluster Autoscaler 1.18.3	2020-11-02 14:55:45 +01:00
Kubernetes Prow Robot	9dcf938c2b	Merge pull request #3655 from detiber/backportEM-1.18 [cluster-autoscaler] Backport fixes for packet provider to release-1.18	2020-11-02 05:54:52 -08:00
v-pap	e629617910	Add price support in Packet	2020-10-30 15:10:05 -04:00
v-pap	7304a81b84	Add support for multiple nodepools in Packet	2020-10-30 15:08:19 -04:00
v-pap	39c0540c8a	Add support for scaling up/down from/to 0 nodes in Packet	2020-10-30 15:07:30 -04:00
Marques Johansson	c33b7380f9	add Packet cloudprovider owners Signed-off-by: Marques Johansson <marques@packet.com>	2020-10-30 13:31:04 -04:00
Kubernetes Prow Robot	61507f15ad	Merge pull request #3625 from nilo19/cleanup/cherry-pick-3532-1.18 Cherry-pick #3532 onto 1.18: Azure: support allocatable resources overrides via VMSS tags	2020-10-18 20:54:14 -07:00
qini	02306ad991	Azure: support allocatable resources overrides via VMSS tags	2020-10-19 10:55:34 +08:00
Kubernetes Prow Robot	b2962ef323	Merge pull request #3598 from ryaneorth/cherry-pick-3570-1.18 Merge pull request #3570 from towca/jtuznik/scale-down-after-delete-fix	2020-10-15 01:36:24 -07:00
Ryan Orth	221d03217e	Merge remote-tracking branch 'upstream/cluster-autoscaler-release-1.18' into cherry-pick-3570-1.18	2020-10-14 14:14:40 -07:00
Kubernetes Prow Robot	d282d74286	Merge pull request #3612 from ryaneorth/cherry-pick-3441-1.18 Merge pull request #3441 from detiber/fixCAPITests	2020-10-14 13:53:50 -07:00
Kubernetes Prow Robot	48c9d68a42	Merge pull request #3441 from detiber/fixCAPITests Improve Cluster API tests to work better with constrained resources	2020-10-14 13:23:32 -07:00
Kubernetes Prow Robot	4e0c2ff8c5	Merge pull request #3570 from towca/jtuznik/scale-down-after-delete-fix Remove ScaleDownNodeDeleted status since we no longer delete nodes synchronously	2020-10-09 14:22:20 -07:00
Kubernetes Prow Robot	f926cf5085	Merge pull request #3581 from nitrag/cherry-pick-3308-1.18 Cherry pick 3308 onto 1.18 - Fix priority expander falling back to random although higher priority matches	2020-10-07 15:36:16 -07:00
Kubernetes Prow Robot	51379c2794	Merge pull request #3308 from bruecktech/fix-fallback Fix priority expander falling back to random although higher priority matches	2020-10-05 09:53:31 -04:00
Kubernetes Prow Robot	486f7b25ea	Merge pull request #3551 from benmoss/capi-backports-1.18 [CA-1.18] CAPI backports for autoscaling workload clusters	2020-10-01 06:12:53 -07:00
Kubernetes Prow Robot	ecbead813f	Merge pull request #3560 from marwanad/cherry-pick-3558-1.18 Cherry pick #3558 onto 1.18 - Add missing stable labels in the azure template	2020-09-30 01:50:26 -07:00
Marwan Ahmed	f04cd2ef63	fix imports	2020-09-29 22:12:16 -07:00
Marwan Ahmed	4c6137f4ad	add stable labels to the azure template	2020-09-29 20:39:26 -07:00
Marwan Ahmed	f3cfe3d4cb	move template-related code to its own file	2020-09-29 20:38:55 -07:00
Jason DeTiberus	1e0fe7c85d	Update group identifier to use for Cluster API annotations - Also add backwards compatibility for the previously used deprecated annotations	2020-09-28 13:31:37 -04:00
Jason DeTiberus	09accf65d1	[cluster-autoscaler] Support using --cloud-config for clusterapi provider - Leverage --cloud-config to allow for providing a separate kubeconfig for Cluster API management and workload cluster resources - Allow for fallback to previous behavior when --cloud-config is not specified for backward compatibility - Provides a --clusterapi-cloud-config-authoritative flag to disable the above fallback behavior and allow for both the management and workload cluster clients to use the in-cluster config	2020-09-28 13:31:37 -04:00
Jason DeTiberus	5753f3f2ab	Add node autodiscovery to cluster-autoscaler clusterapi provider	2020-09-28 13:31:36 -04:00
Jason DeTiberus	9dc30d538c	Convert clusterapi provider to use unstructured Remove internal types for Cluster API and replace with unstructured access	2020-09-28 13:31:35 -04:00
Jason DeTiberus	72526e5a91	Update vendor to pull in necessary new paths for client-go	2020-09-28 13:31:34 -04:00
Kubernetes Prow Robot	6c243fe2ea	Merge pull request #2950 from enxebre/skip-machinedeployment Let the controller move on if machineDeployments are not available	2020-09-28 11:28:45 -04:00
Kubernetes Prow Robot	f4ba55e1e7	Merge pull request #3523 from marwanad/cherry-pick-3440-1.18 Cherry-pick #3440 onto 1.18 - optional jitter on initial VMSS VM cache refresh	2020-09-16 18:16:45 -07:00
Benjamin Pineau	1e35781b5a	Azure: optional jitter on initial VMSS VM cache refresh On (re)start, cluster-autoscaler will refresh all VMSS instances caches at once, and set those cache TTL to 5mn. All VMSS VM List calls (for VMSS discovered at boot) will then continuously hit ARM API at the same time, potentially causing regular throttling bursts. Exposing an optional jitter subtracted from the initial first scheduled refresh delay will splay those calls (except for the first one, at start), while keeping the predictable (max. 5mn, unless the VMSS changed) refresh interval after the first refresh.	2020-09-16 17:55:50 -07:00
Kubernetes Prow Robot	b24a5beb8b	Merge pull request #3519 from marwanad/cherry-pick-3484-1.18 Cherry pick #3484 onto 1.18: Serve stale on ongoing throttling	2020-09-16 17:10:45 -07:00
Kubernetes Prow Robot	3e51cc7f5a	Merge pull request #3521 from marwanad/cherry-pick-3437-1.18 Cherry pick #3437 onto 1.18 - Avoid unwanted VMSS VMs caches invalidation	2020-09-16 17:08:45 -07:00
Marwan Ahmed	e146e3ee84	call in the nodegroup API to avoid type assertion errors	2020-09-16 12:17:56 -07:00
Benjamin Pineau	c153a63df8	Avoid unwanted VMSS VMs caches invalidations `fetchAutoAsgs()` is called at regular intervals, fetches a list of VMSS, then call `Register()` to cache each of those. That registration function will tell the caller wether that vmss' cache is outdated (when the provided VMSS, supposedly fresh, is different than the one held in cache) and will replace existing cache entry by the provided VMSS (which in effect will require a forced refresh since that ScaleSet struct is passed by fetchAutoAsgs with a nil lastRefresh time and an empty instanceCache). To detect changes, `Register()` uses an `reflect.DeepEqual()` between the provided and the cached VMSS. Which does always find them different: cached VMSS were enriched with instances lists (while the provided one is blank, fresh from a simple vmss.list call). That DeepEqual is also fragile due to the compared structs containing mutexes (that may be held or not) and refresh timestamps, attributes that shoudln't be relevant to the comparison. As a consequence, all Register() calls causes indirect cache invalidations and a costly refresh (VMSS VMS List). The number of Register() calls is directly proportional to the number of VMSS attached to the cluster, and can easily triggers ARM API throttling. With a large number of VMSS, that throttling prevents `fetchAutoAsgs` to ever succeed (and cluster-autoscaler to start). ie.: ``` I0807 16:55:25.875907 153 azure_scale_set.go:344] GetScaleSetVms: starts I0807 16:55:25.875915 153 azure_scale_set.go:350] GetScaleSetVms: scaleSet.Name: a-testvmss-10, vmList: [] E0807 16:55:25.875919 153 azure_scale_set.go:352] VirtualMachineScaleSetVMsClient.List failed for a-testvmss-10: &{true 0 2020-08-07 17:10:25.875447854 +0000 UTC m=+913.985215807 azure cloud provider throttled for operation VMSSVMList with reason "client throttled"} E0807 16:55:25.875928 153 azure_manager.go:538] Failed to regenerate ASG cache: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" F0807 16:55:25.875934 153 azure_cloud_provider.go:167] Failed to create Azure Manager: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" goroutine 28 [running]: ``` From [`ScaleSet` struct attributes](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L74-L89) (manager, sizes, mutexes, refreshes timestamps) only sizes are relevant to that comparison. `curSize` is not strictly necessary, but comparing it will provide early instance caches refreshs.	2020-09-16 12:17:46 -07:00
Benjamin Pineau	6eca014c7e	Azure: serve stale on ongoing throttling k8s Azure clients keeps tracks of previous HTTP 429 and Retry-After cool down periods. On subsequent calls, they will notice the ongoing throttling window and will return a synthetic errors (without HTTPStatusCode) rather than submitting a throttled request to the ARM API: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/clients/vmssvmclient/azure_vmssvmclient.go#L154-L158 `a5ed2cc3fe/cluster-autoscaler/vendor/k8s.io/legacy-cloud-providers/azure/retry/azure_error.go (L118-L123)` Some CA components can cope with a temporarily outdated object view when throttled. They call in to `isAzureRequestsThrottled()` on clients errors to return stale objects from cache (if any) and extend the object's refresh period (if any). But this only works for the first API call (returning HTTP 429). Next calls in the same throttling window (per Retry-After header) won't be identified as throttled by `isAzureRequestsThrottled` due to their nul `HTTPStatusCode`. This can makes the CA panic during startup due a failing cache init, when more than one VMSS call hits throttling. We've seen this causing early restarts loops, re-scanning every VMSS due to cold cache on start, keeping the subscription throttled. Practically this change allows the 3 call sites (`scaleSet.Nodes()`, `scaleSet.getCurSize()`, and `AgentPool.getVirtualMachinesFromCache()`) to serve from cache (and extend the object's next refresh deadline) as they would do on the first HTTP 429 hit, rather than returning an error.	2020-09-16 11:20:11 -07:00
Kubernetes Prow Robot	45b905c78e	Merge pull request #3452 from nilo19/bug/cherry-pick-3418-1-18 Cherry pick the bug fix in #2418 onto 1.18	2020-08-23 05:53:41 -07:00
Kubernetes Prow Robot	3ecf85c359	Merge pull request #3450 from DataDog/backoff-needs-retries-release-1.18 Cherry-pick onto 1.18: Backoff needs retries	2020-08-23 05:51:41 -07:00
niqi	83529110e9	Fix the bug that nicName in the if block shadows the counterpart outside.	2020-08-23 18:03:53 +08:00
Benjamin Pineau	09a08c220f	Azure cloud provider: backoff needs retries When `cloudProviderBackoff` is configured, `cloudProviderBackoffRetries` must also be set to a value > 0, otherwise the cluster-autoscaler will instanciate a vmssclient with 0 Steps retries, which will cause `doBackoffRetry()` to return a nil response and nil error on requests. ARM client can't cope with those and will then segfault. See https://github.com/kubernetes/kubernetes/pull/94078 The README.md needed a small update, because the documented defaults are a bit misleading: they don't apply when the cluster-autoscaler is provided a config file, due to: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_manager.go#L299-L308 ... which is also causing all environment variables to be ignored when a configuration file is provided.	2020-08-23 11:05:23 +02:00
Kubernetes Prow Robot	a86950d5bb	Merge pull request #3444 from marwanad/cherry-pick-new-instances Cherry-pick #3311: Add various azure instance types now available	2020-08-20 17:27:39 -07:00
Nicholas Kiraly	07036f9177	Add various azure instance types now available	2020-08-20 16:57:18 -07:00
Kubernetes Prow Robot	72178ad66c	Merge pull request #3345 from detiber/backport3177 [CA-1.18] #3177 cherry-pick: Fix stale replicas issue with cluster-autoscaler CAPI provider	2020-07-29 04:55:48 -07:00
Kubernetes Prow Robot	0baddce016	Merge pull request #3346 from detiber/backport3034 [CA-1.18] #3034 cherry-pick: Improve delete node mechanisms for cluster-api autoscaler provider #3034	2020-07-29 04:35:49 -07:00
Kubernetes Prow Robot	bc7b29e346	Merge pull request #3361 from MaciekPytel/1_18_2 Cluster Autoscaler 1.18.2	2020-07-27 06:26:17 -07:00
Maciek Pytel	7906c7fed0	Cluster Autoscaler 1.18.2	2020-07-27 14:32:56 +02:00

1 2 3 4 5 ...

4188 Commits All Branches Search

4188 Commits

All Branches