autoscaler

Commit Graph

Author	SHA1	Message	Date
dom.bozzuto	066315cfa2	Add detection for VMs that fail provisioning to backoff that nodegroup sooner When Azure fails to provision a node for a nodegroup due to an instance capacity issue ((Zonal)AllocationFailed) or other reason, the VMSS size increase is still reflected but the new instance gets the status `ProvisioningStateFailed`. This now bubbles up the error to the `cloudprovider.Instance`, where it can be used by in `clusterstate` to put the nodegroup into backoff sooner.	2023-04-24 13:56:30 -04:00
Manish Satwani	a99da16658	Incorporated comments	2023-04-20 15:36:21 -07:00
Manish Satwani	3de4ffb6d6	refactored to use buildCacheInstance for Uniform & Flex	2023-04-03 09:52:52 -07:00
Manish Satwani	ea9ff4d0eb	refactor as per feedback, added test case for Flex	2023-03-27 10:52:16 -07:00
Manish Satwani	c418175c7d	Adding 'enableVmssFlex' feature flag.	2023-03-15 12:18:16 -07:00
Manish Satwani	61f2700fc4	Modified test case and added test case for Flex	2023-03-13 10:18:47 -07:00
Manish Satwani	84f748f67b	Add support for VMSS Flex	2023-02-15 13:12:09 -08:00
Manish Satwani	fd7903a548	bump cloud-provider-azure version in CA to 1.26.2 for azure imports	2023-02-02 12:02:56 -08:00
Benjamin Pineau	20c451bbc0	Azure: effectively cache instance-types SKUs The skewer's library cache is re-created at every call, which causes pressure on Azure API, and slows down the cluster-autoscaler startup time by two minutes on my small (120 nodes, 300 VMSS) test cluster. This was hitting the API twice on cache miss to look for non-promo instance types (even when the instance name doesn't ends with "_Promo").	2022-07-25 16:09:36 +02:00
Prachi Gandhi	6b05ccab84	bump cloud-provider-azure version in CA	2022-05-11 12:03:33 -07:00
Prachi Gandhi	5d0e23d1bc	Support for dynamic SKUs for scale from zero. Currently, cluster autoscaler uses hard-coded (static) list of instanceTypes to scale from zero as there is no node to build blueprint of the information from. This static list needs to updated every-time a new VMSS is added which is not feasible.	2022-04-20 15:31:38 -07:00
Marwan Ahmed	542e919b18	remove check for returning in-memory size when VMSS is in updating state	2022-04-05 14:46:58 -07:00
Marwan Ahmed	d49a131f9e	azure vmss cache fixes and improvements	2022-02-16 17:51:56 -08:00
Marwan Ahmed	6689f92cbc	update delete async calls in scale sets	2022-01-28 13:58:15 -08:00
Marwan Ahmed	e0952eb29d	fix scale set log formatter	2021-12-21 17:57:35 +02:00
Marwan Ahmed	091c72cbb0	cleanup scale set size logs	2021-12-20 15:14:05 +02:00
Marwan Ahmed	afb443f9f3	switch azure clients to non-legacy repo	2021-12-06 17:17:10 +02:00
Benjamin Pineau	28cd49c09e	implement GetOptions for Azure Support per-VMSS (scaledown) settings as permited by the cloudprovider's interface `GetOptions()` method.	2021-08-24 09:48:51 +02:00
Marwan Ahmed	756a3e155d	dont proactively decrement azure cache for unregistered nodes	2021-06-10 14:14:34 -07:00
Maciek Pytel	08d18a7bd0	Define interfaces for per NodeGroup config. This is the first step of implementing https://github.com/kubernetes/autoscaler/issues/3583#issuecomment-743215343. New method was added to cloudprovider interface. All existing providers were updated with a no-op stub implementation that will result in no behavior change. The config values specified per NodeGroup are not yet applied.	2021-01-25 11:00:16 +01:00
Cecile Robert-Michon	28badba175	cleanup: refactor Azure cache and remove redundant API calls	2020-12-07 11:55:34 -07:00
Bartłomiej Wróblewski	0fb897b839	Update imports after scheduler scheduler/framework/v1alpha1 removal	2020-11-30 10:48:52 +00:00
Benjamin Pineau	ec2e477d1f	Azure: keep refreshes spread over time When a `vmssVmsCacheJitter` is provided, API calls (after start) will be randomly spread over the provided time range, then happens at regular interval (for a given VMSS). This prevents API calls spikes. But we noticed that the various VMSS' refreshes will progressively converge and agglomerate over time (in particular after a few large throttling windows affected the autoscaler), which defeats the purpose. Re-randomizing the next refresh deadline every time (rather than just at autoscaler start) keeps the calls properly spread. Configuring `vmssVmsCacheJitter` and `vmssVmsCacheTTL` allows users to control the average and worst case refresh interval (and avg API call rate). And we can count on VMSS size change detection to kick early refreshes when needed. That's a small behaviour change, but possibly still a good time for that, as `vmssVmsCacheJitter` was introduced recently and wasn't part of any release yet.	2020-10-19 14:59:08 +02:00
Marwan Ahmed	4f37cec1cf	set instance status to deleting to better handle long cache TTLs	2020-10-18 18:58:17 -07:00
Marwan Ahmed	875f35580a	gofmt	2020-09-29 13:41:05 -07:00
Marwan Ahmed	6c235f842b	move template-related code to its own file	2020-09-29 11:57:12 -07:00
Benjamin Pineau	5982062de2	Azure: support allocatable resources overrides via VMSS tags This allows to specify effective nodes resources capacity using Scale Sets tags, preventing wrong CA decisions and infinite upscale when pods requests are within instance type capacity but over k8s nodes allocatable (which might comprise system and kubelet's reservations), and when using node-infos from instances templates (ie. scaling from 0). This is similar to what AWS (with launch configuration tags) and GCP (with instances templates metadata) cloud providers offers, ensuring the tags format is similar to AWS' for consistency. See also: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/min_at_zero_gcp.md	2020-09-22 13:15:28 +02:00
Marwan Ahmed	6da62d0b7b	dont update capacity if VMSS provisioning state is updating	2020-09-13 22:21:28 -07:00
Benjamin Pineau	c168eed930	Azure: optional jitter on initial VMSS VM cache refresh On (re)start, cluster-autoscaler will refresh all VMSS instances caches at once, and set those cache TTL to 5mn. All VMSS VM List calls (for VMSS discovered at boot) will then continuously hit ARM API at the same time, potentially causing regular throttling bursts. Exposing an optional jitter subtracted from the initial first scheduled refresh delay will splay those calls (except for the first one, at start), while keeping the predictable (max. 5mn, unless the VMSS changed) refresh interval after the first refresh.	2020-08-19 20:48:28 +02:00
Benjamin Pineau	4997972426	Avoid unwanted VMSS VMs caches invalidations `fetchAutoAsgs()` is called at regular intervals, fetches a list of VMSS, then call `Register()` to cache each of those. That registration function will tell the caller wether that vmss' cache is outdated (when the provided VMSS, supposedly fresh, is different than the one held in cache) and will replace existing cache entry by the provided VMSS (which in effect will require a forced refresh since that ScaleSet struct is passed by fetchAutoAsgs with a nil lastRefresh time and an empty instanceCache). To detect changes, `Register()` uses an `reflect.DeepEqual()` between the provided and the cached VMSS. Which does always find them different: cached VMSS were enriched with instances lists (while the provided one is blank, fresh from a simple vmss.list call). That DeepEqual is also fragile due to the compared structs containing mutexes (that may be held or not) and refresh timestamps, attributes that shoudln't be relevant to the comparison. As a consequence, all Register() calls causes indirect cache invalidations and a costly refresh (VMSS VMS List). The number of Register() calls is directly proportional to the number of VMSS attached to the cluster, and can easily triggers ARM API throttling. With a large number of VMSS, that throttling prevents `fetchAutoAsgs` to ever succeed (and cluster-autoscaler to start). ie.: ``` I0807 16:55:25.875907 153 azure_scale_set.go:344] GetScaleSetVms: starts I0807 16:55:25.875915 153 azure_scale_set.go:350] GetScaleSetVms: scaleSet.Name: a-testvmss-10, vmList: [] E0807 16:55:25.875919 153 azure_scale_set.go:352] VirtualMachineScaleSetVMsClient.List failed for a-testvmss-10: &{true 0 2020-08-07 17:10:25.875447854 +0000 UTC m=+913.985215807 azure cloud provider throttled for operation VMSSVMList with reason "client throttled"} E0807 16:55:25.875928 153 azure_manager.go:538] Failed to regenerate ASG cache: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" F0807 16:55:25.875934 153 azure_cloud_provider.go:167] Failed to create Azure Manager: Retriable: true, RetryAfter: 899s, HTTPStatusCode: 0, RawError: azure cloud provider throttled for operation VMSSVMList with reason "client throttled" goroutine 28 [running]: ``` From [`ScaleSet` struct attributes](https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L74-L89) (manager, sizes, mutexes, refreshes timestamps) only sizes are relevant to that comparison. `curSize` is not strictly necessary, but comparing it will provide early instance caches refreshs.	2020-08-18 14:52:02 +02:00
Marwan Ahmed	4cb388fa6e	fix potential lock issue	2020-07-06 22:33:18 -07:00
Marwan Ahmed	710abdd713	fix err in log and cleanup other logs	2020-07-06 22:32:47 -07:00
Marwan Ahmed	7e5073192a	add context timeouts for the sync part of long running operations	2020-07-06 22:21:38 -07:00
Marwan Ahmed	44ba2ca58b	synchronize instance deletions to avoid 409 conflict errors	2020-07-04 18:37:29 -07:00
Kubernetes Prow Robot	35901cddb6	Merge pull request #3278 from marwanad/set-context-timeouts-on-gets use contexts with timeouts in scale set GET calls	2020-07-04 01:28:48 -07:00
Marwan Ahmed	3891852eec	use contexts with timeouts in scale set GET calls	2020-07-03 20:17:56 -07:00
Marwan Ahmed	a06b1c9d69	decerement cache by the proper amount	2020-07-03 15:15:13 -07:00
Marwan Ahmed	f48b26e538	move lock to the get method	2020-07-03 14:11:32 -07:00
Marwan Ahmed	3969346cb5	remove unneeded error message	2020-07-01 16:14:58 -07:00
Marwan Ahmed	e250076338	reduce instance mutex lock scope since its used by the Nodes() call to refresh cache	2020-07-01 16:14:17 -07:00
Kubernetes Prow Robot	1434d14ec7	Merge pull request #3242 from nilo19/bug/disable-increase-when-initializing Disable increaseSize when the node group is under initialilzation.	2020-06-25 05:24:38 -07:00
qini	81cb1a772a	Disable increaseSize when the node group is under initialilzation.	2020-06-25 19:33:29 +08:00
Marwan Ahmed	88ad78f390	no need to invalidate caches on scale-down	2020-06-23 23:22:48 -07:00
Marwan Ahmed	694e089f9e	switch scalesets to delete asynchronously without waiting on future	2020-06-17 11:54:54 -07:00
Maciek Pytel	655b4081f4	Migrate to klog v2	2020-06-05 17:22:26 +02:00
Marwan Ahmed	57101652cc	return correct error for GetScaleSetVms	2020-06-03 13:53:50 -07:00
Marwan Ahmed	d9aaf4d6f3	avoid sending unncessary delete requests if delete is in progress	2020-05-19 17:51:20 -07:00
Jakub Tużnik	73a5cdf928	Address recent breaking changes in scheduler The following things changed in scheduler and needed to be fixed: * NodeInfo was moved to schedulerframework * Some fields on NodeInfo are now exposed directly instead of via getters * NodeInfo.Pods is now a list of schedulerframework.PodInfo, not apiv1.Pod * SharedLister and NodeInfoLister were moved to schedulerframework * PodLister was removed	2020-04-24 17:54:47 +02:00
Kubernetes Prow Robot	babcd6121a	Merge pull request #3036 from marwanad/fix-deletion-going-below-min Proactively decrement scale set count during deletion operations	2020-04-13 01:37:47 -07:00
Pengfei Ni	0ba8ca6bbd	Remove checking for VMSS provisioningState before scaling	2020-04-10 06:33:04 +00:00

1 2 3

101 Commits