autoscaler

Commit Graph

Author	SHA1	Message	Date
Kubernetes Prow Robot	0c83e28f76	Merge pull request #5689 from x13n/pluggable-drain Move mirror pods check to a dedicated rule	2023-05-19 06:20:30 -07:00
Daniel Kłobuszewski	ebaa81e9cd	Move mirror pods check to a dedicated rule	2023-05-19 14:45:44 +02:00
Kubernetes Prow Robot	114a35961a	Merge pull request #5705 from damikag/fix-race-condition-between-ca-fetching bugfix: fix race condition between CA fetching list of scheduled pods…	2023-05-12 05:23:01 -07:00
Damika Gamlath	3b4d6d62b9	bugfix: fix race condition between CA fetching list of scheduled pods and pods being scheduled	2023-05-12 11:53:50 +00:00
Jayant Jain	fbf0c64ddb	Refactor taints.go to support taint values	2023-05-11 14:19:17 +02:00
Bartłomiej Wróblewski	b8d40fdd3c	Add status taints option to template creation	2023-04-19 13:55:38 +00:00
Kubernetes Prow Robot	1009797f55	Merge pull request #5594 from vadasambar/feat/3947/ignore-some-local-storage-volumes feat: add annotation to ignore local storage volume during scale down	2023-04-17 02:16:44 -07:00
vadasambar	b663f138a4	feat: add annotation to ignore local storage volume during scale down - this is so that scale down is not blocked on local storage volume - for pods where it is okay to ignore local storage volume Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: tests failing - there was a problem in the logic Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add unit test for `IgnoreLocalStorageVolumeKey` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use `IgnoreLocalStorageVolumeKey` in tests instead of hardcoding the annotation Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: wording for test name - `pod with EmptyDir but IgnoreLocalStorageVolumeKey annotation` -> `pod with EmptyDir and IgnoreLocalStorageVolumeKey annotation` Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: simulator drain tests failing - set local storage vol name (required) Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: add support for multiple vals in `safe-to-evict-local-volume` annotation - add more unit tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename ignore local vol key `safe-to-evict-local-volume` -> `safe-to-evict-local-volumes` - abtract code to process annotation into a separate fn - shorten name for test cases Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: update FAQ with info about `safe-to-evict-local-volumes` annotation Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: add the FAQ for `safe-to-evict-local-volumes` annotation Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: fix formatting for `safe-to-evict-local-volumes` in FAQ Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: format the `safe-to-evict-local-volumes` as a bullet Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: fix `Unless` -> `unless` to make it consistent with other lines Signed-off-by: vadasambar <surajrbanakar@gmail.com> test: add an extra test for mismatching local vol value in annotation Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: make the wording clearer - for `safe-to-evict-local-volumes` annotation Signed-off-by: vadasambar <surajrbanakar@gmail.com>	2023-04-17 09:53:19 +05:30
Bartłomiej Wróblewski	d5d0a3c7b7	Fix drain logic when skipNodesWithCustomControllerPods=false, set NodeDeleteOptions correctly	2023-04-04 09:50:26 +00:00
Kubernetes Prow Robot	dcf8f822f5	Merge pull request #5551 from askoriy/memory-volumes-evictable Consider pods with emptydir volume in memory be evictable	2023-03-24 08:14:32 -07:00
vadasambar	ff6fe5833d	feat: check only controller ref to decide if a pod is replicated Signed-off-by: vadasambar <surajrbanakar@gmail.com> (cherry picked from commit `144a64a402`) fix: set `replicated` to true if controller ref is set to `true` - forgot to add this in the last commit Signed-off-by: vadasambar <surajrbanakar@gmail.com> (cherry picked from commit `f8f458295d`) fix: remove `checkReferences` - not needed anymore Signed-off-by: vadasambar <surajrbanakar@gmail.com> (cherry picked from commit `5df6e31f8b`) test(drain): add test for custom controller pod Signed-off-by: vadasambar <surajrbanakar@gmail.com> feat: add flag to allow scale down on custom controller pods - set to `false` by default - `false` will be set to `true` by default in the future - right now, we want to ensure backwards compatibility and make the feature available if the flag is explicitly set to `true` - TODO: this code might need some unit tests. Look into adding unit tests. Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: remove `at` symbol in prefix of `vadasambar` - to keep it consistent with previous such mentions in the code Signed-off-by: vadasambar <surajrbanakar@gmail.com> test(utils): run all drain tests twice - once for `allowScaleDownOnCustomControllerOwnedPods=false` - and once for `allowScaleDownOnCustomControllerOwnedPods=true` Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs(utils): add description for `testOpts` struct Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: update FAQ with info about `allow-scale-down-on-custom-controller-owned-pods` flag Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `allow-scale-down-on-custom-controller-owned-pods` -> `skip-nodes-with-custom-controller-pods` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename `allowScaleDownOnCustomControllerOwnedPods` -> `skipNodesWithCustomControllerPods` Signed-off-by: vadasambar <surajrbanakar@gmail.com> test(utils/drain): fix failing tests - refactor code to add cusom controller pod test Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: fix long code comments - clean-up print statements Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: move `expectFatal` right above where it is used - makes the code easier to read Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: fix code comment wording Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: address PR comments - abstract legacy code to check for replicated pods into a separate function so that it's easier to remove in the future - fix param info in the FAQ.md - simplify tests and remove the global variable used in the tests - rename `--skip-nodes-with-custom-controller-pods` -> `--scale-down-nodes-with-custom-controller-pods` Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: rename flag `--scale-down-nodes-with-custom-controller-pods` -> `--skip-nodes-with-custom-controller-pods` - refactor tests Signed-off-by: vadasambar <surajrbanakar@gmail.com> docs: update flag info Signed-off-by: vadasambar <surajrbanakar@gmail.com> fix: forgot to change flag name on a line in the code Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: use `ControllerRef()` directly instead of `controllerRef` - we don't need an extra variable Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: create tests consolidated test cases - from looping over and tweaking shared test cases - so that we don't have to duplicate shared test cases Signed-off-by: vadasambar <surajrbanakar@gmail.com> refactor: append test flag to shared test description - so that the failed test is easy to identify - shallow copy tests and add comments so that others do the same Signed-off-by: vadasambar <surajrbanakar@gmail.com>	2023-03-22 10:51:07 +05:30
Oleksandr Skoryi	946189f81b	Consider pods with emptydir volume in memory be evictable	2023-02-28 19:18:59 +01:00
Kubernetes Prow Robot	b516e808c6	Merge pull request #5477 from BigDarkClown/taint Merge taint utils into one package, make taint modifying methods public	2023-02-23 04:13:34 -08:00
Hakan Bostan	2ea2fb66f6	Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics * Added the new resource_name field to scaled_up/down_gpu_nodes_total, representing the resource name for the gpu. * Changed metrics registrations to use GpuConfig	2023-02-22 10:09:45 +00:00
Hakan Bostan	2b602fca9f	Use GpuConfig in utilization calculations for scale-down * Changed the `utilization.Calculate()` function to use GpuConfig instead of GPU label. * Started using GpuConfig in utilization threshold calculations.	2023-02-15 08:28:24 +00:00
Hakan Bostan	1f646e4095	Add GetNodeGpuConfig to cloud provider * Added GetNodeGpuConfig to cloud provider which returns a GpuConfig struct containing the gpu label, type and resource name if the node has a GPU. * Added initial implementaion of the GetNodeGpuConfig to all cloud providers.	2023-02-14 14:08:29 +00:00
Bartłomiej Wróblewski	b5ead036a8	Merge taint utils into one package, make taint modifying methods public	2023-02-13 11:29:45 +00:00
Bartłomiej Wróblewski	0470fdfc35	Clean up DS utils: remove unused cluster snapshot and predicate checker	2023-01-23 14:14:53 +00:00
Kubernetes Prow Robot	af23e6187e	Merge pull request #5276 from pacoxu/master Stop applying the beta.kubernetes.io/os and arch	2022-12-16 03:10:17 -08:00
Bartłomiej Wróblewski	c3d8e81b98	Don't add pods from drained nodes in scale-down	2022-12-09 16:26:54 +00:00
Paco Xu	8dec2025f8	Stop applying the beta.kubernetes.io/os and arch	2022-10-27 12:20:04 +08:00
Daniel Kłobuszewski	accf58f36c	Base parallel scale down implementation	2022-10-24 20:14:48 +02:00
Daniel Kłobuszewski	18f2e67c4f	Split out code from simulator package	2022-10-18 11:51:44 +02:00
Alexandru Matei	0ee2a359e7	Add option to wait for a period of time after node tainting/cordoning Node state is refreshed and checked again before deleting the node It gives kube-scheduler time to acknowledge that nodes state has changed and to stop scheduling pods on them	2022-10-13 10:37:56 +03:00
Flavian	f1b6d4ded6	handle directx nodes the same as gpu nodes	2022-09-23 09:55:14 +02:00
Daniel Kłobuszewski	c2a0329668	Limit amount of node utilization logging	2022-09-01 15:16:56 +02:00
Aleksandra Gacek	ab2cc2fb8a	Bump k/k dependencies to v1.25.0 together with go.mod go version.	2022-08-26 13:38:07 +02:00
mikelo	c127763a45	switched policy for PodDisruptionBudget from v1beta1 to v1 in time for 1.25	2022-06-24 19:13:03 +02:00
Michael McCune	8c27f76933	add a flag to allow event duplication this change brings in a new command line flag, `--record-duplicated-events`, which allows a user to enable the duplication of events bypassing the 5 minute de-duplication window.	2022-06-03 14:26:38 -04:00
Yaroslava Serdiuk	1cbcfbcbe7	Add ephemeral storage price to PodPrice	2022-06-03 16:12:17 +00:00
Yaroslava Serdiuk	581f1d7bc6	Add ephemeral storage price to NodePrice	2022-06-03 16:12:17 +00:00
Daniel Kłobuszewski	c550b77020	Make NodeDeletionTracker implement ActuationStatus interface	2022-04-28 17:08:10 +02:00
Bartłomiej Wróblewski	9ef91eb71e	Handle daemonsets using the daemonset controller logic	2022-03-23 12:10:44 +00:00
Daniel Kłobuszewski	109765b844	Skip pod hostname when comparing PodSpecs Hostname doesn't affect scheduling and different hostnames prevent caching of similar pods in CA.	2022-03-16 15:46:49 +01:00
Daniel Kłobuszewski	26769e4c1b	Expose nodes with unready GPU in CA status This change simplifies debugging GPU issues: without it, all nodes can be Ready as far as Kubernetes API is concerned, but CA will still report some of them as unready if are missing GPU resource. Explicitly calling them out in the status ConfigMap will point into the right direction.	2022-03-03 14:59:31 +01:00
Marwan Ahmed	286f44e351	fix pod equivalency checks for pods with projected volumes	2021-12-21 17:02:30 +02:00
Aleksandra Gacek	8939cb700c	Use custom spam filtering function to event recorder.	2021-09-14 14:42:24 +02:00
Bartłomiej Wróblewski	1e4cb1eafe	Move UpdateDeprecatedTemplateLabels function This is a useful function, we will benefit from having it more accessible then it is currently.	2021-08-04 14:32:39 +00:00
Daniel Kłobuszewski	44b8d67d50	Allow DaemonSet pods to opt in/out from eviction	2021-06-29 11:58:14 +02:00
Kubernetes Prow Robot	5ab7792a20	Merge pull request #4089 from DataDog/templates-names-collisions Fix templated nodeinfo names collisions in BinpackingNodeEstimator	2021-05-24 03:23:38 -07:00
Brett Elliott	5cf64a2b3c	Update vendor to v1.22.0-alpha.1	2021-05-20 22:02:41 +02:00
Benjamin Pineau	030a2152b0	Fix templated nodeinfo names collisions in BinpackingNodeEstimator Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses the same shared DeepCopyTemplateNode function and inherits its naming pattern, which is great as that fixes a long standing bug. Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with generated nodeinfos and nodes having predictable names (using template name + an incremental ordinal starting at 0) for upcoming nodes. Later, when it looks for fitting nodes for unschedulable pods (when upcoming nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity, or pods antiaffinity, ...), the binpacking estimator will also build virtual nodes and place them in a snapshot fork to evaluate scheduler predicates. Those temporary virtual nodes are built using the same pattern (template name and an index ordinal also starting at 0) as the one previously used by `getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes names for nodegroups having upcoming nodes. But adding nodes by the same name in an existing cluster snapshot isn't allowed, and the evaluation attempt will fail. Practically this blocks re-upscales for nodegroups having upcoming nodes, which can cause a significant delay.	2021-05-19 12:05:40 +02:00
Kubernetes Prow Robot	f4c4a77940	Merge pull request #3989 from brett-elliott/useragent Set cluster autoscaler-specific user agent.	2021-04-09 05:49:05 -07:00
Kubernetes Prow Robot	6432771415	Merge pull request #3971 from BigDarkClown/feat/resource-processor Separate and refactor custom resources logic	2021-04-07 04:41:52 -07:00
Bartłomiej Wróblewski	1698e0e583	Separate and refactor custom resources logic	2021-04-07 10:31:11 +00:00
Brett Elliott	3b48a3193f	Set cluster autoscaler-specific user agent. Refactored mocks to remove redundancy.	2021-04-06 17:49:35 +02:00
Brett Elliott	013fa19be3	Log failed scale up metric based on string of AutoscalerErrorType.	2021-03-23 15:37:04 +01:00
Brett Elliott	4cddaed2f2	Support for reporting authorization errors during scale up	2021-03-17 14:56:03 +01:00
Vivek Bagade	8c592f0c04	Fix bug where a node that becomes ready after 2 mins can be treated as unready. Deprecated LongNotStarted In cases where node n1 would: 1) Be created at t=0min 2) Ready condition is true at t=2.5min 3) Not ready taint is removed at t=3min the ready node is counted as unready Tested cases after fix: 1) Case described above 2) Nodes not starting even after 15mins still treated as unready 3) Nodes created long ago that suddenly become unready are counted as unready.	2021-03-11 18:32:51 +01:00
Maciek Pytel	9831623810	Set different hostname label for upcoming nodes Function copying template node to use for upcoming nodes was not chaning hostname label, meaning that features relying on this label (ex. pod antiaffinity on hostname topology) would treat all upcoming nodes as a single node. This resulted in triggering too many scale-ups for pods using such features. Analogous function in binpacking didn't have the same bug (but it didn't set unique UID or pod names). I extracted the functionality to a util function used in both places to avoid the two functions getting out of sync again.	2021-02-12 19:41:04 +01:00

1 2 3 4 5 ...

265 Commits