Commit Graph

265 Commits

Author SHA1 Message Date
Kubernetes Prow Robot 0c83e28f76
Merge pull request #5689 from x13n/pluggable-drain
Move mirror pods check to a dedicated rule
2023-05-19 06:20:30 -07:00
Daniel Kłobuszewski ebaa81e9cd Move mirror pods check to a dedicated rule 2023-05-19 14:45:44 +02:00
Kubernetes Prow Robot 114a35961a
Merge pull request #5705 from damikag/fix-race-condition-between-ca-fetching
bugfix: fix race condition between CA fetching list of scheduled pods…
2023-05-12 05:23:01 -07:00
Damika Gamlath 3b4d6d62b9 bugfix: fix race condition between CA fetching list of scheduled pods and pods being scheduled 2023-05-12 11:53:50 +00:00
Jayant Jain fbf0c64ddb Refactor taints.go to support taint values 2023-05-11 14:19:17 +02:00
Bartłomiej Wróblewski b8d40fdd3c Add status taints option to template creation 2023-04-19 13:55:38 +00:00
Kubernetes Prow Robot 1009797f55
Merge pull request #5594 from vadasambar/feat/3947/ignore-some-local-storage-volumes
feat: add annotation to ignore local storage volume during scale down
2023-04-17 02:16:44 -07:00
vadasambar b663f138a4 feat: add annotation to ignore local storage volume during scale down
- this is so that scale down is not blocked on local storage volume
- for pods where it is okay to ignore local storage volume
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: tests failing
- there was a problem in the logic
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add unit test for `IgnoreLocalStorageVolumeKey`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `IgnoreLocalStorageVolumeKey`  in tests instead of hardcoding the annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: wording for test name
- `pod with EmptyDir but IgnoreLocalStorageVolumeKey annotation` -> `pod with EmptyDir and IgnoreLocalStorageVolumeKey annotation`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: simulator drain tests failing
- set local storage vol name (required)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: add support for multiple vals in `safe-to-evict-local-volume` annotation
- add more unit tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename ignore local vol key `safe-to-evict-local-volume` -> `safe-to-evict-local-volumes`
- abtract code to process annotation into a separate fn
- shorten name for test cases
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: update FAQ with info about `safe-to-evict-local-volumes` annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: add the FAQ for `safe-to-evict-local-volumes` annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: fix formatting for `safe-to-evict-local-volumes` in FAQ
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: format the `safe-to-evict-local-volumes` as a bullet
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: fix `Unless` -> `unless` to make it consistent with other lines
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test: add an extra test for mismatching local vol value in annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: make the wording clearer
- for `safe-to-evict-local-volumes` annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-04-17 09:53:19 +05:30
Bartłomiej Wróblewski d5d0a3c7b7 Fix drain logic when skipNodesWithCustomControllerPods=false, set NodeDeleteOptions correctly 2023-04-04 09:50:26 +00:00
Kubernetes Prow Robot dcf8f822f5
Merge pull request #5551 from askoriy/memory-volumes-evictable
Consider pods with emptydir volume in memory be evictable
2023-03-24 08:14:32 -07:00
vadasambar ff6fe5833d feat: check only controller ref to decide if a pod is replicated
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
(cherry picked from commit 144a64a402)

fix: set `replicated` to true if controller ref is set to `true`
- forgot to add this in the last commit

Signed-off-by: vadasambar <surajrbanakar@gmail.com>
(cherry picked from commit f8f458295d)

fix: remove `checkReferences`
- not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

(cherry picked from commit 5df6e31f8b)

test(drain): add test for custom controller pod
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

feat: add flag to allow scale down on custom controller pods
- set to `false` by default
- `false` will be set to `true` by default in the future
- right now, we want to ensure backwards compatibility and make the feature available if the flag is explicitly set to `true`
- TODO: this code might need some unit tests. Look into adding unit tests.
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: remove `at` symbol in prefix of `vadasambar`
- to keep it consistent with previous such mentions in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(utils): run all drain tests twice
- once for  `allowScaleDownOnCustomControllerOwnedPods=false`
- and once for `allowScaleDownOnCustomControllerOwnedPods=true`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs(utils): add description for `testOpts` struct
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: update FAQ with info about `allow-scale-down-on-custom-controller-owned-pods` flag
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `allow-scale-down-on-custom-controller-owned-pods` -> `skip-nodes-with-custom-controller-pods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename `allowScaleDownOnCustomControllerOwnedPods` -> `skipNodesWithCustomControllerPods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

test(utils/drain): fix failing tests
- refactor code to add cusom controller pod test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: fix long code comments
- clean-up print statements
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: move `expectFatal` right above where it is used
- makes the code easier to read
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: fix code comment wording
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: address PR comments
- abstract legacy code to check for replicated pods into a separate function so that it's easier to remove in the future
- fix param info in the FAQ.md
- simplify tests and remove the global variable used in the tests
- rename `--skip-nodes-with-custom-controller-pods` -> `--scale-down-nodes-with-custom-controller-pods`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: rename flag `--scale-down-nodes-with-custom-controller-pods` -> `--skip-nodes-with-custom-controller-pods`
- refactor tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

docs: update flag info
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

fix: forgot to change flag name on a line in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: use `ControllerRef()` directly instead of `controllerRef`
- we don't need an extra variable
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: create tests consolidated test cases
- from looping over and tweaking shared test cases
- so that we don't have to duplicate shared test cases
Signed-off-by: vadasambar <surajrbanakar@gmail.com>

refactor: append test flag to shared test description
- so that the failed test is easy to identify
- shallow copy tests and add comments so that others do the same
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
2023-03-22 10:51:07 +05:30
Oleksandr Skoryi 946189f81b
Consider pods with emptydir volume in memory be evictable 2023-02-28 19:18:59 +01:00
Kubernetes Prow Robot b516e808c6
Merge pull request #5477 from BigDarkClown/taint
Merge taint utils into one package, make taint modifying methods public
2023-02-23 04:13:34 -08:00
Hakan Bostan 2ea2fb66f6 Add "resource_name" to scaled_up_gpu_nodes_total and scaled_down_gpu_nodes_total metrics
* Added the new resource_name field to scaled_up/down_gpu_nodes_total,
  representing the resource name for the gpu.
* Changed metrics registrations to use GpuConfig
2023-02-22 10:09:45 +00:00
Hakan Bostan 2b602fca9f Use GpuConfig in utilization calculations for scale-down
* Changed the `utilization.Calculate()` function to use GpuConfig
  instead of GPU label.
* Started using GpuConfig in utilization threshold calculations.
2023-02-15 08:28:24 +00:00
Hakan Bostan 1f646e4095 Add GetNodeGpuConfig to cloud provider
* Added GetNodeGpuConfig to cloud provider which returns a GpuConfig
  struct containing the gpu label, type and resource name if the node
  has a GPU.
* Added initial implementaion of the GetNodeGpuConfig to all cloud
  providers.
2023-02-14 14:08:29 +00:00
Bartłomiej Wróblewski b5ead036a8 Merge taint utils into one package, make taint modifying methods public 2023-02-13 11:29:45 +00:00
Bartłomiej Wróblewski 0470fdfc35 Clean up DS utils: remove unused cluster snapshot and predicate checker 2023-01-23 14:14:53 +00:00
Kubernetes Prow Robot af23e6187e
Merge pull request #5276 from pacoxu/master
Stop applying the beta.kubernetes.io/os and arch
2022-12-16 03:10:17 -08:00
Bartłomiej Wróblewski c3d8e81b98 Don't add pods from drained nodes in scale-down 2022-12-09 16:26:54 +00:00
Paco Xu 8dec2025f8 Stop applying the beta.kubernetes.io/os and arch 2022-10-27 12:20:04 +08:00
Daniel Kłobuszewski accf58f36c Base parallel scale down implementation 2022-10-24 20:14:48 +02:00
Daniel Kłobuszewski 18f2e67c4f Split out code from simulator package 2022-10-18 11:51:44 +02:00
Alexandru Matei 0ee2a359e7 Add option to wait for a period of time after node tainting/cordoning
Node state is refreshed and checked again before deleting the node
It gives kube-scheduler time to acknowledge that nodes state has
changed and to stop scheduling pods on them
2022-10-13 10:37:56 +03:00
Flavian f1b6d4ded6 handle directx nodes the same as gpu nodes 2022-09-23 09:55:14 +02:00
Daniel Kłobuszewski c2a0329668 Limit amount of node utilization logging 2022-09-01 15:16:56 +02:00
Aleksandra Gacek ab2cc2fb8a Bump k/k dependencies to v1.25.0 together with go.mod go version. 2022-08-26 13:38:07 +02:00
mikelo c127763a45 switched policy for PodDisruptionBudget from v1beta1 to v1 in time for 1.25 2022-06-24 19:13:03 +02:00
Michael McCune 8c27f76933 add a flag to allow event duplication
this change brings in a new command line flag,
`--record-duplicated-events`, which allows a user to enable the
duplication of events bypassing the 5 minute de-duplication window.
2022-06-03 14:26:38 -04:00
Yaroslava Serdiuk 1cbcfbcbe7 Add ephemeral storage price to PodPrice 2022-06-03 16:12:17 +00:00
Yaroslava Serdiuk 581f1d7bc6 Add ephemeral storage price to NodePrice 2022-06-03 16:12:17 +00:00
Daniel Kłobuszewski c550b77020 Make NodeDeletionTracker implement ActuationStatus interface 2022-04-28 17:08:10 +02:00
Bartłomiej Wróblewski 9ef91eb71e Handle daemonsets using the daemonset controller logic 2022-03-23 12:10:44 +00:00
Daniel Kłobuszewski 109765b844 Skip pod hostname when comparing PodSpecs
Hostname doesn't affect scheduling and different hostnames prevent
caching of similar pods in CA.
2022-03-16 15:46:49 +01:00
Daniel Kłobuszewski 26769e4c1b Expose nodes with unready GPU in CA status
This change simplifies debugging GPU issues: without it, all nodes can
be Ready as far as Kubernetes API is concerned, but CA will still report
some of them as unready if are missing GPU resource. Explicitly calling
them out in the status ConfigMap will point into the right direction.
2022-03-03 14:59:31 +01:00
Marwan Ahmed 286f44e351 fix pod equivalency checks for pods with projected volumes 2021-12-21 17:02:30 +02:00
Aleksandra Gacek 8939cb700c Use custom spam filtering function to event recorder. 2021-09-14 14:42:24 +02:00
Bartłomiej Wróblewski 1e4cb1eafe Move UpdateDeprecatedTemplateLabels function
This is a useful function, we will benefit from
having it more accessible then it is currently.
2021-08-04 14:32:39 +00:00
Daniel Kłobuszewski 44b8d67d50 Allow DaemonSet pods to opt in/out from eviction 2021-06-29 11:58:14 +02:00
Kubernetes Prow Robot 5ab7792a20
Merge pull request #4089 from DataDog/templates-names-collisions
Fix templated nodeinfo names collisions in BinpackingNodeEstimator
2021-05-24 03:23:38 -07:00
Brett Elliott 5cf64a2b3c Update vendor to v1.22.0-alpha.1 2021-05-20 22:02:41 +02:00
Benjamin Pineau 030a2152b0 Fix templated nodeinfo names collisions in BinpackingNodeEstimator
Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses
the same shared DeepCopyTemplateNode function and inherits its naming
pattern, which is great as that fixes a long standing bug.

Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with
generated nodeinfos and nodes having predictable names (using template name
+ an incremental ordinal starting at 0) for upcoming nodes.

Later, when it looks for fitting nodes for unschedulable pods (when upcoming
nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity,
or pods antiaffinity, ...), the binpacking estimator will also build virtual
nodes and place them in a snapshot fork to evaluate scheduler predicates.

Those temporary virtual nodes are built using the same pattern (template name
and an index ordinal also starting at 0) as the one previously used by
`getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes
names for nodegroups having upcoming nodes.

But adding nodes by the same name in an existing cluster snapshot isn't
allowed, and the evaluation attempt will fail.

Practically this blocks re-upscales for nodegroups having upcoming nodes,
which can cause a significant delay.
2021-05-19 12:05:40 +02:00
Kubernetes Prow Robot f4c4a77940
Merge pull request #3989 from brett-elliott/useragent
Set cluster autoscaler-specific user agent.
2021-04-09 05:49:05 -07:00
Kubernetes Prow Robot 6432771415
Merge pull request #3971 from BigDarkClown/feat/resource-processor
Separate and refactor custom resources logic
2021-04-07 04:41:52 -07:00
Bartłomiej Wróblewski 1698e0e583 Separate and refactor custom resources logic 2021-04-07 10:31:11 +00:00
Brett Elliott 3b48a3193f Set cluster autoscaler-specific user agent.
Refactored mocks to remove redundancy.
2021-04-06 17:49:35 +02:00
Brett Elliott 013fa19be3 Log failed scale up metric based on string of AutoscalerErrorType. 2021-03-23 15:37:04 +01:00
Brett Elliott 4cddaed2f2 Support for reporting authorization errors during scale up 2021-03-17 14:56:03 +01:00
Vivek Bagade 8c592f0c04 Fix bug where a node that becomes ready after 2 mins can be
treated as unready. Deprecated LongNotStarted

 In cases where node n1 would:
 1) Be created at t=0min
 2) Ready condition is true at t=2.5min
 3) Not ready taint is removed at t=3min
 the ready node is counted as unready

 Tested cases after fix:
 1) Case described above
 2) Nodes not starting even after 15mins still
 treated as unready
 3) Nodes created long ago that suddenly become unready are
 counted as unready.
2021-03-11 18:32:51 +01:00
Maciek Pytel 9831623810 Set different hostname label for upcoming nodes
Function copying template node to use for upcoming nodes was
not chaning hostname label, meaning that features relying on
this label (ex. pod antiaffinity on hostname topology) would
treat all upcoming nodes as a single node.
This resulted in triggering too many scale-ups for pods
using such features. Analogous function in binpacking didn't
have the same bug (but it didn't set unique UID or pod names).
I extracted the functionality to a util function used in both
places to avoid the two functions getting out of sync again.
2021-02-12 19:41:04 +01:00