Commit Graph

3004 Commits

Author SHA1 Message Date
Daniel Kłobuszewski 26769e4c1b Expose nodes with unready GPU in CA status
This change simplifies debugging GPU issues: without it, all nodes can
be Ready as far as Kubernetes API is concerned, but CA will still report
some of them as unready if are missing GPU resource. Explicitly calling
them out in the status ConfigMap will point into the right direction.
2022-03-03 14:59:31 +01:00
Kubernetes Prow Robot 994fbac99f
Merge pull request #4661 from olagacek/master
Remove disable scale down callback if schedulable pods are found in filter_out_schedulable.
2022-02-03 09:37:46 -08:00
Jayant Jain a906da2c6e mig_info_provider.go:fillMigInstances will now use locking when calling the gce api.
This is to avoid multiple gce calls for the same mig during scale down (which is done in parallel).
2022-02-03 12:25:53 +00:00
Aleksandra Gacek 834d02b2d5 Remove disable scale down callback if schedulable pods are found in
filter_out_schedulable.
2022-02-02 15:23:31 +01:00
Maciek Pytel a8f4981f4f Update import paths to clock utils library 2022-01-28 16:56:21 -08:00
Marwan Ahmed b0da013ec2 update vendor directory 2022-01-28 16:47:06 -08:00
Marwan Ahmed 4d4ecbef02 increase azure clients polling delay to 30s 2022-01-28 13:58:23 -08:00
Marwan Ahmed 6689f92cbc update delete async calls in scale sets 2022-01-28 13:58:15 -08:00
Marwan Ahmed 82c480d221 bump az cloudprovider version 2022-01-28 13:58:11 -08:00
Jayant Jain a3db650c26 CA: Debugging snapshotter locking optimisation for better transactions 2022-01-27 11:36:19 +00:00
Kubernetes Prow Robot b64d2949a5
Merge pull request #4633 from jayantjain93/debugging-snapshot-1
CA: Debugging snapshot adding a new field for TemplateNode.
2022-01-27 03:02:25 -08:00
Kubernetes Prow Robot f508212e9d
Merge pull request #4641 from x13n/nodeinfocache
Don't cache NodeInfo for recently Ready nodes
2022-01-27 01:56:10 -08:00
Daniel Kłobuszewski 9944137fae Don't cache NodeInfo for recently Ready nodes
There's a race condition between DaemonSet pods getting scheduled to a
new node and Cluster Autoscaler caching that node for the sake of
predicting future nodes in a given node group. We can reduce the risk of
missing some DaemonSet by providing a grace period before accepting nodes in the
cache. 1 minute should be more than enough, except for some pathological
edge cases.
2022-01-26 20:18:53 +01:00
Kubernetes Prow Robot 44170bc038
Merge pull request #4648 from marwanad/moar-instances
update azure instances and template with np-series SKU
2022-01-25 16:38:26 -08:00
Marwan Ahmed 24537b1ab7 properly set FPGA capacity 2022-01-25 15:51:08 -08:00
Kubernetes Prow Robot 28f549e4d1
Merge pull request #4636 from nxtlytics/fix-aws-asg-tags
Allow colon in AWS ASG autodiscovery tag keys
2022-01-25 15:35:42 -08:00
Marwan Ahmed 21a758c635 update azure instances with np-series 2022-01-25 14:36:52 -08:00
Jayant Jain 537e07fdb1 CA: Debugging snapshot adding a new field for TemplateNode. This captures all the templates for nodegroups present 2022-01-24 17:12:57 +00:00
Tyler Montgomery afc835a5dd allow colon in aws asg discovery tag names, update documentation 2022-01-21 10:34:20 -06:00
Joel Speed 9f670d4ea8
Ensure ClusterAPI DeleteNodes accounts for out of band changes scale
Because the autoscaler assumes it can delete nodes in parallel, it 
fetches nodegroups for each node in separate go routines and then 
instructs each nodegroup to delete a single node.
Because we don't share the nodegroup across go routines, the cached 
replica count in the scalableresource can become stale and as such, if 
the autoscaler attempts to scale down multiple nodes at a time, the 
cluster api provider only actually removes a single node.

To prevent this, we must ensure we have a fresh replica count for every 
scale down attempt.
2022-01-21 16:08:00 +00:00
Kubernetes Prow Robot 5c741c881d
Merge pull request #4626 from lzhecheng/remove-deleteblob-ut
Remove TestDeleteBlob UT
2022-01-19 17:53:52 -08:00
Zhecheng Li 5b99b58ba1 Remove TestDeleteBlob UT
Signed-off-by: Zhecheng Li <zhechengli@microsoft.com>
2022-01-20 09:28:18 +08:00
Kubernetes Prow Robot f8266a5101
Merge pull request #4627 from yaroslava-serdiuk/templates
GCE: Add m2-megamem-416 price
2022-01-19 07:06:06 -08:00
Yaroslava Serdiuk abacf124ad GCE: Add m2-megamem-416 price 2022-01-19 14:51:22 +00:00
Kubernetes Prow Robot 698c02b17c
Merge pull request #4603 from yaroslava-serdiuk/templates
Introduce gce image types and remove *_containerd gce os distributions
2022-01-19 04:56:04 -08:00
Yaroslava Serdiuk 5380a9dd83 Cluster-Autoscaler: Introduce gce image types and remove *_containerd gce os distributions. 2022-01-19 12:26:36 +00:00
Kubernetes Prow Robot 91e8f8e40c
Merge pull request #4617 from kisieland/add_context_to_scale_down_processor
Add AutoscalingContext to the scale-down post-processor
2022-01-18 03:07:08 -08:00
Maciek Pytel 217d780160 Add FAQ entry about the go version used 2022-01-18 10:22:57 +01:00
Maciek Pytel 24f896cd9d Add go:build tags matching existing +build tags
As of go1.17 both tags are expected to exist simultaneously.
Added tags in all cluster autoscaler files. Added verify-gomod.sh
exceptions for non-compliant autogenerated VPA files.
2022-01-18 10:22:57 +01:00
Daniel Gutowski a230b47fec Add AutoscalingContext to the scale-down post-processor 2022-01-18 07:58:53 +00:00
Benjamin Pineau 1aca77527a azure: change a flacky test
It seems that test gets varying error messages which prompted
Bartłomiej previous fix, but I'm now seeing the original error
message string back ("Server failed to authenticate [...]"),
so that `TestDeleteBlob` test is failing again (other PRs' tests
failures suggest that's not just my laptop).

Let's assume this was meant to check for an error, until someone
can confirm, that might be better than potentially hidding other
PRs real tests failures.
2022-01-17 19:01:05 +01:00
Kubernetes Prow Robot f5de590bea
Merge pull request #4580 from cprivite/Rename_Packet_to_Equinix_Metal
Rename packet to equinix metal
2022-01-13 08:04:30 -08:00
Kubernetes Prow Robot 441d7968fa
Merge pull request #4519 from kisieland/scale_down_candidate_select_processor
Introduce the scale down processor that picks the final scale down candidates
2022-01-13 08:02:30 -08:00
Kubernetes Prow Robot b9bfdc1bbc
Merge pull request #4579 from randomvariable/remove-randomvariable-owners
Cluster API OWNERS: Remove randomvariable
2022-01-13 07:12:30 -08:00
Kubernetes Prow Robot 80574ca166
Merge pull request #4508 from aledbf/done-error
Cluster Autoscaler: GCE: check the result of the operation
2022-01-13 07:08:30 -08:00
Kubernetes Prow Robot 00721caf97
Merge pull request #4582 from cprivite/Use_Current_cluster-autoscaler_image_In_Example
use gcr hosted cluster-autoscaler image
2022-01-13 06:18:30 -08:00
Bartłomiej Wróblewski f0a9ede345 Fix constant used in azure unit tests 2022-01-11 16:05:16 +00:00
Kubernetes Prow Robot b3576e0cdc
Merge pull request #4507 from ByteAlex/hetzner-node-name
Shorten Hetzners node names with hex repr
2022-01-09 19:09:12 -08:00
Chris Privitere a220224889 use gcr hosted cluster-autoscaler image
Signed-off-by: Chris Privitere <cprivite@users.noreply.github.com>
2022-01-06 20:59:59 +00:00
Chris Privitere c4e1aa247e Add note to readme about the rename of Packet.
Signed-off-by: Chris Privitere <cprivite@users.noreply.github.com>
2022-01-05 20:26:16 +00:00
Chris Privitere 8f8d071b9e Update example facility and machine plans to current versions.
Signed-off-by: Chris Privitere <cprivite@users.noreply.github.com>
2022-01-05 18:20:31 +00:00
Chris Privitere 0396f5c3c9 Rename packet to Equinix Metal 2022-01-05 17:45:48 +00:00
Naadir Jeewa ee761bdc24
Cluster API OWNERS: Remove randomvariable
Signed-off-by: Naadir Jeewa <jeewan@vmware.com>
2022-01-05 15:11:21 +00:00
Daniel Gutowski 8064d6d1fd Introduce the scale down processor that picks the final scale down candidates. 2022-01-03 16:05:36 +00:00
Jayant Jain 729038ff2d Adding support for Debugging Snapshot 2021-12-30 09:08:05 +00:00
Qi Ni dc64e41104 chore: remove a time comsuming unit test in provider azure 2021-12-27 10:37:52 +08:00
Kubernetes Prow Robot 6d19e3ddb9
Merge pull request #4441 from marwanad/fix-pod-equivalence-perf
fix pod equivalency checks for pods with projected volumes
2021-12-24 04:12:15 -08:00
Kubernetes Prow Robot fca1dc0513
Merge pull request #4550 from marwanad/csi-topology-label-ignore-scale-from-zero
ignore azure csi topology label for similarity checks and populate it for scale from zero
2021-12-23 03:30:37 -08:00
Marwan Ahmed fd089c2d15 avoid double wrapping scale up error 2021-12-22 15:47:05 +02:00
Kubernetes Prow Robot 7b19d33de7
Merge pull request #4345 from sergelogvinov/create-timeout
Increase server create timeout
2021-12-22 03:23:35 -08:00