This allows more consistent error handling with other GCE API calls.
Also removed caching errors for MachineType API in cache.go, since
it was never never used anyway and it's inconsistent with error handling
for other APIs.
PR #4973 changed ToSystemArchitecture behavior to return DefaultArch
instead of UnknownArch for invalid architectures. This kind of defaulting
makes sense while parsing KUBE_ENV, but prevents using the function
in contexts where an invalid architecture should result in an error.
This commit reverts ToSystemArchitecture to previous behavior, and
moves defaulting to the callsite.
Previously we've just assumed pod will always fit on a newly added node
during binpacking, because we've already checked that a pod fits on an
empty template node earlier in scale-up logic.
This assumption is incorrect, as it doesn't take into account potential
impact of other scheduling we've done in binpacking. For pods using
zonal Filters (such as PodTopologySpreading with zonal topology key) the
pod may no longer be able to schedule even on an empty node as a result
of earlier decisions we've made in binpacking.
The binpacking algorithm is O(#pending_pods * #new_nodes) and
calculating a very large scale-up can get stuck for minutes or even
hours, leading to CA failing it's healthcheck and going down.
The new limiting prevents this scenario by stopping binpacking after
reaching specified threshold. Any pods that remain pending as a result
of shorter binpacking will be processed next autoscaler loop.
The thresholds used can be controlled with newly introduced flags:
--max-nodes-per-scaleup and --max-nodegroup-binpacking-duration. The
limiting can be disabled by setting both flags to 0 (not recommended,
especially for --max-nodegroup-binpacking-duration).
Part of the test verifies if all taint updates happened as expected.
The taints are verified asynchronously, and the test waits for exactly
as many taint updates as defined in the test case. A couple of test
cases were missing some expected updates (clearing the taint if
drain/deletion fails). The test could randomly fail if one of the
missing updates happened to apear before one of the expected updates.
This commit adds the missing expected updates, all should be accounted
for now.
This commit also adds a sync point to wait for all expected node
deletion results before asserting them. Without it, the test would
sometimes move to the assertion before the results were actually
reported.
this change brings in a new command line flag,
`--record-duplicated-events`, which allows a user to enable the
duplication of events bypassing the 5 minute de-duplication window.