A few of the unit test structures did not have field name keys when
using literal structs. This change adds the fields to make this code a
little more future-proof.
Use shell syntax instead of pipe so error messages are printed out. Before
this change, if it could not download a module, it would just exit without
printing an error message due to the pipe. With this change it now prints
out the error message(unable to download) from the underlying process.
Both upscale's `getUpcomingNodeInfos` and the binpacking estimator now uses
the same shared DeepCopyTemplateNode function and inherits its naming
pattern, which is great as that fixes a long standing bug.
Due to that, `getUpcomingNodeInfos` will enrich the cluster snapshots with
generated nodeinfos and nodes having predictable names (using template name
+ an incremental ordinal starting at 0) for upcoming nodes.
Later, when it looks for fitting nodes for unschedulable pods (when upcoming
nodes don't satisfy those (FitsAnyNodeMatching failing due to nodes capacity,
or pods antiaffinity, ...), the binpacking estimator will also build virtual
nodes and place them in a snapshot fork to evaluate scheduler predicates.
Those temporary virtual nodes are built using the same pattern (template name
and an index ordinal also starting at 0) as the one previously used by
`getUpcomingNodeInfos`, which means it will generate the same nodeinfos/nodes
names for nodegroups having upcoming nodes.
But adding nodes by the same name in an existing cluster snapshot isn't
allowed, and the evaluation attempt will fail.
Practically this blocks re-upscales for nodegroups having upcoming nodes,
which can cause a significant delay.
FetchAllMigs (unfiltered InstanceGroupManagers.List) is costly as it's not
bounded to MIGs attached to the current cluster, but rather lists all MIGs
in the project/zone, and therefore equally affects all clusters in that
project/zone. Running the calls concurrently over the region's zones (so at
most, 4 concurrent API calls, about once per minute) contains that impact.
findMigsInRegion might be scoped to the current cluster (name pattern),
but also benefits from the same improvement, as it's also costly and
called at each refreshInterval (1mn).
Also: we're calling out GCE mig.Get() API again for each MIG (at ~300ms per
API call, in my tests), sequentially and with the global cache lock held
(when updateClusterState -> ...-> GetMigForInstance kicks in). Yet we
already get that bit of information (MIG's basename) from any other
mig.Get or mig.List call, like the one fetching target sizes. Leveraging
this helps significantly on large fleets (for instance this shaves 8mn
startup time on the large cluster I tested on).
The current implementation assumes MIG ids have the
"https://content.googleapis.com" prefix, while the
canonical id format seems to begin with "https://www.googleapis.com".
Both formats work while talking to the GCE API, but
the API returns the latter and other GCP services seem to
assume it as well.