Enhancing the logs:
- [cluster-name (cluster-id)] > [cluster-name (id:cluster-id)]
- Use structs instead of pointers to print data
- Remove upstream data from info log
- Add debug info
- Update nodepool name duplicate error
Signed-off-by: Parthvi <parthvi.vala@suse.com>
Reverts 3244dfa5a and instead catches the fingerprint mismatch error and
downgrades it to a debug log. The reasoning behind that commit still
applies - the upstream GKE cluster is slightly delayed in processing the
label updates - but the controller will naturally retry the update, so
we don't need to block on retrying ourselves. This way the equality
check will also be done again and so the cluster won't be updated twice.
53dbce90 was an attempt to address the occurrence of resource conflict
errors during an UpdateStatus call to set the resource phase to
"updating". It helped reduce the frequency of this happening but it did
not fully address the issue because it was not the only place where the
resource state is updated.
The resource conflict error always occurs after an error has occurred
during the reconcile loop. Multiple status updates happen in quick
succession during a single loop, and upon entering the next loop, the
shared informer cache is not fully synced and the controller worker
fetches an out of date version of the resource object.
Without this change, when this happens, the recordError handler function
tries to continue persisting the error state to the resource object with
UpdateStatus, but since UpdateStatus returns an empty struct, there's no
object it can actually update at this point, and another error is
logged.
If this happens, just return the error right away instead of trying to
update an empty object.
The max pods setting is only available for VPC-native (alias IP)
clusters, so requiring to be set to something when UseIPAliases is false
will cause a validation error from GKE. This change loosens the
requirement that MaxPodsConstraint be non-nil on create, and fixes the
upstreamSpec builder to tolerate it not being set on the upstream
cluster.
If labels are not set on the gkeapi cluster object, they will appear as
`nil`, but they should be converted to an empty map so that it is
comparable to the applied cluster state.
It is very important for the status update to succeed here, otherwise
the update loop will be entered again unnecessarily. If this happens
very quickly, there may be an inconsistency between the upstreamSpec
state, the config.Spec state, and the actual upstream cluster state. It
is best to ensure that this status update does not compound on other
problems.
Previously, if a user tried to create two nodepools with the same name,
an update loop would commence where gke-operator would continually try
to upgrade a nodepool in GKE that had two configs in Rancher.
After this change, the collision will be detected and no update will
happen until the collision is fixed.
The validateUpdate method was adopted from eks-provider and mimicked its
kubernetes version validation, which ensures the provided version is
valid Semver and that the control plane and node versions are within a
constrained range of each other. This was causing a problem with GKE
versions as it was not always parsing the Semver components correctly
and would result in a confusing and incorrect error message like:
versions for cluster [1.18.16-gke.300] and nodegroup [1.18.15-gke.1501] not compatible: all nodegroup kubernetes versionsmust be equal to or one minor version lower than the cluster kubernetes version
Rather than fix the version parsing, this change just removes this
validation. GKE does not place such strict constraints on the delta
between the control plane and node pool versions, you can even create a
cluster that is up to seven minor versions apart. The UI queries GKE for
valid versions to input, so as long as it does so there is no danger of
requesting an invalid version. This is also just an awkward place to do
this validation, since it's only validating particular attributes and
not the full update request, so if validation is needed it should be
done elsewhere.
The operator logs an info message when cluster creation is started, but
gives no indication when a cluster is being imported. Add a log for when
a cluster import is starting.
When clusters are imported there might be a race condition where the
GKEClusterConfig Status doesn't successfully get saved on the first try
and the controller will retry on its own. Without this patch, if it does
this, it would reattempt to create the CA secret and end up failing
trying to do this. This change ensures that there is no reattempt to
create the same CA secret, which also helps ensure that if there is a
legitimate failure to update the cluster config's Status then that
failure message isn't clobbered by this unrelated error.
PrivateClusterConfig.PublicEndpoint and
PrivateClusterConfig.PrivateEndpoint are read-only informational
parameters in the GKE SDK[1]. They should never have been exposed as
configurable options here. Configuration of the public and private
endpoints is done via the EnablePrivateEndpoint toggle.
[1] https://pkg.go.dev/google.golang.org/api/container/v1#PrivateClusterConfig
GKE introduced a new mode called Autopilot[1] in which node pools no
longer need to be managed by the user. This change allows users to opt
into this mode. There are caveats regarding the flexibility of this
mode[2] so it is not necessarily suitable for all users and standard
mode should still be supported.
The cluster must have clusterAddons.horizontalPodAutoscaling and
clusterAddons.httpLoadBalancing enabled. The cluster must be created for
a region, not a zone, so zone must be unset. Any configured node pools
will be ignored, and setting node pools to an empty list should be
allowed in Rancher. The autopilot setting is immutable.
[1] https://cloud.google.com/blog/products/containers-kubernetes/introducing-gke-autopilot
[2] https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview
Rancher's machine driver creates a dynamic schema with name "nodeconfig"
that collides with the static schema generated for the GKE operator's
NodeConfig struct. The result is that repeated calls to
/v3/schemas/nodeConfig may alternately return either the NodeConfig
schema from this operator or a different NodeConfig schema containing
the googleConfig schema for GCP cloud credentials.
This change renames the NodeConfig struct to avoid this collision, and
also renames every other struct defined for the operator in order to
prevent potential collisions in the future.
The function that was supposed to set the FailureMessage status
attribute was not doing that. This change ensures the error from the
onChange handler is actually used and the cluster status is updated.
It is valid to create a GKE cluster with the kubernetes version, node
pool version, or node pool image type set to "". GKE will set defaults
on the backend. However, it causes problems during the update cycle. The
validateAndUpdate function on the controller cannot check whether "" is
semver compliant, and the Update* functions cannot set "" as a desired
value to update to. This change adds allowances for certain string
parameters to be empty. Also restructures the
UpdateMasterKubernetesVersion to be more Go-idiomatic by returning early
if possible.
Add a field MaintenanceWindow which represents the start time
at which automatic maintenance is allowed to be performed. The duration
of the window is not settable. Setting MaintenanceWindow to "" means
maintenance can occur at any time, which is the default.
For parity with the KEv1 implementation, add the ability to set and
update the Locations field for a cluster. This represents additional
zones (within a single region) in which node pools can be deployed for
greater availability.
Add a struct called Management as an attribute for NodePool which
supports toggling auto-repair and auto-upgrade for a node pool. Also
update the examples.
GKE defaults to setting these to true on the backend, so before this
patch, since we were not setting the Management pointer to an object for
a node pool, the node pool would be created with auto-repair and
auto-upgrade enabled. Now that we're explicitly setting it, it defaults
to disabled.
In GKE, a cluster is created either in a Region or a Zone. The CRD
supports setting either and the controller validates that one but not
both is set. But without this patch, the operator was only respecting
the Region setting and never regarding the Zone. The examples also
incorrectly referred to an example Zone identifier in the Region
setting. This patch ensures that Zone will be used to identify the
cluster if it is used and fixes the examples to make sense.
The CredentialContent attribute does not really have the contents of the
credential, but a reference to a Cloud Credential Secret which does
contain the credential. Rename it to make its purpose clearer and to be
consistent with the credential field in eks- and aks-operator.
There is no displayName attribute in GKEClusterConfigSpec. Also, the
function creates a Secret as a side effect, which is worth mentioning in
the comment.
Rename the alpha-feature toggle to be consistent with the Go SDK and
improve the readability of the translation between CRD attributes and
GKE API attributes.
The GKE handler in Rancher needs to be able to build an initial admin
kubeconfig in order to generate the first service account in the
cluster. Since gke-operator already knows how to get the credential
Secret and convert it to an OAuth2 token, it is convenient to expose
this as a public function that Rancher can use to authenticate to the
cluster.