When the TB controller attempts to schedule a RWO PVC it checks its
accessModes in the PVC status. The controller panics if the list is
empty.
This commit adds a check to ensure the list is not empty.
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
* Add Tensorboard controller permissions for managing resources
The pod running the Tensorboard controller didn't have permissions
to manage the deployments, services, and VirtualServices needed
so that the Tensorboard servers would function properly.
In order for the deployed Tensorboard controller to run properly,
permissions to 'get', 'list', 'watch', 'create' and 'update'
are given to the Tensorboard controller pod so that the necessary
deployments, services and VirtualServices are created and managed
as expected. Also, permissions to 'get', 'list', 'watch' PVCs and
pods were added.
* Add namespace of Tensorboard CR to VirtualService prefix
In order to avoid creating 2 virtual services that have the same
prefix in different namespaces, the namespace of the corresponding
Tensorboard CR was added in the prefix of the generated Virtual
Service.
* Fix directory bug in Makefile
* Add README.md
* Extend Tensorboard CRD with status.readyReplicas field
The Tensorboard CRD didn't contain any information about the
Tensorboard server being ready or not. So, the status of the
Tensorboard resource is extended so that it contains a
readyReplicas field, similar to the status.readyReplicas of
the deployment of the Tensorboard server.
* Extend Tensorboard controller to update status of Tensorboard CR
The frontend of the Tensorboard web-app will need information
about whether the Tensorboard servers are ready to connect or not.
As a result, the Tensorboard controller now copies the value of the
status.readyReplicas field of the Tensorboard deployment to the
status.readyReplicas of the Tensorboard CR.
Also, a Deployment() function was added for applying and updating
Tensorboard server deployments.
* Update tensorboard.status.phase of TWA backend response
The frontend of the TWA will need information about the status
of the Tensorboard server, so that it can inform the user about
the server being ready being ready to connect or not.
As a result, the backend sets the status.phase field of the response
to "ready", if tensorboard.status.readyReplicas == 1. Otherwise, the
status.phase field of the response is set to "unavailable".
Also, the getPVCName() function was added, which extracts the name
of a given PVC object.
* Add GET route for PVCs
The Tensorboard web-app frontend will be using an autocomplete
drop-bar to show user the PVCs that live in a specific namespace.
These PVCs could be used as log storages for the Tensorboard server.
So, a PVC GET route was added to the Tensorboard web-app backend.
* Add message to Tensorboard response object in TWA backend
The frontend of the TWA will need to output a response message for
every Tensorboard object. This response message will inform the
user about the current state of the Tensorboard server.
* Use status.STATUS_PHASE for backend response
* Add requirements.txt to TWA backend
* Use status.create_status() for backend response
* Add indexers as custom field selectors for list requests to cache
The tensorboard controller must be able to list pods that have
mounted a PVC with a specific ClaimName.
In order for this list request to cache to work properly, custom
field selectors are added. These selectors are used to index the
"pod.spec.volumes.persistentvolumeclaim.claimname" field so that
unneeded pods can be filtered out.
* Set pod's nodeAffinity if log files exist in a PVC
In the case of using a PVC as a logdir for Tensorboard Server, if
the PVC had a ReadWriteOnce access mode and was alread mounted by
another running pod X, then the Tensorboard Server pod would not
always be scheduled on the same node as X. As a result, the
Tensorboard Server pod would be blocked since multi-node access
is prohibited on ReadWriteOnce volumes.
In order for the Tensorboard Server pod to run successfully,
nodeAffinity was added to the spec.template.spec.affinity field
of the returned deployment.
As a result, both X and the Tensorboard
Server pod are now scheduled on the same node.
Resolveskubernetes/kubernetes#26567
* Set Tensorboard Server scheduling feature to 'off' by default
In the case that the Tensorboard Server used a RWO PVC (as a log
storage) that was already mounted by another pod, nodeAffinity
was used so that the Tensorboard Server would be scheduled
(if possible) on the same node as that pod.
Now, this added functionality is used only if the
'RWO_PVC_SCHEDULING' environmental variable is set to "true"
when running the Tensorboard controller.
This scheduling functionality is disabled by default.
* Remove duplicate package import
Package "k8s.io/api/core/v1" was imported twice with names "v1"
and "corev1".
* Mount GCP secret only when accessing Google storage
The Tensorboard controller used to create pods (running the Tensorboard
server) that would always mount user-gcp-sa secret, regardless of the
logs storage being a Google cloud bucket or not. This would lead to pods
never starting properly in the case of using other cloud services (or
PVCs) as log storages, if the user-gcp-sa secret didn't exist on the
cluster.
In order for the Tensorboard server pods to run properly, user-gcp-sa
secret is now mounted only when Google cloud buckets are used as log
storages.
Fixeskubeflow/kubeflow#5065