Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
review changes
* build images with the latest tag only when a PR
is merged to master branch
* revert changes in manifests/workflows for the
notebook-server images
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Introduce intergration test workflow for tensorboard-controller
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Publish Docker image only when PR is merged
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Remove kind & manifest gh-action workflows
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Update tag in manifests to v1.6.0
This change is required as images with v1.5.0 do not
exist in Dockerhub.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* kind: Introduce config file for 1.25
* Add a new KinD configuration file for testing with K8s 1.25.3
* Install kind v0.17.0 for testing with K8s 1.25.3
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* gh-actions: Use 1.25 for testing
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* testing: Install Istio 1.16 for testing
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Test commit for enabling the tests
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* notebook-controller: Fix Makefile
Remove the test rule as a prerequisite for running docker-build
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* updated base images in Volume Web component for multiple arch
* updated base images in Tensorboard Web component for multiple arch
* updated base images in Jupyter Web component for multiple arch
* updated admission webhook component for multiple arch
* removed goarch depedency for multiarch building
* removed goarch depedency for multiarch building in admission webhook component
* removed goarch depedency & added powerPC case for multiarch building in access-management component
* removed goarch depedency for multiarch building in tensorboard controller
* removed goarch depedency for multiarch building in notebook Controller
* Removing empty computation to resolve future build issues
* build: Update components makefiles for building
We'll create a top-level Makefile under components/ dir
that has the following rules:
* build-all:
To build all images locally
* push-all:
* We can use a specific REGISTRY and retag the images
* Push all the images
This top-level Makefile will run the sub-Makefiles that every
component has for building and pushing the images.
We modified every sub-Makefile as follows:
* We don't use a registry in images by default
* Removed unused rules and vars
* Use the --dirty flag of git describe in TAG
--dirty[=<mark>]
Describe the working tree. It means describe HEAD and
appends <mark> (-dirty by default) if the working tree
is dirty.
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Create makefiles for notebook servers
The common starting point of building the notebook-server
images are the following 4 *base* images:
- base
- codeserver
- jupyter
- rstudio
All other server images need to build on top of them. We'll
dynamically pass the base images in every Dockerfile by
using an ARG IMG. We can set the value of this ARG during
docker build with the --build-arg CLI argument.
This way we build both the base images with a tag locally,
and then we pass that image as arg via the Makefile and build the rest
So we modified our building procedure as follows:
1. Build the base image since everything starts from there
2. Pass the base image as an ARG in the Dockerfiles of
jupyter, codeserver, rstudio images and build on top
3. Pass the base images in all other server images and build
on top
For that we will:
1. Create a Makefile for each of the notebook servers, in each folder
a. Each makefile will be responsible for building the bases and use args for passing them on
2. Use the central Makefile to call each Makefile from above
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* cherry-pick: Notebook server upstream fixes
Relevant upstream PR: https://github.com/kubeflow/kubeflow/pull/6466/files
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix Makefiles
- Remove build-gcp and build-gcr rules as we don't use them anywhere in
the project
- Fix code conficts
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix Dockerfiles for notbook-server images
We remove the previous logic of using already built images as bases.
The users must use only the Makefiles to pass the appropriate BASE_IMG
and build the images correctly.
Thus, we have Makefiles everywhere that:
- Can build any base image
- If an image requires another notebook base, then we first build that one using its makefile,
and then use it as docker ARG for building the next one
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix notebook-controller Makefile
Removed a misplaced "|" char that breaks the Makefile
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Update GH action workflows
* Update workflow for notebook-server images:
- Add a step for building all images by using the
central-Makefile under components/example-notebook-servers/ dir.
- Add a step for pushing all images by using the
central-Makefile under components/example-notebook-servers/ dir.
* Update workflow for all Kubeflow images:
- Add a step for building & pushing all images by using the
top-level Makefile under components/ dir.
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Remove completely ECR references from images
Replace everywhere the "public.ecr.aws/j1r0q0g6/notebooks/notebook-servers"
prefix with "kubeflownotebookswg"
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix GH actions for Kubeflow components
Fix GH actions to use the updated make rules
when building the Kubeflow component images.
Remove the "docker.io" prefix when building with
GH action workflows
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* Update the releasing version tag
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Run automated script for updating versions
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* kfam: Upgrade go to 1.17
Update to a more recent docker image that has a newer version of
openssl.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* poddefaults: Upgrade go to 1.17
Update to a more recent docker image that has a newer version of
openssl.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* tensorboards: Upgrade go to 1.17
Update to a more recent docker image that has a newer version of
openssl.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
When the TB controller attempts to schedule a RWO PVC it checks its
accessModes in the PVC status. The controller panics if the list is
empty.
This commit adds a check to ensure the list is not empty.
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
* Add RWO_PVC_SCHEDULING env var to Tensorboard controller deployment
The value of the 'RWO_PVC_SCHEDULING' env var is set to "false" by
default. The user will be able to change the value of the env var
manually by modifying the 'config/manager/manager.yaml' file.
* Update README.md
* Add Tensorboard controller permissions for managing resources
The pod running the Tensorboard controller didn't have permissions
to manage the deployments, services, and VirtualServices needed
so that the Tensorboard servers would function properly.
In order for the deployed Tensorboard controller to run properly,
permissions to 'get', 'list', 'watch', 'create' and 'update'
are given to the Tensorboard controller pod so that the necessary
deployments, services and VirtualServices are created and managed
as expected. Also, permissions to 'get', 'list', 'watch' PVCs and
pods were added.
* Add namespace of Tensorboard CR to VirtualService prefix
In order to avoid creating 2 virtual services that have the same
prefix in different namespaces, the namespace of the corresponding
Tensorboard CR was added in the prefix of the generated Virtual
Service.
* Fix directory bug in Makefile
* Add README.md
* Extend Tensorboard CRD with status.readyReplicas field
The Tensorboard CRD didn't contain any information about the
Tensorboard server being ready or not. So, the status of the
Tensorboard resource is extended so that it contains a
readyReplicas field, similar to the status.readyReplicas of
the deployment of the Tensorboard server.
* Extend Tensorboard controller to update status of Tensorboard CR
The frontend of the Tensorboard web-app will need information
about whether the Tensorboard servers are ready to connect or not.
As a result, the Tensorboard controller now copies the value of the
status.readyReplicas field of the Tensorboard deployment to the
status.readyReplicas of the Tensorboard CR.
Also, a Deployment() function was added for applying and updating
Tensorboard server deployments.
* Update tensorboard.status.phase of TWA backend response
The frontend of the TWA will need information about the status
of the Tensorboard server, so that it can inform the user about
the server being ready being ready to connect or not.
As a result, the backend sets the status.phase field of the response
to "ready", if tensorboard.status.readyReplicas == 1. Otherwise, the
status.phase field of the response is set to "unavailable".
Also, the getPVCName() function was added, which extracts the name
of a given PVC object.
* Add GET route for PVCs
The Tensorboard web-app frontend will be using an autocomplete
drop-bar to show user the PVCs that live in a specific namespace.
These PVCs could be used as log storages for the Tensorboard server.
So, a PVC GET route was added to the Tensorboard web-app backend.
* Add message to Tensorboard response object in TWA backend
The frontend of the TWA will need to output a response message for
every Tensorboard object. This response message will inform the
user about the current state of the Tensorboard server.
* Use status.STATUS_PHASE for backend response
* Add requirements.txt to TWA backend
* Use status.create_status() for backend response
* Add indexers as custom field selectors for list requests to cache
The tensorboard controller must be able to list pods that have
mounted a PVC with a specific ClaimName.
In order for this list request to cache to work properly, custom
field selectors are added. These selectors are used to index the
"pod.spec.volumes.persistentvolumeclaim.claimname" field so that
unneeded pods can be filtered out.
* Set pod's nodeAffinity if log files exist in a PVC
In the case of using a PVC as a logdir for Tensorboard Server, if
the PVC had a ReadWriteOnce access mode and was alread mounted by
another running pod X, then the Tensorboard Server pod would not
always be scheduled on the same node as X. As a result, the
Tensorboard Server pod would be blocked since multi-node access
is prohibited on ReadWriteOnce volumes.
In order for the Tensorboard Server pod to run successfully,
nodeAffinity was added to the spec.template.spec.affinity field
of the returned deployment.
As a result, both X and the Tensorboard
Server pod are now scheduled on the same node.
Resolveskubernetes/kubernetes#26567
* Set Tensorboard Server scheduling feature to 'off' by default
In the case that the Tensorboard Server used a RWO PVC (as a log
storage) that was already mounted by another pod, nodeAffinity
was used so that the Tensorboard Server would be scheduled
(if possible) on the same node as that pod.
Now, this added functionality is used only if the
'RWO_PVC_SCHEDULING' environmental variable is set to "true"
when running the Tensorboard controller.
This scheduling functionality is disabled by default.
* Remove duplicate package import
Package "k8s.io/api/core/v1" was imported twice with names "v1"
and "corev1".
* Mount GCP secret only when accessing Google storage
The Tensorboard controller used to create pods (running the Tensorboard
server) that would always mount user-gcp-sa secret, regardless of the
logs storage being a Google cloud bucket or not. This would lead to pods
never starting properly in the case of using other cloud services (or
PVCs) as log storages, if the user-gcp-sa secret didn't exist on the
cluster.
In order for the Tensorboard server pods to run properly, user-gcp-sa
secret is now mounted only when Google cloud buckets are used as log
storages.
Fixeskubeflow/kubeflow#5065
* Fix docker builds of notebook and tensorboard controller
* The notebook-controllers and tensorboard-controllers now depend on
the go package components/common
* We need to rewrite the Dockerfiles so that the context is now
${KUBEfLOW_REPO}/common
* so that components/common can be included in the context and copied
to the Dockerfile
* Create skaffold configs to make it easier to do remote builds with Kaniko
* The skaffold configs are currently written assuming the kubeflow-ci cluster
is used to build the images. This could be generalized in the future.
* Remove the code to build the notebook-controller with GCB; we can just
use skaffold and kaniko to do efficient remote builds.
* Related to #4582 - Jupyter image doesn't build.
* Fix docker build rule.
* The jupyter docker image isn't building because it now depends on code
in components/common
* To make this work we need to configure it as a multi module package
and modify go.mod to redirect to a local path.
* Ref: https://github.com/golang/go/wiki/Modules#when-should-i-use-the-replace-directive
* Replaces PR #4583
Related to #4582 - Jupyter image doesn't build.