* Implement a culling controller for notebooks
Changes:
* Move the idleness/culling logic into a separate controller
as part of the Notebooks Controller/Operator.
* Introduce an "notebooks.kubeflow.org/last_activity_check_timestamp".
annotation in each Notebook CR to keep the timestamp of the last
performed idleness check
The controller can then compare this timestamp with the current time to
ensure that notebooks will get reconciled every IDLENESS_CHECK_PERIOD
minutes.
The culling-controller will:
* reconcile only notebooks CRs
* set/update culling annotations
- 'notebooks.kubeflow.org/last_activity'
- 'notebooks.kubeflow.org/last_activity_check_timestamp'
* perform idleness checks every 'IDLENESS_CHECK_PERIOD' minutes
and set the 'kubeflow-resource-stopped' annotation, if a notebook
needs to be culled.
Refs: kubeflow/kubeflow#6767
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Remove culling annotations when Pod is not found
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* review: Improve logs
Add a log message at the beginning of the reconciliation loop
to make it clear that a Reconcile was called for a notebook.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Run the controller locally
* Introduce make rule for running the controller locally with
culling enabled
* Introduce a dev_culling_authorization_policy which must be
applied when testing the culling-controller locally
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Update README instructions
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
review changes
* build images with the latest tag only when a PR
is merged to master branch
* revert changes in manifests/workflows for the
notebook-server images
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Introduce intergration test workflow for notebook-controller
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Publish Docker image only when PR is merged
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Remove kind & manifest gh-action workflows
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Update tag in manifests to v1.6.0
This change is required as images with v1.5.0 do not
exist in Dockerhub.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* kind: Introduce config file for 1.25
* Add a new KinD configuration file for testing with K8s 1.25.3
* Install kind v0.17.0 for testing with K8s 1.25.3
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* gh-actions: Use 1.25 for testing
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* testing: Install Istio 1.16 for testing
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Test commit for enabling the tests
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* notebook-controller: Fix Makefile
Remove the test rule as a prerequisite for running docker-build
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* updated base images in Volume Web component for multiple arch
* updated base images in Tensorboard Web component for multiple arch
* updated base images in Jupyter Web component for multiple arch
* updated admission webhook component for multiple arch
* removed goarch depedency for multiarch building
* removed goarch depedency for multiarch building in admission webhook component
* removed goarch depedency & added powerPC case for multiarch building in access-management component
* removed goarch depedency for multiarch building in tensorboard controller
* removed goarch depedency for multiarch building in notebook Controller
* Removing empty computation to resolve future build issues
The notebook controller writes the last-activity annotation
before culling the Notebook, however, doesn't remove this
annotation before start. This causes the Notebook to be culled
again before is has a chance to start.
Fix:
* calculate correctly the podFound variable and ensure its value
its true only if the Pod is actually found. This way the culling
annotation will be removed when there is no Pod.
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
Signed-off-by: Apostolos Gerakaris <apoger@arrikto.com>
* Fix#6056: Update Notebook status properly
Signed-off-by: Apostolos Gerakaris apoger@arrikto.com
* Added suggested code changes
Signed-off-by: Apostolos Gerakaris apoger@arrikto.com
* notebook-controller: Add unit tests
*Introduce basic unit tests for "createNotebookStatus" function
*Add GH action for unit tests
Signed-off-by: Apostolos Gerakaris apoger@arrikto.com
* Fix PodCoditionsMirroringToNotebook & Unit-tests
We encountered an error during testing. It seems that
the pod.status.conditions.condition.LastProbeTime remains
always null and so the controller ends up applying a Notebook
CR instance with null condition values.
Relevant Issues:
*https://github.com/kubernetes/kubernetes/issues/109958
*https://github.com/kubernetes/kubernetes/issues/79402
*https://github.com/kubernetes/kubernetes/issues/14393
Fix: Check if the Pod's condition.LastProbeTime
and condition.LastTransitionTime timestamp fields are null.
If so, initialize them so we dont end up applying
a Notebook instance with null condition values.
Other changes:
*Fix basic unit tests
*Introduced a unit test for the case where Notebook's Pod
is unschedulable
Signed-off-by: Apostolos Gerakaris apoger@arrikto.com
Signed-off-by: Apostolos Gerakaris apoger@arrikto.com
* Fix#6528: Mirroring Pod conditions to Notebook
* Added missing fields which are part of PodConditions into NotebookConditions
* Added suggested changes
* build: Update components makefiles for building
We'll create a top-level Makefile under components/ dir
that has the following rules:
* build-all:
To build all images locally
* push-all:
* We can use a specific REGISTRY and retag the images
* Push all the images
This top-level Makefile will run the sub-Makefiles that every
component has for building and pushing the images.
We modified every sub-Makefile as follows:
* We don't use a registry in images by default
* Removed unused rules and vars
* Use the --dirty flag of git describe in TAG
--dirty[=<mark>]
Describe the working tree. It means describe HEAD and
appends <mark> (-dirty by default) if the working tree
is dirty.
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Create makefiles for notebook servers
The common starting point of building the notebook-server
images are the following 4 *base* images:
- base
- codeserver
- jupyter
- rstudio
All other server images need to build on top of them. We'll
dynamically pass the base images in every Dockerfile by
using an ARG IMG. We can set the value of this ARG during
docker build with the --build-arg CLI argument.
This way we build both the base images with a tag locally,
and then we pass that image as arg via the Makefile and build the rest
So we modified our building procedure as follows:
1. Build the base image since everything starts from there
2. Pass the base image as an ARG in the Dockerfiles of
jupyter, codeserver, rstudio images and build on top
3. Pass the base images in all other server images and build
on top
For that we will:
1. Create a Makefile for each of the notebook servers, in each folder
a. Each makefile will be responsible for building the bases and use args for passing them on
2. Use the central Makefile to call each Makefile from above
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* cherry-pick: Notebook server upstream fixes
Relevant upstream PR: https://github.com/kubeflow/kubeflow/pull/6466/files
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix Makefiles
- Remove build-gcp and build-gcr rules as we don't use them anywhere in
the project
- Fix code conficts
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix Dockerfiles for notbook-server images
We remove the previous logic of using already built images as bases.
The users must use only the Makefiles to pass the appropriate BASE_IMG
and build the images correctly.
Thus, we have Makefiles everywhere that:
- Can build any base image
- If an image requires another notebook base, then we first build that one using its makefile,
and then use it as docker ARG for building the next one
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix notebook-controller Makefile
Removed a misplaced "|" char that breaks the Makefile
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Update GH action workflows
* Update workflow for notebook-server images:
- Add a step for building all images by using the
central-Makefile under components/example-notebook-servers/ dir.
- Add a step for pushing all images by using the
central-Makefile under components/example-notebook-servers/ dir.
* Update workflow for all Kubeflow images:
- Add a step for building & pushing all images by using the
top-level Makefile under components/ dir.
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Remove completely ECR references from images
Replace everywhere the "public.ecr.aws/j1r0q0g6/notebooks/notebook-servers"
prefix with "kubeflownotebookswg"
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
* build: Fix GH actions for Kubeflow components
Fix GH actions to use the updated make rules
when building the Kubeflow component images.
Remove the "docker.io" prefix when building with
GH action workflows
Signed-off-by: Apotolos Gerakaris <apoger@arrikto.com>
Fix https://github.com/kubeflow/kubeflow/issues/6366
Migrating to Kubebuilder v3 leads to the following changes:
- Add .dockerignore file.
- Upgrade Go version from v1.15 to v1.17.
- Adapt Makefile.
- Add image (build + push) target to makefile.
- Upgrade EnvTest to use K8s v1.22.
- Update PROJECT template.
- Migrate CRD apiVersion from v1beta to v1.
- Add livenessProbe and readinessProbe to controller manager.
- Upgrade controller-runtime from v0.2.0 to v0.11.0.
Other changes:
- Build image using public.ecr.aws registry instead of gcr.io.
- Update README.md documentation.
- Update 3rd party licences.
- Fix notebook.spec description.
- Add 3 sample notebooks (v1, v1alpha1 and v1beta1).
Signed-off-by: Samuel Veloso <svelosol@redhat.com>
The controller should not trigger the reconcile loop when an Event is
deleted. Previously the controller would run the reconciliation loop on
any event deletion.
This commit updates it to not run the loop for ANY event.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Update notebook if timestamp changed
We don't want to be updating the spec of the notebook if the timestamp
hasn't changed, since this will lead to constant updates and
reconciliation loops.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Use a deep-copy of the notebook spec
The controller should use a deep-copy of the notebook spec when
calculating the spec for the StatefulSet. If not then we could
update the notebook object without wanting it, since the spec could have
been changed when calculating the STS spec.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Add prefix env var only if missing
The controller should be setting OR updating the NB_PREFIX env var.
Previously it would always blindly append it to the spec, which could
result in double entries for the same env var.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Handle events gracefully
The controller is not exiting the reconciliation loop after it has
re-emitted a Pod/STS Event as a Notebook Event. This results in the
controller to later on try and GET a Notebook with the name of the Event
that triggered the reconciliation loop.
The controller should exit the reconciliation function once it has
emitted the event.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Don't reconcile on deleted events
We don't want to trigger the reconciliation function when an event gets
deleted.
If a Notebook would be deleted then the underlying events would
be deleted as well, which results in the reconcile function to get
triggered and try to GET Events and Notebooks with the name of the
deleted event.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Update image's tag in make
Modify Makefile to update properly the TAG
based on the git TAG.
Signed-off-by: Athanasios Markou <athamark@arrikto.com>
Reviewed-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Expose last-activity
Extend the notebook-controller to:
* cull idle Notebook Servers based on their new `last-activity`
annotation
* expose the last activity of each Notebook Server as an annotation
on the metadata of the corresponding CR object
Modify notebook_controller.go to:
* update the Last Activity of each Notebook Server that has a
Running pod
* delete the Last Activity Annotation for every Notebook Server
that does not have a Running pod
Extend culler.go to:
* perform culling based on the new `last-activity` annotation and
not based on the `/api/status` endpoint.
* update the last activity of a Notebook Server, based on the
kernels' execution states.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
Reviewed-by: Athanasios Markou <athamark@arrikto.com>
* notebooks: Introduce a DEV env var
We introduce a DEV ENV var to allow admins
develop and test on their local machine their
custom Notebook Controller.
We provide information and instructions inside
the components/notebook-controller/README.md.
Signed-off-by: Athanasios Markou <athamark@arrikto.com>
Reviewed-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* notebooks: Add unit tests for last-activity
* Introduce new tests for allKernelsAreIdle()
* Extend the tests for NotebookIsIdle() and for
NotebookNeedsCulling().
Signed-off-by: Athanasios Markou <athamark@arrikto.com>
Reviewed-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* review: UpdateNotebookLastActivityAnnotation()
Ensure that UpdateNotebookLastActivityAnnotation() does not return
"true". This function should not return any value.
Signed-off-by: Athanasios Markou <athamark@arrikto.com>
* Update the releasing version tag
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Run automated script for updating versions
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
In the existing version, the 'timeout: 300s' added to the notebook's virtual service would cause websockets to disconnect at the 5 minute mark, causing the Jupyter Notebook web terminal function to hang. This is described in https://github.com/kubeflow/kubeflow/issues/6124.
As part of the work of wg-manifests for 1.3
(https://github.com/kubeflow/manifests/issues/1735), we are moving manifests
development in upstream repos. This gives the application developers full
ownership of their manifests, tracked in a single place.
This commit copies the manifests for application `Notebook Controller`
from path `apps/jupyter/notebook-controller/upstream` of kubeflow/manifests to path
`components/notebook-controller/config` of the upstream repo (https://github.com/kubeflow/kubeflow).
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
Upgrade go version of the notebook-controller to 1.15, across the
Dockerfile, Makefile and README. We used the same Golang version as our Kubernetes
dependency, after @Jeffwan's suggestion.
* Correct ContainerStatus of Notebook CR
The Notebook Controller doesn't set the State of the CR correctly. In some cases
the first container is the istio-sidecar which results in an incorrect state being
shown to the Notebook CR. This is fix now by showing the Notebook container
ContainerState to the Notebook CR ContainerState
* Changed log statement and added a comment
Implemented remarks of @yanniszark and @kimwnasptd
* Small reorganization of some if statements
We use the local `../common` module to build `notebook-controller`. We
also need to specify a valid pseudo-version for `common` to support
importing the Notebook API in other modules. This is because according
to the `go.mod` docs [1]:
> exclude and replace directives only operate on the current (“main”)
> module. exclude and replace directives in modules other than the main
> module are ignored when building the main module.
If we don't replace the default "zero version" for `common` that is
generated in our require directive, then then builds fail for modules
that require the Notebook API. They will encounter an an "invalid
version" error for `common` at commit hash "000000000000".
[1]: https://github.com/golang/go/wiki/Modules#gomod
* Implemented functional tests using ginkgo
The notebook controller can be tested using sigs.k8s.io/controller-runtime/pkg/envtest which comes as part of kubebuilder. With this we should be able to measurable test coverage.
* Fixed the incorrect test condition and included fix to download the envtest binaries.
Fixed the incorrect test condition and included fix to download the envtest binaries.
* Some tweaks based on review.
* Removed the check-license as it was blocking the test.
Included some of the tweaked yaml's files that were being generated.
The default leader election ID is controller-leader-election-helper which could conflict when multiple controllers run within the same namespace. This is a required field in later versions of controller-runtime.