* web-apps(front): Update the README
Update the readme with detailed commands on how to consume the library
as well as developer guidelines.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* web-apps(front): Fix typo in README
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* web-apps(front): Udate the common library
Add new components to the library. These components will enhance
* The current common table for visualizing objects
* The components we can use for a details-page for each object
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* web-apps(front): Add unit tests to common lib
Fix and introduce new unit tests for most of the components in the
library. We expect the developers to always run `ng test` before any PR
to ensure that the existing functionality is not broken.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* jwa(front): Add required packages for common lib
The common library will expect extra npm modules to be installed in each
app that consumes it.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
We use the local `../common` module to build `notebook-controller`. We
also need to specify a valid pseudo-version for `common` to support
importing the Notebook API in other modules. This is because according
to the `go.mod` docs [1]:
> exclude and replace directives only operate on the current (“main”)
> module. exclude and replace directives in modules other than the main
> module are ignored when building the main module.
If we don't replace the default "zero version" for `common` that is
generated in our require directive, then then builds fail for modules
that require the Notebook API. They will encounter an an "invalid
version" error for `common` at commit hash "000000000000".
[1]: https://github.com/golang/go/wiki/Modules#gomod
* Implemented functional tests using ginkgo
The notebook controller can be tested using sigs.k8s.io/controller-runtime/pkg/envtest which comes as part of kubebuilder. With this we should be able to measurable test coverage.
* Fixed the incorrect test condition and included fix to download the envtest binaries.
Fixed the incorrect test condition and included fix to download the envtest binaries.
* Some tweaks based on review.
* Removed the check-license as it was blocking the test.
Included some of the tweaked yaml's files that were being generated.
The default leader election ID is controller-leader-election-helper which could conflict when multiple controllers run within the same namespace. This is a required field in later versions of controller-runtime.
* Update the backend
For the frontend to work properly we will need to add the following
changes to jupyter web app's backend as well as to the common backend
code:
* rename the references from `flask_rest_backend` to `crud_backend` in
the web app's backend code
* add a route for exposing GPU info. This way the UI will block users
from creating Notebooks with a GPU type that is not installed at all
in the cluster
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Update the common frontend library
New functionality added:
* An `Advanced Settings` button that can expand and shrink to
expose/hide more options in the form
* All validators will have a debounce time to make the input of
characters smoother
* Extend the Status types to allow start/stopped resources
* Extend the main table config to support a button [ ex CONNECT for
jupyter web app ]
* The http services should use relative URLs
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Update the frontend to utilize the common code
The bulk of the new frontend code. The folder structure is changed to
make it more clear what pages are used from the page.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* {Make,Docker}files
Add Makefile and Dockerfiles. Note that GCB build process needs to be
updated.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* README.md
Add a readme that explains how to build the app and have a development
environment.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* WA: Backend common: update the library
Update the common python wheel wrt:
* How to distinguish between dev and prod mode
* Extra routes for handling Notebooks
* Serving the index.html for every non api route (SPA)
* Add a STOPPED state to the possible Status values
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* JWA: Add the refactored backend
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* JWA: Backend: Add support for Affinity/Tolerations
* Extend the configuration yaml with default form values for the
affinity/tolerations
* Set them accordingly when the user submits a notebook
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
Reviewed-by: Ilias Katsakioris <elikatsis@arrikto.com>
* Add frontend for the Tensorboard Web-app
This commit contains the code for the frontend of the Tensorboard
web-app. It completes the GSoC 2020 project for building the
standalone TWA for Kubeflow.
The app is not yet fully integrated to the Kubeflow dashboard, so
the README.md file contains documentation on how to build, run and
use the web-app locally.
Also, a Dockerfile was added in order to build a playground image
of the web-app. The 'deploy' folder contains manifests that will
enable the TWA to properlly run on the cluster in the future.
* Update README.md
* Add RWO_PVC_SCHEDULING env var to Tensorboard controller deployment
The value of the 'RWO_PVC_SCHEDULING' env var is set to "false" by
default. The user will be able to change the value of the env var
manually by modifying the 'config/manager/manager.yaml' file.
* Update README.md
* Add Tensorboard controller permissions for managing resources
The pod running the Tensorboard controller didn't have permissions
to manage the deployments, services, and VirtualServices needed
so that the Tensorboard servers would function properly.
In order for the deployed Tensorboard controller to run properly,
permissions to 'get', 'list', 'watch', 'create' and 'update'
are given to the Tensorboard controller pod so that the necessary
deployments, services and VirtualServices are created and managed
as expected. Also, permissions to 'get', 'list', 'watch' PVCs and
pods were added.
* Add namespace of Tensorboard CR to VirtualService prefix
In order to avoid creating 2 virtual services that have the same
prefix in different namespaces, the namespace of the corresponding
Tensorboard CR was added in the prefix of the generated Virtual
Service.
* Fix directory bug in Makefile
* Add README.md
* Extend Tensorboard CRD with status.readyReplicas field
The Tensorboard CRD didn't contain any information about the
Tensorboard server being ready or not. So, the status of the
Tensorboard resource is extended so that it contains a
readyReplicas field, similar to the status.readyReplicas of
the deployment of the Tensorboard server.
* Extend Tensorboard controller to update status of Tensorboard CR
The frontend of the Tensorboard web-app will need information
about whether the Tensorboard servers are ready to connect or not.
As a result, the Tensorboard controller now copies the value of the
status.readyReplicas field of the Tensorboard deployment to the
status.readyReplicas of the Tensorboard CR.
Also, a Deployment() function was added for applying and updating
Tensorboard server deployments.
* Update tensorboard.status.phase of TWA backend response
The frontend of the TWA will need information about the status
of the Tensorboard server, so that it can inform the user about
the server being ready being ready to connect or not.
As a result, the backend sets the status.phase field of the response
to "ready", if tensorboard.status.readyReplicas == 1. Otherwise, the
status.phase field of the response is set to "unavailable".
Also, the getPVCName() function was added, which extracts the name
of a given PVC object.
* Add GET route for PVCs
The Tensorboard web-app frontend will be using an autocomplete
drop-bar to show user the PVCs that live in a specific namespace.
These PVCs could be used as log storages for the Tensorboard server.
So, a PVC GET route was added to the Tensorboard web-app backend.
* Add message to Tensorboard response object in TWA backend
The frontend of the TWA will need to output a response message for
every Tensorboard object. This response message will inform the
user about the current state of the Tensorboard server.
* Use status.STATUS_PHASE for backend response
* Add requirements.txt to TWA backend
* Use status.create_status() for backend response
Create an Angular Library with common frontend code. Our crud web apps
should use this library to share common functionality like:
* Talking to Central Dashboard for the Namespace selection
* Making http calls
* Surfacing and showing error messages and warnings
* Form utilities
* Showing a table with entries and actions
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Add indexers as custom field selectors for list requests to cache
The tensorboard controller must be able to list pods that have
mounted a PVC with a specific ClaimName.
In order for this list request to cache to work properly, custom
field selectors are added. These selectors are used to index the
"pod.spec.volumes.persistentvolumeclaim.claimname" field so that
unneeded pods can be filtered out.
* Set pod's nodeAffinity if log files exist in a PVC
In the case of using a PVC as a logdir for Tensorboard Server, if
the PVC had a ReadWriteOnce access mode and was alread mounted by
another running pod X, then the Tensorboard Server pod would not
always be scheduled on the same node as X. As a result, the
Tensorboard Server pod would be blocked since multi-node access
is prohibited on ReadWriteOnce volumes.
In order for the Tensorboard Server pod to run successfully,
nodeAffinity was added to the spec.template.spec.affinity field
of the returned deployment.
As a result, both X and the Tensorboard
Server pod are now scheduled on the same node.
Resolveskubernetes/kubernetes#26567
* Set Tensorboard Server scheduling feature to 'off' by default
In the case that the Tensorboard Server used a RWO PVC (as a log
storage) that was already mounted by another pod, nodeAffinity
was used so that the Tensorboard Server would be scheduled
(if possible) on the same node as that pod.
Now, this added functionality is used only if the
'RWO_PVC_SCHEDULING' environmental variable is set to "true"
when running the Tensorboard controller.
This scheduling functionality is disabled by default.
* Create Tensorboard web-app backend
Create the code for the Tensorboard web-app backend which
includes routes for GET, POST and DELETE requests.
The backend is created with Python/Flask, so it also uses
the common code from 'kubeflow.kubeflow.crud_backend'.
* Add 'get_age(k8s_object)' function to 'crud_backend' common code
It would be useful for all web apps of the 'crud-web-apps' folder
to return age information to their frontends.
As a result, 'get_age(k8s_object)' was added to the common code,
so that all web apps can use it.
Create a python module under the kubeflow.kubeflow package that will
be exposing common code and a base app the takes care of:
* Exceptions handling
* Common routes for serving static files and their cache control policy
* Authorization checks with SubjectAccessReview
* Authentication checks on the Kubeflow headers
* Common helper functions for dates, yaml parsing etc
* health/liveness probes
Backends that are written with Python/Flask should use this common code
in order for us to reduce code duplication and have our backends align
with our accepted practices.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Create a new directory in components for web apps
Since we want to also have some common code between our web apps we
should create a parent dir for any future web app we want to develop.
The code for the web apps, common or not, should be organized under this
directory.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* remove the reviewers
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Remove duplicate package import
Package "k8s.io/api/core/v1" was imported twice with names "v1"
and "corev1".
* Mount GCP secret only when accessing Google storage
The Tensorboard controller used to create pods (running the Tensorboard
server) that would always mount user-gcp-sa secret, regardless of the
logs storage being a Google cloud bucket or not. This would lead to pods
never starting properly in the case of using other cloud services (or
PVCs) as log storages, if the user-gcp-sa secret didn't exist on the
cluster.
In order for the Tensorboard server pods to run properly, user-gcp-sa
secret is now mounted only when Google cloud buckets are used as log
storages.
Fixeskubeflow/kubeflow#5065
* Allowing for an env var ADD_FSGROUP to be set to false to suppress the automatic addition of fsGroup: 100 in the pod's security context.
This addresses issue #4617.
* Adding note in README regarding ADD_FSGROUP.
This commit fixes the event filtering check, so it doesn't crash when
the Pod name doesn't contain a dash ("-").
Signed-off-by: Yannis Zarkadas <yanniszark@arrikto.com>
* Fix docker builds of notebook and tensorboard controller
* The notebook-controllers and tensorboard-controllers now depend on
the go package components/common
* We need to rewrite the Dockerfiles so that the context is now
${KUBEfLOW_REPO}/common
* so that components/common can be included in the context and copied
to the Dockerfile
* Create skaffold configs to make it easier to do remote builds with Kaniko
* The skaffold configs are currently written assuming the kubeflow-ci cluster
is used to build the images. This could be generalized in the future.
* Remove the code to build the notebook-controller with GCB; we can just
use skaffold and kaniko to do efficient remote builds.
* Related to #4582 - Jupyter image doesn't build.
* Fix docker build rule.
* The jupyter docker image isn't building because it now depends on code
in components/common
* To make this work we need to configure it as a multi module package
and modify go.mod to redirect to a local path.
* Ref: https://github.com/golang/go/wiki/Modules#when-should-i-use-the-replace-directive
* Replaces PR #4583
Related to #4582 - Jupyter image doesn't build.
* Delete all the Tekton pipelines and scripts for continuous delivery
of Kubeflow applications because they are moving into kubeflow/testing
* kubeflow/testing#551 is the PR moving the code into kubeflow/testing
Related to: kubeflow/testing#544 redo how we use kustomize and Tekton
to parameterize the pipelines
* Migrate to kustomize3: Phase 1. Update kustomization.yaml
* Migrate to kustomize3: Phase 2: Update kustomize.go
- Update kustomize.go to match new package structure.
- Update module dependencies.
* Migrate to kustomize3: Phase 3: Implements code review
- As per request, revert kustomization.yaml back to deprecated syntax.
- As per request, revert kustomize.go to use deprecated .Bases field.
- Note: patchesStrategicMerge: will be turned into a deprecated field pretty soon.
- Rerun go mod tidy
* Migrate to kustomize3: Phase 4: Activate legacy order transformer
* Create a culler as a package
Helper functions for culling resources. Takes for granted that ISTIO is
installed to the system and queries Prometheus to get metrics.
Specifically, requests/{configurable time}.
If the resource should be culled, then it should be done by setting an
annotation. This way the UIs can also show that the Resource is stopping
and also easily stop a resource by making a PATCH request.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Culling logic enhancements
Add necessary ENV Vars. Culling won't happen by default. To enable it
the user will need to set the ENABLE_CULLING=true
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Misc fixes in logging and comment cleanup
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Fix typo
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Add Notebooks specific culling
Query the /api/status endpoint of each Server
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Remove the generic culling logic
We need to discuss if it would make sense to have this logic as a go
library, or use knative.
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Add unit tests
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Remove unused code
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Review changes #1
* rename `getEnvDef` to `getEnvDefault`
* Add a comment to describe how the STOP_ANNOTATION gets used
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>
* Make cluster domain configurable
Signed-off-by: Kimonas Sotirchos <kimwnasptd@arrikto.com>