* feat(deployment): marketplace - allow specifying gcs bucket directly
* Switch tfx default bucket to user specified one
* Update schema description
* Update version to 0.5.1 to match marketplace expectation
* Fix gcsBucketName var
* Remove gcp secret credentials
* refactor(deployment): separate metadata-writer and metadata-grpc folders
* refactor(deployment): move kustomization.yaml images to the lowest level package
* format
* refactor(manifests): move minio artifact secret to minio package
* let api server and ui use minio artifact secret instead of default value
* Update kustomization.yaml
* fix name
* reduce ttl of pesisted final workflow to 1 day
* add comment
* enable pagination when expanding experiment in both the home page and the archive page
* Revert "enable pagination when expanding experiment in both the home page and the archive page"
This reverts commit 5b672739dd.
* Address comments
* Use configMapKeyRef for env vars
* Allow easy customization of cluster-scoped resources namespace
* clean up
* Clean up
* Simplify var replacement with direct configmap value ref
* clean up params.env
* Refactor components/release.sh to provide a new components/release-branch.sh that updates release branch directly
* Release components as version tag instead of commit SHA
* Publish component images in release.cloudbuild.yaml
* Include script that updates version tag for component sdk
* [Manifest] Use kustomize native image transformer to override image
* Revert unintended changes
* Fix kustomization.yaml location
* Fix inverse proxy image
* Add release script for kustomize manifest
* Add release scripts for marketplace manifest and sdk
* Add global release.sh
* Fix sdk release script
* Clean up release scripts
* Fix release script
* Fix release scripts
* fix
* fix
* Fix cannot use uppercase vars in cloudbuild.yaml
* Add old components release script back
* Add a RELEASE.md doc
* probes for ml-pipeline-ui
* clean up comments
* Use wget instead of curl, because wget is included in alpine
* Also update marketplace manifest
* Add readiness/liveness probe for api server
* Add probes for python vis server
* Upgraded Argo to v2.7.4
* Downgraded the Argo CLI version to 2.4.3
See https://github.com/argoproj/argo/issues/2793
* Removed the argo cli arg that had been removed
* Updated to Argo 2.7.5
* Added workflowtemplates and cronworkflows to the Role
* Added the new Argo CRDs
* Add kfp-container-builder sa
* Allow service account to be configurable
* Fix tests
* Fix test
* Use documentation for service account to introduce compatibility with different types of installation
* updated doc
* clean up
* Update container_builder_test.py
* Update _build_image_api.py
* Update kustomization.yaml
* Add executable permission for presubmit tests mkp.sh
* Initial execution cache
This commit adds initial execution cache service. Including http service
and execution key generation.
* fix master
* Add cache manifests for mkp deployment
* revert go.sum
* Add helm on delete policy for cache deployer job
* Change cache deployer job to statefulset
* remove unnecessary cluster role
* seperate clusterrole and role
* add role and rolebinding to mkp
* change secret role to clusterrole
* Add cloudsql support to cache
* fix comma
* Change cache secret clusterrole to role
* Adjust sequences of resources
* Update values and schema
* remove extra tab
* Change statefulset to job
* Add pod delete permission to cache deployer role
* Test changing cache deployer job to deployment
* remove extra permission
* remove statefulset check
* enable CloudSQL+GCSObjStore without default credential
* refresh document
* fix schema
* minio project ID is required
* fix several
* self throtting Github requests to let build be stable
* can work now
* upsize and lowercase for bucket name
Co-authored-by: Renmin Gu <renming@google.com>
* Initial execution cache
This commit adds initial execution cache service. Including http service
and execution key generation.
* fix master
* Change cache deployer job to stateful set
* Delete cache deployer job
* Delete cache deployer job after it completes
* minor fix
* fix indention
* Change cache deployer job to statefulset
* Remove extra cluster role for cache deployer
* remove cache in base kustomize file for upgrade test
* minor fix
* Enable cache and cache-deployer in base kustomization file
* fix
* fix
* test
* test
* test
* Refactor cluster scope resources
* refactor
* Add namespace for sa
* Fix
* Add crds folder to cluster kustomization yaml
* namespace change
* fix
* fix
* fix
* update test
* Rename cluster to cluster-scoped-resource
* test adding namespace in kustomization file
* revert namespace for clusterrolebinding
* fix
* Add db_name in cache_deployment manifest
* rename
* change secret cluster role to role
* Initial execution cache
This commit adds initial execution cache service. Including http service
and execution key generation.
* fix master
* Add cache manifests for mkp deployment
* revert go.sum
* Add helm on delete policy for cache deployer job
* Change cache deployer job to statefulset
* remove unnecessary cluster role
* seperate clusterrole and role
* add role and rolebinding to mkp
* change secret role to clusterrole
* Add cloudsql support to cache
* fix comma
* Initial execution cache
This commit adds initial execution cache service. Including http service
and execution key generation.
* fix master
* Add cache manifests for mkp deployment
* revert go.sum
* Add helm on delete policy for cache deployer job
* Change cache deployer job to statefulset
* remove unnecessary cluster role
* seperate clusterrole and role
* add role and rolebinding to mkp
* change secret role to clusterrole
* Add cloudsql support to cache
* Manifests: Rename metadata gRPC server's resources to metadata-grpc-*
The metadata service deployed is a gRPC server.
Proper KF installation deploys both an HTTP server, naming the required
resources as 'metadata-deployment' and 'metadata-service', as well as a
gRPC server, naming the corresponding resources
'metadata-grpc-deployment' and 'metadata-grpc-service'.
KFP standalone installation manifests deploy solely the gRPC server, but
use naming identical to the KF's HTTP server one.
Applying them on top of an existing KF cluster breaks Metadata service.
In this PR we change the naming making it not diverge from a proper KF
installation. We also make MetadataWriter aware of that change.
Closes#2889.
Signed-off-by: Ilias Katsakioris <elikatsis@arrikto.com>
* Fix ConfigMaps' label
* metadata-configmap
* metadata-mysql-configmap
* README: Link to KF installation & reference KFP version
* Initial execution cache
This commit adds initial execution cache service. Including http service
and execution key generation.
* fix master
* Change cache deployer job to stateful set
* Delete cache deployer job
* Delete cache deployer job after it completes
* minor fix
* fix indention
* Change cache deployer job to statefulset
* Remove extra cluster role for cache deployer
* remove cache in base kustomize file for upgrade test
* minor fix
* Initial execution cache
This commit adds initial execution cache service. Including http service
and execution key generation.
* Add initial server logic
* Add const
* Change folder name
* Change execution key name
* Fix unit test
* Add Dockerfile and OWNERS file
This commit adds Dockerfile for building source code and OWNERS file for
easy review. This commit also renames some functions.
* fix go.sum
This PR fixes changes on go.sum
* Add local deployment scripts
This commit adds local deployment scripts which can deploy cache service
to an existing cluster with KFP installed.
* refactor src code
* Add standalone deployment scripts and yamls
This commit adds execution cache deployment scripts and yaml files in
KFP standalone deployment. Including a deployer which will generate the
certification and mutatingwebhookconfiguration and execution cache
deployment.
* Minor fix
* Add execution cache image build in test folder
* fix test cloudbuild
* Fix cloudbuild
* Add execution cache deployer image to test folder
* Add copyright
* Fix deployer build
* Add license for execution cache and cloudbuild for execution cache
images
This commit adds licenses for execution cache source code. Also adds
cloud build step for building cache image and cache deployer image.
Change the manifest name based on changed image.
* Refactor license intermediate data
* Fix execution cache image manifest
* Typo fix for cache and cache deployer images
* Add arguments in ca generation scripts and change deployer base image to google/cloud
* minor fix
* fix arg
* Mirror source code with MPL in execution_cache image
* Minor fix
* minor refactor on error handling
* Refactor cache source code, Docker image and manifest
* Fix variable names
* Add images in .release.cloudbuild.yaml
* Change execution_cache to generic name
* revice readme
* Move deployer job out of upgrade script
* fix tests
* fix tests
* Seperate cache service and cache deployer job
* mysql set up
* wip
* WIP
* WIP
* work mysql connection
* initial cache logic
* watcher
* WIP pod watching with mysql
* worked crud
* Add sql unit test
* fix manifest
* Add copyright
* Add watcher check and update cache key generation logic
* test replace container images
* work cache service
* Add configmap for cache service
* refactor
* fix manifest
* Add unit tests
* Remove delete table
* Fix sql dialect
* Add cached step log
* Add metadata execution id
* minor fix
* revert go.mod and go.sum
* revert go.sum and go.mod
* revert go.sum and go.mod
* revert go.mod and go.sum
* [UI Server] Pod info handler
* [UI] Pod info tab in run details page
* Change pod info preview to use yaml editor
* Fix namespace
* Adds error handling for PodInfo
* Adjust to warning message
* [UI] Pod events in RunDetails page
* Adjust error message
* Refactor k8s helper to get rid of in cluster limit
* Tests for pod info handler
* Tests for pod event list handler
* Move pod yaml viewer related components to separate file.
* Unit tests for PodYaml component
* Fix react unit tests
* Fix error message
* Address CR comments
* Add permission to ui role
* Initial execution cache
This commit adds initial execution cache service. Including http service
and execution key generation.
* Add initial server logic
* Add const
* Change folder name
* Change execution key name
* Fix unit test
* Add Dockerfile and OWNERS file
This commit adds Dockerfile for building source code and OWNERS file for
easy review. This commit also renames some functions.
* fix go.sum
This PR fixes changes on go.sum
* Add local deployment scripts
This commit adds local deployment scripts which can deploy cache service
to an existing cluster with KFP installed.
* refactor src code
* Add standalone deployment scripts and yamls
This commit adds execution cache deployment scripts and yaml files in
KFP standalone deployment. Including a deployer which will generate the
certification and mutatingwebhookconfiguration and execution cache
deployment.
* Minor fix
* Add execution cache image build in test folder
* fix test cloudbuild
* Fix cloudbuild
* Add execution cache deployer image to test folder
* Add copyright
* Fix deployer build
* Add license for execution cache and cloudbuild for execution cache
images
This commit adds licenses for execution cache source code. Also adds
cloud build step for building cache image and cache deployer image.
Change the manifest name based on changed image.
* Refactor license intermediate data
* Fix execution cache image manifest
* Typo fix for cache and cache deployer images
* Add arguments in ca generation scripts and change deployer base image to google/cloud
* minor fix
* fix arg
* Mirror source code with MPL in execution_cache image
* Minor fix
* minor refactor on error handling
* Refactor cache source code, Docker image and manifest
* Fix variable names
* Add images in .release.cloudbuild.yaml
* Change execution_cache to generic name
* revice readme
* Move deployer job out of upgrade script
* fix tests
* fix tests
* Seperate cache service and cache deployer job
* mysql set up
* Delete cache service in manifest, only test in presubmit tests
* fix
* fix presubmit tests
* fix
* fix
* revert unnecessary change
* fix cache image tag
* change image gcr to ml-pipeline-test
* Remove namespace in standalone manifest and add to test manifest
* minio: Set secure=true to enable TLS by default
Not using TLS is a security concern, especially if using cloud storage
like S3. This should be set to secure to avoid people unknowingly not
using TLS.
To make the bundled minio still work, I've submitted
https://github.com/kubeflow/manifests/pull/950 to set secure=false in
this case explicitly.
* minio: secure=false in GCP & standalone manifests
* bump version
* less resource request as MKP side anyway will request more
* done
Co-authored-by: renmingu <40223865+renmingu@users.noreply.github.com>
* Implement getting started page.
* Add feature flag to only show getting started page on hosted pipelines
* Add tests
* Fix format
* Implement requested layout in getting started page
* Minor adjust layout
* Fix tests
* Fix snapshots
* Update page title
* Metadata writer
* Added sleeper-based metadata writer
* Sleeper
* First working draft
* Added properties to Executions Artifacts and Contexts
Also added attributions.
context_id is now stored as label.
* Prefix the execution type names
* Ignoring TFX pods
* Fixed the deployment container spec
* Cleaned up the file and added deployment spec
* Added the Kubernetes deployment
* Added startup logging
* Made python output unbuffered
* Fixed None exception
* Formatting exceptions
* Prefixing the log message
* Improved handling non-S3 artifacts
* Logging input artifacts
* Extracted code to the link_execution_to_input_artifact function
* Setting execution's pipeline_name to workflow name
* Adding annotation with input artifact IDs
* Running infinitely
* Added component version to execution type name
* Marking metadata as written even for failed pods
* Cleaned up some comments
* Do not fail when upstream artifact is missing
* Change the completion detection logic
Waiting for Argo's "completed=true" instead of Kubernetes' "phase: Completed" introduced delays that lead to problems with missing input artifacts.
This changes allows us to log the outpuyt artifacts earlier.
* Added Dockerfile
* Added release deployment manifest
* Added OWNERS
* Switching to using MLMD service instead of direct DB access
* Adding licenses to the image
* Pinned Python's minor version
* Moved code to /backend/metadata_writer
Moved manifest to /manifests
* Added image building to CloudBuild
* Added Metadata Writer to release CloudBuild
* Added Metadata Writer to test scripts
* Finished the kustomization manifests
* Added Metadata Writer to marketplace manifests
* Added ServiceAccount, Role and RoleBinding for MW
* Fixed merge conflict
* Removed the debug deployment
* Forgot to add the chart templates for the SA and roles
* Specified the service account
* Switched to watching a single namespace
* Resolved feedback
Removed dev deployment comment from python code.
Added license.
Fixed the range of kubernetes package versions.
* More review fixes
* Extracted the metadata helper functions
* Improved the error message when context type is unexpected
* Fixed the import
* Checking the connection to MLMD
The latest tests started to have connection problems - "failed to connect to all addresses" and "Failed to pick subchannel".
* Improved the MLMD connection error logging
* Try creating MLMD client on each retry and using a different request
* Changed the MLMD connection check request
All get requests fail when the DB is empty, so we have to use a put request.
See https://github.com/google/ml-metadata/issues/28
* Using unbuffered IO to improve the logging latency
* Changed the URI schema for the artifacts
* Cleanup
* Simplified the kubernetes config loading code
* Resolving the feedback
* add label and namespace to resource created inside pod
* fix
* done
* update existing configmap for better GC
* update dependency to make sure configmap got created before run script
* self fix
* use latest deployer base
* deployer
* all done
* upgrade deployer to 0.1.37 for bug bash
* all done
* done
* fix issue for MLMD
* done
* done
Co-authored-by: renmingu <40223865+renmingu@users.noreply.github.com>
* add label and namespace to resource created inside pod
* fix
* done
* update existing configmap for better GC
* update dependency to make sure configmap got created before run script
* self fix
* all done
* reuse existing
* reuse
Co-authored-by: renmingu <40223865+renmingu@users.noreply.github.com>
* visualization server wants kubernetes serivce account too
* add ksa for visualization server and use this ksa for standalone and
hosted deployment of visualization server
* Use server name its ksa name
* add sa to pipeline.yaml
* Updated component images to version bae654dc5c
* Updated components to version ff116b6f1a
* 0.1.40
* append old items
* fix line
Co-authored-by: renmingu <40223865+renmingu@users.noreply.github.com>
* Script to set up workload identity for standalone deployment
* Migrate tests to run on standalone + workload identity
* Fix test script
* Switch to static GSAs for testing, because they have name length limit
* Add workload identity binding for argo
* Fix argo workload identity bindings
* Remove user-gcp-sa from tests
* Remove use_gcp_secret from xgboost sample
* Allow debugging tests locally
* Wait for policies to take effect
* Update deploy-pipeline-lite.sh
* Update deploy-pipeline-lite.sh
* [WIP] test gcloud auth list with test-runner sa
* Add namespace
* test again
* Use new image builder
* test again
* Remove debug code
* Remove usages of use_gcp_secret
* Fix unit test and tensorboard pod template
* Add debug code again to test
* Try waiting until workload identity bindings are ready
* Fix some other samples
* Fix parameterized tfx oss sample
* Add retry to image building
* Try fixing tfx oss sample
* Fix compiled tfx oss sample
* Update all google/cloud-sdk to latest
* Try fixing parameterized tfx oss sample again
* Also verify pipeline-runner ksa is working
* Fix parameterized_tfx_oss sample
* Update gcp-workload-identity-setup.sh
* Revert unneeded change
* Pin to new google/cloud-sdk
* Remove wrongly commited binaries
This change adds the necessary config-map related to gRPC MLMD server.
To make the names more clear, this change also modifies the existing
'metadata-configmap' which provides mysql configurations to
'metadata-mysql-configmap'
* Added manifest for deploying on aws using s3
* Revert "Added manifest for deploying on aws using s3"
This reverts commit 6a9c498c2c.
* Added readme and link to kubeflow-aws on how to deploy lightweight pipeline on AWS
* updated readme on how to deploy on aws
* Update README.md
* Update README.md
* add namespace to some run APIs
* update only the create run api
* add resourcereference for namespace runs
* add variables in const
* add types to toModel func
* bug fix
* strip the namespace resource reference when mapping to the db model
* add unit tests
* use gofmt
* replace belonging relationshipreference to owner
* put a todo for further investigation of using namespace or uuid
* apply gofmt
* revert minor change
* Update model_converter.go
* update component sdk version
* bump python SDK and manifest version
* Revert "update component sdk version" to prevent conflict
This reverts commit 1fd6ddc8
* Without version bump
* fix the delete caller
* return after delete
* reconciler removes old viewer crd file that misses image specification
* add frontend comment
* remove accidental changes that are irrelevant
* Revise log message
* Add error handling
* add test
* tensorflow image check only applies to viewer tensorboard type and thus put it after the type check.
* Use of default image instead of validation
When switching to GKE workload identity, the pods can't access to metadata server anymore by default due to metadata concealment.
This can be unlocked by explicitly enable hostnetwork for the pod.
https://cloud.google.com/kubernetes-engine/docs/how-to/protecting-cluster-metadata#concealment
This should be OK as proxy is an optional component. In any case when user feel this not a secure option he/she could opt out it.
* Retrieve pod logs from argo archive
* Added aws instance profile iam credential support for minio client. Read workflow status for argo archive location for pod logs.
* fix minor typo, and enforce typing for minio client options
* Update helm chart for pipelines ui role with permission to access secret and workflow crd
* remove unnecessary type cast
* Fix bug: s3client should be a callable, so that iam token is refreshed
* add tfx permission guide
* add xgboost guide
* add a direct link to dataproc API
* Add tip
* Update README.md
* Update and apply suggestions
* Fix
* Update README.md
* Make gcs bucket name configurable + fix marketplace managed storage does not run pipelines successfully
* gcsBucketName is computed from cloudsqlInstanceConnectionName + avoid code duplication
* Update README for MKP development. Remove managed SQL part and update verison tag.
* Update guide for MKP deployment.
* Small fix.
* fix one missing thing
* add owner
* updated owners per comment, keep at least one from SHA for easy co-operation
* limit OWNERS scope first
* refine doc for MKP
* fix James comments for wording
* fix doc and mask pwd
* temp disable managed storage
* also update images to 0.1.31
* Add concatenated third party license to argo and minio images
* Add MPL dependencies source code in argoexec docker image
* Include source code of MPL dependencies of minio in its image
* Add source code for argo dependencies with MPL in argo images
* Updated workflow to release manually, included cloudbuild config to build each image, also added README for instructions
* docker env naming consistency
* Include release scripts and instructions in third_party/README.md
* Update README.md
* undo cloudbuild.yaml changes, update README
* Change argo and minio image tags in manifests
* Remove unneeded code
* Fix copyright year
* add owner
* updated owners per comment, keep at least one from SHA for easy co-operation
* limit OWNERS scope first
* refine doc for MKP
* fix James comments for wording
* fix doc and mask pwd
* pass in secret
* fix
* use application name by default for database prefix
* bug fixes and bump kfp version
* Update application.yaml
* fix objectstore name
* fix objectstore name
* store db pwd as secret
* fix
* fix
* fix
* fix
* pass in secret
* fix
* use application name by default for database prefix
* bug fixes and bump kfp version
* Update application.yaml
* fix objectstore name
* fix objectstore name
* Working, though the request seems malformed
* Working with grpc-web. trying to push to cluster
* WIP
* With great hax come great success
* Begin moving some metadata UI pages to KFP
* Artifact list and details pages work! A lot of clean up is needed. Look for console.log and TODO
* Clean up
* Fixes filtering of artifact list
* More cleanup
* Revert ui deployment
* Updates tests
* Update envoy deployment
This change introduces the metadata component to pipeline-lite
installation. This installation:
1. Does not include metadata-ui
2. mysql installation is used instead of metadata-db
3. Replica count has been reduced 1 instead of 3
* restructure
* working example
* working example
* move mysql
* moving minio and mysql out
* add gcp
* add files
* fix test
* extract parameters to single place
* update
* update readme
* update readme
* address pr comment
* Add visualization-server service to lightweight deployment
* Addressed PR suggestions
* Added field to determine if visualization service is active and fixed unit tests for visualization_server.go
* Additional small fixes
* port change from 88888 -> 8888
* version change from 0.1.15 -> 0.1.26
* removed visualization-server from base/kustomization.yaml
* Fixed visualization_server_test.go to reflect new changes
* Changed implementation to be fail fast
* Changed host name to be constant provided by environment
* Added retry and extracted isVisualizationServiceAlive logic to function
* Fixed deployment.yaml file
* Fixed serviceURL configuration issuse
serviceURL is now properly obtained from the environment, the service ip address and port are used rather than service name and namespace
* Added log message to indicate when visualization service is unreachable
* Addressed PR comments
* Removed _HTTP
* viewer controller is now namespaced so no need for cluster role
* our default namespaced install (kubeflow namespace) can also use Role instead of ClusterRole
* Viewer CRD controller running under namespace
* Change docker file and add manifest deployment yaml to support the new flag namespace
* Change docker file to support new flag namespace for viewer crd controller
* Modify kustomization.yaml and namespaced-install.yaml
* Change file name from ml-pipeline-viewer-crd-deployment to ml-pipeline-viewer-crd-deployment-patch
* Fix typo
* Remove some duplicate configs in namespaced-install
* clean up
* argo
* expose configuration for max number of viewers
* add sample how to configure
* Revert "argo"
This reverts commit 3ff0d07679.
* update namespaced-install.yaml
* kfp frontend API service can configure minio client params thru env vars
* minio endpoint is composed from host and namespace to support k8s yaml
* Added kustomize patch for pipeline-ui deploy