feat: wip implement `CloudProvider` interface boilerplate for `kwok` provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: add builder for `kwok`
- add logic to scale up and scale down nodes in `kwok` provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: wip parse node templates from file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add short README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: implement remaining things
- to get the provider in a somewhat working state
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add in-cluster `kwok` as pre-requisite in the README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: templates file not correctly marshalling into node list
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: `invalid leading UTF-8 octet` error during template parsing
- remove encoding using `gob`
- not required
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: use lister to get and list
- instead of uncached kube client
- add lister as a field on the provider and nodegroup struct
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: `did not find nodegroup annotation` error
- CA was thinking the annotation is not present even though it is
- fix a bug with parsing annotation
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: CA node recognizing fake nodegroups
- add provider ID to nodes in the format `kwok:<node-name>`
- fix invalid `KwokManagedAnnotation`
- sanitize template nodes (remove `resourceVersion` etc.,)
- not sanitizing the node leads to error during creation of new nodes
- abstract code to get NG name into a separate function `getNGNameFromAnnotation`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: node not getting deleted
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add empty test file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: add OWNERS file
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: wip kwok provider config
- add samples for static and dynamic template nodes
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: wip implement pulling node templates from cluster
- add status field to kwok provider config
- this is to capture how the nodes would be grouped by (can be annotation or label)
- use kwok provider config status to get ng name from the node template
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: syntax error in calling `loadNodeTemplatesFromCluster`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: first draft of dynamic node templates
- this allows node templates to be pulled from the cluster
- instead of having to specify static templates manually
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: syntax error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: abstract out related code into separate files
- use named constants instead of hardcoded values
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: cleanup kwok nodes when CA is exiting
- so that the user doesn't have to cleanup the fake nodes themselves
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: return `nil` instead of err for `HasInstance`
- because there is no underlying cloud provider (hence no reason to return `cloudprovider.ErrNotImplemented`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: start working on tests for kwok provider config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: add `gpuLabelKey` under `nodes` field in kwok provider config
- fix validation for kwok provider config
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add motivation doc
- update README with more details
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: update kwok provider config example to support pulling gpu labels and types from existing providers
- still needs to be implemented in the code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: wip update kwok provider config to get gpu label and available types
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: wip read gpu label and available types from specified provider
- add available gpu types in kwok provider config status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: add validation for gpu fields in kwok provider config
- load gpu related fields in kwok provider config status
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: implement `GetAvailableGPUTypes`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: add support to install and uninstall kwok
- add option to disable installation
- add option to manually specify kwok release tag
- add future scope in readme
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add future scope 'evaluate adding support to check if kwok controller already exists'
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: vendor conflict and cyclic import
- remove support to get gpu config from the specified provider (can't be used because leads to cyclic import)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add a TODO 'get gpu config from other providers'
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: rename `file` -> `configmap`
- load config and templates from configmap instead of file
- move `nodes` and `nodegroups` config to top level
- add helper to encode configmap data into `[]bytes`
- add helper to get current pod namespace
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: add new options to the kwok provider config
- auto install kwok only if the version is >= v0.4.0
- add test for `GPULabel()`
- use `kubectl apply` way of installing kwok instead of kustomize
- add test for kwok helpers
- add test for kwok config
- inject service account name in CA deployment
- add example configmap for node templates and kwok provider config in CA helm chart
- add permission to create `clusterrolebinding` (so that kwok provider can create a clusterrolebinding with `cluster-admin` role and create/delete upstream manifests)
- update kwok provider sample configs
- update `README`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: update go.mod to use v1.28 packages
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: `go mod tidy` and `go mod vendor` (again)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: kwok installation code
- add functions to create and delete clusterrolebinding to create kwok resources
- refactor kwok install and uninstall fns
- delete manifests in the opposite order of install ]
- add cleaning up left-over kwok installation to future scope
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: nil ptr error
- add `TODO` in README for adding docs around kwok config fields
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: remove code to automatically install and uninstall `kwok`
- installing/uninstalling requires strong permissions to be granted to `kwok`
- granting strong permissions to `kwok` means granting strong permissions to the entire CA codebase
- this can pose a security risk
- I have removed the code related to install and uninstall for now
- will proceed after discussion with the community
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: run `go mod tidy` and `go mod vendor`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: add permission to create nodes
- to fix permissions error for kwok provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add more unit tests
- add tests for kwok helpers
- fix and update kwok config tests
- fix a bug where gpu label was getting assigned to `kwokConfig.status.key`
- expose `loadConfigFile` -> `LoadConfigFile`
- throw error if templates configmap does not have `templates` key (value of which is node templates)
- finish test for `GPULabel()`
- add tests for `NodeGroupForNode()`
- expose `loadNodeTemplatesFromConfigMap` -> `LoadNodeTemplatesFromConfigMap`
- fix `KwokCloudProvider`'s kwok config was empty (this caused `GPULabel()` to return empty)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: abstract provider ID code into `getProviderID` fn
- fix provider name in test `kwok` -> `kwok:kind-worker-xxx`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: run `go mod vendor` and `go mod tidy
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs(cloudprovider/kwok): update info on creating nodegroups based on `hostname/label`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor(charts): replace fromLabelKey value `"kubernetes.io/hostname"` -> `"kwok-nodegroup"`
- `"kubernetes.io/hostname"` leads to infinite scale-up
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
feat: support running CA with kwok provider locally
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: use global informer factory
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: use `fromNodeLabelKey: "kwok-nodegroup"` in test templates
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: `Cleanup()` logic
- clean up only nodes managed by the kwok provider
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix/refactor: nodegroup creation logic
- fix issue where fake node was getting created which caused fatal error
- use ng annotation to keep track of nodegroups
- (when creating nodegroups) don't process nodes which don't have the right ng nabel
- suffix ng name with unix timestamp
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor/test(cloudprovider/kwok): write tests for `BuildKwokProvider` and `Cleanup`
- pass only the required node lister to cloud provider instead of the entire informer factory
- pass the required configmap name to `LoadNodeTemplatesFromConfigMap` instead of passing the entire kwok provider config
- implement fake node lister for testing
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add test case for dynamic templates in `TestNodeGroupForNode`
- remove non-required fields from template node
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add tests for `NodeGroups()`
- add extra node template without ng selector label to add more variability in the test
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: write tests for `GetNodeGpuConfig()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add test for `GetAvailableGPUTypes`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test: add test for `GetResourceLimiter()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): add tests for nodegroup's `IncreaseSize()`
- abstract error msgs into variables to use them in tests
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): add test for ng `DeleteNodes()` fn
- add check for deleting too many nodes
- rename err msg var names to make them consistent
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): add tests for ng `DecreaseTargetSize()`
- abstract error msgs into variables (for easy use in tests)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): add test for ng `Nodes()`
- add extra test case for `DecreaseTargetSize()` to check lister error
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): add test for ng `TemplateNodeInfo`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): improve tests for `BuildKwokProvider()`
- add more test cases
- refactor lister for `TestBuildKwokProvider()` and `TestCleanUp()`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): add test for ng `GetOptions`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
test(cloudprovider/kwok): unset `KWOK_CONFIG_MAP_NAME` at the end of the test
- not doing so leads to failure in other tests
- remove `kwokRelease` field from kwok config (not used anymore) - this was causing the tests to fail
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: bump CA chart version
- this is because of changes made related to kwok
- fix type `everwhere` -> `everywhere`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: fix linting checks
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: address CI lint errors
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: generate helm docs for `kwokConfigMapName`
- remove `KWOK_CONFIG_MAP_KEY` (not being used in the code)
- bump helm chart version
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: revise the outline for README
- add AEP link to the motivation doc
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: wip create an outline for the README
- remove `kwok` field from examples (not needed right now)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add outline for ascii gifs
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: rename env variable `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: update README with info around installation and benefits of using kwok provider
- add `Kwok` as a provider in main CA README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: run `go mod vendor`
- remove TODOs that are not needed anymore
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: finish first draft of README
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
fix: env variable in chart `KWOK_CONFIG_MAP_NAME` -> `KWOK_PROVIDER_CONFIGMAP`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: remove redundant/deprecated code
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: bump chart version `9.30.1` -> `9.30.2`
- because of kwok provider related changes
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: fix typo `offical` -> `official`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: remove debug log msg
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: add links for getting help
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: fix type in log `external cluster` -> `cluster`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
chore: add newline in chart.yaml to fix CI lint
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: fix mistake `sig-kwok` -> `sig-scheduling`
- kwok is a part if sig-scheduling (there is no sig-kwok)
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
docs: fix type `release"` -> `"release"`
Signed-off-by: vadasambar <surajrbanakar@gmail.com>
refactor: pass informer instead of lister to cloud provider builder fn
Signed-off-by: vadasambar <surajrbanakar@gmail.com>