Commit Graph

145 Commits

Author SHA1 Message Date
Jakub Tużnik b92f971326 Provide ScaleDownStatusProcessor with more info about scale-down results 2019-04-30 13:49:06 +02:00
Jakub Tużnik 402c643851 Modify the info passed to ScaleDownStatusProcessor when empty nodes are deleted
Previously, if any of the nodes fails to delete, the processor gets
a ScaleDownError status. After this commit, it will get the list of
nodes that were successfully deleted.
2019-04-26 15:54:11 +02:00
Jiaxin Shan 83ae66cebc Consider GPU utilization in scaling down 2019-04-04 01:12:51 -07:00
Jiaxin Shan 90666881d3 Move GPULabel and GPUTypes to cloud provider 2019-03-25 13:03:01 -07:00
Marcin Wielgus 99f1dcf9d2
Merge branch 'master' into crc-fix-error-format 2019-02-01 17:22:57 +01:00
Vivek Bagade 79ef3a6940 unexporting methods in utils.go 2019-01-25 00:06:03 +05:30
Jacek Kaniuk 0c64e0932a Tainting unneeded nodes as PreferNoSchedule 2019-01-21 13:06:50 +01:00
CodeLingo Bot c0603afdeb Fix error format strings according to best practices from CodeReviewComments
Fix error format strings according to best practices from CodeReviewComments

Fix error format strings according to best practices from CodeReviewComments

Reverted incorrect change to with error format string

Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingoBot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Resolve conflict

Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingoBot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Fix error strings in testscases to remedy failing tests

Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Fix more error strings to remedy failing tests

Signed-off-by: CodeLingo Bot <bot@codelingo.io>
2019-01-11 09:10:31 +13:00
Maciej Pytel 9060014992 Use listers in scale-down 2018-12-31 14:55:38 +01:00
lsytj0413 672dddd23a refactor(*): fix golint warning 2018-12-19 10:04:08 +08:00
Andrew McDermott fd3fd85f26 UPSTREAM: <carry>: handle nil nodeGroup in calculateScaleDownGpusTotal
Explicitly handle nil as a return value for nodeGroup in
`calculateScaleDownGpusTotal()` when `NodeGroupForNode()` is called
for GPU nodes that don't exist. The current logic generates a runtime
exception:

    "reflect: call of reflect.Value.IsNil on zero Value"

Looking through the rest of the tree all the other places that use
this pattern additionally and explicitly check whether `nodeGroup ==
nil` first.

This change now completes the pattern in
`calculateScaleDownGpusTotal()`.

Looking at the other occurrences of this pattern we see:

```
File: clusterstate/clusterstate.go
488:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {

File: core/utils.go
231:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
322:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
394:27:			if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
461:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {

File: core/scale_down.go
185:6:		if reflect.ValueOf(nodeGroup).IsNil() {
608:27:			if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
747:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
1010:25:	if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
```

with the notable exception at core/scale_down.go:185 which is
`calculateScaleDownGpusTotal()`.

With this change, and invoking the autoscaler with:

```
...
      --max-nodes-total=24 \
      --cores-total=8:128 \
      --memory-total=4:256 \
      --gpu-total=nvidia.com/gpu:0:16 \
      --gpu-total=amd.com/gpu:0:4 \
...
```

I no longer see a runtime exception.
2018-12-05 18:54:07 +00:00
Łukasz Osipiuk 016bf7fc2c Use k8s.io/klog instead github.com/golang/glog 2018-11-26 17:30:31 +01:00
Alex Price 4ae7acbacc add flags to ignore daemonsets and mirror pods when calculating resource utilization of a node
Adds the flag --ignore-daemonsets-utilization and --ignore-mirror-pods-utilization
(defaults to false) and when enabled, factors DaemonSet and mirror pods out when
calculating the resource utilization of a node.
2018-11-23 15:24:25 +11:00
Łukasz Osipiuk 55fc1e2f00 Store NodeGroup in ScaleUpRequest and ScaleDownRequest 2018-10-30 18:03:04 +01:00
Jakub Tużnik 71111da20c Add a scale down status processor, refactor so that there's more scale down info available to it 2018-09-12 14:52:20 +02:00
Pengfei Ni 1dd0147d9e Add more events for CA 2018-07-09 15:42:05 +08:00
Aleksandra Malinowska 800ee56b34 Refactor and extend GPU metrics error types 2018-07-05 13:13:11 +02:00
Karol Gołąb aae4d1270a Make GetGpuTypeForMetrics more robust 2018-06-26 21:35:16 +02:00
Marcin Wielgus f2e76e2592
Merge pull request #1008 from krzysztof-jastrzebski/master
Move removing unneeded autoprovisioned node groups to node group manager
2018-06-22 21:01:36 +02:00
Karol Gołąb 5eb7021f82 Add GPU-related scaled_up & scaled_down metrics (#974)
* Add GPU-related scaled_up & scaled_down metrics

* Fix name to match SD naming convention

* Fix import after master rebase

* Change the logic to include GPU-being-installed nodes
2018-06-22 21:00:52 +02:00
Krzysztof Jastrzebski 2df2568841 Move removing unneeded autoprovisioned node groups to node group manager 2018-06-22 14:26:12 +02:00
Nic Doye ebadbda2b2 issues/933 Consider making UnremovableNodeRecheckTimeout configurable 2018-06-18 11:54:14 +01:00
Łukasz Osipiuk b7323bc0d1 Respect GPU limits in scale_up 2018-06-14 15:46:58 +02:00
Łukasz Osipiuk 9f75099d2c Restructure checking resource limits in scale_up.go
Preparatory work for before introducing GPU limits
2018-06-13 19:00:37 +02:00
Łukasz Osipiuk 087a5cc9a9 Respect GPU limits in scale_down 2018-06-13 14:19:59 +02:00
Łukasz Osipiuk 1fa44a4d3a Fix bug resulting resource limits not being enforced in scale_down 2018-06-11 16:39:07 +02:00
Łukasz Osipiuk 519064e1ec Extract isNodeBeingDeleted function 2018-06-11 14:21:07 +02:00
Łukasz Osipiuk 6c57a01fc9 Restructure checking resource limits in scale_down.go 2018-06-11 14:02:40 +02:00
Łukasz Osipiuk 9c61477d25 Do not return error when getting cpu/memory capacity of node 2018-06-08 15:04:57 +02:00
Krzysztof Jastrzebski adad14c2c9 Delete autoprovisioned node pool after all nodes are deleted. 2018-05-28 14:22:18 +02:00
Karol Gołąb 4c710950de Move ClusterStateRegistry to StaticAutoscaler
AutoscalingContext is basically a configuration and few static helpers
and API handles.
ClusterStateRegistry is state and thus moved to other state-keeping
objects.
2018-05-24 13:03:01 +02:00
Aleksandra Malinowska ffeebde8d8 Add support for rescheduled pods with the same name in drain 2018-05-10 12:00:56 +02:00
Marcin Wielgus 9c5728fd74
Merge pull request #836 from kgolab/kg-clean-up-004
Use timestamp argument
2018-05-08 20:24:37 +02:00
Karol Gołąb 53b1c6a394 Use timestamp argument 2018-05-08 13:08:30 +02:00
Karol Gołąb da16642bcf Make the code slightly more idiomatic go 2018-05-08 11:35:01 +02:00
Beata Skiba 054f6d8650
Merge pull request #794 from krzysztof-jastrzebski/pods
Refactor cluster autoscaler builder and add pod list processor.
2018-04-26 13:08:56 +02:00
Krzysztof Jastrzebski 88b769b324 Refactor cluster autoscaler builder and add pod list processor. 2018-04-26 12:37:51 +02:00
Aleksandra Malinowska 3d599bfabe Rephrase unremovable node warning 2018-04-18 13:43:32 +02:00
Aleksandra Malinowska 4c594db7f8 Run spellchecker 2018-03-15 15:47:49 +01:00
anniedy bf59e3daa5 Typo fix unneded->[unneeded] (#623)
* Update clusterstate.md

* Update scale_down.go

* Update static_autoscaler.go
2018-02-07 17:36:58 +01:00
Marcin Wielgus 439fd3c9ec
Merge pull request #411 from krzysztof-jastrzebski/priority
Adds priority preemption support to cluster autoscaler.
2017-11-08 09:09:26 +01:00
Edward Tsang 4104a91991 more spelling fixes 2017-11-02 14:21:36 -07:00
Maciej Pytel c376ef3c87 Add metrics for autoprovisioning 2017-10-31 17:42:58 +01:00
Maciej Pytel 9c2ebccbfe Write events when autoprovisioned nodegroup is created / deleted 2017-10-25 17:39:30 +02:00
Krzysztof Jastrzebski 56ac572666 Adds resource limits to cloud provider. 2017-10-23 16:06:56 +02:00
Krzysztof Jastrzebski d9c00e5ce1 Adds priority preemption support to cluster autoscaler. 2017-10-23 09:54:56 +02:00
Aleksandra Malinowska 4c31a57374 fix leaking taints in case of cloud provider error on node deletion 2017-09-22 17:55:48 +02:00
Marcin Wielgus f04113d746 Remove TargetSize() from loops iterating over nodes 2017-09-13 22:33:17 +02:00
Aleksandra Malinowska 197b05b180 respect minimum cores/memory limit during scale down 2017-09-13 10:10:47 +02:00
Aleksandra Malinowska 187c02693e Taint empty nodes to be deleted 2017-09-12 17:40:05 +02:00
Marcin Wielgus 3039a0e813 Merge pull request #319 from krzysztof-jastrzebski/core-test
Core/static_autoscaler.go unit tests.
2017-09-12 13:11:11 +02:00
Beata Skiba eba0fa2f95 Remove nodes that are not in the cluster from unremovableNodes 2017-09-11 20:01:02 +02:00
Krzysztof Jastrzebski 0aec68a46d Core/static_autoscaler.go unit tests. Current time usage refactoring. 2017-09-11 15:07:21 +02:00
Marcin Wielgus db63ac3a18 Merge pull request #324 from aleksandra-malinowska/scale-down-pod-not-found
Add checking for pod not found error on eviction
2017-09-11 15:10:08 +05:30
Beata Skiba 6e5784a519 Always add empty nodes to unneeded nodes 2017-09-08 15:55:18 +02:00
Aleksandra Malinowska fbc8462b10 Add checking for not found error 2017-09-08 15:45:44 +02:00
Marcin Wielgus f9cabf3a1a Merge pull request #297 from bskiba/additional-k
Only consider up to 10% of the nodes as additional candidates for scale down
2017-09-07 04:34:23 +05:30
Sergey Lanzman 415f53cdea Change from deprecated Core to CoreV1 for kube client 2017-09-04 22:16:21 +03:00
Beata Skiba a6c18b87d2 Only consider up to 10% of the nodes as additional candidates for scale down. 2017-09-04 17:37:02 +02:00
Marcin Wielgus bcc8cded64 Clean up empty autoprovisioned node groups 2017-09-04 13:53:07 +02:00
Marcin Wielgus c0b48e4a15 Merge pull request #285 from mwielgus/loglevel
Set verbosity for each of the glog.Info logs
2017-09-01 16:42:11 +05:30
Marcin Wielgus 2d8f59e23d Set verbosity for each of the glog.Info logs 2017-09-01 12:34:29 +02:00
Beata Skiba 576e4105db Make ScaleDownNonEmptyCandidatesCount a flag. 2017-08-31 15:05:06 +02:00
Beata Skiba 4560cc0a85 Keep maximum 30 candidates for scale down with drain 2017-08-31 14:58:40 +02:00
Marcin Wielgus 191d140107 Don't increase pod graceful termination 2017-08-28 16:54:19 +02:00
Marcin Wielgus 6ad7ca21e8 Merge pull request #265 from MaciekPytel/ignore_unneded_if_min_size
Skip nodes in min-sized groups in scale-down simulation
2017-08-28 19:40:53 +05:30
Maciej Pytel 2f6dd8aefc Skip nodes in min-sized groups in scale-down simulation
Currently we track if those nodes can be removed and only
skip them at the execution step. Since checking if node is
unneeded is pretty expensive it's better to filter them out
early.
2017-08-28 15:48:41 +02:00
Marcin Wielgus 718e5db78e Run node drain/delete in a separate goroutine 2017-08-28 12:12:31 +02:00
Maciej Pytel fa53e52ed9 Skip node in scale-down if it was recently found unremovable 2017-08-25 17:21:08 +02:00
Beata Skiba 44f69c6706 Extract deleting empty nodes to a separate function. 2017-08-22 16:09:42 +02:00
Beata Skiba 14df1b808b Drill down scale down metrics
Split scale down duration into three parts:
1. Find nodes to remove
2. Node deletion
3. Misc operations
2017-08-18 14:17:02 +02:00
Marcin Wielgus 9116e4c08c Compilation fix for CA after godeps update 2017-08-11 17:56:47 +02:00
Marcin Wielgus 4580e1dc45 Fix getEmptyNodes function in CA 2017-08-07 22:21:41 +02:00
Aleksandra Malinowska ab8323e8dc fix some logs in scale down 2017-07-20 10:33:42 +02:00
fate-grand-order 5b230a45ee correct some misspells for cluster-autoscaler/core 2017-07-13 17:53:59 +08:00
Aleksandra Malinowska 9f54934229 add annotation 2017-07-06 14:47:32 +02:00
Marcin Wielgus fc43808149 Godeps bump for CA 2017-07-03 22:05:11 +02:00
Marcin Wielgus 2cd532ebfe Don't calculate utilization and run scale down simulations for unmanaged nodes 2017-06-20 16:57:30 +02:00
Maciej Pytel 767367c866 Fix typos related to max-graceful-termination-sec 2017-06-14 14:14:21 +02:00
Marcin Wielgus 69c77791a2 Fix error types 2017-06-12 21:26:50 +02:00
Maciej Pytel 3f8ca51768 Use typed errors in scale down 2017-05-18 14:09:15 +02:00
Maciej Pytel 7a21a68b56 Add metrics counting CA operations 2017-05-15 13:03:00 +02:00
Marcin Wielgus 42c177b68f Add deletion safety margin to node drain 2017-05-08 11:47:33 +02:00
Marcin Wielgus 34eb4973f8 Fix imports in cluster autoscaler after migrating it from contrib 2017-04-18 15:42:04 +02:00
Maciej Pytel 0b74a3bd25 Cluster-Autoscaler: update event name 2017-04-10 14:03:21 +02:00
Maciej Pytel 72c885b800 Cluster-Autoscaler: reset scale-down on unready cluster 2017-03-22 17:17:59 +01:00
Maciej Pytel 39162f0860 Cluster-Autoscaler: evict pods instead of deleting them 2017-03-10 16:18:47 +01:00
Maciej Pytel 5d2c675c8e Cluster-Autoscaler: update scale down status 2017-03-08 11:51:20 +01:00
Marcin Wielgus 27b797f541 Cluster-Autoscaler: skip nodes currently under deletion in scale down 2017-03-07 14:59:15 +01:00
Kubernetes Submit Queue 39fa783ad7 Merge pull request https://github.com/kubernetes/contrib/pull/2451 from mwielgus/pdb-ca
Automatic merge from submit-queue

Cluster-autoscaler: include PodDisruptionBudget in drain - part 1/2

In part 1 or 2 we skip nodes that have a pod with 0 poddisruptionallowed. Part 2/2 will delete pods using evict.

cc: @jszczepkowski @MaciekPytel @davidopp @fgrzadkowski
2017-03-06 09:27:50 -08:00
Marcin Wielgus 5b4441083a Cluster-autoscaler: include PodDisruptionBudget in drain - part 1/2 2017-03-06 17:15:04 +01:00
Maciej Pytel d3bf5d3d51 Cluster-Autoscaler: log events on status configmap 2017-03-06 12:21:24 +01:00
Marcin Wielgus 2ffaddb7c0 Cluster-autoscaler: lint 2017-03-02 15:15:07 +01:00
Marcin Wielgus 72a47dc2b2 Cluster-autoscaler: update code for 1.6 k8s sync 2017-03-02 14:34:49 +01:00
Yusuke Kuoka baee799524 cluster-autoscaler: Dynamic Reconfiguration via ConfigMaps
Adds a new optional flag named `configmap` to specify the name of a configmap containing node group specs.

The configmap is polled every `scan-interval` seconds to reconfigure cluster-autoscaler dynamically at runtime.

Example usage:

```
./cluster-autoscaler --v=4 --cloud-provider=aws --skip-nodes-with-local-storage=false --logtostderr --leader-elect=false --configmap=cluster-autoscaler --logtostderr
```

The configmap would look like:

```yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: cluster-autoscaler
  namespace: kube-system
data:
  settings: |-
    {
      "nodeGroups": [
        {
          "minSize": 1,
          "maxSize": 2,
          "name": "kubeawstest-nodepool1-AutoScaleWorker-1VWD4GAVG35L5"
        }
      ]
    }
 ```

Other notes:

* Make namespace defaults to "kube-system"
according to https://github.com/kubernetes/contrib/pull/2226#discussion_r94144267

* Trigger a full-recreate on a configuration change

according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-269617410

* Introduced `autoscaler/` and moved  all the dynamic/recreatable-at-runtime parts of autoscaler into there (Update: the package is now named `core` according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-273071663)

* Extracted the core of CA(=`func Run()` in `main.go`) into `Autoscaler`

* `DynamicAutoscaler` is a wrapper around `Autoscaler` which achieves reconfiguration of CA by recreating an `Autoscaler` instance on a configmap change.

* Moved `scale_down*.go`, `scale_up*.go` and `utils*.go` into the `autoscaler` package accordingly because they seemed to be meant to be collocated in the same package as the core of CA (which is now implemented as `Autoscaler`)

* Moved the `createEventRecorder` func from the `main` package to the `utils/kubernetes` package to make it importable from both `main` and `autoscaler`
2017-02-24 20:36:47 +09:00