Commit Graph

445 Commits

Author SHA1 Message Date
Andrew McDermott 5bc77f051c UPSTREAM: <carry>: fix calculation of max cluster size
When scaling up, the calculation for the maximum size of the cluster
based on `--max-nodes-total` doesn't take into account any nodes that
are in the process of coming up. This allows the cluster to grow
beyond the size specified.

With this change I now see:

scale_up.go:266] 21 other pods are also unschedulable
scale_up.go:423] Best option to resize: openshift-cluster-api/amcdermo-ca-worker-us-east-2b
scale_up.go:427] Estimated 18 nodes needed in openshift-cluster-api/amcdermo-ca-worker-us-east-2b
scale_up.go:432] Capping size to max cluster total size (23)
static_autoscaler.go:275] Failed to scale up: max node total count already reached
2018-12-18 17:05:19 +00:00
Zhenhai Gao df10e5f5c2 Fix log output detailed warning info
Signed-off-by: Zhenhai Gao <gaozh1988@live.com>
2018-12-07 17:25:54 +08:00
Andrew McDermott fd3fd85f26 UPSTREAM: <carry>: handle nil nodeGroup in calculateScaleDownGpusTotal
Explicitly handle nil as a return value for nodeGroup in
`calculateScaleDownGpusTotal()` when `NodeGroupForNode()` is called
for GPU nodes that don't exist. The current logic generates a runtime
exception:

    "reflect: call of reflect.Value.IsNil on zero Value"

Looking through the rest of the tree all the other places that use
this pattern additionally and explicitly check whether `nodeGroup ==
nil` first.

This change now completes the pattern in
`calculateScaleDownGpusTotal()`.

Looking at the other occurrences of this pattern we see:

```
File: clusterstate/clusterstate.go
488:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {

File: core/utils.go
231:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
322:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
394:27:			if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
461:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {

File: core/scale_down.go
185:6:		if reflect.ValueOf(nodeGroup).IsNil() {
608:27:			if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
747:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
1010:25:	if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
```

with the notable exception at core/scale_down.go:185 which is
`calculateScaleDownGpusTotal()`.

With this change, and invoking the autoscaler with:

```
...
      --max-nodes-total=24 \
      --cores-total=8:128 \
      --memory-total=4:256 \
      --gpu-total=nvidia.com/gpu:0:16 \
      --gpu-total=amd.com/gpu:0:4 \
...
```

I no longer see a runtime exception.
2018-12-05 18:54:07 +00:00
Thomas Hartland d0dd00c602 Fix logged error in static autoscaler 2018-12-04 16:59:57 +01:00
Łukasz Osipiuk 016bf7fc2c Use k8s.io/klog instead github.com/golang/glog 2018-11-26 17:30:31 +01:00
Łukasz Osipiuk 991873c237 Fix gofmt errors 2018-11-26 15:39:59 +01:00
Alex Price 4ae7acbacc add flags to ignore daemonsets and mirror pods when calculating resource utilization of a node
Adds the flag --ignore-daemonsets-utilization and --ignore-mirror-pods-utilization
(defaults to false) and when enabled, factors DaemonSet and mirror pods out when
calculating the resource utilization of a node.
2018-11-23 15:24:25 +11:00
Łukasz Osipiuk 5962354c81 Inject Backoff instance to ClusterStateRegistry on creation 2018-11-13 14:25:16 +01:00
k8s-ci-robot 7008fb50be
Merge pull request #1380 from losipiuk/lo/backoff
Make Backoff interface
2018-11-07 05:13:43 -08:00
Aleksandra Malinowska 6febc1ddb0 Fix formatted log messages 2018-11-06 14:51:43 +01:00
Aleksandra Malinowska bf6ff4be8e Clean up estimators 2018-11-06 14:15:42 +01:00
Łukasz Osipiuk 0e2c3739b7 Use NodeGroup as key in Backoff 2018-10-30 18:17:26 +01:00
Łukasz Osipiuk 55fc1e2f00 Store NodeGroup in ScaleUpRequest and ScaleDownRequest 2018-10-30 18:03:04 +01:00
Maciej Pytel 6f5e6aab6f Move node group balancing to processor
The goal is to allow customization of this logic
for different use-case and cloudproviders.
2018-10-25 14:04:05 +02:00
Łukasz Osipiuk a266420f6a Recalculate clusterStateRegistry after adding multiple node groups 2018-10-02 17:15:20 +02:00
Łukasz Osipiuk 437efe4af6 If possible use nodeInfo based on created node group 2018-10-02 15:46:45 +02:00
Jakub Tużnik 8179e4e716 Refactor the scale-(up|down) status processors so that they have more info available
Replace the simple boolean ScaledUp property of ScaleUpStatus with a more
comprehensive ScaleUpResult. Add more possible values to ScaleDownResult.
Refactor the processors execution so that they are always executed every
iteration, even if RunOnce exits earlier.
2018-09-20 17:12:02 +02:00
k8s-ci-robot 556029ad8d
Merge pull request #1255 from towca/feat/jtuznik/original-reasons
Add the ability to retrieve the original reasons from a PredicateError
2018-09-20 07:12:37 -07:00
Jakub Tużnik 8a7338e6d8 Add the ability to retrieve the original reasons from a PredicateError 2018-09-19 17:31:34 +02:00
Łukasz Osipiuk bf8cfef10b NodeGroupManager.CreateNodeGroup can return extra created node groups. 2018-09-19 13:55:51 +02:00
k8s-ci-robot d56bb24b71
Merge pull request #1244 from losipiuk/lo/muzon
Call CheckPodsSchedulableOnNode in scale_up.go via caching layer
2018-09-18 02:16:35 -07:00
Steve Scaffidi 88d857222d Renamed one more variable for consistency
Change-Id: Idf42fd58089a1e75f3291ab7cc583735c68735f2
2018-09-17 14:08:10 -04:00
Steve Scaffidi 56b5456269 Fixing nits: renamed newPodScaleUpBuffer -> newPodScaleUpDelay, deleted redundant comment
Change-Id: I7969194d8e07e2fb34029d0d7990341c891d0623
2018-09-17 10:38:28 -04:00
Łukasz Osipiuk 705a6d87e2 fixup! Call CheckPodsSchedulableOnNode in scale_up.go via caching layer 2018-09-17 13:01:19 +02:00
Steve Scaffidi 33b93cbc5f Add configurable delay for pod age before considering for scale-up
- This is intended to address the issue described in https://github.com/kubernetes/autoscaler/issues/923
  - the delay is configurable via a CLI option
  - in production (on AWS) we set this to a value of 2m
  - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment
  - the default of 0 for the CLI option results in no change to the CA's behavior from defaults.

Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35
2018-09-14 13:55:09 -04:00
Łukasz Osipiuk 0ad4efe920 Call CheckPodsSchedulableOnNode in scale_up.go via caching layer 2018-09-13 17:01:15 +02:00
Jakub Tużnik 71111da20c Add a scale down status processor, refactor so that there's more scale down info available to it 2018-09-12 14:52:20 +02:00
mikeweiwei 7ed0599b42 Fix delete node event (#1229)
* Add more event.When node is deleted and then add event

* move eventf above return and change type to warning
2018-09-07 14:31:57 +02:00
Łukasz Osipiuk 84d8f6fd31 Remove obsolete implementations of node-related processors 2018-09-05 11:58:46 +02:00
Aleksandra Malinowska b88e6019f7 code review fixes 3 2018-08-28 18:11:04 +02:00
Aleksandra Malinowska 5620f76c62 Pass NoScaleUpInfo to ScaleUpStatus processor 2018-08-28 14:26:03 +02:00
Aleksandra Malinowska cd9808185e Report reason why pod didn't trigger scale-up 2018-08-28 14:11:36 +02:00
Aleksandra Malinowska f5690aab96 Make CheckPredicates return predicateError 2018-08-28 14:11:35 +02:00
Jakub Tużnik 054f0b3b90 Add AutoscalingStatusProcessor 2018-08-07 14:47:06 +02:00
Aleksandra Malinowska 90e8a7a2d9 Move initializing defaults out of main 2018-08-02 14:04:03 +02:00
Aleksandra Malinowska 6f9b6f8290 Move ListerRegistry to context 2018-07-26 13:31:49 +02:00
Aleksandra Malinowska 7225a0fcab Move all Kubernetes API clients to AutoscalingKubeClients 2018-07-26 13:31:48 +02:00
Aleksandra Malinowska 07e52e6c79 Move creating cloud provider out of context 2018-07-25 13:43:47 +02:00
Aleksandra Malinowska 0976d2aa07 Move autoscaling options out of static 2018-07-25 10:52:37 +02:00
Aleksandra Malinowska 6b94d7172d Move AutoscalingOptions to config/static 2018-07-23 15:52:27 +02:00
Aleksandra Malinowska f7352500d7
Merge pull request #1080 from aleksandra-malinowska/refactor-cp-3
Remove not-so-useful type check test
2018-07-23 12:00:10 +02:00
Aleksandra Malinowska 1c09fdfe6a Remove not-so-useful type check test 2018-07-23 11:32:24 +02:00
Aleksandra Malinowska 398a1ac153 Fix error on node info not found for group 2018-07-23 11:16:12 +02:00
Aleksandra Malinowska 3b90694191 Remove autoscaler builder 2018-07-19 15:22:30 +02:00
Aleksandra Malinowska 54f8497079 Remove unused dynamic.Config 2018-07-19 14:53:09 +02:00
Pengfei Ni 1dd0147d9e Add more events for CA 2018-07-09 15:42:05 +08:00
Aleksandra Malinowska 800ee56b34 Refactor and extend GPU metrics error types 2018-07-05 13:13:11 +02:00
Karol Gołąb aae4d1270a Make GetGpuTypeForMetrics more robust 2018-06-26 21:35:16 +02:00
Marcin Wielgus f2e76e2592
Merge pull request #1008 from krzysztof-jastrzebski/master
Move removing unneeded autoprovisioned node groups to node group manager
2018-06-22 21:01:36 +02:00
Karol Gołąb 5eb7021f82 Add GPU-related scaled_up & scaled_down metrics (#974)
* Add GPU-related scaled_up & scaled_down metrics

* Fix name to match SD naming convention

* Fix import after master rebase

* Change the logic to include GPU-being-installed nodes
2018-06-22 21:00:52 +02:00
Krzysztof Jastrzebski 2df2568841 Move removing unneeded autoprovisioned node groups to node group manager 2018-06-22 14:26:12 +02:00
Nic Doye ebadbda2b2 issues/933 Consider making UnremovableNodeRecheckTimeout configurable 2018-06-18 11:54:14 +01:00
Aleksandra Malinowska ed5e82d85d
Merge pull request #956 from krzysztof-jastrzebski/master
Create NodeGroupManager which is responsible for creating…
2018-06-14 17:25:32 +02:00
Łukasz Osipiuk 51d628c2f1 Add test to check if nodes from not autoscaled groups are used in max-nodes limit 2018-06-14 16:17:51 +02:00
Krzysztof Jastrzebski 99c8c51bb3 Create NodeGroupManager which is responsible for creating/deleting node groups. 2018-06-14 16:11:32 +02:00
Łukasz Osipiuk b7323bc0d1 Respect GPU limits in scale_up 2018-06-14 15:46:58 +02:00
Łukasz Osipiuk dfcbedb41f Take into consideration nodes from not autoscaled groups when enforcing resource limits 2018-06-14 15:31:40 +02:00
Łukasz Osipiuk b1db155c50 Remove duplicated test case 2018-06-13 19:00:37 +02:00
Łukasz Osipiuk 9f75099d2c Restructure checking resource limits in scale_up.go
Preparatory work for before introducing GPU limits
2018-06-13 19:00:37 +02:00
Łukasz Osipiuk 087a5cc9a9 Respect GPU limits in scale_down 2018-06-13 14:19:59 +02:00
Łukasz Osipiuk 1fa44a4d3a Fix bug resulting resource limits not being enforced in scale_down 2018-06-11 16:39:07 +02:00
Łukasz Osipiuk 519064e1ec Extract isNodeBeingDeleted function 2018-06-11 14:21:07 +02:00
Łukasz Osipiuk 6c57a01fc9 Restructure checking resource limits in scale_down.go 2018-06-11 14:02:40 +02:00
Pengfei Ni be3dd85503 Update scheduler cache package 2018-06-11 13:54:12 +08:00
Łukasz Osipiuk 9c61477d25 Do not return error when getting cpu/memory capacity of node 2018-06-08 15:04:57 +02:00
MaciekPytel c41dc43704
Merge pull request #495 from aleksandra-malinowska/resource-limiter-bytes
Use bytes instead of MB for memory limits
2018-06-08 14:47:22 +02:00
Beata Skiba b8ae6df5d3 Add post scale up status processor. 2018-06-06 13:34:49 +02:00
Maciej Pytel 856855987b Move some GKE-specific logic outside core
No change in actual logic being executed. Added a new
NodeGroupListProcessor interface to encapsulate the existing logic.
Moved PodListProcessor and refactor how it's passed around
to make it consistent and easy to add similar interfaces.
2018-05-29 12:57:19 +02:00
Maciej Pytel 5faa41e683 Move PodListProcessor to new directory
It's not really a util and with more processors
coming it makes more sense to keep them in dedicated place.
2018-05-29 12:00:47 +02:00
Krzysztof Jastrzebski 6761d7f354 Execute predicates only for similar pods. 2018-05-29 09:36:11 +02:00
Krzysztof Jastrzebski adad14c2c9 Delete autoprovisioned node pool after all nodes are deleted. 2018-05-28 14:22:18 +02:00
Karol Gołąb 4c710950de Move ClusterStateRegistry to StaticAutoscaler
AutoscalingContext is basically a configuration and few static helpers
and API handles.
ClusterStateRegistry is state and thus moved to other state-keeping
objects.
2018-05-24 13:03:01 +02:00
Marcin Wielgus 494c2aff1b
Merge pull request #883 from kgolab/kg-clean-up-016
Reorder & extract initial parts of RunOnce
2018-05-22 10:06:27 +02:00
Karol Gołąb 5bfab7d9b2 Return value moved to the caller 2018-05-18 14:59:15 +02:00
Joachim Bartosik bfb70e40ee Allow passing taints to Node Group creation. 2018-05-18 14:33:33 +02:00
Karol Gołąb fa6f25a70a Extract ClusterStateRegistry update with its soft dependency 2018-05-18 10:25:15 +02:00
Karol Gołąb dc34b43a40 Extract another tiny method 2018-05-18 10:10:51 +02:00
Karol Gołąb 34f6a45a04 Extract method to hide a tiny bit of complexity 2018-05-18 10:01:52 +02:00
Aleksandra Malinowska 3ccfa5be23 Move universal constants to separate module 2018-05-17 18:36:43 +02:00
Aleksandra Malinowska fcc3d004f5 Use bytes instead of MB for memory limits 2018-05-17 17:35:39 +02:00
Aleksandra Malinowska d7dc3616f7
Merge pull request #868 from kgolab/kg-clean-up-010
Move metrics update to proper place
2018-05-17 14:52:18 +02:00
Karol Gołąb e31bf0bb58 Move metrics.Autoscaling after all Node-level operations & checks 2018-05-17 14:37:43 +02:00
Aleksandra Malinowska 3b6cfc7c2b
Merge pull request #870 from kgolab/kg-clean-up-012
Set lastScaleDownFailTime properly
2018-05-17 12:09:15 +02:00
MaciekPytel 444201d1e7
Merge pull request #871 from kgolab/kg-clean-up-013
Extract duplicate code into a single method
2018-05-17 11:49:49 +02:00
Karol Gołąb 400147a075 Extract duplicate code into a single method 2018-05-17 10:01:04 +02:00
Karol Gołąb b8cbdf4178 Set lastScaleDownFailTime properly - the ScaleDownError check was unreachable 2018-05-17 09:50:22 +02:00
Karol Gołąb 38a5951e22 Check glog.V once 2018-05-17 09:47:52 +02:00
Karol Gołąb ccca078a2b Move metrics update to proper place 2018-05-17 09:46:25 +02:00
Łukasz Osipiuk eb6eff282a Add gpu related tests to scale_up_test 2018-05-15 22:43:31 +02:00
Łukasz Osipiuk c406da4174 Support gpus in nodes and pods definitions in UT 2018-05-15 22:43:31 +02:00
Łukasz Osipiuk be381facfb Introduce asserting expanding strategy for scale_up_test 2018-05-15 17:01:31 +02:00
Łukasz Osipiuk c1073fe23a Model expected scale up in scale_up_test with struct 2018-05-15 17:01:30 +02:00
Łukasz Osipiuk 8bdc6a1bdc Move commons structs from scale_up_test.go to scale_test_common.go 2018-05-15 17:00:45 +02:00
Karol Gołąb 74b540fdab Remove DynamicAutoscaler since it's unused (#851)
* Remove DynamicAutoscaler since it's unused

* Remove configmap flag with its unused-elsewhere dependecies

* gofmt
2018-05-14 20:22:42 +02:00
MaciekPytel bc39d4dcd5
Merge pull request #842 from kgolab/kg-clean-up-008
Merge two variables into one.
2018-05-14 10:54:43 +02:00
Aleksandra Malinowska b52ec59b05 Fix cleaning up taints 2018-05-11 12:00:48 +02:00
Karol Gołąb f1f92f065e Merge two variables into one. 2018-05-10 14:32:37 +02:00
Aleksandra Malinowska ffeebde8d8 Add support for rescheduled pods with the same name in drain 2018-05-10 12:00:56 +02:00
Marcin Wielgus 9c5728fd74
Merge pull request #836 from kgolab/kg-clean-up-004
Use timestamp argument
2018-05-08 20:24:37 +02:00
Karol Gołąb 53b1c6a394 Use timestamp argument 2018-05-08 13:08:30 +02:00
MaciekPytel e5659e7c57
Merge pull request #835 from kgolab/kg-clean-up-003
Make the code slightly more idiomatic go
2018-05-08 12:58:14 +02:00
Karol Gołąb da16642bcf Make the code slightly more idiomatic go 2018-05-08 11:35:01 +02:00
Karol Gołąb ae203ed517 Removed unused CloudProvider() method. 2018-05-08 11:23:55 +02:00
Karol Gołąb 854fcc1ff8 Remove implementation details (CleanUp) from the interface.
The CleanUp method is instead called directly from the implementation,
when required.
Test updated in a quick way since the mock we're using does not support
AtLeast(1) - thus Times(2).
2018-05-07 15:24:14 +02:00
Beata Skiba 054f6d8650
Merge pull request #794 from krzysztof-jastrzebski/pods
Refactor cluster autoscaler builder and add pod list processor.
2018-04-26 13:08:56 +02:00
Krzysztof Jastrzebski 88b769b324 Refactor cluster autoscaler builder and add pod list processor. 2018-04-26 12:37:51 +02:00
Aleksandra Malinowska 3d599bfabe Rephrase unremovable node warning 2018-04-18 13:43:32 +02:00
Aleksandra Malinowska 7e1353a865 Ignore TPU resource in simulations 2018-04-11 12:26:22 +02:00
Aleksandra Malinowska feb4ad9e14 Add utility for limiting logging 2018-03-22 12:57:22 +01:00
Marcin Wielgus 04bec08e84 Compilation fix 2018-03-20 20:11:36 +01:00
Aleksandra Malinowska 4c594db7f8 Run spellchecker 2018-03-15 15:47:49 +01:00
Aleksandra Malinowska f98e953eb4 Add regional flag 2018-03-12 14:15:56 +01:00
Maciej Pytel abbc45da2e Delay scale-up including GPU request
Nodes with GPU are expensive and it's likely a bunch of pods
using them will be created in a batch. In this case we can
wait a bit for all pods to be created to make more efficient
scale-up decision.
2018-03-02 15:55:04 +01:00
Aleksandra Malinowska 9cc322a61d Disable checking inter pod affinity predicate if only preferred or node affinity used 2018-02-14 14:40:02 +01:00
anniedy bf59e3daa5 Typo fix unneded->[unneeded] (#623)
* Update clusterstate.md

* Update scale_down.go

* Update static_autoscaler.go
2018-02-07 17:36:58 +01:00
Beata Skiba 346a5c26a9 Remove old unregistered nodes before checking cluster healthiness 2018-02-01 16:34:50 +01:00
Aleksandra Malinowska b17b6c3ec5 Wait before publishing no nodes ready after start 2018-01-16 19:04:38 +01:00
Aleksandra Malinowska 3894ecb470 Export unregistered node count metric 2018-01-16 16:56:40 +01:00
Aleksandra Malinowska 27efa05b1d Publish ClusterUnhealthy events 2018-01-16 16:56:36 +01:00
Aleksandra Malinowska 1b728d411b Publish status and metrics for empty cluster 2018-01-16 16:07:29 +01:00
Aleksandra Malinowska 3d33b64599 Export long unregistered node count metric 2018-01-16 16:07:24 +01:00
Marcin Wielgus d5f091a886
Merge pull request #508 from mwielgus/wait-for-pods
Skip iteration if pending pods are too new
2017-12-28 17:22:38 +01:00
Marcin Wielgus 15b10c8f67 Skip iteration if pending pods are too new 2017-12-28 16:55:44 +01:00
Nic Cope 19607bd285 Remove the Polling Autoscaler. 2017-12-11 13:09:56 -08:00
Nic Cope 982f9e41a3 Support autodetection of GCE managed instance groups by name prefix
This commit adds a new usage of the --node-group-auto-discovery flag intended
for use with the GCE cloud provider. GCE instance groups can be automatically
discovered based on a prefix of their group name. Example usage:

--node-group-auto-discovery=mig:prefix=k8s-mig,minNodes=0,maxNodes=10

Note that unlike the existing AWS ASG autodetection functionality we must
specify the min and max nodes in the flag. This is because MIGs store only
a target size in the GCE API - they do not have a min and max size we can
infer via the API.

In order to alleviate this limitation a little we allow multiple uses of the
autodiscovery flag. For example to discover two classes (big and small) of
instance groups with different size limits:

./cluster-autoscaler \
  --node-group-auto-discovery=mig:prefix=k8s-a-small,minNodes=1,maxNodes=10 \
  --node-group-auto-discovery=mig:prefix=k8s-a-big,minNodes=1,maxNodes=100

Zonal clusters (i.e. multizone = false in the cloud config) will detect all
managed instance groups within the cluster's zone. Regional clusters will
detect all matching (zonal) managed instance groups within any of that region's
zones.
2017-12-11 13:09:56 -08:00
Maciej Pytel b7f8622eb2 Create node groups with GPU in scale-up.go
This is still not implemented in cloudprovider.
Extended NewNodeGroup inteface to have a way of passing
parameters for more complex resources.
2017-12-11 13:12:22 +01:00
Marcin Wielgus f8c0e20ad9 Source fix after godep update 2017-11-28 14:01:43 +01:00
Marcin Wielgus 2589c43a61
Merge pull request #469 from aleksandra-malinowska/single-unregistered-flag
Remove --unregistered-node-removal-time flag
2017-11-16 13:07:52 +01:00
Krzysztof Jastrzebski 6c8d3aa37d Fix unit static autoscaler unit tests. 2017-11-15 16:13:18 +01:00
Aleksandra Malinowska 2ff962e53e Remove --unregistered-node-removal-time flag 2017-11-15 11:11:30 +01:00
Marcin Wielgus ded016dfd8
Merge pull request #461 from MaciekPytel/gpu_unready_fix
Consider GPU nodes unready until allocatable GPU is > 0
2017-11-13 15:29:27 +01:00
Maciej Pytel d81dca5991 Mark nodes with uninitialized GPUs as unready 2017-11-10 17:56:10 +01:00
Marcin Wielgus 439fd3c9ec
Merge pull request #411 from krzysztof-jastrzebski/priority
Adds priority preemption support to cluster autoscaler.
2017-11-08 09:09:26 +01:00
Beata Skiba 2b28ac1a04 Add a workaround for scaling of VMs with GPUs
When a machine with GPU becomes ready it can take
up to 15 minutes before it reports that GPU is allocatable.
This can cause Cluster Autoscaler to trigger a second
unnecessary scale up.
The workaround sets allocatable to capacity for GPU so that
a node that waits for GPUs to become ready to use will be
considered as a place where pods requesting GPUs can be
scheduled.
2017-11-06 16:04:22 +01:00
Edward Tsang 4104a91991 more spelling fixes 2017-11-02 14:21:36 -07:00
mmerrill3 3d043f73cb Renaming the interface function to Cleanup() for CloudProvider type 2017-11-01 12:41:13 -04:00
mmerrill3 77aa30a5c1 Fixing for issue 252 by implementing a channel to stop the go routine 2017-11-01 11:00:00 -04:00
Maciej Pytel c376ef3c87 Add metrics for autoprovisioning 2017-10-31 17:42:58 +01:00
Maciej Pytel 9c2ebccbfe Write events when autoprovisioned nodegroup is created / deleted 2017-10-25 17:39:30 +02:00
Maciej Pytel 07511f444a Add Refresh method to cloud provider
This can be used to dynamically update cloud provider
config (in particular list of managed NodeGroups and their
min/max constraints).
Add GKE implementation.
2017-10-24 18:36:29 +02:00
Marcin Wielgus 596f478e63 Merge pull request #414 from krzysztof-jastrzebski/resource_limit
Adds resource limits to cloud provider.
2017-10-23 20:38:04 +02:00
Krzysztof Jastrzebski 56ac572666 Adds resource limits to cloud provider. 2017-10-23 16:06:56 +02:00
Maciej Pytel 7b95e71315 Use GKE alpha client when autoprovisioning is enabled 2017-10-23 15:21:02 +02:00
Krzysztof Jastrzebski d9c00e5ce1 Adds priority preemption support to cluster autoscaler. 2017-10-23 09:54:56 +02:00
Maciej Pytel 02ccba3338 Update clusterstate after scale-up 2017-10-17 16:11:25 +02:00
Maciej Pytel 3498507220 Handle nodegroup id changing upon creation 2017-10-17 14:02:46 +02:00
Marcin Wielgus f658450b16 Merge pull request #379 from MaciekPytel/long_unregistered_node
Keep track of nodes that failed to register for a long time
2017-09-28 15:01:32 +02:00
Maciej Pytel ff21b0b00c Keep track of nodes that failed to register for a long time
Previously a node that failed to register and couldn't be deleted
basically broke CA.
2017-09-27 16:32:04 +02:00
Marcin Wielgus 9631f0f136 Merge pull request #375 from MaciekPytel/failed_scale_up_reason
Add failed scale-up reason in metric
2017-09-26 19:23:47 +02:00
Maciej Pytel e12ee88f5f Add failed scale-up reason in metric 2017-09-26 13:40:34 +02:00
Krzysztof Jastrzebski 16e9106c07 Fix setting target size for group in core/static_autoscaler_test.go. 2017-09-26 10:58:00 +02:00
Krzysztof Jastrzebski 80a7577399 Unit tests. 2017-09-25 11:37:24 +02:00
Maciej Pytel 098ebbee09 Log event when removing unregistered node 2017-09-22 22:48:07 +02:00
Marcin Wielgus 32c4a7ba5c Merge pull request #360 from aleksandra-malinowska/leaking-taints
Fix leaking taints in case of cloud provider error on node deletion
2017-09-22 21:43:55 +01:00
Maciej Pytel 5e05c84cf0 Add metric counting failed scale-ups
A minor refactor was required to avoid cyclic imports
2017-09-22 18:12:50 +02:00
Aleksandra Malinowska 4c31a57374 fix leaking taints in case of cloud provider error on node deletion 2017-09-22 17:55:48 +02:00
Matt Terry 63310ef41a Introduce new flags to control scale down behavior: scale-down-delay-after-delete and scale-down-delay-after-failure, replacing scale-down-trial-interval. scale-down-delay-after-add replaces scale-down-delay 2017-09-18 17:09:44 -07:00
Marcin Wielgus f04113d746 Remove TargetSize() from loops iterating over nodes 2017-09-13 22:33:17 +02:00
Marcin Wielgus 303f86c163 Merge pull request #336 from electronicarts/feature/matt/unneeded-check-fix
Move calculateUnneededOnly check after unneeded calculations
2017-09-13 11:14:51 +02:00
Marcin Wielgus 4bed50d290 Merge pull request #331 from aleksandra-malinowska/min-cluster-cpu-memory
Respect minimum cores/memory limit during scale down
2017-09-13 11:12:29 +02:00
Aleksandra Malinowska 197b05b180 respect minimum cores/memory limit during scale down 2017-09-13 10:10:47 +02:00
Krzysztof Jastrzebski d8db14701e Core/static_autoscaler_test.go unit tests. 2017-09-13 09:52:07 +02:00
Matt Terry 43943cdeb4 Move calculateUnneededOnly check after unneeded calculations, add log message to main loop start 2017-09-12 21:38:29 -07:00
Aleksandra Malinowska 187c02693e Taint empty nodes to be deleted 2017-09-12 17:40:05 +02:00
Marcin Wielgus ef730e19c5 Merge pull request #332 from krzysztof-jastrzebski/scale_up2
Fix filtering for autoprovisioned node groups and add unit test.
2017-09-12 16:40:30 +02:00
Krzysztof Jastrzebski b1396c3cd1 Fix filtering for autoprovisioned node groups and add unit test. 2017-09-12 16:20:23 +02:00
Marcin Wielgus 738fb640e1 Merge pull request #330 from krzysztof-jastrzebski/core-test4
Core/autoscaling_context_test.go unit tests.
2017-09-12 15:07:22 +02:00
Marcin Wielgus 9d3e52551c Merge pull request #329 from krzysztof-jastrzebski/scale_down2
Core/scale_down.go unit tests.
2017-09-12 13:12:46 +02:00
Marcin Wielgus 3039a0e813 Merge pull request #319 from krzysztof-jastrzebski/core-test
Core/static_autoscaler.go unit tests.
2017-09-12 13:11:11 +02:00
Krzysztof Jastrzebski 001ade48c9 Core/autoscaling_context_test.go unit tests. 2017-09-12 11:04:18 +02:00
Krzysztof Jastrzebski 1db2513f1f Core/scale_down.go unit tests. 2017-09-12 10:41:19 +02:00
Beata Skiba eba0fa2f95 Remove nodes that are not in the cluster from unremovableNodes 2017-09-11 20:01:02 +02:00
Krzysztof Jastrzebski 0aec68a46d Core/static_autoscaler.go unit tests. Current time usage refactoring. 2017-09-11 15:07:21 +02:00
Marcin Wielgus db63ac3a18 Merge pull request #324 from aleksandra-malinowska/scale-down-pod-not-found
Add checking for pod not found error on eviction
2017-09-11 15:10:08 +05:30
Clayton Coleman e84807e828
Do not include ToBeDeleted taint when constructing a template
This results in the simulator being unable to place candidate pods
because the taint blocks all scheduling.
2017-09-10 22:31:39 -04:00
Beata Skiba 1d10a14aa0 Merge pull request #318 from bskiba/fix-empty
Always add empty nodes to unneeded nodes
2017-09-08 16:31:19 +02:00
Beata Skiba 6e5784a519 Always add empty nodes to unneeded nodes 2017-09-08 15:55:18 +02:00
Aleksandra Malinowska fbc8462b10 Add checking for not found error 2017-09-08 15:45:44 +02:00
Aleksandra Malinowska d43029c180 implement blocking scale up beyond max cores & memory 2017-09-08 12:50:00 +02:00
Marcin Wielgus fc599bd08c Merge pull request #310 from krzysztof-jastrzebski/core-test
Core/utils.go unit tests
2017-09-07 17:15:58 +05:30
Krzysztof Jastrzebski 2295d9bcc4 Core/utils.go unit tests 2017-09-07 13:24:12 +02:00
Marcin Wielgus f9cabf3a1a Merge pull request #297 from bskiba/additional-k
Only consider up to 10% of the nodes as additional candidates for scale down
2017-09-07 04:34:23 +05:30
Marcin Wielgus e85e94510d Tests for add autoprovisioned node groups 2017-09-06 02:44:16 +02:00
Marcin Wielgus 1ad8d9e10c Build template NodeInfo for node autoprovisioning 2017-09-05 17:28:49 +02:00
Sergey Lanzman 437a3f60e1 Small optimize code 2017-09-04 23:50:45 +03:00
Sergey Lanzman 44195b39a2 Fix small typos 2017-09-04 22:18:07 +03:00
Sergey Lanzman 415f53cdea Change from deprecated Core to CoreV1 for kube client 2017-09-04 22:16:21 +03:00
Beata Skiba a6c18b87d2 Only consider up to 10% of the nodes as additional candidates for scale down. 2017-09-04 17:37:02 +02:00
Aleksandra Malinowska 7ae64de0af Merge pull request #291 from mwielgus/nap-cleanup
Clean up empty autoprovisioned node groups
2017-09-04 15:03:26 +02:00
Marcin Wielgus bcc8cded64 Clean up empty autoprovisioned node groups 2017-09-04 13:53:07 +02:00
Marcin Wielgus ae00f0544b Merge pull request #290 from mwielgus/max-nap-groups
Limit autoprovisioned groups to 15
2017-09-01 23:49:33 +05:30
Marcin Wielgus de524a6688 Limit autoprovisioned groups to 15 2017-09-01 18:25:28 +02:00
Maciej Pytel a440d92a60 Log event on scale-up timeout 2017-09-01 14:19:14 +02:00
Maciej Pytel a86268f114 Write event on scale-up failure 2017-09-01 13:34:20 +02:00
Marcin Wielgus c0b48e4a15 Merge pull request #285 from mwielgus/loglevel
Set verbosity for each of the glog.Info logs
2017-09-01 16:42:11 +05:30
Marcin Wielgus 021a2fdf5d Merge pull request #286 from mwielgus/exist-no-error
Do not return error from exist
2017-09-01 16:05:52 +05:30
Marcin Wielgus 2d8f59e23d Set verbosity for each of the glog.Info logs 2017-09-01 12:34:29 +02:00
Marcin Wielgus f217d4ac93 Do not return error from exist 2017-09-01 00:24:01 +02:00
Beata Skiba 576e4105db Make ScaleDownNonEmptyCandidatesCount a flag. 2017-08-31 15:05:06 +02:00
Beata Skiba 4560cc0a85 Keep maximum 30 candidates for scale down with drain 2017-08-31 14:58:40 +02:00