Andrew McDermott
5bc77f051c
UPSTREAM: <carry>: fix calculation of max cluster size
...
When scaling up, the calculation for the maximum size of the cluster
based on `--max-nodes-total` doesn't take into account any nodes that
are in the process of coming up. This allows the cluster to grow
beyond the size specified.
With this change I now see:
scale_up.go:266] 21 other pods are also unschedulable
scale_up.go:423] Best option to resize: openshift-cluster-api/amcdermo-ca-worker-us-east-2b
scale_up.go:427] Estimated 18 nodes needed in openshift-cluster-api/amcdermo-ca-worker-us-east-2b
scale_up.go:432] Capping size to max cluster total size (23)
static_autoscaler.go:275] Failed to scale up: max node total count already reached
2018-12-18 17:05:19 +00:00
Zhenhai Gao
df10e5f5c2
Fix log output detailed warning info
...
Signed-off-by: Zhenhai Gao <gaozh1988@live.com>
2018-12-07 17:25:54 +08:00
Andrew McDermott
fd3fd85f26
UPSTREAM: <carry>: handle nil nodeGroup in calculateScaleDownGpusTotal
...
Explicitly handle nil as a return value for nodeGroup in
`calculateScaleDownGpusTotal()` when `NodeGroupForNode()` is called
for GPU nodes that don't exist. The current logic generates a runtime
exception:
"reflect: call of reflect.Value.IsNil on zero Value"
Looking through the rest of the tree all the other places that use
this pattern additionally and explicitly check whether `nodeGroup ==
nil` first.
This change now completes the pattern in
`calculateScaleDownGpusTotal()`.
Looking at the other occurrences of this pattern we see:
```
File: clusterstate/clusterstate.go
488:26: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
File: core/utils.go
231:26: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
322:26: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
394:27: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
461:26: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
File: core/scale_down.go
185:6: if reflect.ValueOf(nodeGroup).IsNil() {
608:27: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
747:26: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
1010:25: if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
```
with the notable exception at core/scale_down.go:185 which is
`calculateScaleDownGpusTotal()`.
With this change, and invoking the autoscaler with:
```
...
--max-nodes-total=24 \
--cores-total=8:128 \
--memory-total=4:256 \
--gpu-total=nvidia.com/gpu:0:16 \
--gpu-total=amd.com/gpu:0:4 \
...
```
I no longer see a runtime exception.
2018-12-05 18:54:07 +00:00
Thomas Hartland
d0dd00c602
Fix logged error in static autoscaler
2018-12-04 16:59:57 +01:00
Łukasz Osipiuk
016bf7fc2c
Use k8s.io/klog instead github.com/golang/glog
2018-11-26 17:30:31 +01:00
Łukasz Osipiuk
991873c237
Fix gofmt errors
2018-11-26 15:39:59 +01:00
Alex Price
4ae7acbacc
add flags to ignore daemonsets and mirror pods when calculating resource utilization of a node
...
Adds the flag --ignore-daemonsets-utilization and --ignore-mirror-pods-utilization
(defaults to false) and when enabled, factors DaemonSet and mirror pods out when
calculating the resource utilization of a node.
2018-11-23 15:24:25 +11:00
Łukasz Osipiuk
5962354c81
Inject Backoff instance to ClusterStateRegistry on creation
2018-11-13 14:25:16 +01:00
k8s-ci-robot
7008fb50be
Merge pull request #1380 from losipiuk/lo/backoff
...
Make Backoff interface
2018-11-07 05:13:43 -08:00
Aleksandra Malinowska
6febc1ddb0
Fix formatted log messages
2018-11-06 14:51:43 +01:00
Aleksandra Malinowska
bf6ff4be8e
Clean up estimators
2018-11-06 14:15:42 +01:00
Łukasz Osipiuk
0e2c3739b7
Use NodeGroup as key in Backoff
2018-10-30 18:17:26 +01:00
Łukasz Osipiuk
55fc1e2f00
Store NodeGroup in ScaleUpRequest and ScaleDownRequest
2018-10-30 18:03:04 +01:00
Maciej Pytel
6f5e6aab6f
Move node group balancing to processor
...
The goal is to allow customization of this logic
for different use-case and cloudproviders.
2018-10-25 14:04:05 +02:00
Łukasz Osipiuk
a266420f6a
Recalculate clusterStateRegistry after adding multiple node groups
2018-10-02 17:15:20 +02:00
Łukasz Osipiuk
437efe4af6
If possible use nodeInfo based on created node group
2018-10-02 15:46:45 +02:00
Jakub Tużnik
8179e4e716
Refactor the scale-(up|down) status processors so that they have more info available
...
Replace the simple boolean ScaledUp property of ScaleUpStatus with a more
comprehensive ScaleUpResult. Add more possible values to ScaleDownResult.
Refactor the processors execution so that they are always executed every
iteration, even if RunOnce exits earlier.
2018-09-20 17:12:02 +02:00
k8s-ci-robot
556029ad8d
Merge pull request #1255 from towca/feat/jtuznik/original-reasons
...
Add the ability to retrieve the original reasons from a PredicateError
2018-09-20 07:12:37 -07:00
Jakub Tużnik
8a7338e6d8
Add the ability to retrieve the original reasons from a PredicateError
2018-09-19 17:31:34 +02:00
Łukasz Osipiuk
bf8cfef10b
NodeGroupManager.CreateNodeGroup can return extra created node groups.
2018-09-19 13:55:51 +02:00
k8s-ci-robot
d56bb24b71
Merge pull request #1244 from losipiuk/lo/muzon
...
Call CheckPodsSchedulableOnNode in scale_up.go via caching layer
2018-09-18 02:16:35 -07:00
Steve Scaffidi
88d857222d
Renamed one more variable for consistency
...
Change-Id: Idf42fd58089a1e75f3291ab7cc583735c68735f2
2018-09-17 14:08:10 -04:00
Steve Scaffidi
56b5456269
Fixing nits: renamed newPodScaleUpBuffer -> newPodScaleUpDelay, deleted redundant comment
...
Change-Id: I7969194d8e07e2fb34029d0d7990341c891d0623
2018-09-17 10:38:28 -04:00
Łukasz Osipiuk
705a6d87e2
fixup! Call CheckPodsSchedulableOnNode in scale_up.go via caching layer
2018-09-17 13:01:19 +02:00
Steve Scaffidi
33b93cbc5f
Add configurable delay for pod age before considering for scale-up
...
- This is intended to address the issue described in https://github.com/kubernetes/autoscaler/issues/923
- the delay is configurable via a CLI option
- in production (on AWS) we set this to a value of 2m
- the delay could possibly be set as low as 30s and still be effective depending on your workload and environment
- the default of 0 for the CLI option results in no change to the CA's behavior from defaults.
Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35
2018-09-14 13:55:09 -04:00
Łukasz Osipiuk
0ad4efe920
Call CheckPodsSchedulableOnNode in scale_up.go via caching layer
2018-09-13 17:01:15 +02:00
Jakub Tużnik
71111da20c
Add a scale down status processor, refactor so that there's more scale down info available to it
2018-09-12 14:52:20 +02:00
mikeweiwei
7ed0599b42
Fix delete node event ( #1229 )
...
* Add more event.When node is deleted and then add event
* move eventf above return and change type to warning
2018-09-07 14:31:57 +02:00
Łukasz Osipiuk
84d8f6fd31
Remove obsolete implementations of node-related processors
2018-09-05 11:58:46 +02:00
Aleksandra Malinowska
b88e6019f7
code review fixes 3
2018-08-28 18:11:04 +02:00
Aleksandra Malinowska
5620f76c62
Pass NoScaleUpInfo to ScaleUpStatus processor
2018-08-28 14:26:03 +02:00
Aleksandra Malinowska
cd9808185e
Report reason why pod didn't trigger scale-up
2018-08-28 14:11:36 +02:00
Aleksandra Malinowska
f5690aab96
Make CheckPredicates return predicateError
2018-08-28 14:11:35 +02:00
Jakub Tużnik
054f0b3b90
Add AutoscalingStatusProcessor
2018-08-07 14:47:06 +02:00
Aleksandra Malinowska
90e8a7a2d9
Move initializing defaults out of main
2018-08-02 14:04:03 +02:00
Aleksandra Malinowska
6f9b6f8290
Move ListerRegistry to context
2018-07-26 13:31:49 +02:00
Aleksandra Malinowska
7225a0fcab
Move all Kubernetes API clients to AutoscalingKubeClients
2018-07-26 13:31:48 +02:00
Aleksandra Malinowska
07e52e6c79
Move creating cloud provider out of context
2018-07-25 13:43:47 +02:00
Aleksandra Malinowska
0976d2aa07
Move autoscaling options out of static
2018-07-25 10:52:37 +02:00
Aleksandra Malinowska
6b94d7172d
Move AutoscalingOptions to config/static
2018-07-23 15:52:27 +02:00
Aleksandra Malinowska
f7352500d7
Merge pull request #1080 from aleksandra-malinowska/refactor-cp-3
...
Remove not-so-useful type check test
2018-07-23 12:00:10 +02:00
Aleksandra Malinowska
1c09fdfe6a
Remove not-so-useful type check test
2018-07-23 11:32:24 +02:00
Aleksandra Malinowska
398a1ac153
Fix error on node info not found for group
2018-07-23 11:16:12 +02:00
Aleksandra Malinowska
3b90694191
Remove autoscaler builder
2018-07-19 15:22:30 +02:00
Aleksandra Malinowska
54f8497079
Remove unused dynamic.Config
2018-07-19 14:53:09 +02:00
Pengfei Ni
1dd0147d9e
Add more events for CA
2018-07-09 15:42:05 +08:00
Aleksandra Malinowska
800ee56b34
Refactor and extend GPU metrics error types
2018-07-05 13:13:11 +02:00
Karol Gołąb
aae4d1270a
Make GetGpuTypeForMetrics more robust
2018-06-26 21:35:16 +02:00
Marcin Wielgus
f2e76e2592
Merge pull request #1008 from krzysztof-jastrzebski/master
...
Move removing unneeded autoprovisioned node groups to node group manager
2018-06-22 21:01:36 +02:00
Karol Gołąb
5eb7021f82
Add GPU-related scaled_up & scaled_down metrics ( #974 )
...
* Add GPU-related scaled_up & scaled_down metrics
* Fix name to match SD naming convention
* Fix import after master rebase
* Change the logic to include GPU-being-installed nodes
2018-06-22 21:00:52 +02:00
Krzysztof Jastrzebski
2df2568841
Move removing unneeded autoprovisioned node groups to node group manager
2018-06-22 14:26:12 +02:00
Nic Doye
ebadbda2b2
issues/933 Consider making UnremovableNodeRecheckTimeout configurable
2018-06-18 11:54:14 +01:00
Aleksandra Malinowska
ed5e82d85d
Merge pull request #956 from krzysztof-jastrzebski/master
...
Create NodeGroupManager which is responsible for creating…
2018-06-14 17:25:32 +02:00
Łukasz Osipiuk
51d628c2f1
Add test to check if nodes from not autoscaled groups are used in max-nodes limit
2018-06-14 16:17:51 +02:00
Krzysztof Jastrzebski
99c8c51bb3
Create NodeGroupManager which is responsible for creating/deleting node groups.
2018-06-14 16:11:32 +02:00
Łukasz Osipiuk
b7323bc0d1
Respect GPU limits in scale_up
2018-06-14 15:46:58 +02:00
Łukasz Osipiuk
dfcbedb41f
Take into consideration nodes from not autoscaled groups when enforcing resource limits
2018-06-14 15:31:40 +02:00
Łukasz Osipiuk
b1db155c50
Remove duplicated test case
2018-06-13 19:00:37 +02:00
Łukasz Osipiuk
9f75099d2c
Restructure checking resource limits in scale_up.go
...
Preparatory work for before introducing GPU limits
2018-06-13 19:00:37 +02:00
Łukasz Osipiuk
087a5cc9a9
Respect GPU limits in scale_down
2018-06-13 14:19:59 +02:00
Łukasz Osipiuk
1fa44a4d3a
Fix bug resulting resource limits not being enforced in scale_down
2018-06-11 16:39:07 +02:00
Łukasz Osipiuk
519064e1ec
Extract isNodeBeingDeleted function
2018-06-11 14:21:07 +02:00
Łukasz Osipiuk
6c57a01fc9
Restructure checking resource limits in scale_down.go
2018-06-11 14:02:40 +02:00
Pengfei Ni
be3dd85503
Update scheduler cache package
2018-06-11 13:54:12 +08:00
Łukasz Osipiuk
9c61477d25
Do not return error when getting cpu/memory capacity of node
2018-06-08 15:04:57 +02:00
MaciekPytel
c41dc43704
Merge pull request #495 from aleksandra-malinowska/resource-limiter-bytes
...
Use bytes instead of MB for memory limits
2018-06-08 14:47:22 +02:00
Beata Skiba
b8ae6df5d3
Add post scale up status processor.
2018-06-06 13:34:49 +02:00
Maciej Pytel
856855987b
Move some GKE-specific logic outside core
...
No change in actual logic being executed. Added a new
NodeGroupListProcessor interface to encapsulate the existing logic.
Moved PodListProcessor and refactor how it's passed around
to make it consistent and easy to add similar interfaces.
2018-05-29 12:57:19 +02:00
Maciej Pytel
5faa41e683
Move PodListProcessor to new directory
...
It's not really a util and with more processors
coming it makes more sense to keep them in dedicated place.
2018-05-29 12:00:47 +02:00
Krzysztof Jastrzebski
6761d7f354
Execute predicates only for similar pods.
2018-05-29 09:36:11 +02:00
Krzysztof Jastrzebski
adad14c2c9
Delete autoprovisioned node pool after all nodes are deleted.
2018-05-28 14:22:18 +02:00
Karol Gołąb
4c710950de
Move ClusterStateRegistry to StaticAutoscaler
...
AutoscalingContext is basically a configuration and few static helpers
and API handles.
ClusterStateRegistry is state and thus moved to other state-keeping
objects.
2018-05-24 13:03:01 +02:00
Marcin Wielgus
494c2aff1b
Merge pull request #883 from kgolab/kg-clean-up-016
...
Reorder & extract initial parts of RunOnce
2018-05-22 10:06:27 +02:00
Karol Gołąb
5bfab7d9b2
Return value moved to the caller
2018-05-18 14:59:15 +02:00
Joachim Bartosik
bfb70e40ee
Allow passing taints to Node Group creation.
2018-05-18 14:33:33 +02:00
Karol Gołąb
fa6f25a70a
Extract ClusterStateRegistry update with its soft dependency
2018-05-18 10:25:15 +02:00
Karol Gołąb
dc34b43a40
Extract another tiny method
2018-05-18 10:10:51 +02:00
Karol Gołąb
34f6a45a04
Extract method to hide a tiny bit of complexity
2018-05-18 10:01:52 +02:00
Aleksandra Malinowska
3ccfa5be23
Move universal constants to separate module
2018-05-17 18:36:43 +02:00
Aleksandra Malinowska
fcc3d004f5
Use bytes instead of MB for memory limits
2018-05-17 17:35:39 +02:00
Aleksandra Malinowska
d7dc3616f7
Merge pull request #868 from kgolab/kg-clean-up-010
...
Move metrics update to proper place
2018-05-17 14:52:18 +02:00
Karol Gołąb
e31bf0bb58
Move metrics.Autoscaling after all Node-level operations & checks
2018-05-17 14:37:43 +02:00
Aleksandra Malinowska
3b6cfc7c2b
Merge pull request #870 from kgolab/kg-clean-up-012
...
Set lastScaleDownFailTime properly
2018-05-17 12:09:15 +02:00
MaciekPytel
444201d1e7
Merge pull request #871 from kgolab/kg-clean-up-013
...
Extract duplicate code into a single method
2018-05-17 11:49:49 +02:00
Karol Gołąb
400147a075
Extract duplicate code into a single method
2018-05-17 10:01:04 +02:00
Karol Gołąb
b8cbdf4178
Set lastScaleDownFailTime properly - the ScaleDownError check was unreachable
2018-05-17 09:50:22 +02:00
Karol Gołąb
38a5951e22
Check glog.V once
2018-05-17 09:47:52 +02:00
Karol Gołąb
ccca078a2b
Move metrics update to proper place
2018-05-17 09:46:25 +02:00
Łukasz Osipiuk
eb6eff282a
Add gpu related tests to scale_up_test
2018-05-15 22:43:31 +02:00
Łukasz Osipiuk
c406da4174
Support gpus in nodes and pods definitions in UT
2018-05-15 22:43:31 +02:00
Łukasz Osipiuk
be381facfb
Introduce asserting expanding strategy for scale_up_test
2018-05-15 17:01:31 +02:00
Łukasz Osipiuk
c1073fe23a
Model expected scale up in scale_up_test with struct
2018-05-15 17:01:30 +02:00
Łukasz Osipiuk
8bdc6a1bdc
Move commons structs from scale_up_test.go to scale_test_common.go
2018-05-15 17:00:45 +02:00
Karol Gołąb
74b540fdab
Remove DynamicAutoscaler since it's unused ( #851 )
...
* Remove DynamicAutoscaler since it's unused
* Remove configmap flag with its unused-elsewhere dependecies
* gofmt
2018-05-14 20:22:42 +02:00
MaciekPytel
bc39d4dcd5
Merge pull request #842 from kgolab/kg-clean-up-008
...
Merge two variables into one.
2018-05-14 10:54:43 +02:00
Aleksandra Malinowska
b52ec59b05
Fix cleaning up taints
2018-05-11 12:00:48 +02:00
Karol Gołąb
f1f92f065e
Merge two variables into one.
2018-05-10 14:32:37 +02:00
Aleksandra Malinowska
ffeebde8d8
Add support for rescheduled pods with the same name in drain
2018-05-10 12:00:56 +02:00
Marcin Wielgus
9c5728fd74
Merge pull request #836 from kgolab/kg-clean-up-004
...
Use timestamp argument
2018-05-08 20:24:37 +02:00
Karol Gołąb
53b1c6a394
Use timestamp argument
2018-05-08 13:08:30 +02:00
MaciekPytel
e5659e7c57
Merge pull request #835 from kgolab/kg-clean-up-003
...
Make the code slightly more idiomatic go
2018-05-08 12:58:14 +02:00
Karol Gołąb
da16642bcf
Make the code slightly more idiomatic go
2018-05-08 11:35:01 +02:00
Karol Gołąb
ae203ed517
Removed unused CloudProvider() method.
2018-05-08 11:23:55 +02:00
Karol Gołąb
854fcc1ff8
Remove implementation details (CleanUp) from the interface.
...
The CleanUp method is instead called directly from the implementation,
when required.
Test updated in a quick way since the mock we're using does not support
AtLeast(1) - thus Times(2).
2018-05-07 15:24:14 +02:00
Beata Skiba
054f6d8650
Merge pull request #794 from krzysztof-jastrzebski/pods
...
Refactor cluster autoscaler builder and add pod list processor.
2018-04-26 13:08:56 +02:00
Krzysztof Jastrzebski
88b769b324
Refactor cluster autoscaler builder and add pod list processor.
2018-04-26 12:37:51 +02:00
Aleksandra Malinowska
3d599bfabe
Rephrase unremovable node warning
2018-04-18 13:43:32 +02:00
Aleksandra Malinowska
7e1353a865
Ignore TPU resource in simulations
2018-04-11 12:26:22 +02:00
Aleksandra Malinowska
feb4ad9e14
Add utility for limiting logging
2018-03-22 12:57:22 +01:00
Marcin Wielgus
04bec08e84
Compilation fix
2018-03-20 20:11:36 +01:00
Aleksandra Malinowska
4c594db7f8
Run spellchecker
2018-03-15 15:47:49 +01:00
Aleksandra Malinowska
f98e953eb4
Add regional flag
2018-03-12 14:15:56 +01:00
Maciej Pytel
abbc45da2e
Delay scale-up including GPU request
...
Nodes with GPU are expensive and it's likely a bunch of pods
using them will be created in a batch. In this case we can
wait a bit for all pods to be created to make more efficient
scale-up decision.
2018-03-02 15:55:04 +01:00
Aleksandra Malinowska
9cc322a61d
Disable checking inter pod affinity predicate if only preferred or node affinity used
2018-02-14 14:40:02 +01:00
anniedy
bf59e3daa5
Typo fix unneded->[unneeded] ( #623 )
...
* Update clusterstate.md
* Update scale_down.go
* Update static_autoscaler.go
2018-02-07 17:36:58 +01:00
Beata Skiba
346a5c26a9
Remove old unregistered nodes before checking cluster healthiness
2018-02-01 16:34:50 +01:00
Aleksandra Malinowska
b17b6c3ec5
Wait before publishing no nodes ready after start
2018-01-16 19:04:38 +01:00
Aleksandra Malinowska
3894ecb470
Export unregistered node count metric
2018-01-16 16:56:40 +01:00
Aleksandra Malinowska
27efa05b1d
Publish ClusterUnhealthy events
2018-01-16 16:56:36 +01:00
Aleksandra Malinowska
1b728d411b
Publish status and metrics for empty cluster
2018-01-16 16:07:29 +01:00
Aleksandra Malinowska
3d33b64599
Export long unregistered node count metric
2018-01-16 16:07:24 +01:00
Marcin Wielgus
d5f091a886
Merge pull request #508 from mwielgus/wait-for-pods
...
Skip iteration if pending pods are too new
2017-12-28 17:22:38 +01:00
Marcin Wielgus
15b10c8f67
Skip iteration if pending pods are too new
2017-12-28 16:55:44 +01:00
Nic Cope
19607bd285
Remove the Polling Autoscaler.
2017-12-11 13:09:56 -08:00
Nic Cope
982f9e41a3
Support autodetection of GCE managed instance groups by name prefix
...
This commit adds a new usage of the --node-group-auto-discovery flag intended
for use with the GCE cloud provider. GCE instance groups can be automatically
discovered based on a prefix of their group name. Example usage:
--node-group-auto-discovery=mig:prefix=k8s-mig,minNodes=0,maxNodes=10
Note that unlike the existing AWS ASG autodetection functionality we must
specify the min and max nodes in the flag. This is because MIGs store only
a target size in the GCE API - they do not have a min and max size we can
infer via the API.
In order to alleviate this limitation a little we allow multiple uses of the
autodiscovery flag. For example to discover two classes (big and small) of
instance groups with different size limits:
./cluster-autoscaler \
--node-group-auto-discovery=mig:prefix=k8s-a-small,minNodes=1,maxNodes=10 \
--node-group-auto-discovery=mig:prefix=k8s-a-big,minNodes=1,maxNodes=100
Zonal clusters (i.e. multizone = false in the cloud config) will detect all
managed instance groups within the cluster's zone. Regional clusters will
detect all matching (zonal) managed instance groups within any of that region's
zones.
2017-12-11 13:09:56 -08:00
Maciej Pytel
b7f8622eb2
Create node groups with GPU in scale-up.go
...
This is still not implemented in cloudprovider.
Extended NewNodeGroup inteface to have a way of passing
parameters for more complex resources.
2017-12-11 13:12:22 +01:00
Marcin Wielgus
f8c0e20ad9
Source fix after godep update
2017-11-28 14:01:43 +01:00
Marcin Wielgus
2589c43a61
Merge pull request #469 from aleksandra-malinowska/single-unregistered-flag
...
Remove --unregistered-node-removal-time flag
2017-11-16 13:07:52 +01:00
Krzysztof Jastrzebski
6c8d3aa37d
Fix unit static autoscaler unit tests.
2017-11-15 16:13:18 +01:00
Aleksandra Malinowska
2ff962e53e
Remove --unregistered-node-removal-time flag
2017-11-15 11:11:30 +01:00
Marcin Wielgus
ded016dfd8
Merge pull request #461 from MaciekPytel/gpu_unready_fix
...
Consider GPU nodes unready until allocatable GPU is > 0
2017-11-13 15:29:27 +01:00
Maciej Pytel
d81dca5991
Mark nodes with uninitialized GPUs as unready
2017-11-10 17:56:10 +01:00
Marcin Wielgus
439fd3c9ec
Merge pull request #411 from krzysztof-jastrzebski/priority
...
Adds priority preemption support to cluster autoscaler.
2017-11-08 09:09:26 +01:00
Beata Skiba
2b28ac1a04
Add a workaround for scaling of VMs with GPUs
...
When a machine with GPU becomes ready it can take
up to 15 minutes before it reports that GPU is allocatable.
This can cause Cluster Autoscaler to trigger a second
unnecessary scale up.
The workaround sets allocatable to capacity for GPU so that
a node that waits for GPUs to become ready to use will be
considered as a place where pods requesting GPUs can be
scheduled.
2017-11-06 16:04:22 +01:00
Edward Tsang
4104a91991
more spelling fixes
2017-11-02 14:21:36 -07:00
mmerrill3
3d043f73cb
Renaming the interface function to Cleanup() for CloudProvider type
2017-11-01 12:41:13 -04:00
mmerrill3
77aa30a5c1
Fixing for issue 252 by implementing a channel to stop the go routine
2017-11-01 11:00:00 -04:00
Maciej Pytel
c376ef3c87
Add metrics for autoprovisioning
2017-10-31 17:42:58 +01:00
Maciej Pytel
9c2ebccbfe
Write events when autoprovisioned nodegroup is created / deleted
2017-10-25 17:39:30 +02:00
Maciej Pytel
07511f444a
Add Refresh method to cloud provider
...
This can be used to dynamically update cloud provider
config (in particular list of managed NodeGroups and their
min/max constraints).
Add GKE implementation.
2017-10-24 18:36:29 +02:00
Marcin Wielgus
596f478e63
Merge pull request #414 from krzysztof-jastrzebski/resource_limit
...
Adds resource limits to cloud provider.
2017-10-23 20:38:04 +02:00
Krzysztof Jastrzebski
56ac572666
Adds resource limits to cloud provider.
2017-10-23 16:06:56 +02:00
Maciej Pytel
7b95e71315
Use GKE alpha client when autoprovisioning is enabled
2017-10-23 15:21:02 +02:00
Krzysztof Jastrzebski
d9c00e5ce1
Adds priority preemption support to cluster autoscaler.
2017-10-23 09:54:56 +02:00
Maciej Pytel
02ccba3338
Update clusterstate after scale-up
2017-10-17 16:11:25 +02:00
Maciej Pytel
3498507220
Handle nodegroup id changing upon creation
2017-10-17 14:02:46 +02:00
Marcin Wielgus
f658450b16
Merge pull request #379 from MaciekPytel/long_unregistered_node
...
Keep track of nodes that failed to register for a long time
2017-09-28 15:01:32 +02:00
Maciej Pytel
ff21b0b00c
Keep track of nodes that failed to register for a long time
...
Previously a node that failed to register and couldn't be deleted
basically broke CA.
2017-09-27 16:32:04 +02:00
Marcin Wielgus
9631f0f136
Merge pull request #375 from MaciekPytel/failed_scale_up_reason
...
Add failed scale-up reason in metric
2017-09-26 19:23:47 +02:00
Maciej Pytel
e12ee88f5f
Add failed scale-up reason in metric
2017-09-26 13:40:34 +02:00
Krzysztof Jastrzebski
16e9106c07
Fix setting target size for group in core/static_autoscaler_test.go.
2017-09-26 10:58:00 +02:00
Krzysztof Jastrzebski
80a7577399
Unit tests.
2017-09-25 11:37:24 +02:00
Maciej Pytel
098ebbee09
Log event when removing unregistered node
2017-09-22 22:48:07 +02:00
Marcin Wielgus
32c4a7ba5c
Merge pull request #360 from aleksandra-malinowska/leaking-taints
...
Fix leaking taints in case of cloud provider error on node deletion
2017-09-22 21:43:55 +01:00
Maciej Pytel
5e05c84cf0
Add metric counting failed scale-ups
...
A minor refactor was required to avoid cyclic imports
2017-09-22 18:12:50 +02:00
Aleksandra Malinowska
4c31a57374
fix leaking taints in case of cloud provider error on node deletion
2017-09-22 17:55:48 +02:00
Matt Terry
63310ef41a
Introduce new flags to control scale down behavior: scale-down-delay-after-delete and scale-down-delay-after-failure, replacing scale-down-trial-interval. scale-down-delay-after-add replaces scale-down-delay
2017-09-18 17:09:44 -07:00
Marcin Wielgus
f04113d746
Remove TargetSize() from loops iterating over nodes
2017-09-13 22:33:17 +02:00
Marcin Wielgus
303f86c163
Merge pull request #336 from electronicarts/feature/matt/unneeded-check-fix
...
Move calculateUnneededOnly check after unneeded calculations
2017-09-13 11:14:51 +02:00
Marcin Wielgus
4bed50d290
Merge pull request #331 from aleksandra-malinowska/min-cluster-cpu-memory
...
Respect minimum cores/memory limit during scale down
2017-09-13 11:12:29 +02:00
Aleksandra Malinowska
197b05b180
respect minimum cores/memory limit during scale down
2017-09-13 10:10:47 +02:00
Krzysztof Jastrzebski
d8db14701e
Core/static_autoscaler_test.go unit tests.
2017-09-13 09:52:07 +02:00
Matt Terry
43943cdeb4
Move calculateUnneededOnly check after unneeded calculations, add log message to main loop start
2017-09-12 21:38:29 -07:00
Aleksandra Malinowska
187c02693e
Taint empty nodes to be deleted
2017-09-12 17:40:05 +02:00
Marcin Wielgus
ef730e19c5
Merge pull request #332 from krzysztof-jastrzebski/scale_up2
...
Fix filtering for autoprovisioned node groups and add unit test.
2017-09-12 16:40:30 +02:00
Krzysztof Jastrzebski
b1396c3cd1
Fix filtering for autoprovisioned node groups and add unit test.
2017-09-12 16:20:23 +02:00
Marcin Wielgus
738fb640e1
Merge pull request #330 from krzysztof-jastrzebski/core-test4
...
Core/autoscaling_context_test.go unit tests.
2017-09-12 15:07:22 +02:00
Marcin Wielgus
9d3e52551c
Merge pull request #329 from krzysztof-jastrzebski/scale_down2
...
Core/scale_down.go unit tests.
2017-09-12 13:12:46 +02:00
Marcin Wielgus
3039a0e813
Merge pull request #319 from krzysztof-jastrzebski/core-test
...
Core/static_autoscaler.go unit tests.
2017-09-12 13:11:11 +02:00
Krzysztof Jastrzebski
001ade48c9
Core/autoscaling_context_test.go unit tests.
2017-09-12 11:04:18 +02:00
Krzysztof Jastrzebski
1db2513f1f
Core/scale_down.go unit tests.
2017-09-12 10:41:19 +02:00
Beata Skiba
eba0fa2f95
Remove nodes that are not in the cluster from unremovableNodes
2017-09-11 20:01:02 +02:00
Krzysztof Jastrzebski
0aec68a46d
Core/static_autoscaler.go unit tests. Current time usage refactoring.
2017-09-11 15:07:21 +02:00
Marcin Wielgus
db63ac3a18
Merge pull request #324 from aleksandra-malinowska/scale-down-pod-not-found
...
Add checking for pod not found error on eviction
2017-09-11 15:10:08 +05:30
Clayton Coleman
e84807e828
Do not include ToBeDeleted taint when constructing a template
...
This results in the simulator being unable to place candidate pods
because the taint blocks all scheduling.
2017-09-10 22:31:39 -04:00
Beata Skiba
1d10a14aa0
Merge pull request #318 from bskiba/fix-empty
...
Always add empty nodes to unneeded nodes
2017-09-08 16:31:19 +02:00
Beata Skiba
6e5784a519
Always add empty nodes to unneeded nodes
2017-09-08 15:55:18 +02:00
Aleksandra Malinowska
fbc8462b10
Add checking for not found error
2017-09-08 15:45:44 +02:00
Aleksandra Malinowska
d43029c180
implement blocking scale up beyond max cores & memory
2017-09-08 12:50:00 +02:00
Marcin Wielgus
fc599bd08c
Merge pull request #310 from krzysztof-jastrzebski/core-test
...
Core/utils.go unit tests
2017-09-07 17:15:58 +05:30
Krzysztof Jastrzebski
2295d9bcc4
Core/utils.go unit tests
2017-09-07 13:24:12 +02:00
Marcin Wielgus
f9cabf3a1a
Merge pull request #297 from bskiba/additional-k
...
Only consider up to 10% of the nodes as additional candidates for scale down
2017-09-07 04:34:23 +05:30
Marcin Wielgus
e85e94510d
Tests for add autoprovisioned node groups
2017-09-06 02:44:16 +02:00
Marcin Wielgus
1ad8d9e10c
Build template NodeInfo for node autoprovisioning
2017-09-05 17:28:49 +02:00
Sergey Lanzman
437a3f60e1
Small optimize code
2017-09-04 23:50:45 +03:00
Sergey Lanzman
44195b39a2
Fix small typos
2017-09-04 22:18:07 +03:00
Sergey Lanzman
415f53cdea
Change from deprecated Core to CoreV1 for kube client
2017-09-04 22:16:21 +03:00
Beata Skiba
a6c18b87d2
Only consider up to 10% of the nodes as additional candidates for scale down.
2017-09-04 17:37:02 +02:00
Aleksandra Malinowska
7ae64de0af
Merge pull request #291 from mwielgus/nap-cleanup
...
Clean up empty autoprovisioned node groups
2017-09-04 15:03:26 +02:00
Marcin Wielgus
bcc8cded64
Clean up empty autoprovisioned node groups
2017-09-04 13:53:07 +02:00
Marcin Wielgus
ae00f0544b
Merge pull request #290 from mwielgus/max-nap-groups
...
Limit autoprovisioned groups to 15
2017-09-01 23:49:33 +05:30
Marcin Wielgus
de524a6688
Limit autoprovisioned groups to 15
2017-09-01 18:25:28 +02:00
Maciej Pytel
a440d92a60
Log event on scale-up timeout
2017-09-01 14:19:14 +02:00
Maciej Pytel
a86268f114
Write event on scale-up failure
2017-09-01 13:34:20 +02:00
Marcin Wielgus
c0b48e4a15
Merge pull request #285 from mwielgus/loglevel
...
Set verbosity for each of the glog.Info logs
2017-09-01 16:42:11 +05:30
Marcin Wielgus
021a2fdf5d
Merge pull request #286 from mwielgus/exist-no-error
...
Do not return error from exist
2017-09-01 16:05:52 +05:30
Marcin Wielgus
2d8f59e23d
Set verbosity for each of the glog.Info logs
2017-09-01 12:34:29 +02:00
Marcin Wielgus
f217d4ac93
Do not return error from exist
2017-09-01 00:24:01 +02:00
Beata Skiba
576e4105db
Make ScaleDownNonEmptyCandidatesCount a flag.
2017-08-31 15:05:06 +02:00
Beata Skiba
4560cc0a85
Keep maximum 30 candidates for scale down with drain
2017-08-31 14:58:40 +02:00