Commit Graph

445 Commits

Author SHA1 Message Date
Krzysztof Jastrzebski 2df2568841 Move removing unneeded autoprovisioned node groups to node group manager 2018-06-22 14:26:12 +02:00
Nic Doye ebadbda2b2 issues/933 Consider making UnremovableNodeRecheckTimeout configurable 2018-06-18 11:54:14 +01:00
Aleksandra Malinowska ed5e82d85d
Merge pull request #956 from krzysztof-jastrzebski/master
Create NodeGroupManager which is responsible for creating…
2018-06-14 17:25:32 +02:00
Łukasz Osipiuk 51d628c2f1 Add test to check if nodes from not autoscaled groups are used in max-nodes limit 2018-06-14 16:17:51 +02:00
Krzysztof Jastrzebski 99c8c51bb3 Create NodeGroupManager which is responsible for creating/deleting node groups. 2018-06-14 16:11:32 +02:00
Łukasz Osipiuk b7323bc0d1 Respect GPU limits in scale_up 2018-06-14 15:46:58 +02:00
Łukasz Osipiuk dfcbedb41f Take into consideration nodes from not autoscaled groups when enforcing resource limits 2018-06-14 15:31:40 +02:00
Łukasz Osipiuk b1db155c50 Remove duplicated test case 2018-06-13 19:00:37 +02:00
Łukasz Osipiuk 9f75099d2c Restructure checking resource limits in scale_up.go
Preparatory work for before introducing GPU limits
2018-06-13 19:00:37 +02:00
Łukasz Osipiuk 087a5cc9a9 Respect GPU limits in scale_down 2018-06-13 14:19:59 +02:00
Łukasz Osipiuk 1fa44a4d3a Fix bug resulting resource limits not being enforced in scale_down 2018-06-11 16:39:07 +02:00
Łukasz Osipiuk 519064e1ec Extract isNodeBeingDeleted function 2018-06-11 14:21:07 +02:00
Łukasz Osipiuk 6c57a01fc9 Restructure checking resource limits in scale_down.go 2018-06-11 14:02:40 +02:00
Pengfei Ni be3dd85503 Update scheduler cache package 2018-06-11 13:54:12 +08:00
Łukasz Osipiuk 9c61477d25 Do not return error when getting cpu/memory capacity of node 2018-06-08 15:04:57 +02:00
MaciekPytel c41dc43704
Merge pull request #495 from aleksandra-malinowska/resource-limiter-bytes
Use bytes instead of MB for memory limits
2018-06-08 14:47:22 +02:00
Beata Skiba b8ae6df5d3 Add post scale up status processor. 2018-06-06 13:34:49 +02:00
Maciej Pytel 856855987b Move some GKE-specific logic outside core
No change in actual logic being executed. Added a new
NodeGroupListProcessor interface to encapsulate the existing logic.
Moved PodListProcessor and refactor how it's passed around
to make it consistent and easy to add similar interfaces.
2018-05-29 12:57:19 +02:00
Maciej Pytel 5faa41e683 Move PodListProcessor to new directory
It's not really a util and with more processors
coming it makes more sense to keep them in dedicated place.
2018-05-29 12:00:47 +02:00
Krzysztof Jastrzebski 6761d7f354 Execute predicates only for similar pods. 2018-05-29 09:36:11 +02:00
Krzysztof Jastrzebski adad14c2c9 Delete autoprovisioned node pool after all nodes are deleted. 2018-05-28 14:22:18 +02:00
Karol Gołąb 4c710950de Move ClusterStateRegistry to StaticAutoscaler
AutoscalingContext is basically a configuration and few static helpers
and API handles.
ClusterStateRegistry is state and thus moved to other state-keeping
objects.
2018-05-24 13:03:01 +02:00
Marcin Wielgus 494c2aff1b
Merge pull request #883 from kgolab/kg-clean-up-016
Reorder & extract initial parts of RunOnce
2018-05-22 10:06:27 +02:00
Karol Gołąb 5bfab7d9b2 Return value moved to the caller 2018-05-18 14:59:15 +02:00
Joachim Bartosik bfb70e40ee Allow passing taints to Node Group creation. 2018-05-18 14:33:33 +02:00
Karol Gołąb fa6f25a70a Extract ClusterStateRegistry update with its soft dependency 2018-05-18 10:25:15 +02:00
Karol Gołąb dc34b43a40 Extract another tiny method 2018-05-18 10:10:51 +02:00
Karol Gołąb 34f6a45a04 Extract method to hide a tiny bit of complexity 2018-05-18 10:01:52 +02:00
Aleksandra Malinowska 3ccfa5be23 Move universal constants to separate module 2018-05-17 18:36:43 +02:00
Aleksandra Malinowska fcc3d004f5 Use bytes instead of MB for memory limits 2018-05-17 17:35:39 +02:00
Aleksandra Malinowska d7dc3616f7
Merge pull request #868 from kgolab/kg-clean-up-010
Move metrics update to proper place
2018-05-17 14:52:18 +02:00
Karol Gołąb e31bf0bb58 Move metrics.Autoscaling after all Node-level operations & checks 2018-05-17 14:37:43 +02:00
Aleksandra Malinowska 3b6cfc7c2b
Merge pull request #870 from kgolab/kg-clean-up-012
Set lastScaleDownFailTime properly
2018-05-17 12:09:15 +02:00
MaciekPytel 444201d1e7
Merge pull request #871 from kgolab/kg-clean-up-013
Extract duplicate code into a single method
2018-05-17 11:49:49 +02:00
Karol Gołąb 400147a075 Extract duplicate code into a single method 2018-05-17 10:01:04 +02:00
Karol Gołąb b8cbdf4178 Set lastScaleDownFailTime properly - the ScaleDownError check was unreachable 2018-05-17 09:50:22 +02:00
Karol Gołąb 38a5951e22 Check glog.V once 2018-05-17 09:47:52 +02:00
Karol Gołąb ccca078a2b Move metrics update to proper place 2018-05-17 09:46:25 +02:00
Łukasz Osipiuk eb6eff282a Add gpu related tests to scale_up_test 2018-05-15 22:43:31 +02:00
Łukasz Osipiuk c406da4174 Support gpus in nodes and pods definitions in UT 2018-05-15 22:43:31 +02:00
Łukasz Osipiuk be381facfb Introduce asserting expanding strategy for scale_up_test 2018-05-15 17:01:31 +02:00
Łukasz Osipiuk c1073fe23a Model expected scale up in scale_up_test with struct 2018-05-15 17:01:30 +02:00
Łukasz Osipiuk 8bdc6a1bdc Move commons structs from scale_up_test.go to scale_test_common.go 2018-05-15 17:00:45 +02:00
Karol Gołąb 74b540fdab Remove DynamicAutoscaler since it's unused (#851)
* Remove DynamicAutoscaler since it's unused

* Remove configmap flag with its unused-elsewhere dependecies

* gofmt
2018-05-14 20:22:42 +02:00
MaciekPytel bc39d4dcd5
Merge pull request #842 from kgolab/kg-clean-up-008
Merge two variables into one.
2018-05-14 10:54:43 +02:00
Aleksandra Malinowska b52ec59b05 Fix cleaning up taints 2018-05-11 12:00:48 +02:00
Karol Gołąb f1f92f065e Merge two variables into one. 2018-05-10 14:32:37 +02:00
Aleksandra Malinowska ffeebde8d8 Add support for rescheduled pods with the same name in drain 2018-05-10 12:00:56 +02:00
Marcin Wielgus 9c5728fd74
Merge pull request #836 from kgolab/kg-clean-up-004
Use timestamp argument
2018-05-08 20:24:37 +02:00
Karol Gołąb 53b1c6a394 Use timestamp argument 2018-05-08 13:08:30 +02:00
MaciekPytel e5659e7c57
Merge pull request #835 from kgolab/kg-clean-up-003
Make the code slightly more idiomatic go
2018-05-08 12:58:14 +02:00
Karol Gołąb da16642bcf Make the code slightly more idiomatic go 2018-05-08 11:35:01 +02:00
Karol Gołąb ae203ed517 Removed unused CloudProvider() method. 2018-05-08 11:23:55 +02:00
Karol Gołąb 854fcc1ff8 Remove implementation details (CleanUp) from the interface.
The CleanUp method is instead called directly from the implementation,
when required.
Test updated in a quick way since the mock we're using does not support
AtLeast(1) - thus Times(2).
2018-05-07 15:24:14 +02:00
Beata Skiba 054f6d8650
Merge pull request #794 from krzysztof-jastrzebski/pods
Refactor cluster autoscaler builder and add pod list processor.
2018-04-26 13:08:56 +02:00
Krzysztof Jastrzebski 88b769b324 Refactor cluster autoscaler builder and add pod list processor. 2018-04-26 12:37:51 +02:00
Aleksandra Malinowska 3d599bfabe Rephrase unremovable node warning 2018-04-18 13:43:32 +02:00
Aleksandra Malinowska 7e1353a865 Ignore TPU resource in simulations 2018-04-11 12:26:22 +02:00
Aleksandra Malinowska feb4ad9e14 Add utility for limiting logging 2018-03-22 12:57:22 +01:00
Marcin Wielgus 04bec08e84 Compilation fix 2018-03-20 20:11:36 +01:00
Aleksandra Malinowska 4c594db7f8 Run spellchecker 2018-03-15 15:47:49 +01:00
Aleksandra Malinowska f98e953eb4 Add regional flag 2018-03-12 14:15:56 +01:00
Maciej Pytel abbc45da2e Delay scale-up including GPU request
Nodes with GPU are expensive and it's likely a bunch of pods
using them will be created in a batch. In this case we can
wait a bit for all pods to be created to make more efficient
scale-up decision.
2018-03-02 15:55:04 +01:00
Aleksandra Malinowska 9cc322a61d Disable checking inter pod affinity predicate if only preferred or node affinity used 2018-02-14 14:40:02 +01:00
anniedy bf59e3daa5 Typo fix unneded->[unneeded] (#623)
* Update clusterstate.md

* Update scale_down.go

* Update static_autoscaler.go
2018-02-07 17:36:58 +01:00
Beata Skiba 346a5c26a9 Remove old unregistered nodes before checking cluster healthiness 2018-02-01 16:34:50 +01:00
Aleksandra Malinowska b17b6c3ec5 Wait before publishing no nodes ready after start 2018-01-16 19:04:38 +01:00
Aleksandra Malinowska 3894ecb470 Export unregistered node count metric 2018-01-16 16:56:40 +01:00
Aleksandra Malinowska 27efa05b1d Publish ClusterUnhealthy events 2018-01-16 16:56:36 +01:00
Aleksandra Malinowska 1b728d411b Publish status and metrics for empty cluster 2018-01-16 16:07:29 +01:00
Aleksandra Malinowska 3d33b64599 Export long unregistered node count metric 2018-01-16 16:07:24 +01:00
Marcin Wielgus d5f091a886
Merge pull request #508 from mwielgus/wait-for-pods
Skip iteration if pending pods are too new
2017-12-28 17:22:38 +01:00
Marcin Wielgus 15b10c8f67 Skip iteration if pending pods are too new 2017-12-28 16:55:44 +01:00
Nic Cope 19607bd285 Remove the Polling Autoscaler. 2017-12-11 13:09:56 -08:00
Nic Cope 982f9e41a3 Support autodetection of GCE managed instance groups by name prefix
This commit adds a new usage of the --node-group-auto-discovery flag intended
for use with the GCE cloud provider. GCE instance groups can be automatically
discovered based on a prefix of their group name. Example usage:

--node-group-auto-discovery=mig:prefix=k8s-mig,minNodes=0,maxNodes=10

Note that unlike the existing AWS ASG autodetection functionality we must
specify the min and max nodes in the flag. This is because MIGs store only
a target size in the GCE API - they do not have a min and max size we can
infer via the API.

In order to alleviate this limitation a little we allow multiple uses of the
autodiscovery flag. For example to discover two classes (big and small) of
instance groups with different size limits:

./cluster-autoscaler \
  --node-group-auto-discovery=mig:prefix=k8s-a-small,minNodes=1,maxNodes=10 \
  --node-group-auto-discovery=mig:prefix=k8s-a-big,minNodes=1,maxNodes=100

Zonal clusters (i.e. multizone = false in the cloud config) will detect all
managed instance groups within the cluster's zone. Regional clusters will
detect all matching (zonal) managed instance groups within any of that region's
zones.
2017-12-11 13:09:56 -08:00
Maciej Pytel b7f8622eb2 Create node groups with GPU in scale-up.go
This is still not implemented in cloudprovider.
Extended NewNodeGroup inteface to have a way of passing
parameters for more complex resources.
2017-12-11 13:12:22 +01:00
Marcin Wielgus f8c0e20ad9 Source fix after godep update 2017-11-28 14:01:43 +01:00
Marcin Wielgus 2589c43a61
Merge pull request #469 from aleksandra-malinowska/single-unregistered-flag
Remove --unregistered-node-removal-time flag
2017-11-16 13:07:52 +01:00
Krzysztof Jastrzebski 6c8d3aa37d Fix unit static autoscaler unit tests. 2017-11-15 16:13:18 +01:00
Aleksandra Malinowska 2ff962e53e Remove --unregistered-node-removal-time flag 2017-11-15 11:11:30 +01:00
Marcin Wielgus ded016dfd8
Merge pull request #461 from MaciekPytel/gpu_unready_fix
Consider GPU nodes unready until allocatable GPU is > 0
2017-11-13 15:29:27 +01:00
Maciej Pytel d81dca5991 Mark nodes with uninitialized GPUs as unready 2017-11-10 17:56:10 +01:00
Marcin Wielgus 439fd3c9ec
Merge pull request #411 from krzysztof-jastrzebski/priority
Adds priority preemption support to cluster autoscaler.
2017-11-08 09:09:26 +01:00
Beata Skiba 2b28ac1a04 Add a workaround for scaling of VMs with GPUs
When a machine with GPU becomes ready it can take
up to 15 minutes before it reports that GPU is allocatable.
This can cause Cluster Autoscaler to trigger a second
unnecessary scale up.
The workaround sets allocatable to capacity for GPU so that
a node that waits for GPUs to become ready to use will be
considered as a place where pods requesting GPUs can be
scheduled.
2017-11-06 16:04:22 +01:00
Edward Tsang 4104a91991 more spelling fixes 2017-11-02 14:21:36 -07:00
mmerrill3 3d043f73cb Renaming the interface function to Cleanup() for CloudProvider type 2017-11-01 12:41:13 -04:00
mmerrill3 77aa30a5c1 Fixing for issue 252 by implementing a channel to stop the go routine 2017-11-01 11:00:00 -04:00
Maciej Pytel c376ef3c87 Add metrics for autoprovisioning 2017-10-31 17:42:58 +01:00
Maciej Pytel 9c2ebccbfe Write events when autoprovisioned nodegroup is created / deleted 2017-10-25 17:39:30 +02:00
Maciej Pytel 07511f444a Add Refresh method to cloud provider
This can be used to dynamically update cloud provider
config (in particular list of managed NodeGroups and their
min/max constraints).
Add GKE implementation.
2017-10-24 18:36:29 +02:00
Marcin Wielgus 596f478e63 Merge pull request #414 from krzysztof-jastrzebski/resource_limit
Adds resource limits to cloud provider.
2017-10-23 20:38:04 +02:00
Krzysztof Jastrzebski 56ac572666 Adds resource limits to cloud provider. 2017-10-23 16:06:56 +02:00
Maciej Pytel 7b95e71315 Use GKE alpha client when autoprovisioning is enabled 2017-10-23 15:21:02 +02:00
Krzysztof Jastrzebski d9c00e5ce1 Adds priority preemption support to cluster autoscaler. 2017-10-23 09:54:56 +02:00
Maciej Pytel 02ccba3338 Update clusterstate after scale-up 2017-10-17 16:11:25 +02:00
Maciej Pytel 3498507220 Handle nodegroup id changing upon creation 2017-10-17 14:02:46 +02:00
Marcin Wielgus f658450b16 Merge pull request #379 from MaciekPytel/long_unregistered_node
Keep track of nodes that failed to register for a long time
2017-09-28 15:01:32 +02:00
Maciej Pytel ff21b0b00c Keep track of nodes that failed to register for a long time
Previously a node that failed to register and couldn't be deleted
basically broke CA.
2017-09-27 16:32:04 +02:00
Marcin Wielgus 9631f0f136 Merge pull request #375 from MaciekPytel/failed_scale_up_reason
Add failed scale-up reason in metric
2017-09-26 19:23:47 +02:00
Maciej Pytel e12ee88f5f Add failed scale-up reason in metric 2017-09-26 13:40:34 +02:00
Krzysztof Jastrzebski 16e9106c07 Fix setting target size for group in core/static_autoscaler_test.go. 2017-09-26 10:58:00 +02:00
Krzysztof Jastrzebski 80a7577399 Unit tests. 2017-09-25 11:37:24 +02:00
Maciej Pytel 098ebbee09 Log event when removing unregistered node 2017-09-22 22:48:07 +02:00
Marcin Wielgus 32c4a7ba5c Merge pull request #360 from aleksandra-malinowska/leaking-taints
Fix leaking taints in case of cloud provider error on node deletion
2017-09-22 21:43:55 +01:00
Maciej Pytel 5e05c84cf0 Add metric counting failed scale-ups
A minor refactor was required to avoid cyclic imports
2017-09-22 18:12:50 +02:00
Aleksandra Malinowska 4c31a57374 fix leaking taints in case of cloud provider error on node deletion 2017-09-22 17:55:48 +02:00
Matt Terry 63310ef41a Introduce new flags to control scale down behavior: scale-down-delay-after-delete and scale-down-delay-after-failure, replacing scale-down-trial-interval. scale-down-delay-after-add replaces scale-down-delay 2017-09-18 17:09:44 -07:00
Marcin Wielgus f04113d746 Remove TargetSize() from loops iterating over nodes 2017-09-13 22:33:17 +02:00
Marcin Wielgus 303f86c163 Merge pull request #336 from electronicarts/feature/matt/unneeded-check-fix
Move calculateUnneededOnly check after unneeded calculations
2017-09-13 11:14:51 +02:00
Marcin Wielgus 4bed50d290 Merge pull request #331 from aleksandra-malinowska/min-cluster-cpu-memory
Respect minimum cores/memory limit during scale down
2017-09-13 11:12:29 +02:00
Aleksandra Malinowska 197b05b180 respect minimum cores/memory limit during scale down 2017-09-13 10:10:47 +02:00
Krzysztof Jastrzebski d8db14701e Core/static_autoscaler_test.go unit tests. 2017-09-13 09:52:07 +02:00
Matt Terry 43943cdeb4 Move calculateUnneededOnly check after unneeded calculations, add log message to main loop start 2017-09-12 21:38:29 -07:00
Aleksandra Malinowska 187c02693e Taint empty nodes to be deleted 2017-09-12 17:40:05 +02:00
Marcin Wielgus ef730e19c5 Merge pull request #332 from krzysztof-jastrzebski/scale_up2
Fix filtering for autoprovisioned node groups and add unit test.
2017-09-12 16:40:30 +02:00
Krzysztof Jastrzebski b1396c3cd1 Fix filtering for autoprovisioned node groups and add unit test. 2017-09-12 16:20:23 +02:00
Marcin Wielgus 738fb640e1 Merge pull request #330 from krzysztof-jastrzebski/core-test4
Core/autoscaling_context_test.go unit tests.
2017-09-12 15:07:22 +02:00
Marcin Wielgus 9d3e52551c Merge pull request #329 from krzysztof-jastrzebski/scale_down2
Core/scale_down.go unit tests.
2017-09-12 13:12:46 +02:00
Marcin Wielgus 3039a0e813 Merge pull request #319 from krzysztof-jastrzebski/core-test
Core/static_autoscaler.go unit tests.
2017-09-12 13:11:11 +02:00
Krzysztof Jastrzebski 001ade48c9 Core/autoscaling_context_test.go unit tests. 2017-09-12 11:04:18 +02:00
Krzysztof Jastrzebski 1db2513f1f Core/scale_down.go unit tests. 2017-09-12 10:41:19 +02:00
Beata Skiba eba0fa2f95 Remove nodes that are not in the cluster from unremovableNodes 2017-09-11 20:01:02 +02:00
Krzysztof Jastrzebski 0aec68a46d Core/static_autoscaler.go unit tests. Current time usage refactoring. 2017-09-11 15:07:21 +02:00
Marcin Wielgus db63ac3a18 Merge pull request #324 from aleksandra-malinowska/scale-down-pod-not-found
Add checking for pod not found error on eviction
2017-09-11 15:10:08 +05:30
Clayton Coleman e84807e828
Do not include ToBeDeleted taint when constructing a template
This results in the simulator being unable to place candidate pods
because the taint blocks all scheduling.
2017-09-10 22:31:39 -04:00
Beata Skiba 1d10a14aa0 Merge pull request #318 from bskiba/fix-empty
Always add empty nodes to unneeded nodes
2017-09-08 16:31:19 +02:00
Beata Skiba 6e5784a519 Always add empty nodes to unneeded nodes 2017-09-08 15:55:18 +02:00
Aleksandra Malinowska fbc8462b10 Add checking for not found error 2017-09-08 15:45:44 +02:00
Aleksandra Malinowska d43029c180 implement blocking scale up beyond max cores & memory 2017-09-08 12:50:00 +02:00
Marcin Wielgus fc599bd08c Merge pull request #310 from krzysztof-jastrzebski/core-test
Core/utils.go unit tests
2017-09-07 17:15:58 +05:30
Krzysztof Jastrzebski 2295d9bcc4 Core/utils.go unit tests 2017-09-07 13:24:12 +02:00
Marcin Wielgus f9cabf3a1a Merge pull request #297 from bskiba/additional-k
Only consider up to 10% of the nodes as additional candidates for scale down
2017-09-07 04:34:23 +05:30
Marcin Wielgus e85e94510d Tests for add autoprovisioned node groups 2017-09-06 02:44:16 +02:00
Marcin Wielgus 1ad8d9e10c Build template NodeInfo for node autoprovisioning 2017-09-05 17:28:49 +02:00
Sergey Lanzman 437a3f60e1 Small optimize code 2017-09-04 23:50:45 +03:00
Sergey Lanzman 44195b39a2 Fix small typos 2017-09-04 22:18:07 +03:00
Sergey Lanzman 415f53cdea Change from deprecated Core to CoreV1 for kube client 2017-09-04 22:16:21 +03:00
Beata Skiba a6c18b87d2 Only consider up to 10% of the nodes as additional candidates for scale down. 2017-09-04 17:37:02 +02:00
Aleksandra Malinowska 7ae64de0af Merge pull request #291 from mwielgus/nap-cleanup
Clean up empty autoprovisioned node groups
2017-09-04 15:03:26 +02:00
Marcin Wielgus bcc8cded64 Clean up empty autoprovisioned node groups 2017-09-04 13:53:07 +02:00
Marcin Wielgus ae00f0544b Merge pull request #290 from mwielgus/max-nap-groups
Limit autoprovisioned groups to 15
2017-09-01 23:49:33 +05:30
Marcin Wielgus de524a6688 Limit autoprovisioned groups to 15 2017-09-01 18:25:28 +02:00
Maciej Pytel a440d92a60 Log event on scale-up timeout 2017-09-01 14:19:14 +02:00
Maciej Pytel a86268f114 Write event on scale-up failure 2017-09-01 13:34:20 +02:00
Marcin Wielgus c0b48e4a15 Merge pull request #285 from mwielgus/loglevel
Set verbosity for each of the glog.Info logs
2017-09-01 16:42:11 +05:30
Marcin Wielgus 021a2fdf5d Merge pull request #286 from mwielgus/exist-no-error
Do not return error from exist
2017-09-01 16:05:52 +05:30
Marcin Wielgus 2d8f59e23d Set verbosity for each of the glog.Info logs 2017-09-01 12:34:29 +02:00
Marcin Wielgus f217d4ac93 Do not return error from exist 2017-09-01 00:24:01 +02:00
Beata Skiba 576e4105db Make ScaleDownNonEmptyCandidatesCount a flag. 2017-08-31 15:05:06 +02:00
Beata Skiba 4560cc0a85 Keep maximum 30 candidates for scale down with drain 2017-08-31 14:58:40 +02:00
Marcin Wielgus e9261a249c Merge pull request #284 from mwielgus/nap-5
Node autoprovisioning in scale up
2017-08-31 17:47:25 +05:30
Marcin Wielgus 22f856d4da Small refactoring in ScaleUp 2017-08-31 13:21:20 +02:00
Marcin Wielgus 6b9e56f0f9 Node autoprovisioning in scale up 2017-08-31 01:33:52 +02:00
Marcin Wielgus 19507aa0de Node autoprovisioning flag 2017-08-31 00:48:54 +02:00
Maciej Pytel 69c5ea03ce Disable MatchInterPodAffinity if there are no pods using affinity 2017-08-30 16:18:31 +02:00
Marcin Wielgus fbf0d6f499 Merge pull request #271 from aleksandra-malinowska/creator-ref
Use OwnerReferences in place of deprecated created by annotation
2017-08-30 04:21:58 +05:30
Aleksandra Malinowska ac0d8388bc use OwnerReferences instead of deprecated created by annotation 2017-08-29 17:26:38 +02:00
Maciej Pytel 281afa7147 precompute predicateMetadata in scale-down 2017-08-29 16:29:45 +02:00
Marcin Wielgus 51a5ad58c0 GKE NodePool support for NAP - get NP/Migs via api - part 1 2017-08-28 20:50:02 +02:00
Marcin Wielgus 191d140107 Don't increase pod graceful termination 2017-08-28 16:54:19 +02:00
Marcin Wielgus 6ad7ca21e8 Merge pull request #265 from MaciekPytel/ignore_unneded_if_min_size
Skip nodes in min-sized groups in scale-down simulation
2017-08-28 19:40:53 +05:30
Marcin Wielgus 9e2c76551f Merge pull request #263 from mwielgus/delete-in-goroutine
Run node drain/delete in a separate goroutine
2017-08-28 19:39:57 +05:30
Maciej Pytel 2f6dd8aefc Skip nodes in min-sized groups in scale-down simulation
Currently we track if those nodes can be removed and only
skip them at the execution step. Since checking if node is
unneeded is pretty expensive it's better to filter them out
early.
2017-08-28 15:48:41 +02:00
Marcin Wielgus 718e5db78e Run node drain/delete in a separate goroutine 2017-08-28 12:12:31 +02:00
Marcin Wielgus 71b4ca5461 Dont block stale downs if no nodes can be removed 2017-08-26 16:29:50 +02:00
Maciej Pytel fa53e52ed9 Skip node in scale-down if it was recently found unremovable 2017-08-25 17:21:08 +02:00
Maciej Pytel fb6ef75d12 Don't create verbose errors in predicates if we ignore them
Turns out all this string formatting is pretty damn expensive.
2017-08-24 15:18:38 +02:00
Beata Skiba edeb522274 Add measuring of FilterOutSchedulable 2017-08-22 18:36:13 +02:00
Beata Skiba 2ae609b93a Merge pull request #237 from bskiba/split_scale_down
Drill down scale down metrics
2017-08-22 16:41:55 +02:00
Beata Skiba 43c9b6b06b Add cleaner function labels for metrics exporting. 2017-08-22 16:09:42 +02:00
Beata Skiba 44f69c6706 Extract deleting empty nodes to a separate function. 2017-08-22 16:09:42 +02:00
Maciej Pytel d2faf11482 Re-use results for similar pods in FilterOutSchedulable 2017-08-21 16:32:14 +02:00
Beata Skiba 14df1b808b Drill down scale down metrics
Split scale down duration into three parts:
1. Find nodes to remove
2. Node deletion
3. Misc operations
2017-08-18 14:17:02 +02:00
Maciej Pytel 95b5b4be94 Remove --verify-unschedulabe-pods flag
This flag was true in default setups for every platform,
we haven't heard about any user changing it to false and
after removing check on PodScheduled condition setting it
to false would basically break CA.
2017-08-16 17:31:59 +02:00
Maciej Pytel ef1241b3c6 Remove checking and resetting PodSchedulable condition
The performance cost was too high and the pods should
be filtered out by follow up checks anyway.
Check out https://github.com/kubernetes/autoscaler/issues/187
for details.
2017-08-16 17:30:11 +02:00
Marcin Wielgus 998b3f1acd Merge pull request #198 from MaciekPytel/support_zone_failures
Backoff for node group after failed scale-up
2017-08-16 20:46:45 +05:30
Marcin Wielgus 9116e4c08c Compilation fix for CA after godeps update 2017-08-11 17:56:47 +02:00
Marcin Wielgus 4580e1dc45 Fix getEmptyNodes function in CA 2017-08-07 22:21:41 +02:00
Maciej Pytel 6aacbb5bf7 Backoff for node group after failed scale-up 2017-08-04 15:40:23 +02:00
Ivan Towlson 902d2414b7 Fixed typoes of name 'Kubernetes' 2017-08-03 14:20:23 +12:00
Marcin Wielgus 55d750196c Add a flag to turn off pod status condition reseting for performance tests 2017-07-24 15:53:45 +02:00
Aleksandra Malinowska ab8323e8dc fix some logs in scale down 2017-07-20 10:33:42 +02:00
Aleksandra Malinowska 2de8ccc8e1 Change scope of scaleUp metric 2017-07-18 12:17:51 +02:00
Hanfei Shen 2dff7466f8 fix typo for logging 2017-07-14 13:14:27 +08:00
MaciekPytel 2ac2535a48 Merge pull request #169 from aleksandra-malinowska/test-provider-package-name
Rename testprovider package
2017-07-13 12:20:30 +02:00
fate-grand-order 5b230a45ee correct some misspells for cluster-autoscaler/core 2017-07-13 17:53:59 +08:00
Aleksandra Malinowska d9eed646f1 add taints to GCE node template 2017-07-11 16:05:30 +02:00
Aleksandra Malinowska aa1771107e change scope of findUnneeded metric 2017-07-07 16:30:59 +02:00
Aleksandra Malinowska c159a90f04 rename test provider package 2017-07-06 16:23:15 +02:00
Aleksandra Malinowska 9f54934229 add annotation 2017-07-06 14:47:32 +02:00
Marcin Wielgus 7cbf295b7f Merge pull request #161 from mwielgus/godeps-020717
Godeps bump for CA
2017-07-04 11:41:00 +02:00
Marcin Wielgus fc43808149 Godeps bump for CA 2017-07-03 22:05:11 +02:00
Maciej Pytel 39dfced56b Strip rescheduler taint from node templates 2017-07-03 14:57:17 +02:00
Yusuke Kuoka 7697d5345a cluster-autoscaler: Fix scale-down when the node group auto-discovery feature is enabled
By fixing CA not to reset `StaticAutoscaler` state before each iteration so that it remembers last scale-up/down time which is used to throttle scale-down, which is causing the issue.
2017-06-22 10:25:37 +09:00
Marcin Wielgus 2cd532ebfe Don't calculate utilization and run scale down simulations for unmanaged nodes 2017-06-20 16:57:30 +02:00
Marcin Wielgus 63e679a74f Merge pull request #120 from MaciekPytel/fix_graceful_flag
Fix typos related to max-graceful-termination-sec
2017-06-14 14:42:35 +02:00
Maciej Pytel 767367c866 Fix typos related to max-graceful-termination-sec 2017-06-14 14:14:21 +02:00
Maciej Pytel fe514ed75d Make status configmap respect namespace parameter 2017-06-14 14:07:13 +02:00
Marcin Wielgus 1bedee5707 Update GODEPS 2017-06-13 14:48:24 +02:00
Marcin Wielgus 69c77791a2 Fix error types 2017-06-12 21:26:50 +02:00
Marcin Wielgus e2e171b7b7 Enable pricing in expander factory 2017-06-09 11:09:43 -07:00
Marcin Wielgus be0d16a57f Move Autoscaler Builder to a new file 2017-06-09 10:02:44 -07:00
Maciej Pytel cd186f3ebc Balance sizes of similar nodegroups in scale-up 2017-06-06 00:52:38 +02:00
Maciej Pytel 58cdfa1702 Updated log levels in main loop 2017-05-18 14:09:15 +02:00
Maciej Pytel 3f8ca51768 Use typed errors in scale down 2017-05-18 14:09:15 +02:00
Maciej Pytel 7f5c7ed3a2 Used typed errors in scale up code
Updated some of the functions called by scale up
to return new errors as required.
2017-05-18 14:09:15 +02:00
Maciej Pytel f716a7e496 Add typed errors; add errors_total metric
To keep reasonable commit size only top-level files use
new errors. Will add them in other files in next commits.
2017-05-18 14:09:15 +02:00
Marcin Wielgus ea7bd81681 Prefer using ready nodes and cloudprovider template nodes over unready/unschedulable nodes in scale-up 2017-05-16 13:06:19 +02:00
Marcin Wielgus d9bf5aacd7 Use TemplateNodeInfo in scale up 2017-05-16 11:45:05 +02:00
Maciej Pytel 7a21a68b56 Add metrics counting CA operations 2017-05-15 13:03:00 +02:00
Maciej Pytel 4cdf06ea94 Added CA metrics related to autoscaler execution 2017-05-11 14:51:04 +02:00
Maciej Pytel 83ef3d2be3 Added CA metrics related to cluster state 2017-05-11 13:54:04 +02:00
Marcin Wielgus 0a0129f511 Daemonset listers 2017-05-11 12:30:27 +02:00
Marcin Wielgus 30cb7a52e5 Merge pull request #11 from mumoshu/node-group-auto-discovery-with-asg-tag
cluster-autoscaler: Re: AWS Autoscaler autodiscover ASG names and sizes
2017-05-10 11:07:58 +02:00
Yusuke Kuoka 5304e9af21 cluster-autoscaler: Fix typos in comments 2017-05-10 11:22:15 +09:00
Yusuke Kuoka e9c7cd0733 cluster-autoscaler: Re: AWS Autoscaler autodiscover ASG names and sizes
This is an alternative implementation of https://github.com/kubernetes/contrib/pull/1982

Notable differences from the original PR are:

* A new flag named `--node-group-auto-discovery` is introduced for opting in to enable the auto-discovery feature.
  * For example, specifying `--cloud-provider aws --node-group-auto-discovery asg:tag=k8s.io/cluster-autoscaler/enabled` instructs CA to auto-discover ASGs tagged with `k8s.io/cluster-autoscaler/enabled` to be used as target node groups
* The new code path introduced by this PR is executed only when `node-group-auto-discovery` is specified. There is relatively less chance to break existing features by introducing this change

Resolves https://github.com/kubernetes/contrib/issues/1956

---

Other notes:

* We rely mainly on the `DescribeTags` API rather than `DescribeAutoScalingGroups` so that AWS can filter out unnecessary ASGs which doesn't belong to the k8s cluster, for us.
  * If we relied on `DescribeAutoScalingGroups` here, as it doesn't support `Filter`ing, we'd need to iterate over ALL the ASGs available in an AWS account, which isn't desirable due to unnecessary excessive API calls and network usages

* Update cloudprovider/aws/README for the new configuration

* Warn abount invalid combination of flags
according to the review comment https://github.com/kubernetes/autoscaler/pull/11#discussion_r113713138

* Emit a validation error when both --nodes and --node-group-auto-discovery are specified
according to the review comment https://github.com/kubernetes/autoscaler/pull/11#discussion_r113958080

TODO/Possible future improvements before recommending this to everyone:

* Cache the result of an auto-discovery for a configurable period, so that we won't invoke DescribeTags and DescribeAutoScalingGroup APIs too many times
2017-05-10 08:36:02 +09:00
Marcin Wielgus 42c177b68f Add deletion safety margin to node drain 2017-05-08 11:47:33 +02:00
Marcin Wielgus 6f5d52e3a7 Overwrite pod.spec.nodename and node.name in template nodes for scale up 2017-04-28 17:57:02 +02:00
Marcin Wielgus 6bafa2a940 Merge pull request #25 from mwielgus/label-fix
Override hostname label when building a template node
2017-04-27 17:25:43 +02:00
Marcin Wielgus e1c89f8fe2 Override hostname label when building a template node 2017-04-27 17:17:01 +02:00
Maciej Pytel 7e4212478a Fix error handling for updating node status 2017-04-25 17:34:23 +02:00
Maciej Pytel 6b2ea76973 Added UT for CA simulator 2017-04-19 19:12:30 +02:00
Maciej Pytel 4d40222b63 Fix gofmt 2017-04-18 16:45:27 +02:00
Marcin Wielgus 34eb4973f8 Fix imports in cluster autoscaler after migrating it from contrib 2017-04-18 15:42:04 +02:00
Maciej Pytel 0b74a3bd25 Cluster-Autoscaler: update event name 2017-04-10 14:03:21 +02:00
Marcin Wielgus eb3e6173d1 Cluster-autoscaler: Fix isNodeStarting 2017-03-27 23:27:14 +02:00
Maciej Pytel 72c885b800 Cluster-Autoscaler: reset scale-down on unready cluster 2017-03-22 17:17:59 +01:00
Maciej Pytel c71668a8d8 Cluster-Autoscaler: update status configmap on errors
Previously it would only update after successfully completing the main
loop, meaning the status wouldn't get updated unless cluster was
healthy.
2017-03-15 13:22:24 +01:00
Kubernetes Submit Queue ac5f7634d8 Merge pull request https://github.com/kubernetes/contrib/pull/2464 from MaciekPytel/ca_drain_evictions
Automatic merge from submit-queue

Cluster-Autoscaler: evict pods instead of deleting them

This should make CA respect PodDisruptionBudget.
2017-03-15 04:27:27 -07:00
Maciej Pytel 7d5488898c Cluster-autoscaler: fix NotTriggerScaleUp event
This should fix a failing e2e test
2017-03-14 14:54:36 +01:00
Maciej Pytel 10d560dae6 Cluster-Autoscaler: handle nil node group
In a few place we assumed it's not-nil, leading
to segfaults.
2017-03-13 14:46:11 +01:00
Maciej Pytel 39162f0860 Cluster-Autoscaler: evict pods instead of deleting them 2017-03-10 16:18:47 +01:00
Maciej Pytel 5d2c675c8e Cluster-Autoscaler: update scale down status 2017-03-08 11:51:20 +01:00
Marcin Wielgus 27b797f541 Cluster-Autoscaler: skip nodes currently under deletion in scale down 2017-03-07 14:59:15 +01:00
Kubernetes Submit Queue 39fa783ad7 Merge pull request https://github.com/kubernetes/contrib/pull/2451 from mwielgus/pdb-ca
Automatic merge from submit-queue

Cluster-autoscaler: include PodDisruptionBudget in drain - part 1/2

In part 1 or 2 we skip nodes that have a pod with 0 poddisruptionallowed. Part 2/2 will delete pods using evict.

cc: @jszczepkowski @MaciekPytel @davidopp @fgrzadkowski
2017-03-06 09:27:50 -08:00
Marcin Wielgus 5b4441083a Cluster-autoscaler: include PodDisruptionBudget in drain - part 1/2 2017-03-06 17:15:04 +01:00
Maciej Pytel d3bf5d3d51 Cluster-Autoscaler: log events on status configmap 2017-03-06 12:21:24 +01:00
Maciej Pytel 84f19c1e1e Cluster-Autoscaler: add map to disable status configmap 2017-03-02 15:35:00 +01:00
Marcin Wielgus 2ffaddb7c0 Cluster-autoscaler: lint 2017-03-02 15:15:07 +01:00
Marcin Wielgus 72a47dc2b2 Cluster-autoscaler: update code for 1.6 k8s sync 2017-03-02 14:34:49 +01:00
Maciej Pytel d0196c9e1b Cluster-Autoscaler: Delete status configmap on exit 2017-02-28 17:19:23 +01:00
Maciej Pytel 497d2800ea Cluster-Autoscaler: Write status to configmap 2017-02-28 09:59:40 +01:00
Maciej Pytel 637e750246 Cluster-Autoscaler: fix segfault
StaticAutoscaler.kubeClient was uninitialized,
leading to segfaults when trying to use it. It was
also a duplicate since the client is already available
through AutoscalingContext.
2017-02-27 14:13:54 +01:00
Marcin Wielgus 83fdeb184f Cluster-autoscaler: use listers from ListersRegistry 2017-02-24 20:40:53 +01:00
Yusuke Kuoka baee799524 cluster-autoscaler: Dynamic Reconfiguration via ConfigMaps
Adds a new optional flag named `configmap` to specify the name of a configmap containing node group specs.

The configmap is polled every `scan-interval` seconds to reconfigure cluster-autoscaler dynamically at runtime.

Example usage:

```
./cluster-autoscaler --v=4 --cloud-provider=aws --skip-nodes-with-local-storage=false --logtostderr --leader-elect=false --configmap=cluster-autoscaler --logtostderr
```

The configmap would look like:

```yaml
kind: ConfigMap
apiVersion: v1
metadata:
  name: cluster-autoscaler
  namespace: kube-system
data:
  settings: |-
    {
      "nodeGroups": [
        {
          "minSize": 1,
          "maxSize": 2,
          "name": "kubeawstest-nodepool1-AutoScaleWorker-1VWD4GAVG35L5"
        }
      ]
    }
 ```

Other notes:

* Make namespace defaults to "kube-system"
according to https://github.com/kubernetes/contrib/pull/2226#discussion_r94144267

* Trigger a full-recreate on a configuration change

according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-269617410

* Introduced `autoscaler/` and moved  all the dynamic/recreatable-at-runtime parts of autoscaler into there (Update: the package is now named `core` according to https://github.com/kubernetes/contrib/pull/2226#issuecomment-273071663)

* Extracted the core of CA(=`func Run()` in `main.go`) into `Autoscaler`

* `DynamicAutoscaler` is a wrapper around `Autoscaler` which achieves reconfiguration of CA by recreating an `Autoscaler` instance on a configmap change.

* Moved `scale_down*.go`, `scale_up*.go` and `utils*.go` into the `autoscaler` package accordingly because they seemed to be meant to be collocated in the same package as the core of CA (which is now implemented as `Autoscaler`)

* Moved the `createEventRecorder` func from the `main` package to the `utils/kubernetes` package to make it importable from both `main` and `autoscaler`
2017-02-24 20:36:47 +09:00