Commit Graph

445 Commits

Author SHA1 Message Date
Kubernetes Prow Robot 8871f1702d
Merge pull request #2521 from losipiuk/lo/rename-stockout
Rename STOCKOUT to RESOURCE_POOL_EXHAUSTED
2019-11-12 06:00:07 -08:00
Łukasz Osipiuk 7b499aa4c9 Rename STOCKOUT to RESOURCE_POOL_EXHAUSTED
We came into conclusion that using STOCKOUT as error code is too
specific. Migrating to more general term RESOURCE_POOL_EXHAUSTED.
2019-11-12 14:39:51 +01:00
Vivek Bagade 910e75365c remove temporary nodes logic 2019-11-12 11:58:29 +01:00
Kubernetes Prow Robot 19dcfbd25e
Merge pull request #2476 from tghartland/fix-scale-down-errorf
CA: Make error message in scale down node draining consistent
2019-11-04 01:01:40 -08:00
Jarvis-Zhou 7c9d6e3518 Do not assign return values to variables when not needed 2019-10-25 19:28:00 +08:00
Thomas Hartland 229fc959b4 Make error message in scale down consistent 2019-10-23 15:28:09 +02:00
Łukasz Osipiuk 7f083d2393 Move core/utils.go to separate package and split into multiple files 2019-10-22 14:23:40 +02:00
Łukasz Osipiuk 41e9271b9e Remove unused GetCandidatesForScaleDown 2019-10-22 14:23:38 +02:00
Kubernetes Prow Robot 3f137fde4f
Merge pull request #2448 from hectorj2f/hectorj2f/chore_typos
cluster-autoscaler: fix some typos in the code
2019-10-21 00:33:37 -07:00
Łukasz Osipiuk 288d4107b2 Rename GetCreatedNodesWithOutOfResourcesErrors to GetCreatedNodesWithErrors 2019-10-14 10:56:56 +02:00
Hector Fernandez 24401b373f
cluster-autoscaler: fix some typos in the code 2019-10-13 12:52:53 +02:00
Thomas Hartland c51b7ee72a Update TestRemoveOldUnregisteredNodes to pass cluster state registry 2019-09-30 14:29:02 +02:00
Thomas Hartland 474eef6d47 Invalidate node instances cache after deleting unregistered nodes 2019-09-30 14:29:02 +02:00
Thomas Hartland 7c17d52ec8 Invalidate node instances cache after deleting failed nodes 2019-09-30 13:56:33 +02:00
Kubernetes Prow Robot 791f0d8355
Merge pull request #2281 from DataDog/JulienBalestra/mig-block
cluster-autoscaler: blocked if an instance is detached from MIG
2019-09-11 05:03:22 -07:00
Julien Balestra 3441f616e1 cluster-autoscaler/skip-node: unblock cluster autoscaler when having a single nodegroup for node error
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-09-11 13:40:23 +02:00
Krzysztof Jastrzebski 839cdaaa09 Stop disabling Cluster Autoscaler when there is no ready nodes. 2019-09-06 14:45:34 +02:00
Julien Balestra 6d707a08ac cluster-autoscaler/metrics: expose the scale down cooldown
Signed-off-by: Julien Balestra <julien.balestra@datadoghq.com>
2019-08-27 18:12:33 +02:00
Kubernetes Prow Robot 9aac43e237
Merge pull request #2235 from piontec/fix/aws_spots_squashed
correctly handle lack of capacity of AWS spot ASGs
2019-08-19 04:27:30 -07:00
Kubernetes Prow Robot 4c056fb8ba
Merge pull request #2259 from towca/jtuznik/rejected-node-groups-more-info
Provide ScaleUpStatusProcessor with info about all rejected node groups
2019-08-19 04:05:31 -07:00
Kubernetes Prow Robot 3f0a5fa3c2
Merge pull request #2233 from vivekbagade/surge
Adding ScaleDownNodeProcessor
2019-08-19 03:59:32 -07:00
Jakub Tużnik 43466ff837 Provide ScaleUpStatusProcessor with info about all rejected node groups
Previously, it had info only about the ones that actually exist.

The changes to the eventing processor are done to keep its previous
behavior the same.
2019-08-19 12:48:10 +02:00
Łukasz Piątkowski 8d9b81caaa correctly handle lack of capacity of AWS spot ASGs 2019-08-19 12:43:53 +02:00
Kubernetes Prow Robot 60bdca087d
Merge pull request #2255 from towca/jtuznik/create-node-group-result
Provide more info to ScaleUpStatusProcessor
2019-08-13 06:51:41 -07:00
Vivek Bagade dc64d0aab2 Adding ScaleDownNodeProcessor 2019-08-12 20:19:55 +02:00
Jakub Tużnik 935476a7e2 Provide more info to ScaleUpStatusProcessor
Add info about considered and created nodegroups to
ScaleUpStatusProcessor
2019-08-12 17:20:09 +02:00
Jakub Tużnik 44ae89dd09 Communicate the result of RemoveUnneededNodeGroups to ScaleDownStatusProcessor 2019-08-12 17:03:51 +02:00
t-qini f7c563ab06 Modify the code as the simple solution proposed by MaciekPytel. 2019-07-18 23:58:05 +08:00
t-qini 622a838c2c Modify nodal similarity rules. 2019-07-09 16:04:40 +08:00
Kubernetes Prow Robot c6067574c1
Merge pull request #2160 from aleksandra-malinowska/scale-up-events-fix
Add resource limit type to NotTriggerScaleUp event
2019-07-05 05:48:38 -07:00
Aleksandra Malinowska 0d0c9440f6 Add no scale up test 2019-07-03 16:38:53 +02:00
Aleksandra Malinowska 7b80f4e8b8 Separate running scale up test from checking results 2019-07-03 16:38:52 +02:00
Aleksandra Malinowska c27ae4eb24 Add resource limit type to NotTriggerScaleUp event 2019-07-03 16:38:46 +02:00
Aleksandra Malinowska d01a2392db Make scale down unit tests faster 2019-07-03 13:12:48 +02:00
Pengfei Ni d45fee06da Ensure upcoming nodes are different 2019-07-02 16:52:19 +08:00
silenceper 478660a6bb fix error 2019-06-28 18:49:58 +08:00
Vivek Bagade 0a75333e1b Potential performance improvement in bin packing unschedulable pods 2019-06-19 18:39:47 +02:00
Vivek Bagade 90aa28a077 Move pod packing in upcoming nodes to RunOnce from Estimator for performance improvements 2019-06-19 14:48:47 +02:00
Kubernetes Prow Robot da36677d04
Merge pull request #2108 from losipiuk/lo/other-error-ut
Add unit test case for OTHER error handling
2019-06-10 05:29:08 -07:00
Łukasz Osipiuk 0bcf5315a7 Do not fail loop iteration if unregistered nodes cannot be removed
The mechanism of unregistered nodes removal is not the first
responsibility of Cluster Autoscaler. We do not want to renderi CA
unsable (disable scale-up and scale-down) if removing unregistered nodes
cannot be done for prolonged period of time.
2019-06-10 13:45:54 +02:00
Łukasz Osipiuk be68d06b40 Add unit test case for OTHER error handling 2019-06-07 16:54:01 +02:00
Jakub Tużnik bb382f47f9 Retain information about scale-up failures in CSR
This will provide the AutoscalingStatusProcessor with information
about failed scale-ups.
2019-06-05 16:53:30 +02:00
Krzysztof Jastrzebski 22b4a6283e Optimize building node infos by using map with pods for nodes. 2019-06-03 13:24:09 +02:00
Kubernetes Prow Robot a0853bcc80
Merge pull request #2071 from losipiuk/lo/predicate-checker-speedup
Precompute inter pod equivalence groups in checkPodsSchedulableOnNode
2019-06-03 03:52:16 -07:00
Krzysztof Jastrzebski 4831d76288 Cache cloud provider node instances in cluster state. 2019-05-31 10:11:51 +02:00
Łukasz Osipiuk a849ead286 Precompute inter pod equivalence groups in checkPodsSchedulableOnNode 2019-05-29 18:05:52 +02:00
Krzysztof Jastrzebski 6944f3fc56 Delete zero values from deletionsInProgress map in NodeDeletionTracker. 2019-05-28 14:34:56 +02:00
Krzysztof Jastrzebski da82f831a3 Use fakeNodeLister instead of mocks. 2019-05-27 15:10:31 +02:00
Kubernetes Prow Robot cb4e60f8d4
Merge pull request #2031 from krzysztof-jastrzebski/master
Add functionality which delays node deletion to let other components prepare for deletion.
2019-05-20 00:57:13 -07:00
Kubernetes Prow Robot 8d2ec08b2c
Merge pull request #2015 from losipiuk/lo/pass-via-context
Add methods for passing arbitrary object via autoscaling context
2019-05-17 08:12:07 -07:00
Łukasz Osipiuk e76558c65f Add methods for passing arbitrary object via autoscaling context
Change-Id: I066e58010a0aef4989bfc1f73b90bc69c773b26e
2019-05-17 16:38:12 +02:00
Krzysztof Jastrzebski 4247c8b032 Implement functionality which delays node deletion when node has
annotation with  prefix
'delay-deletion.cluster-autoscaler.kubernetes.io/'.
2019-05-17 16:06:17 +02:00
Kubernetes Prow Robot c756ed3953
Merge pull request #1963 from cjbradfield/ignore-taints
add --ignore-taint flag and ignore taints added by TaintNodesByCondition
2019-05-15 02:18:21 -07:00
Chris Bradfield 92ea680f1a Implement an --ignore-taint flag
This change adds support for a user to specify taints to ignore when
considering a node as a template for a node group.
2019-05-14 10:22:59 -07:00
Chris Bradfield 54773da830 Ignore taints added from TaintNodesByCondition when considering a node as a Node Group template 2019-05-14 10:22:59 -07:00
Kubernetes Prow Robot a6c109f8f5
Merge pull request #1967 from towca/jtuznik/delete-empty-nodes-behaviour-fix
Modify the info passed to ScaleDownStatusProcessor when empty nodes a…
2019-04-30 05:25:37 -07:00
Jakub Tużnik b92f971326 Provide ScaleDownStatusProcessor with more info about scale-down results 2019-04-30 13:49:06 +02:00
Jakub Tużnik 402c643851 Modify the info passed to ScaleDownStatusProcessor when empty nodes are deleted
Previously, if any of the nodes fails to delete, the processor gets
a ScaleDownError status. After this commit, it will get the list of
nodes that were successfully deleted.
2019-04-26 15:54:11 +02:00
Łukasz Osipiuk c9811e87b4 Include pods with NominatedNodeName in scheduledPods list used for scale-up considerations
Change-Id: Ie4c095b30bf0cd1f160f1ac4b8c1fcb8c0524096
2019-04-15 16:59:13 +02:00
Łukasz Osipiuk db4c6f1133 Migrate filter out schedulabe to PodListProcessor 2019-04-15 16:59:13 +02:00
Łukasz Osipiuk 5c09c50774 Pass ready nodes list to PodListProcessor 2019-04-15 16:59:13 +02:00
Łukasz Osipiuk c6115b826e Define ProcessorCallbacks interface 2019-04-15 16:59:13 +02:00
Jiaxin Shan 83ae66cebc Consider GPU utilization in scaling down 2019-04-04 01:12:51 -07:00
Jiaxin Shan 90666881d3 Move GPULabel and GPUTypes to cloud provider 2019-03-25 13:03:01 -07:00
Lukasz Piatkowski c5ba4b3068 priority expander 2019-03-22 10:43:20 +01:00
Łukasz Osipiuk 2474dc2fd5 Call CloudProvider.Refresh before getNodeInfosForGroups
We need to call refresh before getNodeInfosForGroups. If we have
stale state getNodeInfosForGroups may fail and we will end up in infinite crash looping.
2019-03-12 12:07:49 +01:00
Aleksandra Malinowska 62a28f3005 Soft taint when there are no candidates 2019-03-11 14:05:09 +01:00
Andrew McDermott 5ae76ea66e UPSTREAM: <carry>: fix max cluster size calculation on scale up
When scaling up the calculation for computing the maximum cluster size
does not take into account the number of any upcoming nodes and it is
possible to grow the cluster beyond the cluster
size (--max-nodes-total).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1670695
2019-03-08 13:28:58 +00:00
Uday Ruddarraju 91b7bc08a1 Fixing minor error handling bug in static autoscaler 2019-03-07 15:16:27 -08:00
Kubernetes Prow Robot 8944afd901
Merge pull request #1720 from aleksandra-malinowska/events-client
Use separate client for events
2019-02-26 12:00:19 -08:00
Aleksandra Malinowska a824e87957 Only soft taint nodes if there's no scale down to do 2019-02-25 17:11:15 +01:00
Aleksandra Malinowska f304722a1f Use separate client for events 2019-02-25 13:58:54 +01:00
Pengfei Ni 2546d0d97c Move leaderelection options to new packages 2019-02-21 13:45:46 +08:00
Pengfei Ni 128729bae9 Move schedulercache to package nodeinfo 2019-02-21 12:41:08 +08:00
Jacek Kaniuk d969baff22 Cache exemplar ready node for each node group 2019-02-11 17:40:58 +01:00
Jacek Kaniuk f054c53c46 Account for kernel reserved memory in capacity calculations 2019-02-08 17:04:07 +01:00
Marcin Wielgus 99f1dcf9d2
Merge branch 'master' into crc-fix-error-format 2019-02-01 17:22:57 +01:00
Kubernetes Prow Robot bd84757b7e
Merge pull request #1596 from vivekbagade/improve-filterout-logic
Added better checks for filterSchedulablePods and added a tunable fla…
2019-01-27 13:00:31 -08:00
Vivek Bagade c6b87841ce Added a new method that uses pod packing to filter schedulable pods
filterOutSchedulableByPacking is an alternative to the older
filterOutSchedulable. filterOutSchedulableByPacking sorts pods in
unschedulableCandidates by priority and filters out pods that can be
scheduled on free capacity on existing nodes. It uses a basic packing
approach to do this. Pods with nominatedNodeName set are always
filtered out.

filterOutSchedulableByPacking is set to be used by default, but, this
can be toggled off by setting filter-out-schedulable-pods-uses-packing
flag to false, which would then activate the older and more lenient
filterOutSchedulable(now called filterOutSchedulableSimple).

Added test cases for both methods.
2019-01-25 16:09:51 +05:30
Jacek Kaniuk d05dbb9ec4 Refactor tests of tainting
Refactor scale down nad deletetaint tests
Speed up deletetaint tests
2019-01-25 09:21:41 +01:00
Vivek Bagade 8fff0f6556 Removing nominatedNodeName annotation and moving to pod.Status.NominatedNodeName 2019-01-25 00:06:03 +05:30
Vivek Bagade 79ef3a6940 unexporting methods in utils.go 2019-01-25 00:06:03 +05:30
Jacek Kaniuk d00af2373c Tainting nodes - update first, refresh on conflict 2019-01-24 16:57:27 +01:00
Jacek Kaniuk 0c64e0932a Tainting unneeded nodes as PreferNoSchedule 2019-01-21 13:06:50 +01:00
CodeLingo Bot c0603afdeb Fix error format strings according to best practices from CodeReviewComments
Fix error format strings according to best practices from CodeReviewComments

Fix error format strings according to best practices from CodeReviewComments

Reverted incorrect change to with error format string

Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingoBot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Resolve conflict

Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingoBot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Fix error strings in testscases to remedy failing tests

Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Fix more error strings to remedy failing tests

Signed-off-by: CodeLingo Bot <bot@codelingo.io>
2019-01-11 09:10:31 +13:00
Łukasz Osipiuk 85a83b62bd Pass nodeGroup->NodeInfo map to ClusterStateRegistry
Change-Id: Ie2a51694b5731b39c8a4135355a3b4c832c26801
2019-01-08 15:52:00 +01:00
Kubernetes Prow Robot 4002559a4c
Merge pull request #1516 from frobware/fix-max-nodes-total-upstream
fix calculation of max cluster size
2019-01-03 10:02:38 -08:00
Maciej Pytel 3f0da8947a Use listers in scale-up 2019-01-02 15:56:01 +01:00
Kubernetes Prow Robot f960f95d28
Merge pull request #1542 from JoeWrightss/patch-7
Fix typo in comment
2019-01-02 05:24:14 -08:00
JoeWrightss 9f87523de9 Fix typo in comment
Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>
2019-01-01 15:10:43 +08:00
Maciej Pytel 9060014992 Use listers in scale-down 2018-12-31 14:55:38 +01:00
Kubernetes Prow Robot ab7f1e69be
Merge pull request #1464 from losipiuk/lo/stockouts2
Better quota-exceeded/stockout handling
2018-12-31 05:28:08 -08:00
Łukasz Osipiuk ddbe05b279 Add unit test for stockouts handling 2018-12-28 17:17:07 +01:00
Łukasz Osipiuk 2fbae197f4 Handle possible stockout/quota scale-up errors 2018-12-28 17:17:07 +01:00
Łukasz Osipiuk 9689b30ee4 Do not use time.Now() in RegisterFailedScaleUp 2018-12-28 17:17:07 +01:00
Łukasz Osipiuk da5bef307b Allow updating Increase for ScaleUpRequest in ClusterStateRegistry 2018-12-28 17:17:07 +01:00
Maciej Pytel 60babe7158 Use kubernetes lister for daemonset instead of custom one
Also migrate to using apps/v1.DaemonSet instead of old
extensions/v1beta1.
2018-12-28 13:55:41 +01:00
Maciej Pytel 40811c2f8b Add listers for more controllers 2018-12-28 13:31:21 +01:00
Kubernetes Prow Robot 62c492cb1f
Merge pull request #1518 from lsytj0413/fix-golint
refactor(*): fix golint warning
2018-12-21 06:05:20 -08:00
lsytj0413 672dddd23a refactor(*): fix golint warning 2018-12-19 10:04:08 +08:00