autoscaler

Commit Graph

Author	SHA1	Message	Date
Lukasz Piatkowski	c5ba4b3068	priority expander	2019-03-22 10:43:20 +01:00
Łukasz Osipiuk	2474dc2fd5	Call CloudProvider.Refresh before getNodeInfosForGroups We need to call refresh before getNodeInfosForGroups. If we have stale state getNodeInfosForGroups may fail and we will end up in infinite crash looping.	2019-03-12 12:07:49 +01:00
Aleksandra Malinowska	62a28f3005	Soft taint when there are no candidates	2019-03-11 14:05:09 +01:00
Andrew McDermott	5ae76ea66e	UPSTREAM: <carry>: fix max cluster size calculation on scale up When scaling up the calculation for computing the maximum cluster size does not take into account the number of any upcoming nodes and it is possible to grow the cluster beyond the cluster size (--max-nodes-total). Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1670695	2019-03-08 13:28:58 +00:00
Uday Ruddarraju	91b7bc08a1	Fixing minor error handling bug in static autoscaler	2019-03-07 15:16:27 -08:00
Kubernetes Prow Robot	8944afd901	Merge pull request #1720 from aleksandra-malinowska/events-client Use separate client for events	2019-02-26 12:00:19 -08:00
Aleksandra Malinowska	a824e87957	Only soft taint nodes if there's no scale down to do	2019-02-25 17:11:15 +01:00
Aleksandra Malinowska	f304722a1f	Use separate client for events	2019-02-25 13:58:54 +01:00
Pengfei Ni	2546d0d97c	Move leaderelection options to new packages	2019-02-21 13:45:46 +08:00
Pengfei Ni	128729bae9	Move schedulercache to package nodeinfo	2019-02-21 12:41:08 +08:00
Jacek Kaniuk	d969baff22	Cache exemplar ready node for each node group	2019-02-11 17:40:58 +01:00
Jacek Kaniuk	f054c53c46	Account for kernel reserved memory in capacity calculations	2019-02-08 17:04:07 +01:00
Marcin Wielgus	99f1dcf9d2	Merge branch 'master' into crc-fix-error-format	2019-02-01 17:22:57 +01:00
Kubernetes Prow Robot	bd84757b7e	Merge pull request #1596 from vivekbagade/improve-filterout-logic Added better checks for filterSchedulablePods and added a tunable fla…	2019-01-27 13:00:31 -08:00
Vivek Bagade	c6b87841ce	Added a new method that uses pod packing to filter schedulable pods filterOutSchedulableByPacking is an alternative to the older filterOutSchedulable. filterOutSchedulableByPacking sorts pods in unschedulableCandidates by priority and filters out pods that can be scheduled on free capacity on existing nodes. It uses a basic packing approach to do this. Pods with nominatedNodeName set are always filtered out. filterOutSchedulableByPacking is set to be used by default, but, this can be toggled off by setting filter-out-schedulable-pods-uses-packing flag to false, which would then activate the older and more lenient filterOutSchedulable(now called filterOutSchedulableSimple). Added test cases for both methods.	2019-01-25 16:09:51 +05:30
Jacek Kaniuk	d05dbb9ec4	Refactor tests of tainting Refactor scale down nad deletetaint tests Speed up deletetaint tests	2019-01-25 09:21:41 +01:00
Vivek Bagade	8fff0f6556	Removing nominatedNodeName annotation and moving to pod.Status.NominatedNodeName	2019-01-25 00:06:03 +05:30
Vivek Bagade	79ef3a6940	unexporting methods in utils.go	2019-01-25 00:06:03 +05:30
Jacek Kaniuk	d00af2373c	Tainting nodes - update first, refresh on conflict	2019-01-24 16:57:27 +01:00
Jacek Kaniuk	0c64e0932a	Tainting unneeded nodes as PreferNoSchedule	2019-01-21 13:06:50 +01:00
CodeLingo Bot	c0603afdeb	Fix error format strings according to best practices from CodeReviewComments Fix error format strings according to best practices from CodeReviewComments Fix error format strings according to best practices from CodeReviewComments Reverted incorrect change to with error format string Signed-off-by: CodeLingo Bot <hello@codelingo.io> Signed-off-by: CodeLingoBot <hello@codelingo.io> Signed-off-by: CodeLingo Bot <hello@codelingo.io> Signed-off-by: CodeLingo Bot <bot@codelingo.io> Resolve conflict Signed-off-by: CodeLingo Bot <hello@codelingo.io> Signed-off-by: CodeLingoBot <hello@codelingo.io> Signed-off-by: CodeLingo Bot <hello@codelingo.io> Signed-off-by: CodeLingo Bot <bot@codelingo.io> Fix error strings in testscases to remedy failing tests Signed-off-by: CodeLingo Bot <bot@codelingo.io> Fix more error strings to remedy failing tests Signed-off-by: CodeLingo Bot <bot@codelingo.io>	2019-01-11 09:10:31 +13:00
Łukasz Osipiuk	85a83b62bd	Pass nodeGroup->NodeInfo map to ClusterStateRegistry Change-Id: Ie2a51694b5731b39c8a4135355a3b4c832c26801	2019-01-08 15:52:00 +01:00
Kubernetes Prow Robot	4002559a4c	Merge pull request #1516 from frobware/fix-max-nodes-total-upstream fix calculation of max cluster size	2019-01-03 10:02:38 -08:00
Maciej Pytel	3f0da8947a	Use listers in scale-up	2019-01-02 15:56:01 +01:00
Kubernetes Prow Robot	f960f95d28	Merge pull request #1542 from JoeWrightss/patch-7 Fix typo in comment	2019-01-02 05:24:14 -08:00
JoeWrightss	9f87523de9	Fix typo in comment Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>	2019-01-01 15:10:43 +08:00
Maciej Pytel	9060014992	Use listers in scale-down	2018-12-31 14:55:38 +01:00
Kubernetes Prow Robot	ab7f1e69be	Merge pull request #1464 from losipiuk/lo/stockouts2 Better quota-exceeded/stockout handling	2018-12-31 05:28:08 -08:00
Łukasz Osipiuk	ddbe05b279	Add unit test for stockouts handling	2018-12-28 17:17:07 +01:00
Łukasz Osipiuk	2fbae197f4	Handle possible stockout/quota scale-up errors	2018-12-28 17:17:07 +01:00
Łukasz Osipiuk	9689b30ee4	Do not use time.Now() in RegisterFailedScaleUp	2018-12-28 17:17:07 +01:00
Łukasz Osipiuk	da5bef307b	Allow updating Increase for ScaleUpRequest in ClusterStateRegistry	2018-12-28 17:17:07 +01:00
Maciej Pytel	60babe7158	Use kubernetes lister for daemonset instead of custom one Also migrate to using apps/v1.DaemonSet instead of old extensions/v1beta1.	2018-12-28 13:55:41 +01:00
Maciej Pytel	40811c2f8b	Add listers for more controllers	2018-12-28 13:31:21 +01:00
Kubernetes Prow Robot	62c492cb1f	Merge pull request #1518 from lsytj0413/fix-golint refactor(*): fix golint warning	2018-12-21 06:05:20 -08:00
lsytj0413	672dddd23a	refactor(*): fix golint warning	2018-12-19 10:04:08 +08:00
Andrew McDermott	5bc77f051c	UPSTREAM: <carry>: fix calculation of max cluster size When scaling up, the calculation for the maximum size of the cluster based on `--max-nodes-total` doesn't take into account any nodes that are in the process of coming up. This allows the cluster to grow beyond the size specified. With this change I now see: scale_up.go:266] 21 other pods are also unschedulable scale_up.go:423] Best option to resize: openshift-cluster-api/amcdermo-ca-worker-us-east-2b scale_up.go:427] Estimated 18 nodes needed in openshift-cluster-api/amcdermo-ca-worker-us-east-2b scale_up.go:432] Capping size to max cluster total size (23) static_autoscaler.go:275] Failed to scale up: max node total count already reached	2018-12-18 17:05:19 +00:00
Zhenhai Gao	df10e5f5c2	Fix log output detailed warning info Signed-off-by: Zhenhai Gao <gaozh1988@live.com>	2018-12-07 17:25:54 +08:00
Andrew McDermott	fd3fd85f26	UPSTREAM: <carry>: handle nil nodeGroup in calculateScaleDownGpusTotal Explicitly handle nil as a return value for nodeGroup in `calculateScaleDownGpusTotal()` when `NodeGroupForNode()` is called for GPU nodes that don't exist. The current logic generates a runtime exception: "reflect: call of reflect.Value.IsNil on zero Value" Looking through the rest of the tree all the other places that use this pattern additionally and explicitly check whether `nodeGroup == nil` first. This change now completes the pattern in `calculateScaleDownGpusTotal()`. Looking at the other occurrences of this pattern we see: ``` File: clusterstate/clusterstate.go 488:26: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { File: core/utils.go 231:26: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { 322:26: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { 394:27: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { 461:26: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { File: core/scale_down.go 185:6: if reflect.ValueOf(nodeGroup).IsNil() { 608:27: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { 747:26: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { 1010:25: if nodeGroup == nil \|\| reflect.ValueOf(nodeGroup).IsNil() { ``` with the notable exception at core/scale_down.go:185 which is `calculateScaleDownGpusTotal()`. With this change, and invoking the autoscaler with: ``` ... --max-nodes-total=24 \ --cores-total=8:128 \ --memory-total=4:256 \ --gpu-total=nvidia.com/gpu:0:16 \ --gpu-total=amd.com/gpu:0:4 \ ... ``` I no longer see a runtime exception.	2018-12-05 18:54:07 +00:00
Thomas Hartland	d0dd00c602	Fix logged error in static autoscaler	2018-12-04 16:59:57 +01:00
Łukasz Osipiuk	016bf7fc2c	Use k8s.io/klog instead github.com/golang/glog	2018-11-26 17:30:31 +01:00
Łukasz Osipiuk	991873c237	Fix gofmt errors	2018-11-26 15:39:59 +01:00
Alex Price	4ae7acbacc	add flags to ignore daemonsets and mirror pods when calculating resource utilization of a node Adds the flag --ignore-daemonsets-utilization and --ignore-mirror-pods-utilization (defaults to false) and when enabled, factors DaemonSet and mirror pods out when calculating the resource utilization of a node.	2018-11-23 15:24:25 +11:00
Łukasz Osipiuk	5962354c81	Inject Backoff instance to ClusterStateRegistry on creation	2018-11-13 14:25:16 +01:00
k8s-ci-robot	7008fb50be	Merge pull request #1380 from losipiuk/lo/backoff Make Backoff interface	2018-11-07 05:13:43 -08:00
Aleksandra Malinowska	6febc1ddb0	Fix formatted log messages	2018-11-06 14:51:43 +01:00
Aleksandra Malinowska	bf6ff4be8e	Clean up estimators	2018-11-06 14:15:42 +01:00
Łukasz Osipiuk	0e2c3739b7	Use NodeGroup as key in Backoff	2018-10-30 18:17:26 +01:00
Łukasz Osipiuk	55fc1e2f00	Store NodeGroup in ScaleUpRequest and ScaleDownRequest	2018-10-30 18:03:04 +01:00
Maciej Pytel	6f5e6aab6f	Move node group balancing to processor The goal is to allow customization of this logic for different use-case and cloudproviders.	2018-10-25 14:04:05 +02:00
Łukasz Osipiuk	a266420f6a	Recalculate clusterStateRegistry after adding multiple node groups	2018-10-02 17:15:20 +02:00
Łukasz Osipiuk	437efe4af6	If possible use nodeInfo based on created node group	2018-10-02 15:46:45 +02:00
Jakub Tużnik	8179e4e716	Refactor the scale-(up\|down) status processors so that they have more info available Replace the simple boolean ScaledUp property of ScaleUpStatus with a more comprehensive ScaleUpResult. Add more possible values to ScaleDownResult. Refactor the processors execution so that they are always executed every iteration, even if RunOnce exits earlier.	2018-09-20 17:12:02 +02:00
k8s-ci-robot	556029ad8d	Merge pull request #1255 from towca/feat/jtuznik/original-reasons Add the ability to retrieve the original reasons from a PredicateError	2018-09-20 07:12:37 -07:00
Jakub Tużnik	8a7338e6d8	Add the ability to retrieve the original reasons from a PredicateError	2018-09-19 17:31:34 +02:00
Łukasz Osipiuk	bf8cfef10b	NodeGroupManager.CreateNodeGroup can return extra created node groups.	2018-09-19 13:55:51 +02:00
k8s-ci-robot	d56bb24b71	Merge pull request #1244 from losipiuk/lo/muzon Call CheckPodsSchedulableOnNode in scale_up.go via caching layer	2018-09-18 02:16:35 -07:00
Steve Scaffidi	88d857222d	Renamed one more variable for consistency Change-Id: Idf42fd58089a1e75f3291ab7cc583735c68735f2	2018-09-17 14:08:10 -04:00
Steve Scaffidi	56b5456269	Fixing nits: renamed newPodScaleUpBuffer -> newPodScaleUpDelay, deleted redundant comment Change-Id: I7969194d8e07e2fb34029d0d7990341c891d0623	2018-09-17 10:38:28 -04:00
Łukasz Osipiuk	705a6d87e2	fixup! Call CheckPodsSchedulableOnNode in scale_up.go via caching layer	2018-09-17 13:01:19 +02:00
Steve Scaffidi	33b93cbc5f	Add configurable delay for pod age before considering for scale-up - This is intended to address the issue described in https://github.com/kubernetes/autoscaler/issues/923 - the delay is configurable via a CLI option - in production (on AWS) we set this to a value of 2m - the delay could possibly be set as low as 30s and still be effective depending on your workload and environment - the default of 0 for the CLI option results in no change to the CA's behavior from defaults. Change-Id: I7e3f36bb48641faaf8a392cca01a12b07fb0ee35	2018-09-14 13:55:09 -04:00
Łukasz Osipiuk	0ad4efe920	Call CheckPodsSchedulableOnNode in scale_up.go via caching layer	2018-09-13 17:01:15 +02:00
Jakub Tużnik	71111da20c	Add a scale down status processor, refactor so that there's more scale down info available to it	2018-09-12 14:52:20 +02:00
mikeweiwei	7ed0599b42	Fix delete node event (#1229 ) * Add more event.When node is deleted and then add event * move eventf above return and change type to warning	2018-09-07 14:31:57 +02:00
Łukasz Osipiuk	84d8f6fd31	Remove obsolete implementations of node-related processors	2018-09-05 11:58:46 +02:00
Aleksandra Malinowska	b88e6019f7	code review fixes 3	2018-08-28 18:11:04 +02:00
Aleksandra Malinowska	5620f76c62	Pass NoScaleUpInfo to ScaleUpStatus processor	2018-08-28 14:26:03 +02:00
Aleksandra Malinowska	cd9808185e	Report reason why pod didn't trigger scale-up	2018-08-28 14:11:36 +02:00
Aleksandra Malinowska	f5690aab96	Make CheckPredicates return predicateError	2018-08-28 14:11:35 +02:00
Jakub Tużnik	054f0b3b90	Add AutoscalingStatusProcessor	2018-08-07 14:47:06 +02:00
Aleksandra Malinowska	90e8a7a2d9	Move initializing defaults out of main	2018-08-02 14:04:03 +02:00
Aleksandra Malinowska	6f9b6f8290	Move ListerRegistry to context	2018-07-26 13:31:49 +02:00
Aleksandra Malinowska	7225a0fcab	Move all Kubernetes API clients to AutoscalingKubeClients	2018-07-26 13:31:48 +02:00
Aleksandra Malinowska	07e52e6c79	Move creating cloud provider out of context	2018-07-25 13:43:47 +02:00
Aleksandra Malinowska	0976d2aa07	Move autoscaling options out of static	2018-07-25 10:52:37 +02:00
Aleksandra Malinowska	6b94d7172d	Move AutoscalingOptions to config/static	2018-07-23 15:52:27 +02:00
Aleksandra Malinowska	f7352500d7	Merge pull request #1080 from aleksandra-malinowska/refactor-cp-3 Remove not-so-useful type check test	2018-07-23 12:00:10 +02:00
Aleksandra Malinowska	1c09fdfe6a	Remove not-so-useful type check test	2018-07-23 11:32:24 +02:00
Aleksandra Malinowska	398a1ac153	Fix error on node info not found for group	2018-07-23 11:16:12 +02:00
Aleksandra Malinowska	3b90694191	Remove autoscaler builder	2018-07-19 15:22:30 +02:00
Aleksandra Malinowska	54f8497079	Remove unused dynamic.Config	2018-07-19 14:53:09 +02:00
Pengfei Ni	1dd0147d9e	Add more events for CA	2018-07-09 15:42:05 +08:00
Aleksandra Malinowska	800ee56b34	Refactor and extend GPU metrics error types	2018-07-05 13:13:11 +02:00
Karol Gołąb	aae4d1270a	Make GetGpuTypeForMetrics more robust	2018-06-26 21:35:16 +02:00
Marcin Wielgus	f2e76e2592	Merge pull request #1008 from krzysztof-jastrzebski/master Move removing unneeded autoprovisioned node groups to node group manager	2018-06-22 21:01:36 +02:00
Karol Gołąb	5eb7021f82	Add GPU-related scaled_up & scaled_down metrics (#974 ) * Add GPU-related scaled_up & scaled_down metrics * Fix name to match SD naming convention * Fix import after master rebase * Change the logic to include GPU-being-installed nodes	2018-06-22 21:00:52 +02:00
Krzysztof Jastrzebski	2df2568841	Move removing unneeded autoprovisioned node groups to node group manager	2018-06-22 14:26:12 +02:00
Nic Doye	ebadbda2b2	issues/933 Consider making UnremovableNodeRecheckTimeout configurable	2018-06-18 11:54:14 +01:00
Aleksandra Malinowska	ed5e82d85d	Merge pull request #956 from krzysztof-jastrzebski/master Create NodeGroupManager which is responsible for creating…	2018-06-14 17:25:32 +02:00
Łukasz Osipiuk	51d628c2f1	Add test to check if nodes from not autoscaled groups are used in max-nodes limit	2018-06-14 16:17:51 +02:00
Krzysztof Jastrzebski	99c8c51bb3	Create NodeGroupManager which is responsible for creating/deleting node groups.	2018-06-14 16:11:32 +02:00
Łukasz Osipiuk	b7323bc0d1	Respect GPU limits in scale_up	2018-06-14 15:46:58 +02:00
Łukasz Osipiuk	dfcbedb41f	Take into consideration nodes from not autoscaled groups when enforcing resource limits	2018-06-14 15:31:40 +02:00
Łukasz Osipiuk	b1db155c50	Remove duplicated test case	2018-06-13 19:00:37 +02:00
Łukasz Osipiuk	9f75099d2c	Restructure checking resource limits in scale_up.go Preparatory work for before introducing GPU limits	2018-06-13 19:00:37 +02:00
Łukasz Osipiuk	087a5cc9a9	Respect GPU limits in scale_down	2018-06-13 14:19:59 +02:00
Łukasz Osipiuk	1fa44a4d3a	Fix bug resulting resource limits not being enforced in scale_down	2018-06-11 16:39:07 +02:00
Łukasz Osipiuk	519064e1ec	Extract isNodeBeingDeleted function	2018-06-11 14:21:07 +02:00
Łukasz Osipiuk	6c57a01fc9	Restructure checking resource limits in scale_down.go	2018-06-11 14:02:40 +02:00
Pengfei Ni	be3dd85503	Update scheduler cache package	2018-06-11 13:54:12 +08:00

1 2 3 4 5 ...

381 Commits