Commit Graph

331 Commits

Author SHA1 Message Date
Lukasz Piatkowski c5ba4b3068 priority expander 2019-03-22 10:43:20 +01:00
Łukasz Osipiuk 2474dc2fd5 Call CloudProvider.Refresh before getNodeInfosForGroups
We need to call refresh before getNodeInfosForGroups. If we have
stale state getNodeInfosForGroups may fail and we will end up in infinite crash looping.
2019-03-12 12:07:49 +01:00
Aleksandra Malinowska 62a28f3005 Soft taint when there are no candidates 2019-03-11 14:05:09 +01:00
Andrew McDermott 5ae76ea66e UPSTREAM: <carry>: fix max cluster size calculation on scale up
When scaling up the calculation for computing the maximum cluster size
does not take into account the number of any upcoming nodes and it is
possible to grow the cluster beyond the cluster
size (--max-nodes-total).

Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1670695
2019-03-08 13:28:58 +00:00
Uday Ruddarraju 91b7bc08a1 Fixing minor error handling bug in static autoscaler 2019-03-07 15:16:27 -08:00
Kubernetes Prow Robot 8944afd901
Merge pull request #1720 from aleksandra-malinowska/events-client
Use separate client for events
2019-02-26 12:00:19 -08:00
Aleksandra Malinowska a824e87957 Only soft taint nodes if there's no scale down to do 2019-02-25 17:11:15 +01:00
Aleksandra Malinowska f304722a1f Use separate client for events 2019-02-25 13:58:54 +01:00
Pengfei Ni 2546d0d97c Move leaderelection options to new packages 2019-02-21 13:45:46 +08:00
Pengfei Ni 128729bae9 Move schedulercache to package nodeinfo 2019-02-21 12:41:08 +08:00
Jacek Kaniuk d969baff22 Cache exemplar ready node for each node group 2019-02-11 17:40:58 +01:00
Jacek Kaniuk f054c53c46 Account for kernel reserved memory in capacity calculations 2019-02-08 17:04:07 +01:00
Marcin Wielgus 99f1dcf9d2
Merge branch 'master' into crc-fix-error-format 2019-02-01 17:22:57 +01:00
Kubernetes Prow Robot bd84757b7e
Merge pull request #1596 from vivekbagade/improve-filterout-logic
Added better checks for filterSchedulablePods and added a tunable fla…
2019-01-27 13:00:31 -08:00
Vivek Bagade c6b87841ce Added a new method that uses pod packing to filter schedulable pods
filterOutSchedulableByPacking is an alternative to the older
filterOutSchedulable. filterOutSchedulableByPacking sorts pods in
unschedulableCandidates by priority and filters out pods that can be
scheduled on free capacity on existing nodes. It uses a basic packing
approach to do this. Pods with nominatedNodeName set are always
filtered out.

filterOutSchedulableByPacking is set to be used by default, but, this
can be toggled off by setting filter-out-schedulable-pods-uses-packing
flag to false, which would then activate the older and more lenient
filterOutSchedulable(now called filterOutSchedulableSimple).

Added test cases for both methods.
2019-01-25 16:09:51 +05:30
Jacek Kaniuk d05dbb9ec4 Refactor tests of tainting
Refactor scale down nad deletetaint tests
Speed up deletetaint tests
2019-01-25 09:21:41 +01:00
Vivek Bagade 8fff0f6556 Removing nominatedNodeName annotation and moving to pod.Status.NominatedNodeName 2019-01-25 00:06:03 +05:30
Vivek Bagade 79ef3a6940 unexporting methods in utils.go 2019-01-25 00:06:03 +05:30
Jacek Kaniuk d00af2373c Tainting nodes - update first, refresh on conflict 2019-01-24 16:57:27 +01:00
Jacek Kaniuk 0c64e0932a Tainting unneeded nodes as PreferNoSchedule 2019-01-21 13:06:50 +01:00
CodeLingo Bot c0603afdeb Fix error format strings according to best practices from CodeReviewComments
Fix error format strings according to best practices from CodeReviewComments

Fix error format strings according to best practices from CodeReviewComments

Reverted incorrect change to with error format string

Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingoBot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Resolve conflict

Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingoBot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <hello@codelingo.io>
Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Fix error strings in testscases to remedy failing tests

Signed-off-by: CodeLingo Bot <bot@codelingo.io>

Fix more error strings to remedy failing tests

Signed-off-by: CodeLingo Bot <bot@codelingo.io>
2019-01-11 09:10:31 +13:00
Łukasz Osipiuk 85a83b62bd Pass nodeGroup->NodeInfo map to ClusterStateRegistry
Change-Id: Ie2a51694b5731b39c8a4135355a3b4c832c26801
2019-01-08 15:52:00 +01:00
Kubernetes Prow Robot 4002559a4c
Merge pull request #1516 from frobware/fix-max-nodes-total-upstream
fix calculation of max cluster size
2019-01-03 10:02:38 -08:00
Maciej Pytel 3f0da8947a Use listers in scale-up 2019-01-02 15:56:01 +01:00
Kubernetes Prow Robot f960f95d28
Merge pull request #1542 from JoeWrightss/patch-7
Fix typo in comment
2019-01-02 05:24:14 -08:00
JoeWrightss 9f87523de9 Fix typo in comment
Signed-off-by: JoeWrightss <zhoulin.xie@daocloud.io>
2019-01-01 15:10:43 +08:00
Maciej Pytel 9060014992 Use listers in scale-down 2018-12-31 14:55:38 +01:00
Kubernetes Prow Robot ab7f1e69be
Merge pull request #1464 from losipiuk/lo/stockouts2
Better quota-exceeded/stockout handling
2018-12-31 05:28:08 -08:00
Łukasz Osipiuk ddbe05b279 Add unit test for stockouts handling 2018-12-28 17:17:07 +01:00
Łukasz Osipiuk 2fbae197f4 Handle possible stockout/quota scale-up errors 2018-12-28 17:17:07 +01:00
Łukasz Osipiuk 9689b30ee4 Do not use time.Now() in RegisterFailedScaleUp 2018-12-28 17:17:07 +01:00
Łukasz Osipiuk da5bef307b Allow updating Increase for ScaleUpRequest in ClusterStateRegistry 2018-12-28 17:17:07 +01:00
Maciej Pytel 60babe7158 Use kubernetes lister for daemonset instead of custom one
Also migrate to using apps/v1.DaemonSet instead of old
extensions/v1beta1.
2018-12-28 13:55:41 +01:00
Maciej Pytel 40811c2f8b Add listers for more controllers 2018-12-28 13:31:21 +01:00
Kubernetes Prow Robot 62c492cb1f
Merge pull request #1518 from lsytj0413/fix-golint
refactor(*): fix golint warning
2018-12-21 06:05:20 -08:00
lsytj0413 672dddd23a refactor(*): fix golint warning 2018-12-19 10:04:08 +08:00
Andrew McDermott 5bc77f051c UPSTREAM: <carry>: fix calculation of max cluster size
When scaling up, the calculation for the maximum size of the cluster
based on `--max-nodes-total` doesn't take into account any nodes that
are in the process of coming up. This allows the cluster to grow
beyond the size specified.

With this change I now see:

scale_up.go:266] 21 other pods are also unschedulable
scale_up.go:423] Best option to resize: openshift-cluster-api/amcdermo-ca-worker-us-east-2b
scale_up.go:427] Estimated 18 nodes needed in openshift-cluster-api/amcdermo-ca-worker-us-east-2b
scale_up.go:432] Capping size to max cluster total size (23)
static_autoscaler.go:275] Failed to scale up: max node total count already reached
2018-12-18 17:05:19 +00:00
Zhenhai Gao df10e5f5c2 Fix log output detailed warning info
Signed-off-by: Zhenhai Gao <gaozh1988@live.com>
2018-12-07 17:25:54 +08:00
Andrew McDermott fd3fd85f26 UPSTREAM: <carry>: handle nil nodeGroup in calculateScaleDownGpusTotal
Explicitly handle nil as a return value for nodeGroup in
`calculateScaleDownGpusTotal()` when `NodeGroupForNode()` is called
for GPU nodes that don't exist. The current logic generates a runtime
exception:

    "reflect: call of reflect.Value.IsNil on zero Value"

Looking through the rest of the tree all the other places that use
this pattern additionally and explicitly check whether `nodeGroup ==
nil` first.

This change now completes the pattern in
`calculateScaleDownGpusTotal()`.

Looking at the other occurrences of this pattern we see:

```
File: clusterstate/clusterstate.go
488:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {

File: core/utils.go
231:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
322:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
394:27:			if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
461:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {

File: core/scale_down.go
185:6:		if reflect.ValueOf(nodeGroup).IsNil() {
608:27:			if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
747:26:		if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
1010:25:	if nodeGroup == nil || reflect.ValueOf(nodeGroup).IsNil() {
```

with the notable exception at core/scale_down.go:185 which is
`calculateScaleDownGpusTotal()`.

With this change, and invoking the autoscaler with:

```
...
      --max-nodes-total=24 \
      --cores-total=8:128 \
      --memory-total=4:256 \
      --gpu-total=nvidia.com/gpu:0:16 \
      --gpu-total=amd.com/gpu:0:4 \
...
```

I no longer see a runtime exception.
2018-12-05 18:54:07 +00:00
Thomas Hartland d0dd00c602 Fix logged error in static autoscaler 2018-12-04 16:59:57 +01:00
Łukasz Osipiuk 016bf7fc2c Use k8s.io/klog instead github.com/golang/glog 2018-11-26 17:30:31 +01:00
Łukasz Osipiuk 991873c237 Fix gofmt errors 2018-11-26 15:39:59 +01:00
Alex Price 4ae7acbacc add flags to ignore daemonsets and mirror pods when calculating resource utilization of a node
Adds the flag --ignore-daemonsets-utilization and --ignore-mirror-pods-utilization
(defaults to false) and when enabled, factors DaemonSet and mirror pods out when
calculating the resource utilization of a node.
2018-11-23 15:24:25 +11:00
Łukasz Osipiuk 5962354c81 Inject Backoff instance to ClusterStateRegistry on creation 2018-11-13 14:25:16 +01:00
k8s-ci-robot 7008fb50be
Merge pull request #1380 from losipiuk/lo/backoff
Make Backoff interface
2018-11-07 05:13:43 -08:00
Aleksandra Malinowska 6febc1ddb0 Fix formatted log messages 2018-11-06 14:51:43 +01:00
Aleksandra Malinowska bf6ff4be8e Clean up estimators 2018-11-06 14:15:42 +01:00
Łukasz Osipiuk 0e2c3739b7 Use NodeGroup as key in Backoff 2018-10-30 18:17:26 +01:00
Łukasz Osipiuk 55fc1e2f00 Store NodeGroup in ScaleUpRequest and ScaleDownRequest 2018-10-30 18:03:04 +01:00
Maciej Pytel 6f5e6aab6f Move node group balancing to processor
The goal is to allow customization of this logic
for different use-case and cloudproviders.
2018-10-25 14:04:05 +02:00