mirror of https://github.com/rancher/dartboard.git
add docs for 20221003 test
Signed-off-by: Silvio Moioli <silvio@moioli.net>
This commit is contained in:
parent
57f5a86f90
commit
b45fd81a6a
|
|
@ -0,0 +1,163 @@
|
|||
# 2022-10-03 - 300 pods per node test
|
||||
|
||||
## Results
|
||||
|
||||
- a 6-node RKE cluster is set up with a limit of 300 pods per node (1800 pods maximum cluster capacity)
|
||||
- an `ngnix` workload at 90% pod capacity (270 pods per node / 1620 pods total) is scheduled and health checks stay green
|
||||
- some day-2 operations are performed on the cluster, with the workload staying healthy (after a transitory period) at every step:
|
||||
- addition (uncordoning) of one node
|
||||
- full workload redeployment
|
||||
- removal (draining) of one node
|
||||
- upgrade of RKE2 on all cluster nodes
|
||||
- CPU overhead at rest (kubelet, container runtime, etc.): ~15% (of a 4-vCPU 3.1 GHz Intel Xeon)
|
||||
|
||||
The test is completely automated, takes ~1 hour, and costs ~2 USD in AWS resources.
|
||||
|
||||
[Screen captures and logs](https://mysuse-my.sharepoint.com/:f:/g/personal/moio_suse_com/Esqw92qeMS5AkAwn6PaV-O0B8tADbJ9rK9zRjqL-yKIGMQ?e=OUbMaz) are available to SUSE employees.
|
||||
|
||||
## Methodology notes
|
||||
|
||||
- 300 pods per node was chosen to be higher than 254 pods per node, which is the maximum possible with the default /24 CIDR range
|
||||
- a /22 CIDR range was assigned, as [Google recommends](https://cloud.google.com/kubernetes-engine/docs/how-to/flexible-pod-cidr#cidr_ranges_for_clusters) having at least twice the number of available IP addresses than pods
|
||||
- workload pods were scheduled up to 90% of the maximum number (270 pods per node). That still leaves 30 pods for system components and is still higher than 254
|
||||
- the downstream cluster is imported because AWS provisioning [is not supported with temporary session tokens at the time of writing](https://github.com/rancher/rancher/issues/15962)
|
||||
- the downstream cluster RKE2 upgrade is performed via the [installation script method](https://github.com/rancher/rke2/blob/v1.25.2%2Brke2r1/docs/upgrade/basic_upgrade.md#upgrade-rke2-using-the-installation-script) because it's the only option with imported clusters at the time of writing
|
||||
- upgrades via installation scripts imply draining nodes that are being upgraded. One extra node is added for that purpose
|
||||
- bugs not related to scalability have been worked around
|
||||
|
||||
## Test outline
|
||||
- infrastructure is set up:
|
||||
- AWS hardware (VMs, network devices...) are deployed
|
||||
- RKE2 is installed on upstream cluster nodes, Rancher is installed on top of it
|
||||
- RKE3 is installed on downstream clusters
|
||||
- test is conducted:
|
||||
- initial admin user is set up ([detail](../cypress/cypress/e2e/users.cy.js))
|
||||
- downstream cluster is imported into Rancher ([detail](../cypress/cypress/e2e/imported-clusters.cy.js))
|
||||
- an ngnix deployment ([detail](../cypress/cypress/e2e/deployment.yaml)) is applied ([detail](../cypress/cypress/e2e/workloads.cy.js))
|
||||
- deployment replicas are scaled up to 1620
|
||||
- one worker node is drained
|
||||
- deployment is recycled
|
||||
- worker node is uncordoned
|
||||
- deployment is recycled again
|
||||
- RKE2 is updated, node per node ([detail](../cypress/cypress/e2e/rke2-update.cy.js))
|
||||
- logs are collected
|
||||
|
||||
## AWS Hardware configuration outline
|
||||
|
||||
- bastion host (for SSH tunnelling only): `t4g.small`, 50 GiB EBS `gp3` root volume
|
||||
- Rancher cluster: 3-node `t3.large`, 50 GiB EBS `gp3` root volume
|
||||
- downstream cluster: 6 to 7 nodes, `t3.xlarge`, 50 GiB EBS gp3 root volume
|
||||
- networking: one /16 AWS VPC with two /24 subnets
|
||||
- public subnet: contains the one bastion host which exposes port 22 to the Internet via security groups
|
||||
- private subnet: contains all other nodes. Traffic allowed only internally and to/from the bastion via SSH
|
||||
|
||||
References:
|
||||
- [instance types](https://aws.amazon.com/ec2/instance-types/)
|
||||
- [EBS](https://aws.amazon.com/ebs/)
|
||||
- [VPC](https://aws.amazon.com/vpc/)
|
||||
|
||||
## Software configuration outline
|
||||
|
||||
- bastion host: SLES 15 SP4
|
||||
- Rancher cluster: Rancher 2.6.5 on a 3-node RKE2 v1.23.10+rke2r1 cluster
|
||||
- all nodes based on Rocky Linux 8.6
|
||||
- downstream cluster: RKE2 v1.22.13+rke2r1, 3 server nodes and 4 agent nodes
|
||||
- all nodes based on Rocky Linux 8.6
|
||||
|
||||
The number of 300 pods per node is set via:
|
||||
- adding a `kube-controller-manager-arg: "node-cidr-mask-size=22"` line to `/etc/rancher/rke2/config.yaml`
|
||||
- adding a `kubelet-arg: "config=/etc/rancher/rke2/kubelet-custom.config"` line to `/etc/rancher/rke2/config.yaml`
|
||||
- creating a `/etc/rancher/rke2/kubelet-custom.config` file with the following content:
|
||||
```yaml
|
||||
apiVersion: kubelet.config.k8s.io/v1beta1
|
||||
kind: KubeletConfiguration
|
||||
maxPods: 250
|
||||
```
|
||||
- restarting the `rke2-server` service
|
||||
|
||||
See [the rke2 installation script in this repo](../rke2/install_rke2.sh) for details.
|
||||
|
||||
## Full configuration details
|
||||
|
||||
All infrastructure is defined via [Terraform](https://www.terraform.io/) files in the [20221003_300_pods_per_node](https://github.com/moio/scalability-tests/tree/20221003_300_pods_per_node) branch. Note in particular [inputs.tf](../inputs.tf) for the main parameters.
|
||||
All tests are driven by [Cypress](https://www.cypress.io/) files in the [cypress](https://github.com/moio/scalability-tests/tree/20221003_300_pods_per_node/cypress) directory.
|
||||
|
||||
## Reproduction Instructions
|
||||
|
||||
### Requirements
|
||||
|
||||
- API access to EC2 configured for your terminal
|
||||
- for SUSE Engineering:
|
||||
- [have "AWS Landing Zone" added to your Okta account](https://confluence.suse.com/display/CCOE/Requesting+AWS+Access)
|
||||
- open [Okta](https://suse.okta.com/) -> "AWS Landing Zone"
|
||||
- Click on "AWS Account" -> your account -> "Command line or programmatic access" -> click to copy commands under "Option 1: Set AWS environment variables"
|
||||
- paste contents in terminal
|
||||
- [Terraform](https://www.terraform.io/downloads)
|
||||
- `git`
|
||||
- `node`
|
||||
- `kubectl`
|
||||
- `nc` (netcat)
|
||||
|
||||
### Setup
|
||||
|
||||
- clone this project:
|
||||
```shell
|
||||
git clone https://github.com/moio/scalability-tests.git
|
||||
cd scalability-tests
|
||||
git checkout 20221003_300_pods_per_node
|
||||
```
|
||||
- initialize Terraform and Cypress:
|
||||
```shell
|
||||
terraform init
|
||||
pushd cypress
|
||||
npm install cypress --save-dev
|
||||
```
|
||||
|
||||
### Run
|
||||
|
||||
```shell
|
||||
# deploys all infrastructure
|
||||
terraform apply -auto-approve
|
||||
|
||||
# runs tests
|
||||
cd cypress; ./node_modules/cypress/bin/cypress run
|
||||
```
|
||||
|
||||
### Inspect results
|
||||
|
||||
- `cypress/cypress/logs` contains log tarballs collected from all downstream nodes
|
||||
- `cypress/cypress/videos` contains screen captures of test runs
|
||||
- Terraform output will contain instructions to access the newly created clusters, eg.
|
||||
```
|
||||
UPSTREAM CLUSTER ACCESS:
|
||||
export KUBECONFIG=./config/upstream.yaml
|
||||
|
||||
RANCHER UI:
|
||||
https://upstream.local.gd:3000
|
||||
|
||||
DOWNSTREAM CLUSTER ACCESS:
|
||||
export KUBECONFIG=./config/downstream.yaml
|
||||
```
|
||||
- `./config` contains scripts to SSH into any VM and reopen SSH tunnels to ports 3000 (Web UI), 6443 (upstream cluster API), 6444 (downstream cluster API)
|
||||
- Cypress tests can be run interactively via:
|
||||
|
||||
```shell
|
||||
cd cypress
|
||||
npx cypress open
|
||||
```
|
||||
|
||||
### Cleanup
|
||||
|
||||
All created infrastructure can be destroyed via:
|
||||
```shell
|
||||
terraform destroy -auto-approve
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
- if the error below is produced:
|
||||
```
|
||||
Error: creating EC2 Instance: VcpuLimitExceeded: You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit.
|
||||
```
|
||||
|
||||
Then you need to request higher limits for your account to AWS. This can be done by visting [the Service Quotas](https://console.aws.amazon.com/servicequotas/home) page and filling the details to request an increase of the "Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances" limit.
|
||||
Loading…
Reference in New Issue