Archive meeting notes and ci subgroup notes.
This commit is contained in:
parent
8a99192b37
commit
72fe46651c
File diff suppressed because one or more lines are too long
|
@ -0,0 +1,859 @@
|
|||
# Kubernetes SIG-Node CI subgroup notes
|
||||
|
||||
## 12/29/2021
|
||||
|
||||
**Cancelled** \- year-end holiday break
|
||||
|
||||
## 12/22/2021
|
||||
|
||||
- Device plugin: [https://github.com/kubernetes/test-infra/issues/24557](https://github.com/kubernetes/test-infra/issues/24557)
|
||||
- Put on hold to be able to repro. But can repro locally now. So can skip tests for now
|
||||
- Memory (kubelet) down again:
|
||||
![][image12]
|
||||
|
||||
## 12/15/2021
|
||||
|
||||
* [https://github.com/kubernetes/test-infra/issues/24618\#issuecomment-993808136](https://github.com/kubernetes/test-infra/issues/24618#issuecomment-993808136)
|
||||
* Job to move image-config: [https://github.com/kubernetes/test-infra/blob/master/jobs/e2e\_node/swap/image-config-swap.yaml](https://github.com/kubernetes/test-infra/blob/master/jobs/e2e_node/swap/image-config-swap.yaml)
|
||||
* Other jobs: [https://github.com/kubernetes/test-infra/blob/master/jobs/e2e\_node/containerd/containerd-release-1.5/image-config.yaml](https://github.com/kubernetes/test-infra/blob/master/jobs/e2e_node/containerd/containerd-release-1.5/image-config.yaml)
|
||||
* \[Alukiano\] Same with hugepages. Need to provide the init that will include both. Examples:
|
||||
* [https://github.com/kubernetes/test-infra/pull/24673](https://github.com/kubernetes/test-infra/pull/24673)
|
||||
* Small follow-up fix [https://github.com/kubernetes/test-infra/pull/24682](https://github.com/kubernetes/test-infra/pull/24682)
|
||||
* Follow up:
|
||||
* GCEPD: follow up to exclude from more tabs
|
||||
* [https://github.com/kubernetes/kubernetes/issues/106720](https://github.com/kubernetes/kubernetes/issues/106720),
|
||||
* [https://github.com/kubernetes/kubernetes/issues/106719](https://github.com/kubernetes/kubernetes/issues/106719),
|
||||
* [https://testgrid.k8s.io/sig-node-containerd\#containerd-e2e-ubuntu](https://testgrid.k8s.io/sig-node-containerd#containerd-e2e-ubuntu)
|
||||
* [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-e2e](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-e2e)
|
||||
* [https://testgrid.k8s.io/sig-node-containerd\#image-validation-cos-e2e](https://testgrid.k8s.io/sig-node-containerd#image-validation-cos-e2e)
|
||||
* [https://testgrid.k8s.io/sig-node-containerd\#image-validation-ubuntu-e2e](https://testgrid.k8s.io/sig-node-containerd#image-validation-ubuntu-e2e)
|
||||
* [https://testgrid.k8s.io/sig-node-containerd\#e2e-ubuntu](https://testgrid.k8s.io/sig-node-containerd#e2e-ubuntu)
|
||||
* [https://testgrid.k8s.io/sig-node-cos\#soak-cos-gce](https://testgrid.k8s.io/sig-node-cos#soak-cos-gce)
|
||||
* [https://testgrid.k8s.io/sig-node-cos\#e2e-cos](https://testgrid.k8s.io/sig-node-cos#e2e-cos)
|
||||
* [https://testgrid.k8s.io/sig-node-cos\#e2e-cos-ip-alias](https://testgrid.k8s.io/sig-node-cos#e2e-cos-ip-alias)
|
||||
* [https://testgrid.k8s.io/sig-node-cos\#e2e-cos-proto](https://testgrid.k8s.io/sig-node-cos#e2e-cos-proto)
|
||||
* [https://testgrid.k8s.io/sig-node-cos\#e2e-cos-serial](https://testgrid.k8s.io/sig-node-cos#e2e-cos-serial)
|
||||
* [https://testgrid.k8s.io/sig-node-cos\#e2e-cos-slow](https://testgrid.k8s.io/sig-node-cos#e2e-cos-slow)
|
||||
* Create issue for: [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-node-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-node-e2e-serial) ([https://github.com/kubernetes/kubernetes/issues/107062](https://github.com/kubernetes/kubernetes/issues/107062))
|
||||
* Create issue for: [https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-eviction](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-eviction) [https://github.com/kubernetes/kubernetes/issues/107063](https://github.com/kubernetes/kubernetes/issues/107063)
|
||||
* NPD: [https://testgrid.k8s.io/sig-node-node-problem-detector\#ci-npd-e2e-node](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-node) [https://github.com/kubernetes/kubernetes/issues/107067](https://github.com/kubernetes/kubernetes/issues/107067)
|
||||
* Check for [https://testgrid.k8s.io/sig-node-cos\#e2e-cos-alpha-features](https://testgrid.k8s.io/sig-node-cos#e2e-cos-alpha-features)
|
||||
* Check if we have an issue for this: [https://testgrid.k8s.io/sig-node-presubmits\#pr-node-kubelet-serial](https://testgrid.k8s.io/sig-node-presubmits#pr-node-kubelet-serial) [https://github.com/kubernetes/test-infra/issues/24557](https://github.com/kubernetes/test-infra/issues/24557)
|
||||
|
||||
## 12/09/2021
|
||||
|
||||
- \[ehashman\] Dockershim removal \- cleanup [https://github.com/kubernetes/test-infra/issues/24592](https://github.com/kubernetes/test-infra/issues/24592)
|
||||
- PR: [https://github.com/kubernetes/test-infra/pull/24595](https://github.com/kubernetes/test-infra/pull/24595)
|
||||
- There are some tests I don’t want to migrate as part of this PR with special configs: e.g. CPU manager, hugepages, etc. but are failing so no point in running them
|
||||
- Ideally, since these are serial, would like to see all tests using the same config moved under a single job
|
||||
- Presubmits need to be done separately
|
||||
- Will file issues for split work (presumbits)
|
||||
- \[ruiwen-zhao\] [https://github.com/kubernetes/kubernetes/issues/106895](https://github.com/kubernetes/kubernetes/issues/106895)
|
||||
- Summary API test flaky on kubelet-gce-e2e-swap-ubuntu since the beginning of test job history
|
||||
- Same test passes on \-fedora testgrid
|
||||
- Agreed: cancelling Thursday alternate time meeting due to lack of attendance.
|
||||
- We will only meet Wednesdays from now on.
|
||||
- [https://github.com/kubernetes/community/pull/6285](https://github.com/kubernetes/community/pull/6285)
|
||||
|
||||
## 12/01/2021
|
||||
|
||||
- \[ehashman\] Milestone 1.23
|
||||
- [https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.23+label%3Asig%2Fnode](https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.23+label%3Asig%2Fnode)
|
||||
- Serial lane is green\!\!\!
|
||||
- Perf (memory chart) \- let’s see if trend continues next week
|
||||
- We looked and it looks stable since code freeze (\~11/17)
|
||||
- Next week is our alternate Thursday time
|
||||
- If we do not see substantial attendance compared to Wednesday meeting, we will revert back to Wednesdays only going forward
|
||||
|
||||
## 11/24/2021
|
||||
|
||||
- \[ehashman\] Milestone 1.23
|
||||
- [https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.23+label%3Asig%2Fnode](https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.23+label%3Asig%2Fnode)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/106635](https://github.com/kubernetes/kubernetes/issues/106635)
|
||||
- Still failing, all fixes need 1.23 backports at this point, so let’s remove these from the serial lane until we can fix them next release
|
||||
|
||||
Perf (memory chart) \- let’s see if trend continues next week
|
||||
![][image13]
|
||||
|
||||
## 11/17/2021
|
||||
|
||||
* \[ehashman\] Node-kubelet-serial to release-informing
|
||||
* GPU tests started failing again, sigh
|
||||
* Danielle has a PR that will hopefully prevent this in the future
|
||||
* [https://github.com/kubernetes/kubernetes/pull/106348](https://github.com/kubernetes/kubernetes/pull/106348)
|
||||
* We probably shouldn’t run with real GPUs (i.e. special hardware) in our regular serial lane
|
||||
* Is it off docker?
|
||||
* Containerd job currently broken \- moved to community infra but has a config issue
|
||||
* Cut a bug for [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-serial-containerd](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd)
|
||||
* Do we have an equivalent job for crio?
|
||||
* Don’t think so, we should definitely add one, add both to release-informing when sufficiently green
|
||||
* [https://github.com/kubernetes/test-infra/issues/24451](https://github.com/kubernetes/test-infra/issues/24451)
|
||||
*
|
||||
* \[mmiranda96\] [https://github.com/kubernetes/kubernetes/issues/106469](https://github.com/kubernetes/kubernetes/issues/106469)
|
||||
* Probably safe now, only noticed two failures yesterday.
|
||||
* \[aditi\] [https://github.com/kubernetes/kubernetes/pull/106449](https://github.com/kubernetes/kubernetes/pull/106449)
|
||||
* Just added the log to be sure about the reason for flake
|
||||
* Can increase the grace period/decrease sleep time based on the finding
|
||||
* \[aditi\] should we remove 1.19 jobs?
|
||||
* [https://testgrid.k8s.io/sig-node-kubelet\#node-kubelet-features-1.19](https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-features-1.19)
|
||||
* [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-1.19](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-1.19)
|
||||
* Let’s wait for SIG Release to remove their job first
|
||||
* \[ehashman\] refactor kubelet config validation
|
||||
* Okay to merge during test freeze?
|
||||
* [https://github.com/kubernetes/kubernetes/pull/105360](https://github.com/kubernetes/kubernetes/pull/105360)
|
||||
* \+1’s from Danielle and David
|
||||
* \+1 from Sergey.
|
||||
|
||||
## 11/11/2021
|
||||
|
||||
* [https://github.com/kubernetes/kubernetes/issues/106204](https://github.com/kubernetes/kubernetes/issues/106204)
|
||||
* Release blocker?
|
||||
* \[danielle\] [https://github.com/kubernetes/kubernetes/pull/106348](https://github.com/kubernetes/kubernetes/pull/106348)
|
||||
* Lets drop the dependency on GPUs and reduce some maintenance for ourselves. Also apparently the device tests were broken in the same way GPU tests originally were (but we missed it bc of \[Flaky\])
|
||||
* \[aditi\] [https://github.com/kubernetes/kubernetes/pull/106252](https://github.com/kubernetes/kubernetes/pull/106252)
|
||||
* Thoughts on adding credential provider to node e2e?
|
||||
* Status of node serial
|
||||
* Let’s make it green by removing flakes out
|
||||
|
||||
|
||||
|
||||
|
||||
## 10/26/2021
|
||||
|
||||
* \[manugupt1\] I have been working with sig-storage to speed up mount / unmount performance. See PR: [https://github.com/kubernetes/kubernetes/pull/105833/files](https://github.com/kubernetes/kubernetes/pull/105833/files). The change in this PR works only on kernel 5.6+. The ask on this PR is to write tests that will test that the behavior does not change with my tests. While thinking about this, I can think of a couple of options:
|
||||
* Re-run all the tests through the pod-spec.
|
||||
* Move all the unit-tests that I ran as an e2e test without pod-spec.
|
||||
* \[SergeyKanzhelev\] cos-81-lts is out of support. I suggest: [https://github.com/kubernetes/test-infra/search?q=cos-stable1](https://github.com/kubernetes/test-infra/search?q=cos-stable1) \-\> `cos-89-lts`
|
||||
[`https://github.com/kubernetes/test-infra/search?q=cos-stable2`](https://github.com/kubernetes/test-infra/search?q=cos-stable2) `-> cos-93-lts`
|
||||
|
||||
|
||||
## 10/20/2021
|
||||
|
||||
* \[fromani\]\[could be postponed if not enough time\] some tests have implicit dependencies on the node state
|
||||
* Memory manager tests have implicit dep on memory fragmentation (or lack thereof)
|
||||
* ContainerRuntimeRestart \- fails for timeout, but we WANT to saturate the node with pods (that’s the whole point of the test\!)
|
||||
* How do we keep these tests while keeping it reliable (no false negatives)?
|
||||
* Separate lanes seem a bit excessive, any other idea?
|
||||
* Wait for the PR to reduce amount of allocated hugepages under the test
|
||||
* If the PR will fix flakes, remove the separated lane, otherwise remove the test from serial lane with notes why it needs a separate lane
|
||||
* \[imran\] Lock Contention Tests : Updated the job with \`NodeSpecialFeature:LockContention\` , all tests are passing, CI is green.
|
||||
Updated the test-infra PR to include a skip value with \`--skip="\\\[Flaky\\\]|\\\[Serial\\\]"\`
|
||||
Just need to merge and proceed with the next set of steps.
|
||||
* \[mmiranda96\] Adding new alpha features to jobs e2e-gce-alpha-features ([https://github.com/kubernetes/test-infra/issues/23642](https://github.com/kubernetes/test-infra/issues/23642))
|
||||
* \[jlebon\] Allow running e2e tests on non-GCE nodes [https://github.com/kubernetes/kubernetes/pull/105764](https://github.com/kubernetes/kubernetes/pull/105764)
|
||||
* Danielle will take a look
|
||||
* Fromani will have a look as well
|
||||
* What about cloud-init?
|
||||
* Cloud ignition is an alternative \- needs to be adapted
|
||||
* Have a way to prepare all the binaries upfront?
|
||||
*
|
||||
* \[mmiranda96\] Updating job ci-kubernetes-node-kubelet-eviction to use swap (for [https://github.com/kubernetes/kubernetes/issues/105023\#issuecomment-947145748](https://github.com/kubernetes/kubernetes/issues/105023#issuecomment-947145748))
|
||||
* PR: [https://github.com/kubernetes/test-infra/pull/24064](https://github.com/kubernetes/test-infra/pull/24064)
|
||||
* Not sure if the current swap config will work on COS.
|
||||
* [https://github.com/kubernetes/test-infra/blob/cc714da33d7ba85672aa4c7f58e0b3993155176d/jobs/e2e\_node/swap/crio\_swap1g.ign](https://github.com/kubernetes/test-infra/blob/cc714da33d7ba85672aa4c7f58e0b3993155176d/jobs/e2e_node/swap/crio_swap1g.ign)
|
||||
* \[ehashman\] attendance for alt time
|
||||
* Let’s hold another one in november and then if poor attendance again, cancel
|
||||
|
||||
## 10/14/2021
|
||||
|
||||
* \[mmiranda96\] Troubleshooting containerd 1.4 canaries ([https://github.com/kubernetes/test-infra/issues/23915](https://github.com/kubernetes/test-infra/issues/23915))
|
||||
* From the logs, it appears that the cluster is never created. Node operations are no-ops. Is the cluster expected to be existent before the tests run?
|
||||
* \[mmiranda96\] MemoryPressure testing with swap enabled.
|
||||
* How can we run tests in machines with swap enabled?
|
||||
* [https://github.com/kubernetes/test-infra/tree/master/jobs/e2e\_node/swap](https://github.com/kubernetes/test-infra/tree/master/jobs/e2e_node/swap)
|
||||
|
||||
## 10/06/2021
|
||||
|
||||
* \[bobbypage\] Kubetest2 migration plan
|
||||
* Amit @amwat will provide a bit of context on kubetest2 migration plans and how it relates to node e2e testing
|
||||
* ref: [https://github.com/kubernetes/enhancements/issues/2464](https://github.com/kubernetes/enhancements/issues/2464)
|
||||
|
||||
CI jobs very first layer Image for prow job \- has many tools on it already
|
||||
\- all these tools are deprecated and in maintenance mode. Tools evolved from bash script and became unmaintainable
|
||||
\- kubetest2 is designed to be extensible and will replace it.
|
||||
\- PLuggable on where to test (GCE, AWS, Kind, etc.)
|
||||
\- Pluggable on what to test
|
||||
|
||||
Thousands of jobs using old tools. The process of switching all the jobs will be slow. Some of the jobs will be moved to kubetest2. Presubmits and release blocking jobs are the first target.
|
||||
|
||||
Mainly: awareness of the project. Feature requests must go to kubetest2 now.
|
||||
|
||||
Most significant impacting change \- node tester. kubetest2 will use a makefile as a source of truth. make test\_e2e lets you run tests, but kubetest is not using it. kubetest2 will change this and will only use makefile. Dealing with test infra will be mostly when bugs are encountered, no need to deal with it any longer.
|
||||
|
||||
Danielle: some tests needs more features
|
||||
Amwat: yes, can be added in node tester
|
||||
|
||||
Timeline?
|
||||
Amwat: scoped to presubmits and release blocking \- 1.24 is a target version. At least jobs will start be running. No timeline for other jobs.
|
||||
|
||||
* \[SergeyKanzhelev\] Image for presubmits: [https://github.com/kubernetes/kubernetes/issues/105381](https://github.com/kubernetes/kubernetes/issues/105381)
|
||||
* File an issue for the future
|
||||
* Revert to image family for now
|
||||
* Find somebody to investigate
|
||||
* \[SergeyKanzhelev\] [https://docs.google.com/document/d/19HqSyrS-4pyubqTvQV0hJKt\_97nbCSt\_aD0soL-RGGE/edit\#heading=h.veqp9g4ihszu](https://docs.google.com/document/d/19HqSyrS-4pyubqTvQV0hJKt_97nbCSt_aD0soL-RGGE/edit#heading=h.veqp9g4ihszu)
|
||||
* \[mmiranda96\] A little off-topic, do we plan to participate in Hacktoberfest? Maintainers guide link: [https://hacktoberfest.digitalocean.com/resources/maintainers](https://hacktoberfest.digitalocean.com/resources/maintainers)
|
||||
* Issue on kops for last year event: [https://github.com/kubernetes/kops/issues/9920](https://github.com/kubernetes/kops/issues/9920)
|
||||
* No \- Kubernetes has explicitly opted out of Hacktoberfest project wide. The quality of contributions we’ve historically gotten have been very low and created a lot of extra cleanup work for maintainers.
|
||||
|
||||
|
||||
## 09/29/2021
|
||||
|
||||
* \[SergeyKanzhelev\] NodeConformance updates
|
||||
|
||||
./\_output/local/go/bin/ginkgo \--dryRun \-v ./\_output/local/go/bin/e2e\_node.test | sed $'s,\\x1b\\\\\[\[0-9;\]\*\[a-zA-Z\],,g' \> ./tmp/e2e\_node.test.txt
|
||||
|
||||
[https://github.com/kubernetes/community/blob/32a1c14d04ff78684d78b827ac7c49f70352d509/contributors/devel/sig-testing/e2e-tests.md\#kinds-of-tests](https://github.com/kubernetes/community/blob/32a1c14d04ff78684d78b827ac7c49f70352d509/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests)
|
||||
|
||||
* \[bobbypage\] Kubetest2 migration plan
|
||||
* Moved to 10/06
|
||||
* \[arnaud\] Migration to k8s-infra: [https://github.com/kubernetes/k8s.io/issues/1469](https://github.com/kubernetes/k8s.io/issues/1469)
|
||||
* Migrate away from GCP projects:
|
||||
* k8s-jkns-pr-node-e2e: [https://cs.k8s.io/?q=k8s-jkns-pr-node-e2e\&i=nope\&files=\&excludeFiles=\&repos=](https://cs.k8s.io/?q=k8s-jkns-pr-node-e2e&i=nope&files=&excludeFiles=&repos=)
|
||||
* cri-containerd-node-e2e: [https://cs.k8s.io/?q=cri-containerd-node-e2e\&i=nope\&files=\&excludeFiles=\&repos=](https://cs.k8s.io/?q=cri-containerd-node-e2e&i=nope&files=&excludeFiles=&repos=)
|
||||
* K8s-cri-containerd: house of bucket gs://cri-containerd-testing : [https://cs.k8s.io/?q=cri-containerd-staging\&i=nope\&files=\&excludeFiles=\&repos=](https://cs.k8s.io/?q=cri-containerd-staging&i=nope&files=&excludeFiles=&repos=)
|
||||
* **Action:** ehashman to create an issue for SIG Node in test-infra for the SIG to move the project with steps for a new contributor to pick up, cc Arnaud
|
||||
* [https://github.com/kubernetes/test-infra/issues/23822](https://github.com/kubernetes/test-infra/issues/23822)
|
||||
* Example PR: [https://github.com/kubernetes/test-infra/pull/23777/files](https://github.com/kubernetes/test-infra/pull/23777/files)
|
||||
* Why serial run on every PR?
|
||||
* [https://github.com/kubernetes/test-infra/pull/23823](https://github.com/kubernetes/test-infra/pull/23823)
|
||||
|
||||
## 09/22/2021
|
||||
|
||||
* \[mmiranda96\] Created [https://github.com/kubernetes/test-infra/issues/23642](https://github.com/kubernetes/test-infra/issues/23642) for keeping track of alpha feature jobs with non-alpha features.
|
||||
* Update list of features manually to make progress on the alpha job cleanup.
|
||||
* Work on updating tags for jobs
|
||||
* \[ehashman\] status of [https://github.com/kubernetes/k8s.io/issues/956](https://github.com/kubernetes/k8s.io/issues/956) ?
|
||||
* GCP accounts for node contributors \- need a list of use cases
|
||||
* Dani to drive when she returns from PTO?
|
||||
* \[Sergey\] memory spike [https://github.com/kubernetes/kubernetes/issues/105053](https://github.com/kubernetes/kubernetes/issues/105053)
|
||||
|
||||
## 09/15/2021
|
||||
|
||||
* \[alukiano\] \- I can not attend(Israel holidays), but I want to know people opinion regarding moving DynamicKubeletConfig tests out of serial lane
|
||||
* The feature should be deprecated under 1.23
|
||||
* The feature tests took the most time of the serial lane, for comparison with DynamicKubeletConfig serial lane takes \~3h, without \~1h
|
||||
* We can create a separate lane for deprecated features(I know we do not like an idea about the separate lane)
|
||||
* Let’s prioritize removing the feature
|
||||
* Need unified approach for kubelet configuration
|
||||
* DynamicKubeletConfig flakes are often due to the feature being unreliable/kubelet not restarting
|
||||
* **Action:** Elana to file an issue to detail work that needs to be done to move tests off DynamicKubeletConfig, assign Danielle and cc Sergey
|
||||
* [https://github.com/kubernetes/kubernetes/issues/105047](https://github.com/kubernetes/kubernetes/issues/105047)
|
||||
* \[arnaud\] Migrate prowjobs to community infra
|
||||
* In-scope :
|
||||
* Sig-node-presubmits : [https://testgrid.k8s.io/sig-node-presubmits](https://testgrid.k8s.io/sig-node-presubmits)
|
||||
* Sig-node-kubelet : [https://testgrid.k8s.io/sig-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet)
|
||||
* Node [https://github.com/kubernetes/k8s.io/issues/1527](https://github.com/kubernetes/k8s.io/issues/1527)
|
||||
* [One question](https://github.com/kubernetes/k8s.io/issues/1527#issuecomment-855111265) remaining; Danielle thinks these haven’t been used in 2+ years
|
||||
* **Action:** Danielle to comment on issue with update on Arnaud’s question and recent findings
|
||||
* \[ehashman\] status of [https://github.com/kubernetes/test-infra/issues/23 291](https://github.com/kubernetes/test-infra/issues/23291)
|
||||
* Danielle: there are some optimizations we can do, including eviction tests, GPU tests, etc. to be less wasteful
|
||||
* Fromani: agreed. Only few tests actually benefit from availability of GPU devices (and they need to be changed accordingly)
|
||||
* **Action:** Sergey to follow up on combining node pool with the rest of the project
|
||||
* \[mmiranda96\] [https://github.com/kubernetes/kubernetes/issues/104556](https://github.com/kubernetes/kubernetes/issues/104556)
|
||||
* **Action:** Elana to add a comment discussing the period/presubmit split and suggested steps forward for the job
|
||||
* [https://testgrid.k8s.io/sig-node-kubelet\#node-kubelet-alpha](https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-alpha)
|
||||
* Bugs triage: [https://github.com/orgs/kubernetes/projects/59](https://github.com/orgs/kubernetes/projects/59)
|
||||
|
||||
## 09/09/2021
|
||||
|
||||
- \[SergeyKanzhelev\] Feedback from liggitt: [https://github.com/kubernetes/kubernetes/pull/103674\#discussion\_r704446017](https://github.com/kubernetes/kubernetes/pull/103674#discussion_r704446017) on NodeFeature-\>Feature transition
|
||||
- \[aditi\] Looking for some context on kubelet resource usage tracking tests
|
||||
[https://github.com/kubernetes/kubernetes/issues/36621](https://github.com/kubernetes/kubernetes/issues/36621)
|
||||
The node performance testing doc points to these tests
|
||||
[https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/node-performance-testing.md](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/node-performance-testing.md)
|
||||
Interested in refactoring the doc with current status of node perf tests and refactoring of tests as well.
|
||||
- Why tests [https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/kubelet\_perf.go](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node/kubelet_perf.go) are not run in kubelet-serial?
|
||||
- Do we need these ^^^ or [https://testgrid.k8s.io/sig-node-kubelet\#node-performance-test](https://testgrid.k8s.io/sig-node-kubelet#node-performance-test) will be enough?
|
||||
-
|
||||
- \[Imran Pochi\] Further steps on Lock contention tests ([https://github.com/kubernetes/kubernetes/pull/104334](https://github.com/kubernetes/kubernetes/pull/104334))
|
||||
- Since test requires special config, we want a separate tab. Let’s add \[Special\] tag and make sure “features” do not pick up these “\[Special\]” tag
|
||||
- \[ehashman\] Is alternate time working?
|
||||
- Some folks definitely can’t attend the Wednesday time
|
||||
- Many people don’t realize this meeting exists
|
||||
- **Action:** ehashman to set up reminder for the next Thursday meeting
|
||||
- Done \- for the rest of 2021
|
||||
|
||||
## 09/01/2021
|
||||
|
||||
- \[mmiranda96\] [kubernetes/test-infra\#23202](https://github.com/kubernetes/test-infra/issues/23202) Job fails while building the image, mostly related to CGO\_ENABLED. Any recommendations for this?
|
||||
- \[danielle\] FYI \- Node Problem Detector usually builds on top of Debian \- [https://github.com/kubernetes/node-problem-detector/blob/master/Makefile\#L77](https://github.com/kubernetes/node-problem-detector/blob/master/Makefile#L77)
|
||||
- ~~\[mmiranda96\] Requesting review for [kubernetes/test-infra\#23215](https://github.com/kubernetes/test-infra/pull/23215)~~
|
||||
- \[ehashman\] Test failure emails to main mailing list
|
||||
- Seems like an oversight [https://groups.google.com/g/kubernetes-sig-node](https://groups.google.com/g/kubernetes-sig-node)
|
||||
- AI: create a task in github
|
||||
- \[danielle\] [https://github.com/kubernetes/kubernetes/pull/104304](https://github.com/kubernetes/kubernetes/pull/104304) \- Fixing eviction tests, needs reviews please :)
|
||||
- \[rphillips\] [https://github.com/kubernetes/kubernetes/issues/104648](https://github.com/kubernetes/kubernetes/issues/104648) [PR\#104712](https://github.com/kubernetes/kubernetes/pull/104712)
|
||||
- E2e is coming as separate PR later to make release this week available
|
||||
|
||||
## 08/25/2021 Cancelled due to hosts unavailability
|
||||
|
||||
## 08/18/2021
|
||||
|
||||
- \[mmiranda96\] Requesting review for [kubernetes/node-problem-detector\#607](https://github.com/kubernetes/node-problem-detector/pull/607).
|
||||
- AI: create issue to review other NPD tests.
|
||||
- \[mmiranda96\] Failures in [pull-kubernetes-node-e2e-alpha](https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-node-e2e-alpha)
|
||||
- Mike: investigate and find the test grid. \+ create an issue.
|
||||
-
|
||||
- \[SergeyKanzhelev\] Quota ([https://github.com/kubernetes/test-infra/issues/23232](https://github.com/kubernetes/test-infra/issues/23232)): [https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?panelId=9\&fullscreen\&orgId=1\&from=now-90d\&to=now](https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?panelId=9&fullscreen&orgId=1&from=now-90d&to=now)
|
||||
- 18 is the number of projects used.
|
||||
- The best way to improve utilization is less test jobs or faster tests inside the job
|
||||
- \[alukiano\] Orphans jobs
|
||||
- Runs serial and conformance tests
|
||||
- Prepare a PR to skip serial and conformance tests [https://github.com/kubernetes/test-infra/pull/23295](https://github.com/kubernetes/test-infra/pull/23295)
|
||||
|
||||
## 08/12/2021
|
||||
|
||||
-
|
||||
- ~~\[mkmir\] Working on [kubernetes/test-infra\#23131](https://github.com/kubernetes/test-infra/issues/23131), but I can’t seem to find a way to locally run a ProwJob without connecting to GCE. Any suggestions?~~
|
||||
- Lock contention tests failing on features-master, reverted [https://github.com/kubernetes/kubernetes/issues/104307](https://github.com/kubernetes/kubernetes/issues/104307)
|
||||
- Probably need to label this \[Serial\], remove the separate job
|
||||
- Test labels: [https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md\#kinds-of-tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests)
|
||||
- AI: Imran:
|
||||
- Add \[Serial\] tag
|
||||
- Validate that test is passing in Serial [https://testgrid.k8s.io/sig-node-kubelet\#pr-node-kubelet-serial](https://testgrid.k8s.io/sig-node-kubelet#pr-node-kubelet-serial)
|
||||
- PR that adds a job needs to be reverted
|
||||
- \[ehashman\] testgrid cleanup follow-up
|
||||
- Didn’t get time to submit a PR, good thing for a new contributor to work/pair on
|
||||
- We want to move “pr-\*” jobs out of the sig-node-kubelet tab to a new sig-node-presubmits tab
|
||||
- Maybe look at any other tabs that could be consolidated (e.g. sig-node-containerd and sig-node-containerd-io)
|
||||
- Remove release-blocking jobs
|
||||
- Any volunteers?
|
||||
- Imran may reach out
|
||||
- Aditi \+ matthyx to volunteer
|
||||
- **Action:** Elana to create an issue
|
||||
- [https://github.com/kubernetes/test-infra/issues/23231](https://github.com/kubernetes/test-infra/issues/23231)
|
||||
- \[ehashman\] test-infra resource utilization
|
||||
- [https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?panelId=9\&fullscreen\&orgId=1\&from=now-90d\&to=now](https://monitoring.prow.k8s.io/d/wSrfvNxWz/boskos-resource-usage?panelId=9&fullscreen&orgId=1&from=now-90d&to=now)
|
||||
- Do we even need single-test jobs like [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-lock-contention](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-lock-contention) ?
|
||||
- Suggested action: someone should audit all sig-node periodics and suggest jobs that can be consolidated/removed
|
||||
- AI: Sergey to find out what this quota means
|
||||
- Action: Elana to create issue to track
|
||||
- [https://github.com/kubernetes/test-infra/issues/23232](https://github.com/kubernetes/test-infra/issues/23232)
|
||||
- CI signal:
|
||||
- [https://github.com/kubernetes/kubernetes/issues/104173](https://github.com/kubernetes/kubernetes/issues/104173)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/99437](https://github.com/kubernetes/kubernetes/issues/99437)
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-serial-gce-e2e-graceful-node-shutdown](https://testgrid.k8s.io/sig-node-kubelet#kubelet-serial-gce-e2e-graceful-node-shutdown) Sergey to create an issue ([https://github.com/kubernetes/kubernetes/issues/104344](https://github.com/kubernetes/kubernetes/issues/104344))
|
||||
- \[ehashman\] New bug triage board: [https://github.com/orgs/kubernetes/projects/59](https://github.com/orgs/kubernetes/projects/59)
|
||||
- Hoping to triage and assign new bugs regularly as part of weekly triage meeting
|
||||
|
||||
## 08/04/2021
|
||||
|
||||
- \[dims\] Missing 1.5 branch containerd canaries [https://kubernetes.slack.com/archives/C0BP8PW9G/p1628039263053300](https://kubernetes.slack.com/archives/C0BP8PW9G/p1628039263053300)
|
||||
- \[ehashman\] cleaning up the presubmits in the kubelet-serial dash
|
||||
- Pr- tests \- move to separate dashboard
|
||||
- Remove release blocking ones (previous releases) from kubelet
|
||||
- \[ehashman\] why is [https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-node/sig-node-config.yaml](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-node/sig-node-config.yaml) separate?
|
||||
- \[mmiranda96\] Following [kubernetes/kubernetes\#94289](https://github.com/kubernetes/kubernetes/issues/94289), I need to fix some more test tags in kubernetes/kubernetes (I’ll create a PR similar to [this](https://github.com/kubernetes/kubernetes/pull/103674)). Anything to consider for backport?
|
||||
- \[ehashman\] [https://github.com/kubernetes/kubernetes/pull/103257](https://github.com/kubernetes/kubernetes/pull/103257) test changes
|
||||
- \[Thomas and Qiutong\] [https://github.com/kubernetes/kubernetes/issues/100467](https://github.com/kubernetes/kubernetes/issues/100467) The fix doesn’twork.
|
||||
- [https://github.com/kubernetes/kubernetes/issues/93338](https://github.com/kubernetes/kubernetes/issues/93338)
|
||||
- Action: Qiutong to investigate a reproducer test
|
||||
- \[ehashman\] reenabling flaky tests [https://github.com/kubernetes/test-infra/pull/19352](https://github.com/kubernetes/test-infra/pull/19352)
|
||||
- We are not missing a lot of signal (very few are flaky)
|
||||
- Manu will take it
|
||||
- Serial tests updates?
|
||||
- Couple PRs are out \- de-flaking tests that were temporary marked as such
|
||||
- AI on fromanirh to rebase and remove the “Flake” label
|
||||
- [https://github.com/kubernetes/kubernetes/pull/103408](https://github.com/kubernetes/kubernetes/pull/103408)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/103297](https://github.com/kubernetes/kubernetes/pull/103297)
|
||||
- Not running to completion [https://github.com/kubernetes/kubernetes/issues/104038](https://github.com/kubernetes/kubernetes/issues/104038)
|
||||
-
|
||||
- We really shouldn’t call klog.Fatalf in the tests, it’s causing the massive stack traces
|
||||
|
||||
## 07/28/2021
|
||||
|
||||
- \[mmiranda96\] Should I backport [this](https://github.com/kubernetes/kubernetes/pull/103827)?
|
||||
- \[ehashman\] 1.22 burndown
|
||||
- [https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.22+label%3Asig%2Fnode+](https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.22+label%3Asig%2Fnode+)
|
||||
- **Action:** mark known failures in serial tests as Flaky so we can try to get the job green
|
||||
- [https://github.com/kubernetes/kubernetes/pull/103982](https://github.com/kubernetes/kubernetes/pull/103982)
|
||||
- \[alukiano\] all managers are broken by PLEG refactoring. 1.22 will have issues with high-performance workload \- pinning wouldn’t work. Urgent, but not release blocking
|
||||
- Need to talk to release team to add a known issue to the release notes
|
||||
|
||||
## 07/21/2021
|
||||
|
||||
- \[haircommander\] Adding presubmit/release-blocking CRI-O jobs
|
||||
- [https://hackmd.io/kEd86GlmTD-BloMBwVXQHQ](https://hackmd.io/kEd86GlmTD-BloMBwVXQHQ)
|
||||
- \[fromani\] using device plugins in the e2e suites running on CI
|
||||
- The k8s e2e test suite has a fair amount of tests which need device plugin, because they exercise device manager \-directly or indirectly.
|
||||
We mostly use SRIOV devices, because SRIOV devices are just the cheapest and easiest supported device to get, so this is why we wrote the tests in k8s to consume them.
|
||||
But we don't have device plugin support on CI. We do have gpus-enabled machine, but it's a subset and should be used sparingly (e.g. not every PR should use them. Or can we just use gpus every time? I expect no for cost reasons, but worth mentioning).
|
||||
So today a large amount of tests just skip on CI. This is especially evident in the serial lane and in the resource management area
|
||||
In RH we actually have machines with SRIOV devices which run the e2e testsuite, but this is of course suboptimal for a number of reasons; a much better state for everyone would be to actually have some device plugins in u/s CI.
|
||||
There are some options we can discuss as community:
|
||||
- 1\. use sample plugin
|
||||
- 2\. fake sriov devices (I can elaborate on this if there is interest
|
||||
- 3\. just use GPUs?
|
||||
- 4\. Just bump the spec of the CI machines to have SRIOV devices?
|
||||
|
||||
- \[mkmir\] [E2E sysctl test](https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/node/sysctl.go) is marked as conformance. However, it does not respect some of the [requirements](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md#conformance-test-requirements):
|
||||
- it tests only GA, non-optional features or APIs (uses feature sysctl)
|
||||
- it works for all providers (doesn’t work for Windows and other non-sysctl OS)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/101190](https://github.com/kubernetes/kubernetes/pull/101190)
|
||||
- \[ehashman\] 1.22 burndown (includes some of the topics above)
|
||||
- [https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.22+label%3Asig%2Fnode+](https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.22+label%3Asig%2Fnode+)
|
||||
- \[SergeyKanzhelev\] 1.21 vs 1.20 perf degradation: [https://github.com/kubernetes/kubernetes/issues/101989](https://github.com/kubernetes/kubernetes/issues/101989)
|
||||
|
||||
## 07/14/2021
|
||||
|
||||
- \[Sergey\] NodeConformance writeup [https://docs.google.com/document/d/1ezJPfItuhZvwyP\_RtiWTNcjCM3gi94vu1nw6uVNHKgM/edit?usp=sharing](https://docs.google.com/document/d/1ezJPfItuhZvwyP_RtiWTNcjCM3gi94vu1nw6uVNHKgM/edit?usp=sharing)
|
||||
- NodeConformance historically tried to be two things:
|
||||
- A set of e2e tests that you just needed a single node to run
|
||||
- Conformance-like test for nodes
|
||||
- \[ehashman\] Suggestion: we get rid of “NodeConformance” because the name is confusing
|
||||
- For CRI conformance-like tests, label them CRIValidation \-- NodeAgnostic?
|
||||
- For the set of e2e tests you just need a single node to run, let’s come up with a new name \-- needs a name (anything in test/e2e\_node) (SingleNodeTest or KubeletLocal)
|
||||
- Note: some tests may overlap between both sets
|
||||
- **Action:** add a plan to cover splitting the use cases for current tests
|
||||
- **Action:** send out NodeConformance plan, soliciting feedback
|
||||
- \[ehashman\] 1.22 burndown
|
||||
- \[fromani\]\[status update\]\[serial lane\] looks like [https://github.com/kubernetes/kubernetes/issues/100145](https://github.com/kubernetes/kubernetes/issues/100145) eventually broke, will look on it ASAP
|
||||
|
||||
## 07/08/2021
|
||||
|
||||
- \[ehashman\] Report on NodeConformance from SIG Arch discussion on 07/01
|
||||
- [SIG-Architecture Agenda and Meeting Notes](https://docs.google.com/document/d/1BlmHq5uPyBUDlppYqAAzslVbAO8hilgjqZUTaNXUhKM/edit#bookmark=id.ln4uxm9twb2r)
|
||||
- Historical issue [https://github.com/kubernetes/kubernetes/issues/59001](https://github.com/kubernetes/kubernetes/issues/59001)
|
||||
- Suggestion: Remove “Conformance” from the name
|
||||
- Not just CRI.
|
||||
- Conformance requires entire cluster. NodeConformance just require kubelet \- need to discuss what we want in scope of NodeConformance tests.
|
||||
- NodeConformance run as presubmits
|
||||
- \[ehashman\] remaining burndown for CI signal (4 tests tracked by RT)
|
||||
- [https://groups.google.com/g/kubernetes-dev/c/u1LMXHcKhbg/m/Lp81VX7eAgAJ](https://groups.google.com/g/kubernetes-dev/c/u1LMXHcKhbg/m/Lp81VX7eAgAJ)
|
||||
- \#100788 [https://github.com/kubernetes/kubernetes/issues/100788](https://github.com/kubernetes/kubernetes/issues/100788) \[sig-node\]\[NodeConformance\] when querying /stats/summary should report resource usage through the stats api
|
||||
- \#99437 [https://github.com/kubernetes/kubernetes/issues/99437](https://github.com/kubernetes/kubernetes/issues/99437) \[Flake\]\[sig-node\] Pods should run through the lifecycle of Pods and PodStatus
|
||||
- \#75355 [https://github.com/kubernetes/kubernetes/issues/75355](https://github.com/kubernetes/kubernetes/issues/75355) \[Flaky test\] \[[k8s.io](http://k8s.io)\] Pods should support pod readiness gates \[NodeFeature:PodReadinessGate\]
|
||||
- \#99979 [https://github.com/kubernetes/kubernetes/issues/99979](https://github.com/kubernetes/kubernetes/issues/99979) \[flaky test\]: \[sig-node\] Probing container should be ready immediately after startupProbe succeeds
|
||||
- \[fromani\] cannot join (conflict \- reach me on slack if needed) \- status ony:
|
||||
- same status as next week (please review the pending fixes to the serial lane\! :) )
|
||||
- Critical pod test PR: [https://github.com/kubernetes/kubernetes/pull/103408](https://github.com/kubernetes/kubernetes/pull/103408)
|
||||
- Need to investigate the DynamicConfig failures we experienced recently
|
||||
- Acknowledge the podreadinessgate flake, but still dunno how to reproduce, suggestions welcome\!
|
||||
- \[alukiano\] A lot of serial jobs broke because of DynamicConfig, the fix already proposed [https://github.com/kubernetes/kubernetes/pull/103580](https://github.com/kubernetes/kubernetes/pull/103580)(needs approval)
|
||||
|
||||
## 06/30/2021
|
||||
|
||||
- \[fromani\] update only, no agenda item:
|
||||
- Managed to find time to fix the serial lane:
|
||||
- Recent serial lane run: [https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/103297/pull-kubernetes-node-kubelet-serial/1410231408982495232/](https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/103297/pull-kubernetes-node-kubelet-serial/1410231408982495232/)
|
||||
- Merged: [https://github.com/kubernetes/kubernetes/pull/103265](https://github.com/kubernetes/kubernetes/pull/103265)
|
||||
- In review:
|
||||
- [https://github.com/kubernetes/kubernetes/pull/103297](https://github.com/kubernetes/kubernetes/pull/103297)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/103408](https://github.com/kubernetes/kubernetes/pull/103408)
|
||||
- Next up: eviction test lane (no ETA atm)
|
||||
- More async work to be done here
|
||||
- Bug scrub follow-up
|
||||
- Added some new bugs to the board after having scrubbed bugs, including some issues for adding test coverage
|
||||
- Issues are now in a more manageable state, but we have so many
|
||||
- Suggestion: bug board \+ everything else board will be optimal, need to figure out the columns (triage/waiting/accepted/in progress/done?)
|
||||
- Hopefully moving forward we can do regular incoming issue triage as part of these meetings
|
||||
- Bot to help with automation for boards?
|
||||
- GitHub still doesn’t have support
|
||||
- Contribex is working on it
|
||||
- NodeConformance status
|
||||
- Assigned to Sergey, worked on bug scrub so hasn’t had a chance to look since last meeting
|
||||
- NodeFeature status?
|
||||
- mmiranda has submitted PR: [https://github.com/kubernetes/test-infra/pull/22677](https://github.com/kubernetes/test-infra/pull/22677)
|
||||
- Starting with duplicating selectors in test-infra, then we can start making test changes
|
||||
- \[Sergey\] Soak tests
|
||||
- [https://github.com/kubernetes/kubernetes/issues/64523](https://github.com/kubernetes/kubernetes/issues/64523)
|
||||
- Is there logic we can reuse to automatically detect this?
|
||||
- Not afaik \- usually determined by querying debugging endpoint and looking at the memory dumps
|
||||
- Resource utilization regressions now being tracked in perf-tests: [https://github.com/kubernetes/perf-tests/issues/1789](https://github.com/kubernetes/perf-tests/issues/1789)
|
||||
- How to find reviewers for various PRs?
|
||||
- **Action:** Swati to add item to next week’s SIG Node meeting to discuss with wider SIG
|
||||
|
||||
## 06/23/2021
|
||||
|
||||
- Discuss removing NodeFeature flags in favour of Feature: [https://github.com/kubernetes/kubernetes/issues/94289](https://github.com/kubernetes/kubernetes/issues/94289)
|
||||
- Mike Miranda to drive
|
||||
- Need to update labels in both test-infra and k/k
|
||||
- What is the difference between NodeConformance and Conformance?
|
||||
- Filed bug for documentation: [https://github.com/kubernetes/community/issues/5859](https://github.com/kubernetes/community/issues/5859)
|
||||
- Serial tests
|
||||
- No progress
|
||||
- \[ehashman\] Volunteers for bug scrub in a week\!
|
||||
- NASA needs more mentors: [https://docs.google.com/spreadsheets/d/1y6HKIsThphzpaG2a-Vgsc66b36ckM\_TIPqGJ6GeVUGI/edit\#gid=0](https://docs.google.com/spreadsheets/d/1y6HKIsThphzpaG2a-Vgsc66b36ckM_TIPqGJ6GeVUGI/edit#gid=0)
|
||||
|
||||
Attendees:
|
||||
![][image14]
|
||||
|
||||
# 06/16/2021
|
||||
|
||||
Attendees:
|
||||
![][image15]
|
||||
|
||||
- Discuss removing NodeFeature flags in favour of Feature: [https://github.com/kubernetes/kubernetes/issues/94289](https://github.com/kubernetes/kubernetes/issues/94289)
|
||||
-
|
||||
- What is the difference between NodeConformance and Conformance?
|
||||
- See also [https://kubernetes.slack.com/archives/C78F00H99/p1623803503028500](https://kubernetes.slack.com/archives/C78F00H99/p1623803503028500)
|
||||
- \[ehashman\] Volunteers for bug scrub in a week\!
|
||||
- Serial tests
|
||||
- Eviction tests were holding everything. They are split away and needs attention. Odin has a PR ready \- increase open files limit
|
||||
- Some serial tests are failing
|
||||
- AI: Francesco: follow up from serial tests failure investigation
|
||||
|
||||
# 06/10/2021
|
||||
|
||||
Attendees:
|
||||
|
||||
- \[Sergey\] I cannot join this time, but this is one of conflict
|
||||
-
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[pacoxu\]
|
||||
- [x] ~~Ci-kubernetes-node-kubelet-conformance keeps failing: [https://github.com/kubernetes/kubernetes/issues/97130](https://github.com/kubernetes/kubernetes/issues/97130) (shoud be fixed by [\[sig-node\] Remove failing, unused NodeConformance job kubernetes/test-infra\#22454](https://github.com/kubernetes/test-infra/pull/22454) )~~
|
||||
- Serial tests update [https://github.com/kubernetes/kubernetes/issues/102148](https://github.com/kubernetes/kubernetes/issues/102148)
|
||||
- \[fromani\] One huge serial test is too much, thinks it makes sense to split
|
||||
- \[paco\] I will help in this and work on the timeout tests(eviction?)
|
||||
- Filed [https://github.com/kubernetes/kubernetes/issues/102782](https://github.com/kubernetes/kubernetes/issues/102782) to track eviction
|
||||
- Failing Node Conformance tests?
|
||||
- New issue filed as [https://github.com/kubernetes/test-infra/issues/22250](https://github.com/kubernetes/test-infra/issues/22250) \- possible duplicate of the existing 2 issues
|
||||
- PR up to remove the tests at [https://github.com/kubernetes/test-infra/pull/22454](https://github.com/kubernetes/test-infra/pull/22454)
|
||||
- Swati to further investigate
|
||||
|
||||
# 06/02/2021
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[fromani \- cannot attend the mtg \- just status update\] “Serial” updates
|
||||
- Kubelet \+ test logs uploaded [https://github.com/fromanirh/k8smisc/tree/main/e2e\_node](https://github.com/fromanirh/k8smisc/tree/main/e2e_node) (lacking a better place; suggestions?)
|
||||
- Spending time analyzing the logs, will update about findings (and send PRs :) )
|
||||
- Everyone is welcome to ping me anytime on slack to talk about this (/run more tests/upload logs)
|
||||
|
||||
- APAC friendly time: [https://doodle.com/poll/sfeh699qt44mrzv6](https://doodle.com/poll/sfeh699qt44mrzv6) Cannot see responses...
|
||||
|
||||
# 05/25/2021
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[fromani\] updates about “Serial” [https://testgrid.k8s.io/sig-node-kubelet\#node-kubelet-serial](https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial)
|
||||
- Got cpu time to run test locally and collect logs
|
||||
- First experimental run done (failed with timeout)
|
||||
- Doing runs with the same parameter the node-kubelet-serial lane is using
|
||||
- Will collect and publish logs on [https://github.com/fromanirh/k8smisc/tree/main/e2e\_node](https://github.com/fromanirh/k8smisc/tree/main/e2e_node) (lacking a better place; suggestions?)
|
||||
- Will deep dive in the logs after published \- so we can go in parallel
|
||||
- Ping me on k8s/cncf slack chans (@fromani) for any question/comment/chat about the issue
|
||||
- Ping me or file issues against the repo above so we don’t miss/forget
|
||||
- PRs potentially helping with them:
|
||||
- File handles limit increase [https://github.com/kubernetes/kubernetes/pull/102169](https://github.com/kubernetes/kubernetes/pull/102169)
|
||||
- Fix that allows to upload artifacts [https://github.com/kubernetes/kubernetes/pull/102209](https://github.com/kubernetes/kubernetes/pull/102209)
|
||||
- More logs are coming (kubelet.log)
|
||||
- Add ability to run tests on PR
|
||||
- [https://github.com/kubernetes/test-infra/pull/22284](https://github.com/kubernetes/test-infra/pull/22284)
|
||||
|
||||
# 05/19/2021
|
||||
|
||||
Attendees:
|
||||
![][image16]
|
||||
|
||||
Agenda:
|
||||
|
||||
- Meeting time for asia?
|
||||
- Code coverage dashboard in OSS: [https://testgrid.k8s.io/sig-testing-canaries\#ci-kubernetes-coverage-unit\&include-filter-by-regex=pkg%2Fkubelet](https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit&include-filter-by-regex=pkg%2Fkubelet)
|
||||
- Examples:
|
||||
- pkg/kubelet/container/runtime.go
|
||||
- pkg/kubelet/certificate/bootstrap/bootstrap.go
|
||||
- \[Elana\] Some files don’t have targeted unit tests
|
||||
- \[Matthias\] Some files very hard to write tests for
|
||||
- \[Elana\] Rearchitexting some features will be needed
|
||||
- \[Francesco\] It is a backlog items. How we will prioritize?
|
||||
- Discuss “Serial” [https://testgrid.k8s.io/sig-node-kubelet\#node-kubelet-serial](https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial)
|
||||
- Timeouts under tests may need adjustment
|
||||
- We already adjusted global timeout
|
||||
- Just run locally with higher timeout and see what is failing
|
||||
- **Action** \[Artyom\] file issue [https://github.com/kubernetes/kubernetes/issues/102148](https://github.com/kubernetes/kubernetes/issues/102148)
|
||||
- Francesco \- best effort \- can run tests locally
|
||||
- Aditi, Mike, Matthias to help
|
||||
- \[Aditi\] Flags for kubelet? Or kubelet config?
|
||||
- Policy is no new flags, add to KubeletConfig
|
||||
- Raise at full SIG Node
|
||||
|
||||
# 05/12/2021
|
||||
|
||||
Agenda:
|
||||
|
||||
- Triage: [https://github.com/orgs/kubernetes/projects/43?card\_filter\_query=no%3Aassignee](https://github.com/orgs/kubernetes/projects/43?card_filter_query=no%3Aassignee)
|
||||
|
||||
- \[Artyom\]Need more good first issues
|
||||
- Writing e2e test may be a good (difficult) first issue
|
||||
- \[Francesco\] please not first good issues and we can review them together later
|
||||
- \[A\] increasing code coverage is a good issue
|
||||
- Still failing:
|
||||
- Node conformance (docker)
|
||||
- \[S\] find an issue and include on the project board.
|
||||
- Serial jobs are still failing
|
||||
- \[A\] when we increased memory \- it is timing out. Pod is killed so no logs
|
||||
- Artyom to open a new issue
|
||||
- Orphans clean up
|
||||
- [https://github.com/kubernetes/kubernetes/issues/98265](https://github.com/kubernetes/kubernetes/issues/98265)
|
||||
-
|
||||
|
||||
# 05/04/2021 Cancel for KubeCon
|
||||
|
||||
# 04/28/2021 Short sync up
|
||||
|
||||
Attendees:
|
||||
![][image17]
|
||||
|
||||
Agenda:
|
||||
|
||||
- Follow ups \- need to move to the next week:
|
||||
- Kubetest2 updates David
|
||||
- [NodeConformance](https://testgrid.k8s.io/sig-node-critical#kubelet-NodeConformance) Qiutong
|
||||
- Serial clean up \- Francesco, actually there is a PR: [https://github.com/kubernetes/test-infra/pull/21828](https://github.com/kubernetes/test-infra/pull/21828)
|
||||
- \[fromani\] Managing kubelet state in e2e tests: overriding the system default (/var/lib/kubelet)
|
||||
- Shared global state between tests
|
||||
- Any objections to move away from it?
|
||||
- Storing state of the kubelet
|
||||
- Ideally each e2e test has it’s own state
|
||||
- \[ehashman\] looking for volunteers for [https://github.com/kubernetes/perf-tests/issues/1789](https://github.com/kubernetes/perf-tests/issues/1789)
|
||||
- Currently don’t have any perf/scalability tests for upstream kubelet resource utilization (CPU/memory)
|
||||
|
||||
# 04/21/2021
|
||||
|
||||
Attendees:
|
||||
|
||||
- Sergey
|
||||
- Elana
|
||||
- Artyom
|
||||
|
||||
Agenda:
|
||||
|
||||
- triage
|
||||
|
||||
# 04/14/2021
|
||||
|
||||
Attendees:
|
||||
|
||||
- fromani
|
||||
- ehashman
|
||||
- [Sergey Kanzhelev](mailto:skanzhelev@google.com)
|
||||
- David Porter
|
||||
- Amim Knabben
|
||||
- Qiutong Song
|
||||
- Madhav Jivrajani
|
||||
- Jiaming Xu
|
||||
|
||||
Agenda:
|
||||
|
||||
- Kubetest2 updates: [https://github.com/kubernetes-sigs/kubetest2/pull/103/](https://github.com/kubernetes-sigs/kubetest2/pull/103/)
|
||||
- David: currently using kubetest1, moving to kubetest2. Feedback: kubetest has some flags that do not match “make”-way to run tests. PR is moving to the “make”-way to run tests. David’s concern: there will be some need to migrate all tests, but do not see big value in doing it.
|
||||
- Elana: it will be useful to have a writeup with the justification.
|
||||
- Francesco: want to take a look and learn more as a consumer of tests.
|
||||
- Q: is this part of a process to make the node e2e test more like the other e2e tests?
|
||||
- A: no, it’s just a new interface to call the same tests
|
||||
- Action: David to ask for justification, estimation of migration effort
|
||||
- Should we move \[sig-storage\] tests from [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-master](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-master) to some place under [https://testgrid.k8s.io/sig-storage-kubernetes](https://testgrid.k8s.io/sig-storage-kubernetes)?
|
||||
- Amim: prefer to keep tests, maybe remove flaky tests
|
||||
- Sergey: need a mechanism to notify those teams
|
||||
- Move \[sig-network\] tests from [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-master](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-master) to some place under [https://testgrid.k8s.io/sig-network](https://testgrid.k8s.io/sig-network)
|
||||
- Fix infra issue at: [https://testgrid.k8s.io/sig-node-critical\#kubelet-NodeConformance](https://testgrid.k8s.io/sig-node-critical#kubelet-NodeConformance)
|
||||
- Qiutong to take a look
|
||||
- Clean-up [https://testgrid.k8s.io/sig-node-kubelet\#node-kubelet-serial](https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial)
|
||||
- Francesco: will try.
|
||||
- Multi-zone tests: [https://github.com/kubernetes/test-infra/pull/21777](https://github.com/kubernetes/test-infra/pull/21777)
|
||||
- Elana: looks like cloud provider specific
|
||||
- Sergey: to look at what these tests suppose to test-cover
|
||||
|
||||
# 04/07/2021
|
||||
|
||||
Attendees:
|
||||
|
||||
- Elana Hashman
|
||||
- Sergey Kanzhelev
|
||||
- Alukiano
|
||||
- Tao
|
||||
- Qiutong
|
||||
|
||||
Agrenda:
|
||||
|
||||
- Artem will start looking at fake NUMA flag.
|
||||
|
||||
# 03/30/2021 Cancelled
|
||||
|
||||
# 03/24/2021
|
||||
|
||||
Agenda
|
||||
|
||||
[https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.21+label%3Asig%2Fnode](https://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.21+label%3Asig%2Fnode)
|
||||
|
||||
Discussing [https://github.com/kubernetes/kubernetes/pull/99336](https://github.com/kubernetes/kubernetes/pull/99336):
|
||||
|
||||
- Overall feeling is that it’s too late for unknown unknowns introduced by this PR.
|
||||
- 1.18 cherry-picking is not Node team problem, more release team problem. Maybe release team will need an exception.
|
||||
|
||||
# 03/17/2021
|
||||
|
||||
Attendees
|
||||
|
||||
Agenda
|
||||
|
||||
- \[ehashman\] Discuss the future of node presubmit tests
|
||||
- Context: [https://github.com/kubernetes/test-infra/pull/21278](https://github.com/kubernetes/test-infra/pull/21278) [https://kubernetes.slack.com/archives/C0BP8PW9G/p1615470977302300](https://kubernetes.slack.com/archives/C0BP8PW9G/p1615470977302300)
|
||||
|
||||
Direction long term is not to test the whole matrix on presubmits, but have a good signal with failures easy to investigate by contributors. Maybe PR needs to be replaced with a single job with both runtimes.
|
||||
|
||||
**Action:** determine a long-term plan to merge all node presubmits into one job, using a name that doesn’t reveal the underlying runtimes. (e.g. cleanup ubuntu-containerd\* tests)
|
||||
Timeframe: dependent on a presubmit policy, maybe will happen next release cycle (1.22)
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/issues/94289](https://github.com/kubernetes/kubernetes/issues/94289)
|
||||
- Sergey: list all the tags and decide what to do about it
|
||||
- [https://github.com/kubernetes/kubernetes/issues/96524](https://github.com/kubernetes/kubernetes/issues/96524)
|
||||
- Sergey’s todo.
|
||||
|
||||
# 03/10/2021
|
||||
|
||||
Attendees
|
||||
|
||||
* ehashman
|
||||
* fromanirh
|
||||
* swsehgal
|
||||
*
|
||||
|
||||
Agenda
|
||||
|
||||
- Deflake of sig-node-alpha tab [https://testgrid.k8s.io/sig-node-kubelet\#node-kubelet-alpha](https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-alpha)
|
||||
- It’s done\!\!
|
||||
|
||||
# 03/03/2021
|
||||
|
||||
Attendees
|
||||
|
||||
Agenda
|
||||
|
||||
- Containerd tests. TODO: insert links
|
||||
- Triage
|
||||
- \[alukiano\] Serial job \- timeouts because OOMs. Artem will switch to 2 Gb \- log [https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1365962673736388608/artifacts/tmp-node-e2e-a988a9a1-cos-81-12871-1245-10/kern.log](https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1365962673736388608/artifacts/tmp-node-e2e-a988a9a1-cos-81-12871-1245-10/kern.log)
|
||||
|
||||
# 02/24/2021.
|
||||
|
||||
Attendees (7 on call):
|
||||
![][image18]
|
||||
|
||||
- Triage [https://github.com/orgs/kubernetes/projects/43](https://github.com/orgs/kubernetes/projects/43)
|
||||
|
||||
1. Containerd plan [https://github.com/kubernetes/test-infra/issues/18570](https://github.com/kubernetes/test-infra/issues/18570)
|
||||
2. Questions about [https://github.com/kubernetes/test-infra/pull/20937](https://github.com/kubernetes/test-infra/pull/20937) and node-kubelet-serial tests ([https://testgrid.k8s.io/sig-node-kubelet\#node-kubelet-serial](https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial))
|
||||
3. Announcement: node n-2 version skew tests to be discussed at SIG Arch tomorrow: [https://groups.google.com/g/kubernetes-sig-architecture/c/QX-4qq2krm0/m/998T3cJUBQAJ](https://groups.google.com/g/kubernetes-sig-architecture/c/QX-4qq2krm0/m/998T3cJUBQAJ)
|
||||
|
||||
Product triage: [https://github.com/orgs/kubernetes/projects/49](https://github.com/orgs/kubernetes/projects/49)
|
||||
|
||||
- Feature PRs missing from board that happen to have sig/testing label
|
||||
- Action: needs-rebase isn’t auto-applied, bot needs to be pestered. File issue to proactively apply without resetting stale counter
|
||||
- [https://github.com/kubernetes/test-infra/issues/21006](https://github.com/kubernetes/test-infra/issues/21006)
|
||||
|
||||
# 02/17/2021
|
||||
|
||||
- Triage [https://github.com/orgs/kubernetes/projects/43](https://github.com/orgs/kubernetes/projects/43)
|
||||
- SIG-labels for all tests
|
||||
- Morgan as approver
|
||||
- \[fromani\] CPU manager e2e tests needs improvement.
|
||||
|
||||
Product triage: [https://github.com/orgs/kubernetes/projects/49](https://github.com/orgs/kubernetes/projects/49)
|
||||
|
||||
# 02/08/2021
|
||||
|
||||
\[Sergey\] New time for the meeting? It looks like 10 AM Mon is very inconvenient. Is Monday 9AM better?
|
||||
|
||||
[https://doodle.com/poll/ii5vyde6wpp3migm?utm\_campaign=poll\_update\_participant\_admin\&utm\_medium=email\&utm\_source=poll\_transactional\&utm\_content=gotopoll-cta\#table](https://doodle.com/poll/ii5vyde6wpp3migm?utm_campaign=poll_update_participant_admin&utm_medium=email&utm_source=poll_transactional&utm_content=gotopoll-cta#table)
|
||||
|
||||
![][image19]
|
||||
|
||||
Triage: [https://github.com/orgs/kubernetes/projects/43](https://github.com/orgs/kubernetes/projects/43)
|
||||
|
||||
- Suggest MHBauer to approver. Elana to reach out
|
||||
-
|
||||
|
||||
# 02/01/2021 \[skipping\]
|
||||
|
||||
# 01/25/2021
|
||||
|
||||
Agenda:
|
||||
\[Aditi\]
|
||||
|
||||
- Test plan for containerd [https://github.com/kubernetes/test-infra/issues/18570](https://github.com/kubernetes/test-infra/issues/18570#issuecomment-764576223)
|
||||
Some test analysis here [https://docs.google.com/spreadsheets/d/1mN1fG0dq6t7dZTzl-g9fFNwFqD8XdbbKlr5D3R7BRzc/edit\#gid=0](https://docs.google.com/spreadsheets/d/1mN1fG0dq6t7dZTzl-g9fFNwFqD8XdbbKlr5D3R7BRzc/edit#gid=0)
|
||||
Some answers we want [https://github.com/kubernetes/test-infra/issues/18570\#issue comment-764576223](https://github.com/kubernetes/test-infra/issues/18570#issuecomment-764576223)
|
||||
- Aditi: Updated action items here [https://github.com/kubernetes/test-infra/issues/18570\#issuecomment-767024601](https://github.com/kubernetes/test-infra/issues/18570#issuecomment-767024601)
|
||||
|
||||
\[Sergey\] Do we need this:
|
||||
[https://github.com/kubernetes/k8s.io/issues/956\#issuecomment-764847132](https://github.com/kubernetes/k8s.io/issues/956#issuecomment-764847132)?
|
||||
|
||||
- Need for deflaking tests. “Pause” job and SSH access
|
||||
- Sometime hard to understand what failed and why PR validation failed
|
||||
|
||||
\[knabben\]
|
||||
|
||||
- Deflake of these tabs:
|
||||
- [Node-kubelet-orphans](https://github.com/kubernetes/kubernetes/issues/98265) (partial)
|
||||
- [Node-kubelet-alpha](https://github.com/kubernetes/kubernetes/issues/98220)
|
||||
|
||||
\[ehashman\]
|
||||
|
||||
* Non-CI PR triage: Node board [https://github.com/orgs/kubernetes/projects/49](https://github.com/orgs/kubernetes/projects/49)
|
||||
* Action (ehashman): will add note cards to columns of project board to explain how it works
|
||||
* Action (ehashman): will draft doc on HOW-TO node review for community repo, tag attendees, and leave open for lazy consensus/feedback
|
||||
|
||||
\[Meeting time\]
|
||||
|
||||
* Action: Sergey to send doodle for maybe moving this meeting? Mondays frequently conflict.
|
||||
|
||||
# 01/11/2021
|
||||
|
||||
Attendees (7 on the call):
|
||||
|
||||
Agenda:
|
||||
|
||||
- triage
|
||||
|
||||
# 01/04/2021
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
\[knabben\]
|
||||
|
||||
- [Node s/gci/cos/g tab rename](https://docs.google.com/document/d/1KlnfvQ_OPkrty5DvKHmeEMTKMpaVK-ioHgDOpVFRrkA/) \- [https://github.com/kubernetes/test-infra/pull/20351](https://github.com/kubernetes/test-infra/pull/20351/files)
|
||||
- Adds a **\--restart-kubelet** flag on Node E2E tests \- PTAL [https://github.com/kubernetes/kubernetes/pull/97028/](https://github.com/kubernetes/kubernetes/pull/97028/)
|
||||
|
||||
\[Sergey\] Triage
|
||||
|
||||
# Volunteers to help with this effort
|
||||
|
||||
Victor Pickard (Red Hat), nick=vpickard, vpickard@redhat.com
|
||||
Jay Pipes (AWS), nick=jaypipes, gh=jaypipes, jaypipes@gmail.com
|
||||
Balaji (AWS), nick=srisaranbalaji, gh=SaranBalaji90, srisaranbalaji@gmail.com
|
||||
Morgan Bauer (IBM), slack=mhb, gh=mhbauer, mail=mbauer@us.ibm.com
|
||||
Ning Liao (Google), nick=nliao, mail=ningliao@google.com
|
||||
David Porter (Google), nick=davidporter; mail=[porterdavid@google.com](mailto:porterdavid@google.com)
|
||||
Hanfei Lin (Google), nick=hanfeil; mail=hanfeil@google.com
|
||||
Hugo Huang (Google), nick=tangent; mail=[tangent@google.com](mailto:tangent@google.com)
|
||||
Roy Yang(Google), nick=roy; mail=royyang@google.com
|
||||
Aaron Crickenberger (Google), nick=spiffxp, spiffxp@gmail.com
|
||||
nick=Archer
|
||||
Ed Bartosh (Intel), slack=Ed, github=bart0sh [eduard.bartosh@intel.com](mailto:eduard.bartosh@intel.com)
|
||||
Daniel Mangum (upbound.io), nick=hasheddan
|
||||
Chirag Tayal (PayPal) nick=ctayal, [chiragtayal@gmail.com](mailto:chiragtayal@gmail.com)
|
||||
Zhi Feng(Airbnb), nick=Zhi, helenfengzhi@gmail.com
|
||||
Dims, nick=dims, [davanum@gmail.com](mailto:davanum@gmail.com)
|
||||
Jacob Blain Christen (Rancher Labs), nick=dweomer; mail=[dweomer5@gmail.com](mailto:dweomer5@gmail.com)
|
||||
Artyom Lukianov(Red Hat), nick(github)=cynepco3hahue,nick(slack)=alukiano,mail=[alukiano@redhat.com](mailto:alukiano@redhat.com)
|
||||
Swati Sehgal (Red Hat), slack=swsehgal; mail [swsehgal@redhat.com](mailto:swsehgal@redhat.com)
|
||||
Jorge Alarcon, nick=alejandrox1, [alarcj137@gmail.com](mailto:alarcj137@gmail.com)
|
||||
Sascha Grunert (SUSE), nick=sascha, [sgrunert@suse.com](mailto:sgrunert@suse.com)
|
||||
Srini Brahmaroutu(IBM), slack=srbrahma, gh=brahmaroutu, mail=[srbrahma@us.ibm.com](mailto:srbrahma@us.ibm.com)
|
||||
Tim Pepper (VMware), slack=tpepper, gh=tpepper, mail=[tpepper@vmware.com](mailto:tpepper@vmware.com)
|
||||
John Taylor (IBM), mail=[jtaylor1@uk.ibm.com](mailto:jtaylor1@uk.ibm.com)
|
||||
Francesco Romani (Red Hat), nick=fromani; mail=[fromani@redhat.com](mailto:fromani@redhat.com)
|
||||
Karan Goel (Google), nick=karan; mail=[karangoel@google.com](mailto:karangoel@google.com)
|
||||
Sergey Kanzhelev (Google), nick=SergeyKanzhelev; mail=[skanzhelev@google.com](mailto:skanzhelev@google.com)
|
||||
Mike Carlise (Salesforce), nick=micarlise, mail=[micarlise@gmail.com](mailto:micarlise@gmail.com)
|
||||
Matt Merkes (AWS), nick=merkes, mail=[matt.merkes@gmail.com](mailto:matt.merkes@gmail.com)
|
||||
Amim Knabben (Loadsmart), nick=knabben, mail=[amim.knabben@gmail.com](mailto:amim.knabben@gmail.com)
|
||||
Swati Sehgal (Red Hat), nick(slack)=swsehgal, nick(github)= swatisehgal, mail \= [swsehgal@redhat.com](mailto:swsehgal@redhat.com)
|
||||
Harshal Patil (Red Hat), slack=Harshal, gh=harche, mail=[harpatil@redhat.com](mailto:harpatil@redhat.com)
|
||||
Elana Hashman (Red Hat), nick=ehashman, mail=[ehashman@redhat.com](mailto:ehashman@redhat.com)
|
||||
Paco Xu(DaoCloud), nick=paco,mail=[paco.xu@daocloud.io](mailto:paco.xu@daocloud.io)
|
|
@ -0,0 +1,536 @@
|
|||
# Kubernetes SIG-Node CI subgroup notes
|
||||
|
||||
## 2022/12/14
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=drlQWZiMj6o](https://www.youtube.com/watch?v=drlQWZiMj6o)
|
||||
|
||||
- [https://github.com/kubernetes/test-infra/issues/28211](https://github.com/kubernetes/test-infra/issues/28211)
|
||||
- \[Francesco\] Let’s also use a new label for multi-numa
|
||||
- \[Swati\] will initiate POC and then we can clean up by adding labels and doing other optimizations
|
||||
|
||||
PRs:
|
||||
|
||||
* [Block ephemeral containers for Static Pods](https://github.com/kubernetes/kubernetes/pull/114086)
|
||||
* [More backport registry move to 1.23](https://github.com/kubernetes/kubernetes/pull/114377)
|
||||
|
||||
## 2022/12/07
|
||||
|
||||
* \[swsehgal\] Topology Manager currently does not run e2e test on K8s CI due lack of multi-NUMA systems in the CI infrastructure. This could be a blocker for its GA graduation.
|
||||
* Planning to bring this to the main SIG Node meeting next week but was wondering if this group has any suggestions on how this can be handled?
|
||||
* Do we have [Compute optimized](https://cloud.google.com/compute/docs/compute-optimized-machines) nodes in the CI infrastructure? C2-standard-60 (referenced [here](https://cloud.google.com/architecture/best-practices-for-using-mpi-on-compute-engine)) provides VMs with multi NUMA but don’t think we have them in our infra.
|
||||
* Any pointers?
|
||||
* Let’s talk to k8s infra before taking expensive machine like C2-standard-60
|
||||
|
||||
\[Sergey\] Update:
|
||||
two cheapest options with Numa on GCP:
|
||||
n2-standard-32 $908.47
|
||||
skanzhelev@n2-standard-32:\~$ grep NUMA=y /boot/config-\`uname \-r\`
|
||||
lscpu | grep \-i numa
|
||||
CONFIG\_NUMA=y
|
||||
CONFIG\_X86\_64\_ACPI\_NUMA=y
|
||||
CONFIG\_ACPI\_NUMA=y
|
||||
NUMA node(s): 2
|
||||
NUMA node0 CPU(s): 0-7,16-23
|
||||
NUMA node1 CPU(s): 8-15,24-31
|
||||
n2d-standard-32 $790.49
|
||||
skanzhelev@n2d-standard-32:\~$ grep NUMA=y /boot/config-\`uname \-r\`
|
||||
lscpu | grep \-i numa
|
||||
CONFIG\_NUMA=y
|
||||
CONFIG\_X86\_64\_ACPI\_NUMA=y
|
||||
CONFIG\_ACPI\_NUMA=y
|
||||
NUMA node(s): 2
|
||||
NUMA node0 CPU(s): 0-7,16-23
|
||||
NUMA node1 CPU(s): 8-15,24-31
|
||||
|
||||
* \[SergeyKanzhelev\] cgroup mismatch between kubelet and runtime work is ongoing
|
||||
|
||||
## 2022/11/23 \[Canceled \- short week in US\]
|
||||
|
||||
## 2022/11/16
|
||||
|
||||
- Nothing release blocking: [https://github.com/kubernetes/kubernetes/pulls?q=is%3Apr+is%3Aopen+milestone%3Av1.26+](https://github.com/kubernetes/kubernetes/pulls?q=is%3Apr+is%3Aopen+milestone%3Av1.26+)
|
||||
-
|
||||
- Containerd 1.5: [https://kubernetes.slack.com/archives/C0BP8PW9G/p1668511893118389](https://kubernetes.slack.com/archives/C0BP8PW9G/p1668511893118389)
|
||||
- Mismatch of driver in kubelet and runtime is not easily discoverable
|
||||
- Cgroupv2 will be primary test target
|
||||
-
|
||||
|
||||
## 2022/11/09
|
||||
|
||||
- Release blocking? Code freeze yesterday
|
||||
- [https://github.com/kubernetes/kubernetes/issues/113791](https://github.com/kubernetes/kubernetes/issues/113791) may affect pod vertical scaling tests
|
||||
- [https://github.com/kubernetes/kubernetes/issues/113781](https://github.com/kubernetes/kubernetes/issues/113781) Francesco to take a look
|
||||
-
|
||||
-
|
||||
|
||||
## 2022/11/02 \[cancelled due host unavailablity\]
|
||||
|
||||
## 2022/10/26
|
||||
|
||||
## 2022/10/19
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Sergey, Swati\] Report feedback? [https://docs.google.com/document/d/1vfqqFtN4Ke2JtB9O4wjoKvMChW2Ptizmsom1\_gRGauU/edit\#heading=h.eawmmxfxo8vq](https://docs.google.com/document/d/1vfqqFtN4Ke2JtB9O4wjoKvMChW2Ptizmsom1_gRGauU/edit#heading=h.eawmmxfxo8vq)
|
||||
- \[Brian\] [https://github.com/kubernetes/kubernetes/pull/113012](https://github.com/kubernetes/kubernetes/pull/113012)
|
||||
- \[Mike\] Removing COS jobs and testgrid: [https://github.com/kubernetes/test-infra/pull/27636](https://github.com/kubernetes/test-infra/pull/27636)
|
||||
|
||||
Bugs triage: 6 bugs
|
||||
|
||||
## 2022/10/12
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Brian\] Performance tests fix update
|
||||
|
||||
## 2022/10/05
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Sergey\] reports for the main sig node meeting
|
||||
- Perma failures, new improvements,
|
||||
- Swati Sehgal can help
|
||||
- Updated the tests tags improvements KEP: [https://github.com/kubernetes/enhancements/pull/3042](https://github.com/kubernetes/enhancements/pull/3042)
|
||||
- Please review if you interested
|
||||
- \[Brian McQueen\] [https://github.com/kubernetes/kubernetes/issues/109295](https://github.com/kubernetes/kubernetes/issues/109295)
|
||||
- \[Sergey\] Soak test: todo add link
|
||||
|
||||
## 2022/09/21
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Sergey\] Triage continued
|
||||
- \[Sergey\] Dashboards names and location:
|
||||
- [https://testgrid.k8s.io/sig-node-release-blocking](https://testgrid.k8s.io/sig-node-release-blocking) 1.22 is tested, 1.24 and 1.25 is not
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet](https://testgrid.k8s.io/sig-node-kubelet) kubelet node conformance? Features on master?
|
||||
|
||||
I will resurrect this: [https://github.com/kubernetes/test-infra/issues/24641](https://github.com/kubernetes/test-infra/issues/24641)
|
||||
|
||||
## 2022/09/13
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Sergey\] Mostly triage
|
||||
|
||||
## 2022/09/07
|
||||
|
||||
Attendees:
|
||||
|
||||
-
|
||||
|
||||
Agenda:
|
||||
|
||||
- (mmiranda96) [https://github.com/kubernetes/test-infra/pull/27406](https://github.com/kubernetes/test-infra/pull/27406)
|
||||
|
||||
## 2022/08/10
|
||||
|
||||
Attendees:
|
||||
|
||||
-
|
||||
|
||||
Agenda:
|
||||
|
||||
-
|
||||
|
||||
## 2022/08/3
|
||||
|
||||
Attendees:
|
||||
|
||||
-
|
||||
|
||||
Agenda:
|
||||
|
||||
-
|
||||
|
||||
## 2022/07/27 \[Cancelled due to codefreeze\]
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[danielle\] containerd PidEviction tests are failing because cadvisor is reporting back 0 values, so pid eviction priority is random. Potentially a reason for others failing too. Hoping to get david to take a look bc I’m unfamiliar with cadvisor’s codebase.
|
||||
|
||||
## 2022/06/29
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[paco\] [https://github.com/kubernetes/kubernetes/pull/108958](https://github.com/kubernetes/kubernetes/pull/108958) PR to fix density test on pod creation, needs approval
|
||||
- \[paco\] [https://testgrid.k8s.io/sig-node-cos\#soak-cos-gce](https://testgrid.k8s.io/sig-node-cos#soak-cos-gce) keeps failing for NPD after [https://github.com/kubernetes/kubernetes/pull/109396](https://github.com/kubernetes/kubernetes/pull/109396) was merged.(The ci is not using the latest code if I understand the log correctly)
|
||||
|
||||
## 2022/06/22 \[Cancelled, Zoom 2FA issues\]
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
-
|
||||
|
||||
## 2022/06/15 \[starts 15 minutes late\]
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Sergey\] availability announcement \- Sergey out till September for baby bonding leave.
|
||||
- Triage mostly
|
||||
- Brian to take the performance test (xmcqueen)
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/issues/109621](https://github.com/kubernetes/kubernetes/issues/109621)
|
||||
|
||||
##
|
||||
|
||||
## 2022/06/08
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
- Triage mostly
|
||||
|
||||
##
|
||||
|
||||
## 2022/06/01
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
\[fromani\] quick status update about [https://github.com/kubernetes/kubernetes/pull/109820](https://github.com/kubernetes/kubernetes/pull/109820) \- re-enabling device plugins tests. Mostly good news\!
|
||||
|
||||
## 2022/05/25
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
- (Vaibhav) Why are EvictionHard's imagefs.available and ImageGCHighThresholdPercent the same by default
|
||||
|
||||
## 2022/05/04
|
||||
|
||||
Attendees:
|
||||
![][image1]
|
||||
|
||||
Agenda:
|
||||
|
||||
- (mmiranda96) [https://github.com/kubernetes/enhancements/pull/3042](https://github.com/kubernetes/enhancements/pull/3042)
|
||||
- (danielle) [https://github.com/kubernetes/kubernetes/issues/108028](https://github.com/kubernetes/kubernetes/issues/108028)
|
||||
- David to take a look
|
||||
|
||||
## 04/27/2022
|
||||
|
||||
Attendees:
|
||||
|
||||
![][image2]
|
||||
|
||||
Agenda:
|
||||
|
||||
- [https://kubernetes.slack.com/archives/C0BP8PW9G/p1650995429992899](https://kubernetes.slack.com/archives/C0BP8PW9G/p1650995429992899)
|
||||
|
||||
\[Thread to start discussion on planning reliability/maintainability improvements\]
|
||||
|
||||
Danielle will send something to the mailing list tomorrow.
|
||||
|
||||
Francesco:
|
||||
|
||||
- Yes, good initiative and want to help
|
||||
- Adding tests and necessary refactoring has the same reviewers bandwidth
|
||||
- Can we address that?
|
||||
|
||||
Danielle:
|
||||
|
||||
- This is why it was brought up on main meeting as we need to raise the priority of this work.
|
||||
- PR adding tests to container manager sat open for a long time as it touched code.
|
||||
- With Derek back it may be easier
|
||||
|
||||
Francesco:
|
||||
|
||||
- Maybe even statically allocate more time for this.
|
||||
|
||||
Danielle:
|
||||
|
||||
- Will be as loud as needed for this to happen. Kubelet PRs will be thoroughly reviewed for the proper coverage. Improve test is prerequisite for new code.
|
||||
|
||||
Some areas we are lacking coverage:
|
||||
|
||||
- Failure modes testing
|
||||
- Sometimes expected behavior is not specified clearly.
|
||||
|
||||
- [https://kubernetes.slack.com/archives/C0BP8PW9G/p1651068114756659](https://kubernetes.slack.com/archives/C0BP8PW9G/p1651068114756659)
|
||||
|
||||
[@danielle](https://kubernetes.slack.com/team/U8C4ZRN83) could you please see if we can find people who can look into and fix problems in SIG-node related CI jobs (guessing at the next Node CI meeting perhaps?) There's a bunch of CRI jobs there too (cc [@mrunal](https://kubernetes.slack.com/team/U1A24MU2Z) [@sascha](https://kubernetes.slack.com/team/U53SUDBD4)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-eviction](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-eviction)
|
||||
|
||||
Problem: is mostly known, fixing of it is unknown.
|
||||
|
||||
Problem is tests interacting with each other and interacting with things on the host. For the disk maybe use some other disk.
|
||||
|
||||
|
||||
Trying to fill up the whole disk, but 30Gi disk is very slow to fill. So timeout may be caused by this.
|
||||
|
||||
|
||||
David will leave comment on this ^^^.
|
||||
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-performance-test](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-performance-test)
|
||||
|
||||
[https://github.com/kubernetes/kubernetes/pull/109551](https://github.com/kubernetes/kubernetes/pull/109551)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cos\#e2e-cos-flaky](https://testgrid.k8s.io/sig-node-cos#e2e-cos-flaky)
|
||||
|
||||
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cos\#soak-cos-gce](https://testgrid.k8s.io/sig-node-cos#soak-cos-gce)
|
||||
|
||||
https://github.com/kubernetes/kubernetes/pull/109396
|
||||
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cri-o\#ci-crio-cgroupv1-node-e2e-alpha](https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-alpha)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cri-o\#ci-crio-cgroupv1-node-e2e-eviction](https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-eviction)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cri-o\#ci-crio-cgroupv1-node-e2e-flaky](https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-flaky)
|
||||
|
||||
Peter will try to take a look \- after two weeks will have more time.
|
||||
|
||||
\+fromanirh will also take a look
|
||||
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial)
|
||||
|
||||
There is an open PR for these
|
||||
|
||||
## 04/20/2022
|
||||
|
||||
Attendees:
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[ehashman\] Announcements
|
||||
- Elana taking a break during 1.25, stepping down as CI subproject lead
|
||||
- Nominating Danielle to step up as new lead
|
||||
- \[Sergey\] [https://storage.googleapis.com/k8s-metrics/failures-latest.json](https://storage.googleapis.com/k8s-metrics/failures-latest.json)
|
||||
- Arnaud: focus on jobs failing for more than a year
|
||||
- Everything under 90 days is not relevant
|
||||
- \[Sergey\] [https://github.com/kubernetes-sigs/cri-tools/pull/914\#issuecomment-1101102538](https://github.com/kubernetes-sigs/cri-tools/pull/914#issuecomment-1101102538)
|
||||
- \[Sergey\] [https://github.com/kubernetes/kubernetes/pull/109472](https://github.com/kubernetes/kubernetes/pull/109472)
|
||||
- [https://github.com/kubernetes/test-infra/pull/26000](https://github.com/kubernetes/test-infra/pull/26000)
|
||||
|
||||
## 04/13/2022 \[Canceled due to lack of quorum and being in test freeze\]
|
||||
|
||||
Please help with the release blocking: [https://github.com/kubernetes/kubernetes/issues/109082](https://github.com/kubernetes/kubernetes/issues/109082) if you have cycles\!
|
||||
|
||||
## 04/06/2022
|
||||
|
||||
Attendees:
|
||||
![][image3]
|
||||
|
||||
\[arnaud\]: Switch to registry.k8s.io for cri-o:
|
||||
[https://github.com/cri-o/cri-o/pull/5777](https://github.com/cri-o/cri-o/pull/5777)
|
||||
|
||||
## 03/30/2022
|
||||
|
||||
Attendees:
|
||||
![][image4]
|
||||
|
||||
- \[arnaud\] Migrate away from custom images: [https://github.com/kubernetes/k8s.io/issues/1527](https://github.com/kubernetes/k8s.io/issues/1527)
|
||||
- Likely not used by us, but Danielle will check
|
||||
- [https://kubernetes.slack.com/archives/C7J9RP96G/p1648564612846839?thread\_ts=1648557863.010859\&cid=C7J9RP96G](https://kubernetes.slack.com/archives/C7J9RP96G/p1648564612846839?thread_ts=1648557863.010859&cid=C7J9RP96G)
|
||||
- Perf regression” [http://perf-dash.k8s.io/\#/?jobname=gce-100Nodes-master\&metriccategoryname=E2E\&metricname=LoadResources\&PodName=e2e-big-minion-group%2Fkubelet\&Resource=CPU](http://perf-dash.k8s.io/#/?jobname=gce-100Nodes-master&metriccategoryname=E2E&metricname=LoadResources&PodName=e2e-big-minion-group%2Fkubelet&Resource=CPU) seems to be taken care of here: [https://kubernetes.slack.com/archives/C09QZTRH7/p1648636053781389](https://kubernetes.slack.com/archives/C09QZTRH7/p1648636053781389)
|
||||
|
||||
## 03/23/2022
|
||||
|
||||
Attendees:
|
||||
![][image5]
|
||||
|
||||
\- \[arnaud\] The Great Migration to registry.k8s.io
|
||||
\- FYI : change containerd config:
|
||||
[https://github.com/kubernetes/test-infra/pull/25739](https://github.com/kubernetes/test-infra/pull/25739)
|
||||
[https://github.com/kubernetes/test-infra/pull/25742](https://github.com/kubernetes/test-infra/pull/25742)
|
||||
I’ll not be around for the meeting. If you see any failures related to this change, please revert\!
|
||||
\- \[mmiranda96\] Need review on [https://github.com/kubernetes/kubernetes/pull/108862](https://github.com/kubernetes/kubernetes/pull/108862)
|
||||
\- Fedora’s job is passing now ([https://github.com/kubernetes/kubernetes/issues/104292\#issuecomment-1074417968](https://github.com/kubernetes/kubernetes/issues/104292#issuecomment-1074417968))
|
||||
|
||||
## 03/16/2022
|
||||
|
||||
Attendees:
|
||||
![][image6]
|
||||
|
||||
- \[ipochi\] Next steps on getting the lock contention flags to KubeletConfiguration [PR](https://github.com/kubernetes/kubernetes/pull/104302) merged.
|
||||
- \[arnaud\] The Great migration to registry.k8s.io
|
||||
- [https://github.com/kubernetes/k8s.io/issues/3411](https://github.com/kubernetes/k8s.io/issues/3411)
|
||||
- How can I test the change of default sandbox image: [https://github.com/kubernetes/kubernetes/blob/f9be590b25abf3921ffffc2a6b31e941ad9ab8fc/cmd/kubelet/app/options/container\_runtime.go\#L26](https://github.com/kubernetes/kubernetes/blob/f9be590b25abf3921ffffc2a6b31e941ad9ab8fc/cmd/kubelet/app/options/container_runtime.go#L26)
|
||||
- File a bug to investigate what changed the kubelet CPU/MEMOry
|
||||
|
||||
[http://perf-dash.k8s.io/\#/?jobname=gce-100Nodes-master\&metriccategoryname=E2E\&metricname=LoadResources\&PodName=e2e-big-minion-group%2Fkubelet\&Resource=CPU](http://perf-dash.k8s.io/#/?jobname=gce-100Nodes-master&metriccategoryname=E2E&metricname=LoadResources&PodName=e2e-big-minion-group%2Fkubelet&Resource=CPU)
|
||||
|
||||
![][image7]
|
||||
[http://perf-dash.k8s.io/\#/?jobname=gce-100Nodes-master\&metriccategoryname=E2E\&metricname=LoadResources\&PodName=e2e-big-minion-group%2Fkubelet\&Resource=memory](http://perf-dash.k8s.io/#/?jobname=gce-100Nodes-master&metriccategoryname=E2E&metricname=LoadResources&PodName=e2e-big-minion-group%2Fkubelet&Resource=memory)
|
||||
|
||||
![][image8]
|
||||
|
||||
## 03/09/2022 \[Cancelled\]
|
||||
|
||||
## 03/02/2022
|
||||
|
||||
Do we have an issue tracking this failure?
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-performance-test](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-performance-test)
|
||||
still fails even [https://github.com/kubernetes/test-infra/pull/25385](https://github.com/kubernetes/test-infra/pull/25385) is merged
|
||||
[https://github.com/kubernetes/test-infra/issues/25430](https://github.com/kubernetes/test-infra/issues/25430)
|
||||
|
||||
Infra flakes a lot, do we have a bug?
|
||||
Kubernetes Presubmits blocking
|
||||
[https://testgrid.k8s.io/presubmits-kubernetes-blocking](https://testgrid.k8s.io/presubmits-kubernetes-blocking)
|
||||
|
||||
Kubelet memory increase:
|
||||
|
||||
![][image9]
|
||||
|
||||
![][image10]
|
||||
2022-02-24 UTC start of the spike
|
||||
|
||||
## 02/23/2022
|
||||
|
||||
- \[matthyx\] Remaining presubmits using dockershim [kubernetes/test-infra/issues/24620](https://github.com/kubernetes/test-infra/issues/24620#issuecomment-1012938110)
|
||||
- continue cleaning them, separate PR for gce
|
||||
|
||||
## 02/16/2022
|
||||
|
||||
- \[ipochi\] Bring back the test job for lock contention tests which was removed earlier.
|
||||
[kubernetes/test-infra\#23243](https://github.com/kubernetes/test-infra/pull/23243).
|
||||
Next steps on [kubernetes/kubernetes\#104334](https://github.com/kubernetes/kubernetes/pull/104334)
|
||||
|
||||
|
||||
## 02/09/2022
|
||||
|
||||
- \[Sergey\] [https://github.com/kubernetes/contributor-site/pull/288](https://github.com/kubernetes/contributor-site/pull/288)
|
||||
- Please send comments publicly or privately by EOW at the latest (Feb. 11\)
|
||||
- \[danielle\] Starting to add unit tests to various parts of pkg/kubelet/… [https://github.com/kubernetes/kubernetes/pull/108024](https://github.com/kubernetes/kubernetes/pull/108024)
|
||||
- Reviews appreciated\!
|
||||
- \[arnaud\] just need a LGTM/Approval : [https://github.com/kubernetes/test-infra/pull/25080](https://github.com/kubernetes/test-infra/pull/25080)
|
||||
- Done
|
||||
- **Action:** Remove sig-node-critical tab from testgrid \- only 2 jobs there, not a useful view
|
||||
- Gcloud auth login failed:
|
||||
- Tests failing to run, e.g. [https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-node-e2e/1491454142533603328](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-containerd-node-e2e/1491454142533603328)
|
||||
- Seems critical-urgent
|
||||
|
||||
Status of CI: [https://docs.google.com/spreadsheets/d/1IwONkeXSc2SG\_EQMYGRSkfiSWNk8yWLpVhPm-LOTbGM/edit\#gid=1187923038](https://docs.google.com/spreadsheets/d/1IwONkeXSc2SG_EQMYGRSkfiSWNk8yWLpVhPm-LOTbGM/edit#gid=1187923038)
|
||||
|
||||
## 02/02/2022
|
||||
|
||||
- \[Mike\] [https://github.com/kubernetes/kubernetes/pull/107913](https://github.com/kubernetes/kubernetes/pull/107913)
|
||||
- \[Danielle\] Is looking at eviction tests.
|
||||
- for the disk pressure, might create a small tmpfs that's easier to fill up
|
||||
- Perf dashboard: see a small bump on runtime memory, let's check next week if it’s the same pattern when it will go down
|
||||
|
||||
![][image11]
|
||||
|
||||
## 01/26/2022
|
||||
|
||||
- \[Mike\]: [https://github.com/kubernetes/kubernetes/pull/107768](https://github.com/kubernetes/kubernetes/pull/107768)
|
||||
- Wouldn't start:
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-e2e-ubuntu](https://testgrid.k8s.io/sig-node-containerd#containerd-e2e-ubuntu)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-node-e2e-1.4](https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-1.4)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#e2e-cos-device-plugin-gpu](https://testgrid.k8s.io/sig-node-containerd#e2e-cos-device-plugin-gpu)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#e2e-ubuntu](https://testgrid.k8s.io/sig-node-containerd#e2e-ubuntu)
|
||||
- Create a bug and Danielle will take a look
|
||||
- [https://github.com/kubernetes/kubernetes/issues/107800](https://github.com/kubernetes/kubernetes/issues/107800)
|
||||
- Separate issue, partially failing:
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-node-e2e-features-1.4](https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-features-1.4)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-node-e2e-features-1.5](https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-features-1.5)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-e2e](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-e2e)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/107801](https://github.com/kubernetes/kubernetes/issues/107801)
|
||||
|
||||
## 01/19/2022
|
||||
|
||||
- Arnaud: [https://github.com/kubernetes/test-infra/issues/23822](https://github.com/kubernetes/test-infra/issues/23822)
|
||||
- Moving to bokos is blocked by: [https://github.com/kubernetes/test-infra/issues/24798](https://github.com/kubernetes/test-infra/issues/24798)
|
||||
- Arnaud will take a look at Ssh failure for CRI-O jobs.
|
||||
- Testgrid analysis notes:
|
||||
- CRI-O alpha tab is running conformance
|
||||
- No Flaky tab for containerd
|
||||
|
||||
## 01/12/2022
|
||||
|
||||
- Test naming consistency
|
||||
- Discuss [https://github.com/kubernetes/test-infra/issues/24641](https://github.com/kubernetes/test-infra/issues/24641)
|
||||
- Don’t make an impression that we test every platform
|
||||
- E.g. don’t include OS name whenever non-important
|
||||
- Container runtime?
|
||||
- \[Elana\] Container runtime is important to know, at least two container runtimes need to be tested.
|
||||
- How do we decide which two to test?
|
||||
- \[Sergey\] Maybe runtime included in names in sig-node tab, but not in release blocking tests
|
||||
- Scheme of the name needs to be aligned
|
||||
- E2e vs. node\_e2e, OS, container runtime, container runtime version, what is being tested (features, serial, unlabeled, etc.), specific feature name (hugepages, etc.), “ci-” prefix to indicate build/deploy, “pull-”, “pr-” for presubmits (do we need to merge to one prefix?)
|
||||
- Do we need many versions of Containerd tested? If not, where will those tests run?
|
||||
- Distinguish e2e/node and e2e\_node and if so, how to name those?
|
||||
- \[Elana\] Use node-only for when we don't spin a full cluster?
|
||||
- Tao: propose to use [Kubernetes SIG-Node CI Testgrid Tracker](https://docs.google.com/spreadsheets/d/1IwONkeXSc2SG_EQMYGRSkfiSWNk8yWLpVhPm-LOTbGM/edit#gid=0) to track the test grid.
|
||||
- \[Sergey\] maybe a single tab? To track continuation
|
||||
- \[Elana\] perhaps push vs. pull to get the status
|
||||
- \[Mike\] tab to test mapping is important also provides longer time history
|
||||
- \[David\] also helpful to understand if anybody is working on issues.
|
||||
- \[Tao\] it may be a short term solution unless the test grid becomes very stable.
|
||||
- \[Elana\] engage with sig testing?
|
||||
- \[Artyom\] who will update the sheet?
|
||||
- \[Sergey\] [https://github.com/kubernetes/kubernetes/issues/107469](https://github.com/kubernetes/kubernetes/issues/107469)
|
||||
- GPU tests gce-device-plugin-gpu-master? Were tests removed?
|
||||
- \[matthyx\] will open an [issue](https://github.com/kubernetes/test-infra/issues/24851) for sig-storage regarding the same problem.
|
||||
- \[mmiranda96\] [https://github.com/kubernetes/kubernetes/issues/107412](https://github.com/kubernetes/kubernetes/issues/107412)
|
||||
- \[matthyx\] [https://github.com/kubernetes/test-infra/pull/24793/files](https://github.com/kubernetes/test-infra/pull/24793/files) Was it enough?
|
||||
- Reopen issues and see what else needs removal/migration
|
||||
- Test grid review (copied from the last meeting):
|
||||
- Remove:
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#conformance-node-rhel](https://testgrid.k8s.io/sig-node-kubelet#conformance-node-rhel)
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#conformance-node-containerized-rhel](https://testgrid.k8s.io/sig-node-kubelet#conformance-node-containerized-rhel)
|
||||
- [https://github.com/kubernetes/test-infra/pull/24852](https://github.com/kubernetes/test-infra/pull/24852)
|
||||
- Cgroupv2: [https://github.com/kubernetes/kubernetes/issues/107062](https://github.com/kubernetes/kubernetes/issues/107062): [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-node-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-node-e2e-serial)
|
||||
- Benchmark: [https://testgrid.k8s.io/sig-node-containerd\#node-e2e-benchmark](https://testgrid.k8s.io/sig-node-containerd#node-e2e-benchmark) find owner/file bug: [https://github.com/kubernetes/kubernetes/issues/36621](https://github.com/kubernetes/kubernetes/issues/36621)
|
||||
- Eviction: [https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-eviction](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-eviction) [https://github.com/kubernetes/kubernetes/issues/107063](https://github.com/kubernetes/kubernetes/issues/107063)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-serial-crio](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-serial-crio)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/107343](https://github.com/kubernetes/kubernetes/issues/107343)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#ci-crio-cgroupv1-node-e2e-eviction](https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-eviction)
|
||||
- Ask COS (@bsdnet):
|
||||
- [https://testgrid.k8s.io/sig-node-cos\#soak-cos-gce](https://testgrid.k8s.io/sig-node-cos#soak-cos-gce)
|
||||
- [https://testgrid.k8s.io/sig-node-cos\#e2e-cos-flaky](https://testgrid.k8s.io/sig-node-cos#e2e-cos-flaky)
|
||||
- NPD: [https://testgrid.k8s.io/sig-node-node-problem-detector\#ci-npd-e2e-node](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-node) [https://github.com/kubernetes/kubernetes/issues/107067](https://github.com/kubernetes/kubernetes/issues/107067)
|
||||
-
|
||||
|
||||
|
||||
## 01/05/2022
|
||||
|
||||
- No alternative time.
|
||||
- (imran) \- Lock file contention tests. Refresh why it was added, discuss plans to move forward to completion.
|
||||
Last comment after addressing the review feedback \- [https://github.com/kubernetes/kubernetes/pull/104334\#discussion\_r760761261](https://github.com/kubernetes/kubernetes/pull/104334#discussion_r760761261)
|
||||
- Presubmit jobs change after dockershim removal:
|
||||
- Maybe delete all and only bring back those we need
|
||||
- [https://github.com/kubernetes/test-infra/issues/24620](https://github.com/kubernetes/test-infra/issues/24620)
|
||||
- Should we make a goal for 1.24:
|
||||
- migration to kubetest2: [https://github.com/kubernetes/enhancements/issues/2464\#issuecomment-1013490836](https://github.com/kubernetes/enhancements/issues/2464#issuecomment-1013490836)
|
||||
- Migration to common pool of projects: [https://github.com/kubernetes/test-infra/issues/7769](https://github.com/kubernetes/test-infra/issues/7769)
|
||||
- Test grid review:
|
||||
- AppArmor tests failing [https://github.com/kubernetes/kubernetes/issues/107342](https://github.com/kubernetes/kubernetes/issues/107342)
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-ubuntu](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu) file a bug
|
||||
- Maybe related to ^^^ [https://testgrid.k8s.io/sig-node-containerd\#containerd-node-features](https://testgrid.k8s.io/sig-node-containerd#containerd-node-features)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-node-e2e](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-node-e2e)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#image-validation-node-features](https://testgrid.k8s.io/sig-node-containerd#image-validation-node-features)
|
||||
- Remove:
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#conformance-node-rhel](https://testgrid.k8s.io/sig-node-kubelet#conformance-node-rhel)
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#conformance-node-containerized-rhel](https://testgrid.k8s.io/sig-node-kubelet#conformance-node-containerized-rhel)
|
||||
- Cgroupv2: [https://github.com/kubernetes/kubernetes/issues/107062](https://github.com/kubernetes/kubernetes/issues/107062): [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-node-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-node-e2e-serial)
|
||||
- Benchmark: [https://testgrid.k8s.io/sig-node-containerd\#node-e2e-benchmark](https://testgrid.k8s.io/sig-node-containerd#node-e2e-benchmark) find owner/file bug: [https://github.com/kubernetes/kubernetes/issues/36621](https://github.com/kubernetes/kubernetes/issues/36621)
|
||||
- Eviction: [https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-eviction](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-eviction) [https://github.com/kubernetes/kubernetes/issues/107063](https://github.com/kubernetes/kubernetes/issues/107063)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-serial-crio](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-serial-crio)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/107343](https://github.com/kubernetes/kubernetes/issues/107343)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#ci-crio-cgroupv1-node-e2e-eviction](https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-eviction)
|
||||
- Ask COS (@bsdnet):
|
||||
- [https://testgrid.k8s.io/sig-node-cos\#soak-cos-gce](https://testgrid.k8s.io/sig-node-cos#soak-cos-gce)
|
||||
- [https://testgrid.k8s.io/sig-node-cos\#e2e-cos-flaky](https://testgrid.k8s.io/sig-node-cos#e2e-cos-flaky)
|
||||
- NPD: [https://testgrid.k8s.io/sig-node-node-problem-detector\#ci-npd-e2e-node](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-node) [https://github.com/kubernetes/kubernetes/issues/107067](https://github.com/kubernetes/kubernetes/issues/107067)
|
||||
-
|
||||
-
|
|
@ -0,0 +1,646 @@
|
|||
# Kubernetes SIG-Node CI subgroup notes
|
||||
|
||||
## 2023/12/06
|
||||
|
||||
Recording: [https://youtu.be/TFp7tv72854](https://youtu.be/TFp7tv72854)
|
||||
Hosts:
|
||||
|
||||
- Tests: mmiranda96
|
||||
- Bugs:
|
||||
|
||||
Thaw will start Dec 13th.
|
||||
Next meeting on Jan 3rd.
|
||||
|
||||
## 2023/11/29
|
||||
|
||||
Recording: [https://youtu.be/utMfJzEBcvQ](https://youtu.be/utMfJzEBcvQ)
|
||||
|
||||
- Hosts:
|
||||
- Tests: Sergey
|
||||
- Bugs: ndixita
|
||||
- New failure possible regression: E2eNode Suite.\[It\] \[sig-node\] \[Serial\] Containers Lifecycle should restart the containers in right order after the node rebootChanges
|
||||
- \[ruiwen\] [https://github.com/kubernetes/kubernetes/pull/122095](https://github.com/kubernetes/kubernetes/pull/122095)
|
||||
- \[harshal\] [https://github.com/kubernetes/kubernetes/issues/121349\#issuecomment-1813029991](https://github.com/kubernetes/kubernetes/issues/121349#issuecomment-1813029991) closer to the root cause
|
||||
- Possibly need a bug for this: [https://testgrid.k8s.io/sig-node-presubmits\#pr-crio-cgroupv2-imagefs-e2e-diskpressure](https://testgrid.k8s.io/sig-node-presubmits#pr-crio-cgroupv2-imagefs-e2e-diskpressure)
|
||||
|
||||
## 2023/11/15
|
||||
|
||||
Recording [https://youtu.be/aoT5PnEBdus](https://youtu.be/aoT5PnEBdus)
|
||||
|
||||
- Hosts:
|
||||
- Tests: mkmir
|
||||
- Bugs: Sergey
|
||||
- Mostly triage
|
||||
- [https://kubernetes.slack.com/archives/CN0K3TE2C/p1700060944364139](https://kubernetes.slack.com/archives/CN0K3TE2C/p1700060944364139)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/121349](https://github.com/kubernetes/kubernetes/issues/121349)
|
||||
|
||||
## 2023/11/01
|
||||
|
||||
Recording: [https://youtu.be/PclxEBV1awI](https://youtu.be/PclxEBV1awI)
|
||||
Hosts:
|
||||
|
||||
- Tests:
|
||||
- Bugs: ndixita
|
||||
|
||||
Agenda:
|
||||
|
||||
* ~~Kubelet/CRIO/Containerd logs are missing from jobs that have been migrated to community cluster~~
|
||||
* [~~https://github.com/kubernetes/kubernetes/issues/121444~~](https://github.com/kubernetes/kubernetes/issues/121444)
|
||||
* [https://github.com/kubernetes/kubernetes/pull/119496\#issuecomment-1653172666](https://github.com/kubernetes/kubernetes/pull/119496#issuecomment-1653172666)
|
||||
* Fedora swap job
|
||||
* [https://github.com/kubernetes/kubernetes/pull/121671](https://github.com/kubernetes/kubernetes/pull/121671)
|
||||
* Job config looks broken:
|
||||
* [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-e2e\&width=5](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-e2e&width=5) (check with mimiranda96)
|
||||
* [https://github.com/kubernetes/kubernetes/issues/121309](https://github.com/kubernetes/kubernetes/issues/121309)
|
||||
* [https://testgrid.k8s.io/sig-node-containerd\#e2e-cos-device-plugin-gpu\&width=5](https://testgrid.k8s.io/sig-node-containerd#e2e-cos-device-plugin-gpu&width=5)
|
||||
*
|
||||
* Busted job config:
|
||||
* [https://testgrid.k8s.io/sig-node-node-problem-detector\#ci-npd-e2e-node\&width=5](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-node&width=5)
|
||||
|
||||
## 2023/10/25 \[Cancelled\]
|
||||
|
||||
Canceled due to an host availability. Review PRs for freeze next week.
|
||||
|
||||
## 2023/10/18
|
||||
|
||||
Recording:
|
||||
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: kannon92
|
||||
|
||||
Agenda:
|
||||
|
||||
* Issue link for [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial) failures
|
||||
* [https://github.com/kubernetes/kubernetes/issues/121220](https://github.com/kubernetes/kubernetes/issues/121220) for existing issue related to feodra
|
||||
* \[Tzneal\] For the OOM Swap issue, there is a discusion on \#sig-node-swap that I think is related: [https://kubernetes.slack.com/archives/C02UCH9N02J/p1695207853069709](https://kubernetes.slack.com/archives/C02UCH9N02J/p1695207853069709)
|
||||
* Create an issue for PriorityPidEvictionOrdering
|
||||
*
|
||||
* Issue link for containerd-cgroupv1, GLIBC\_2.34 not foundGLIB
|
||||
* [https://github.com/kubernetes/kubernetes/https://github.com/kubernetes/kubernetes/issues/121309](https://github.com/kubernetes/kubernetes/https://github.com/kubernetes/kubernetes/issues/121309)
|
||||
* issues/121309C\_2.3 \-
|
||||
- Remind mike about COS and dbus
|
||||
- \[Sergey\] found the right people to look at it
|
||||
- Reach out to sig testing: [https://github.com/kubernetes/kubernetes/issues/121220](https://github.com/kubernetes/kubernetes/issues/121220)
|
||||
|
||||
Issues Created for bugs:
|
||||
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/121309](https://github.com/kubernetes/kubernetes/issues/121309)
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/121220](https://github.com/kubernetes/kubernetes/issues/121220)
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/121124](https://github.com/kubernetes/kubernetes/issues/121124)
|
||||
\- [https://github.com/kubernetes/node-problem-detector/issues/831](https://github.com/kubernetes/node-problem-detector/issues/831)
|
||||
\-
|
||||
|
||||
PR to fix jobs due to configuration:
|
||||
\- [https://github.com/kubernetes/test-infra/pull/31054](https://github.com/kubernetes/test-infra/pull/31054)
|
||||
\- https://github.com/kubernetes/test-infra/pull/31051
|
||||
|
||||
## 2023/10/11
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=MZNSFlJGnMw](https://www.youtube.com/watch?v=MZNSFlJGnMw)
|
||||
Hosts:
|
||||
|
||||
- Tests: mmiranda96
|
||||
- Bugs: ndixita
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[mmiranda96\]: file an issue for swap Fedora job
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-hugepages](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-hugepages)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#ci-crio-cgroupv1-node-e2e-hugepages](https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-hugepages)
|
||||
- [https://testgrid.k8s.io/sig-node-node-problem-detector\#ci-npd-e2e-kubernetes-gce-gci](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-kubernetes-gce-gci)
|
||||
- [https://testgrid.k8s.io/sig-node-node-problem-detector\#ci-npd-e2e-kubernetes-gce-gci-custom-flags](https://testgrid.k8s.io/sig-node-node-problem-detector#ci-npd-e2e-kubernetes-gce-gci-custom-flags)
|
||||
- Kubelet behavior change in 1.27 related to multiple containers in Pod
|
||||
|
||||
## 2023/10/04
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=yx7iXyohVDU](https://www.youtube.com/watch?v=yx7iXyohVDU)
|
||||
|
||||
Hosts:
|
||||
|
||||
- Tests: mmiranda96
|
||||
- Bugs: skanzhelev
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Kevin Hannon\] Presubmit Jobs and crio-community-cluster migration
|
||||
- [Reducing flakes in the periodics tests will become easier if it is possible to kick off tests as presubmits](https://docs.google.com/document/d/1QrLBy-v6B3sits0Tu5uqQFHkofejorcqw73KLxQVWsg/edit?usp=sharing)
|
||||
- mmiranda96: Create issue for fixing sig-node-containerd containerd-branch jobs (change branch name).
|
||||
-
|
||||
|
||||
## 2023/09/27
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=ehD9x7mYcIQ](https://www.youtube.com/watch?v=ehD9x7mYcIQ)
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: mmiranda96
|
||||
|
||||
|
||||
Agenda:
|
||||
|
||||
- **\[Kevin Hannon** [9:36 AM](https://kubernetes.slack.com/archives/C05KQLJEWHX/p1695659778256519)\]
|
||||
It looks like there was a large refactor of the node jobs to drop bootstrap.py. I'd suggest for our weekly meeting to look over the sig-node jobs and make sure they are all running.
|
||||
[https://github.com/kubernetes/kubernetes/pull/120831\#issuecomment-1732281761](https://github.com/kubernetes/kubernetes/pull/120831#issuecomment-1732281761)
|
||||
Is a list of some jobs that are failing
|
||||
|
||||
|
||||
- Topology manager
|
||||
- Swati to look into why tests are being skipped: TODO
|
||||
- Issue: [https://github.com/kubernetes/kubernetes/issues/120725](https://github.com/kubernetes/kubernetes/issues/120725)
|
||||
- Previous Issue where we were tracking Resource Manager testing on multi-numa: [https://github.com/kubernetes/kubernetes/issues/119601](https://github.com/kubernetes/kubernetes/issues/119601) . We had fixed tests to run on multi-NUMA nodes ([https://github.com/kubernetes/test-infra/pull/30545](https://github.com/kubernetes/test-infra/pull/30545) and [https://github.com/kubernetes/test-infra/pull/30629](https://github.com/kubernetes/test-infra/pull/30629))
|
||||
- Swap:
|
||||
- [https://github.com/kubernetes/kubernetes/pull/120139](https://github.com/kubernetes/kubernetes/pull/120139)
|
||||
- Create an issue for [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial)
|
||||
- https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-node-e2e
|
||||
- GracefulNodeShutdownOnPodPriority failing tests: [https://github.com/kubernetes/kubernetes/issues/120726](https://github.com/kubernetes/kubernetes/issues/120726)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-serial-crio](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-serial-crio)
|
||||
- kubelet-serial-containerd: [https://github.com/kubernetes/kubernetes/issues/120913](https://github.com/kubernetes/kubernetes/issues/120913)
|
||||
-
|
||||
|
||||
## 2023/09/20
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=pNCzeo3J7oQ](https://www.youtube.com/watch?v=pNCzeo3J7oQ)
|
||||
|
||||
Hosts:
|
||||
|
||||
- Tests: mmiranda96
|
||||
- Bugs: ndixita
|
||||
|
||||
Agenda:
|
||||
|
||||
* \[kannon92\] Investigating eviction test cases
|
||||
* Notice there is no test coverage for eviction with a separate image fs
|
||||
* Can we consider adding a e2e test for crio/containerd eviction with a separate image fs?
|
||||
* Could use some help on how to do this
|
||||
* \[kannon92\]
|
||||
* Bootstrap py issue caused problems with prow jobs (not just the periodics but also the blocking PRs)
|
||||
* [https://github.com/kubernetes/test-infra/issues/30759](https://github.com/kubernetes/test-infra/issues/30759)
|
||||
* Should consider migrating the required presubmits to use decorate
|
||||
* \[fromani\] totally\! [https://github.com/kubernetes/kubernetes/issues/120609](https://github.com/kubernetes/kubernetes/issues/120609) (perhaps should have been filed against test-infra?)
|
||||
|
||||
## 2023/09/13
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=kk2GCdZRJBs](https://www.youtube.com/watch?v=kk2GCdZRJBs)
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: mmiranda96
|
||||
|
||||
Agenda:
|
||||
|
||||
* \[ndixita\] Triage OOM Kill test in [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial)
|
||||
* \[ndixita\] Create issue for cos-cgroupv1-containerd-node-e2e-serial and cgroupv2failing job
|
||||
* \[mmiranda96\] fixing [https://testgrid.k8s.io/sig-node-containerd\#node-e2e-features](https://testgrid.k8s.io/sig-node-containerd#node-e2e-features)
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#node-e2e-features](https://testgrid.k8s.io/sig-node-containerd#node-e2e-features)
|
||||
|
||||
## 2023/09/06
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=lDTQSHeB-v4](https://www.youtube.com/watch?v=lDTQSHeB-v4)
|
||||
|
||||
Recording:
|
||||
Hosts:
|
||||
|
||||
- Tests: skanzhelev
|
||||
- Bugs: skanzhelev
|
||||
|
||||
Agenda:
|
||||
|
||||
## 2023/08/30
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=HmY5ID9LlNM](https://www.youtube.com/watch?v=HmY5ID9LlNM)
|
||||
Hosts:
|
||||
|
||||
- Tests: mmiranda96
|
||||
- Bugs: ndixita
|
||||
|
||||
|
||||
Agenda:
|
||||
\[Harchal\] Performance issue: many pods in the same namespace take more CPU than same number of pods in multiple namespaces. Will file an issue with observations.
|
||||
|
||||
* \[mahamed\][https://github.com/kubernetes/test-infra/pull/29944](https://github.com/kubernetes/test-infra/pull/29944)
|
||||
|
||||
## 2023/08/23
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=637\_82vBhqM](https://www.youtube.com/watch?v=637_82vBhqM)
|
||||
Hosts:
|
||||
|
||||
- Tests: ~~ndixita~~ mmiranda96
|
||||
- Bugs: mmiranda96
|
||||
|
||||
|
||||
Agenda:
|
||||
|
||||
- Todd seems to found the reason for OOMKilled not being reported as a status: [https://github.com/kubernetes/kubernetes/issues/119600](https://github.com/kubernetes/kubernetes/issues/119600)
|
||||
- CPUManager/Topology manager: try \`decorate:true\` on the job definition
|
||||
|
||||
## 2023/08/16
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=PIqJyLT\_D\_E](https://www.youtube.com/watch?v=PIqJyLT_D_E)
|
||||
Hosts:
|
||||
|
||||
- Tests: mmiranda96
|
||||
- Bugs: ndixita
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[tzneal\] Looking for review
|
||||
- [~~https://github.com/kubernetes/kubernetes/pull/119765~~](https://github.com/kubernetes/kubernetes/pull/119765) ~~Use the ‘/’ mount path for NFS tests that works everywhere~~ Merged
|
||||
- [~~https://github.com/kubernetes/kubernetes/pull/119890~~](https://github.com/kubernetes/kubernetes/pull/119890) ~~crio: increase test buffer to eliminate test flakes~~ Merged
|
||||
- [~~https://github.com/kubernetes/kubernetes/pull/119974~~](https://github.com/kubernetes/kubernetes/pull/119974) ~~Update tests to use the latest busybox test image~~ \- Approved, waiting on merge
|
||||
- \[mmiranda96\] We need to confirm if perf tests are running on cgroup v1/v2 (v2 is expected to behave better under stress conditions)
|
||||
- \[ndixita\] [https://github.com/kubernetes/kubernetes/issues/119960](https://github.com/kubernetes/kubernetes/issues/119960) Link to other relatable bugs
|
||||
|
||||
## 2023/08/09
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=OOgdMe0TJWU](https://www.youtube.com/watch?v=OOgdMe0TJWU)
|
||||
Hosts:
|
||||
|
||||
- Tests: mmiranda96
|
||||
- Bugs: ~~haircommander~~ mmiranda96
|
||||
|
||||
Agenda:
|
||||
|
||||
- [https://github.com/kubernetes/test-infra/pull/30249](https://github.com/kubernetes/test-infra/pull/30249)
|
||||
- \[tzneal\] \- [https://github.com/kubernetes/kubernetes/issues/119611](https://github.com/kubernetes/kubernetes/issues/119611)
|
||||
- \[mmiranda96\] Create node conformance release branch jobs for 1.27 and 1.28 (self-assign issue)
|
||||
|
||||
## 2023/08/02
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=jxr4iMYzH2E](https://www.youtube.com/watch?v=jxr4iMYzH2E)
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: tzneal
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[ndixita\] Looking for more people to engage in the meeting by leading bug triage or test failures. Please reach out to \[SergeyKanzhelev\] if you are interested.
|
||||
- ndixita doing test failures triage today.
|
||||
- tzneal doing bug triage today.
|
||||
- \[ndixita\] Test Failures
|
||||
- \[fromani\] gce device plugin presubmit jobs seems to be broken. Any insights, or tips to debug?
|
||||
- Issue: [https://github.com/kubernetes/kubernetes/issues/119730](https://github.com/kubernetes/kubernetes/issues/119730)
|
||||
- Testing PR: [https://github.com/kubernetes/kubernetes/pull/119590](https://github.com/kubernetes/kubernetes/pull/119590)
|
||||
- Multi-numa still failing: [https://github.com/kubernetes/test-infra/pull/29717\#issuecomment-1584967574](https://github.com/kubernetes/test-infra/pull/29717#issuecomment-1584967574) [https://github.com/kubernetes/kubernetes/issues/119601](https://github.com/kubernetes/kubernetes/issues/119601)
|
||||
[https://github.com/kubernetes/test-infra/pull/30249](https://github.com/kubernetes/test-infra/pull/30249)
|
||||
- Arm failures fixed: [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-arm64-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial)
|
||||
- Fix: https://github.com/kubernetes/kubernetes/pull/119603
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-node-e2e-1.7](https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-1.7)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/119600](https://github.com/kubernetes/kubernetes/issues/119600)
|
||||
- Fix for oom kill test: [https://github.com/kubernetes/kubernetes/pull/119670](https://github.com/kubernetes/kubernetes/pull/119670)
|
||||
- “Summary API \[NodeConformance\] when querying /stats/summary should report resource usage through the stats api” still failing.
|
||||
- \[tzneal\] Odd segfault in /bin/top that only occurs with old busybox images on arm64. Fixed with newer images, PR to update and fix some arm64 tests is at [https://github.com/kubernetes/kubernetes/pull/119636](https://github.com/kubernetes/kubernetes/pull/119636)
|
||||
- Related, some test images are built on centos which is no longer updated (e.g. [https://github.com/kubernetes/kubernetes/blob/master/test/images/volume/nfs/BASEIMAGE](https://github.com/kubernetes/kubernetes/blob/master/test/images/volume/nfs/BASEIMAGE) )
|
||||
-
|
||||
|
||||
## 2023/07/26
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=iJUsH4BfYTY](https://www.youtube.com/watch?v=iJUsH4BfYTY)
|
||||
Hosts:
|
||||
|
||||
- Tests: SergeyKanzhelev
|
||||
- Bugs: mmiranda96
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[SergeyKanzhelev\] More people engaging with leading the meeting. Please reach out to me if you are interested.
|
||||
- Mike will do bug triage today.
|
||||
- \[fromani\] it seems the job [pull-kubernetes-e2e-gce-device-plugin-gpu](https://prow.k8s.io/job-history/gs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce-device-plugin-gpu) is broken for stable releases (e.g. 1.27). Xref: [https://prow.k8s.io/pr-history/?org=kubernetes\&repo=kubernetes\&pr=119590](https://prow.k8s.io/pr-history/?org=kubernetes&repo=kubernetes&pr=119590)
|
||||
- \[SergeyKanzhelev\] Multi-numa still failing: [https://github.com/kubernetes/test-infra/pull/29717\#issuecomment-1584967574](https://github.com/kubernetes/test-infra/pull/29717#issuecomment-1584967574) [https://github.com/kubernetes/kubernetes/issues/119601](https://github.com/kubernetes/kubernetes/issues/119601)
|
||||
- Arm failing: [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-arm64-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial) \[mention @upodroid\] [**https://github.com/kubernetes/kubernetes/issues/119599**](https://github.com/kubernetes/kubernetes/issues/119599)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-node-e2e-1.7](https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-1.7) [https://github.com/kubernetes/kubernetes/issues/119600](https://github.com/kubernetes/kubernetes/issues/119600)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/119602](https://github.com/kubernetes/kubernetes/issues/119602)
|
||||
- \[Harshal\] is it related to [https://github.com/kubernetes/kubernetes/pull/119486](https://github.com/kubernetes/kubernetes/pull/119486) ?
|
||||
- \[ndixita\] suggestion for adding dashboards for each architecture and cloud provider [https://github.com/kubernetes/test-infra/pull/29969](https://github.com/kubernetes/test-infra/pull/29969)
|
||||
- \[mahamed\] [https://github.com/kubernetes/test-infra/issues/29946](https://github.com/kubernetes/test-infra/issues/29946) node prowjob overhaul
|
||||
|
||||
## 2023/07/19
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=PvaIQwVaCEs](https://www.youtube.com/watch?v=PvaIQwVaCEs)
|
||||
|
||||
\[swsehgal\]: Multi-numa failing: [https://github.com/kubernetes/test-infra/pull/29717\#issuecomment-1584967574](https://github.com/kubernetes/test-infra/pull/29717#issuecomment-1584967574)
|
||||
|
||||
* Ran CPU manager, Memory Manager and Topology Manager e2e tests locally on a multi-numa system. All are passing.
|
||||
* Topology Manager metric test is failing in the RedHat environment ( not the same failure in u/s CI; tests are skipped in u/s CI): Looking into this. Wasn’t able to reproduce it locally.
|
||||
* For CPU Manager and memory manager, kubelet fails in upstream CI environment. Perhaps issue with job config.
|
||||
* PR with a potential fix (update node test args): [https://github.com/kubernetes/test-infra/pull/30072](https://github.com/kubernetes/test-infra/pull/30072)
|
||||
|
||||
## 2023/07/07 canceled due to host availability
|
||||
|
||||
## 2023/07/05
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=G6j0hnwsEXQ](https://www.youtube.com/watch?v=G6j0hnwsEXQ)
|
||||
|
||||
- Review the umbrella issue: [https://github.com/kubernetes/kubernetes/issues/118441\#issuecomment-1578138427](https://github.com/kubernetes/kubernetes/issues/118441#issuecomment-1578138427)
|
||||
- Multi-numa failing: [https://github.com/kubernetes/test-infra/pull/29717\#issuecomment-1584967574](https://github.com/kubernetes/test-infra/pull/29717#issuecomment-1584967574)
|
||||
|
||||
## 2023/06/28
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=U9yophjx23Q](https://www.youtube.com/watch?v=U9yophjx23Q)
|
||||
|
||||
- \[SergeyKanzhelev\] prepull: [https://github.com/kubernetes/kubernetes/pull/118747](https://github.com/kubernetes/kubernetes/pull/118747)
|
||||
-
|
||||
|
||||
## 2023/06/21 \[Cancelled\]
|
||||
|
||||
## 2023/06/14
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=TDBLb4\_sj7w](https://www.youtube.com/watch?v=TDBLb4_sj7w)
|
||||
|
||||
Agenda:
|
||||
|
||||
- ARM64 failures: [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-arm64-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial)
|
||||
- Need CI test as well
|
||||
- Multi-numa failing: [https://github.com/kubernetes/test-infra/pull/29717\#issuecomment-1584967574](https://github.com/kubernetes/test-infra/pull/29717#issuecomment-1584967574)
|
||||
- Updating the base image may be relevant here (cos 93\)
|
||||
- Mike to send a PR today to bump the version
|
||||
- Cos-containerd-node-e2e-serial failure : [https://github.com/kubernetes/kubernetes/issues/118660](https://github.com/kubernetes/kubernetes/issues/118660)
|
||||
|
||||
[https://github.com/kubernetes/kubernetes/issues/118441](https://github.com/kubernetes/kubernetes/issues/118441)
|
||||
|
||||
## 2023/06/07
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=1u6--D2Y9LU](https://www.youtube.com/watch?v=1u6--D2Y9LU)
|
||||
|
||||
- \[pacoxu\] Dims opened an umbrella issues [https://github.com/kubernetes/kubernetes/issues/118441\#issuecomment-1578138427](https://github.com/kubernetes/kubernetes/issues/118441#issuecomment-1578138427) Some updates from my side. Some may need platform permissions and not sure if we can be granted.
|
||||
- AI: ownership and the purpose of this: prowjob\_name: ci-kubernetes-e2e-node-canary prowjob\_config\_url: https://git.k8s.io/test-infra/config/jobs/kubernetes/sig-testing/kubetest-canaries.yaml
|
||||
-
|
||||
-
|
||||
- ARM and multi-numa periodics status
|
||||
- Multi NUMA PR: [https://github.com/kubernetes/test-infra/pull/29717](https://github.com/kubernetes/test-infra/pull/29717)
|
||||
- ARM is still failing: [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-arm64-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial)
|
||||
|
||||
## 2023/05/31
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=4PZi50cP-UY](https://www.youtube.com/watch?v=4PZi50cP-UY)
|
||||
|
||||
Agenda:
|
||||
|
||||
- StandaloneMode: green now: [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-containerd-standalone-mode-all-alpha](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-containerd-standalone-mode-all-alpha) Ideally need to add more tests
|
||||
- ARM64 still failing:
|
||||
- [https://github.com/kubernetes/test-infra/pull/29617\#issuecomment-1570293715](https://github.com/kubernetes/test-infra/pull/29617#issuecomment-1570293715)
|
||||
- Create issue for periodics of multi-NUMA
|
||||
- Periodics vs. presubmits jobs:
|
||||
- presubmits are degrading
|
||||
- presubmits diverge from periodics
|
||||
- \[ffromani\] example is topology tests that investigating now. Will ask sig testing if there are any established approaches to keep those green
|
||||
- \[Dixita\] First draft for guidance around test coverage [https://docs.google.com/document/d/1P1X9Jr2PYFiC6xNF-9RtgrPuc9PsSmikCngGTfGVxjY/edit?usp=sharing](https://docs.google.com/document/d/1P1X9Jr2PYFiC6xNF-9RtgrPuc9PsSmikCngGTfGVxjY/edit?usp=sharing)
|
||||
- Please take a look at it and drop feedback.
|
||||
\- grant permission to [kubernetes-sig-node-test-failures](https://groups.google.com/forum/#!forum/kubernetes-sig-node-test-failures) please ([kubernetes-sig-node-test-failures@googlegroups.com](mailto:kubernetes-sig-node-test-failures@googlegroups.com))
|
||||
-
|
||||
|
||||
## 2023/05/24
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=X87Lmqmf3QQ](https://www.youtube.com/watch?v=X87Lmqmf3QQ)
|
||||
Agenda:
|
||||
|
||||
- \[tzneal\] Added periodics for standalone kubelet tests, but they are failing [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-containerd-standalone-mode](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-containerd-standalone-mode)
|
||||
- [https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-e2e-containerd-standalone-mode-all-alpha/1661213193298513920](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-e2e-containerd-standalone-mode-all-alpha/1661213193298513920)
|
||||
- W0524 03:37:35.874\] 2023/05/24 03:37:35 main.go:328: Something went wrong: failed to prepare test environment: \--provider=gce boskos failed to acquire project: resources not found
|
||||
- Asked in \#sig-k8s-infra: [https://kubernetes.slack.com/archives/CCK68P2Q2/p1684932723841139](https://kubernetes.slack.com/archives/CCK68P2Q2/p1684932723841139)
|
||||
- Try compare with [https://testgrid.k8s.io/sig-node-containerd\#containerd-node-e2e-features-1.7](https://testgrid.k8s.io/sig-node-containerd#containerd-node-e2e-features-1.7)
|
||||
- Containerd features tests are not running: [https://testgrid.k8s.io/sig-node-containerd\#node-e2e-features](https://testgrid.k8s.io/sig-node-containerd#node-e2e-features)
|
||||
|
||||
## 2023/05/17
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=5IC3JLbrk-A](https://www.youtube.com/watch?v=5IC3JLbrk-A)
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[SergeyKanzhelev\] Stress test: [https://github.com/kubernetes/kubernetes/pull/117439](https://github.com/kubernetes/kubernetes/pull/117439)
|
||||
- More test areas:
|
||||
- ARM tests: [https://github.com/kubernetes/test-infra/pull/29192](https://github.com/kubernetes/test-infra/pull/29192) and [https://github.com/kubernetes/kubernetes/pull/117017](https://github.com/kubernetes/kubernetes/pull/117017)
|
||||
- Run StandaloneMode kubelet tests in periodics: [https://github.com/kubernetes/test-infra/pull/29042](https://github.com/kubernetes/test-infra/pull/29042)
|
||||
- Tzneal can do it
|
||||
- Run multi-numa in periodics: [https://github.com/kubernetes/test-infra/blob/master/jobs/e2e\_node/image-config-serial-multi-numa.yaml](https://github.com/kubernetes/test-infra/blob/master/jobs/e2e_node/image-config-serial-multi-numa.yaml)
|
||||
- Swati can do it
|
||||
- Jobs to release branches for hotfix validation
|
||||
- Fork-per-release: [https://github.com/kubernetes/test-infra/pull/29483/commits/9ef5f916f78323f99857ff16cbb6b03f665eed60](https://github.com/kubernetes/test-infra/pull/29483/commits/9ef5f916f78323f99857ff16cbb6b03f665eed60)
|
||||
- \[Todd\] [https://github.com/kubernetes/test-infra/blob/master/releng/config-forker/README.md](https://github.com/kubernetes/test-infra/blob/master/releng/config-forker/README.md)
|
||||
- Images may be used there that are “too fresh”
|
||||
- \[mmiranda96\] in these cases we can fork manually
|
||||
- Previous releases periodics with specific images
|
||||
|
||||
## 2023/05/10
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=JsjBNgVb6Po](https://www.youtube.com/watch?v=JsjBNgVb6Po)
|
||||
|
||||
Agenda:
|
||||
|
||||
\[swsehgal\] Would like help with [https://github.com/kubernetes/test-infra/pull/29483](https://github.com/kubernetes/test-infra/pull/29483)
|
||||
|
||||
* Trying to enable jobs for 1.25, 1.26 and 1.27 release branches so that device manager tests can execute in the CI
|
||||
* Unit tests are failing
|
||||
* How can I ensure that these jobs are properly added to testgrid?
|
||||
* \[mmiranda96\] Previous PR that contains release branch SIG-Node tests: [https://github.com/kubernetes/test-infra/pull/29038](https://github.com/kubernetes/test-infra/pull/29038)
|
||||
|
||||
## 2023/05/03
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=SAUmnsx\_sG8](https://www.youtube.com/watch?v=SAUmnsx_sG8)
|
||||
|
||||
Triage only
|
||||
|
||||
## 2023/04/26
|
||||
|
||||
Recording: [Kubernetes SIG Node CI 20230426](https://www.youtube.com/watch?v=-Aq6ZB7Bb0o)
|
||||
|
||||
- \[mmiranda96\] [https://github.com/kubernetes/test-infra/issues/29308](https://github.com/kubernetes/test-infra/issues/29308)
|
||||
- \[ndixita\] Guidance doc for Sig Node CI Test Coverage
|
||||
|
||||
## 2023/04/18 Canceled for KubeCon
|
||||
|
||||
## 2023/03/29
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=TaUK0qYwtwA](https://www.youtube.com/watch?v=TaUK0qYwtwA)
|
||||
|
||||
- \[akhil\] [https://github.com/kubernetes/kubernetes/issues/116944](https://github.com/kubernetes/kubernetes/issues/116944)
|
||||
|
||||
## 2023/03/22
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=d6TZ5AywdnM](https://www.youtube.com/watch?v=d6TZ5AywdnM)
|
||||
|
||||
- \[tzneal\] Eviction Flakes \- [https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-eviction](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-eviction)
|
||||
- Fix for Pid Pressure by Todd \- will be posted soon
|
||||
- Test: [https://github.com/kubernetes/kubernetes/blob/master/test/e2e\_node/eviction\_test.go\#L500](https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/eviction_test.go#L500)
|
||||
- PR: [https://github.com/kubernetes/kubernetes/pull/116862](https://github.com/kubernetes/kubernetes/pull/116862)
|
||||
- Update\[tzneal\]: I was wrong on this, I think it’s actually caused by [https://github.com/kubernetes/kubernetes/issues/115215](https://github.com/kubernetes/kubernetes/issues/115215) .
|
||||
- New failure:
|
||||
- E2eNode Suite.\[It\] \[sig-node\] MirrorPod when create a mirror pod without changes should successfully recreate when file is removed and recreated \[NodeConformance\]
|
||||
-
|
||||
- [https://testgrid.k8s.io/sig-node-release-blocking\#ci-crio-cgroupv1-node-e2e-conformance](https://testgrid.k8s.io/sig-node-release-blocking#ci-crio-cgroupv1-node-e2e-conformance)
|
||||
- AI: Ryan to take a look
|
||||
- [https://github.com/kubernetes/kubernetes/issues/116714](https://github.com/kubernetes/kubernetes/issues/116714)
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/issues/116874](https://github.com/kubernetes/kubernetes/issues/116874)
|
||||
|
||||
- Flakes/failure:
|
||||
- E2eNode Suite.\[It\] \[sig-node\] MirrorPodWithGracePeriod when create a mirror pod and the container runtime is temporarily down during pod termination \[NodeConformance\] \[Serial\] \[Disruptive\] the mirror pod should terminate successfully
|
||||
-
|
||||
- [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-serial-containerd\&width=20](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd&width=20)
|
||||
|
||||
Perma failure here:
|
||||
|
||||
E2eNode Suite.\[It\] \[sig-node\] MirrorPodWithGracePeriod when create a mirror pod and the container runtime is temporarily down during pod termination \[NodeConformance\] \[Serial\] \[Disruptive\] the mirror pod should terminate successfully
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-serial-crio](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-serial-crio)
|
||||
|
||||
|
||||
Introduced here: [https://github.com/kubernetes/kubernetes/pull/113145/](https://github.com/kubernetes/kubernetes/pull/113145/)
|
||||
|
||||
- Not working:
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-e2e-ubuntu](https://testgrid.k8s.io/sig-node-containerd#containerd-e2e-ubuntu)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/116873](https://github.com/kubernetes/kubernetes/issues/116873)
|
||||
|
||||
- Pod Overhead related
|
||||
|
||||
E2eNode Suite.\[It\] \[sig-node\] Kubelet PodOverhead handling \[LinuxOnly\] PodOverhead cgroup accounting On running pod with PodOverhead defined Pod cgroup should be sum of overhead and resource limits
|
||||
|
||||
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cri-o\#ci-crio-cgroupv1-node-e2e-unlabelled\&width=20](https://testgrid.k8s.io/sig-node-cri-o#ci-crio-cgroupv1-node-e2e-unlabelled&width=20)
|
||||
|
||||
AI: Todd to investigate
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/issues/116864](https://github.com/kubernetes/kubernetes/issues/116864)
|
||||
- Doesn’t appear to be related to the changes to pod resource calculations. This uses a custom runtime class and support for setting up the test infra with that runtime class is not implemented for cri-o.
|
||||
|
||||
|
||||
|
||||
## 2023/03/15
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=D\_5tx6lkxY4](https://www.youtube.com/watch?v=D_5tx6lkxY4)
|
||||
|
||||
- \[SergeyKanzhelev\] Standalone tests:
|
||||
- [https://github.com/kubernetes/kubernetes/pull/116628](https://github.com/kubernetes/kubernetes/pull/116628)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/116631](https://github.com/kubernetes/kubernetes/pull/116631)
|
||||
- \[SergeyKanzhelev\] testing limits in e2e: TODO
|
||||
- \[SergeyKanzhelev\] there is a flake: [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-serial-containerd](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd)
|
||||
|
||||
|
||||
## 2023/03/08
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=luwWBmxyngk](https://www.youtube.com/watch?v=luwWBmxyngk)
|
||||
|
||||
- \[SergeyKanzhelev\] [https://github.com/kubernetes/test-infra/issues/28888](https://github.com/kubernetes/test-infra/issues/28888)
|
||||
- \[SergeyKanzhelev\] [https://github.com/kubernetes/test-infra/pull/28919](https://github.com/kubernetes/test-infra/pull/28919#discussion_r1128352969) Apply USE\_TEST\_INFRA\_LOG\_DUMPING to sig node jobs
|
||||
- @xmcqueen
|
||||
- \[fromani\] not urgent, mostly to socialize the idea:
|
||||
- promoting a subset of podresources e2e tests to NodeConformance
|
||||
- To be done around the timeframe of podresources GA (should not block the GA’ing)
|
||||
- Context: podresources [endpoint on windows](https://github.com/kubernetes/kubernetes/pull/115133/) needs e2e tests. On windows they run conformance and nodeconformance tests by default.
|
||||
- \[SergeyKanzhelev\] Testing previous releases [https://testgrid.k8s.io/sig-node-containerd\#node-conformance-release-1.24](https://testgrid.k8s.io/sig-node-containerd#node-conformance-release-1.24)
|
||||
- \[vinaykul\] In-place pod resize CI tests merged yesterday.
|
||||
- Multiple failures from ‘Insufficient cpu’ on small node (2000m allocatable)
|
||||
- Potential fix [https://github.com/kubernetes/kubernetes/pull/116372](https://github.com/kubernetes/kubernetes/pull/116372)
|
||||
|
||||
AI:
|
||||
|
||||
- Ping sig storage on [https://github.com/kubernetes/kubernetes/issues/116357](https://github.com/kubernetes/kubernetes/issues/116357)
|
||||
|
||||
## 2023/03/01
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=BIhQ69FNYJA](https://www.youtube.com/watch?v=BIhQ69FNYJA)
|
||||
|
||||
- \[mmiranda96\] [https://github.com/kubernetes/test-infra/issues/28627](https://github.com/kubernetes/test-infra/issues/28627)
|
||||
- Start with NodeConformance on supported k8s release branches.
|
||||
- Add @xmcqueen as reviewer
|
||||
- [https://github.com/kubernetes/kubernetes/pull/115984](https://github.com/kubernetes/kubernetes/pull/115984)
|
||||
|
||||
## 2023/02/22 \[cancelled\]
|
||||
|
||||
## 2023/02/15 \[cancelled\]
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[fromani\] Promoting some node e2e tests to NodeConformance: best way forward
|
||||
- This will provide test coverage for podresources API on windows
|
||||
- Fromani to discuss this offline (not very urgent, no worries)
|
||||
|
||||
## 2023/02/08
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=\_4K8JD7Zejo](https://www.youtube.com/watch?v=_4K8JD7Zejo)
|
||||
Agenda:
|
||||
|
||||
- \[pacoxu\] I want to add some CI for [quota monitoring of ephemeral storage](https://github.com/kubernetes/enhancements/issues/1029). (There is already e2e test to test on configmap editing, what we should do is providing a cluster that XFS project quotas is enabled): [https://github.com/kubernetes/test-infra/issues/28614](https://github.com/kubernetes/test-infra/issues/28614)
|
||||
- The correct bug fix fo the fsquota issue is [https://github.com/kubernetes/kubernetes/pull/115314](https://github.com/kubernetes/kubernetes/pull/115314) .
|
||||
- \[SergeyKanzhelev\] Check on last meting AIs:
|
||||
- [https://github.com/kubernetes/test-infra/issues/28627](https://github.com/kubernetes/test-infra/issues/28627)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/115372](https://github.com/kubernetes/kubernetes/issues/115372)
|
||||
|
||||
## 2023/02/01
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=t9brdyaGo3Y](https://www.youtube.com/watch?v=t9brdyaGo3Y)
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[SergeyKanzhelev\] Test coverage for previous releases
|
||||
- Let’s add some NodeConformance tests to sig-node-releas-blocking for supported versions
|
||||
- Mike is to open an issue for this: [https://github.com/kubernetes/test-infra/issues/28627](https://github.com/kubernetes/test-infra/issues/28627)
|
||||
- Also let’s lock the OS to the specific release
|
||||
- Mike to decide on new OS for 1.27
|
||||
|
||||
- \[Mike Miranda\] swap on fedora: [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-fedora](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora)
|
||||
- Tracking issue: [https://github.com/kubernetes/kubernetes/issues/115372](https://github.com/kubernetes/kubernetes/issues/115372)
|
||||
- Peter will take a look
|
||||
- Mike maybe take a look into adding memory swap presubmit
|
||||
|
||||
## 2023/01/25
|
||||
|
||||
Recording: [https://youtu.be/t1kbFCaeSSE](https://youtu.be/t1kbFCaeSSE)
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[SergeyKanzhelev\] no stress tests: [https://github.com/kubernetes/kubernetes/pull/115143](https://github.com/kubernetes/kubernetes/pull/115143)
|
||||
- \[Brian\] soak tests exists but broken fro a long time
|
||||
- \[Ryan\] gRPC \- is it also needs to be fixed? What tests needs to run?
|
||||
- \[Brian\] qq: is there already a place where it will fit in? Maybe we can add a place first and clear the path for contributors to add test cases.
|
||||
- \[Sergey\] curious if evented PLEG has been looking into stress testing
|
||||
- \[Harshal\] adding e2e CI tests jobs for containerd and cri-o. Have a presubmit job now, debugging it.
|
||||
- Stress testing and different scenarios is a next step
|
||||
|
||||
## 2023/01/18
|
||||
|
||||
Recording: [https://youtu.be/hep7StWT8u0](https://youtu.be/hep7StWT8u0)
|
||||
Agenda:
|
||||
|
||||
- \[swsehgal\] Need some guidance on the process of publishing images that are used in node e2e tests
|
||||
- Context: Device Manager Bug [https://github.com/kubernetes/kubernetes/pull/114640](https://github.com/kubernetes/kubernetes/pull/114640)
|
||||
- Scenarios where on node reboot or kubelet restart, device plugin pod is not recovered before an application pod consuming device.
|
||||
- To reproduce the issue and for e2e testing, sample device plugin (which is a device plugin implemented in tree for testing) was modified to control its registration process
|
||||
- Changes related to sample device plugin was split into a separate PR [https://github.com/kubernetes/kubernetes/pull/115107](https://github.com/kubernetes/kubernetes/pull/115107)
|
||||
- How can we get this image pushed to the Kubernetes registry so it can be correctly consumed in 114640? [Kubernetes image promotion process](https://github.com/kubernetes/enhancements/tree/master/keps/sig-release/1734-k8s-image-promoter#promotion-process) indicates that maintainer involvement might be needed here.
|
||||
- Examples:
|
||||
- [https://github.com/kubernetes/kubernetes/pull/109551/files](https://github.com/kubernetes/kubernetes/pull/109551/files)
|
||||
- [https://github.com/kubernetes/k8s.io/pull/4391](https://github.com/kubernetes/k8s.io/pull/4391)
|
||||
- [https://github.com/kubernetes/k8s.io/blob/71231519d8f36b71b2c218ed3a993c64d63d0882/k8s.gcr.io/images/k8s-staging-e2e-test-images/images.yaml\#L149](https://github.com/kubernetes/k8s.io/blob/71231519d8f36b71b2c218ed3a993c64d63d0882/k8s.gcr.io/images/k8s-staging-e2e-test-images/images.yaml#L149)
|
||||
|
||||
## 2023/01/11
|
||||
|
||||
Recording: [https://youtu.be/N97z4wGoIl0](https://youtu.be/N97z4wGoIl0)
|
||||
Agenda:
|
||||
|
||||
- \[Francesco\] device plugins to use in e2e tests
|
||||
- Why we need them
|
||||
- Current issues
|
||||
- Discussion about future plans
|
||||
|
||||
- Synthetic devices are better than nothing. They would work on any machine and vendor-neutral
|
||||
- Device plugin from kubevirt was good as it was vendor neutral and didn’t require hardware: [https://kubevirt.io/2018/KVM-Using-Device-Plugins.html](https://kubevirt.io/2018/KVM-Using-Device-Plugins.html)
|
||||
- In search of real world devices:
|
||||
- Investigate GPU devices \- can we run with them periodically?
|
||||
- \[Ryan\] good approach may be a combination \- running presubmits with synthetic and GPU devices periodically
|
||||
- \[Francesco\] will investigate
|
||||
- [**https://github.com/kubernetes/test-infra/pull/28369**](https://github.com/kubernetes/test-infra/pull/28369) **enable multi-numa tests**
|
||||
- [**https://github.com/kubernetes/community/pull/7021**](https://github.com/kubernetes/community/pull/7021) **how to write good tests**
|
||||
|
||||
## 2023/01/04
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=wyc4k1ERDEA](https://www.youtube.com/watch?v=wyc4k1ERDEA)
|
||||
|
||||
Agenda:
|
||||
|
||||
- Triage
|
||||
- All CRI-O tests are red: [https://testgrid.k8s.io/sig-node-cri-o](https://testgrid.k8s.io/sig-node-cri-o)
|
||||
-
|
|
@ -0,0 +1,744 @@
|
|||
# Kubernetes SIG-Node CI subgroup notes
|
||||
|
||||
## Dec 18, 2024
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging
|
||||
|
||||
Agenda:
|
||||
|
||||
- Test-Infra Cleanup
|
||||
- Dropped NodeSpecialFeature / NodeAlphaFeature
|
||||
- Fall out: [https://github.com/kubernetes/test-infra/pull/33996](https://github.com/kubernetes/test-infra/pull/33996)
|
||||
- Aiming to deprecate NodeFeature by replicating with NodeFeature
|
||||
- [https://github.com/kubernetes/kubernetes/pull/129166](https://github.com/kubernetes/kubernetes/pull/129166)
|
||||
-
|
||||
|
||||
## Dec 11, 2024
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
- [https://github.com/kubernetes/enhancements/tree/master/keps/sig-testing/3041-node-conformance-and-features\#goals](https://github.com/kubernetes/enhancements/tree/master/keps/sig-testing/3041-node-conformance-and-features#goals)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/128923](https://github.com/kubernetes/kubernetes/pull/128923)
|
||||
- [https://github.com/kubernetes/test-infra/pull/33828](https://github.com/kubernetes/test-infra/pull/33828)
|
||||
- TODO:
|
||||
- [Nodefeature](http://github.com/kubernetes/kubernetes/tree/master/test/e2e/nodefeature) to feature
|
||||
- Convert to label filter instead of using skip/focus
|
||||
- [https://github.com/kubernetes/kubernetes/pull/128880](https://github.com/kubernetes/kubernetes/pull/128880)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/128889](https://github.com/kubernetes/kubernetes/pull/128889)
|
||||
|
||||
Action items:
|
||||
\- Follow up on https://github.com/kubernetes/test-infra/issues/32567
|
||||
|
||||
## Dec 4, 2024 \[just 3 people joined. Will do triage offline\]
|
||||
|
||||
## Nov 27, 2024
|
||||
Cancelled due to U.S. Holiday
|
||||
|
||||
## Nov 20, 2024
|
||||
|
||||
Hosts:
|
||||
Tests: Kevin
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
Need to create tickets for this
|
||||
|
||||
- Swap Tests are bonked. Need to create a ticket to investigate.
|
||||
- Huge Page Test Failures
|
||||
- Device Plugin
|
||||
|
||||
## Nov 13, 2024 \[Canceled for KubeCon\]
|
||||
|
||||
## Nov 6, 2024
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[anish\] Why are evictions / cpu,memory,topology managers e2e tests not considered release blocking? Even though these are stable features?
|
||||
|
||||
## Oct 30, 2024 \[Canceled\]
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
## Oct 23, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=Y8bUTXy3FGs](https://www.youtube.com/watch?v=Y8bUTXy3FGs)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
- Combine sig node/sig node CI board?
|
||||
- Originally it was separated to onboard new members to be able to do reviews without needing to worry about production code
|
||||
- Generally, this meeting should be focused on CI, so maybe defer PR triage
|
||||
- Add a special label to PRs, when it’s present remove from one board/add it to the other
|
||||
- CRI proxy PR merged, now more tests can be added to test different CRI scenarios
|
||||
- [https://github.com/kubernetes/kubernetes/pull/127495](https://github.com/kubernetes/kubernetes/pull/127495)
|
||||
- How it can be used: [https://github.com/kubernetes/kubernetes/pull/121604](https://github.com/kubernetes/kubernetes/pull/121604)
|
||||
|
||||
## Oct 16, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=Y8bUTXy3FGs](https://www.youtube.com/watch?v=Y8bUTXy3FGs)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[KubeTest\] Migration to kubetest
|
||||
- https://github.com/kubernetes/test-infra/issues/32567
|
||||
- Presubmits first
|
||||
- A lot of problems
|
||||
- [https://github.com/elieser1101](https://github.com/elieser1101) is owner for this
|
||||
- EventedPleg: [https://github.com/kubernetes/test-infra/issues/33666](https://github.com/kubernetes/test-infra/issues/33666)
|
||||
-
|
||||
|
||||
## Oct 9, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=GKrlW1LDXz0](https://www.youtube.com/watch?v=GKrlW1LDXz0)
|
||||
|
||||
Hosts:
|
||||
Tests: Anish
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
Oct 2
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=80260g3EEv8](https://www.youtube.com/watch?v=80260g3EEv8)
|
||||
|
||||
Sep 25, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=gcX6sDoibM4](https://www.youtube.com/watch?v=gcX6sDoibM4)
|
||||
|
||||
Agenda:
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/issues/127610](https://github.com/kubernetes/kubernetes/issues/127610)
|
||||
- \[ffromani\] (can attend only 1st half) Chicken and egg: [https://github.com/kubernetes/kubernetes/pull/120661](https://github.com/kubernetes/kubernetes/pull/120661) and [https://github.com/kubernetes/kubernetes/pull/127506](https://github.com/kubernetes/kubernetes/pull/127506)
|
||||
- CRI proxy work started: [https://github.com/kubernetes/kubernetes/pull/127495](https://github.com/kubernetes/kubernetes/pull/127495) Mostly FYI
|
||||
|
||||
Sep 18, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=Y6slxvO6Hv8](https://www.youtube.com/watch?v=Y6slxvO6Hv8)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triage:
|
||||
|
||||
Agenda:
|
||||
|
||||
- Another ping about the flake: [https://kubernetes.slack.com/archives/C0BP8PW9G/p1726622928011249?thread\_ts=1718369837.055379\&cid=C0BP8PW9G](https://kubernetes.slack.com/archives/C0BP8PW9G/p1726622928011249?thread_ts=1718369837.055379&cid=C0BP8PW9G)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122270](https://github.com/kubernetes/kubernetes/issues/122270)
|
||||
- CRI proxy
|
||||
- Injecting failures is a good idea
|
||||
- If easy to set up \- maybe set up everywhere. If not \- let’s only do it per test
|
||||
- The main concern to not leak tests into each other
|
||||
-
|
||||
|
||||
Sep 11, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=vmH-6iWjWPM](https://www.youtube.com/watch?v=vmH-6iWjWPM)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
Sep 4, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=1DiDFkYhpi4](https://www.youtube.com/watch?v=1DiDFkYhpi4)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging: anish
|
||||
|
||||
Agenda:
|
||||
|
||||
Aug 28, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=tE4uO6Gj4sM](https://www.youtube.com/watch?v=tE4uO6Gj4sM)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[anish\] will join in the second half of the meeting. Taking a shot at deflaking eviction tests \- [https://github.com/kubernetes/kubernetes/issues/123591](https://github.com/kubernetes/kubernetes/issues/123591)
|
||||
- Cadvisor cache seems to be in sync
|
||||
|
||||
Aug 21, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=EgqFB0PDb0g](https://www.youtube.com/watch?v=EgqFB0PDb0g)
|
||||
|
||||
Aug 7, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=g6U40nR\_tRU](https://www.youtube.com/watch?v=g6U40nR_tRU)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
- Kevinn for approver: [https://github.com/kubernetes/test-infra/pull/33255](https://github.com/kubernetes/test-infra/pull/33255)
|
||||
- Add tests lanes for 1.31 \- Kevin will take it
|
||||
- Sergey: add workflows to auto-populate issues
|
||||
|
||||
Jul 31, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=yfSd6ezXWIs](https://www.youtube.com/watch?v=yfSd6ezXWIs)
|
||||
|
||||
Hosts:
|
||||
Tests: Peter
|
||||
Bugs triaging: Anish
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[harche\] \- are we identifying cgroup v1 and v2 specific CI jobs?
|
||||
|
||||
Jul 24, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=Wz0Dzo\_f4jg](https://www.youtube.com/watch?v=Wz0Dzo_f4jg)
|
||||
|
||||
Hosts:
|
||||
Tests: Kevin
|
||||
Bugs triaging: Peter
|
||||
|
||||
Agenda:
|
||||
|
||||
- AI: migrate to the new project boards.
|
||||
- AI: ask about perf dashboard
|
||||
- \[harche\] \- [https://github.com/kubernetes/kubernetes/issues/125720](https://github.com/kubernetes/kubernetes/issues/125720)
|
||||
- Also: [https://github.com/kubernetes/kubernetes/issues/125409](https://github.com/kubernetes/kubernetes/issues/125409)
|
||||
- Also potential fix for the behavior : [https://biriukov.dev/docs/page-cache/6-cgroup-v2-and-page-cache/\#writeback-and-io](https://biriukov.dev/docs/page-cache/6-cgroup-v2-and-page-cache/#writeback-and-io)
|
||||
-
|
||||
|
||||
Jul 17, 2024
|
||||
|
||||
Jul 10, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=see5mwuN0YA](https://www.youtube.com/watch?v=see5mwuN0YA)
|
||||
|
||||
Host:
|
||||
Tests:
|
||||
Bugs triaging: anish
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[anishshah\] New tests failing:
|
||||
- Podspidlimit \- [https://github.com/kubernetes/kubernetes/issues/126007](https://github.com/kubernetes/kubernetes/issues/126007)
|
||||
- OOMKiller tests in EC2 jobs \- [https://github.com/kubernetes/kubernetes/issues/126009](https://github.com/kubernetes/kubernetes/issues/126009)
|
||||
- Due to testsuite timeout \- [https://github.com/kubernetes/kubernetes/issues/126008](https://github.com/kubernetes/kubernetes/issues/126008)
|
||||
- \[sotiris\] [https://github.com/kubernetes/kubernetes/pull/124296](https://github.com/kubernetes/kubernetes/pull/124296)
|
||||
- \[Sergey\] Device plugin failure injection tests: [https://github.com/kubernetes/kubernetes/pull/125753](https://github.com/kubernetes/kubernetes/pull/125753)
|
||||
|
||||
Jul 3, 2024 \[Cancelled\]
|
||||
|
||||
Host:
|
||||
|
||||
Agenda:
|
||||
|
||||
-
|
||||
|
||||
Jun 26, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=Cn-1k0U1kGw](https://www.youtube.com/watch?v=Cn-1k0U1kGw)
|
||||
|
||||
Agenda:
|
||||
|
||||
- [https://kubernetes.slack.com/archives/C0BP8PW9G/p1718369837055379](https://kubernetes.slack.com/archives/C0BP8PW9G/p1718369837055379)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122270](https://github.com/kubernetes/kubernetes/issues/122270)
|
||||
- \[fromani\] Annotating pod to detect leftovers: [https://github.com/kubernetes/kubernetes/pull/125434](https://github.com/kubernetes/kubernetes/pull/125434)
|
||||
- Driven by: [https://github.com/kubernetes/kubernetes/pull/123468](https://github.com/kubernetes/kubernetes/pull/123468) (PTAL\!)
|
||||
- \[alex\] Test guidance compliance work
|
||||
- [https://github.com/kubernetes/test-infra/pull/32752](https://github.com/kubernetes/test-infra/pull/32752)
|
||||
- \[Sotiris\] Can we do triage for [https://github.com/kubernetes/test-infra/pull/32765](https://github.com/kubernetes/test-infra/pull/32765)
|
||||
-
|
||||
|
||||
Triage:
|
||||
[https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-serial-containerd](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd)
|
||||
|
||||
E2eNode Suite.\[It\] \[sig-node\] CriticalPod \[Serial\] \[Disruptive\] \[NodeFeature:CriticalPod\] when we need to admit a critical pod should add DisruptionTarget condition to the preempted pod \[NodeFeature:PodDisruptionConditions\]
|
||||
E2eNode Suite.\[It\] \[sig-node\] CriticalPod \[Serial\] \[Disruptive\] \[NodeFeature:CriticalPod\] when we need to admit a critical pod should be able to create and delete a critical pod
|
||||
E2eNode Suite.\[It\] \[sig-node\] MirrorPodWithGracePeriod when create a mirror pod and the container runtime is temporarily down during pod termination \[NodeConformance\] \[Serial\] \[Disruptive\] the mirror pod should terminate successfully
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#pull-e2e-serial-ec2-canary](https://testgrid.k8s.io/sig-node-containerd#pull-e2e-serial-ec2-canary)
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-containerd-node-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-containerd-node-e2e-serial)
|
||||
|
||||
Not working
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-cgroupv1-serial-crio](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-cgroupv1-serial-crio)
|
||||
|
||||
Not working
|
||||
|
||||
[https://testgrid.k8s.io/sig-node-cri-o\#pr-node-kubelet-serial-crio-cgroupv2](https://testgrid.k8s.io/sig-node-cri-o#pr-node-kubelet-serial-crio-cgroupv2)
|
||||
[https://testgrid.k8s.io/sig-node-presubmits\#pr-node-kubelet-serial-containerd](https://testgrid.k8s.io/sig-node-presubmits#pr-node-kubelet-serial-containerd)
|
||||
|
||||
Should not run the NodeSwap
|
||||
|
||||
Jun 19, 2024 \[Cancelled for holidays\]
|
||||
|
||||
Jun 12, 2024 \[Cancelled for KEP freeze reviews\]
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
-
|
||||
|
||||
Follow up Items:
|
||||
|
||||
Jun 5, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=L1rXfz5pJgQ](https://www.youtube.com/watch?v=L1rXfz5pJgQ)
|
||||
|
||||
Hosts:
|
||||
Tests: Anish
|
||||
Bugs triage: Peter Hunt
|
||||
|
||||
Agenda:
|
||||
|
||||
- Release blocking?:
|
||||
- [https://github.com/orgs/kubernetes/projects/68/views/35?sliceBy%5BcolumnId%5D=Labels\&sliceBy%5Bvalue%5D=sig%2Fnode](https://github.com/orgs/kubernetes/projects/68/views/35?sliceBy%5BcolumnId%5D=Labels&sliceBy%5Bvalue%5D=sig%2Fnode)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/125264](https://github.com/kubernetes/kubernetes/issues/125264)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/125264\#issuecomment-2148446172](https://github.com/kubernetes/kubernetes/issues/125264#issuecomment-2148446172)
|
||||
-
|
||||
- [https://github.com/kubernetes/kubernetes/issues/125183](https://github.com/kubernetes/kubernetes/issues/125183)
|
||||
|
||||
|
||||
|
||||
May 29, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=PSq4VpMSlQ0](https://www.youtube.com/watch?v=PSq4VpMSlQ0)
|
||||
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging:
|
||||
|
||||
Agenda:
|
||||
|
||||
- ~~\[follow-up\] [https://testgrid.k8s.io/sig-node-containerd\#ci-cgroupv1-containerd-node-arm64-e2e-serial-ec2-eks](https://testgrid.k8s.io/sig-node-containerd#ci-cgroupv1-containerd-node-arm64-e2e-serial-ec2-eks)~~
|
||||
- ~~Filed [https://github.com/kubernetes/kubernetes/issues/125173](https://github.com/kubernetes/kubernetes/issues/125173)~~
|
||||
- ~~Looks like swap feature was enabled in cgroupv1 jobs but it is cgroupv2 only feature?~~
|
||||
- Help to review [https://github.com/kubernetes/kubernetes/pull/124617](https://github.com/kubernetes/kubernetes/pull/124617)
|
||||
|
||||
Follow up:
|
||||
|
||||
- This should have been done: [https://testgrid.k8s.io/sig-node-release-blocking\#node-kubelet-serial-containerd](https://testgrid.k8s.io/sig-node-release-blocking#node-kubelet-serial-containerd) but still failing
|
||||
- [https://github.com/kubernetes/kubernetes/pull/125027](https://github.com/kubernetes/kubernetes/pull/125027)
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-arm64-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-arm64-ubuntu-serial)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-cgrpv2-serial-crio\&width=5](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-cgrpv2-serial-crio&width=5)
|
||||
- Peter will fix by adding more skips
|
||||
- Maybe file a bug
|
||||
- Follow up on sidecar meeting: E2eNode Suite.\[It\] \[sig-node\] \[NodeFeature:SidecarContainers\] Containers Lifecycle should terminate sidecars simultaneously if prestop doesn't exit
|
||||
-
|
||||
-
|
||||
- The test is broken completely: [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-containerd-node-e2e-serial\&width=5](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-containerd-node-e2e-serial&width=5)
|
||||
|
||||
- Was green with no tests, now failing with timeout: [https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-cgroupv1-serial-crio\&width=5](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-cgroupv1-serial-crio&width=5)
|
||||
|
||||
May 22, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=rPoI3HrTkiM](https://www.youtube.com/watch?v=rPoI3HrTkiM)
|
||||
|
||||
Hosts:
|
||||
Host: Peter
|
||||
Bugs triaging: Peter
|
||||
|
||||
Agenda:
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/pull/125027](https://github.com/kubernetes/kubernetes/pull/125027) could use approval
|
||||
- [https://github.com/kubernetes/kubernetes/issues/124743](https://github.com/kubernetes/kubernetes/issues/124743) still failing, need to bump cri-o version
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#ci-cgroupv1-containerd-node-arm64-e2e-serial-ec2-eks](https://testgrid.k8s.io/sig-node-containerd#ci-cgroupv1-containerd-node-arm64-e2e-serial-ec2-eks) Has additional failures, needs an issue
|
||||
|
||||
Follow up Items:
|
||||
|
||||
- \[Sergey\] [https://kubernetes.slack.com/archives/C0BP8PW9G/p1716308390271449](https://kubernetes.slack.com/archives/C0BP8PW9G/p1716308390271449)
|
||||
- Looks like something we changed for sidecars
|
||||
- Matthyx was planning to work on a fix
|
||||
|
||||
May 15, 2024
|
||||
|
||||
* **No meeting, no items on the agenda.**
|
||||
|
||||
May 8, 2024
|
||||
|
||||
Recording:[https://www.youtube.com/watch?v=ZlL0yVKJ\_o8](https://www.youtube.com/watch?v=ZlL0yVKJ_o8)
|
||||
|
||||
Hosts:
|
||||
Host: Peter
|
||||
Bugs triaging: Dixita
|
||||
|
||||
Agenda:
|
||||
|
||||
Follow up Items:
|
||||
|
||||
* \[Peter\] Open an issue for [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-fedora](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora) failures
|
||||
* \[Dixita\] Memory usage beyond node allocatable tests failing again: [https://github.com/kubernetes/kubernetes/issues/120646](https://github.com/kubernetes/kubernetes/issues/120646)
|
||||
* [https://github.com/kubernetes/kubernetes/issues/124345](https://github.com/kubernetes/kubernetes/issues/124345) follow up with Swati and Francesco
|
||||
|
||||
May 1, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=h89s\_z-YmIU](https://www.youtube.com/watch?v=h89s_z-YmIU)
|
||||
|
||||
Hosts:
|
||||
Host:
|
||||
Bugs triaging: Anish
|
||||
|
||||
Agenda:
|
||||
|
||||
Follow up Items:
|
||||
|
||||
Apr 24, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=MlltvJWa1so](https://www.youtube.com/watch?v=MlltvJWa1so)
|
||||
Hosts:
|
||||
Host: Sergey
|
||||
Bugs triaging: Anish
|
||||
|
||||
Agenda:
|
||||
|
||||
Apr 17, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=MhMZJvLx3sg](https://www.youtube.com/watch?v=MhMZJvLx3sg)
|
||||
Hosts:
|
||||
Host: Sergey
|
||||
Bugs triaging: Anish
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Sotiris\] Test PR needing approval
|
||||
- [https://github.com/kubernetes/kubernetes/pull/124097](https://github.com/kubernetes/kubernetes/pull/124097)
|
||||
|
||||
- \[Kevin Hannon\] NVIDIA K80 out of support in May
|
||||
- [https://github.com/kubernetes/test-infra/issues/32242](https://github.com/kubernetes/test-infra/issues/32242)
|
||||
- \[Anish\] [https://github.com/kubernetes/kubernetes/issues/116965](https://github.com/kubernetes/kubernetes/issues/116965)
|
||||
- IIUC, pod status is not updated during graceful node shutdown. Does anyone have historical context on why the pod status is not updated?
|
||||
- Ryan to reply on issue to explain the expected behavior part of this behavior
|
||||
- \[Ed\] ideally we need to extend the e2e test.
|
||||
- \[ryan\] kubelet must be killed before networking is shut down
|
||||
|
||||
Followup
|
||||
|
||||
- \[Sotiris\] Seems worth it to Improve cpu manager tests coverage, [https://github.com/kubernetes/kubernetes/issues/100145](https://github.com/kubernetes/kubernetes/issues/100145) . What do you think? How should we proceed with this?
|
||||
|
||||
|
||||
|
||||
\- \[anishshah\] \- v1.30 release report
|
||||
\- [github.com/AnishShah/sig-node-flaky-tests/tree/main](https://github.com/AnishShah/sig-node-flaky-tests/tree/main)
|
||||
\- 22/249 sig-node release blocking tests are flaky.
|
||||
|
||||
Apr 10, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=NUzCEC4WuL0](https://www.youtube.com/watch?v=NUzCEC4WuL0)
|
||||
Hosts:
|
||||
Host: ndixita
|
||||
Bugs triaging: Peter Hunt
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[kannon92\] Test PRs needing approval
|
||||
- [https://github.com/kubernetes/kubernetes/pull/123950](https://github.com/kubernetes/kubernetes/pull/123950)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/123386](https://github.com/kubernetes/kubernetes/pull/123386)
|
||||
- [https://github.com/kubernetes/test-infra/pull/32271](https://github.com/kubernetes/test-infra/pull/32271)
|
||||
|
||||
- Cgroup v2 crio jobs
|
||||
- Deprecating cgroup v1 means that we should have 1on1 coverage for cgroup v1 and cgroup v2
|
||||
- [Add corresponding cgroups v2 for node-crio-e2e-features and node-crio-flaky](https://github.com/kubernetes/test-infra/pull/32409)
|
||||
- [crio huge pages cgroup v2](https://github.com/kubernetes/test-infra/pull/32407)
|
||||
- [Resource managers crio cgroup v2](https://github.com/kubernetes/test-infra/pull/32406)
|
||||
|
||||
- \[Ed\] Can we consider triaging [SIG-Node PRs](https://github.com/orgs/kubernetes/projects/49) in this meeting?
|
||||
|
||||
Followup
|
||||
|
||||
* Check which tests need to have coverage for cgroupv2
|
||||
* Consider [Sig Node PRs](https://github.com/orgs/kubernetes/projects/49) triaging : maybe once per month?
|
||||
* [https://github.com/kubernetes/kubernetes/pull/124220](https://github.com/kubernetes/kubernetes/pull/124220)
|
||||
* Sig node : [https://github.com/kubernetes/kubernetes/pull/124229](https://github.com/kubernetes/kubernetes/pull/124229)
|
||||
|
||||
Apr 3, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=wZeHdf3PtMQ](https://www.youtube.com/watch?v=wZeHdf3PtMQ)
|
||||
|
||||
Hosts:
|
||||
Host: skanzhelev
|
||||
Bugs triaging: anish
|
||||
|
||||
Agenda:
|
||||
|
||||
- [Bugs dashboard](https://github.com/orgs/kubernetes/projects/59)
|
||||
- \[ndixita\] Questions to get some context to help deflake the tests and cleanup
|
||||
- [https://github.com/kubernetes/test-infra/pull/32271](https://github.com/kubernetes/test-infra/pull/32271) why did we remove manager jobs from serial tests
|
||||
- Duplicate coverage so fine to remove
|
||||
- Presubmit and periodics are already running these tests
|
||||
- Kubeadm version skew tests in sig-node-kubelet POC
|
||||
- The tests are in sig-cluster-lifecycle and sig-node. Send a PR to remove them from sig-node?
|
||||
- [https://github.com/kubernetes/test-infra/blob/d3f9ee6f4d5b185a7b784533d6a36fab9c8409dc/config/jobs/kubernetes/sig-cluster-lifecycle/kubeadm-kinder-kubelet-x-on-y.yaml\#L356](https://github.com/kubernetes/test-infra/blob/d3f9ee6f4d5b185a7b784533d6a36fab9c8409dc/config/jobs/kubernetes/sig-cluster-lifecycle/kubeadm-kinder-kubelet-x-on-y.yaml#L356)
|
||||
- Swap serial tests are flaky while parallel are not
|
||||
- ideally effort needs to be put to deflake these tests
|
||||
- History of node-kubernetes-containerd-flaky dashboard \- [https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-flaky](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-flaky) ?
|
||||
|
||||
Follow up Items:
|
||||
|
||||
Mar 27, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=fkWV\_mqcZzs](https://www.youtube.com/watch?v=fkWV_mqcZzs)
|
||||
|
||||
Hosts:
|
||||
Host:
|
||||
Bugs triaging: peter hunt
|
||||
|
||||
- \[SergeyKanzhelev\] [https://github.com/kubernetes/kubernetes/pull/124009\#issuecomment-2013142716](https://github.com/kubernetes/kubernetes/pull/124009#issuecomment-2013142716)
|
||||
- sig-node CI v1.30 release report
|
||||
- [\[Flaking Test\] \[sig-node\] ☂️ node-kubelet-serial-containerd job multiple flakes🌂 · Issue \#120913](https://github.com/kubernetes/kubernetes/issues/120913#issuecomment-1996670691)
|
||||
- These tests are flaky in these dashboards:
|
||||
- \[sig-node-release-blocking\]\[node-kubelet-serial-containerd\]
|
||||
- \[sig-node-kubelet\]
|
||||
- \[sig-node-containerd\]
|
||||
- \[sig-node-crio\]
|
||||
- Manager’s tests \- lets remove them from Serial lane
|
||||
- [https://github.com/kubernetes/test-infra/pull/32271](https://github.com/kubernetes/test-infra/pull/32271)
|
||||
- Check CI jobs are working for managers
|
||||
|
||||
- # \[Sotiris\] oomkiller\_linux\_test: fix warnings
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/pull/123908](https://github.com/kubernetes/kubernetes/pull/123908)
|
||||
- Lets wait till branch will reopen
|
||||
|
||||
Mar 20, 2024
|
||||
|
||||
* Canceled due to Kubecon Week
|
||||
|
||||
Mar 13, 2024
|
||||
Recording: [https://www.youtube.com/watch?v=itj3vxg23nk](https://www.youtube.com/watch?v=itj3vxg23nk)
|
||||
Hosts:
|
||||
Host: Dixita (Dixi)
|
||||
Bugs triaging: Anish
|
||||
|
||||
Agenda:
|
||||
|
||||
* \[Dixi\] Removing huge pages from allocatable/capacity [https://github.com/kubernetes/kubernetes/pull/119173](https://github.com/kubernetes/kubernetes/pull/119173)
|
||||
* [Bugs with no priority](https://github.com/kubernetes/kubernetes/issues?q=is%3Aissue+label%3Akind%2Fbug+is%3Aopen+label%3Asig%2Fnode+-label%3Apriority%2Fimportant-soon+-label%3Apriority%2Fimportant-longterm+-label%3Apriority%2Fbacklog+-label%3Apriority%2Fcritical-urgent+-label%3Atriage%2Fneeds-information)
|
||||
* Seeking help to debug Serial crio jobs failures
|
||||
* [https://github.com/kubernetes/kubernetes/pull/123908](https://github.com/kubernetes/kubernetes/pull/123908) (from Sotiris)
|
||||
|
||||
Follow up Items:
|
||||
|
||||
* Talk about the [https://github.com/kubernetes/kubernetes/pull/119173](https://github.com/kubernetes/kubernetes/pull/119173) in Sig node.
|
||||
* Why /proc/meminfo used to report capacity?
|
||||
* Change to priority/important-soon after assessing the impact.
|
||||
|
||||
## 2024/03/06
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=wCoCEAQqMOY](https://www.youtube.com/watch?v=wCoCEAQqMOY)
|
||||
|
||||
Hosts:
|
||||
Host: Dixita
|
||||
Bugs triaging: Anish
|
||||
|
||||
Agenda:
|
||||
|
||||
- Serial Jobs Failures
|
||||
- OOM
|
||||
- [https://github.com/kubernetes/kubernetes/issues/123589](https://github.com/kubernetes/kubernetes/issues/123589)
|
||||
- Jobs are OOMing due to dd oom.
|
||||
- They also run twice
|
||||
-
|
||||
- \[harche\] \- [https://github.com/kubernetes/kubernetes/issues/123027\#issuecomment-1971147830](https://github.com/kubernetes/kubernetes/issues/123027#issuecomment-1971147830)
|
||||
- Not sure if this is really a bug.
|
||||
- Follow-up from last week:
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122160](https://github.com/kubernetes/kubernetes/issues/122160)
|
||||
- Triaging bugs since we close to 1.30 code freeze:
|
||||
- Bugs with [critical-urgent priority](https://github.com/orgs/kubernetes/projects/59?card_filter_query=label%3Apriority%2Fcritical-urgent).
|
||||
- Bugs with [important-soon priority](https://github.com/orgs/kubernetes/projects/59?card_filter_query=label%3Apriority%2Fimportant-soon).
|
||||
- Bugs with [no priority labels and no owner](https://github.com/orgs/kubernetes/projects/59?card_filter_query=-label%3Apriority%2Fimportant-soon+-label%3Apriority%2Fimportant-longterm+-label%3Apriority%2Fcritical-urgent+-label%3Apriority%2Fbacklog+-label%3Apriority%2Fawaiting-more-evidence+no%3Aassignee).
|
||||
|
||||
Follow up Items:
|
||||
|
||||
*
|
||||
|
||||
## 2024/02/28
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=2fqfYwYwRkk](https://www.youtube.com/watch?v=2fqfYwYwRkk)
|
||||
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: anish
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[esotsal\]
|
||||
- ndixita@ [https://github.com/kubernetes/kubernetes/issues/123313](https://github.com/kubernetes/kubernetes/issues/123313) : \[Failing test\] pull-kubernetes-local-e2e
|
||||
- PR [https://github.com/kubernetes/test-infra/pull/32025](https://github.com/kubernetes/test-infra/pull/32025)
|
||||
- \[pehunt\] quick review request [https://github.com/kubernetes/test-infra/pull/32096](https://github.com/kubernetes/test-infra/pull/32096)
|
||||
|
||||
Follow up
|
||||
|
||||
- ndixita@
|
||||
- [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-ubuntu-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-ubuntu-serial)
|
||||
- OOM killer test
|
||||
- Reach out to Ed: [https://testgrid.k8s.io/sig-node-containerd\#e2e-cos-device-plugin-gpu](https://testgrid.k8s.io/sig-node-containerd#e2e-cos-device-plugin-gpu)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/123491](https://github.com/kubernetes/kubernetes/issues/123491)
|
||||
- Create an issue if it doesn’t exist: [https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-serial-crio](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-serial-crio)
|
||||
- Prioritize: Eviction tests: pid returning 0 process count issues: Find related issues
|
||||
- David Porter: [https://github.com/kubernetes/kubernetes/pull/123369](https://github.com/kubernetes/kubernetes/pull/123369)
|
||||
- https://github.com/kubernetes/test-infra/pull/32031 : Add the labels doc
|
||||
- [https://github.com/kubernetes/kubernetes/pull/122927](https://github.com/kubernetes/kubernetes/pull/122927)
|
||||
|
||||
## 2024/02/21
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=e4eRbWiIPN4](https://www.youtube.com/watch?v=e4eRbWiIPN4)
|
||||
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: ndixita
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[kevin\] Few Prs to review/approve
|
||||
- [Split Disk Presubmits](https://github.com/kubernetes/test-infra/pull/31502)
|
||||
- [Flaky Kubelet Serial Label](https://github.com/kubernetes/test-infra/pull/32031)
|
||||
- [CRIO Eviction cgroupv2](https://github.com/kubernetes/test-infra/pull/32006)
|
||||
- \[ndixita\]
|
||||
- Ubuntu-test-e2e failures: dims@: WIP [https://github.com/kubernetes/kubernetes/issues/123236](https://github.com/kubernetes/kubernetes/issues/123236)
|
||||
- PR on https://github.com/kubernetes/kubernetes/pull/123390
|
||||
- Follow up with sig testing infra
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-containerd-node-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-containerd-node-e2e-serial)
|
||||
[https://testgrid.k8s.io/sig-node-containerd\#node-e2e-features](https://testgrid.k8s.io/sig-node-containerd#node-e2e-features)
|
||||
@ndixita: I don’t think it’s a test-infra issue anymore. Both tests look flaky, but they’re not failing because of test-infra misconfiguration anymore. That issue seems to be fixed starting Feb 15\.
|
||||
- Bugs follow up
|
||||
[https://github.com/kubernetes/kubernetes/issues/122903](https://github.com/kubernetes/kubernetes/issues/122903): Do we provide support for forked repos?
|
||||
Todd Neal: [https://github.com/kubernetes/kubernetes/issues/122902](https://github.com/kubernetes/kubernetes/issues/122902) find and assign
|
||||
- \[esotsal\]
|
||||
- ndixita@ [https://github.com/kubernetes/kubernetes/issues/123313](https://github.com/kubernetes/kubernetes/issues/123313) : \[Failing test\] pull-kubernetes-local-e2e
|
||||
-
|
||||
|
||||
## 2024/02/14
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=g2PgAwVXHwA](https://www.youtube.com/watch?v=g2PgAwVXHwA)
|
||||
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: ndixita
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[chris\] Adding prow jobs for e2e tests with containerd v2.0
|
||||
- (ndixita): RC already shipped, in March release
|
||||
- Issues: can’t start using containerd straightaway
|
||||
- New features require containerdv2.0 so better to add new test tabs and have both old and new versions running
|
||||
|
||||
- ## (ndixita) Monitor Ubuntu-test-e2e failures:
|
||||
|
||||
- ##
|
||||
|
||||
- ## [https://github.com/kubernetes/kubernetes/issues/123236](https://github.com/kubernetes/kubernetes/issues/123236)
|
||||
|
||||
- Ndixita Sign node release testing registry related failure: follow up with sig testing infra
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-containerd-node-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-containerd-node-e2e-serial)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#node-e2e-features](https://testgrid.k8s.io/sig-node-containerd#node-e2e-features)
|
||||
- Ed Bartosh: Device plugin test [https://testgrid.k8s.io/sig-node-kubelet\#kubelet-gce-e2e-swap-fedora-serial](https://testgrid.k8s.io/sig-node-kubelet#kubelet-gce-e2e-swap-fedora-serial)
|
||||
- Ndixita [https://github.com/kubernetes/kubernetes/issues/122903](https://github.com/kubernetes/kubernetes/issues/122903)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122902](https://github.com/kubernetes/kubernetes/issues/122902) find and assign
|
||||
|
||||
## 2024/02/07
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=3sCGp\_3uU2k](https://www.youtube.com/watch?v=3sCGp_3uU2k)
|
||||
Hosts:
|
||||
|
||||
- Tests: ndixita
|
||||
- Bugs: ndixita
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[Sergey\] Are there serial tests for e2e node? Question is from Sidecar WG meeting
|
||||
- e2e/node: missing tests need to be added
|
||||
- Check [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-inplace-pod-resize-containerd-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-inplace-pod-resize-containerd-e2e-serial)
|
||||
- \[Ed\] GPUDevicePlugin: which tests are targeted with this feature
|
||||
- File a bug for testgrid failure
|
||||
- Make test as flaky and move to less important tab
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-containerd-e2e](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-containerd-e2e)
|
||||
- Oom killer tests failing forever
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-inplace-pod-resize-containerd-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-inplace-pod-resize-containerd-e2e-serial)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv2-containerd-e2e](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv2-containerd-e2e)
|
||||
- Kevin: [https://testgrid.k8s.io/sig-node-containerd\#node-kubelet-containerd-eviction](https://testgrid.k8s.io/sig-node-containerd#node-kubelet-containerd-eviction)
|
||||
- \[ndixita\] confirm if Device plugin GA feature doesn’t have periodic jobs
|
||||
- Graceful nodes shutdown don’t work with daemonsets
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122912](https://github.com/kubernetes/kubernetes/issues/122912)
|
||||
- Sig node bugs to discuss
|
||||
- https://github.com/kubernetes/kubernetes/issues/122905
|
||||
|
||||
## 2024/01/31 \[Looking for a host \- canceled if not found\]
|
||||
|
||||
## 2024/01/24
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=i92MuHisqUw](https://www.youtube.com/watch?v=i92MuHisqUw)
|
||||
|
||||
Hosts:
|
||||
|
||||
- Tests: Sergey
|
||||
- Bugs: Sergey
|
||||
|
||||
Agenda:
|
||||
|
||||
* \[Kevin\] PodReadyToStartContainers e2e test PR looking for approval
|
||||
* [https://github.com/kubernetes/kubernetes/pull/121321](https://github.com/kubernetes/kubernetes/pull/121321)
|
||||
* \[Kevin\] Crio-cgroupv2 adding to release-informing
|
||||
* [https://github.com/kubernetes/test-infra/pull/31650](https://github.com/kubernetes/test-infra/pull/31650)
|
||||
* \[Kevin\] ImageFs e2e tests: [https://github.com/kubernetes/kubernetes/pull/121832](https://github.com/kubernetes/kubernetes/pull/121832)
|
||||
* Running using gcp instance (remote=True) fine
|
||||
* CI has node failure with SoftEviction
|
||||
|
||||
Test grid:
|
||||
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#containerd-e2e-ubuntu](https://testgrid.k8s.io/sig-node-containerd#containerd-e2e-ubuntu)
|
||||
- [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-containerd-e2e](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-containerd-e2e)
|
||||
- [https://testgrid.k8s.io/sig-node-cri-o\#node-kubelet-serial-crio](https://testgrid.k8s.io/sig-node-cri-o#node-kubelet-serial-crio)
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122828](https://github.com/kubernetes/kubernetes/issues/122828)
|
||||
-
|
||||
|
||||
## 2024/01/17
|
||||
|
||||
Recording: [https://youtu.be/A3y\_\_Ivvo1c](https://youtu.be/A3y__Ivvo1c)
|
||||
Hosts:
|
||||
|
||||
- Tests: tzneal
|
||||
- Bugs: peter hunt
|
||||
|
||||
Agenda:
|
||||
|
||||
* Adding CI tests for separate container runtime filesystem and split filesystem
|
||||
* Debugging [https://github.com/kubernetes/kubernetes/pull/121832](https://github.com/kubernetes/kubernetes/pull/121832) has become quite difficult due to hard coding DiskPressure
|
||||
* Have [https://github.com/kubernetes/test-infra/pull/31638](https://github.com/kubernetes/test-infra/pull/31638) to help (needs review/approver). Will clean up once I finish debugging
|
||||
* Added a presubmit for split disk work
|
||||
* [https://github.com/kubernetes/test-infra/pull/31502](https://github.com/kubernetes/test-infra/pull/31502)
|
||||
* \[harche\] Should alpha-features blocking test skip Evented PLEG feature temporarily? [https://github.com/kubernetes/kubernetes/issues/122721\#issuecomment-1895922234](https://github.com/kubernetes/kubernetes/issues/122721#issuecomment-1895922234)
|
||||
- Tzneal \- investigate single group OOM kill failure at [https://testgrid.k8s.io/sig-node-containerd\#cos-cgroupv1-containerd-node-e2e-serial](https://testgrid.k8s.io/sig-node-containerd#cos-cgroupv1-containerd-node-e2e-serial)
|
||||
- Tzneal \- ask sig testing how to cleanup the test grid from old removed test suites
|
||||
- Kevin
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122828\#issue-2086433864](https://github.com/kubernetes/kubernetes/issues/122828#issue-2086433864)
|
||||
- [https://kubernetes.slack.com/archives/C0BP8PW9G/p1705506664344289](https://kubernetes.slack.com/archives/C0BP8PW9G/p1705506664344289)
|
||||
-
|
||||
|
||||
## 2024/01/10 \[Cancelled\]
|
||||
|
||||
Agenda:
|
||||
|
||||
* \[swsehgal\] Looking for some help in promoting sample-device-plugin image
|
||||
* [https://github.com/kubernetes/kubernetes/pull/118534](https://github.com/kubernetes/kubernetes/pull/118534) was merged a while back but the sample device plugin image is still not promoted.
|
||||
* In the past, I had promoted the image ([https://github.com/kubernetes/k8s.io/pull/4862](https://github.com/kubernetes/k8s.io/pull/4862)) but since then we have transitioned to registry.k8s.io so not sure not to obtain the sha of the image corresponding to the latest version of sample-device-plugin.
|
||||
* Has anyone promoted a test image recently?
|
||||
|
||||
## 2024/01/03
|
||||
|
||||
Recording: [https://youtu.be/nw5IhScZGEY](https://youtu.be/nw5IhScZGEY)
|
||||
Hosts:
|
||||
|
||||
- Tests: Sergey
|
||||
- Bugs: Sergey
|
|
@ -0,0 +1,17 @@
|
|||
# Kubernetes SIG-Node CI subgroup notes
|
||||
|
||||
## Jan 8, 2025
|
||||
Hosts:
|
||||
Tests:
|
||||
Bugs triaging
|
||||
|
||||
Agenda:
|
||||
|
||||
- Test-Infra Cleanup
|
||||
- NodeSpecialFeature / NodeAlphaFeature has been dropped and CI looks great
|
||||
- Replicate NodeFeature and Feature
|
||||
- [https://github.com/kubernetes/kubernetes/pull/129166](https://github.com/kubernetes/kubernetes/pull/129166)
|
||||
- CI Containerd 2 migration
|
||||
- [https://github.com/kubernetes/test-infra/issues/34063](https://github.com/kubernetes/test-infra/issues/34063)
|
||||
- Job config generation
|
||||
- [https://github.com/kubernetes/test-infra/pull/34010](https://github.com/kubernetes/test-infra/pull/34010)
|
|
@ -0,0 +1,886 @@
|
|||
# SIG Node Meeting Notes
|
||||
|
||||
## Dec 31, 2024
|
||||
|
||||
* Cancelled
|
||||
|
||||
## Dec 24, 2024
|
||||
|
||||
* Cancelled
|
||||
|
||||
## Dec 17, 2024
|
||||
|
||||
* Cancelled
|
||||
|
||||
## Dec 10, 2024
|
||||
|
||||
- 1.32 retro: [SIG Node 1.32 retro](https://docs.google.com/document/d/1CM_WLChPzAx2VLFCXR0RNvNcqeV0OZKL8iqlBFVrpHY/edit?tab=t.0#heading=h.gmyk4fetligd)
|
||||
|
||||
## Dec 3, 2024
|
||||
|
||||
- \[minna\] asking for some PR feedback [https://github.com/kubernetes/kubernetes/pull/125918](https://github.com/kubernetes/kubernetes/pull/125918)
|
||||
- \[Peter\] We should add a feature gate beta \+ on by default
|
||||
- \[Francesco\] \+ 1 and we should extend
|
||||
- Maybe wait for critical pods to be ready and not just started before we try to start non critical pods
|
||||
- \[Sergey\] Similarly we could extend logic for admission
|
||||
- \[Sergey\] It’s possible this PR may switch starting failure to admission failure (if critical pod starts and fails, the pods that rely on them will fail differently)
|
||||
- \[Sergey\] Add agenda items ASAP, as we will cancel the meeting aggressively in December
|
||||
|
||||
## Nov 26, 2024 \[Canceled due to US holiday\]
|
||||
|
||||
## Nov 19, 2024 \[Canceled due to lack of agenda\]
|
||||
|
||||
## Nov 12, 2024 \[Canceled for KubeCon\]
|
||||
|
||||
## Nov 5, 2024
|
||||
|
||||
## Recording: [https://www.youtube.com/watch?v=1u\_yKruHeZU](https://www.youtube.com/watch?v=1u_yKruHeZU)
|
||||
|
||||
* # \[danwinship/surya\] [Redesigning Kubelet Probes](https://docs.google.com/presentation/d/1XujDtyhIkZ7FPDPck9qou-L1O80a_IomCGNf6I5E9X4/edit#slide=id.p)
|
||||
|
||||
* antonio had opened an issue for runtime to do the checks
|
||||
* when kubetlet requests runtime to do probe
|
||||
* launching new pods and containers would be heavy
|
||||
* can we re-use the container-monitor process here ? instead of adding new ones?
|
||||
* tcp/http/grpc types of probes
|
||||
* would containerd/cri-o be able to do those probes?
|
||||
* \[mrunal\] containerd would have to do learn the split of daemon
|
||||
* \[dawn\] the pod sounds better than what we have today?
|
||||
* cost to the user though here at the application level usage is unpredictable \- this is not worse than what we have today but there is a complexity for the user (with per container case)
|
||||
* probing pod is part of system overhead
|
||||
* will this be a new type of probe? replacement of existing probes?
|
||||
* if its a pod probe then some features like ensuring the port is open might be lost?
|
||||
* so maybe we should keep both types of probes and users can
|
||||
* Performance should not regress
|
||||
* checking a file in the filesystem and letting users put what they want?
|
||||
* \[tallclair\] In-Place Pod Resize: status update
|
||||
* [Beta Dashboard](https://github.com/orgs/kubernetes/projects/178/views/1?filterQuery=is%3Aissue+-status%3Adone+-is%3Aclosed+roadmap%3Abeta+&visibleFields=%5B%22Title%22%2C%22Assignees%22%2C%22Status%22%2C145416682%2C87750681%2C%22Linked+pull+requests%22%5D&sortedBy%5Bdirection%5D=asc&sortedBy%5BcolumnId%5D=145416682)
|
||||
|
||||
## \`1Oct 29, 2024 (Canceled)
|
||||
|
||||
Canceled due to lack of the agenda.
|
||||
|
||||
## Oct 22, 2024
|
||||
|
||||
- \[Kevin Hannon in place of dims\] cadvisor for 1.32
|
||||
- [https://github.com/google/cadvisor/pull/3609](https://github.com/google/cadvisor/pull/3609) ( Reduce the dependencies we drag into cadvisor AND drag into k/k through cadvisor )
|
||||
- [https://github.com/google/cadvisor/pull/3608](https://github.com/google/cadvisor/pull/3608) ( help the periodic CI job to recover )
|
||||
- fix for [https://github.com/google/cadvisor/issues/3577](https://github.com/google/cadvisor/issues/3577) as well
|
||||
- Release may be needed
|
||||
- https://kubernetes.slack.com/archives/C0BP8PW9G/p1729517493050419
|
||||
- \[Kevin Hannon\] Swap Based Eviction
|
||||
- [https://github.com/kubernetes/kubernetes/pull/128137](https://github.com/kubernetes/kubernetes/pull/128137)
|
||||
-
|
||||
- \[Lakshmi\] Requesting for review and feedback on PR
|
||||
- [https://github.com/kubernetes/website/pull/48001](https://github.com/kubernetes/website/pull/48001)
|
||||
- \[pehunt\] libcontainer \+ runc \+ k8s
|
||||
- two pieces
|
||||
- runc 1.2.0 just came out, k8s wants to use it (to get PSI stats) but there are concerns about containerd using a different libcontainer version from cadvisor
|
||||
- [https://kubernetes.slack.com/archives/C0BP8PW9G/p1729606639892799](https://kubernetes.slack.com/archives/C0BP8PW9G/p1729606639892799)
|
||||
- [https://cloud-native.slack.com/archives/CGEQHPYF4/p1729607023643899](https://cloud-native.slack.com/archives/CGEQHPYF4/p1729607023643899)
|
||||
- [https://github.com/google/cadvisor/pull/3083\#issuecomment-2429370533](https://github.com/google/cadvisor/pull/3083#issuecomment-2429370533)
|
||||
- Do we need to wait for 1.2.0 in 2.0, or can we backport, or can we run disjoint? we’ve waited a long time for 1.2.0 and I’d like to use it
|
||||
- libraryfication of libcontainer: currently, we’re vendoring runc libcontainer in k8s, and this means we’re version locked with the runc binary (which doesn’t have k8s as a priority with release cadence)
|
||||
- Discussions on moving the libcontainer/cgroups library out of runc and into its own repo [https://github.com/kubernetes/kubernetes/issues/128157](https://github.com/kubernetes/kubernetes/issues/128157)
|
||||
- [Peter Hunt](mailto:pehunt@redhat.com)to send an email to sig-node mailing list to notify folks of this plan
|
||||
- part of this plan [https://github.com/kubernetes/kubernetes/pull/128245](https://github.com/kubernetes/kubernetes/pull/128245)
|
||||
-
|
||||
|
||||
## Oct 15, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=MyOhDhHRRKk](https://www.youtube.com/watch?v=MyOhDhHRRKk)
|
||||
|
||||
- \[Sergey\] New Feature Gates emulation mode and features GA: [https://github.com/kubernetes/kubernetes/pull/126981\#discussion\_r1799779745](https://github.com/kubernetes/kubernetes/pull/126981#discussion_r1799779745)
|
||||
- Should we keep removing code in kubelet as before? Or just keep it around the same way as we do for API server to minimize possible errors and simply not test it?
|
||||
- \[Chris\] A demo for k8s dynamic batch workloads:
|
||||
[https://github.com/chrishenzie/k8s-dynamic-batch-demo](https://github.com/chrishenzie/k8s-dynamic-batch-demo)
|
||||
- \[pehunt\] (defer to end, only if there’s time) beginnings of swap aware eviction discussion
|
||||
-
|
||||
|
||||
## Oct 8, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=\_Zexxr4pxr8](https://www.youtube.com/watch?v=_Zexxr4pxr8)
|
||||
|
||||
- \[Sergey\] Containerd 2.0 and KEPs: [https://groups.google.com/g/kubernetes-sig-architecture/c/kft-wa929\_Q](https://groups.google.com/g/kubernetes-sig-architecture/c/kft-wa929_Q)
|
||||
- Are we promoting to beta with a single runtime implementation?
|
||||
- What is the production test requirement for the feature? (in case of 2.0 \- how do we measure exposure of the feature to prod?)
|
||||
- ~~\[fromani\] Heads up: KEP 4885 will introduce a new memory manager policy~~
|
||||
- ~~the windows and linux will support different policies~~
|
||||
- ~~do we prefer to postpone the memory manager GA graduation?~~
|
||||
- \[fromani\] unblocking [https://github.com/kubernetes/kubernetes/issues/70585](https://github.com/kubernetes/kubernetes/issues/70585) with a [feature gate](https://kubernetes.slack.com/archives/C5P3FE08M/p1727072265347319)?
|
||||
- \[Eddie\] Request for KEP review: [Mutable CSINode Allocatable Property](https://github.com/kubernetes/enhancements/issues/4876)
|
||||
- \[pehunt\] FYI for approvers: two new KEPs have been added to the milestone and don’t have an approver
|
||||
- [https://github.com/kubernetes/enhancements/issues/3619](https://github.com/kubernetes/enhancements/issues/3619)
|
||||
- [https://github.com/kubernetes/enhancements/issues/4753](https://github.com/kubernetes/enhancements/issues/4753)
|
||||
- \[fromani\] [https://github.com/kubernetes/enhancements/issues/4885](https://github.com/kubernetes/enhancements/issues/4885) lacks approver also. I’m reviewing and almost LGTM (almost \= need a final pass but no outstanding issues after last update)
|
||||
- \[pehunt\] according to [The KEP board](https://github.com/orgs/kubernetes/projects/186/views/7?filterQuery=status%3A%22Considered+for+release%22+cpu&visibleFields=%5B%22Title%22%2C%22Assignees%22%2C%22Status%22%2C126885563%2C130447354%2C130446877%2C130446939%2C130446997%2C133731923%2C133734297%5D&sortedBy%5Bdirection%5D=asc&sortedBy%5BcolumnId%5D=126885563&sortedBy%5Bdirection%5D=desc&sortedBy%5BcolumnId%5D=130447354) it’s [Mrunal Patel](mailto:mpatel@redhat.com)
|
||||
|
||||
## Oct 1, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=8YWCql6rLLk](https://www.youtube.com/watch?v=8YWCql6rLLk)
|
||||
|
||||
- KEP planning part 2
|
||||
- \[ndixita\] Pod Level Resources [\[Critical Scenarios\] Pod Level Resources](https://docs.google.com/presentation/d/1X6U81dzs_j3N0Wtu4ftJ6OU2g2pSL5_QsUTwN9sI9T4/edit?usp=sharing)
|
||||
- \[Lakshmi\] IWhen container garbage collection is deprecated? Is there any alternate recommended way for container garbage collection?
|
||||
- \[tjons\] run an initContainer only once per rollout of the deployment, not on every scheduled pod.
|
||||
|
||||
## Sep 24, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=GkPrY56\_gB4](https://www.youtube.com/watch?v=GkPrY56_gB4)
|
||||
|
||||
- \[jsturtevant\] Windows KEP updates for Cpu/memory manager: [https://github.com/kubernetes/enhancements/pull/4738](https://github.com/kubernetes/enhancements/pull/4738)
|
||||
- \[tallclair\] InPlacePodVerticalScaling discussion \- part 2 ([slides](https://docs.google.com/presentation/d/1vwwOeMxGPp1woJsI5rh89O8xzvrDK2VBqqMi7J2cpCM/edit#slide=id.p))
|
||||
- KEP: [https://github.com/kubernetes/enhancements/pull/4704](https://github.com/kubernetes/enhancements/pull/4704)
|
||||
- \[johnbelamaric\] Quick PSA: Unless a strong use case comes forward, we plan to remove “classic DRA” in 1.32. See [https://github.com/kubernetes/enhancements/issues/3063\#issuecomment-2305446451](https://github.com/kubernetes/enhancements/issues/3063#issuecomment-2305446451)
|
||||
- Reach out to [kklues@nvidia.com](mailto:kklues@nvidia.com) if you have any questions
|
||||
- \[Lakshmi\] Requesting for review and feedback on PR
|
||||
- [https://github.com/kubernetes/website/pull/48001](https://github.com/kubernetes/website/pull/48001)
|
||||
|
||||
## Sep 17, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=iH6KVk9B5DE](https://www.youtube.com/watch?v=iH6KVk9B5DE)
|
||||
|
||||
- \[pehunt\] KEP planning
|
||||
- [Planning table](https://github.com/orgs/kubernetes/projects/186/views/7?filterQuery=status%3A%22Draft+Stage%22%2C%22Proposed+for+consideration%22+&visibleFields=%5B%22Title%22%2C%22Status%22%2C130447354%2C126885563%2C130446877%2C130446939%2C%22Reviewers%22%2C130446997%5D&sortedBy%5Bdirection%5D=desc&sortedBy%5BcolumnId%5D=130447354&sortedBy%5Bdirection%5D=asc&sortedBy%5BcolumnId%5D=126885563)
|
||||
|
||||
## Sep 10, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=9AfQA0DYR0E](https://www.youtube.com/watch?v=9AfQA0DYR0E)
|
||||
|
||||
- \[pehunt\] KEP planning [KEP Board](https://github.com/orgs/kubernetes/projects/186/views/1)
|
||||
|
||||
- \[lauralorenz\] CrashLoopBackOff KEP for 1.32 ([slides](https://docs.google.com/presentation/d/16itbKQiClbP2L7vbBCASEC5Oz6qRKxzLmcohM5_efCQ/edit?slide=id.p#slide=id.g2fb40fe0c6f_0_0) 6-10)
|
||||
- \[harche\] \- Looking for reviews [https://github.com/kubernetes/kubernetes/pull/125982](https://github.com/kubernetes/kubernetes/pull/125982)
|
||||
- This especially affects users with high number CPUs per nodes
|
||||
- \[tallclair\] InPlacePodVerticalScaling discussion ([slides](https://docs.google.com/presentation/d/1vwwOeMxGPp1woJsI5rh89O8xzvrDK2VBqqMi7J2cpCM/edit#slide=id.p))
|
||||
- KEP: [https://github.com/kubernetes/enhancements/pull/4704](https://github.com/kubernetes/enhancements/pull/4704)
|
||||
- \[SergeyKanzhelev\] [https://github.com/kubernetes/enhancements/issues/3386\#issue comment-2337050862](https://github.com/kubernetes/enhancements/issues/3386#issuecomment-2337050862) Do we want to remove this code for now?
|
||||
- \[T-Lakshmi\] \- Looking for feedback/answers on queries [https://github.com/kubernetes/kubernetes/issues/127157](https://github.com/kubernetes/kubernetes/issues/127157) Is container GC policy replaced with any function in evictionHard and evictionSoft policy, or its completely deprecated? What are the future plans on these container garbage collector policy?
|
||||
|
||||
|
||||
## Sep 3, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=E8sw-fybnKc](https://www.youtube.com/watch?v=E8sw-fybnKc)
|
||||
|
||||
- \[ndixita\] Pod level resources KEP discussion
|
||||
[\[Public\] Effective Resources & OOM Kill Behavior](https://docs.google.com/document/d/1q3UaDO5wrfP3vZFuDjpZjapYr4k_ui4SAF0s4HAIjKQ/edit?usp=sharing)
|
||||
* OOM group \-\> Pod kill \-\> in the next iteration of KEP
|
||||
- \[lauralorenz\] CrashLoopBackOff KEP for 1.32 ([slides](https://docs.google.com/presentation/d/16itbKQiClbP2L7vbBCASEC5Oz6qRKxzLmcohM5_efCQ/edit?slide=id.p#slide=id.g2fb40fe0c6f_0_0) 6-10) \[bumped to next week but feel free to take a look at slides or discuss x-post in slack\]
|
||||
- \[sreeram-venkitesh\] Zero values for [Sleep Action of PreStop Hook](https://github.com/kubernetes/enhancements/issues/3960)
|
||||
- [KEP-4818: Allow zero value for Sleep Action of PreStop Hook](https://docs.google.com/document/d/1o01gH2ELPY-kjwgemjg3BbWiykx37ag-4VYwwJ1jBTw/edit?usp=sharing)
|
||||
- Draft PR to discuss changes: [https://github.com/kubernetes/kubernetes/pull/127094](https://github.com/kubernetes/kubernetes/pull/127094)
|
||||
- Do we need to do anything particular with rollback of the feature?
|
||||
- Probably not at least the kubelet
|
||||
- \[pranav\] Kubelet idle threads [issue](https://github.com/kubernetes/kubernetes/issues/123275)
|
||||
\- raised this [issue](https://github.com/golang/go/issues/68993) in golang upstream
|
||||
\- how to control kubelet threads and memory by go runtime variables, is there any other way to do it?
|
||||
|
||||
- \[Kevin Hannon\] [KEP Board](https://github.com/orgs/kubernetes/projects/186/views/1)
|
||||
- Open it up for public viewing?
|
||||
- \[pehunt\] inspired by release team, we’ve updated the [tracking board](https://github.com/orgs/kubernetes/projects/186/views/7?filterQuery=-status%3ADone+-status%3A%22Not+for+release%22++-status%3ARemoved&visibleFields=%5B%22Title%22%2C%22Status%22%2C130447354%2C130446997%2C130446939%2C%22Reviewers%22%2C130446877%5D&sortedBy%5Bdirection%5D=desc&sortedBy%5BcolumnId%5D=Status) to have more column
|
||||
|
||||
##
|
||||
|
||||
## Aug 27, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=wGbkByo\_NBI](https://www.youtube.com/watch?v=wGbkByo_NBI)
|
||||
|
||||
- \[lauralorenz\] CrashLoopBackOff KEP ([slides](https://docs.google.com/presentation/d/16itbKQiClbP2L7vbBCASEC5Oz6qRKxzLmcohM5_efCQ/edit?usp=sharing))
|
||||
- updates and changes since 1.31 \[5 minutes\]
|
||||
- some discussion on path forward \[10-15 mins if I can get it\]
|
||||
- \[pehunt\] KEP wrangler brainstorm
|
||||
- [SIG Node KEP Wrangler Brainstorming](https://docs.google.com/document/d/1CypSsxdowXk0PmoYLF7h5Q91MjiUPibsLFmwUVR4fKs/edit?usp=sharing)
|
||||
|
||||
## Aug 20, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=KUw2kSFsf2U](https://www.youtube.com/watch?v=KUw2kSFsf2U)
|
||||
|
||||
- \[vinayakankugoyal\] [https://github.com/kubernetes/enhancements/pull/4760/files\#r1699209363](https://github.com/kubernetes/enhancements/pull/4760/files#r1699209363)
|
||||
- Break permissions into smaller buckets to allow for users to get access to things like healthz without allowing a user to get a pod to exec
|
||||
- We currently don’t commit to supporting these endpoints, but they are being used as if we have. Should we group the endpoints by function to be less prescriptive on what a user gets access to, so we have power to change?
|
||||
- \[Peter\] Can we break into read-only/read-write?
|
||||
- some of the “read-only” end points can still be risky to give access to
|
||||
- \[Dawn\] There had been talks in the past about deprecating some of the endpoints
|
||||
- \[Tim\] We’ve been talking about doing so for so long, maybe we do this now instead of trying to find the perfect APIs
|
||||
- \[Tim\] Maybe use healthz as the bucket?
|
||||
- \[Sergey\] should documenting the endpoint be part of this KEP?
|
||||
- Tim Volunteered to review/approve
|
||||
- KEP to live under SIG-Auth
|
||||
- \[Kevin Hannon\] different OCI runtime with NodeConformance
|
||||
- [https://github.com/kubernetes/kubernetes/issues/126639](https://github.com/kubernetes/kubernetes/issues/126639)
|
||||
- Presubmit: [https://github.com/kubernetes/test-infra/pull/33298](https://github.com/kubernetes/test-infra/pull/33298)
|
||||
- Periodic: [https://github.com/kubernetes/test-infra/pull/33297](https://github.com/kubernetes/test-infra/pull/33297)
|
||||
- Maybe we should add these tests in CRI-O upstream instead of k8s–reduce overhead on upstream CI
|
||||
- \[Kevin\] If we switch to crun by default in CRI-O, can we switch upstream k8s tests to crun as well?
|
||||
- \[Sergey\] as long as the test failures are looked into and addressed quickly
|
||||
- Run two versions at the same time, and then eventually switch the crun jobs to be the blocking one
|
||||
- \[SergeyKanzhelev\] Some org updates:
|
||||
- New google groups will be used soon:
|
||||
- [https://github.com/kubernetes/k8s.io/pull/7140](https://github.com/kubernetes/k8s.io/pull/7140)
|
||||
- [https://github.com/kubernetes/k8s.io/pull/7160](https://github.com/kubernetes/k8s.io/pull/7160)
|
||||
- New version of GitHub projects:
|
||||
- [https://github.com/orgs/kubernetes/projects/151](https://github.com/orgs/kubernetes/projects/151)
|
||||
- [https://github.com/orgs/kubernetes/projects/184](https://github.com/orgs/kubernetes/projects/184)
|
||||
- [https://github.com/orgs/kubernetes/projects/185](https://github.com/orgs/kubernetes/projects/185)
|
||||
- \[yuanliangzhang\] Windows Node graceful shutdown
|
||||
- KEP enhance draft:
|
||||
[https://github.com/zylxjtu/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md\#background-on-windows-shutdown](https://github.com/zylxjtu/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md#background-on-windows-shutdown)
|
||||
- POC [shoutdown poc · zylxjtu/kubernetes@854ea4b (github.com)](https://github.com/zylxjtu/kubernetes/commit/854ea4bde88c0905241b43f5f80d470967bb909f)
|
||||
- Should we have a new KEP or keep within the other KEP
|
||||
- Needs a reviewer from kubelet side
|
||||
- \[Dawn\] Most of the reviewers in SIG-Node focus on linux
|
||||
- \[Dawn\] Are there any windows version requirements?
|
||||
- \[Lin\] I don’t think so
|
||||
- \[Lin\] How far back do we need to support specific versions windows nodes?
|
||||
- \[Mark\] probably windows 2019
|
||||
- \[Mark\] We didn’t add support before because termination wasn’t working right in windows, that’s fixed now
|
||||
- \[Peter\] [https://github.com/kubernetes/enhancements/pull/4738](https://github.com/kubernetes/enhancements/pull/4738) can be used as a baseline for KEP process
|
||||
- \[Sergey\] If we tie this to the linux version, we may be blocked on windows to GA graceful shutdown
|
||||
- \[Sergey\] Ideally, we GA ASAP
|
||||
- \[Mrunal\] Instead of additional KEP, we could add another feature gate
|
||||
- \[Sergey\] Feature gate will still block KEP graduation :-(
|
||||
- endpoints problem: [https://github.com/kubernetes/kubernetes/issues/116965](https://github.com/kubernetes/kubernetes/issues/116965)
|
||||
- \[sotiris\] Triage decision for *Minimum CPU request is displayed when only memory request is configured*
|
||||
- [https://github.com/kubernetes/kubernetes/issues/126195](https://github.com/kubernetes/kubernetes/issues/126195)
|
||||
- \[iholder101\]
|
||||
- swap debugability long ongoing discussion \- asking to defer to follow-up KEPs: [https://github.com/kubernetes/kubernetes/pull/125278](https://github.com/kubernetes/kubernetes/pull/125278) and specifically [this comment](https://github.com/kubernetes/kubernetes/pull/125278#issuecomment-2268850735)
|
||||
- [https://github.com/kubernetes/enhancements/pull/4701](https://github.com/kubernetes/enhancements/pull/4701) \- GA plans for swap (KEP-2400)
|
||||
- Dawn to follow up offline
|
||||
- \[SergeyKanzhelev\] [https://github.com/orgs/kubernetes/projects/186/views/1](https://github.com/orgs/kubernetes/projects/186/views/1)
|
||||
- AI:
|
||||
- Either move to proposed for consideration
|
||||
- Or Not for release
|
||||
- \[torredil\] Ensure volumes are unmounted during graceful node shutdown: [https://github.com/kubernetes/kubernetes/pull/125070](https://github.com/kubernetes/kubernetes/pull/125070)
|
||||
- Dawn/Mrunal to look and hopefully approve
|
||||
- \[Mrunal\] Maybe add this to Clayton’s document
|
||||
|
||||
## Aug 13, 2024 (cancelled)
|
||||
|
||||
MEETING IS CANCELED TODAY due to lack of agenda and vacations
|
||||
|
||||
## Aug 6, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=K-eBDYfHiTM](https://www.youtube.com/watch?v=K-eBDYfHiTM)
|
||||
|
||||
- \[SergeyKanzhelev\] Files kubelet uses: [https://github.com/kubernetes/website/pull/46359/files\#r1600516887](https://github.com/kubernetes/website/pull/46359/files#r1600516887) Docs request
|
||||
- List in comment almost? full
|
||||
- We should document files, but recommending removal of all of them is overkill
|
||||
- Mrunal: maybe even have a clean up command that will clean up those files.
|
||||
- Cleaning up on startup of kubelet \- maybe we need a KEP
|
||||
- Dawn: Kubelet should be responsible for its own files, but other files created by the plugins which might not properly cleanup, and there is no way to ensure those by Kubelet. In this case, K8s vendor is responsible for files, not kubelet.
|
||||
- Also end users are not reporting issues back to upstream if they experience issues.
|
||||
- Peter: if kubelet creates a file it should be responsible for deleting it. If file is owned by plugin, kubelet should be resilient to those.
|
||||
- rphillips: Ideally, the plugin’s initialization function should handle cleanup
|
||||
- \[SergeyKanzhelev\] SergeyKanzhelev for approver: [https://github.com/kubernetes/kubernetes/pull/126551](https://github.com/kubernetes/kubernetes/pull/126551)
|
||||
- \[pehunt\] SIG Chair proposal
|
||||
- \[SergeyKanzhelev\] [SIG Node responsiveness improvements](https://docs.google.com/document/d/1AavM205LRi-RNB2xQRduzDo_ChTZ8clacKYtgsbfSAQ/edit#heading=h.tqzvzsaxo0sl)
|
||||
- \[pacoxu\] [issues/116799\#issuecomment-2249301937](https://github.com/kubernetes/kubernetes/issues/116799#issuecomment-2249301937)
|
||||
- In [kubernetes/system-validators\#37](https://github.com/kubernetes/system-validators/pull/37), we refer to kernel long term support: [https://wiki.linuxfoundation.org/civilinfrastructureplatform/start](https://wiki.linuxfoundation.org/civilinfrastructureplatform/start) and [https://endoflife.date/linux](https://endoflife.date/linux)
|
||||
- 4.4 & 4.19 are selected as kernel Super Long Term Support (SLTS), and the Civil Infrastructure Platform(CIP) will provide support until at least 2026\.
|
||||
- For [cgroup v2](https://kubernetes.io/docs/concepts/architecture/cgroups/), Kubernetes recommends to use 5.8 and later, and in [runc docs](https://github.com/opencontainers/runc/blob/main/docs/cgroup-v2.md), the minimal version is 4.15 and 5.2+ is recommended.
|
||||
- 4.5 starts support cgroup v2 io,memory & pids.(kernel 4.5 announce that cgroup v2 is not experimental)
|
||||
- 4.15 starts support cgroup v2 cpu
|
||||
- 4.20 PSI support & [KEP-4205](https://github.com/kubernetes/enhancements/issues/4205) is not alpha(only KEP was merged)
|
||||
- 5.2 starts support cgroup v2 freezer
|
||||
- 5.8: [Adding root](https://github.com/kubernetes/kubernetes/issues/103759#issuecomment-926024150) \`cpu.stat\` [file on cgroupv2](https://github.com/kubernetes/kubernetes/issues/103759#issuecomment-926024150) was only added in 5.8.
|
||||
|
||||
## July 30, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=JGYTQbs6eJk](https://www.youtube.com/watch?v=JGYTQbs6eJk)
|
||||
|
||||
- \[Peter Hunt\] Retrospective from 1.31 release
|
||||
- [SIG Node 1.31 retro](https://docs.google.com/document/d/16Ek41L3ocMDmeJTDTp108PBULaZ6ffEWU8aoEQ8q-DU/edit)
|
||||
- Previous retrospectives:
|
||||
- There were no retro for 1.29 and 1.30
|
||||
- [SIG Node 1.28 retro](https://docs.google.com/document/d/1NaT0rY0o1cNdTxIlgZ5m0TqLDI7AfYn3rBAAl4qT1Bw/edit#heading=h.vn5jbyiup6d)
|
||||
- [SIG Node 1.27 retro](https://docs.google.com/document/d/1DxJH1w_lrEOfflR-TED1vjc0ZYIXO0aDb3vJXPMKCdY/edit#heading=h.99i0ua4v77ap)
|
||||
|
||||
## July 23, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=Wc7yrCLILK8](https://www.youtube.com/watch?v=Wc7yrCLILK8)
|
||||
|
||||
- \[fromani\]\[on behalf of sphrasavath\] resuming work on [KEP 2621](https://github.com/kubernetes/enhancements/issues/2621): Enhance CPU manager with L3 cache aware
|
||||
- pivot from new cpumanager policy to new cpumanager policy option
|
||||
- revised design doc (comment from the enh issue: [https://docs.google.com/document/d/1LpnMjGNsQyHOuVHMktIrjZsdRw9aKZ8djt354nAno6M/edit?usp=sharing](https://docs.google.com/document/d/1LpnMjGNsQyHOuVHMktIrjZsdRw9aKZ8djt354nAno6M/edit?usp=sharing) )
|
||||
- \[Sunnat\] On behalf of Marsik. do not set CPU quota for guaranteed pods
|
||||
- [https://github.com/kubernetes/kubernetes/pull/117030](https://github.com/kubernetes/kubernetes/pull/117030)
|
||||
- \[pehunt\]: ProcMount disabled, or UserNamespaces enabled?
|
||||
- [https://github.com/kubernetes/kubernetes/pull/126291](https://github.com/kubernetes/kubernetes/pull/126291)
|
||||
|
||||
## July 16, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=0iPCt\_FZxSk](https://www.youtube.com/watch?v=0iPCt_FZxSk)
|
||||
|
||||
- \[dawnchen\] FYI: [\[PUBLIC\] Kubernetes: Disrupted pods should be eagerly removed from endpoints](https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit#heading=h.oaga8h3rwgtl)
|
||||
- Primary concern raised so far by Rob Scott is the risk that someone interprets EndpointSlice terminating as one way
|
||||
- More discussion of alternatives
|
||||
|
||||
## July 9, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=RTEtVbZPB-E](https://www.youtube.com/watch?v=RTEtVbZPB-E)
|
||||
|
||||
- \[case\] A group of us were working on a PR [around](https://github.com/kubernetes/kubernetes/issues/40610) adding node labels to the downward API. [KEP-4742](https://github.com/kubernetes/enhancements/pull/4747)
|
||||
- \[harche\] \- Are we calculating the system reserved cpu shares correctly? [https://github.com/kubernetes/kubernetes/issues/72881\#issuecomment-821224980](https://github.com/kubernetes/kubernetes/issues/72881#issuecomment-821224980)
|
||||
- Analysis with various CPU cores \- [System reservation cpu](https://docs.google.com/spreadsheets/d/1N8Xkzu7ArZYKTP0Ob9vMxUnlqAFlsaFoNGbxyAhzzxU/edit?usp=sharing)
|
||||
- \[Derek\] found relevant node allocatable designs [https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-systemd.md](https://github.com/kubernetes/design-proposals-archive/blob/main/node/kubelet-systemd.md) and [https://github.com/kubernetes/design-proposals-archive/blob/main/node/node-allocatable.md](https://github.com/kubernetes/design-proposals-archive/blob/main/node/node-allocatable.md)
|
||||
-
|
||||
- \[adil\] I have a question regarding logging, is there a way to disable all logs from k8s components and only get error logs? I tried setting different verbosities but it didn't help much. If there is no way to do it right now, is this something would interested in implementing? The reason why we want this is to optimize the CPU usage.
|
||||
- pehunt: For all the kubernetes components, you should be able to set the `-v` flag which sets the verbosity of klog. You need to individually set this flag for each kubernetes component, I don’t think there’s a centralized place you can do this today. If you set `-v=1` you should only get the most urgent messages
|
||||
- \[mimowo\] looking for sig-node reviewers for [Fix that PodIP field is temporarily removed for a terminal pod](https://github.com/kubernetes/kubernetes/pull/125404); heads up for the Kubelet issue that it may flip phase from Succeeded to Failed [link](https://github.com/kubernetes/kubernetes/issues/125410)
|
||||
- \[MaRosset\] [Request for review of Windows memory pressure eviction PR](https://github.com/kubernetes/kubernetes/pull/122922)
|
||||
-
|
||||
|
||||
## July 2, 2024 \[Canceled for July 4th week\]
|
||||
|
||||
## June 25, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=ExmOu9Twp3A](https://www.youtube.com/watch?v=ExmOu9Twp3A)
|
||||
|
||||
- \[Filip Krepinsky\] Creation of a new WG
|
||||
- discussed in [https://groups.google.com/g/kubernetes-sig-architecture/c/Tb\_3oDMAHrg/m/pJjl6v4mAgAJ](https://groups.google.com/g/kubernetes-sig-architecture/c/Tb_3oDMAHrg/m/pJjl6v4mAgAJ)
|
||||
- Clarify scope: Node vs group of Node, SIG Node vs k8s level, list of problems/scope
|
||||
- \[Pranav Pandey\] Kubelet not releasing idle threads
|
||||
\- discussed [here](https://github.com/kubernetes/kubernetes/issues/123275)
|
||||
\- I think this issue is due to golang, could we confirm this?
|
||||
\- Could we also confirm if there is a direct way for the kubelet to set the
|
||||
maximum thread number by any parameter or something like that?
|
||||
- \[lubomir\] review my small PR that makes a windows/kubelet related change:
|
||||
- [https://github.com/kubernetes/kubernetes/pull/123137](https://github.com/kubernetes/kubernetes/pull/123137)
|
||||
- warn instead of error for unsupported options on Windows
|
||||
- we don't need to exit the kubelet with an error on Windows just because the user is using a config that works on Linux.
|
||||
- old PR where we discussed we should not have different defaults on Windows:
|
||||
- [https://github.com/kubernetes/kubernetes/pull/77710](https://github.com/kubernetes/kubernetes/pull/77710)
|
||||
|
||||
## June 18, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=REmtlcXma\_M](https://www.youtube.com/watch?v=REmtlcXma_M)
|
||||
|
||||
- \[Sergey\] KEPs list for 1.31: [https://github.com/orgs/kubernetes/projects/183/views/1?filterQuery=sig%3Asig-node\&groupedBy%5BcolumnId%5D=Status\&sortedBy%5Bdirection%5D=desc\&sortedBy%5BcolumnId%5D=Status\&sliceBy%5BcolumnId%5D=Status](https://github.com/orgs/kubernetes/projects/183/views/1?filterQuery=sig%3Asig-node&groupedBy%5BcolumnId%5D=Status&sortedBy%5Bdirection%5D=desc&sortedBy%5BcolumnId%5D=Status&sliceBy%5BcolumnId%5D=Status)
|
||||
- \[Dixita\] Support for KEP exception until Friday, June 21
|
||||
- [https://github.com/kubernetes/enhancements/issues/2837](https://github.com/kubernetes/enhancements/issues/2837)
|
||||
- KEP needs to address the following suggestions by Tim Hockin
|
||||
- Default values when one of requests/limits is not set at pod level
|
||||
- Change language for QoS definitions
|
||||
- Stating OOM Kill behavior change
|
||||
- Reasoning
|
||||
- Feature discussions since March 2020
|
||||
- The more we delay this feature, it becomes difficult to support new features being added in every release.
|
||||
- Low risk: Alpha phase targets only adding the new fields in the PodSpec so that feature development can start.
|
||||
- Important to unblock AI model use cases
|
||||
- \[mimowo\] looking for sig-node reviews for [Fix that PodIP field is temporarily removed for a terminal pod](https://github.com/kubernetes/kubernetes/pull/125404)
|
||||
|
||||
## June 11, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=A1XwOJxBL0c](https://www.youtube.com/watch?v=A1XwOJxBL0c)
|
||||
|
||||
- \[tallclair\] [\#125393](https://github.com/kubernetes/kubernetes/issues/125393) Should we remove soft admission failure, before AppArmor goes GA?
|
||||
- \[Filip Krepinsky\] Latest NodeMaintenance discussions
|
||||
- [https://github.com/kubernetes/enhancements/pull/4213](https://github.com/kubernetes/enhancements/pull/4213)
|
||||
- \[Sotiris/esotsal\] Static CPU management policy alongside InPlacePodVerticalScaling
|
||||
- [Status / next steps / open questions one slider](https://docs.google.com/presentation/d/1jm80y9rCvjV3P6a5LTQxYv8R5er9Zw705QxANwwLaMg/edit?usp=sharing)
|
||||
|
||||
\- \[vaibhav\] Eviction manager should check the disk usage of dead containers
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/115201](https://github.com/kubernetes/kubernetes/issues/115201)
|
||||
\- Default values of Kubelet’s eviction hard parameters
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/119985](https://github.com/kubernetes/kubernetes/issues/119985)
|
||||
|
||||
- \[pehunt\] [https://github.com/kubernetes/kubernetes/pull/124285](https://github.com/kubernetes/kubernetes/pull/124285) need KEP?
|
||||
- \[harche\] \- [https://github.com/kubernetes/kubernetes/pull/125341](https://github.com/kubernetes/kubernetes/pull/125341) \- changing the secret fetching strategy while creating the pod.
|
||||
- \[pehunt\] sync about [https://github.com/kubernetes/enhancements/pull/4693](https://github.com/kubernetes/enhancements/pull/4693) updates
|
||||
- [https://github.com/kubernetes/enhancements/pull/4693\#discussion\_r1630238957](https://github.com/kubernetes/enhancements/pull/4693#discussion_r1630238957) How do we feel about the Never handling change?
|
||||
|
||||
## June 4, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=3dyVRBR7K7k](https://www.youtube.com/watch?v=3dyVRBR7K7k)
|
||||
|
||||
- \[SergeyKanzhelev\]
|
||||
KEPs for 1.31: [https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+](https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+)
|
||||
|
||||
Missing lead-opted-in: [https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+-label%3Alead-opted-in](https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+-label%3Alead-opted-in)
|
||||
|
||||
- \[chrismuellner\] discuss loose linux capability handling in security context: [https://github.com/kubernetes/kubernetes/issues/119569\#issuecomment-2020382413](https://github.com/kubernetes/kubernetes/issues/119569#issuecomment-2020382413)
|
||||
- varying, incomplete implementations for validations
|
||||
- [documentation](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container) inaccurate: \`CAP\_\` prefix allowed? upper/lower case?
|
||||
- \[pehunt\] feedback on whether to exclude critical pods in swap
|
||||
- [https://github.com/kubernetes/kubernetes/pull/125277](https://github.com/kubernetes/kubernetes/pull/125277)
|
||||
- \[ndixita\] Pod level resource spec KEP: [https://github.com/kubernetes/enhancements/pull/4678](https://github.com/kubernetes/enhancements/pull/4678)
|
||||
- \[Filip Krepinsky\] update on the Declarative NodeMaintenance and Evacuation API KEPs:
|
||||
- [https://github.com/kubernetes/enhancements/pull/4213](https://github.com/kubernetes/enhancements/pull/4213)
|
||||
- [https://github.com/kubernetes/enhancements/pull/4565](https://github.com/kubernetes/enhancements/pull/4565)
|
||||
- \[lauralorenz\] CrashLoopBackoff KEP
|
||||
- [https://github.com/kubernetes/enhancements/pull/4604](https://github.com/kubernetes/enhancements/pull/4604)
|
||||
- \[SergeyKanzhelev\] Many flakes reported by release team
|
||||
|
||||
## May 28, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=RDWC4rtQOCo](https://www.youtube.com/watch?v=RDWC4rtQOCo)
|
||||
|
||||
- KEP freeze is coming (schedule: [https://www.kubernetes.dev/resources/release/](https://www.kubernetes.dev/resources/release/)). [https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+](https://github.com/kubernetes/enhancements/issues?q=is%3Aissue+is%3Aopen+label%3Asig%2Fnode+milestone%3Av1.31+)
|
||||
- \[JeffLuoo\] Pod full startup latency metrics to record pod from creation to ready: [https://github.com/kubernetes/kubernetes/issues/124892](https://github.com/kubernetes/kubernetes/issues/124892)
|
||||
- \[dawnchen\] Starting DRA driver for GPU in CNCF / K8s repo
|
||||
- \[Ed\] [https://github.com/kubernetes-sigs/dra-example-driver](https://github.com/kubernetes-sigs/dra-example-driver)
|
||||
- \[John\] from distributors perspective \- driver from community would be preferable comparing to vendor-managed.
|
||||
- We talking about allowing space for vendors, if they want/prefer.
|
||||
- Idea is to simplify the life of distributors to have a place to take drivers from.
|
||||
- \[pehunt\] [https://github.com/kubernetes/kubernetes/pull/125038](https://github.com/kubernetes/kubernetes/pull/125038)
|
||||
- \[harche\] cgroup v1 maintenance mode KEP \- Should we feature gate it or not? [https://github.com/kubernetes/enhancements/pull/4572\#discussion\_r1608362477](https://github.com/kubernetes/enhancements/pull/4572#discussion_r1608362477)
|
||||
|
||||
## May 21, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=eSYWzusEZiA](https://www.youtube.com/watch?v=eSYWzusEZiA)
|
||||
|
||||
- \[iholder101\]:
|
||||
- \#[123963](https://github.com/kubernetes/kubernetes/pull/123963): Add swap to kubectl describe node's output
|
||||
- On the one hand we [received feedback](https://github.com/kubernetes/enhancements/pull/4401#discussion_r1479124963) regarding making it easier to debug and monitor swap. On the other hand there’s a pushback regarding exposing it through API. What’s the right balance here?
|
||||
- timezone poll results from two weeks ago: [https://ibb.co/z8R3nXN](https://ibb.co/z8R3nXN).
|
||||
- SIG-Node leadership: does moving back two hours make sense? What is the process to formalize that change?
|
||||
- \[sallyom\]:
|
||||
- [KEP 4639: OCI VolumeSource](https://github.com/kubernetes/enhancements/pull/4642)
|
||||
- alternatives
|
||||
- [https://github.com/kubernetes-retired/csi-driver-image-populator](https://github.com/kubernetes-retired/csi-driver-image-populator)
|
||||
- [https://github.com/warm-metal/container-image-csi-driver](https://github.com/warm-metal/container-image-csi-driver)
|
||||
- [https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1495-volume-populators](https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/1495-volume-populators)
|
||||
- \[pehunt\]: [https://github.com/kubernetes/kubernetes/issues/124333](https://github.com/kubernetes/kubernetes/issues/124333)
|
||||
- compelling case between balancing cluster admin configuration and workloads being punished for them
|
||||
|
||||
\- \[vaibhav\] Eviction manager should check the disk usage of dead containers
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/115201](https://github.com/kubernetes/kubernetes/issues/115201)
|
||||
\- [https://github.com/kubernetes/enhancements/issues/4341](https://github.com/kubernetes/enhancements/issues/4341)
|
||||
|
||||
## May 14, 2024
|
||||
|
||||
No agenda, canceling this week.
|
||||
|
||||
## May 7, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=\_FPa0TVPoY4](https://www.youtube.com/watch?v=_FPa0TVPoY4)
|
||||
|
||||
- \[marquiz/zvonkok\] [KEP-4112: Pass down resources to CRI](https://github.com/kubernetes/enhancements/issues/4112) follow-up
|
||||
- [https://docs.google.com/presentation/d/13TDKyASpMfDrVBSRj4JiU6gFeChx0ws4DTenBN1qUnA/edit?usp=sharing](https://docs.google.com/presentation/d/13TDKyASpMfDrVBSRj4JiU6gFeChx0ws4DTenBN1qUnA/edit?usp=sharing)
|
||||
- \[yujuhong\] cgroup v2 memory usage – bug or working as intended?
|
||||
- [https://github.com/kubernetes/kubernetes/issues/118916](https://github.com/kubernetes/kubernetes/issues/118916)
|
||||
- and discussion in runc \- [https://github.com/opencontainers/runc/pull/3933\#issuecomment-1833599870](https://github.com/opencontainers/runc/pull/3933#issuecomment-1833599870)
|
||||
-
|
||||
|
||||
## Apr 30, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=iuZCxtAeoQ8](https://www.youtube.com/watch?v=iuZCxtAeoQ8)
|
||||
|
||||
- \[SergeyKanzhelev\] Annual report last call for comments: [https://github.com/kubernetes/community/pull/7831/](https://github.com/kubernetes/community/pull/7831/files)
|
||||
- \[lauralorenz\] intro on proposed changes to CrashLoopBackoff ([slides](https://docs.google.com/presentation/d/16itbKQiClbP2L7vbBCASEC5Oz6qRKxzLmcohM5_efCQ/edit?slide=id.p#slide=id.p)), this is from [Kubernetes\#57291](https://github.com/kubernetes/kubernetes/issues/57291)
|
||||
- \[iholder101/Peter Hunt\] \#[124060](https://github.com/kubernetes/kubernetes/pull/124060): Avoid swapping memory-backed volumes with tmpfs’ “[noswap](https://www.kernel.org/doc/html/latest/filesystems/tmpfs.html)” option.
|
||||
- How to behave if the option is not supported?
|
||||
- If it is not supported, do we want to fallback to ramfs / BRD / zswap?
|
||||
- How should it be tested, since the CI runs with an old kernel (5.15 \< 6.4)
|
||||
- Update KEP and issue with the current state
|
||||
- \[iholder101/Peter Hunt\]: In my time-zone this meeting takes place at 20:00 PM. Is it acceptable to reschedule this meeting for an earlier time? This might significantly help people from the EMEA region to join.
|
||||
- Defer to next week, hope for more consensus
|
||||
- in the meantime, ask the sig-node mailing list who would be able to make it that previously cannot
|
||||
- \[ndixita\]
|
||||
\- kubelet archived logs permissions [https://github.com/kubernetes/kubernetes/pull/124229](https://github.com/kubernetes/kubernetes/pull/124229)
|
||||
Solution: 1\) Config options for users maybe [https://github.com/kubernetes/kubernetes/issues/124228\#issuecomment-2042885888](https://github.com/kubernetes/kubernetes/issues/124228#issuecomment-2042885888)
|
||||
Have a feature gate that is removed later.
|
||||
Sergey: same issue with termination logs. [https://github.com/kubernetes/kubernetes/pull/108076](https://github.com/kubernetes/kubernetes/pull/108076)
|
||||
\- cadvisor enumerates memory and hugepages separately
|
||||
Issue: [https://github.com/kubernetes/kubernetes/issues/84426](https://github.com/kubernetes/kubernetes/issues/84426)
|
||||
[https://github.com/kubernetes/kubernetes/pull/119173/files\#r1307246832](https://github.com/kubernetes/kubernetes/pull/119173/files#r1307246832)
|
||||
- Can we know if this option is planned to be backported, and to which version?
|
||||
|
||||
Recommended solution: fix in cadvisor, and assess backward compatibility (probably add a new field)
|
||||
|
||||
- Question: How will the behavior be if huge pages are changed dynamically?
|
||||
|
||||
\- \[Peter Hunt\] Finish KEP Planning
|
||||
|
||||
## Apr 23, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=-TEdQvF7kUE](https://www.youtube.com/watch?v=-TEdQvF7kUE)
|
||||
|
||||
- \[SergeyKanzhelev\] Annual report draft: [https://github.com/kubernetes/community/pull/7831](https://github.com/kubernetes/community/pull/7831) Please add your comment and review the list of KEPs ([https://github.com/kubernetes/community/issues/7777\#issuecomment-2067917685](https://github.com/kubernetes/community/issues/7777#issuecomment-2067917685))
|
||||
|
||||
- \[anishshah\] \- v1.30 release report
|
||||
- \- [github.com/AnishShah/sig-node-flaky-tescontainerd/containerdts/tree/main](https://github.com/AnishShah/sig-node-flaky-tests/tree/main)
|
||||
- \~10% release blocking tests are flaky
|
||||
- \[jstur\] Follow up on UsageNanoCores CRI [https://github.com/kubernetes/kubernetes/issues/122092\#issuecomment-1956783842](https://github.com/kubernetes/kubernetes/issues/122092#issuecomment-1956783842)
|
||||
- What is the best approach?
|
||||
- implemented cri background implementation in [https://github.com/containerd/containerd/pull/10010](https://github.com/containerd/containerd/pull/10010)
|
||||
- Additional questions if cri is responsible:
|
||||
- costs of having 10s heart beat on CRI side?
|
||||
- what does it mean to have it 10s behind other stats?
|
||||
- backwards compat?
|
||||
- James+Peter+Mike to have a call to sync on this
|
||||
|
||||
\- \[vaibhav\] Eviction manager should check the disk usage of dead containers
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/115201](https://github.com/kubernetes/kubernetes/issues/115201)
|
||||
\- [https://github.com/kubernetes/enhancements/issues/4341](https://github.com/kubernetes/enhancements/issues/4341)
|
||||
|
||||
## Apr 16, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=vjcRUX\_vSbU](https://www.youtube.com/watch?v=vjcRUX_vSbU)
|
||||
|
||||
- \[pehunt\] KEPs planning [https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit](https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit)
|
||||
- \[pehunt\]: [https://github.com/kubernetes/org/issues/4805](https://github.com/kubernetes/org/issues/4805)
|
||||
- Mostly looking for feedback
|
||||
- Some questions/replies are here looking for more opinions: [https://github.com/kubernetes/org/issues/4805\#issuecomment-1985215796](https://github.com/kubernetes/org/issues/4805#issuecomment-1985215796)
|
||||
- \[iholder101/pehunt\]: \#[123963](https://github.com/kubernetes/kubernetes/pull/123963): Add swap to kubectl describe node's output
|
||||
- On the one hand we [received feedback](https://github.com/kubernetes/enhancements/pull/4401#discussion_r1479124963) regarding making it easier to debug and monitor swap. On the other hand there’s a pushback regarding exposing it through API. What’s the right balance here?
|
||||
- \[marquiz/zvonkok\] [KEP-4112: Pass down resources to CRI](https://github.com/kubernetes/enhancements/issues/4112) follow-up
|
||||
- \[iholder101/pehunt\]: timezone poll results from two weeks ago: [https://ibb.co/z8R3nXN](https://ibb.co/z8R3nXN).
|
||||
- SIG-Node leadership: does moving back two hours make sense? What is the process to formalize that change?
|
||||
|
||||
\- \[harche\] \- cgroup v1 support \- Deprecation only or Removal as well?
|
||||
\- [https://github.com/kubernetes/enhancements/issues/4569](https://github.com/kubernetes/enhancements/issues/4569)
|
||||
~~\- \[klueska\] \- KEP update for DRA to match 1.30 implementation~~
|
||||
~~\- [https://github.com/kubernetes/enhancements/pull/4561](https://github.com/kubernetes/enhancements/pull/4561)~~
|
||||
~~\- [dawnchen@google.com](mailto:dawnchen@google.com)to approve~~
|
||||
\- \[anishshah\] \- v1.30 release report
|
||||
\- [github.com/AnishShah/sig-node-flaky-tests/tree/main](https://github.com/AnishShah/sig-node-flaky-tests/tree/main)
|
||||
\- 22/249 sig-node release blocking tests are flaky.
|
||||
|
||||
|
||||
## Apr 9, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=o3AohYi9aQA](https://www.youtube.com/watch?v=o3AohYi9aQA)
|
||||
|
||||
- \[oshestopalova\]: [Soft eviction of pods with long grace periods blocks hard evictions when under resource pressure](https://github.com/kubernetes/kubernetes/issues/123872)
|
||||
- \[iholder101/pehunt\]: timezone poll results from last week: [https://ibb.co/z8R3nXN](https://ibb.co/z8R3nXN).
|
||||
- \[jkyros\] Trying to use InPlacePodVerticalScaling in [Vertical Pod Autoscaler](https://github.com/kubernetes/autoscaler/pull/6652)
|
||||
- does anyone remember why limits are [required for in-place scaling](https://github.com/openshift/kubernetes/blob/258f1d5fb6491ba65fd8201c827e179432430627/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L556)?
|
||||
- naively something like [this](https://github.com/kubernetes/kubernetes/compare/master...jkyros:kubernetes:minimal-patch-fix-inplacepodverticalscaling-limits) fixes it, but probably has consequences
|
||||
- \[Sotiris Salloumis\] Perhaps we can discuss this in [https://kubernetes.slack.com/archives/C06FSK01BGU](https://kubernetes.slack.com/archives/C06FSK01BGU) ?
|
||||
- \[pehunt/eddiezane\]: kubectl cp improvements
|
||||
- \[Sonemaly\]: Start discussion around A[ddressing Noisy Neighbor/Split L3 Cache Topology](https://groups.google.com/g/kubernetes-sig-node/c/V1RjCDKcTaY)
|
||||
- \[kad\]: please share in continuation in the mail thread scenarios that you have and corner cases that you found are not solved today. We need to look how it could be done in a way where all other vendors (especially on ARM side where assumptions on presence of L3 might be not true) will not be affected on proposed changes to static policy. At the moment, the cache layout is partially buggy in cAdvisor library that detects it, and components like CPU manager is not consuming it at all from MachineInfo.
|
||||
- \[Matt Karrmann\] [Configure group OOM Kills at the container level instead of the kubelet level](https://github.com/kubernetes/kubernetes/pull/122813#issuecomment-2004527657)
|
||||
- Follow up with an issue to chat about different cases for pod vs kubelet level configuration
|
||||
|
||||
## Apr 2, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=Ho1kn-1p8Cg](https://www.youtube.com/watch?v=Ho1kn-1p8Cg)
|
||||
|
||||
- \[Sotiris\] InPlacePodVerticalScaling moving forward to beta (todo/needs/planing)
|
||||
- From Jiaxin Shan:
|
||||
- We worked on issues [https://docs.google.com/document/d/1V3DLh3pH3CD-xhhJvAnOq\_oWgPyjO-vj6wY6qdew9H0/edit\#heading=h.ybybfdfputt](https://docs.google.com/document/d/1V3DLh3pH3CD-xhhJvAnOq_oWgPyjO-vj6wY6qdew9H0/edit#heading=h.ybybfdfputt) and most of the issues have been solved or have pending PRs. But this is definitely a subset of the working items moving to beta.
|
||||
- for people need more context.
|
||||
- [https://github.com/kubernetes/kubernetes/issues/109547](https://github.com/kubernetes/kubernetes/issues/109547)
|
||||
- \[Dixi\] a lot of interest. Maybe we need to meet separately to split tasks?
|
||||
- \[mrunal\] there is a slack channel already. Is it ok to coordinate there?
|
||||
- \[Dixi\] slack may work.
|
||||
- \[Jiaxin\] Let’s work together in that channel.
|
||||
- \[SergeyKanzhelev\] Please review API again, Many use cases for the feature EXPECT to use this feature differently than the KEP’s API was designed.
|
||||
- \[Jiaxin\] InPlaceVPA performance issue. A few users in the community requested the patch [https://github.com/kubernetes/kubernetes/pull/123941](https://github.com/kubernetes/kubernetes/pull/123941). PLEG cycle doesn’t take inplace pod status into consideration and never emit update events.
|
||||
- \[matthyx\] (from sidecar WG) postStart hook prevents normal container termination \- how to fix that?
|
||||
- [PUBLIC: Trying to diagram pod lifecycle stuff](https://docs.google.com/presentation/d/1e-qJWe6He2qjt0PFiZtocqP8VbkrwKAI9B4X72L0bGk/edit?usp=sharing&resourcekey=0-N77W7Q5UHuN5dFtekjmAIA)(slide 3\)
|
||||
- [https://github.com/kubernetes/kubernetes/blob/ec301a5cc76f48cdadc77bcfbd686cf40b124ecf/pkg/kubelet/kuberuntime/kuberuntime\_container.go\#L297](https://github.com/kubernetes/kubernetes/blob/ec301a5cc76f48cdadc77bcfbd686cf40b124ecf/pkg/kubelet/kuberuntime/kuberuntime_container.go#L297)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/113883](https://github.com/kubernetes/kubernetes/pull/113883) (check for e2e coverage in PR)
|
||||
- we cover this in our [KEP](https://github.com/kubernetes/enhancements/issues/4438) (to be renamed)
|
||||
- \[pranav\]: Could we implement a feature in Kubelet to limit the number of threads to the number of CPUs available?
|
||||
- [https://github.com/kubernetes/kubernetes/issues/123275](https://github.com/kubernetes/kubernetes/issues/123275)
|
||||
- \[SergeyKanzhelev\] WG Serving proposal: [https://groups.google.com/g/kubernetes-sig-node/c/KGfkpVmNrNc](https://groups.google.com/g/kubernetes-sig-node/c/KGfkpVmNrNc)
|
||||
-
|
||||
- \[Anish\] [https://github.com/kubernetes/kubernetes/pull/123782](https://github.com/kubernetes/kubernetes/pull/123782) (ask is for a review).
|
||||
- Issue: Container status changes to ContainerStatusUnknown when evicted due to exceeding ephemeral storage limit.
|
||||
- Root Cause: There is a race condition which is removing the container before the container status update.
|
||||
- Fix: The fix is to check that the pod is finished before cleaning up. added a check to the existing e2e test.
|
||||
- \[iholder101\]: In my time-zone this meeting takes place at 20:00 PM. Is it acceptable to reschedule this meeting for an earlier time? This might significantly help people from the EMEA region to join.
|
||||
-
|
||||
|
||||
## Mar 26, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=5TLp233Bisg](https://www.youtube.com/watch?v=5TLp233Bisg)
|
||||
|
||||
- \[tallclair\] [Deprecate & remove Kubelet RunOnce mode](https://github.com/kubernetes/kubernetes/issues/124030)
|
||||
- Mark deprecated in 1.31 and remove in 1.33
|
||||
- Add KEP
|
||||
- \[Sergey\] cgroupv1 removal/deprecation is moving to 1.31
|
||||
- Harshal to open a KEP for 1.31
|
||||
- \[kannon92\] CAdvisor bug on pid stats
|
||||
- [https://github.com/google/cadvisor/pull/3497/files](https://github.com/google/cadvisor/pull/3497/files)
|
||||
- K8s: [https://github.com/kubernetes/kubernetes/pull/123914](https://github.com/kubernetes/kubernetes/pull/123914)
|
||||
|
||||
\- \[Dawn\] Kubecon recap. Slide deck: [Sig Node Intro and Deep Dive](https://docs.google.com/presentation/d/1xOglu8Pfq8TNLp_ehMj7S-L56znOQ-aksiaHhRtULyY/edit?usp=sharing)
|
||||
\- Unconference hw resource model discussion:
|
||||
[\[PUBLIC\] 2024 KubeCon EU - Contrib Summit Unconference](https://docs.google.com/presentation/d/1LIBx8xWR6uelGM38QgcZ9MloRLL3od20z3JuCXBa3_s/edit?usp=sharing)
|
||||
|
||||
## Mar 19, 2024 \[Canceled for KubeCon\]
|
||||
|
||||
## Mar 12, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=-435mh2GyGU](https://www.youtube.com/watch?v=-435mh2GyGU)
|
||||
|
||||
- \[Kevin Hannon\] Upper limit on ImagePullBackOff and fail the pod
|
||||
- [https://github.com/kubernetes/kubernetes/issues/122300](https://github.com/kubernetes/kubernetes/issues/122300)
|
||||
- \[Kevin Hannon\] Flakiness in eviction tests
|
||||
- [https://github.com/kubernetes/kubernetes/issues/123591](https://github.com/kubernetes/kubernetes/issues/123591)
|
||||
- Stats eviction if stats api failure
|
||||
- PIDStats Fix- [https://github.com/kubernetes/kubernetes/pull/123369](https://github.com/kubernetes/kubernetes/pull/123369)
|
||||
- \[Krzysztof Wilczyński\] Current state and the future of the Graceful Node Shutdown support in kubelet.
|
||||
- [KEP 2000](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2000-graceful-node-shutdown/README.md): Graceful Node Shutdown
|
||||
- ~~\[Anish\] ContainerStatusUnknown after ephemeral storage limit is exceeded~~
|
||||
- ~~[https://github.com/kubernetes/kubernetes/issues/122160](https://github.com/kubernetes/kubernetes/issues/122160)~~
|
||||
-
|
||||
- \[Hongxiang Jiang\] Calculate oom\_score\_adj in a CPU-agnostic way, taking in consideration Pod Priority too
|
||||
- [https://github.com/kubernetes/kubernetes/issues/78848](https://github.com/kubernetes/kubernetes/issues/78848)
|
||||
-
|
||||
|
||||
## Mar 5, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/Kubernetes SIG Node 20240305watch?v=yBmVPBO9Y9Y](https://www.youtube.com/watch?v=yBmVPBO9Y9Y)
|
||||
|
||||
- \[Sotiris Salloumis\] Static CPU management policy along side InPlacePodVerticalScaling
|
||||
- [https://github.com/kubernetes/kubernetes/pull/123319](https://github.com/kubernetes/kubernetes/pull/123319)
|
||||
- Is KEP needed? (this PR is an attempt to fix KEP 1287 Alpha Feature Code Issue [\#10](https://github.com/kubernetes/enhancements/issues/1287#issuecomment-1972964844))
|
||||
- [Demo of latest patch](https://drive.google.com/file/d/1WVmDK668OeaBV2a1MZLmZJFR52innQ9M/view?usp=sharing)
|
||||
|
||||
PTAL: [Inplace VPA + core binding](https://docs.google.com/document/d/1V3DLh3pH3CD-xhhJvAnOq_oWgPyjO-vj6wY6qdew9H0/edit?usp=sharing) There’s some discussion about VPA \+ CPU manager static policy
|
||||
|
||||
- \[Dixita, Anish\] Seeking help for bug prioritization and triage for K8s 1.30 release on Wednesday 10AM PST.
|
||||
- \[pehunt\] proc mount PR separate from e2e tests [https://github.com/kubernetes/kubernetes/pull/123520](https://github.com/kubernetes/kubernetes/pull/123520)
|
||||
- \[Kevin Hannon\] CRIO tests failing as of today
|
||||
- [https://github.com/kubernetes/kubernetes/issues/123715](https://github.com/kubernetes/kubernetes/issues/123715)
|
||||
- pehunt opened [https://github.com/kubernetes/kubernetes/pull/123726](https://github.com/kubernetes/kubernetes/pull/123726)
|
||||
-
|
||||
|
||||
## Feb 27, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=3IRepUPQ0CU](https://www.youtube.com/watch?v=3IRepUPQ0CU)
|
||||
|
||||
- \[dwestbrook\] Discuss Per Pod Container Updates (i.e. similar to this [issue](https://github.com/kubernetes/kubernetes/issues/110487#issuecomment-1153639507))
|
||||
- [Feature Request – Per Pod Container Updates](https://docs.google.com/document/d/1RyECD6xlIDejcpi818WS1pWz9KjX43W4t9MF1LUPRLY/edit) (request access)
|
||||
- \[chrishenzie\] [Extending containerd 1.X EOL](https://github.com/containerd/containerd/issues/9866) to align with K8s EOL
|
||||
- 1.6 and 1.7 have parallel LTS windows
|
||||
- Will run until next LTS release, which release TBD (could be v2.0, v2.1)
|
||||
- containerd v2.0 contains migration tools/scripts to assist with users of deprecated features
|
||||
- containerd \-c pathToToml [config migrate](https://github.com/containerd/containerd/blob/v2.0.0-beta.2/cmd/containerd/command/config.go#L107)
|
||||
- [https://github.com/containerd/containerd/blob/main/RELEASES.md\#daemon-configuration](https://github.com/containerd/containerd/blob/main/RELEASES.md#daemon-configuration)
|
||||
- containerd has moved packages around in the 2.0 refactoring see move script [https://github.com/containerd/containerd/pull/9365](https://github.com/containerd/containerd/pull/9365) this should aid people involved in containerd plugin development and importing the various packages..
|
||||
- \[SergeyKanzhelev\] Sidecar WG \- new time for the meeting: Seattle 2PM, Paris 11PM, Seoul 7AM (6AM) (Wednesday)
|
||||
|
||||
## Feb 20, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=vEbpXkhm73M](https://www.youtube.com/watch?v=vEbpXkhm73M)
|
||||
|
||||
- \[Kevin Hannon\] Discuss configuration for pod logs location
|
||||
- PR: [https://github.com/kubernetes/kubernetes/pull/112957](https://github.com/kubernetes/kubernetes/pull/112957)
|
||||
- issue: [https://github.com/kubernetes/kubernetes/issues/98473](https://github.com/kubernetes/kubernetes/issues/98473)
|
||||
- Is KEP needed?
|
||||
- Security implications of logs locations
|
||||
- Impact on disk usage
|
||||
- impact on Kata or similar runtimes?
|
||||
- \[Kevin Hannon\] [KEP-4191](https://github.com/kubernetes/kubernetes/pull/122438) blocked until we have a cadvisor release
|
||||
- With freeze coming, is it possible to get a cadvisor release before the freeze?
|
||||
- \[AI: dawnchen@\] Identify the new owner to help? \- Done\!
|
||||
- \[Jeffwan/LingyanYin\]
|
||||
- Need reviewers for this PR \- Configure MemoryRequest for InPlace pod resize in cgroupv2 systems [https://github.com/kubernetes/kubernetes/pull/121218](https://github.com/kubernetes/kubernetes/pull/121218)
|
||||
- [Dixita Narang](mailto:ndixita@gmail.com)drop a comment and doc link for why memory.min shouldn't be set as yet
|
||||
- \[AdrianReber\] Graduate "Forensic Container Checkpointing" from Alpha to Beta PR
|
||||
- PR: [https://github.com/kubernetes/kubernetes/pull/123215](https://github.com/kubernetes/kubernetes/pull/123215)
|
||||
- All changes in the PR are based on the KEP discussions
|
||||
- [https://github.com/kubernetes/enhancements/pull/4288](https://github.com/kubernetes/enhancements/pull/4288)
|
||||
- Mainly added tests for existing features as discussed during PRR
|
||||
- Switch from Alpha to Beta feature gate
|
||||
- Added separate sub-resource permission to better control access to the kubelet checkpoint API endpoint
|
||||
- Looking for reviewers
|
||||
- Will probably not be able to make it to the meeting
|
||||
- ~~\[fromani\] Looking for approval review: [https://github.com/kubernetes/kubernetes/pull/121778](https://github.com/kubernetes/kubernetes/pull/121778) (for memory manager GA graduation, kubelet observability/visibilty) thanks mrunal\!~~
|
||||
- \[jsturtevant\] KEP 2371 \- CRI container and pod stats \- Issue with UsageNanoCores calculated in CRI [https://github.com/kubernetes/kubernetes/issues/122092\#issuecomment-1942699262](https://github.com/kubernetes/kubernetes/issues/122092#issuecomment-1942699262)
|
||||
- \[kevin hannon\] PID Stats issues in both containerd and crio
|
||||
- [https://github.com/kubernetes/kubernetes/issues/115215](https://github.com/kubernetes/kubernetes/issues/115215)
|
||||
- [https://github.com/kubernetes/kubernetes/pull/123369](https://github.com/kubernetes/kubernetes/pull/123369)
|
||||
- not sure on crio side why its failing to read any process stats
|
||||
|
||||
## Feb 13, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=WLm7m-8T82A](https://www.youtube.com/watch?v=WLm7m-8T82A)
|
||||
|
||||
- ~~kannon92: self nominating to be a reviewer in sig-node~~
|
||||
- [~~https://github.com/kubernetes/test-infra/pull/31891~~](https://github.com/kubernetes/test-infra/pull/31891)
|
||||
- ~~https://github.com/kubernetes/kubernetes/pull/123202~~
|
||||
- Mrunal and Derek approved the above PRs
|
||||
|
||||
\- \[Vaibhav\] Discuss on the eviction manager issue
|
||||
|
||||
- [https://github.com/kubernetes/kubernetes/issues/115201](https://github.com/kubernetes/kubernetes/issues/115201)
|
||||
- KEP [https://github.com/kubernetes/enhancements/issues/4341](https://github.com/kubernetes/enhancements/issues/4341)
|
||||
|
||||
\- \[Ritika\] Discuss on this issue
|
||||
\- [https://github.com/kubernetes/kubernetes/issues/123176](https://github.com/kubernetes/kubernetes/issues/123176)
|
||||
|
||||
\- Pranav : Kubelet Thread Management and Resource Cleanup Post-High Workload
|
||||
\- Discuss a scenario where Kubelet retains idle threads post-high workload,
|
||||
leading to unnecessary memory consumption.
|
||||
\- Is there a way in kubernetes to set the number of maximum threads?
|
||||
If no, can k8s community implement the new parameter for it?
|
||||
[https://github.com/kubernetes/kubernetes/issues/123275](https://github.com/kubernetes/kubernetes/issues/123275)
|
||||
|
||||
- gathering pprofs of the kubelet would be useful to see if there are stuck goroutines
|
||||
- try to restrict the kubelet process in systemd unit file to cpuset:0, to force go runtime to allocate less threads and kill them more aggressively, and repeat the test. This would rule out either Go library vs. kubelet thread leaks.
|
||||
|
||||
[https://github.com/golang/go/issues/14592](https://github.com/golang/go/issues/14592)
|
||||
|
||||
|
||||
|
||||
- pehunt: imageRef discussion round 2
|
||||
- Problem: the public pod API field `container.ImageID` is constructed from the container status [ImageRef](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kuberuntime/kuberuntime_container.go#L619) field.
|
||||
- This ImageID is used to compare against the image.ID of the CRI call for [garbage collection](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/images/image_gc_manager.go#L244-L251).
|
||||
- The `container.ImageID` is considered to be a stable API, but is not compatible with the image.ID field.
|
||||
- Options to fix:
|
||||
- return same value as image.ID in container.ImageRef (resolved repoDigest)
|
||||
- problem: two images tagged with different repos but the same digest would thrash in GC
|
||||
- add a resolvedImageID or something to ContainerStatus and pod API for doing GC
|
||||
- both CRI and pod API update
|
||||
- In GC manager, compare image.RepoDigests in addition to image.ID to find a match
|
||||
- TODO:
|
||||
- check exactly what is returned for each field in cri-o and containerd
|
||||
- investigate if we can put together the needed info in image gc manager without CRI/pod API extension
|
||||
- extend them if not
|
||||
- kannon92: (if time) [https://github.com/kubernetes/kubernetes/issues/123247](https://github.com/kubernetes/kubernetes/issues/123247)
|
||||
- Discovered reason for flake in eviction
|
||||
- Summary stats is sometimes failing and the first sort of activePods is ignored
|
||||
- ndixita: highlight from Sig Node CI triage meeting (every Wednesday 10AM PST) [https://github.com/kubernetes/kubernetes/issues/122905](https://github.com/kubernetes/kubernetes/issues/122905)
|
||||
|
||||
## Feb 6, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=WiYzo\_knwfk](https://www.youtube.com/watch?v=WiYzo_knwfk)
|
||||
|
||||
- \[Filip Krepinsky\] Update on [Declarative Node Maintenance](https://github.com/kubernetes/enhancements/pull/4213)
|
||||
- \[Derek\] Update requested to clarify security posture that would prevent cross node privileges
|
||||
- \[pehunt\] [https://github.com/kubernetes/enhancements/pull/3858](https://github.com/kubernetes/enhancements/pull/3858) RRO conversation redux
|
||||
- \[jonathan-innis\] Support for `node.kubernetes.io` resource labels for Gt/Lt requirements on pods
|
||||
- See: [https://kubernetes.slack.com/archives/C0BP8PW9G/p1707165434255259](https://kubernetes.slack.com/archives/C0BP8PW9G/p1707165434255259)
|
||||
- \[Jeffwan/LingyanYin\] Two things:
|
||||
- Next steps for KEP 4176 \- a new static policy for cpu manager [https://github.com/kubernetes/enhancements/pull/4177\#issuecomment-1930204226](https://github.com/kubernetes/enhancements/pull/4177#issuecomment-1930204226)
|
||||
- Need reviewers for this PR \- Configure MemoryRequest for InPlace pod resize in cgroupv2 systems [https://github.com/kubernetes/kubernetes/pull/121218](https://github.com/kubernetes/kubernetes/pull/121218)
|
||||
- \[Vaibhav\] Discuss on the eviction manager issue
|
||||
- [https://github.com/kubernetes/kubernetes/issues/115201](https://github.com/kubernetes/kubernetes/issues/115201)
|
||||
- KEP [https://github.com/kubernetes/enhancements/issues/4341](https://github.com/kubernetes/enhancements/issues/4341)
|
||||
|
||||
## Jan 30, 2024
|
||||
|
||||
Recording: [https://www.youtube.com/watch?v=LLS3qQgQJ6g](https://www.youtube.com/watch?v=LLS3qQgQJ6g)
|
||||
|
||||
- \[pacoxu\] [Fix evented pleg mirro pod & use IsEventedPLEGInUse instead of FG status check](https://github.com/kubernetes/kubernetes/pull/122778) needs approval and we’d better get inputs from [@smarterclayton](https://github.com/smarterclayton) before merge. This bugfix blocked [sig-release-master-blocking\#gce-cos-master-alpha-features](https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features). And 1.30.0-alpha.1 is planned for Jan 30th.
|
||||
- Is this an alpha release cut blocker?
|
||||
- Could we ignore \`EventedPLEG\` in the job? We already disabled it in some presubmit Jobs,including \`pull-kubernetes-e2e-kind-alpha-features\` and \`pull-kubernetes-e2e-gce-cos-alpha-features\`.
|
||||
- \[anish\] [KEP-3953: Dynamic node resize](https://github.com/kubernetes/enhancements/issues/3953) \- draft at [KEP-3953: Support for Resizable Nodes](https://docs.google.com/document/d/16M0g2L31JZGSBM_oMxszUK1_5AGeqrdyQKjZn7NeCoY/edit?usp=sharing)
|
||||
Anish, please contact on Slack Markus Lehtonen/Francesco Romani, Alexander Kanevskiy \- we will include into discussion thread about that topic.
|
||||
- ~~\[tallclair\] Expanding Kubelet configuration API~~
|
||||
- ~~Proposal: [https://github.com/kubernetes/kubernetes/issues/122916](https://github.com/kubernetes/kubernetes/issues/122916)~~
|
||||
- ~~Does this need a KEP? (I think no?)~~
|
||||
- \[pehunt\] imageRef usage in the kubelet
|
||||
- context: [https://github.com/cri-o/cri-o/issues/7579](https://github.com/cri-o/cri-o/issues/7579) [https://github.com/cri-o/cri-o/issues/7143](https://github.com/cri-o/cri-o/issues/7143) [https://github.com/kubevirt/kubevirt/pull/10747](https://github.com/kubevirt/kubevirt/pull/10747)
|
||||
- Yuju shared [https://github.com/kubernetes/kubernetes/issues/46255](https://github.com/kubernetes/kubernetes/issues/46255)
|
||||
- \[fromani\] want to improve observability of resource managers: better and more kubelet logs, send kube events on admission failures and in the happy path. Raised as memory manager GA blocker and in general poor observability is a PRR concern. Does this work require a KEP or is an [issue](https://github.com/kubernetes/kubernetes/issues/123037) sufficient?
|
||||
- update KEPs where feasible
|
||||
- For GA KEPs (and in general for this work): update the docs
|
||||
- Keep issues and file the PR when ready
|
||||
- \[marquiz/zvonkok\] [KEP-4112: Pass down resources to CRI](https://github.com/kubernetes/enhancements/issues/4112)
|
||||
- PR: [\#4113](https://github.com/kubernetes/enhancements/pull/4113) ([README.md](https://github.com/kubernetes/enhancements/blob/870d027b872dfa289421353e6a7005d59c210bf0/keps/sig-node/4112-passdown-resources-to-cri/README.md))
|
||||
- KEP intro
|
||||
|
||||
## Jan 23, 2024
|
||||
|
||||
Recording:
|
||||
|
||||
- \[Sergey, Mrunal\] 1.30 Planning [SIG Node - KEP Planning](https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit#bookmark=id.78c7ftvcad73)
|
||||
- \[kannon92, AxeZhan\] KEP4328 for 1.30
|
||||
- [https://github.com/kubernetes/enhancements/pull/4329](https://github.com/kubernetes/enhancements/pull/4329)
|
||||
- sig-scheduling planning to implement nodeAffinity type RequiredDuringSchedulingRequiredDuringExecution by adding a new controller, needs a sig-node approver to review this kep also, as sig-node is involved as a participating-sig.
|
||||
- Thank you to Dawn for agreeing to review from sig-node.
|
||||
- \[jeffwan, LingyanYin\] two KEPs for 1.30
|
||||
- [https://github.com/kubernetes/enhancements/issues/4176](https://github.com/kubernetes/enhancements/issues/4176)
|
||||
- CPU manager: Adding a static policy option to spread hyperthreads across physical CPUs. Addressed all comments, need approvals
|
||||
- [https://github.com/kubernetes/enhancements/pull/4177\#issuecomment-1903670228](https://github.com/kubernetes/enhancements/pull/4177#issuecomment-1903670228) NRI vs. native cpu manager?
|
||||
- [https://github.com/kubernetes/enhancements/pull/4433](https://github.com/kubernetes/enhancements/pull/4433)
|
||||
- keep inplace VPA KEP alpha for 1.30
|
||||
- \[klueska\] Three KEPs for 1.30
|
||||
- Add CDI devices to device plugin API (**promote to GA**)
|
||||
- [https://github.com/kubernetes/enhancements/issues/4009](https://github.com/kubernetes/enhancements/issues/4009)
|
||||
- Add numeric parameters for dynamic resource allocation (**new KEP**)
|
||||
- Simplification / generalization of overall DRA proposal
|
||||
- Context: [Dynamic Resource Allocation (DRA)](https://docs.google.com/document/d/1XNkTobkyz-MyXhidhTp5RfbMsM-uRCWDoflUMqNcYTk/edit#heading=h.ljj9kaa144nr)
|
||||
- [https://github.com/kubernetes/enhancements/pull/4384](https://github.com/kubernetes/enhancements/pull/4384)
|
||||
- Pass down resources to CRI (**new KEP**)
|
||||
- Needed to support GPUs in Kata Containers
|
||||
- [https://github.com/kubernetes/enhancements/pull/4113](https://github.com/kubernetes/enhancements/pull/4113)
|
||||
-
|
||||
|
||||
## Jan 16, 2024
|
||||
|
||||
Recording: [https://youtu.be/NAIQGQHrlN0](https://youtu.be/NAIQGQHrlN0)
|
||||
|
||||
- \[pacoxu\] EventedPLEG bug of static pods start-up. After reverting it to alpha, [sig-release-master-blocking\#gce-cos-master-alpha-features](https://testgrid.k8s.io/sig-release-master-blocking#gce-cos-master-alpha-features) keeps failing. \#[122763](https://github.com/kubernetes/kubernetes/pull/122763) is under review.
|
||||
- Latest PR \- [https://github.com/kubernetes/kubernetes/pull/122778](https://github.com/kubernetes/kubernetes/pull/122778)
|
||||
- \[kannon92 Kevin\] Update on Swap.
|
||||
- [Swap Beta2 Findings](https://docs.google.com/document/d/1S75d_0N0i1taGTGVjQ8YLYymq8D5Yc0aVu-ROMqQkSs/edit?usp=sharing)
|
||||
- [https://github.com/kubernetes/enhancements/pull/4401](https://github.com/kubernetes/enhancements/pull/4401)
|
||||
- NoSwap seems good
|
||||
- UnlimitedSwap and Eviction signal may be needed
|
||||
- We should add eviction signal for swap for UnlimitedSwap
|
||||
- Or drop support for UnlimitedSwap
|
||||
- Kevin to reach out to [dawnchen@google.com](mailto:dawnchen@google.com)about usecases for swap
|
||||
- Kevin to find examples for UnlimitedSwap.
|
||||
- \[pehunt\] proc mount type direction
|
||||
- [https://docs.google.com/document/d/1rYvnhQyi-d8bDgyOGn5FHZKVMgwpygPjksC8ZSBaEPg/edit?usp=sharing](https://docs.google.com/document/d/1rYvnhQyi-d8bDgyOGn5FHZKVMgwpygPjksC8ZSBaEPg/edit?usp=sharing)
|
||||
- Make a KEP update to tie ProcMount behavior to userns (if userns, no masked paths). If there’s pushback, push for ProcMount in Beta
|
||||
- \[AkihiroSuda (unlikely to attend due to the timezone)\] Can I get any reaction (an explicit rejection will be highly appreciated more than having no action) to the KEP for Recursive Read-only (RRO) mounts? Has been open for almost a year. If this isn’t going to be accepted I’ll just leave Kubernetes unmodified and change containerd to treat RO as RRO.
|
||||
[https://github.com/kubernetes/enhancements/issues/3857](https://github.com/kubernetes/enhancements/issues/3857) [https://github.com/kubernetes/enhancements/pull/3858](https://github.com/kubernetes/enhancements/pull/3858)
|
||||
- \[AdrianReber\] Open Checkpoint/Restore questions from last week
|
||||
- Checkpoint/Restore demo from container image based on [https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/\#restore-checkpointed-container-k8s](https://kubernetes.io/blog/2022/12/05/forensic-container-checkpointing-alpha/#restore-checkpointed-container-k8s)
|
||||
|
||||
\# curl \-s \--insecure \--cert /var/run/kubernetes/client-admin.crt \--key /var/run/kubernetes/client-admin.key \-X POST "https://localhost:10250/checkpoint/default/counters/counter"
|
||||
\# kubectl alpha checkpoint counters
|
||||
\# newcontainer=$(buildah from scratch)
|
||||
\# buildah add $newcontainer /var/lib/kubelet/checkpoints/checkpoint-\<pod-name\>\_\<namespace-name\>-\<container-name\>-\<timestamp\>.tar /
|
||||
\# buildah config \--annotation=io.kubernetes.cri-o.annotations.checkpoint.name=counter
|
||||
\# buildah commit $newcontainer checkpoint-image:latest
|
||||
\# buildah rm $newcontainer
|
||||
* How would checkpoint/restore work with pods:
|
||||
* Implemented in March 2022 in combination with kubectl drain
|
||||
* https://github.com/adrianreber/cri-o/commits/checkpoint-restore-support-cri-api/
|
||||
* Pause pod (using cgroup)
|
||||
* Loop over all containers in pod and create a checkpoint
|
||||
* Collect pod metadata
|
||||
* Recreate pod based on metadata (no checkpoint)
|
||||
* Restore all containers
|
||||
* Unpause pod
|
||||
* Security review: looking into it
|
||||
* Garbage collection mechanism: not thought about it
|
||||
* Image-Spec discussion [https://github.com/opencontainers/image-spec/issues/962](https://github.com/opencontainers/image-spec/issues/962)
|
||||
|
||||
## Jan 9, 2024
|
||||
|
||||
Recording: [https://youtu.be/b5jaZux0qCo](https://youtu.be/b5jaZux0qCo)
|
||||
|
||||
Agenda:
|
||||
|
||||
- \[ pehunt \] [https://github.com/kubernetes/kubernetes/pull/117793](https://github.com/kubernetes/kubernetes/pull/117793) ownership. 1.30??
|
||||
- tzneal to take on, no KEP needed
|
||||
- \[kannon92\] [https://github.com/kubernetes/kubernetes/pull/121834](https://github.com/kubernetes/kubernetes/pull/121834) looking for approver
|
||||
- Can we consider backporting this?
|
||||
- Agreement
|
||||
- \[rata\]: UserNS KEP: beta migration in 1.30?
|
||||
- Open a PR to migrate to beta and reach out to gather more feedback
|
||||
- \[tallclair\]: Kubelet config clean up
|
||||
- Now that Dynamic Kubelet config is deprecated & removed, can we move the remaining flags into the Kubelet configuration object?
|
||||
- Derek: look into whether there are any differences in whether the Kubelet needs to be drained on update
|
||||
- Mrunal: Sync with folks working on conf.d
|
||||
- \[rst0git\] Forensic Container Checkpointing:
|
||||
- Provide details about additional checkpoint/restore use cases [https://github.com/kubernetes/enhancements/pull/4305](https://github.com/kubernetes/enhancements/pull/4305)
|
||||
- Graduate "Forensic Container Checkpointing" to Beta [https://github.com/kubernetes/enhancements/pull/4288](https://github.com/kubernetes/enhancements/pull/4288)
|
||||
- Add 'checkpoint' command to kubectl [https://github.com/kubernetes/kubernetes/pull/120898](https://github.com/kubernetes/kubernetes/pull/120898)
|
||||
- Proposal: checkpoint image definition
|
||||
[https://github.com/opencontainers/image-spec/issues/962](https://github.com/opencontainers/image-spec/issues/962)
|
||||
- \[fromani\] proposal to allow kubelet to allow the [kubelet to trigger the rescheduling of pods](https://docs.google.com/document/d/1-wJhiNy84w7tzFdo9HqwTu5DrVSuXFLGTUv8FBiRAAc/edit?usp=sharing). (redo from 20240102 because too small audience; presented on batch WG mtg on 20240104 ) \- expected 5 minutes [presentation](https://github.com/ffromani/ffromani/blob/main/docs/proposal-allow-kubelet-to-trigger-rescheduling.pdf) \+ time for questions/discussion maybe 10 mins top?
|
||||
- Include a security section about restricting the node to unbind only its own pods.
|
||||
- \[SergeyKanzelev, Harche\] [https://github.com/kubernetes/kubernetes/issues/122224](https://github.com/kubernetes/kubernetes/issues/122224) are back copat concerns here valid?
|
||||
|
||||
## Jan 2, 2024
|
||||
|
||||
Recording: [https://youtu.be/BHGZs2HJMyU](https://youtu.be/BHGZs2HJMyU)
|
||||
Agenda:
|
||||
|
||||
- \[marquiz\] [QoS resources KEP](https://github.com/kubernetes/enhancements/pull/3004), call for reviews, blockers from sig-node perspective(?)
|
||||
- \[fromani\] proposal to allow kubelet to allow the [kubelet to trigger the rescheduling of pods](https://docs.google.com/document/d/1-wJhiNy84w7tzFdo9HqwTu5DrVSuXFLGTUv8FBiRAAc/edit?usp=sharing). Looking for early feedback/possible concerns.
|
||||
- spinoff from DRA conversations; beneficial to improve UX with kubelet admission failures
|
||||
- will be presented to batch WG/sig-scheduling mtgs
|
|
@ -0,0 +1,32 @@
|
|||
# SIG Node Meeting Notes
|
||||
|
||||
## Jan 14, 2025
|
||||
|
||||
* \[Tim Allclair\] Changing the CRI contract for `UpdateContainerResources`, to require it to not intentionally restart a container. Redefine (and probably rename) the "NotRequired" resize restart policy to match, meaning containers won't be deliberately restarted (could still trigger an OOM). Runtimes that can't support in-place resize should return an error (e.g. VM runtimes).
|
||||
|
||||
## Jan 7, 2025
|
||||
|
||||
* \[Tim Allclair\] Shrinking memory limits bellow memory usage (in-place resize)
|
||||
* See [In Place Pod Resize - Handling Memory Limit Decreases](https://docs.google.com/document/d/1cEFLXKwNOSNLAkzyhoJUgkBW0OiX-9bXB_aJV7OAypw/edit?tab=t.0)
|
||||
* \[sergey\] to not interrupt the flow posting some questions:
|
||||
* should we disable resize on cgroupv1 to avoid diff behavior?
|
||||
* Mrunal: true, but we like the v1 behavior better
|
||||
* How VPA can use the feature safely? This is another example of VPA cannot guarantee non-disruptive pod resize (best effort)
|
||||
* Tim to follow up with VPA folks
|
||||
* All checks suggested still have possibility for the race condition, the user experience will be super annoying. Should we reconsider the application of memory.high?
|
||||
* Mrunal: Ask Waiman (Red Hat memory subsystem maintainer) if switching to v1 behavior is possible
|
||||
* Peter: we probably need to do this to fully avoid TOCTOU / OOM kills
|
||||
* Tim: Pause container to avoid TOCTOU?
|
||||
* \[Tim Allclair\] Pod generation, status generation (design proposal)
|
||||
* See proposal section of[\[PUBLIC\] Design proposal: Pod ResizeStatus Improvements](https://docs.google.com/document/d/10m0vdWbjqF_q_f1_N3gVOD5YeNn09BwFcVVycG_MoEY/edit?tab=t.0#heading=h.3rtsrq1ip5xv)
|
||||
* \[Zeel Patel\] [https://github.com/kubernetes/kubernetes/issues/129447](https://github.com/kubernetes/kubernetes/issues/129447)
|
||||
* Create a KEP PR: [https://github.com/kubernetes/enhancements/pull/5022](https://github.com/kubernetes/enhancements/pull/5022)
|
||||
* \[fromani\] I’d be happy to review
|
||||
* \[HirazawaUi\]: Addressing the Issue of Slow Pod State Transitions When Evented PLEG is Enabled
|
||||
* Due to the time zone difference, I regret that I am unable to attend the meeting. I have compiled the entire process of how @hshiina and I attempted to address this issue, as well as my proposal, into a detailed document: [https://docs.google.com/document/d/1TPrY56q9MNW8r1FuzKDFkBBhOjQ0hqi7wJAbIP1O-4g](https://docs.google.com/document/d/1TPrY56q9MNW8r1FuzKDFkBBhOjQ0hqi7wJAbIP1O-4g). I sincerely hope this can be discussed during the meeting.
|
||||
* \[SergeyKanzhelev\] [Kubernetes feature development and Container runtimes](https://docs.google.com/document/d/1y42XrUPrm-6DZby1RQjexYYoNn822IRR6igsOiy_62c/edit?tab=t.0#heading=h.2bbangbf1ha4)
|
||||
* State containderd and cri-o explicitly
|
||||
* \[kannon92\] [Deprecation of NodeFeature in sig-node testing](https://github.com/kubernetes/kubernetes/pull/129166)
|
||||
* \[marquiz\] KEP-4112 Pass down resources to CRI, request for review:
|
||||
[https://github.com/kubernetes/enhancements/pull/4113](https://github.com/kubernetes/enhancements/pull/4113)
|
||||
* 1.33 time is coming\! [https://github.com/kubernetes/sig-release/pull/2706/](https://github.com/kubernetes/sig-release/pull/2706/files)
|
Loading…
Reference in New Issue