1.3 MiB
Kubernetes SIG-Node CI subgroup notes
12/21/2020 Pre-new-year clean wrap up
Attendees (4):
- Sergey Kanzhelev
- Amim Knabben
- Elana Hashman
- Matt Merkes
Agenda:
We will try to summarize all ongoing work on project board: https://github.com/orgs/kubernetes/projects/43
Add all issues to the board and have “unrelated” column
- Triage
12/14/2020
Attendees (7):
- Sergey Kanzhelev
- Artyom Lukianov
- Jorge Alarcon
- Matt Merkes
- Francesco Romani (fromani)
- Amim Knabben
Agenda:
-
knabben- How to increase team velocity on issues response? After tackling the issue, how to organize and sync efforts on the specific issue.
- Try to use slack channel (#sig-node) to discuss about issues debugging
- Update the project board with last issues
- How to increase team velocity on issues response? After tackling the issue, how to organize and sync efforts on the specific issue.
-
merkes- Is it time to fix it or kill it? - https://github.com/kubernetes/test-infra/issues/18973
- Jorge: definitey need to document the decision. Thinking is that node/e2e suppose to be NodeConformance, regular conformance is under test/e2e.
- Should NodeConformance and master run differently?
- Action: let’s make a doc and once we feel OK - review with Dawn and implement
-
dims- Action: review dockershim test coverage
12/07/2020
Attendees:
- jorge
- sergey
- amim
- morgan
- matt
- roy
Agenda:
-
knabben- Morgan: more people is better. Currently there is a disconnect between CI and ongoing work.
- Amim: we have an alerting for the jobs
- Sergey: many features are coming hot for the code freeze date. Many improvements are being merged last week. So a small PR breaking tests may lead for a big blockage of PRs and improvements wouldn’t be merged into target release
- Jorge: test infra is complicated and it will be difficult for people to learn. We need a lot of improvements in this area if we will block PRs by tests. We need more long-term contributors and this change will be working best for those, may inhibit one-feature contributors.
- Sergey: this group is slower progress, but develop knowledge, proposal from Wojtek is an extreme of constant firefighting
- Jorge: maybe we need an official onboarding (bootcamp) that can improve the quality of contributors. This may be in-the-middle solution.
-
alejandrox1- we need a central data structure for configuring test environments
- we need a simple way for populating the above
MorganMorgan: happy to join
-
SergeyKanzhelev- Morgan: yes, it is hard to figure out. Would be great to have an explicit flag. How to know? - many tests are defaulting to ContainerD now.
- We definitely will learn something when we start doing it.
- Action: Sergey to start an issue
- Also follow up with recently added crio jobs
- PR and CI may be different.
- Some scalability tests may be specifying CNI explicitly
-
SergeyKanzhelev- pod lifecycle moved to conformance. Was passing in previous location.
- This test was moved from orphans. Need to investigate why it was passing there and is failing here
- Action: merkes will take a look, also check whether there was an e-mail about it.
- original PR https://github.com/kubernetes/kubernetes/pull/96485/files
-
SergeyKanzhelevJorge: will take it and will take a look.
-
mhb- https://github.com/kubernetes/org/blob/8dde2c258144702b4130d8bdc7fc0891dcb72422/config/kubernetes/sig-release/teams.yaml#L17
- Maybe too late for 1.20, but worth checking. If nothing will be done, this will simply be merged into master
11/30/2020
Attendees (4 on a call):
- Artyom Lukianov
- Matt Merkes
- Sergey Kanzhelev
- Morgan Bauer
Agenda:
-
Artyom Lukianov
11/23/2020
Cancelled! No agenda for today and it is a “short” week in the US.
11/16/2020
Attendees (# on a call):
- Artyom Lukianov
- Francesco Romani (fromani)
- Matt Merkes
- Sergey Kanzhelev
Agenda:
-
Sergey
-
fromani-
Ruiwen
-
- Mention fromani on test related PRs
11/09/2020
Attendees (# on a call):
- Artyom Lukianov
- Francesco Romani (fromani)
- Sergey Kanzhelev
- Karan Goel (karan, Google)
Agenda:
[merkes] node conformance ci debrief - failing 10-21 to 11-03 due to insecure port change PR#XYZ, bearer-token not plumbed in conformance test-suite mode. Residual concern of the conformance mode being a completely different setup from normal runs. Test output looks different from before.
node-kubelet-serial last red tabin sig-node-kubelet
[Sergey] containerd jobs, are those owned by ‘sig-node’ or by ‘containerd’ (much overlap in participants), but I wonder who is concerned about them.
-
Karan- NPD e2e is failing, I have a repro but not a root cause. Would appreciate another set of eyes to investigate.
- https://github.com/kubernetes/kubernetes/pull/96262#issuecomment-723208823
- AI: Artyom will take a look - see the comment https://github.com/kubernetes/kubernetes/pull/96262/files#r520785823
-
Artyom- Fake NUMA - requires kernel compiled with the CONFIG_NUMA and CONFIG_NUMA_EMU
- Any additional thoughts?
- CONFIG_NUMA_EMU is missing on COS. Looking into it. Update next or week after the week.
- AI: Artyom to create a ticket on test-infra - https://github.com/kubernetes/test-infra/issues/19902
-
Francesco (fromani)- Makes the flow more robust - please check the commit message for more context
- You won’t see these failures in u/s CI (even with /test pull-kubernetes-node-kubelet-serial-topology-manager) because this part of test only run on a multi-numa box with sriov (https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/topology_manager_test.go#L732)
11/02/2020
Attendees (9 on a call):
- Alukiano
- Sergey Kanzhelev
- Fromani
- Merkes
- Amim Knabben
- morgan
Agenda:
-
Sergey -
Sergey - Please help review issues and comments https://github.com/orgs/kubernetes/projects/43
- New TestGrid alerts at sig-node-kubelet#node-kubelet-conformance since commit/237dae5a5 (Matt) - still WIP
- PR for secure port may have broken test. Now need to check it. Morgan to take a look.
10/26/2020
Attendees: 6 on call:
- Alukiano
- Matt Merkes
- fromani
Agenda:
-
Sergeyhttps://k8s-testgrid.appspot.com/sig-node-containerd#containerd-node-conformance (Sergey)
RoyThis is where contained test fails
-
merkes -
alukiano[RoAmimhttps://github.com/kubernetes/test-infra/issues/19401
1 page write up to formalize the process.
Step1: Dashboards renaming is low hanging fruit and easy to begin with.
Step2: Test infra - rename scripts and program with the transition step.(extract_k8s.go add cos* cosCI* to co-exist with gci* gciCI* to avoid break)
Step 3: Rename kubernetes repo cluster/gce/gci to cluster/gce/cos, and also make sure containerd/cri repo do not hardcoded this into their yaml
10/19/2020
Attendees:
- Sergey, Jorge, Ruiwen
Agenda:
-
Add your agenda items below -
jorge- visibility: document how to take care of Kubernetes
-
fromani- Topology manager test improvements (more in the pipeline)
- RH will port more fixes back to OSS. Yay!
-
alukiano- Need to have multi-numa machines to run tests on numa node selection
- Roy: on GCE all NUMAs virtual. So likely we need a hardware machines?
- Artem: may need to fake it
- Roy: need to change the command line passing kernel arguments
- Artem: will post parameters that are required and Roy will follow up.
- The fake NUMA - https://www.kernel.org/doc/html/latest/x86/x86_64/fake-numa-for-cpusets.html
- Sergey to follow up on do we need on GCE/GKE/Anthos - justification
-
merkes \- offline update only- Been oncall the last couple of weeks and can’t make the meeting today, but I have an AI to update the email as described in last week’s agenda. I will 100% have time to do in the next couple of days. Freedom at noon!
Sergey to send agenda and open PRs list to the mailing list.
10/12/2020
Attendees:
- Jorge
- Matt Merkes
Agenda:
-
Add your agenda items below- For today, please use google hangout link in the invite for this event https://meet.google.com/rzi-qtkm-rbf
-
Matthew- https://github.com/kubernetes/test-infra/issues/18973#issuecomment-705715514
- NodeConformance and NodeFeature tags
-
jorge -
matthew -
matthew
-
-
jorge- Is the mailing list the right place to surface test alerts?
-
matthew -
matthew -
roy -
jorge
10/05/2020
Attendees:
- morgan bauer
- Amim Knabben
- David Porter
- Harshal Patil
- Jorge Alarcon
- merkes
- roy
- Sergey
Agenda:
-
sergey
https://testgrid.k8s.io/sig-node-containerd#containerd-node-conformance
-
Amim/Jorge- let’s add it to community/contributors/devel/sig-node/test-suite.md (?)
Follow ups:
-
Action item: write down this in test guide
-
Action item: Need to find if Docker interested testinfra#17731
- Just need Derek or Dawn approval
-
Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
- https://github.com/kubernetes/test-infra/issues/19401
- Interesting related issue of renaming: https://github.com/kubernetes/test-infra/issues/19384
-
Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master
- https://github.com/kubernetes/test-infra/issues/18973
- Merkes: added more details.
- Why is running in Docker desired?
- Docker may be a way to contain all the dependencies to make it independent
- But it’s unclear if it makes the tests “independent” as it requires docker.
- Jorge: should we ask sig-architecture SIG?
- Let’s ask sig-architecture if there is a knowledge beyond what’s written
- Merkes to daft it
-
Action: try to land e2e tests documentation into community repo [Sergey]
- Follow the lead here https://github.com/kubernetes/community/pull/5148?
-
Jorge- How do we know which features we are testing?
- Morgan: sounds like we need a [Conformance:KEPXYZ] tags
- Merkes: were there any tests added to NodeConformance and what they meant by this?
- Jorge: nowhere we have a documentation saying what is release blocking
- Merkes, Jorge, Morgan: read some completed KEPs to understand how they approached testing and how we can improve this.
- If KEP claimed to be implemented, is there easy way to check it’s tests?
- Jorge: +1
9/28/2020
Attendees:
- Matt Merkes
- Sergey Kanzhelev
- Harshal Patil
- Amim Knabben
- Morgan Bauer
Agenda:
-
mhb- https://github.com/kubernetes/test-infra/blob/06a8bab27b300fdd4d65d5711bfd12e41fbc87cc/config/jobs/kubernetes/sig-node/node-kubelet.yaml#L196
- https://github.com/kubernetes/test-infra/blob/06a8bab27b300fdd4d65d5711bfd12e41fbc87cc/jobs/e2e_node/perf-image-config.yaml#L8-L9
- https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-flaky
- Action item: write down this in test guide
-
Follow ups:
- Action item: Need to find if Docker interested testinfra#17731
- No response, if no response again, let’s just remove it.
- Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
- Create a task to assess and reorganize the gci related tests.
- https://github.com/kubernetes/test-infra/issues/19401
- Interesting related issue of renaming: https://github.com/kubernetes/test-infra/issues/19384
- Create a task to assess and reorganize the gci related tests.
- Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master
- Proposal document doesn’t match reality and no clear way how to achieve this. https://docs.google.com/document/d/1BdNVUGtYO6NDx10x_fueRh_DLT-SVdlPC_SsXjYCHOE/edit#
- Need more information on whether it’s what suppose to be and prioritize as appropriate
- Another link: https://kubernetes.io/docs/setup/best-practices/node-conformance/
- Action: try to land e2e tests documentation into community repo [Sergey]
- Action item: Need to find if Docker interested testinfra#17731
Awaiting approval:
-
Action: PreStop test investigation - check if there is another test that tests what description says. If not, fix the test by changing infinite loop to constant work amount https://github.com/kubernetes/kubernetes/pull/94922
- Action: create a tab for RuntimeClass Disruptive test [Sergey] https://github.com/kubernetes/kubernetes/pull/95046
-
Action: [Jorge] Follow up on routing prow notifications to mailing list
https://github.com/kubernetes/test-infra/pull/19306 -
Action: Move the RuntimeClass tests out of node-kubelet-orphans https://github.com/kubernetes/kubernetes/pull/94796 [Harshal]
-
Morgan- Last thing is keeping it from being clean.
-
Diff disruptive vs. serial?
- Not clear
9/21/2020
Attendees:
- Sergey Kanzhelev
- Harshal Patil
- morgan bauer
- David Porter
- Amim Knabben
- Roy Yang
Actions follow up from the last meeting:
- Action item: Need to find if Docker interested testinfra#17731
- Move to the next week
- ci-cri-containerd-e2e-gci-gce-flaky
- Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
- Action: PreStop test investigation - check if there is another test that tests what description says. If not, fix the test by changing infinite loop to constant work amount
PR open on termination + follow up on filtering
- Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master
- Action: create a tab for RuntimeClass Disruptive test [Sergey]
- Action: try to land e2e tests documentation into community repo [Sergey]
Agenda:
-
karan / roy- Bump 81 -> 85 (where we have image_family)
- For specific image - will do this week
- 85 is going stable this week
- 77 -> 81
- 81 -> 85 python version changed from 2 to 3, breaking containerd builds. But seems to be fixed by https://github.com/containerd/containerd/pull/4559
- Should we remove ContainerD 1.2? -> 1.4
- Morgan helping to move ContainerD to prow
-
knabben- De-flaky by simplifying the termination check - https://github.com/kubernetes/kubernetes/pull/94922
-
mhb- Let’s try to understand it first, than ensure it only runs once
-
mhb- Who wants to take a look? Morgan to create an issue
-
Jorge
9/14/2020
Attendees:
- Sergey Kanzhelev
- Matt Merkes
- Amim Knabben
- David Porter
- Jorge Alarcon
Actions follow up from the last meeting:
- Action item: Need to find if Docker interested testinfra#17731
- Action item: find a place to document test dashboards and tabs as well as test matrices. Perhaps have a place on kubernetes.dev or sig-community
- Action item: convert to markdown and find a place for the document [mhb]
- Investigate “node-kubelet-master and node-kubelet-conformance seem like duplicates” https://github.com/kubernetes/test-infra/issues/18973
- Sergey: come up with the critical test move example
Please add your agenda items.
-
karan / roy- I think it’s ok, need to follow up with Roy and Karan
-
knabben- sig-node-containerd/ci-cri-containerd-e2e-gci-gce-flaky
- Does it make sense to run a periodic batch of known flaky tests in 2 hours intervals?
- This grouping has a mix of SIG tests and 2 failing with periodicity:
- [sig-api-machinery] CustomResourcePublishOpenAPI …
- Probably should not be here.
- PreStop graceful pod terminated should wait until preStop hook completes the process:
- This is not true for the current cluster behavior, the Pod is deleted in the grace period and the event FailedPreStopHook is registered
- [sig-api-machinery] CustomResourcePublishOpenAPI …
- Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
- Action: PreStop test investigation - (known issue) - check if there is another test that tests what description says. If not, fix the test by changing infinite loop to constant work amount
- sig-node-containerd/ci-cri-containerd-e2e-gci-gce-flaky
-
merkes- Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations
- what is “NodeConformance”?
- NOTE: Found a little more information on the history.
-
sergey-
Jorge - Action: create a tab
-
-
sergey- https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests-kubetest2.md
- https://docs.google.com/document/d/10EKyu4kTo3x7mei2TuGuW8AYeld1EuuSj91AvYNVGX4/edit#heading=h.1c4byrql87wf
- https://docs.google.com/document/d/1BdNVUGtYO6NDx10x_fueRh_DLT-SVdlPC_SsXjYCHOE/edit
- Talked about it on ContirbEx meeting with Joel and he has reservations on how this will gonna be kept up to date
- Action: try to land it into community repo [Sergey]
Some stats:
- Test-infra sig/node: PRs: 4 Let’s resolve this one: https://github.com/kubernetes/test-infra/pull/17948
- Test-infra sig/node: issues: 8
- k/k sig node area/test: PRs: 59 (approved): 7 Approved PRs slowly clearing up!
- k/k sig node area/test: issues: 7
- k/k sig node kind/failing-test: PRs: 0 (approved): 0 Yay!
- k/k sig node kind/failing-test: 1
8/31/2020
Please add your agenda items.
-
Sergey- Investigate “node-kubelet-master and node-kubelet-conformance seem like duplicates” https://github.com/kubernetes/test-infra/issues/18973
- Discuss whether we need node-kubelet-serial-alpha tab: https://github.com/kubernetes/test-infra/issues/18972
-
mhb- Where to record the knowledge on what we INTENT to test? Coverage matrix.
- Roy:https://testgrid.k8s.io/sig-node-docker??
- Morgan: this tab needs to be deleted. Docker team doesn;t support it any longer (seemingly) testinfra#17731
- Action item: Need to find if Docker interested testinfra#17731
- Docker installed with ContainerD 1.2. So we need to test 1.2, 1.3, 1.4. We need to be explicit on whether we really need to test all of them.
- Action item: find a place to document test dashboards and tabs as well as test matrices. Perhaps have a place on kubernetes.dev or sig-community
-
Is there some correspondence to https://github.com/kubernetes/kubernetes/blob/master/pkg/features/kube_features.go or are they independent measures?mhb
Action item: convert to markdown and find a place for the document [mhb]
Action item: https://github.com/kubernetes/test-infra/issues/18826
-
Sergey - Next meeting? 9/7 is a US holiday (Labor Day)
- Yes, cancel next week.
8/24/2020
Please add your agenda items.
-
mhb- Action: let’s investigate
- What goes in sig-node-blocking?
- Topologymanager - important for a subset of customers. Maybe not justify to put as blocking?
- SHould we put conformance tests there?
- Jorge: let’s create a process to add to sig node blocking? Based on maintenance commitment from participants
- Also all tests needs to strive to be blocking.
- example proposal: https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md
-
mhb -
mhb- Moveto next week.
-
Sergey-
For example, this one is not tested as it’s not NodeConformance:
var \_ \= SIGDescribe("CPU Manager \[Serial\] \[Feature:CPUManager\]\[NodeAlphaFeature:CPUManager\]", func() {
-
https://github.com/kubernetes/test-infra/blob/bd3b81b6c985971cf10665af396962f1c3136785/config/jobs/kubernetes/sig-node/node-kubelet.yaml#L257
Also how do we know which tests are run and which are not?
No tools, If anybody interested - need to create it.
Action: create issue to clean up the tab
-
alejandrox1- Release team is preparing to release 1.19. Please be vigilant with new test failures
-
alejandrox1- Always ask questions :-) All the work that is going on should make sense to us.
8/17/2020
Please add your agenda items.
-
Sergey -
Sergey
| FAILING | 16 |
|---|---|
| FLAKY | 26 |
| PASSING | 45 |
| STALE | 18 |
| Grand Total | 105 |
-
Sergey- Policy: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/README.md
- PRs
- 4 types of images
- Presubmits
- Postsubmits
- CI with other tools
- Test with latest tech
- [Action] Morgan to update PRs to point to LTS
- [Action] think about one or two stable images issue
-
karan- PR: https://github.com/kubernetes/test-infra/pull/18877
- Build and release a new image for gcr.io/k8s-testimages/kubekins-e2e:latest-1.18 and above?
8/10/2020
Please add your agenda items.
-
Ning- you’ll receive meeting notifications, emails, etc.
-
alejandrox1- Request write permissions on slack, alias: knabben
-
alejandrox1- TODO: Let’s look over our e2e and see what we actually need
- TODO: come up with a “plan”
-
Dawn- Our perspective as SIG node may be different from SIG testing. We need to maintain our priorities.
- We need to maintain common tests across many vendors. Even though some vendors copy existing tests and run on different runtimes. We still need to maintain those tests
-
Sergey- https://testgrid.k8s.io/presubmits-kubernetes-blocking
- https://testgrid.k8s.io/sig-release-master-blocking
- https://testgrid.k8s.io/sig-node-kubelet
-
dims - Split tests into two buckets:
- What is actually green and will stay green.
- What we are working on.
- Buckets above will indicate whether something is “approved” by SIG node. Basically a baseline
- Example from sig release’s release blockin/informaing https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md
- CI jobs that are broken for months needs to be highlighted as not critical for now.
- [Action item] Create tabs and move everything in NODE informing. And slowly move tests to Node blocking. Problem we are solving: if a person is not a part of SIG node - can I ask quickly whether things are OK now or not?
-
Sergey -
Sergey -
Sergey
06/22/2020
- CANCELLED
- Do we need this meeting going forward? If so, move to bi-weekly
-
mhb- categorize the tests here
- https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-orphans
- looking at the prow job definition, it is defined wholly by what it skips --skip="\[Flaky\]|\[NodeConformance\]|\[NodeFeature:.+\]|\[NodeSpecialFeature:.+\]|\[NodeAlphaFeature:.+\]|\[Legacy:.+\]|\[Benchmark\]"
- part of an effort to categorize the tests https://docs.google.com/document/d/1BdNVUGtYO6NDx10x_fueRh_DLT-SVdlPC_SsXjYCHOE/edit#
- added https://github.com/kubernetes/test-infra/commit/6cda335ebfcdace363352b566974ff0dda87a85c
-
alejandrox1- Turns on email notifications from failing e2e jobs in https://testgrid.k8s.io/sig-node-containerd to mailing list https://groups.google.com/forum/#!forum/kubernetes-sig-node-test-failures
- This is different from the notifications people receive from the GitHub team @kubernetes/sig-node-test-failures
-
alejandrox1 -
fromani- Rationale: duplicated code between cpu and topology manager e2e test, the end goal is to clean up and keep the tests working and easy to change/extend
- Initial PR with cleanups which didn’t make in 1.18 https://github.com/kubernetes/kubernetes/pull/90971 - ptal
06/15/2020
- No meeting today, no items on agenda
- Board is now public with view access:
06/08/2020
-
vpickard- experimenting with new board for sig-node-testing enhancements
- Helps to view, monitor, manage issues and PRs for sig-node testing
- Currently private
- Started as a way to track issues/PRs that I had been requested to review - getting buried in emails. Found that I was “wasting” time going through emails to find PRs/issues that I needed to spend time on.
- Sig-storage - uses spreadsheet for high level view
- Sig-release - release engineering - uses boards
- Sig-architecture also uses boards
-
karan- Running test suites using FOCUS? Not clear how that works.
- Not covered in alejandrox1’s docs.
-
alejandrox1 - small PR https://github.com/kubernetes/community/pull/4829
-
mhb -
mhb -
mhb- Should we be testing more images?
- COS and Ubuntu were original images - both open source
- Some tests are run on other platforms (RHEL on AWS, for example), export to testgrid
- More images very welcome and encouraged to add to E2E testing - would need some support/knowledge to maintain these images
-
alejandrox1- need to pass management to SIG chairs and owners of this project.
- everyone can join now.
-
alejandrox1 -
royyang -
royyang- a SKIP=\[NodeFeature:RuntimeHandler\] may help for this job
- we could add some additional “annotation” to the test to skip
-
bart0sh - Can we get a community account for running/debugging sig-node E2E tests? Check with wg-k8s-infra.
06/01/2020
-
Follow up for test failure notifications, https://github.com/kubernetes/community/tree/master/sig-node#contact
- can we get on github team @kubernetes/sig-node-test-failures - Test Failures and Triage
- https://github.com/orgs/kubernetes/teams/sig-node-test-failures
-
alejandrox1 - PR to be added to the github team
-
alejandrox1 -
alejandrox1
-
-
alejandrox1- use https://groups.google.com/forum/#!forum/kubernetes-sig-node - this mailing list doesn’t have a ton of traffic currently
- or create a new one in https://github.com/kubernetes/k8s.io/tree/master/groups
- general mailing list guidelines https://github.com/kubernetes/community/blob/master/communication/mailing-list-guidelines.md#mailing-list-creation
- For a general mailing list we need a sig lead or an approver to set it up.
- AI: alejandro - make pr for mailing list
- name suggested in meeting
- contact list to put in pr
- Let’s start with volunteer list at the top of this doc
- start with subscribing to notifications for
- release/merging blocking informing suites
- existing *stable* test suites
-
mhb - what is image policy in general? Where can we write it down? LTS every 6 months, establish a periodic chore to evaluate the swap.
- assuming this ought to be the cos policy, where do we write down “use the lts images”, and use `image_family: cos-XX-lts` ?
- https://cloud.google.com/container-optimized-os/docs/release-notes#lts_image_families
- I like this comment as well, https://github.com/kubernetes/test-infra/pull/17770#issuecomment-636433860
- rollback issue on image_family vs maintenance issue of specific image: pinning
- many images up there and available, why these specific ones?
-
royyang - about COS image
-
Updated doc: https://github.com/kubernetes/kubernetes/pull/91612/
-
Root cos-stable-73 issue: https://github.com/kubernetes/test-infra/pull/17770
-
Clean up and improve images: https://github.com/kubernetes/kubernetes/pull/91543 PTAL, this needs #17770
-
Using image_family is good for some tests, but may be limited when we want to roll back a LTS or stable image.
-
How to run a single test? Want to debug a couple of tests, and need some steps.
- previous meeting notes from 5/26/2020 on using SKIP=”” & FOCUS=””
Zhiissue in k/k to debug
-
-
should we add [NodeFeature: to kinds of tests?
- tag for feature it is specifically testing. help with graduating to GA.
- maybe a kep for it.
-
mhb- Does sig-node monitor them?
- Should we delete them or hand over to sig-scalability
- summarize details in an email. there is an existing PR to ‘fix’ it by giving it a bigger vm to run on. tests added years ago, by a small group of people with not enough time to monitor them. Call for contributors to monitor and manage, and if none, delete.
- related, benchmark job runs.
- cos-69 works, but not later versions.
-
mhb- Match up test runs to image config to make it easier to see testgrid image used
- Plan to update remaining image configs (20+) as well
-
mhb -
alejandrox1- A lot of work needs approval from approvers who are oversubscribed. What can we do?
- We can learn more, contribute more, and work to gain approval rights to help out.
-
bsdnet- Remove unused code and enhance logging
5/26/2020
-
mhb - Ning Lao, Roy Yang
- How do we know the health of COS images, and which images should we be testing in sig-node? Roy will help us with this.
- Latest LTS should be the one we use, e.g cos-81-lts
- https://cloud.google.com/container-optimized-os/docs/concepts/release-channels#release_channels
- https://cloud.google.com/container-optimized-os/docs/resources/support-policy
- Need to file an issue with how to deal with COS images?
- can we document what images are being used and why?
- https://github.com/kubernetes/kubernetes/issues/88284
- https://cloud.google.com/container-optimized-os/docs/release-notes
- Suggestion to use regex, like gke. But, fail the test if the image is not found.
- Consider using image family
- How do we know the health of COS images, and which images should we be testing in sig-node? Roy will help us with this.
-
alejandrox1 -
royyang -
Ed -
alejandrox1- WIP (will pr the appropriate bits and pieces back into Kubernetes)
- Sig-testing community update https://docs.google.com/presentation/d/1H-MLhKJJVsQG2eDCEv48M_WAzMc66dKaYMgfOSGQRJM/edit#slide=id.g338ac0a8b6_0_27
- Regarding release-blocking/merge-blocking/release-informing jobs, see notes on 5/20/2020 below
- Create email group for sig-node test failures, update jobs to send email to list when job fails
- github group and/or googlegroups mailing list.
- crib off of what other sigs do.
-
vpickard- Import tests results from AWS E2E testing to testgrid
- data is in bucket, but results are not in testgrid
- gs://kubernetes-github-redhat/logs/ci-kubernetes-conformance-node-e2e-containerized-rhel/10653/
- KETTLE issue? Any pointers?
- SIG Testing resources
-
karan- Running test suites using FOCUS? Not clear how that works.
- Covered in docs linked above - will check there.
-
karan- Karan will create one and get that discussion going
- Created https://github.com/kubernetes/test-infra/issues/17714
-
davidporter- Setting up VM for local tests and running node e2e tests: https://gist.github.com/bobbypage/f922d2dea47912786ddc0a0d2fab0fd1
-
davidporter- Open issue to update topology manager to beta from alpha
-
Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded." -- How to deal with this CPU usage issue?zhi
5/20/2020
Sig-node E2E node-kubelet master started failing, an issue was reported here. This was blocking PR merges, and also this test is release-blocking.
And, here is the slack thread with debugging information. There are some good bits of info, such as how to run some of the tests, some of the false leads we were chasing. Will attempt to incorporate some of this debugging/running tests, etc into this doc.
Overall, a great team effort to debug and get to the root cause!
5/17/2020
- Review Goals
-
Ed- update cos images
- Added info on “COS cloud image” section in this doc
- Updates from volunteers investigating tests
- Which tests should we focus on first?
- Merge blocking, release blocking, release informing jobs
- Kubernetes release blocking jobs https://testgrid.k8s.io/sig-release-master-blocking and https://testgrid.k8s.io/sig-release-master-informing
- https://testgrid.k8s.io/sig-release-master-blocking#node-kubelet-master
- https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features
- Release blocking criteria https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md
- Is there something you need to help make progress on these tests?
- Documentation for how to run e2e node tests https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/e2e-node-tests.md
- Setting up VM for local tests and running node e2e tests: https://gist.github.com/bobbypage/f922d2dea47912786ddc0a0d2fab0fd1
- KIND may not work for debugging/running these tests. E2E tests spin up a VM on GCP, and ssh to that VM. The COS images are used to launch the VM.
- Will need both test-infra and kubernetes repo’s to be able to run jobs locally and remote
- Is there a shared google project for volunteers to use for testing patches and debugging?
- At quota limit right now
- File an issue in kubernetes/k8s.io repo, to ask for shared project for testing/debugging/ssh access
- How to find the code that is running the tests?
- Description of the types of tests https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests
5/11/2020 Kickoff Meeting
- 30 second Introductions - Name, k8s experience, E2E testing experience
- Review this document
- Testgrid overview
- Spreadsheet - Sign Up for specific tests
- Meet weekly - same day/time work for most folks?
History/Overview of E2E testing
What broke last week
- Suspect some scan tool scanned the platform, found some CVEs, and shut down the platform/network.
- Root cause likely caused by old images, with CVE
- Network access shutdown, could not even get in to debug
- Many jobs were pinned to specific google cloud project
- Why were all of the jobs pinned to one specific google cloud project?
Prow Channels
COS cloud image
Many of the jobs use cos-xxx images. What are these images, and where can you find them?
Container-Optimized OS is an operating system image for your Compute Engine VMs that is optimized for running Docker containers. With Container-Optimized OS, you can bring up your Docker containers on Google Cloud Platform quickly, efficiently, and securely. Container-Optimized OS is maintained by Google and is based on the open source Chromium OS project.
https://cloud.google.com/compute/docs/images
https://cloud.google.com/container-optimized-os/docs
https://cloud.google.com/container-optimized-os/docs/concepts/release-channels
Link to release notes with image contents (docker, kernel, k8s version)
https://cloud.google.com/container-optimized-os/docs/release-notes
vpickard@rippleRider$
NAME PROJECT FAMILY DEPRECATED STATUS
cos-69-10895-385-0 cos-cloud cos-69-lts READY
cos-73-11647-534-0 cos-cloud cos-73-lts READY
cos-77-12371-251-0 cos-cloud cos-77-lts READY
cos-81-12871-103-0 cos-cloud cos-81-lts READY
cos-beta-81-12871-44-0 cos-cloud cos-beta READY
cos-dev-84-13078-0-0 cos-cloud cos-dev READY
cos-stable-81-12871-103-0 cos-cloud cos-stable READY
vpickard@rippleRider$
archiveSizeBytes: '8233374400'
creationTimestamp: '2020-05-07T21:29:31.522-07:00'
description: 'Google, Container-Optimized OS, 81-12871.103.0 stable, Kernel: ChromiumOS-4.19.112
Kubernetes: 1.17.3 Docker: 19.03.6 Family: cos-81-lts, supports Shielded VM features'
diskSizeGb: '10'
family: cos-81-lts
guestOsFeatures:
- type: UEFI_COMPATIBLE
- type: VIRTIO_SCSI_MULTIQUEUE
id: '4296652415682830020'
https://cloud.google.com/compute/docs/images
GCP Testing Projects
List of projects available for testing.
Documentation/links that describe project
- Number of machines
- Machine specs
- How to access
- Quotas
- Image availability
- Other
New-infras Goals
- Move testing out of google.com
- Move infrastructure more out in the open (#wg-test-infra)
- Capacity planning
- Shared infrastructure
- Shared duty to maintain infrastructure
- Some of the less-critical infra has moved over already
- Create a PROW build cluster out in the open, with Boskos
- What APIs
- What Quota
- What IAM (can this be scripted)
- Allow some users ssh access
Prow
How does this fit into E2E testing?
Configuration of prow provisioned clusters is in YAML files in prow/cluster directory. Do not intend to have a Prow.yaml file in each tested repo (ala .travis.yaml files)
Control plane
- Talk to github
- Spin up pods on clusters
Build clusters
- Boskos on each cluster
Node E2E testing environment
Slack channels: #sig-node, #sig-testing, #testing-ops
Jobs are defined in kubernetes test-infra repo.
Questions:
Where can we find a list of images that can be used in CI and PR tests?
How do we go about creating a sig-node-test-failure notification list?
What is the process for getting access to be able to debug failures in real-time on the system under test?
Kubelet CI jobs are defined in ../test-infra/config/jobs/kubernetes/sig-node/node-kubelet.yaml
- Testgrid-dashboards: sig-release-master-blocking, sig-node-kubelet
- Testgrid-alert-email: kubernetes-sig-node+testgrid@googlegroups.com
- Containers: image: gcr.io/k8s-testimages/kubekins-e2e:v20200420-e830a3a-master
- Where to find all of these images and how to determine which image to use?
- gcp-project=k8s-jnks-gke-gci-soak
- What is the list of projects available and how to determine which project to use?
- gcp-zone=us-west1-b
- List of available zones, which one to use
Kubelet presubmit (Pull Request) jobs are defined in: ../test-infra/config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml
- Containers:image: gcr.io/k8s-testimages/kubekins-e2e:v20200420-e830a3a-master
- Deployment: node
- What are other choices?
How to run tests
Boskos
What is this and how does it work
- Manages pools of projects that it owns. Do not specify project in the yaml file, and it will choose a project for the job
- Checks project out
- Finds project for job to run on
- Boskos cleans up job after it runs
TestGrid
The kubernetes testgrid is here: https://testgrid.k8s.io/
Repo is here, has a video link to testgrid session from 2018 contributor summit:
https://github.com/kubernetes/test-infra/tree/master/testgrid
Most of the sig-node jobs are under the sig-node tab.
Exception. Recently added CI and PR jobs added for Topology Manager are under the wg-resource-management. CPU Manager job was there initially, so added Topology Manager there also. But, these jobs should likely be in sig-node.
Reading testgrid notes
Top level tab
colors (guesses based on observation, I think this must be defined somewhere in testgrid repo or config)
- red: Failing
- blue: Passing, and flaky does not count against it.
- black: Stale, tests have not run
This seems to hold recursively for tabs of tabs:
We should focus on red tabs.
tab names
Come from annotations in the config files, example:
https://github.com/kubernetes/test-infra/blob/a70b1248bacee4dbc332f796d5a3e38411c3f6d6/config/jobs/kubernetes/sig-node/containerd.yaml#L57-L59
annotations:
testgrid-dashboards: sig-node-containerd
testgrid-tab-name: containerd-build
testgrid-dashboards can have multiple entries, so that the test suite shows up in multiple dashboards
test suite descriptions
Image Policy
- pr/pull/presubmits should be stable
- implies less tests and a hardcoded image
- ci runs less often, can be somewhat less stable
- maybe we can split this up
- Overall, why are we picking different images?
- versions of dependencies inside - docker, containerd, runc, selinux, etc
- Safe PR image bump mechanism
- create a mechanism, to extract images used for successful CI Runs
- once the count of successful CI runs is high enough, bump the pr images
Q&A
Q: bootstrap.py warning
W0518 20:41:32.851] **************************************************************************
bootstrap.py is deprecated!
test-infra oncall does not support any job still using bootstrap.py.
Please migrate your job to podutils!
https://github.com/kubernetes/test-infra/blob/master/prow/pod-utilities.md
**************************************************************************
Should we be solving this?
Q: How much history is available in testgrid?
2 weeks, more in gcs buckets, which is what backs Prow, and holds 90 days.
Q: Why do we build containerd? Why do we build containerd/cri?
-
They are upstream, and have their own ci.
- ci-containerd-build is pull-cri-containerd-build
- Maybe the question is, why are there ci jobs vs pull jobs?
- Should we put the PR version into the sig-node tab group?
- End result of this should be
- maybe deleting some of these
- maybe moving some into the existing testgrid panel
- definitely putting information into description annotation of job definition
Q: do we have old tests not cleaning up? How do we check?
List of Jobs & Fields to Fill
Catalogue Existing jobs
https://docs.google.com/spreadsheets/d/1mEU8B2_PmMwwgp-_xnyp7QYMBwcLoA9NNlHwDyMvO0Y/edit#gid=0
In this section, we should list out every job that is running in sig-node, and provide the following:
Job Name
Number of test cases
Intent of test
How long test runs
Brief history of job pass/fail
Importance level to sig-node: low, medium, high
Email contact for failures
Any test cases that are obviously missing
Any redundant tests
Overall state of the test: poor, good, excellent
What resources are required for the test (CPUs, GPUs, NICs, SSD, image type, Container OS)
What version of the image, and why
References
- Dec 2018 KubeCon Into to Testing SIG - Aaron Crickenberger
- Dec 2020 - A Tour of CI on The Kubernetes Project