1.3 MiB

Raw Blame History

Kubernetes SIG-Node CI subgroup notes

12/21/2020 Pre-new-year clean wrap up

Attendees (4):

Sergey Kanzhelev
Amim Knabben
Elana Hashman
Matt Merkes

Agenda:

We will try to summarize all ongoing work on project board: https://github.com/orgs/kubernetes/projects/43

Add all issues to the board and have “unrelated” column

Triage
- Test-infra sig/node
  - PRs: 5
  - Issues: 13
- k/k area/test:
  - PRs: 42
  - Issues: 50
- k/k failing-test:
  - PRs: 2
  - Issues: 8

12/14/2020

Attendees (7):

Sergey Kanzhelev
Artyom Lukianov
Jorge Alarcon
Matt Merkes
Francesco Romani (fromani)
Amim Knabben

Agenda:

```
knabben
```
- How to increase team velocity on issues response? After tackling the issue, how to organize and sync efforts on the specific issue.
  - Try to use slack channel (#sig-node) to discuss about issues debugging
  - Update the project board with last issues
```
merkes
```
- Is it time to fix it or kill it? - https://github.com/kubernetes/test-infra/issues/18973
- Jorge: definitey need to document the decision. Thinking is that node/e2e suppose to be NodeConformance, regular conformance is under test/e2e.
- Should NodeConformance and master run differently?
- Action: let’s make a doc and once we feel OK - review with Dawn and implement
```
dims
```
- Action: review dockershim test coverage

12/07/2020

Attendees:

jorge
sergey
amim
morgan
matt
roy

Agenda:

```
knabben
```
- Morgan: more people is better. Currently there is a disconnect between CI and ongoing work.
- Amim: we have an alerting for the jobs
- Sergey: many features are coming hot for the code freeze date. Many improvements are being merged last week. So a small PR breaking tests may lead for a big blockage of PRs and improvements wouldn’t be merged into target release
- Jorge: test infra is complicated and it will be difficult for people to learn. We need a lot of improvements in this area if we will block PRs by tests. We need more long-term contributors and this change will be working best for those, may inhibit one-feature contributors.
- Sergey: this group is slower progress, but develop knowledge, proposal from Wojtek is an extreme of constant firefighting
- Jorge: maybe we need an official onboarding (bootcamp) that can improve the quality of contributors. This may be in-the-middle solution.
```
alejandrox1
```
- we need a central data structure for configuring test environments
- we need a simple way for populating the above
```
Morgan
```
Morgan: happy to join
```
SergeyKanzhelev
```
- Morgan: yes, it is hard to figure out. Would be great to have an explicit flag. How to know? - many tests are defaulting to ContainerD now.
- We definitely will learn something when we start doing it.
- Action: Sergey to start an issue
  - Also follow up with recently added crio jobs
- PR and CI may be different.
- Some scalability tests may be specifying CNI explicitly
```
SergeyKanzhelev
```
- pod lifecycle moved to conformance. Was passing in previous location.
- This test was moved from orphans. Need to investigate why it was passing there and is failing here
- Action: merkes will take a look, also check whether there was an e-mail about it.
- original PR https://github.com/kubernetes/kubernetes/pull/96485/files
```
SergeyKanzhelev
```
Jorge: will take it and will take a look.
```
mhb
```
- https://github.com/kubernetes/org/blob/8dde2c258144702b4130d8bdc7fc0891dcb72422/config/kubernetes/sig-release/teams.yaml#L17
- Maybe too late for 1.20, but worth checking. If nothing will be done, this will simply be merged into master

11/30/2020

Attendees (4 on a call):

Artyom Lukianov
Matt Merkes
Sergey Kanzhelev
Morgan Bauer

Agenda:

```
Artyom Lukianov
```

11/23/2020

Cancelled! No agenda for today and it is a “short” week in the US.

11/16/2020

Attendees (# on a call):

Artyom Lukianov
Francesco Romani (fromani)
Matt Merkes
Sergey Kanzhelev

Agenda:

```
Sergey
```
- ```
Artyom
```
- https://github.com/kubernetes/kubernetes/issues/95041

```
fromani
```
- ```
Ruiwen
```
Mention fromani on test related PRs

11/09/2020

Attendees (# on a call):

Artyom Lukianov
Francesco Romani (fromani)
Sergey Kanzhelev
Karan Goel (karan, Google)

Agenda:
[merkes] node conformance ci debrief - failing 10-21 to 11-03 due to insecure port change PR#XYZ, bearer-token not plumbed in conformance test-suite mode. Residual concern of the conformance mode being a completely different setup from normal runs. Test output looks different from before.
node-kubelet-serial last red tabin sig-node-kubelet
[Sergey] containerd jobs, are those owned by ‘sig-node’ or by ‘containerd’ (much overlap in participants), but I wonder who is concerned about them.

```
Karan
```
- NPD e2e is failing, I have a repro but not a root cause. Would appreciate another set of eyes to investigate.
- https://github.com/kubernetes/kubernetes/pull/96262#issuecomment-723208823
- AI: Artyom will take a look - see the comment https://github.com/kubernetes/kubernetes/pull/96262/files#r520785823
```
Artyom
```
- Fake NUMA - requires kernel compiled with the CONFIG_NUMA and CONFIG_NUMA_EMU
- Any additional thoughts?
- CONFIG_NUMA_EMU is missing on COS. Looking into it. Update next or week after the week.
- AI: Artyom to create a ticket on test-infra - https://github.com/kubernetes/test-infra/issues/19902
```
Francesco (fromani)
```
- Makes the flow more robust - please check the commit message for more context
- You won’t see these failures in u/s CI (even with /test pull-kubernetes-node-kubelet-serial-topology-manager) because this part of test only run on a multi-numa box with sriov (https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/topology_manager_test.go#L732)

11/02/2020

Attendees (9 on a call):

Alukiano
Sergey Kanzhelev
Fromani
Merkes
Amim Knabben
morgan

Agenda:

```
Sergey
```
```
Sergey
```
Please help review issues and comments https://github.com/orgs/kubernetes/projects/43
New TestGrid alerts at sig-node-kubelet#node-kubelet-conformance since commit/237dae5a5 (Matt) - still WIP
PR for secure port may have broken test. Now need to check it. Morgan to take a look.

10/26/2020

Attendees: 6 on call:

Alukiano
Matt Merkes
fromani

Agenda:

```
Sergey
```
https://k8s-testgrid.appspot.com/sig-node-containerd#containerd-node-conformance (Sergey)
```
Roy
```
This is where contained test fails
```
merkes
```
```
alukiano
```
```
[Ro
```
```
Amim
```
https://github.com/kubernetes/test-infra/issues/19401
1 page write up to formalize the process.
Step1: Dashboards renaming is low hanging fruit and easy to begin with.
Step2: Test infra - rename scripts and program with the transition step.(extract_k8s.go add cos* cosCI* to co-exist with gci* gciCI* to avoid break)
Step 3: Rename kubernetes repo cluster/gce/gci to cluster/gce/cos, and also make sure containerd/cri repo do not hardcoded this into their yaml

10/19/2020

Attendees:

Sergey, Jorge, Ruiwen

Agenda:

```
Add your agenda items below
```
```
jorge
```
- visibility: document how to take care of Kubernetes
```
fromani
```
- Topology manager test improvements (more in the pipeline)
- RH will port more fixes back to OSS. Yay!
```
alukiano
```
- Need to have multi-numa machines to run tests on numa node selection
- Roy: on GCE all NUMAs virtual. So likely we need a hardware machines?
- Artem: may need to fake it
- Roy: need to change the command line passing kernel arguments
- Artem: will post parameters that are required and Roy will follow up.
- The fake NUMA - https://www.kernel.org/doc/html/latest/x86/x86_64/fake-numa-for-cpusets.html
- Sergey to follow up on do we need on GCE/GKE/Anthos - justification
```
merkes \- offline update only
```
- Been oncall the last couple of weeks and can’t make the meeting today, but I have an AI to update the email as described in last week’s agenda. I will 100% have time to do in the next couple of days. Freedom at noon!

Sergey to send agenda and open PRs list to the mailing list.

10/12/2020

Attendees:

Jorge
Matt Merkes

Agenda:

```
Add your agenda items below
```
- For today, please use google hangout link in the invite for this event https://meet.google.com/rzi-qtkm-rbf
```
Matthew
```
- https://github.com/kubernetes/test-infra/issues/18973#issuecomment-705715514
- NodeConformance and NodeFeature tags
  - ```
  jorge
```
- ```
matthew
```
  - ```
  matthew
```
```
jorge
```
- Is the mailing list the right place to surface test alerts?
- ```
matthew
```
- ```
matthew
```
- ```
roy
```
- ```
jorge
```
  - https://github.com/kubernetes/community/pull/5244

10/05/2020

Attendees:

morgan bauer
Amim Knabben
David Porter
Harshal Patil
Jorge Alarcon
merkes
roy
Sergey

Agenda:

```
sergey
```

https://testgrid.k8s.io/sig-node-containerd#containerd-node-conformance

```
Amim/Jorge
```
- let’s add it to community/contributors/devel/sig-node/test-suite.md (?)

Follow ups:

Action item: write down this in test guide
Action item: Need to find if Docker interested testinfra#17731
- Just need Derek or Dawn approval
Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
- https://github.com/kubernetes/test-infra/issues/19401
- Interesting related issue of renaming: https://github.com/kubernetes/test-infra/issues/19384
Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master
- https://github.com/kubernetes/test-infra/issues/18973
- Merkes: added more details.
- Why is running in Docker desired?
  - Docker may be a way to contain all the dependencies to make it independent
  - But it’s unclear if it makes the tests “independent” as it requires docker.
- Jorge: should we ask sig-architecture SIG?
- Let’s ask sig-architecture if there is a knowledge beyond what’s written
- Merkes to daft it
Action: try to land e2e tests documentation into community repo [Sergey]
- Follow the lead here https://github.com/kubernetes/community/pull/5148?
```
Jorge
```
- How do we know which features we are testing?
- Morgan: sounds like we need a [Conformance:KEPXYZ] tags
- Merkes: were there any tests added to NodeConformance and what they meant by this?
- Jorge: nowhere we have a documentation saying what is release blocking
- Merkes, Jorge, Morgan: read some completed KEPs to understand how they approached testing and how we can improve this.
  - If KEP claimed to be implemented, is there easy way to check it’s tests?
  - Jorge: +1

9/28/2020

Attendees:

Matt Merkes
Sergey Kanzhelev
Harshal Patil
Amim Knabben
Morgan Bauer

Agenda:

```
mhb
```
Follow ups:
- Action item: Need to find if Docker interested testinfra#17731
  - No response, if no response again, let’s just remove it.
- Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
  - Create a task to assess and reorganize the gci related tests.
    - https://github.com/kubernetes/test-infra/issues/19401
    - Interesting related issue of renaming: https://github.com/kubernetes/test-infra/issues/19384
- Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master
  - Proposal document doesn’t match reality and no clear way how to achieve this. https://docs.google.com/document/d/1BdNVUGtYO6NDx10x_fueRh_DLT-SVdlPC_SsXjYCHOE/edit#
  - Need more information on whether it’s what suppose to be and prioritize as appropriate
  - Another link: https://kubernetes.io/docs/setup/best-practices/node-conformance/
- Action: try to land e2e tests documentation into community repo [Sergey]

Awaiting approval:

Action: PreStop test investigation - check if there is another test that tests what description says. If not, fix the test by changing infinite loop to constant work amount https://github.com/kubernetes/kubernetes/pull/94922
- Action: create a tab for RuntimeClass Disruptive test [Sergey] https://github.com/kubernetes/kubernetes/pull/95046
Action: [Jorge] Follow up on routing prow notifications to mailing list
https://github.com/kubernetes/test-infra/pull/19306
Action: Move the RuntimeClass tests out of node-kubelet-orphans https://github.com/kubernetes/kubernetes/pull/94796 [Harshal]
```
Morgan
```
- Last thing is keeping it from being clean.
Diff disruptive vs. serial?
- Not clear

9/21/2020

Attendees:

Sergey Kanzhelev
Harshal Patil
morgan bauer
David Porter
Amim Knabben
Roy Yang

Actions follow up from the last meeting:

Action item: Need to find if Docker interested testinfra#17731
- Move to the next week
ci-cri-containerd-e2e-gci-gce-flaky
- Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
- Action: PreStop test investigation - check if there is another test that tests what description says. If not, fix the test by changing infinite loop to constant work amount
  PR open on termination + follow up on filtering
Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master
Action: create a tab for RuntimeClass Disruptive test [Sergey]
Action: try to land e2e tests documentation into community repo [Sergey]

Agenda:

```
karan / roy
```
- Bump 81 -> 85 (where we have image_family)
- For specific image - will do this week
- 85 is going stable this week
- 77 -> 81
- 81 -> 85 python version changed from 2 to 3, breaking containerd builds. But seems to be fixed by https://github.com/containerd/containerd/pull/4559
- Should we remove ContainerD 1.2? -> 1.4
  - Morgan helping to move ContainerD to prow
```
knabben
```
- De-flaky by simplifying the termination check - https://github.com/kubernetes/kubernetes/pull/94922
```
mhb
```
- Let’s try to understand it first, than ensure it only runs once
```
mhb
```
- Who wants to take a look? Morgan to create an issue
```
Jorge
```

9/14/2020

Attendees:

Sergey Kanzhelev
Matt Merkes
Amim Knabben
David Porter
Jorge Alarcon

Actions follow up from the last meeting:

Action item: Need to find if Docker interested testinfra#17731
Action item: find a place to document test dashboards and tabs as well as test matrices. Perhaps have a place on kubernetes.dev or sig-community
Action item: convert to markdown and find a place for the document [mhb]
Investigate “node-kubelet-master and node-kubelet-conformance seem like duplicates” https://github.com/kubernetes/test-infra/issues/18973
Sergey: come up with the critical test move example

Please add your agenda items.

```
karan / roy
```
- I think it’s ok, need to follow up with Roy and Karan
```
knabben
```
- sig-node-containerd/ci-cri-containerd-e2e-gci-gce-flaky
  - Does it make sense to run a periodic batch of known flaky tests in 2 hours intervals?
  - This grouping has a mix of SIG tests and 2 failing with periodicity:
    - [sig-api-machinery] CustomResourcePublishOpenAPI …
      - Probably should not be here.
    - PreStop graceful pod terminated should wait until preStop hook completes the process:
      - This is not true for the current cluster behavior, the Pod is deleted in the grace period and the event FailedPreStopHook is registered
  - Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
  - Action: PreStop test investigation - (known issue) - check if there is another test that tests what description says. If not, fix the test by changing infinite loop to constant work amount
```
merkes
```
- Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations
- what is “NodeConformance”?
- NOTE: Found a little more information on the history.
```
sergey
```
- ```
Jorge
```
- Action: create a tab
```
sergey
```
- https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests-kubetest2.md
- https://docs.google.com/document/d/10EKyu4kTo3x7mei2TuGuW8AYeld1EuuSj91AvYNVGX4/edit#heading=h.1c4byrql87wf
- https://docs.google.com/document/d/1BdNVUGtYO6NDx10x_fueRh_DLT-SVdlPC_SsXjYCHOE/edit
- Talked about it on ContirbEx meeting with Joel and he has reservations on how this will gonna be kept up to date
- Action: try to land it into community repo [Sergey]

Some stats:

Test-infra sig/node: PRs: 4 Let’s resolve this one: https://github.com/kubernetes/test-infra/pull/17948
Test-infra sig/node: issues: 8
k/k sig node area/test: PRs: 59 (approved): 7 Approved PRs slowly clearing up!
k/k sig node area/test: issues: 7
k/k sig node kind/failing-test: PRs: 0 (approved): 0 Yay!
k/k sig node kind/failing-test: 1

8/31/2020

Please add your agenda items.

```
Sergey
```
- Investigate “node-kubelet-master and node-kubelet-conformance seem like duplicates” https://github.com/kubernetes/test-infra/issues/18973
- Discuss whether we need node-kubelet-serial-alpha tab: https://github.com/kubernetes/test-infra/issues/18972
```
mhb
```
- Where to record the knowledge on what we INTENT to test? Coverage matrix.
- Roy:https://testgrid.k8s.io/sig-node-docker??
  - Morgan: this tab needs to be deleted. Docker team doesn;t support it any longer (seemingly) testinfra#17731
  - Action item: Need to find if Docker interested testinfra#17731
- Docker installed with ContainerD 1.2. So we need to test 1.2, 1.3, 1.4. We need to be explicit on whether we really need to test all of them.
- Action item: find a place to document test dashboards and tabs as well as test matrices. Perhaps have a place on kubernetes.dev or sig-community
```
mhb
```
Is there some correspondence to https://github.com/kubernetes/kubernetes/blob/master/pkg/features/kube_features.go or are they independent measures?
Action item: convert to markdown and find a place for the document [mhb]

Action item: https://github.com/kubernetes/test-infra/issues/18826

```
Sergey
```
- Test-infra sig/node: PRs: 6
- Test-infra sig/node: issues: 7
- k/k sig node area/test: PRs: 53 (approved): 11
- k/k sig node area/test: issues: 5
- k/k sig node kind/failing-test: PRs: 3 (approved): 1
- k/k sig node kind/failing-test: 1
Next meeting? 9/7 is a US holiday (Labor Day)
- Yes, cancel next week.

8/24/2020

Please add your agenda items.

```
mhb
```
- Action: let’s investigate
What goes in sig-node-blocking?
- Topologymanager - important for a subset of customers. Maybe not justify to put as blocking?
- SHould we put conformance tests there?
- Jorge: let’s create a process to add to sig node blocking? Based on maintenance commitment from participants
  - Also all tests needs to strive to be blocking.
  - example proposal: https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md
```
mhb
```
```
mhb
```
- Moveto next week.

Sergey

For example, this one is not tested as it’s not NodeConformance:

var \_ \= SIGDescribe("CPU Manager \[Serial\] \[Feature:CPUManager\]\[NodeAlphaFeature:CPUManager\]", func() {

https://github.com/kubernetes/test-infra/blob/bd3b81b6c985971cf10665af396962f1c3136785/config/jobs/kubernetes/sig-node/node-kubelet.yaml#L257
Also how do we know which tests are run and which are not?
No tools, If anybody interested - need to create it.
Action: create issue to clean up the tab

```
alejandrox1
```
- Release team is preparing to release 1.19. Please be vigilant with new test failures
```
alejandrox1
```
- Always ask questions :-) All the work that is going on should make sense to us.

8/17/2020

Please add your agenda items.

```
Sergey
```
- Test-infra sig/node: PRs: 6
- Test-infra sig/node: issues: 5
- k/k sig node area/test: PRs: 57 (approved): 17
- k/k sig node area/test: issues: 4
- k/k sig node kind/failing-test: PRs: 3 (approved): 2
- k/k sig node kind/failing-test issues: 1
```
Sergey
```
- https://github.com/kubernetes/test-infra/issues/18826
- https://docs.google.com/spreadsheets/d/1mEU8B2_PmMwwgp-_xnyp7QYMBwcLoA9NNlHwDyMvO0Y/edit#gid=0

FAILING	16
FLAKY	26
PASSING	45
STALE	18
Grand Total	105

```
Sergey
```
- Policy: https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/README.md
- PRs
  - https://github.com/kubernetes/test-infra/pull/17722
  - https://github.com/kubernetes/test-infra/pull/17734
- 4 types of images
  - Presubmits
  - Postsubmits
  - CI with other tools
  - Test with latest tech
- [Action] Morgan to update PRs to point to LTS
- [Action] think about one or two stable images issue
```
karan
```
- PR: https://github.com/kubernetes/test-infra/pull/18877
- Build and release a new image for gcr.io/k8s-testimages/kubekins-e2e:latest-1.18 and above?

8/10/2020

Please add your agenda items.

```
Ning
```
- you’ll receive meeting notifications, emails, etc.
```
alejandrox1
```
- Request write permissions on slack, alias: knabben
```
alejandrox1
```
- TODO: Let’s look over our e2e and see what we actually need
- TODO: come up with a “plan”
```
Dawn
```
- Our perspective as SIG node may be different from SIG testing. We need to maintain our priorities.
- We need to maintain common tests across many vendors. Even though some vendors copy existing tests and run on different runtimes. We still need to maintain those tests
```
Sergey
```
- https://testgrid.k8s.io/presubmits-kubernetes-blocking
- https://testgrid.k8s.io/sig-release-master-blocking
- https://testgrid.k8s.io/sig-node-kubelet
- ```
dims
```
- Split tests into two buckets:
  - What is actually green and will stay green.
  - What we are working on.
- Buckets above will indicate whether something is “approved” by SIG node. Basically a baseline
  - Example from sig release’s release blockin/informaing https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md
- CI jobs that are broken for months needs to be highlighted as not critical for now.
- [Action item] Create tabs and move everything in NODE informing. And slowly move tests to Node blocking. Problem we are solving: if a person is not a part of SIG node - can I ask quickly whether things are OK now or not?
```
Sergey
```
- Test-infra sig/node
  - PRs: 8
  - Issues: 12
- k/k area/test:
  - PRs: 56
  - Issues: 60
- k/k failing-test:
  - PRs: 3
  - Issues: 4
```
Sergey
```
- https://github.com/orgs/kubernetes/projects/43
- https://docs.google.com/spreadsheets/d/1mEU8B2_PmMwwgp-_xnyp7QYMBwcLoA9NNlHwDyMvO0Y/edit#gid=0
```
Sergey
```

06/22/2020

CANCELLED
Do we need this meeting going forward? If so, move to bi-weekly
```
mhb
```
- categorize the tests here
- https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-orphans
- looking at the prow job definition, it is defined wholly by what it skips --skip="\[Flaky\]|\[NodeConformance\]|\[NodeFeature:.+\]|\[NodeSpecialFeature:.+\]|\[NodeAlphaFeature:.+\]|\[Legacy:.+\]|\[Benchmark\]"
- part of an effort to categorize the tests https://docs.google.com/document/d/1BdNVUGtYO6NDx10x_fueRh_DLT-SVdlPC_SsXjYCHOE/edit#
- added https://github.com/kubernetes/test-infra/commit/6cda335ebfcdace363352b566974ff0dda87a85c
```
alejandrox1
```
- Turns on email notifications from failing e2e jobs in https://testgrid.k8s.io/sig-node-containerd to mailing list https://groups.google.com/forum/#!forum/kubernetes-sig-node-test-failures
- This is different from the notifications people receive from the GitHub team @kubernetes/sig-node-test-failures
```
alejandrox1
```
```
fromani
```
- Rationale: duplicated code between cpu and topology manager e2e test, the end goal is to clean up and keep the tests working and easy to change/extend
- Initial PR with cleanups which didn’t make in 1.18 https://github.com/kubernetes/kubernetes/pull/90971 - ptal

06/15/2020

No meeting today, no items on agenda
Board is now public with view access:
- https://github.com/orgs/kubernetes/projects/43

06/08/2020

```
vpickard
```
- experimenting with new board for sig-node-testing enhancements
  - https://github.com/orgs/kubernetes/projects/43
- Helps to view, monitor, manage issues and PRs for sig-node testing
- Currently private
- Started as a way to track issues/PRs that I had been requested to review - getting buried in emails. Found that I was “wasting” time going through emails to find PRs/issues that I needed to spend time on.
- Sig-storage - uses spreadsheet for high level view
- Sig-release - release engineering - uses boards
- Sig-architecture also uses boards
```
karan
```
- Running test suites using FOCUS? Not clear how that works.
- Not covered in alejandrox1’s docs.
- ```
alejandrox1
```
- small PR https://github.com/kubernetes/community/pull/4829
```
mhb
```
```
mhb
```
- https://github.com/kubernetes/kubernetes/pull/91827
```
mhb
```
- Should we be testing more images?
- COS and Ubuntu were original images - both open source
- Some tests are run on other platforms (RHEL on AWS, for example), export to testgrid
- More images very welcome and encouraged to add to E2E testing - would need some support/knowledge to maintain these images
```
alejandrox1
```
- need to pass management to SIG chairs and owners of this project.
- everyone can join now.
```
alejandrox1
```
```
royyang
```
```
royyang
```
- a SKIP=\[NodeFeature:RuntimeHandler\] may help for this job
- we could add some additional “annotation” to the test to skip
```
bart0sh
```
Can we get a community account for running/debugging sig-node E2E tests? Check with wg-k8s-infra.

06/01/2020

Follow up for test failure notifications, https://github.com/kubernetes/community/tree/master/sig-node#contact
- can we get on github team @kubernetes/sig-node-test-failures - Test Failures and Triage
- https://github.com/orgs/kubernetes/teams/sig-node-test-failures
  - ```
  alejandrox1
```
- PR to be added to the github team
- ```
alejandrox1
```
  - ```
  alejandrox1
```
- ```
alejandrox1
```
  - use https://groups.google.com/forum/#!forum/kubernetes-sig-node - this mailing list doesn’t have a ton of traffic currently
  - or create a new one in https://github.com/kubernetes/k8s.io/tree/master/groups
  - general mailing list guidelines https://github.com/kubernetes/community/blob/master/communication/mailing-list-guidelines.md#mailing-list-creation
    - For a general mailing list we need a sig lead or an approver to set it up.
  - AI: alejandro - make pr for mailing list
    - name suggested in meeting
    - contact list to put in pr
    - Let’s start with volunteer list at the top of this doc
    - start with subscribing to notifications for
      - release/merging blocking informing suites
      - existing *stable* test suites
mhb - what is image policy in general? Where can we write it down? LTS every 6 months, establish a periodic chore to evaluate the swap.
- assuming this ought to be the cos policy, where do we write down “use the lts images”, and use `image_family: cos-XX-lts` ?
- https://cloud.google.com/container-optimized-os/docs/release-notes#lts_image_families
- I like this comment as well, https://github.com/kubernetes/test-infra/pull/17770#issuecomment-636433860
- rollback issue on image_family vs maintenance issue of specific image: pinning
- many images up there and available, why these specific ones?
royyang - about COS image
- Updated doc: https://github.com/kubernetes/kubernetes/pull/91612/
- Root cos-stable-73 issue: https://github.com/kubernetes/test-infra/pull/17770
- Clean up and improve images: https://github.com/kubernetes/kubernetes/pull/91543 PTAL, this needs #17770
- Using image_family is good for some tests, but may be limited when we want to roll back a LTS or stable image.
- How to run a single test? Want to debug a couple of tests, and need some steps.
  - previous meeting notes from 5/26/2020 on using SKIP=”” & FOCUS=””
```
Zhi
```
  issue in k/k to debug
should we add [NodeFeature: to kinds of tests?
- tag for feature it is specifically testing. help with graduating to GA.
- maybe a kep for it.
```
mhb
```
- Does sig-node monitor them?
- Should we delete them or hand over to sig-scalability
- summarize details in an email. there is an existing PR to ‘fix’ it by giving it a bigger vm to run on. tests added years ago, by a small group of people with not enough time to monitor them. Call for contributors to monitor and manage, and if none, delete.
- related, benchmark job runs.
- cos-69 works, but not later versions.
```
mhb
```
- Match up test runs to image config to make it easier to see testgrid image used
- Plan to update remaining image configs (20+) as well
```
mhb
```
```
alejandrox1
```
- A lot of work needs approval from approvers who are oversubscribed. What can we do?
- We can learn more, contribute more, and work to gain approval rights to help out.
```
bsdnet
```
- Remove unused code and enhance logging

5/26/2020

```
mhb
```
Ning Lao, Roy Yang
- How do we know the health of COS images, and which images should we be testing in sig-node? Roy will help us with this.
  - Latest LTS should be the one we use, e.g cos-81-lts
  - https://cloud.google.com/container-optimized-os/docs/concepts/release-channels#release_channels
  - https://cloud.google.com/container-optimized-os/docs/resources/support-policy
- Need to file an issue with how to deal with COS images?
  - can we document what images are being used and why?
  - https://github.com/kubernetes/kubernetes/issues/88284
- https://cloud.google.com/container-optimized-os/docs/release-notes
- Suggestion to use regex, like gke. But, fail the test if the image is not found.
- Consider using image family
```
alejandrox1
```
```
royyang
```
```
Ed
```
```
alejandrox1
```
- WIP (will pr the appropriate bits and pieces back into Kubernetes)
Sig-testing community update https://docs.google.com/presentation/d/1H-MLhKJJVsQG2eDCEv48M_WAzMc66dKaYMgfOSGQRJM/edit#slide=id.g338ac0a8b6_0_27
Regarding release-blocking/merge-blocking/release-informing jobs, see notes on 5/20/2020 below
Create email group for sig-node test failures, update jobs to send email to list when job fails
github group and/or googlegroups mailing list.
crib off of what other sigs do.
```
vpickard
```
- Import tests results from AWS E2E testing to testgrid
- data is in bucket, but results are not in testgrid
  - gs://kubernetes-github-redhat/logs/ci-kubernetes-conformance-node-e2e-containerized-rhel/10653/
- KETTLE issue? Any pointers?
  - https://github.com/kubernetes/test-infra/tree/master/kettle
SIG Testing resources
```
karan
```
- Running test suites using FOCUS? Not clear how that works.
- Covered in docs linked above - will check there.
```
karan
```
- Karan will create one and get that discussion going
- Created https://github.com/kubernetes/test-infra/issues/17714
```
davidporter
```
- Setting up VM for local tests and running node e2e tests: https://gist.github.com/bobbypage/f922d2dea47912786ddc0a0d2fab0fd1
```
davidporter
```
- Open issue to update topology manager to beta from alpha
```
zhi
```
Code:QUOTA_EXCEEDED Location: Message:Quota 'CPUS' exceeded." -- How to deal with this CPU usage issue?

5/20/2020

Sig-node E2E node-kubelet master started failing, an issue was reported here. This was blocking PR merges, and also this test is release-blocking.

And, here is the slack thread with debugging information. There are some good bits of info, such as how to run some of the tests, some of the false leads we were chasing. Will attempt to incorporate some of this debugging/running tests, etc into this doc.

Overall, a great team effort to debug and get to the root cause!

5/17/2020

Review Goals
```
Ed
```
- update cos images
- Added info on “COS cloud image” section in this doc
Updates from volunteers investigating tests
Which tests should we focus on first?
- Merge blocking, release blocking, release informing jobs
- Kubernetes release blocking jobs https://testgrid.k8s.io/sig-release-master-blocking and https://testgrid.k8s.io/sig-release-master-informing
- https://testgrid.k8s.io/sig-release-master-blocking#node-kubelet-master
- https://testgrid.k8s.io/sig-release-master-informing#node-kubelet-features
- Release blocking criteria https://github.com/kubernetes/sig-release/blob/master/release-blocking-jobs.md
Is there something you need to help make progress on these tests?
- Documentation for how to run e2e node tests https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/e2e-node-tests.md
- Setting up VM for local tests and running node e2e tests: https://gist.github.com/bobbypage/f922d2dea47912786ddc0a0d2fab0fd1
KIND may not work for debugging/running these tests. E2E tests spin up a VM on GCP, and ssh to that VM. The COS images are used to launch the VM.
Will need both test-infra and kubernetes repo’s to be able to run jobs locally and remote
Is there a shared google project for volunteers to use for testing patches and debugging?
- At quota limit right now
- File an issue in kubernetes/k8s.io repo, to ask for shared project for testing/debugging/ssh access
How to find the code that is running the tests?
Description of the types of tests https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests

5/11/2020 Kickoff Meeting

30 second Introductions - Name, k8s experience, E2E testing experience
Review this document
Testgrid overview
Spreadsheet - Sign Up for specific tests
Meet weekly - same day/time work for most folks?

History/Overview of E2E testing

What broke last week

Suspect some scan tool scanned the platform, found some CVEs, and shut down the platform/network.
Root cause likely caused by old images, with CVE
Network access shutdown, could not even get in to debug
Many jobs were pinned to specific google cloud project
Why were all of the jobs pinned to one specific google cloud project?

Prow Channels

COS cloud image

Many of the jobs use cos-xxx images. What are these images, and where can you find them?

Container-Optimized OS is an operating system image for your Compute Engine VMs that is optimized for running Docker containers. With Container-Optimized OS, you can bring up your Docker containers on Google Cloud Platform quickly, efficiently, and securely. Container-Optimized OS is maintained by Google and is based on the open source Chromium OS project.

https://cloud.google.com/compute/docs/images
https://cloud.google.com/container-optimized-os/docs
https://cloud.google.com/container-optimized-os/docs/concepts/release-channels

Link to release notes with image contents (docker, kernel, k8s version)
https://cloud.google.com/container-optimized-os/docs/release-notes

vpickard@rippleRider$

NAME PROJECT FAMILY DEPRECATED STATUS
cos-69-10895-385-0 cos-cloud cos-69-lts READY
cos-73-11647-534-0 cos-cloud cos-73-lts READY
cos-77-12371-251-0 cos-cloud cos-77-lts READY
cos-81-12871-103-0 cos-cloud cos-81-lts READY
cos-beta-81-12871-44-0 cos-cloud cos-beta READY
cos-dev-84-13078-0-0 cos-cloud cos-dev READY
cos-stable-81-12871-103-0 cos-cloud cos-stable READY

vpickard@rippleRider$

archiveSizeBytes: '8233374400'
creationTimestamp: '2020-05-07T21:29:31.522-07:00'
description: 'Google, Container-Optimized OS, 81-12871.103.0 stable, Kernel: ChromiumOS-4.19.112
Kubernetes: 1.17.3 Docker: 19.03.6 Family: cos-81-lts, supports Shielded VM features'
diskSizeGb: '10'
family: cos-81-lts
guestOsFeatures:
- type: UEFI_COMPATIBLE
- type: VIRTIO_SCSI_MULTIQUEUE
id: '4296652415682830020'

https://cloud.google.com/compute/docs/images

GCP Testing Projects

List of projects available for testing.
Documentation/links that describe project

Number of machines
Machine specs
How to access
Quotas
Image availability
Other

New-infras Goals

Move testing out of google.com
Move infrastructure more out in the open (#wg-test-infra)
- Capacity planning
- Shared infrastructure
- Shared duty to maintain infrastructure
Some of the less-critical infra has moved over already
Create a PROW build cluster out in the open, with Boskos
- What APIs
- What Quota
- What IAM (can this be scripted)
Allow some users ssh access

Prow

How does this fit into E2E testing?

Configuration of prow provisioned clusters is in YAML files in prow/cluster directory. Do not intend to have a Prow.yaml file in each tested repo (ala .travis.yaml files)

Control plane

Talk to github
Spin up pods on clusters

Build clusters

Boskos on each cluster

Node E2E testing environment

Slack channels: #sig-node, #sig-testing, #testing-ops

Jobs are defined in kubernetes test-infra repo.

Questions:
Where can we find a list of images that can be used in CI and PR tests?
How do we go about creating a sig-node-test-failure notification list?
What is the process for getting access to be able to debug failures in real-time on the system under test?

Kubelet CI jobs are defined in ../test-infra/config/jobs/kubernetes/sig-node/node-kubelet.yaml

Testgrid-dashboards: sig-release-master-blocking, sig-node-kubelet
Testgrid-alert-email: kubernetes-sig-node+testgrid@googlegroups.com
Containers: image: gcr.io/k8s-testimages/kubekins-e2e:v20200420-e830a3a-master
- Where to find all of these images and how to determine which image to use?
gcp-project=k8s-jnks-gke-gci-soak
- What is the list of projects available and how to determine which project to use?
gcp-zone=us-west1-b
- List of available zones, which one to use

Kubelet presubmit (Pull Request) jobs are defined in: ../test-infra/config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml

Containers:image: gcr.io/k8s-testimages/kubekins-e2e:v20200420-e830a3a-master
Deployment: node
- What are other choices?

How to run tests

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-node/e2e-node-tests.md
- Categories of Tests: https://github.com/kubernetes/community/blob/master/contributors/devel/sig-testing/e2e-tests.md#kinds-of-tests

Boskos

What is this and how does it work

Manages pools of projects that it owns. Do not specify project in the yaml file, and it will choose a project for the job
Checks project out
Finds project for job to run on
Boskos cleans up job after it runs

TestGrid

The kubernetes testgrid is here: https://testgrid.k8s.io/

Repo is here, has a video link to testgrid session from 2018 contributor summit:
https://github.com/kubernetes/test-infra/tree/master/testgrid

Most of the sig-node jobs are under the sig-node tab.
Exception. Recently added CI and PR jobs added for Topology Manager are under the wg-resource-management. CPU Manager job was there initially, so added Topology Manager there also. But, these jobs should likely be in sig-node.

Reading testgrid notes

Top level tab

colors (guesses based on observation, I think this must be defined somewhere in testgrid repo or config)

red: Failing
blue: Passing, and flaky does not count against it.
black: Stale, tests have not run

This seems to hold recursively for tabs of tabs:
We should focus on red tabs.

tab names

Come from annotations in the config files, example:
https://github.com/kubernetes/test-infra/blob/a70b1248bacee4dbc332f796d5a3e38411c3f6d6/config/jobs/kubernetes/sig-node/containerd.yaml#L57-L59

annotations:
testgrid-dashboards: sig-node-containerd
testgrid-tab-name: containerd-build

testgrid-dashboards can have multiple entries, so that the test suite shows up in multiple dashboards

test suite descriptions

come from the yaml for a job

Image Policy

pr/pull/presubmits should be stable
- implies less tests and a hardcoded image
ci runs less often, can be somewhat less stable
- maybe we can split this up
Overall, why are we picking different images?
- versions of dependencies inside - docker, containerd, runc, selinux, etc
Safe PR image bump mechanism
- create a mechanism, to extract images used for successful CI Runs
- once the count of successful CI runs is high enough, bump the pr images

Q&A

Q: bootstrap.py warning

W0518 20:41:32.851] **************************************************************************
bootstrap.py is deprecated!
test-infra oncall does not support any job still using bootstrap.py.
Please migrate your job to podutils!
https://github.com/kubernetes/test-infra/blob/master/prow/pod-utilities.md
**************************************************************************
Should we be solving this?

Q: How much history is available in testgrid?

2 weeks, more in gcs buckets, which is what backs Prow, and holds 90 days.

Q: Why do we build containerd? Why do we build containerd/cri?

They are upstream, and have their own ci.
- ci-containerd-build is pull-cri-containerd-build
- Maybe the question is, why are there ci jobs vs pull jobs?
- Should we put the PR version into the sig-node tab group?
- End result of this should be
  - maybe deleting some of these
  - maybe moving some into the existing testgrid panel
  - definitely putting information into description annotation of job definition
Q: do we have old tests not cleaning up? How do we check?
- https://github.com/kubernetes/test-infra/issues/17714#issuecomment-634340589
- https://github.com/kubernetes/kubernetes/issues/89892
- https://github.com/kubernetes/kubernetes/issues/89892#issuecomment-614189404
  - not sure these questions have been answered.

List of Jobs & Fields to Fill

Catalogue Existing jobs

https://docs.google.com/spreadsheets/d/1mEU8B2_PmMwwgp-_xnyp7QYMBwcLoA9NNlHwDyMvO0Y/edit#gid=0

In this section, we should list out every job that is running in sig-node, and provide the following:

Job Name
Number of test cases
Intent of test
How long test runs
Brief history of job pass/fail
Importance level to sig-node: low, medium, high
Email contact for failures
Any test cases that are obviously missing
Any redundant tests
Overall state of the test: poor, good, excellent
What resources are required for the test (CPUs, GPUs, NICs, SSD, image type, Container OS)
What version of the image, and why

References

Dec 2018 KubeCon Into to Testing SIG - Aaron Crickenberger
Dec 2020 - A Tour of CI on The Kubernetes Project

1.3 MiB Raw Blame History Unescape Escape

Kubernetes SIG-Node CI subgroup notes

12/21/2020 Pre-new-year clean wrap up

12/14/2020

12/07/2020

11/30/2020

11/23/2020

11/16/2020

11/09/2020

11/02/2020

10/26/2020

10/19/2020

10/12/2020

10/05/2020

9/28/2020

9/21/2020

9/14/2020

8/31/2020

8/24/2020

8/17/2020

8/10/2020

06/22/2020

06/15/2020

06/08/2020

06/01/2020

5/26/2020

5/20/2020

5/17/2020

5/11/2020 Kickoff Meeting

History/Overview of E2E testing

What broke last week

Prow Channels

COS cloud image

GCP Testing Projects

New-infras Goals

Prow

Node E2E testing environment

How to run tests

Boskos

TestGrid

Reading testgrid notes

tab names

test suite descriptions

Image Policy

Q&A

Q: bootstrap.py warning

Q: How much history is available in testgrid?

Q: Why do we build containerd? Why do we build containerd/cri?

Q: do we have old tests not cleaning up? How do we check?

List of Jobs & Fields to Fill

Catalogue Existing jobs

References

1.3 MiB

Raw Blame History