community/sig-node/archive/ci-subgroup-notes-2020.md

1.3 MiB
Raw Blame History

Kubernetes SIG-Node CI subgroup notes

12/21/2020 Pre-new-year clean wrap up

Attendees (4):

  • Sergey Kanzhelev
  • Amim Knabben
  • Elana Hashman
  • Matt Merkes

Agenda:

We will try to summarize all ongoing work on project board: https://github.com/orgs/kubernetes/projects/43

Add all issues to the board and have “unrelated” column

12/14/2020

Attendees (7):

  • Sergey Kanzhelev
  • Artyom Lukianov
  • Jorge Alarcon
  • Matt Merkes
  • Francesco Romani (fromani)
  • Amim Knabben

Agenda:

  • knabben
    • How to increase team velocity on issues response? After tackling the issue, how to organize and sync efforts on the specific issue.
      • Try to use slack channel (#sig-node) to discuss about issues debugging
      • Update the project board with last issues
  • merkes
    • Is it time to fix it or kill it? - https://github.com/kubernetes/test-infra/issues/18973
    • Jorge: definitey need to document the decision. Thinking is that node/e2e suppose to be NodeConformance, regular conformance is under test/e2e.
    • Should NodeConformance and master run differently?
    • Action: lets make a doc and once we feel OK - review with Dawn and implement
  • dims
    • Action: review dockershim test coverage

12/07/2020

Attendees:

  • jorge
  • sergey
  • amim
  • morgan
  • matt
  • roy

Agenda:

  • knabben
    • Morgan: more people is better. Currently there is a disconnect between CI and ongoing work.
    • Amim: we have an alerting for the jobs
    • Sergey: many features are coming hot for the code freeze date. Many improvements are being merged last week. So a small PR breaking tests may lead for a big blockage of PRs and improvements wouldnt be merged into target release
    • Jorge: test infra is complicated and it will be difficult for people to learn. We need a lot of improvements in this area if we will block PRs by tests. We need more long-term contributors and this change will be working best for those, may inhibit one-feature contributors.
    • Sergey: this group is slower progress, but develop knowledge, proposal from Wojtek is an extreme of constant firefighting
    • Jorge: maybe we need an official onboarding (bootcamp) that can improve the quality of contributors. This may be in-the-middle solution.
  • alejandrox1
    • we need a central data structure for configuring test environments
    • we need a simple way for populating the above
    Morgan

    Morgan: happy to join

  • SergeyKanzhelev
    • Morgan: yes, it is hard to figure out. Would be great to have an explicit flag. How to know? - many tests are defaulting to ContainerD now.
    • We definitely will learn something when we start doing it.
    • Action: Sergey to start an issue
      • Also follow up with recently added crio jobs
    • PR and CI may be different.
    • Some scalability tests may be specifying CNI explicitly
  • SergeyKanzhelev
    • pod lifecycle moved to conformance. Was passing in previous location.
    • This test was moved from orphans. Need to investigate why it was passing there and is failing here
    • Action: merkes will take a look, also check whether there was an e-mail about it.
    • original PR https://github.com/kubernetes/kubernetes/pull/96485/files
  • SergeyKanzhelev

    Jorge: will take it and will take a look.

  • mhb

11/30/2020

Attendees (4 on a call):

  • Artyom Lukianov
  • Matt Merkes
  • Sergey Kanzhelev
  • Morgan Bauer

Agenda:

  • Artyom Lukianov

11/23/2020

Cancelled! No agenda for today and it is a “short” week in the US.

11/16/2020

Attendees (# on a call):

  • Artyom Lukianov
  • Francesco Romani (fromani)
  • Matt Merkes
  • Sergey Kanzhelev

Agenda:

  • fromani
    • Ruiwen
  • Mention fromani on test related PRs

11/09/2020

Attendees (# on a call):

  • Artyom Lukianov
  • Francesco Romani (fromani)
  • Sergey Kanzhelev
  • Karan Goel (karan, Google)

Agenda:
[merkes] node conformance ci debrief - failing 10-21 to 11-03 due to insecure port change PR#XYZ, bearer-token not plumbed in conformance test-suite mode. Residual concern of the conformance mode being a completely different setup from normal runs. Test output looks different from before.
node-kubelet-serial last red tabin sig-node-kubelet
[Sergey] containerd jobs, are those owned by sig-node or by containerd (much overlap in participants), but I wonder who is concerned about them.

11/02/2020

Attendees (9 on a call):

  • Alukiano
  • Sergey Kanzhelev
  • Fromani
  • Merkes
  • Amim Knabben
  • morgan

Agenda:

10/26/2020

Attendees: 6 on call:

  • Alukiano
  • Matt Merkes
  • fromani

Agenda:

10/19/2020

Attendees:

  • Sergey, Jorge, Ruiwen

Agenda:

  • Add your agenda items below
  • jorge
    • visibility: document how to take care of Kubernetes
  • fromani
    • Topology manager test improvements (more in the pipeline)
    • RH will port more fixes back to OSS. Yay!
  • alukiano
    • Need to have multi-numa machines to run tests on numa node selection
    • Roy: on GCE all NUMAs virtual. So likely we need a hardware machines?
    • Artem: may need to fake it
    • Roy: need to change the command line passing kernel arguments
    • Artem: will post parameters that are required and Roy will follow up.
    • The fake NUMA - https://www.kernel.org/doc/html/latest/x86/x86_64/fake-numa-for-cpusets.html
    • Sergey to follow up on do we need on GCE/GKE/Anthos - justification
  • merkes \- offline update only
    • Been oncall the last couple of weeks and cant make the meeting today, but I have an AI to update the email as described in last weeks agenda. I will 100% have time to do in the next couple of days. Freedom at noon!

Sergey to send agenda and open PRs list to the mailing list.

10/12/2020

Attendees:

  • Jorge
  • Matt Merkes

Agenda:

10/05/2020

Attendees:

  • morgan bauer
  • Amim Knabben
  • David Porter
  • Harshal Patil
  • Jorge Alarcon
  • merkes
  • roy
  • Sergey

Agenda:

  • sergey

https://testgrid.k8s.io/sig-node-containerd#containerd-node-conformance

  • Amim/Jorge
    • lets add it to community/contributors/devel/sig-node/test-suite.md (?)

Follow ups:

  • Action item: write down this in test guide

  • Action item: Need to find if Docker interested testinfra#17731

    • Just need Derek or Dawn approval
  • Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)

  • Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master

    • https://github.com/kubernetes/test-infra/issues/18973
    • Merkes: added more details.
    • Why is running in Docker desired?
      • Docker may be a way to contain all the dependencies to make it independent
      • But its unclear if it makes the tests “independent” as it requires docker.
    • Jorge: should we ask sig-architecture SIG?
    • Lets ask sig-architecture if there is a knowledge beyond whats written
    • Merkes to daft it
  • Action: try to land e2e tests documentation into community repo [Sergey]

  • Jorge
    • How do we know which features we are testing?
    • Morgan: sounds like we need a [Conformance:KEPXYZ] tags
    • Merkes: were there any tests added to NodeConformance and what they meant by this?
    • Jorge: nowhere we have a documentation saying what is release blocking
    • Merkes, Jorge, Morgan: read some completed KEPs to understand how they approached testing and how we can improve this.
      • If KEP claimed to be implemented, is there easy way to check its tests?
      • Jorge: +1

9/28/2020

Attendees:

  • Matt Merkes
  • Sergey Kanzhelev
  • Harshal Patil
  • Amim Knabben
  • Morgan Bauer

Agenda:

Awaiting approval:

9/21/2020

Attendees:

  • Sergey Kanzhelev
  • Harshal Patil
  • morgan bauer
  • David Porter
  • Amim Knabben
  • Roy Yang

Actions follow up from the last meeting:

  • Action item: Need to find if Docker interested testinfra#17731
    • Move to the next week
  • ci-cri-containerd-e2e-gci-gce-flaky
    • Action: 1- check that tab is filtering by folder (this test is out of scope for this group and should not run as a node e2e test)
    • Action: PreStop test investigation - check if there is another test that tests what description says. If not, fix the test by changing infinite loop to constant work amount
      PR open on termination + follow up on filtering
  • Action item: Escalate to SIG node meeting tomorrow to understand the history/expectations of node-conformance vs. node-master
  • Action: create a tab for RuntimeClass Disruptive test [Sergey]
  • Action: try to land e2e tests documentation into community repo [Sergey]

Agenda:

  • karan / roy
    • Bump 81 -> 85 (where we have image_family)
    • For specific image - will do this week
    • 85 is going stable this week
    • 77 -> 81
    • 81 -> 85 python version changed from 2 to 3, breaking containerd builds. But seems to be fixed by https://github.com/containerd/containerd/pull/4559
    • Should we remove ContainerD 1.2? -> 1.4
      • Morgan helping to move ContainerD to prow
  • knabben
  • mhb
    • Lets try to understand it first, than ensure it only runs once
  • mhb
    • Who wants to take a look? Morgan to create an issue
  • Jorge

9/14/2020

Attendees:

  • Sergey Kanzhelev
  • Matt Merkes
  • Amim Knabben
  • David Porter
  • Jorge Alarcon

Actions follow up from the last meeting:

  • Action item: Need to find if Docker interested testinfra#17731
  • Action item: find a place to document test dashboards and tabs as well as test matrices. Perhaps have a place on kubernetes.dev or sig-community
  • Action item: convert to markdown and find a place for the document [mhb]
  • Investigate “node-kubelet-master and node-kubelet-conformance seem like duplicates” https://github.com/kubernetes/test-infra/issues/18973
  • Sergey: come up with the critical test move example

Please add your agenda items.

Some stats:

  • Test-infra sig/node: PRs: 4 Lets resolve this one: https://github.com/kubernetes/test-infra/pull/17948
  • Test-infra sig/node: issues: 8
  • k/k sig node area/test: PRs: 59 (approved): 7 Approved PRs slowly clearing up!
  • k/k sig node area/test: issues: 7
  • k/k sig node kind/failing-test: PRs: 0 (approved): 0 Yay!
  • k/k sig node kind/failing-test: 1

8/31/2020

Please add your agenda items.

Action item: https://github.com/kubernetes/test-infra/issues/18826

  • Sergey
    • Test-infra sig/node: PRs: 6
    • Test-infra sig/node: issues: 7
    • k/k sig node area/test: PRs: 53 (approved): 11
    • k/k sig node area/test: issues: 5
    • k/k sig node kind/failing-test: PRs: 3 (approved): 1
    • k/k sig node kind/failing-test: 1
  • Next meeting? 9/7 is a US holiday (Labor Day)
    • Yes, cancel next week.

8/24/2020

Please add your agenda items.

  • mhb
    • Action: lets investigate
  • What goes in sig-node-blocking?
  • mhb
  • mhb
    • Moveto next week.
  • Sergey
    • For example, this one is not tested as its not NodeConformance:

      var \_ \= SIGDescribe("CPU Manager \[Serial\] \[Feature:CPUManager\]\[NodeAlphaFeature:CPUManager\]", func() {  
      

https://github.com/kubernetes/test-infra/blob/bd3b81b6c985971cf10665af396962f1c3136785/config/jobs/kubernetes/sig-node/node-kubelet.yaml#L257
Also how do we know which tests are run and which are not?
No tools, If anybody interested - need to create it.
Action: create issue to clean up the tab

  • alejandrox1
    • Release team is preparing to release 1.19. Please be vigilant with new test failures
  • alejandrox1
    • Always ask questions :-) All the work that is going on should make sense to us.

8/17/2020

Please add your agenda items.

FAILING 16
FLAKY 26
PASSING 45
STALE 18
Grand Total 105

8/10/2020

Please add your agenda items.

06/22/2020

06/15/2020

06/08/2020

  • vpickard
    • experimenting with new board for sig-node-testing enhancements
    • Helps to view, monitor, manage issues and PRs for sig-node testing
    • Currently private
    • Started as a way to track issues/PRs that I had been requested to review - getting buried in emails. Found that I was “wasting” time going through emails to find PRs/issues that I needed to spend time on.
    • Sig-storage - uses spreadsheet for high level view
    • Sig-release - release engineering - uses boards
    • Sig-architecture also uses boards
  • karan
  • mhb
  • mhb
  • mhb
    • Should we be testing more images?
    • COS and Ubuntu were original images - both open source
    • Some tests are run on other platforms (RHEL on AWS, for example), export to testgrid
    • More images very welcome and encouraged to add to E2E testing - would need some support/knowledge to maintain these images
  • alejandrox1
    • need to pass management to SIG chairs and owners of this project.
    • everyone can join now.
  • alejandrox1
  • royyang
  • royyang
    • a SKIP=\[NodeFeature:RuntimeHandler\] may help for this job
    • we could add some additional “annotation” to the test to skip
  • bart0sh
  • Can we get a community account for running/debugging sig-node E2E tests? Check with wg-k8s-infra.

06/01/2020

5/26/2020

5/20/2020

Sig-node E2E node-kubelet master started failing, an issue was reported here. This was blocking PR merges, and also this test is release-blocking.

And, here is the slack thread with debugging information. There are some good bits of info, such as how to run some of the tests, some of the false leads we were chasing. Will attempt to incorporate some of this debugging/running tests, etc into this doc.

Overall, a great team effort to debug and get to the root cause!

5/17/2020

5/11/2020 Kickoff Meeting

  • 30 second Introductions - Name, k8s experience, E2E testing experience
  • Review this document
  • Testgrid overview
  • Spreadsheet - Sign Up for specific tests
  • Meet weekly - same day/time work for most folks?

History/Overview of E2E testing

What broke last week

  • Suspect some scan tool scanned the platform, found some CVEs, and shut down the platform/network.
  • Root cause likely caused by old images, with CVE
  • Network access shutdown, could not even get in to debug
  • Many jobs were pinned to specific google cloud project
  • Why were all of the jobs pinned to one specific google cloud project?

Prow Channels

COS cloud image

Many of the jobs use cos-xxx images. What are these images, and where can you find them?

Container-Optimized OS is an operating system image for your Compute Engine VMs that is optimized for running Docker containers. With Container-Optimized OS, you can bring up your Docker containers on Google Cloud Platform quickly, efficiently, and securely. Container-Optimized OS is maintained by Google and is based on the open source Chromium OS project.

https://cloud.google.com/compute/docs/images
https://cloud.google.com/container-optimized-os/docs
https://cloud.google.com/container-optimized-os/docs/concepts/release-channels

Link to release notes with image contents (docker, kernel, k8s version)
https://cloud.google.com/container-optimized-os/docs/release-notes

vpickard@rippleRider$

NAME PROJECT FAMILY DEPRECATED STATUS
cos-69-10895-385-0 cos-cloud cos-69-lts READY
cos-73-11647-534-0 cos-cloud cos-73-lts READY
cos-77-12371-251-0 cos-cloud cos-77-lts READY
cos-81-12871-103-0 cos-cloud cos-81-lts READY
cos-beta-81-12871-44-0 cos-cloud cos-beta READY
cos-dev-84-13078-0-0 cos-cloud cos-dev READY
cos-stable-81-12871-103-0 cos-cloud cos-stable READY

vpickard@rippleRider$

archiveSizeBytes: '8233374400'
creationTimestamp: '2020-05-07T21:29:31.522-07:00'
description: 'Google, Container-Optimized OS, 81-12871.103.0 stable, Kernel: ChromiumOS-4.19.112
Kubernetes: 1.17.3 Docker: 19.03.6 Family: cos-81-lts, supports Shielded VM features'
diskSizeGb: '10'
family: cos-81-lts
guestOsFeatures:
- type: UEFI_COMPATIBLE
- type: VIRTIO_SCSI_MULTIQUEUE
id: '4296652415682830020'

https://cloud.google.com/compute/docs/images

GCP Testing Projects

List of projects available for testing.
Documentation/links that describe project

  • Number of machines
  • Machine specs
  • How to access
  • Quotas
  • Image availability
  • Other

New-infras Goals

  • Move testing out of google.com
  • Move infrastructure more out in the open (#wg-test-infra)
    • Capacity planning
    • Shared infrastructure
    • Shared duty to maintain infrastructure
  • Some of the less-critical infra has moved over already
  • Create a PROW build cluster out in the open, with Boskos
    • What APIs
    • What Quota
    • What IAM (can this be scripted)
  • Allow some users ssh access

Prow

How does this fit into E2E testing?

Configuration of prow provisioned clusters is in YAML files in prow/cluster directory. Do not intend to have a Prow.yaml file in each tested repo (ala .travis.yaml files)

Control plane

  • Talk to github
  • Spin up pods on clusters

Build clusters

  • Boskos on each cluster

Node E2E testing environment

Slack channels: #sig-node, #sig-testing, #testing-ops

Jobs are defined in kubernetes test-infra repo.

Questions:
Where can we find a list of images that can be used in CI and PR tests?
How do we go about creating a sig-node-test-failure notification list?
What is the process for getting access to be able to debug failures in real-time on the system under test?

Kubelet CI jobs are defined in ../test-infra/config/jobs/kubernetes/sig-node/node-kubelet.yaml

  • Testgrid-dashboards: sig-release-master-blocking, sig-node-kubelet
  • Testgrid-alert-email: kubernetes-sig-node+testgrid@googlegroups.com
  • Containers: image: gcr.io/k8s-testimages/kubekins-e2e:v20200420-e830a3a-master
    • Where to find all of these images and how to determine which image to use?
  • gcp-project=k8s-jnks-gke-gci-soak
    • What is the list of projects available and how to determine which project to use?
  • gcp-zone=us-west1-b
    • List of available zones, which one to use

Kubelet presubmit (Pull Request) jobs are defined in: ../test-infra/config/jobs/kubernetes/sig-node/sig-node-presubmit.yaml

  • Containers:image: gcr.io/k8s-testimages/kubekins-e2e:v20200420-e830a3a-master
  • Deployment: node
    • What are other choices?

How to run tests

Boskos

What is this and how does it work

  • Manages pools of projects that it owns. Do not specify project in the yaml file, and it will choose a project for the job
  • Checks project out
  • Finds project for job to run on
  • Boskos cleans up job after it runs

TestGrid

The kubernetes testgrid is here: https://testgrid.k8s.io/

Repo is here, has a video link to testgrid session from 2018 contributor summit:
https://github.com/kubernetes/test-infra/tree/master/testgrid

Most of the sig-node jobs are under the sig-node tab.
Exception. Recently added CI and PR jobs added for Topology Manager are under the wg-resource-management. CPU Manager job was there initially, so added Topology Manager there also. But, these jobs should likely be in sig-node.

Reading testgrid notes

Top level tab

colors (guesses based on observation, I think this must be defined somewhere in testgrid repo or config)

  • red: Failing
  • blue: Passing, and flaky does not count against it.
  • black: Stale, tests have not run

This seems to hold recursively for tabs of tabs:
We should focus on red tabs.

tab names

Come from annotations in the config files, example:
https://github.com/kubernetes/test-infra/blob/a70b1248bacee4dbc332f796d5a3e38411c3f6d6/config/jobs/kubernetes/sig-node/containerd.yaml#L57-L59

annotations:
testgrid-dashboards: sig-node-containerd
testgrid-tab-name: containerd-build

testgrid-dashboards can have multiple entries, so that the test suite shows up in multiple dashboards

test suite descriptions

come from the yaml for a job

Image Policy

  • pr/pull/presubmits should be stable
    • implies less tests and a hardcoded image
  • ci runs less often, can be somewhat less stable
    • maybe we can split this up
  • Overall, why are we picking different images?
    • versions of dependencies inside - docker, containerd, runc, selinux, etc
  • Safe PR image bump mechanism
    • create a mechanism, to extract images used for successful CI Runs
    • once the count of successful CI runs is high enough, bump the pr images

Q&A

Q: bootstrap.py warning

W0518 20:41:32.851] **************************************************************************
bootstrap.py is deprecated!
test-infra oncall does not support any job still using bootstrap.py.
Please migrate your job to podutils!
https://github.com/kubernetes/test-infra/blob/master/prow/pod-utilities.md
**************************************************************************
Should we be solving this?

Q: How much history is available in testgrid?

2 weeks, more in gcs buckets, which is what backs Prow, and holds 90 days.

Q: Why do we build containerd? Why do we build containerd/cri?

List of Jobs & Fields to Fill

Catalogue Existing jobs

https://docs.google.com/spreadsheets/d/1mEU8B2_PmMwwgp-_xnyp7QYMBwcLoA9NNlHwDyMvO0Y/edit#gid=0

In this section, we should list out every job that is running in sig-node, and provide the following:

Job Name
Number of test cases
Intent of test
How long test runs
Brief history of job pass/fail
Importance level to sig-node: low, medium, high
Email contact for failures
Any test cases that are obviously missing
Any redundant tests
Overall state of the test: poor, good, excellent
What resources are required for the test (CPUs, GPUs, NICs, SSD, image type, Container OS)
What version of the image, and why

References

  1. Dec 2018 KubeCon Into to Testing SIG - Aaron Crickenberger
  2. Dec 2020 - A Tour of CI on The Kubernetes Project