community/sig-node/archive/meeting-notes-2021.md

192 KiB
Raw Blame History

SIG Node Meeting Notes

December 28, 2021 Cancelled

  • [dawnchen] Cancelled

December 21, 2021 Cancelled

  • [dawnchen] Cancelled

December 14, 2021

Total active pull requests: 207 (+6)

Incoming Completed
Created: 32 Closed: 9
Updated: 82 Merged: 20

Highlights:

[https://github.com/kubernetes/kubernetes/pull/97252](https://github.com/kubernetes/kubernetes/pull/97252) Dockershim was removed!
  • [pohly, klueska] Dynamic resource allocation KEP discussion
    • Presentation
    • [dawn] As a reference, the same use case was explored by us 4 years ago, and here is the proposal made by jiayingz@ from Google: https://docs.google.com/document/d/1qKiIVs9AMh2Ua5thhtvWqOqW0MSle_RV3lfriO1Aj6U/edit#
    • [Sergey] Latency in scheduling is in the design. Do we want to keep the existing API and fix similar issues as the KEP solving for it, assuming there are workloads which need low latency
    • [Patrick] this is speculation at this point. We emphasize long running jobs, not short running ones.
    • [sergey] resourceClaim has the same name as PVC, perhaps there should be explicitly called out differences table that will simplify the review.
    • [Patric] big difference - volumes are intentionally non-configurable, resources - customizeable
    • Quota management was one of the concerns before. Plus scheduling latency
    • [Patrick] Quota logic will be managed by vendors addon
    • [Patrick] Implementing scheduler by storage vendor is not the ideal solution. Writing addon may be easier and more natural. Especially if you are using two vendors in the same cluster - implementing custom scheduling wouldnt work
    • [Dawn] yes, multiple vendors is a problem. Extending scheduler is a problem that needs to be solved.
    • 1.24 may be too tight for the alpha.
  • [mrunal] 1.24 planning (ehashman OOO) *
    • Moved to January
  • [vinaykul] In-Place Pod Vertical Scaling - plan for early 1.24 merge
    • PR https://github.com/kubernetes/kubernetes/pull/102884
    • Pod resize E2E tests have been “weakened” for alpha.
      • Resize success verified at cgroup instead of pod status.
      • All 31 tests are passing now.
    • Alpha-blocker issues:
      • Container hash excludes Resources with in-place-resize feature-gate enabled, toggling fg can restart containers.
        • Please review this incremental change which addresses it.
      • Reviewer claims that code in convertToAPIContainerStatus breaks non-mutating guarantees.
        • It is unclear what part of the code updates or mutates any state. Need a response/clarification.
      • Multiple reviewers have felt that the NodeSwap issue is a blocking issue. But in last weeks meeting we felt this may not be an alpha blocker (No CI test failures seen after I weakened resize E2E tests and all tests passed). However, we want to be sure.
        • Can we identify exact reasons why this would (or would not) be alpha blocker?

December 7, 2021

Total PRs: 201 (-8)

Incoming Completed
Created: 15 Closed: 10
Updated: 65 Merged: 14
  • [ehashman] Announcements
  • [ehashman] Node 1.23 release retro
    • 1.22 retro link:
    • 1.23 Node KEP Planning
    • SIG Node 1.23 KEPs
      • 8 of 14 implemented
      • 3 exceptions requested
      • 3 exceptions granted
    • Things that went well
      • Really liked we have a list of KEPs in a Google doc that we talk through; its good to see them all in one place and see who is working on what; this helped find connections and collaboration (+1)
      • Soft freeze made the process much better from a contributor experience point of view, communication around it was also great, reviewers were comfortable reviewing the load of PRs
      • Fixed a lot of CI issues, CI looks much better than previously (+1)
      • Deprecation of DynamicKubeletConfig went very smoothly, removed from tests
      • For the last 2-3 releases weve been doing a better job enumerating what we get done, its providing a good rhythm and focus everyones attention (+1)
      • Appreciate that approvers are able to say “Im not familiar with this area” and hold off on merging code theyre not confident with
      • Lots of effort put into the dockershim deprecation survey and ensuring docker wasnt broken in 1.23, much appreciated
      • Soft code freeze helped a lot with flakes not all piling up at the very end of the release
      • Beta KEPs were removed middle of the release so we didnt scramble to get them merged last minute
      • We did a good job of coordination with the container runtimes, not just internal to Kubernetes; much work happening in containerd, CRI-O, cadvisor happening that were all well-coordinated (+1 +1 +1)
      • General sentiment of a really successful release for node
      • SIG Node is in a much better position than other SIGs, well organized (from an outside contributor)
    • Things that could have gone better
      • Working on logging and kubelet flags; was hard to find reviewers for PRs, spoke with one approver who wasnt familiar (+1)
        • In an ideal situation, someone would know who to pull in for a review, but if we dont have that person it just gets moved to the back of the queue
        • It would be nice to indicate who specializes in which areas of code; kubelet code owner structure isnt as cleanly articulated as could be ideal
        • This has been a problem every release; we need more approvers and reviewers overall. Unfortunately, takes time to train people and we need more people/volunteers participating
      • Sometimes hard to find approver bandwidth as well
      • Last-minute hiccup with dockershim socket for CRI v1
  • [ehashman, mrunal] 1.24 initial KEP planning *
    • [dawn] Overstretched by lack of reviewer bandwidth; have a difficult time prioritizing, expertise for each feature is different. We have 20-30 things on our list with no clear priority.
      • Next pass, lets sort by priority+confidence level?
  • [swsehgal/@alukiano] PodResource API Watch endpoint support (KEP with List and Watch endpoint support was merged in 1.21 but only List endpoint (with cpuid and device topologyinfo) was implemented). Please track the watch endpoint implementation for 1.24. * Issue: https://github.com/kubernetes/enhancements/issues/2043
  • [pohly, klueska] Dynamic resource allocation KEP discussion
  • [rata, giuseppe] user namespaces KEP PR
    • Would like to agree on high level and get review from who needs to review
    • We really want to reduce chances as much as possible of a last-minute showstopper
    • General notes
      • Derek/Mrunal: phases make sense, want Tim to have a look just in case
      • Sergey will review to see if we should target 1.24 or 1.25
    • Action(rata):
      • Ping Tim Hockin
      • Add Sergey as reviewers
  • [vinaykul] In-Place Pod Vertical Scaling - planning for early 1.24 merge
    • PR https://github.com/kubernetes/kubernetes/pull/102884
    • Alpha-blocker issue by Elana: Container hash excludes Resources with in-place-resize feature-gate enabled, toggling fg can restart containers.
      • Please review this incremental change to PR.
    • Pod resize E2E tests have been “weakened” for alpha.
      • All 31 tests are passing now.
      • Please review this incremental change to the original E2E test.
        • This change skips spec-resources == status.resources verification for alpha but enforces it in beta.
        • It verifies everything else besides that.
        • Resize success is verified by looking at container cgroup values and comparing to expected values after resize.
      • This gives containerd time to add support for new fields in CRI.
        • Beta blocker - cannot flip switch to beta without containerd support.
    • Outstanding issues / TODOs to load-share and fix after merge?
  • [zvonkok] Any comments regarding last weeks
    • Derek: go ahead and open the issue as a placeholder for the decision to be made
  • [mweston, swsehgal] looking for feedback re cpu use cases and which cases are covered in the kubelet already, but perhaps not documented. https://docs.google.com/document/d/1U4jjRR7kw18Rllh-xpAaNTBcPsK5jl48ZAVo7KRqkJk/edit

November 30, 2021

Total PRs: 209 (+10 since last week)

Incoming Completed
Created: 18 Closed: 10
Updated: 62 Merged: 4
  • [klueska] Bug fix for new feature added in 1.23 (please add to the v1.23 milestone)
    • https://github.com/kubernetes/kubernetes/pull/106599
      • [ehashman] this missed test freeze, any fixes will need to be prioritized as release blocking, merged and backported; we should wait until master reopens if this isnt affecting CI signal
      • We decided to push this out and include it in 1.23.1
  • [SergeyKanzhelev] Dockershim removal end user survey results: https://docs.google.com/document/d/1DiwCRJffBjoLuDguW9auMXS2NTmkp3NAS7SGJQBu8JY/edit#heading=h.wzgwyg229djr
    • Are dev tools updated? local-up-cluster.sh defaults to docker, can use containerd/cri-o with it but its not documented
      • kubeadm also may need some updates
      • node e2es are difficult to run locally without docker
      • Contributor documentation needs to be updated prior to removal
    • 2 questions:
      • Did we give enough notice for removal?
        • Yes - full year, 1.24 wont be released until next April
      • Did we give sufficient viable options in the time between deprecation and removal?
        • We have 2 runtimes that can be adopted + instructions for migrations
        • Some people have specific monitoring agents/tools that dont support other runtimes, they need dependencies to migrate
        • This may be beyond what SIG Node can answer
    • [dawn] We already have delayed a year given the request; What would make the people migrate? Possibly the people wont make changes if we keep delaying because they are not ready
    • [ehashman] We just need to go and deprecate, otherwise people will not update. We need to ensure that we ourselves are prepared for that, and update everything so that we can work without dockershim, too. [+1s from Lantao and Mrunal]
    • [danielle] It will be painful but once its done its done and we wont have to work on it again. Mirantis is working on making dockershim work with CRI (out of tree), so they can always use that.
    • [dawn] fixing OSS scripts and e2e tests with containerd should be the blocker for the dockershim deprecation, so that OSS users should have the out-of-box experience with containerd.
    • [dawn] containerd is a straightforward migration but we didnt want to switch over as a default because it wasnt GA previously
    • [derek] Did we collect docker versions? Are they getting security updates?
      • We didnt have this as a survey question
    • [derek] Are any conformance features tied to a particular runtime? One feature that doesnt work with dockershim, e.g. cgroupsv2
    • Zoom chat dump:
      • Derek Carr to Everyone (10:15 AM)
      • for local-up cluster, it should be just setting CONTAINER_RUNTIME and CONTAINER_RUNTIME_ENDPOINT... here is an example for crio.... export CONTAINER_RUNTIME=remote
      • export CONTAINER_RUNTIME_ENDPOINT='unix:///var/run/crio/crio.sock'
      • Lantao Liu to Everyone (10:18 AM)
      • +1
      • Mrunal Patel to Everyone (10:18 AM)
      • +1
      • Mike Brown to Everyone (10:20 AM)
      • export CONTAINER_RUNTIME=remote
      • export CONTAINER_RUNTIME_ENDPOINT="unix:///run/containerd/containerd.sock"
      • sudo -E PATH=$PATH ./hack/local-up-cluster.sh
      • nod ^ same as crio
      • Elana Hashman to Everyone (10:20 AM)
      • PRs welcome, folks :)
      • zoom chat is not the best patch tracker :D
      • Mike Brown to Everyone (10:20 AM)
      • we could change local-up to work like crictl does
      • Derek Carr to Everyone (10:20 AM)
      • more just letting us all know that it works today...
      • +1 mike
      • Brian Goff to Everyone (10:21 AM)
      • There is out of tree dockershim anyway.
      • Mike Brown to Everyone (10:21 AM)
      • basically crictl uses a config or.. if not configured loops through the default socks for each container runtime
      • Jack Francis to Everyone (10:21 AM)
      • +1, much experience with the unfortunate human behavioral reality that we need to give folks a concrete incentive to migrate
      • Lantao Liu to Everyone (10:21 AM)
      • There is out of tree dockershim anyway.

      • Yeah, there is backup for those users who can't move away immediately
      • Brian Goff to Everyone (10:24 AM)
      • Also maybe could extend support on <last version w/ dockershim> rather than pushing removal.
      • *pushing out removal
      • The only version getting sec updates is 20.10
      • Jack Francis to Everyone (10:26 AM)
      • I think there is no one answer to “which version of docker are folks using”. The answer is probably “every version”.
      • Danielle Lancashire to Everyone (10:30 AM)
      • "whatever version the OS happens to package when the image was built" * every production image * every user I assume
      • Mike Brown to Everyone (10:32 AM)
      • noting … kubelet node dependencies include some number of system services/utilities (seccomp/apparmor/…) + container runtime (extern/intern in r.current) + runtime engines (runc/crun/kata/…) … Its not just container runtime. Also noting you can run multiple containerd instances at the same time, thus version requirement is scoped to version configured for kubelet to use.
      • Mike Brown to Everyone (10:35 AM)
      • the version of containerd configured for kubelet to use.. can be shared with the installed version of docker.. options
  • [wzshiming] Requesting SIG Node reviewer status. due to the time zone, I may not have time to attend the meeting, but I can review PR while everyone is sleeping. :)
  • [zvonkok] Promote SRO as a kubernetes-sigs project, joining NFD for special resource enablement
    • Ask: want to migrate code into kubernetes-sigs under SIG Node
    • [dawn] Only concerned about latency for initialization if there is a hard requirement on operators
    • [zvonkok] At its core its just a helm chart, and you can use just helm
    • [Derek] Everyone should get a chance to review the slides as a next step, it would be good for people to have a place to collaborate on this
      • After a period of a week for review, we can open an issue
      • Proposed maintainers should be listed on the repo request form
    • [mikebrown] Container orchestrated devices is another CNCF project that could do this, although its not currently well-integrated with Kubernetes
    • [dawn] For SIG Node sponsor a project, we need to figure out the scope first, then figure out if there are the maintenaness and feasibility,etc. We encourage the collaboration to avoid unnecessary duplication, but its not like we will only allow one implementation
    • There is some intersection between the two projects, so we can collaborate as well
    • [zvonkok] SRO is and was also used to enable “special resources” that are not devices, e.g. software defined storage (lustre, veritas) with out-of-tree drivers, this kinda does not fit completely into container-orchestrated-devices.
    • [zvonkok] On slide 10 we can see that “runtime-enablement” is one of the steps in enabling a soft/hardware device this is actually where CDI fits in, it is the low-level part where SRO tries to abstract it as mentioned “do not care about the peculiarities of the platform/runtime etc.” so the CDI effort fits perfectly in the complete picture.
  • [zvonkok] Related to that, I wanted to pick up https://github.com/kubernetes/enhancements/pull/1949
    • Feel free!
  • [adrianreber] Forensic Container Checkpointing - looking for early 1.24 approval
  • [mweston & swsehgal] Discussion re CPU management prioritization (quick discussion to set follow-up meeting to then bring results back to sig-node)
    • Will be sharing out a document for feedback, send an email to the mailing list requesting feedback
      • Send out a doodle with the email to make scheduling a smaller group easier
    • Will bring it to a future SIG node meeting and will open a tracking issue
    • Trying to determine a roadmap for what is/isnt supported; get supported behaviours better documented (especially low-hanging fruit), and determine a roadmap for the gaps
  • [vinaykul] In-Place Pod Vertical Scaling - planning for early 1.24 merge
    • PR https://github.com/kubernetes/kubernetes/pull/102884
    • Alpha-blocker issue by Elana: Container hash excludes Resources with in-place-resize feature-gate enabled, toggling fg can restart containers.
      • We should use the current implementation upon GA+1.
      • Implemented fix early prototype. Updated PR by next week.
    • Is NodeSwap interop issue an alpha-blocker?
    • Identify any other alpha blockers.
    • 27 of 31 pod resize e2e tests fail due to missing containerd support for new CRI field (all tests pass with dockershim)
      • Chicken-egg problem.
      • Whats a good solution (Decision made at last meeting: Nov 16):
        • Action for vinaykul: Ensure KEP requires containerd support for CRI change - Beta blocker.
        • Adapt tests to work around lack of support for alpha.
      • <Mike Brown> you could run off master the upcoming release of containerd.. or you could modify the test to support prior releases.. In the field customers may be using older versions of container runtimes, unless the requirement is now to upgrade the container runtime with the kubernetes release?

November 23, 2021

Cancelled - US Thanksgiving week.

November 16, 2021

Total active pull requests: 199 (-13 from last meeting)

Incoming Completed
Created: 38 Closed: 13
Updated: 100 Merged: 39

Potential fish out from closed:

November 9, 2021

Total active pull requests: 212 (-8 from the last meeting)

Incoming Completed
Created: 26 Closed: 15
Updated: 110 Merged: 19

November 2, 2021

Total active pull requests: 220 (+4 from the last meeting)

  • Announcements
    • Code freeze is Nov. 17 in 2 weeks!
    • Vote in the steering election if youre eligible!
  • [@rata]: user namespaces KEP
    • Agree on high level idea
    • New proposal incorporated feedback from previous discussions
    • Slides to explain the plan & phases proposed
  • [vinaykul] In-Place Pod Vertical Scaling
    • Rebased code and caught up with latest code.
    • Need code review from Lantao & Elana
  • [adrianreber] Forensic Container Checkpointing (1.24)

October 26, 2021

Total active pull requests: 216 (-7 from the last meeting)

Incoming Completed
Created: 11 Closed: 7
Updated: 88 Merged: 11
  • 1.23 KEPs review (soft deadline is reached)

From https://docs.google.com/spreadsheets/d/1P1J1QpayRmh2SNjs8T-wBCb6SgEOdWTRQ7MBol7yibk/edit#gid=936265414

277

Beta Ephemeral Containers PRs merged

1287

Alpha In-place Pod update PR is out: kubernetes/kubernetes#102884

1977

- ContainerNotifier Removed from milestone

2040

Beta Kubelet CRI Support PR is out kubernetes/kubernetes#104575
2133 Beta Kubelet credential provider WIP PR is out. Unit tests are failing: kubernetes/kubernetes#105624 .

Suggest to remove from milestone

How does this impact the timeline for spinning cloud providers out of tree?

2273

Alpha VPA CRI Changes Rolled into kubernetes/kubernetes#102884

2400

Beta Node system swap support Will spill into 1.24.

Remove from milestone - no support in runtimes, also failing tests

2403

Beta Extend podresources API to report allocatable resources PR merged

2535

Alpha Ensure Secret Pulled Images PR is out: kubernetes/kubernetes#94899

2625

Beta Add options to reject non SMT-aligned workload PR is merged

2712

Alpha PriorityClassValueBasedGracefulShutdown PR is out: kubernetes/kubernetes#102915

2727

Alpha Add gRPC probe to Pod PR is out: kubernetes/kubernetes#102162

2902

Alpha Add CPUManager policy option to distribute CPUs across NUMA nodes instead of packing them PR is merged. Need to make sure to add it to release tracking sheet.

2371

Alpha cAdvisor-less, CRI-full Container and Pod Stats PR is out https://github.com/kubernetes/kubernetes/pull/103095
  • [Marlow Weston] Looking to solve cpu allocation items, where pinning is important, but also some pods may want to have some cores pinned and some others shared. Currently semi-abandoned items are here; (https://github.com/kubernetes/community/pull/2435, https://github.com/kubernetes/enhancements/pull/1319)
    Would like to move forward and put together a team to try again to come up with a cohesive KEP. Already have use cases from four companies, and there are likely more that wish to be involved. Would like an initial list of names, in addition to those already there, so we can critically come up with the use cases we expect to cover and start coming up with what the best design is going forward.
    • Some of these use cases are already supported - are we trying to isolate the management parts of a node or specific pods?
    • If there is a set of use cases, we might be able to pull a group together
  • [adrianreber] Looking for approver feedback on the "Forensic Container Checkpoint" KEP (1.24) https://github.com/kubernetes/enhancements/pull/1990 to be ready once the 1.24 period starts to avoid missing the deadline as I did for 1.23.
    • Dawn wants to review but doesnt have bandwidth at this moment, she will approve
    • Can this be used for things other than forensics like migration?
    • There are a bunch of outstanding questions about how networking, storage, etc. will it work on restore with multiple containers. Wanted to rein in scope
  • [vinaykul] In-place Pod Vertical Scaling
    • Working on addressing remaining review items this week. Apologies for the delay as my new role has not given me a big enough block of “focus time” until last week. Ill reach out to reviewers over slack once I update the PR.
    • kubernetes/kubernetes#102884
    • Should PR be split?
      • Perhaps. SIG-scheduling would prefer scheduler changes in a separate PR. Jiaxin Shan from Bytedance is looking into it in parallel.
      • If we split, both would need to go in quick succession - 2-phase commit
  • [mrunal/mikebrown] CRI PR version support https://github.com/kubernetes/kubernetes/pull/104575
    • Support both v1 and v1alpha1 or just one version?
    • Node overhead for the marshalling/unmarshalling needs to be measured to make a decision.
    • Perf vs. cognitive load is a decision to make on the next meeting
    • Also added to API reviews agenda:

October 19, 2021

Total active pull requests: 223 (-6 in two past weeks)

Incoming Completed
Created: 37 Closed: 15
Updated: 121 Merged: 29
  • [marga] Possible KEP? Exposing OCI hooks up the stack. Draft here. Inspektor Gadget has wanted this for a long time (see “Future” section from 2019), but now there are more users out there that are resorting to workarounds because these are not exposed. Id like to get this moving as a KEP in yall agree.(note NRI effort)
    • [Mike Brown] lots of interest. NRI hooks is one of the models

    • [Mrunal Patel] Hooks are wired in CRI-O. Example is nvidia hook. Thai is hard to write CDI is an effort to simplify hooks writing. Hooks are hard to write, so CDI is going for a declarative model to make the actual hook much simpler.

    • [Alexander Kanevskiy] Prototype that make NRI more flexible and working both for containerd and CRI-O: https://github.com/containerd/nri/pull/16 Injecting OCI-hooks was one of the TODO items as example of NRI plugin
      CDI link: https://github.com/container-orchestrated-devices/container-device-interface

    • [Dawn Chen] Lots of use cases are useful, but hooks are very powerful and some hook implementations may harm the host and raise security concerns. In the past we wanted to continue discussing, maybe more declarative way will work best.

    • [Lantao] Hook can run anything on host - has access to host and container environment. Exposing in k8s API or CRI - it can make the pod non-portable. Ideally we need to avoid the environment-specific dependencies

    • [Mike Brown] pre-determined dependencies would be one way to go

    • [Dawn Chen] Are all use cases around tracing and obtaining labels and other runtime information?

    • [Marga] Main use case is tracing and applying the labels. Want to detect container start as early as possible. Other use cases for security - being able to detect container before it started and decline based on whitelist set of images.

    • [Elana] Faster detection of containers - PLEG streaming as oppose to a poll loop (Mike: yes CRI event subscriptions for pod/container/image lifecycle)

    • [Marga] Other scenarios - some information like PID of a container - would like to get, but there is no way (perhaps should be a different KEP).

    • [Dawn] Container information is a scenario for runtime.

    • [Marga] want it in a unified way.

    • [Mrunal] There are edge cases - VM runtime, CRI-O may not have a process

    • [Sergey] Maybe expand motivation section to explain specific scenarios

    • [Dawn] need to discuss use cases individually. Some may be discussed with sig auth due to security concerns. Hooks may be a right technology, but we need to start with the use cases.

    • [Elana] Very large scope change as its written. Need significantly more resources to ensure this lands and maintained. Not a “1-2 PR and then youre done” sort of change.

    • [Mike Brown] Theres no common way to configure these hooks based on each container runtime. Kubernetes integration work would be quite complicated, theres no obvious generic way to do this right now, its WIP.

    • [Danielle] Every time that piece of code is touched, it leads to regressions as we dont have a good coverage there.

    • [Brian Goff] Exposing OCI semantics on CRI seems messy.

      As opposed to configuring the runtime to run hooks

      Run anything not on the host *with root privileges

      My gut says modify (or make it possible to configure) the runtime to do what you need rather than changing the API.

      I dont think a hook could deny based on image.

    • [Eric Ernst] Second the security concern. Chance to run a binary on the host (or in guest if kata) w/ out any restrictions...

    • [Mrunal Patel] There are races in resolving a tag at admission vs. runtime pulling it.

    • [Alexander Kanevskiy] For anyone interested in CDI and NRI, and see if those can solve their usecases, welcome to join meetings of CNCF/TAG-Runtime/COD WG: https://github.com/cncf/tag-runtime/blob/master/wg/COD.md

  • [marquiz] Class-based resources in CRI KEP (draft)
    https://github.com/kubernetes/enhancements/pull/3004
    • Looking for feedback
    • Related KubeCon talk: https://sched.co/m4t2
    • TAG-Runtime presentation with examples on how it is used:
    • Annotations should generally be avoided for new KEPs as they are very difficult to manage with version skew, use alpha fields instead
    • Are we doing anything to expose this information to scheduler?
    • Not at the moment, optimizing everything on the node now.
    • Whats the difference between blockio and cgroup v2 controls?
    • RDT is about memory bandwidth, it uses cgroup v2 underneath.
    • Classes are used to configure specific controls to scenarios like guaranteed burst or levels of IO support.
    • Need to make sure cgroup v2 support in k8s will work well with this proposal.
  • [Eric Ernst] Request for feedback/review: https://github.com/kubernetes/kubernetes/pull/104886
  • [SergeyKanzhelev] Please distribute this form: https://forms.gle/svCJmhvTv78jGdSx8
  • [SergeyKanzhelev] Follow up on this https://github.com/kubernetes/kubernetes/pull/105215#issuecomment-946916830

October 12, 2021

CANCELLED - KUBECON NA

October 5, 2021

LATE START: 30m shifted due to availability (10:30 instead of 10PT)

Total PRs: 229 (-1 from the last meeting)

Incoming Completed
Created: 23 Closed: 9
Updated: 85 Merged: 15

September 28, 2021

Total active pull requests: 230 (-8 since last meeting 2 weeks back)

Incoming Completed
Created: 29 Closed: 23
Updated: 146 Merged: 18

Reminder: soft code freeze date for SIG Node on October 15 - 2 weeks including a week of KubeCon.

September 21, 2021 [Cancelled]

Cancelled due to lack of agenda.

September 14, 2021

Total active pull requests: 238

Incoming Completed
Created: 32 Closed: 11
Updated: 79 Merged: 8

September 7, 2021

Total active pull requests:: 222

Incoming Completed
Created: 18 Closed: 8
Updated: 67 Merged: 2
  • [madhav] Feedback on addition of new PodCondition for pods that have ephemeral containers created.
  • [clayton/ryan] static pod lifecycle, static pod bugs, and static pod uids
    • Static pod lifecycle regression - summary of issues identified
    • Static pods that change UID or have fixed UID are recreated and deleted on disk are broken
      • The fixes to pod workers guaranteed that other kubelet loops finish - those loops depend only on UID, so “delete/recreate by same UID” is also broken because the volume manager might be tearing down a static pod with one set of volumes and needs to complete before we let it spin up a new pod (which means we need to decide how to handle that)
    • Admission around static pods with fixed static pod uids is potentially broken
    • Working to identify what the right fix is
      • Suspect it involves limiting what static pod fixed UID can do, as well as supporting the key use case of “only want one instance of a static pod running at the same time”, but will need feedback
    • Also, we need a KEP (and then docs) for what static pods are actually supposed to be doing and what we support and test those
      • Ryan / Clayton to pair up on static pods
      • Clayton to clarify pod safety around pods (which was what triggered the refactor that led to the regression)
      • Sergey to help with pod admission KEP
  • [aditi] Just a note, Please add Kubelet credential provider KEP to 1.23 Tracking sheet.
  • [vinaykul] In-place Pod Vertical Scaling KEP - status update
    • I just got back from vacation, I wont be in todays meeting.
    • Thanks Elana for getting the two KEPs tracked towards 1.23
    • Im planning to address the outstanding items in the next few weeks.

August 31, 2021

Total active pull requests:: 213

Incoming Completed
Created: 25 Closed: 7
Updated: 53 Merged: 10
  • [klueska] Please add this KEP to tracking sheet (only SIG leads can edit the sheet)
  • [mrunalp] Limiting number of concurrent image pulls
    • making it on runtime level is also possible. Maybe easier to do it on kubelet to minimize back and forth on CRI API level
    • [Lantao] memory-based limit or number limit?
      • likely specific for go version and likely based on the number
    • [Lantao] another concurrency setting - parallelism on number of layers
    • [Derek] lets try serial puller as well to compare.
  • [madhav] Feedback on addition of a new PodCondition on pods that have an EphermeralContainer created
  • [MikeBrown post meeting] (memory, storage, number of concurrent)/(number of network devices/bandwidth)... possibly need a controller at the scheduler level instead of handling failures at the kubelet to reduce thrashing.
  • [marosset] Failures when starting sandbox container keep Pod in pending state indefinitely - should pod be marked as failed?
    • https://github.com/kubernetes/kubernetes/issues/104635
    • [Derek] PodPhases are designed this way and the hope is that sandbox will be created. It is hard to change this behavior/assumption
    • [Alex] should we have a condition or known errors which will indicate the terminate state?
    • [Ibrahim] another example - image that doesnt match the platform will never start
    • [] CSI failure is another example where kubelet keep retrying even if the failure is terminal
    • [Derek] maybe look into the timeout when kubelet should stop trying to start the pod and will mark is failed?
    • There is also a device plugin scenario where pod cannot be scheduled on the node
    • [Derek] if this were only CRI flows, it would be easier
    • [rphillips] found a link after the meeting, there is a logic to recreate a sandbox if it failed, perhaps the error is not being reported in the sandbox statuses [link] \

August 24, 2021

August 17, 2021

Total active pull requests: 207 (+5 from the last meeting)
Incoming Completed
Created: 21 Closed: 5
Updated: 73 Merged: 13
  • Finish with KEPs for 1.23 review: https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E
  • [kmala] https://github.com/kubernetes/kubernetes/pull/98934
    • Moved to next week. Need Clayton presence
  • [bobbypage/qiutongs] Complications with cAdvisor/CRI integration (e.g. file system metrics) and thoughts on moving CRI API out of k/k staging
    • lets move CRI to its own repository to avoid circular dependencies
    • long term want to move metrics to CRI.
    • Question: when will this happen and whether moving CRI out of staging is a long term goal anyway?
    • Derek: trying to think of a mechanics of it. Important to preserve the promise that CRI is versioned with the k8s. if we decouple it from k/k, we need to be careful about its versioning.
    • David: maybe we can have a process of feature freeze date for CRI and propagate this across multiple repos.
    • Mike: or have a k8s release branch in the cri repo specifically tied to each k8s release.
    • Dawn: CSI recently had dependencies and release issues. Maybe they can share some experiences.
    • Xiangqian, Xing: k/k has dependency on CSI. Every release there is an update in k/k. Sometimes development is on CSI repository. Trying to align these changes and coordinate them with k/k changes. Issues mitigated by the fact that SIg has control over the CSI repo.
    • Dawn: heard that there are some regrets to have CSI in separate rero. Process overhead is high.
    • Xing: also working on moving drivers from in-tree. Having them in the repo can cause issues. So pros and cons are in both cases.
    • Lantao: is it possible to do go module magic? (yes but.. managing n apis via submodule is also problematic)
    • Dawn: lets start e-mail thread on this topic.
  • [matthyx/adisky] Draft Proposal for Keystone containers KEP https://github.com/kubernetes/enhancements/pull/2869
    • Derek: didn't have time to read through the KEP, questions can be addressed in another meeting later.
    • Matthias: only need a reviewer and then we do it asynchronously.
  • [sjennings/decarr] Notification API KEP discussion
    https://github.com/kubernetes/enhancements/pull/1995

August 10, 2021

Total active pull requests: 202

Incoming Completed
Created: 28 Closed: 13
Updated: 88 Merged: 29
Rotten: #[98542](https://github.com/kubernetes/kubernetes/pull/98542) #[90727](https://github.com/kubernetes/kubernetes/pull/90727) (related bug may be addressed by other PRs) 

August 3, 2021

Total active pull requests: 216

Incoming Completed
Created: 41 Closed: 20
Updated: 54 Merged: 10

Closed PRs - mostly WIP, test validation PRs. A few Rotten: #84032 #99611. Merged PRs are mostly cherry-picks and test updates since we are in test freeze now.

  • 1.22 release date: tomorrow, Aug. 4
  • [ehashman] 1.22 burndown update
  • [matthyx] requesting reviewer and later approver role to help sig-node CI subgroup in:
    • test/e2e/common/OWNERShttps://github.com/kubernetes/kubernetes/issues?q=is%3Aopen+milestone%3Av1.22+label%3Asig%2Fnode+
    • test/e2e/node/OWNERS
    • test/e2e_node/OWNERS
    • [dawn] Possible areas of approver: top-level node, community repo, node tests in k/k, cadvisor (not a Kubernetes project), node problem detector
    • What about test-infra?
      • [derek] Should probably be tied with node e2e, would be happy with Elana and Sergey to both apply for node test approver, test-infra
    • Need to keep balancing new energy with existing contributors
  • [adrianreber] checkpoint/restore KEP reworked and ready for review
    • https://github.com/kubernetes/enhancements/pull/1990
      • Kep #2008
    • Update KEP to split into stages to mention that 1.23 only includes checkpoint part of it with ability to review the checkpoint later outside of K8s.
    • Adrian to send e-mail when KEP is updated.
    • runc already can do checkpoint. This KEP only adding the initiation of the checkpoint. What this KEP adds from the value add?
      • Derek:
        • this work adds context to checkpoint around secrets?
        • Problem with restore: images that the pod was restored with, and prevent these images to be used on later restart is hard.
      • Adrian:
        • Container engines cannot handle checkpoint in containers running in shared namespaces. Want to add it to CRI API, container engine will do the implementation. Main value-add is to decide to make it in CRI.
      • Dawn:
        • Since its only CRI API and not leak out to Pod API, it is fine to implement without the restore for now.

July 27, 2021

Total active pull requests: 204

Incoming Completed
Created: 24 Closed: 6
Updated: 53 Merged: 8
  • Announcement: doc freeze is today!
  • [ehashman] 1.22 burndown
  • [adrianreber] checkpoint/restore KEP reworked and ready for review
    • https://github.com/kubernetes/enhancements/pull/1990
      • Kep #2008
    • Will move this to next week - working through comments from Derek/Mrunal
    • [sergey] Was the discussion from last time for how checkpoint/restore on different nodes will work resolved?
    • We want to store the checkpoint image in a registry so it can be transferred and avoid local storage between nodes
    • Lifecycle: can checkpoint a container so long as it doesnt use external devices (e.g. GPUs, SRIOV) once the init container has finished running
    • [vinay] Checkpointing the running containers image?
    • Including its memory footprint/pages and all, will be available after migration/reboot as it was before
  • [vinaykul] In-place Pod Vertical Scaling KEP for early v1.23 - Status update
    • PR https://github.com/kubernetes/kubernetes/pull/102884
      • API review approved by Tim Hockin
      • Current PR squashed and rebased.
      • Starting work on unresolved issues in kubelet
        • Outstanding issues tracked here.
    • Scheduler changes are a bit more involved than I initially thought. They requested a separate PR that follows the above main PR.
      • Can we do that? (Two PRs need to go in quick succession.)
      • No problem with this, but code will be reverted if one PR misses deadline.

July 20, 2021

Total active pull requests: 193 (-1 from last week)

Incoming Completed
Created: 13 Closed: 6
Updated: 58 Merged: 8
  • [ehashman] Reminder: docs reviewable deadline today!
  • [ehashman] 1.22 burndown:
  • KEPs retrospective for 1.22
    • SIG Node KEPs planning: https://docs.google.com/document/d/1U10J0WwgWXkdYrqWGGvO8iH2HKeerQAlygnqgDgWv4E/edit#bookmark=id.7mxl83pa2zof
    • sig node 1.22 KEPs retrospective
      • 13 of 24 implemented
      • 2 - w/exception
      • 4 - denied with exception
    • Kubernetes 1.22 Release Information: https://www.kubernetes.dev/resources/release/#kubernetes-122
    • Kubernetes 1.22 Enhancements Tracking
    • Things that could have gone better
      • Difficulty with API reviews. Until the last day we didnt have comments from API reviewers. Some PRs were not marked for API reviews unless the last moment.
      • Suggested action: in future releases, ensure all node KEP PRs are tagged with api-review and have a reviewer assigned as early as possible.
      • Reviews came in late for a lot of feature PRs, a lot of them were marked WIP until the very last week
      • Suggested action: KEP PRs need to be marked as ready for review well before the final week before code freeze
      • Some things we asked for exceptions that werent quite ready
      • Suggested action: In the future, only ask for exceptions when weve engaged with chairs early and the feature is nearly complete
      • PRR didnt catch some of the breaking issues with seccomp by default; by enabling it by default, could break some workloads
    • Things that went well
      • Required e2es to be ready in PRs themselves and that worked really well: gRPC probes were an example and we found quite a few issues by asking for tests
      • Small KEPs like pod FQDN, Size memory backed volumes merged very early
  • [ddebroy] Introduce Draft KEP “Runtime Assisted Mounting”
  • Lantao: what exact problems it is solving, it seems like kata already solved this with Raw Block? (https://github.com/kata-containers/runtime/issues/1354) Would be nice to know more about the pros/cons of this approach to understand whether or what we need to change in Kubernetes.
  • Mrunal: does it have some intersection with how Kata wants to know more about other things like Devices information
  • Dawn: want more proposal to outline more details on the problem. Why it couldnt be solved in the current design of CRI?
  • [xyang] Non-graceful node shutdown: https://github.com/kubernetes/enhancements/pull/1116
    • In the KEP, we are proposing to add a “quarantine” taint to a node that is being shutdown. It is to prevent new pods from being scheduled to this node when it comes up before it is cleaned up.
    • Can we rely on the Node Shutdown Manager to apply this taint?
  • David: it is there already. NodeReady state is already set. Reason of NodeReady condition can be checked.
  • Elana: caution against the taints - they are unreliable.
  • Xing: we want the status to be changed when Node is coming back up.
  • Elana: maybe look at the previous status.

July 13, 2021

Total active pull requests: 194 (-23 from last two weeks)

Incoming Completed
Created: 42 Closed: 23
Updated: 127 Merged: 43
  • [swsehgal] Are the requirements of being reviewer of a well defined subsystem (eg Container Manager) same as requirements of an entire component (like kubelet)
    • Background/Motivation: Hard to find reviewers in the area of CM/CPU Manager/PodResource API (in general in pkg/kubelet/cm)
    • [Dawn] approver status today is too coarse grain. We have some subareas (not implicit), but generally known. Like resource management, monitoring, security, storage, etc. In the past we had clear owners for these areas. As time passed some people moved, some changed interests. So discussion now behind the scenes how to make these areas explicit.
  • [fromani] final call for cpumanager policy aliases - we need a decision
    • fromanis take: i think this is an issue more when the feature reaches beta rather than for alpha stage
    • update: settled in the PR comments
  • [vinaykul] In-place Pod Vertical Scaling KEP - Status update
    • PR https://github.com/kubernetes/kubernetes/pull/102884
      • API review approved by Tim Hockin
      • Most items from Lantaos review addressed and TODOs taken
        • Outstanding issues also tracked here.
    • Cut from 1.22, lets shoot for early 1.23 check-in.
    • SIG-scheduling review ETA this week.
    • Identify any missing reviewers and loop them in.
  • [Balaji] KEP Add support for pluggable pod admit handler

July 6, 2021

Cancelled due to US holiday

June 29, 2021

  • PR/Issue update

  • Bugs scrub charts:

  • [ehashman] Swap update

  • [ehashman] Requesting Node approver https://github.com/kubernetes/kubernetes/pull/103122

  • [vinaykul] In-place Pod Vertical Scaling KEP - Updates & PR review request

    • PR https://github.com/kubernetes/kubernetes/pull/102884
      • API changes - see commit
      • CRI changes - see commit
      • Core implementation mostly done - see commit
        • Scheduler & RQ accounting left to finish
      • Next major task: E2E tests.
        • @wangchen615 from IBM is working on this
      • Lantao to review the PRs
      • Tim Hockin has it in his backlog
      • E2E needs to be part of the main PR
  • [danielfoehrkn] Auto system/kube-reserved based on actual resource usage

    • Issue: https://github.com/kubernetes/kubernetes/issues/103046
    • [Alexander]: The cgroup hierarchy would have to be parsed to evaluate the actual utilization?
      Expanded question/comment: I like the overall idea about adjusting some of kubelet parameters during the lifecycle of kubelet. However, it might not be a good idea to add complexity into kubelet to be dependent on cgroups implementation (v1 vs v2, linux vs windows, etc). It might be a bit better idea to generalize the way of dynamically updating kubelet configuration and use e.g. external operator(s) that would be evaluating the state of the node (e.g. via some external metrics, like Prometheus) or overall cluster health and make a decision about adjusting reserved portions.
    • [Derek]: Has there testing been done with Enforced Allocatable feature enabled?
      • recommended setup explored
      • needs further investigation
      • has runtime and kubelet put in separate cgroups?
    • Has cgroupv2 been explored?
      • Cgroupv2 will allow more signal from kernel
    • [Dawn]: Kernel OOM is best effort at the moment
    • Kubectl exec should also be taken into consideration
    • [Eric]: Problems with under-reserved have been highlighted. What about over reservation?
      • less resources advertised to the scheduler
    • [Derek]: An interesting area that should be explored/researched further
  • [bart0sh] request for PR review & approval:

    promote huge page storage medium size to GA

      PR to add conformance tests? Needs update to the hardware config
    
    
      Needs approval
    
  • [vinayakankugoyal] thoughts on https://github.com/cri-o/cri-o/pull/5043 and https://github.com/containerd/containerd/pull/5668 could we support ambient capabilities without making a change to k8s API and CRI API? I think we can and so I opened those PRs. I would love to hear the thoughts of the CRI maintainers though!

PRs status:

Total active pull requests: 217

Incoming Completed
Created: 36 Closed: 10
Updated: 143 Merged: 23

Three weeks stats (since last update):

Incoming Completed
Created: 92 Closed: 31
Updated: 184 Merged: 56

June 22, 2021

June 15, 2021 [Cancelled]

Cancelled as there were no agenda proposed for the meeting.

Important dates:

June 8, 2021

Total active pull requests: 204 (+1 since the last meeting)

Incoming Completed
Created: 29 Closed: 7
Updated: 93 Merged: 25
  • [SergeyKanzhelev] Currently we have a specific order of container startup. Also second container wouldnt start before the prestart hook finished for the previous container. This is not a documented scenario, but there are apps taking dependency on this behavior (mostly sidecars). Do we want to promote this behavior to conformance?
    • Not a conformance
    • Just have a test - do we have unit tests already?
    • Changing of this behavior will require a KEP
    • [ehashman] Conformance tests have a very specific purpose; we should be adding them to reflect universal user expectations, not implementation details. Lots of room to add unit test coverage: kubelet is only at ~56% covered right now
  • [ranchothu] A KEP about taking LLC cache into cpu allocation, for that, in some architectures like AMD rome/milan, more than one LLC exist in an individual socket, original algorithm may cause performance decrease in this scenario.
  • Francesco has been in touch with the review author - timing is not good for them. Can we have an APAC-friendly node meeting time?
  • Used to run an APAC time for node for about half a year (11PM PT) but people stopped attending after the first few. Would be open to having another meeting, but people need to attend regularly
    • [fromani] maybe just reserve the slot (once/twice a month) and have the meeting only if there are agenda items?
  • [ike-ma] test-infra node image
    • (Context: https://github.com/kubernetes/test-infra/pull/22453 )
    • Do we have any policy regarding onboarding new images?
      • [PUBLIC][Proposal] Re-catogrize the Node E2E tests
      • [dawn] Lantao was the founder for this area. Need an owner for each distro. We used to have policy for this, but not specifically for image lifecycle
      • [lantao] Previously the coreos was in presubmit, and blocked PR submission, and was later removed.
      • [Mrunal] prefer using existing images, will ask Harshal for more details on how the Fedora CoreOS images are used
      • [dawn] swap feature itself is less OS-distro dependent, more on kernel module
    • Do we have any policy/process regarding image update/release pipeline? eg: update kernel version, containerd version etc?
      • Ubuntu/COS/Fedora/CoreOS - focus on existing image
    • Do we have any best practice/recommendation of test coverage for image-oriented features? Two patterns in use right now: focus on feature hugepage (test coverage) vs smoke run cgroupv2 (test coverage)
      • Prefer to tag on Feature
  • [ehashman] Pod lifecycle rework
  • [jberkus] URGENT: which runtimes does the Kubelet Stream regression affect? Contributor Comms needs answers so we can notify users.
    • Went out in the May patch releases
  • [ehashman] Reminder: June cherry-pick deadline is this Friday (June 11)

June 1, 2021

Total active pull requests: 203

Incoming Completed
Created: 28 Closed: 13
Updated: 74 Merged: 8
  • [ehashman] Reminder: code freeze is July 8 (~5.5 weeks)
  • [Sergey] New APAC time for CI+Triage session
    • tentatively 03:00 UTC on Thursdays (8pm PT on Wednesdays)
  • [swsehgal] Pod Resource API
  • [vinayakankugoyal] KEP-2763 - ambient capabilities support
    • We are proposing changes to CRI API
    • Could someone from sig-node volunteer to be point-of-contact and help with review and approval from the sig-node side?
      • Mike Brown - github.com/mikebrow containerd
      • Mrunal - CRI-O
    • Also reach out to someone from the PSP++ effort?
  • [n4j] Redirect container stdout / stderr to file
    • Does this need a KEP since this is a non-trivial change and might require a change in the POD API?
    • Need consensus on the extensibility of the feature i.e. would we support redirect only to file or it can be to external collectors like fluentd
    • Is there a way to control (redirect) the output of the process via the CRI?
      • [mrunal] To an additional file? There are serious performance implications of such a change
      • [n4j] want to be able to set stdout to one file, stderr to another
      • [mrunal] think there is a bigger story for the CRI and what solutions make sense
      • [Dawn] this was discussed from the beginning of the CRI - adding redirects for logs or adding additional file targets. Did not do that for many reasons: using journald or other log mechanisms cover most cases that would enable this already. Theres concerns around complexity and performance, this is why the work hasnt been done in the past. Needs a full design proposal, starting from a problem statement, as there are already ways to do this. At a high level, there are workarounds, so want to hear about why those arent enough.
      • [Lantao] For a legacy container image, you can mount a logfile from the host. Are there security concerns or other issues?
      • [ehashman] Summary of problem statement from issue… possibly we just need additional documentation because people dont know how to do this?
      • [Lantao] Will follow-up with a workaround on the issue. https://github.com/kubernetes/kubernetes/issues/94892#issuecomment-852321278
    • Who would be point-of-contact for design discussion and some early code feedback?
  • [adisky] How to proceed further on flags to component config migration
    • Currently 96 flags have been marked as deprecated in favor of component config without any timeline of removal.
    • New flags being added to Kubelet that are marked as deprecated on creation. It looks confusing from the user's perspective and also It will be difficult to track later on which flags need to be removed when.
    • Should new options be added as config only and flag addition be discouraged?
      • [ehashman] Appears to be owned by a defunct WG (Component Standard)
      • [dims] WG is gone. Need to stop the bleeding - new CLI params should be avoided, added in KubeletConfig instead. Could add unit tests to check for this and prevent this, and then a timeline for removal of deprecated flags.
      • [aditi] All the kubelet flags are getting added as deprecated.
      • [dims] Need to update community page and/or add a KEP discussing deprecation.
      • [ehashman] Suggest we do a KEP, solicit feedback, and announce and remove all the flags in one release rather than removing them piecemeal which is causing issues for downstream consumers.
      • [dawn] Were paused because kubelet has been leading but there are other components which arent migrating away from flags. We paused for a year and were waiting on other components. Many flags are GA so we could safely deprecate, but some need their own deprecation processes (e.g. DynamicKubeletConfig)
      • [Lubomir] +1
      • [Fabrizio] Situation across Kubernetes is confusing for users and platform tools. Not just a Kubelet issue, but its more pressing here because of the deprecation notices.
      • [dims] It would be ideal if things were handled via Component Config but its not staffed. Kubelet is a bit special because it usually doesnt run containerized like other components.
      • [Lubomir] Story for kubelet users need to be consistent because its very confusing. Would like to see something from SIG Arch for k8s-wide.
      • [ehashman] Would suggest that we dont try to increase scope, because we already have tried that with WG which is now defunct. Lets keep with Cluster Lifecycle driving and working with Node. Might want to get a proposal from them representing user needs and what we might want to change.
      • [dawn] Lets ensure KubeletConfig is working with kubeadm and that were not adding new flags. Cant necessarily just get rid of all the deprecated flags in one go as there is more work that needs to be done.
      • [dims] Want to publish a schedule because it makes it easier to get new contributors helping out.
      • Agreed: publishing a schedule, identifying immediate pain points is a reasonable thing to spend time on.

May 25, 2021

  • [ehashman] Fixing termination and status reporting (doc)
  • [yangjunmyfm192085] Exposing container start time in kubelet /metrics/resource endpoint
    • Context: https://github.com/kubernetes/kubernetes/issues/101851
    • [Derek] What is the node start time? When the node boots? When kubelet first marks itself NodeReady?
    • [Elana] Static pods could start before NodeReady happens
    • [Dawn] Static pod usage is discouraged, makes sense to set this to first NodeReady time
    • [Derek] Proposed start time for containers makes sense
    • Need to be clear on what the node start time metric is used for and how its defined
    • [Elana] Proposal says node metric is not necessarily required, and was added for symmetry - can we potentially not add it?
    • [Dawn] We should provide the most generic data as possible, e.g. node has started and is ready to join the cluster
    • [Lantao] Node boot time could be useful in the CPU context
    • [Dawn] NPE uses node boot time as well
    • Action: summarize discussion on bug, bring to SIG Instrumentation for confirmation
  • [verb] kubelet metrics for types of containers

May 18, 2021

Total active pull requests: 202 (+1 from the last meeting).

Incoming Completed
Created: 24 Closed: 17
Updated: 94 Merged: 9
  • [ehashman] Node bug scrub
    • Proposed date: June 24-25
  • [ehashman] Swap work breakdown
  • [mrunal/harshal] node e2e - runc update broke crio tests:
    • Enabling this: https://prow.k8s.io/?job=pull-kubernetes-node-crio-e2e as presubmit would have caught it.
    • [Dawn] need to understand how much combinations of systemd versions we need to test. CoreOS was excluded before from presubmits because it was on latest systemd and was very noisy. The decision was that it wasnt the SIG Node scope to own this.
    • [Mrunal] this may not be the same as this version of crio is running on production for many customers. Making this job blocking, runc updates would be easier
    • [Dawn] is it kubelet or systemd/runc integration issue?
    • [M] it is kubelet running into “wrong“ codepath.
    • [Elana] Based on previous meetings - the goal os CI upstream is to support CRI, and testing more than 1 container runtime.
    • [D] if kubelet supports cgroup driver, likely need to run tests
    • [Brian Goff] Problem is this was exposed after rc94 was released. CI is needed against runc HEAD
    • [Elana] there were some WIP PRs open with runc HEAD but the lack of the blocking job meant no signal
    • [Dawn] as long as we support these tests it is fine. When CoreOS was breaking peoples PRs it was a problem
    • [David] Serial tests may have better signal
    • [Mrunal] making it a runc issue - hard. They dont have enough CI and we might need to contribute
    • [Brian] Periodic job on runc would be better to catch regressions early, before runc releases.
    • [Dawn] In the past we started this conversation. But didnt follow up these details completely
    • [Mrunal] Im on of runc maintainers and can facilitate the discussion. Every time runc tags something, people asking for more changes. Frequent updates are hard on runc because of a tag requirement from kubernetes.
    • [Lantao] The model we use for containerd today is that containerd decides what runc version to use, and we test containerd + runc as a whole. Containerd does update runc periodically, and the runc update needs to go through Kubernetes node e2e tests.
    • [Mike Brown] we run tests against master of containerd. SHould be also possible for runc
    • [Mrunal] Situation is different as runc is vendored into k8s
    • [Dawn] should we look into this integration and make sure we test it?
    • [Lantao] Problem is with runc vendoring
    • [Mrunal] yes, tags are rare and k/k cannot take non-taged deps
    • [Mike Brown] are tags only for release?
    • [Mrunal] Dims and Liggit owns this decision (about tags)
    • [Mike Brown] maybe run both.
    • [David] Question is whether test infra would allow to vendor latest for test
    • [Dawn] kubelet to runc integration is where the challenge is coming from. Maybe runc can tag daily?
    • [Mrunal] lets have a meeting with runc maintainers to work out the plan
    • [Elana] talked with Dims. Opinnion: lets get more tags. In-between vendoring is hard.
    • [Brian] lots of diff on runc. It needs to slow down. Daily tagging sounds rough.
    • [Mike] maybe containerd/cAdvisor can run against master runc continuously
    • [Dawn] since k8s is major runc user, maybe runc is the ultimately the best place to run these tests.
    • [Mrunal] about too many diffs and slow down - I wish we slow down, but there are so many at-scale issues being discovered now.
    • [Brian Goff] hard to test all permutations of runc uses is hard. Maybe have a subset of well-known cases integrated into runc.
    • [Dawn] lets not overreact and not slow down everything. Not tax everything because of the need to test this integration
  • [Brian Goff] virtual kubelet: downward API was a copy (dont want to import k8s.io/k). Can we make downward API parsing moved to someplace else?
    • [Elana] is there issue tracking this?
    • [Dawn] file an issue, please, to discuss. One con to share: maintaining release is becoming harder with components moving to vendor. Cost increases as less ownership understanding for the components that were moved out. Identifying owners is hard for components that were moved out (like cAdvisor or NPD).
    • Also integration and compatibility issues will arise.
    • [Brian] can sympathize with it. Will open an issue.

May 11, 2021

Total active pull requests: 201 (+2 from the last meeting).

Incoming Completed
Created: 26 Closed: 9
Updated: 102 Merged: 17

May 4, 2021

Total active pull requests: 199 (-9 from the last meeting).

Incoming Completed
Created: 24 Closed: 9
Updated: 98 Merged: 24

Apr 27, 2021

  • Cancelled, no agenda items

Apr 20, 2021

Total active pull requests: 208 (+9 from the last meeting)

Incoming Completed
Created: 27 Closed: 10
Updated: 85 Merged: 8

Apr 13, 2021

Total active pull requests:199 (-2 from the last meeting)

Incoming Completed
Created: 29 Closed: 11
Updated: 90 Merged: 20

Apr 06, 2021

Total active pull requests:201 (+19 from the last meeting)

Incoming Completed
Created: 40 Closed: 11
Updated: 125 Merged: 12

Mar 30, 2021

  • Meeting canceled (lots of folks are out of office)

Mar 23rd, 2021

Total active pull requests: 182 (-8 from the last meeting)

Incoming Completed
Created: 26 Closed: 6
Updated: 83 Merged: 27
Before Alpha After Alpha Comment
kubelet behavior Fail to start on swap-enabled node by default OK to start on swap-enabled node by default No visible performance/behavior change from workload point of view
consuming swap N/A (No workload can consume swap) No workload can consume swap by default
limiting swap N/A Expose a KubeConfig parameter to set limit for all container through CRI Experimental only

Mar 16th, 2021

  • [SergeyKanzhelev] CI/Triage subgroup updates

Total active pull requests: 190 (-4 from the last meeting)

Incoming Completed
Created: 55 Closed: 22
Updated: 137 Merged: 42

merged: 17 cherry-picks, 9 sig/instrumentation

Mar 9th, 2021

  • [SergeyKanzhelev] CI/Triage subgroup updates

    Total active pull requests: 194 (-13 from the last meeting)

Incoming Completed
Created: 69 Closed: 20
Updated: 154 Merged: 67

https://groups.google.com/g/kubernetes-sig-node/c/yNjFrBdH18Q

Mar 2nd, 2021

  • [SergeyKanzhelev] CI/Triage subgroup updates

Total active pull requests: 207 (+7 from the last meeting)

Incoming Completed
Created: 44 Closed: 15
Updated: 118 Merged: 26

Feb 23rd, 2021

Total active pull requests: 200 (+6 from the last meeting)

Incoming Completed
Created: 35 Closed: 13
Updated: 90 Merged: 19

Feb 16th, 2021

Total active pull requests: 194 (+17 from the last meeting)
Incoming Completed
Created: 42 Closed: 6
Updated: 77 Merged: 19

Feb 9th, 2021

Total active pull requests: 178 (-1 from the last meeting)

Incoming Completed
Created: 21 Closed: 15
Updated: 90 Merged: 8

Feb 2nd, 2021

Total active pull requests: 179 (+4 from the last meeting)

Incoming Completed
Created: 38 Closed: 17
Updated: 97 Merged: 17

Jan 26th, 2021

Total active pull requests: 172 (+7 from the last meeting)

Incoming Completed
Created: 33 Closed: 11
fUpdated: 91 Merged: 15

Jan 19th, 2021

Total active pull requests: 161 (-4 from the last meeting)

Incoming Completed
Created: 29 Closed: 15
Updated: 65 Merged: 18

Only two rotten - one needed a KEP. Another may be interesting to pick up: https://github.com/kubernetes/kubernetes/pull/81774 if anybody is interested.

Spilled over from last meeting:

Jan 12th, 2021

Total active pull requests: 164 (-12 from the last meeting) Way to go!!!

Incoming Completed
Created: 23 Closed: 19
Updated: 94 Merged: 16

Please approve lgtmd PRs from SIG Node CI group: https://github.com/orgs/kubernetes/projects/43#column-9494828

Jan 5th, 2021

Total active pull requests: 183 (-13 from the last meeting)

Incoming Completed
Created: 66 Closed: 34
Updated: 128 Merged: 45