community/sig-node/archive/meeting-notes-2020.md

118 KiB
Raw Blame History

SIG Node Meeting Notes (2020)

Future

Dec 22 & 29, 2020 Cancelled

Cancelled. Merry holiday season!

Dec 15, 2020 Cancelled

No agenda. Cancelled. Merry holiday season!

Dec 8, 2020

Total active pull requests: 196 (+6 from the last meeting)

Incoming Completed
Created: 14 Closed: 5
Updated: 60 Merged: 3
  • [@rata]: Friendly ping to review sidecar pre-proposal from Tim Hockin
    • Next steps?
    • Discussing issue with the init containers and where the pod takes its state from. Maybe we need more check points and instead of runtime telling the status we need to keep the status in the kubelet. Kubelet knows way more about the status. So kubelet will not mirror runtimes understanding of containers state into the API.
    • Dawn - yes, with the checkpoint side cars will be a reality. Dependency are needed.
    • Derek: I want to emphasys there are people working on making what we have working and reliable that needs to be appreciated.
    • Dawn: GC behavior is also not very well defined and needs to be formalized or even redesigned.
    • Derek: whats missing in proposals is a confident implementation proposal that will make things better rather than more unreliable.
    • Sergey: we need to try to agree on the north star and split it into the road map later. Mrunal: have a feeling that the north star is a DAG. Rodrigo: maybe more phases and custom phases would be a better north star.
    • Derek: also need to make sure we are addressing the keystone container issue where sidecar container failure must stop the main container as it provides security or something.
    • Dawn: next time lets invite Tim. Maybe not the next week, let people time to read it.
    • Dawn: very important to make sure we listed scenarios to support and scenarios that will not be supported.
  • [@SergeyKanzhelev] dockershim deprecation - plan review
    • We may need to revisit the plan of not compiling dockershim in 1.22 in end of Jan when we will have first customers on windows and some commitments from telemetry vendors.
    • Derek, Dawn: yes, we need to make sure we are not breaking things. Pushing a few releases should be feasible.
    • Dawn: maybe even limit it to windows-only as an alternative
    • Dawn: important to keep dockershim testing in OSS so we are not breaking anybody.
    • Mike Brown: we may need to keep the tests running even when dockershim is not released with k8s, but is still in-tree (1 year after 1.22 as the current plan states). Additionally, there will be a second dockershim that is external to kubelet, see cri-dockershim project. This external shim should be able to be tested in a similar fashion to containerd/cri-o after adding proper infra impl.
  • [@ehashman] SIG Node Triage meeting?
    • Have previously discussed in the past
    • Sync on current test health efforts first before adding another meeting?

Dec 1, 2020

Total active pull requests: 188 (+5 from the last meeting)

Incoming Completed
Created: 18 Closed: 5
Updated: 42 Merged: 8
  • [mauriciovasquezbernal]: User ns KEP
    • Reminder asking for review.
  • [SergeyKanzhelev] ContainerD tests: looking for volunteers (TODO: link)
  • [Mrunalp] seccomp enabled by default? Provide a flag to disable
    • [Dawn] - this change will need a wide and loud notifications way ahead of the release as this default can break some vendor
    • Seccomp enabled by default already prevented some security CVE for container runtime - example is gAdvisor
    • Tim Alclair may also be interested in this topic
    • [Action] Mrunal will start writing proposal.

Nov 24th, 2020

Total active pull requests: 182 (-6 from the last meeting) (note: discrepancy in numbers is due to label sig/node was applied or removed from PRs).

Incoming Completed
Created: 45 Closed: 16
Updated: 115 Merged: 35

Nov 17, 2020

Cancelled because of kubecon (https://kubernetes.slack.com/archives/C0BP8PW9G/p1605631458154700).

Nov 10, 2020

Total active pull requests: 185 (-11 from the two weeks)

Incoming Completed
Created: 59 Closed: 24
Updated: 129 Merged: 46

29 lgtm, but not approved PRs: https://github.com/kubernetes/kubernetes/pulls?q=is%3Apr+is%3Aopen+label%3Asig%2Fnode++label%3Algtm+-label%3Aapproved

AI: Sergey - can we exclude all PRs that are not sig/node specific. Release manager, API review blocked, etc.

Nov 3rd, 2020

Todays SIG Node was cancelled due to the election.

October 27th, 2020

Total active pull requests: 192 (-8 from the last meeting)

Incoming Completed
Created: 24 Closed: 16
Updated: 100 Merged: 16

October 20th, 2020

Total active pull requests: 196 (+22 from the last meeting)

Incoming Completed
Created: 28 Closed: 4
Updated: 51 Merged: 2

October 13th, 2020

Total active pull requests:

Total active pull requests: 172 (-2 from the last meeting)

Incoming Completed
Created: 18 Closed: 6
Updated: 52 Merged: 14

October 6th, 2020

Total active pull requests: 173 (-3 from the last meeting)

Incoming Completed
Created: 16 Closed: 11
Updated: 56 Merged: 8

September 29th, 2020

Total active pull requests: 174 (-8 from the last meeting)

Incoming Completed
Created: 8 Closed: 7
Updated: 64 Merged: 9

September 22nd, 2020

Total active pull requests: 179 (-7 from the last meeting)

Incoming Completed
Created: 16 Closed: 10
Updated: 41 Merged: 13

potentially needs to be fished out of rotten: https://github.com/kubernetes/kubernetes/pull/86071 https://github.com/kubernetes/kubernetes/pull/88741

September 15th, 2020

Total active pull requests: 186 (-4 from the last meeting)

Incoming Completed
Created: 24 Closed: 12
Updated: 41 Merged: 16

September 8th, 2020

Total active pull requests: 188 (-22 from the last meeting)

Incoming Completed
Created: 23 Closed: 16
Updated: 50 Merged: 29

potential to fish out of rotten:

September 1, 2020

Total active pull requests: 209 (-33 from the last meeting)

Incoming Completed
Created: 11 Closed: 6
Updated: 65 Merged: 38
  • [@SergeyKanzhelev/@andrewsykim] Timeouts in exec probe

    • Latest PR (@andrewsykim): https://github.com/kubernetes/kubernetes/pull/94115

    • @louyihua (Jan 28, 2018): https://github.com/kubernetes/kubernetes/pull/58925/ (replaces https://github.com/kubernetes/kubernetes/pull/58510)

    • @tnqn (Jun 24, 2020): https://github.com/kubernetes/kubernetes/pull/92465/

    • tedyu (Jan 16, 2020) https://github.com/kubernetes/kubernetes/pull/87281/

      From Alexander Kanevskiy to Everyone:  10:08 AM
      Lets deprecate dockershim :) it is long time overdue
      From Me to Everyone:  10:09 AM
      =) containerD has the similar issue
      but I think I agree in principle
      From Alexander Kanevskiy to Everyone:  10:12 AM
      actually…. “runc exec” doesnt have way to specify timeout
      so, OCI compatible runtime will execute something until it return
      From michael crosby to Everyone:  10:15 AM
      If you want runc exec to timeout, you need to use context.Context with a timeout then containerd should handle it when that context is canceled
      We use exec.CommandContext for all calls to external binarie
      From Alexander Kanevskiy to Everyone:  10:19 AM
      so containerd will be the one who kills “runc exec” process…. in theory that is ok, but sometimes might be not always reliable. would it make sense to integrate “timeout” functionality on the lower level (OCI runtime spec?) so it will be reliable cleaning up processes inside container
      From Me to Everyone:  10:20 AM
      Yes, I think this is the desire. Basically first question we wanted to answer whether we need to support timeouts on exec at all. Now - whatever mechanism we will have - how to introduce it in new versioin without affecting payloads. And finally we can discuss whether to start with Andrew;s PR
      From michael crosby to Everyone:  10:24 AM
      I wouldnt think timeout in the OCI runtime spec would make sense.  The cancel of a context should unroll things correctly as exec.CommandContext does a SIGKILL
      
      
      [Dawn] maybe timeout would not lead to container restart/kill? Ideally timeout value shouldnt be 1 second, it should be like a catch all bigger value.
      
      [michael Crosdby] lets not have explicit default timeout at all?
      
      [Andrew] is changing default a breaking change? [Dawn] Yes, definitely
      When readiness probe relying on default of 1 second - extending it might affect user payload
      
  • [@bmcfall] Status of exposing hugepages in pod

    • add support for hugepages in downward API #86102 - Closed due to inactivity
    • @kad: as a workaround for hugepages specifically, inside container it is possible to read /sys/fs/cgroup/hugetlb/*limit_in_byte
      • @bmcfall: Thanks @kad - Followed up on slack.
  • [@bmcfall]] Code freezed for 1.20?

August 25, 2020

Total active pull requests: 238 (+20 from the last meeting (2 weeks))

Incoming Completed
Created: 35 Closed: 11
Updated: 78 Merged: 4

August 18, 2020

Cancelled due to the conflict with KubeCon.

August 11, 2020

Total active pull requests: 219 (+0 from last meeting)

Incoming Completed
Created: 11 Closed: 3
Updated: 47 Merged: 8

August 4, 2020

July 28, 2020

July 21, 2020

July 14, 2020

  • [@rata/Kinvolk, @jirving] sidecar ordering (KEP)
    • Just a quick update
  • [@vinaykul] In-Place Pod Vertical Scaling - how do we move forward?
    • emptyDir.memory: Abort attempts to set limit < usage reasonable?
    • As discussed in todays meeting - this conversation has link to previous version of KEP where we had Status.ResorurcesAllocated, and its challenges with scheduler functioning vs Spec.ResourcesAllocated (or ResourcesToAllocate for better naming)
  • [@yash97]Sending events Distributed manner--KEP discussion.
  • [@mauriciovasquezbernal, @alban, @rata] User namespaces support
  • [kad] SIG Node Resource Management Forum: new meeting time? calendar invites?
  • [@bobbypage] Cherrypick for kubelet reporting incorrect status: https://github.com/kubernetes/kubernetes/pull/93041

July 07, 2020

  • [@vinaykul] In-Place Pod Vertical Scaling for v1.19 - Update & next step
    • Late-stage API review and questions about KEP - not making 1.19
    • @thockin is revisiting completed API review decisions:
      • Is ResourcesAllocated really needed? Why not do local checkpointing?
      • Subresource to set resources & resourcesAllocated
      • How do we handle runtimes that may sometimes require container restart in order to resize? How does that work with restartPolicy=Never
      • Should Restart be the default resize policy?
      • Should RuntimeClass should be allowed to disable in-place update so users get synchronous error. sounds reasonable to me if runtime doesnt support this - thoughts?
      • Are RestartOnGrow, RestartOnShrink resize policies a good idea? (Doesnt have to be added today, but quite easy to add if it feels useful)
  • [mrunalp, nalin] /dev/fuse in pods https://github.com/kubernetes/kubernetes/pull/79925
    • Nalin and Mrunal to come up with use cases and design
  • [mrunalp, zvonkok] Rlimits support - https://github.com/kubernetes/kubernetes/issues/3595
  • [@rata/Kinvolk, @jirving] sidecar ordering (KEP)
  • [@AlexeyPerevalov @swatisehgal] Topology Aware Scheduling
    • Topology Exporter Daemon in kubernetes-sig
  • [harche/mrunalp] Fedora images for node testing
  • [@yash97]Informal Kep:Sending Events to user directly to kubelet in distributed ways instead to receive it from api server.
  • [alejandrox1] https://github.com/kubernetes/kubernetes/pull/80917 needs review from SIG Node approver

June 30, 2020

  • [Balaji] Add support for disabling /logs endpoint
  • [@vinaykul] In-Place Pod Vertical Scaling for v1.19 - Update & next step
    • Feature code complete with unit tests, reviewed by @dashpole.
    • Basic e2e test framework done, adding test-cases promised for alpha.
    • Need reviews @liggitt, @thockin, @ahg-g, and Derek, Dawn, sig-testing.
    • Identify process for informing CRI changes to runtime folks.
  • [@mukesh-dua/@guptavishal7982] CPU Reservation for Infrastructure Services Deployed in Kubernete
    • Would like to introduce the enhancement and its requirements.
  • [@rata/Kinvolk, @alban/Kinvolk, @jirving], sidecar ordering (KEP)
    • PR to update to provisional state and depend on kubelet node shutdown in place
    • Reviewers/approvers?
      • SergeyKanzhelev
    • PR coming ~1 or 2 weeks (hopefully sooner) with design “callout” discussed in the previous meeting, alternatives and history of all decisions made in the past (WIP)
      • Plan to update summary, motivation, etc. too.
  • [@hasheddan] seccomp-operator: https://github.com/kubernetes/org/issues/1873

June 23, 2020

June 16, 2020

June 9, 2020

  • @rata, @alban sidecar ordering (KEP)
  • @alban User namespaces [design-proposal] [issue] [PR]
    • I am planning to work on this in the near future
    • There were some challenges on upgrades and some areas under-defined in the KEP
    • Set of folks (Kinvolk and other interested parties) can maybe update the KEP? It predates ephemeral containers, PID ns sharing, many clarify what it means to different containers types? (what does it mean for ephemeral containers or other cases?)
    • see vikas pr from 2018 that got far:
  • [kmala] https://github.com/kubernetes/kubernetes/pull/89667
  • [vinaykul] In-Place Pod Vertical Scaling for v1.19 - status update
    • Core implementation initial review done - thanks @dashpole!
    • First-cut implementation ready.
    • Resource-quota, limit-ranger, e2e tests are next.
  • [@renaudwastaken] Disabling GPU metrics provided by the Kubelet/cadvisor
    • Discussion about enabling this by default in the future (note: k8s deprecation policy is 1+ year)
    • TLDR: Consensus on deprecating the GPU metrics, as a sig we have agreed on a way to collect metrics from out of tree. Need to figure out deprecation.
    • Dawn: Make the announcement through release notes and other channel
    • Derek: Can we tie this with cadvisor vendoring in some way?
    • David: We should tie the deprecation of GPU metrics with the summary API
    • Dawn: Deprecation of the summary API started 3+ years ago (even before the CRI), could this be a baby step for the cleanup?
    • Note: Do not add this as another CLI flag but only a config flag
  • [@renaudwastaken] Cherry pick metrics bug fix to 1.16 and 1.17
  • [vpickard] sig-node testing enhancements update
    • Great efforts from team to understand and fix failing test
    • Created github project board to help manage issues/PR
      • experimenting with board, not public yet
    • 10 PRs complete/merged - doc updates, COS image fixes, test config, fail on missing image
    • 13 PRs in progress - more doc updates, more image cleanup, benchmark test
    • benchmark tests failing - OOM - # pods lowered from 105 - 90
    • Considering adding additional images once tests are stable with current images (COS and ubuntu)
      • one-monthish failing test policy for supporting image
      • RHEL uploading results separately
  • [vpickard] sig-node-resource management forum
    • Is it possible to move this mtg to 8 am PDT so folks from US west coast can attend?
    • https://github.com/kubernetes/enhancements/pull/1121 approved, will likely not be implemented until 1.20
    • https://github.com/kubernetes/enhancements/pull/1752 pod level alignment for resource
      • Alex to take a final review of latest comment
    • no topics for Thursday, June 11 (Holiday in Poland, topic moved to 6/18/2020). Will cancel mtg tomorrow if no topic
    • Next scheduled mtg Thursday, June 18.
      • 5G deployment scenarios for pod level resource alignment

June 2, 2020

Meeting is canceled.

May 26, 2020

  • @rata, @alban sidecar ordering (KEP)
    • terminationGracePeriodSecond
    • multi-level dependency sidecar
    • Retro before in the year (is there a video recording?)
    • Concerns:
      • Moving to alpha without understanding termination sequence
      • Losing data during a node shutdown sequence (Kubernetes running in a train)
      • Device access, CPU, Memory policy
      • More complexity: debugContainer
  • [tedyu] https://github.com/kubernetes/kubernetes/pull/91211 Remove excess log https://github.com/kubernetes/kubernetes/issues/90999 Give static pod deletion grace period https://github.com/kubernetes/kubernetes/pull/91453
  • [vinaykul] In-Place Pod Vertical Scaling for v1.19 - update
    • Kubelet-CRI KEP initial code ready for review.
    • First-cut implementation in progress, ~ETA end of 1st week of June.
    • API-only code changes with review feedback: 5126b9e1
  • [mrunalp] CRI errors - https://github.com/kubernetes/kubernetes/pull/91273
    • Ready for review
  • [vpickard] sig-node testing enhancements update (doc)
    • rescheduled mtg from Monday, May 25th to Tuesday May 26th at 11 am EDT due to Monday being US holiday
    • test spreadsheet updated with all sig-node tests. Signup if interested! Few slots open
    • Priority - merge blocking, release blocking, release informing
    • conformance-node-rhel test - no result
  • [vpickard, bart0sh, mhb]
    • [bart0sh] PR to update cos-stable images https://github.com/kubernetes/test-infra/pull/17617 updated cos image
    • https://github.com/kubernetes/kubernetes/issues/91292 sig-node release-blocking failure on 5/20/2020. COS images had been updated, intermittent failures. Learned that COS image testing was broken (not being tested, silently failing) for the last ~4 weeks. Debugged, replaced broken COS image with newer one. Thanks @bart0sh, @mhb for great debugging!
      • Victor to send COS image PRs to Ning Liao and Roy Yang
        • COS image policy may need updating per Dawn ( I dont understand the image policy, maybe Ning and Roy can share?)
      • Create email list and update jobs with email alias to alert folks that are monitoring jobs to avoid fire-drills when release-blocking/merge-blocking tests fail

May 19, 2020

  • [Javier Diaz-Montes] Discuss new feature to set FQDN as hostname of pods, issue #91036, initial draft for PR in #91035. KEP PR in kubernetes/enhancements/1792
  • [vinaykul] In-Place Pod Vertical Scaling v1.19 update & CRI-API design question
    • Updates:
      • API code changes initial review done by Tim Hockin, David Ashpole.
      • First-cut implementation ~3 weeks to PR-ready.
        • David Ashpole is primary reviewer for Kubelet & CRI changes.
    • Concern:
      • CRI clients may return partial or no CPU/memory limit info in ContainerStatus response. Whats the best way to handle this?
        • Option 1: Assume zero means no information returned?
        • Option 2: Add a flag that CRI client can set?
        • Dawn: Prefer to Option 1 since ContainerStatus is setting with the value reading from the host directly, and 0 is invalid the value for the kernel.
  • [vpickard] sig-node testing enhancements update (doc)
    • rescheduled mtg from Monday, May 25th to Tuesday May 26th at 11 am EDT due to Monday being US holiday
    • test spreadsheet updated with all sig-node tests. Signup if interested!
    • Priority - merge blocking, release blocking, release informing
    • [bart0sh] PR to update cos-stable images - needs /approve https://github.com/kubernetes/test-infra/pull/17617
      • follow-up PR will move to latest LTS cos image
    • conformance-node-rhel test - no result
  • [cezaryzukowski, cynepco3hahue, bg.chun, krzwiatrzyk] How can we move forward with the Memory Manager KEP? Could we get any feedback (approval, change-request, partial-merge of KEP (prologue sections of KEP, Summary till Story 2 : Databases), etc.)? Today is the day of Enhancement Freeze.
  • [vpickard] sig-node resource management forum (doc)
    • Reviewed Memory Manager presentation
    • Some concern about new kubelet flags and impact to user experience (more flags!)
    • For multi-numa hint generation, some discussion about preferred flag, great explanation from @klueska here
    • [bg.chun/review request] Update Topology Manager to support pod-level resource alignment
  • [@rata, @alban] sidecar ordering (KEP)

May 12, 2020

  • [kad] Our experience of advanced resource management (CPU, Memory, etc.) 10 mins demo + ~20 mins for other slides.
  • [cezaryzukowski, cynepco3hahue, bg.chun, krzwiatrzyk] Memory Manager KEP:
  • [joe conway (Crunchy), mrunalp (Red Hat)] - Challenges with running Postgres on Kubernetes https://github.com/kubernetes/kubernetes/issues/90973
  • [vinaykul] In-Place Pod Vertical Scaling v1.19 update & CRI-API design question
    • API changes reviewed by Tim Hockin - one more change coming.
    • First-cut implementation ~3 weeks to PR-ready, identify primary reviewer.
    • Concern: CRI client may return partial on no CPU/memory limit info in ContainerStatus response. Whats the best way to handle this?
      • Option 1: Assume zero means no information returned?
      • Option 2: Add a flag that CRI client can set?
  • [krzwiatrzyk, bgchun] Request for review - Topology Manager pod-level-single-numa-node policy KEPs update PR
  • [vpickard,jaypipes] sig-node test kickoff update
    • great attendance
    • thanks to @dims for sharing how to navigate around and debug! Will incorporate into documentation
    • sig-node test document
    • sig-node test spreadsheet
    • mtg recording
    • Next: complete spreadsheet with remainder of tests, some cleanup of columns. Volunteers sign up for tests to investigate.
    • Weekly meeting Monday at 1 pm EDT until we get this under control
  • [vpickard] sig-node working group for Topology Aware Scheduling
    • #topology-aware-scheduling channel for discussion
    • meeting logistics - propose Thursday 9 am EDT, will survey if needed
    • meetings will be recorded for those not able to attend
    • Send invite to sig-node group with mtg link
  • seccomp-operator: proposal to move to k-sig https://github.com/saschagrunert/seccomp-operator/blob/master/RFC.md

May 5, 2020

April 28, 2020

Apr 21, 2020

  • [derekwaynecarr] SIG Health Check
    • retro: release blocking test was RED for 10d [4/6-4/16]
    • Q: Are we able to sustain the number of test suites we run? Is pruning required?
    • Q: Carrot and sticks to improve sustainability?
    • Q: Impact of covid situation on members? Holiday week?
    • Q: Volunteers for mentorship?
    • Notes:
      • [dawn] Communication issues around internal changes, and growing new engineers to backfill.
      • [dawn] Stale old image caused the network blocking. Viewed this as an opportunity to grow new members to backfill that role and had a hand-off problem.
      • [mikebrown] GCP issue hit containerd community pre-submits.
      • [dawn] sig built first level conformance test, node e2e expanded to cluster level, but still want to hold more node e2e as release blocker as its harder to grow to cluster level and keep deterministic, particularly on tests for different node profiles.
      • [victor] testgrid alert email when there is a test failure. not working as he was not getting notifications on his related tests.
      • [derek] ask for volunteers to audit state of test-infrastructure and provide some recommendations back to sig, jay/victor to coordinate a small sub-group in sig over mailing list to determine next steps.
        • [jaypipes, saran balaji] volunteer from aw
        • [victor] volunteer from red hat
        • [morgan] volunteer from ibm
        • [ning liao, david porter] volunteer from google
        • [bart0sh] volunteer from intel
        • [daniel mangum] ci-signal lead for 1.19, aaron is working on transition to community owned infrastructure.
  • [SaranBalaji90] Add node-local plugin support for pod admission handler https://github.com/kubernetes/kubernetes/pull/87273 and disabling logging handler in kubelet https://github.com/kubernetes/enhancements/pull/1461
  • [howardjohn] What is needed to get moving on Sidecar Containers?

Apr 14, 2020

Apr 7, 2020

  • [tedyu] Protect log rotation against concurrent symlink removal https://github.com/kubernetes/kubernetes/pull/89160 Prototype for checking container status before symlink removal is attached to the PR.
  • [k.wiatrzyk (krzwiatrzyk), c.zukowski, bg.chun] Enhancements proposals for 5G packet processing in Kubernetes.
    • Towards high-performance 5G packet processing, slide
    • Overall Proposals in detail, docs

March 31, 2020

March 24, 2020

March 17, 2020

  • [vpickard] Please upload latest video recording
  • [liorokman] Enable defining Pod-level resource limit

March 10, 2020

  • [roycaihw] When a kubelet goroutine (e.g. a pod worker) panicked, kubelet should crash and restart, instead of keeping running with a non-functioning pod worker
  • [liorokman] Enable defining Pod-level resource limit
    • https://github.com/kubernetes/enhancements/pull/1592
    • https://github.com/kubernetes/kubernetes/pull/88899
      • Figure out how this impacts the Init container
      • Verify if resources are released for reuse between sibling cgroups in the pod
      • What is the expected effect on hugetlb?
      • What is the interaction with the ResourceOverhead feature?
      • How would this work in NUMA environments?
      • Does it make more sense to not tie this to the QoS functionality, or make this available for BestEffort pods on the pod level?
      • Does this also apply to ephemeral storage (tmpfs) or possibly pids (if we ever made that container level rather than a global default)
  • [vpickard] Topology Manager documentation PRs need review/merge to finish Beta
  • [mattjmcnaughton/dims] KEP- Build Kubelet without Docker
  • [smarterclayton] Looking at pod end to end latency in the kubelet with an eye on the status loop
    • The status loop is very simple and reliable (good!) but fairly slow
    • No KEP yet, gathered some feedback from ashpole and investigating
      • In an e2e run, we take about ~800s overall from time we detect a status change to the time we successfully write it to apiserver across all tests in sum
      • With some simple changes, i was able to get that down to less than 200s (which means certain types of pod operations complete a lot faster)
      • Found a few correctness bugs already in our core sync loop
        • Internal kubelet state not in sync with apiserver
        • We check the wrong cache for certain pod
        • We are depending on a live lookup to create the patch, but there is no guarantee that is up to date either
        • Kubelet is sending new data when feature flags are off (bad!)
        • Some internal safety checks in kubelet are just log messages, but potentially should be crashes (we ignored them as logs)
    • Going to move to a KEP for some improvements soon
      • “Improve pod status reporting latency and prioritize important transitions”

March 3, 2020

February 25,2020

February 18,2020

February 4, 2020

January 28, 2020

January 21, 2020

  • [@SaranBalaji90] Get opinions from others on #87252. Adding support for disabling /logs endpoint in kubelet
    • agreement that PR should only add ComponentConfig fields (no CLI args for kubelet)
    • general approval of exposing finer-grained endpoint handlers as long as existing behaviour does not break (w.r.t enableDebugHandlers config)
  • [vinaykul] In-Place Vertical Scaling follow-up
    • Moved KEP to sig-node directory
    • Awaiting review of API changes, test plan, GA criteria
    • Targeting to merge KEPs as implementable before Jan 28 deadline for 1.18
  • [@SaranBalaji90] Discuss about KEP - #1461 To implement “out-of-tree” plugin for pod admit handler. This will benefit from not needing to update k8s source code whenever we have new feature in docker/kernel versions and preventing pods that even tolerate any taints from running on custom worker nodes.

January 14, 2020

  • Update to RuntimeClass / pull annotations KEP (#1448) - [@kkmsft, @patricklang]
    • Mike & Lantao to review since they had feedback and suggestions on last proposal - if ok, will proceed. Dawn will /approve if theyre ok
  • Container namespace targeting #84731 for 1.18? [@verb]
    • Three concerns were raised:
      • How will zombies from the ephemeral container be cleaned up, if at all? For docker will they be configured automatically?
      • What's the behavior for Init Containers?
      • Is there any coordination to do with the user namespace work
    • @verb will send a PR updating the ShareProcessNamespace KEP to address these concerns.
  • [vinaykul] In-Place Vertical Scaling KEP - preparing to merge as implementable
    • Final changes from API review
      • New admission controller instead of subresource
      • Using list of named subresource
    • Test-plan, graduation criteria section added
    • Should we move KEP to sig-node directory?
      • Yes

January 7, 2020

  • [bg.chun] Asking a review for PR #84154 of container isolation of hugepage
    • This change enables kubelet to set hugepage limit on container level cgroup sandbox when kubelet creates a container. and it is the last piece for container isolation of hugepages on kubernetes side except docker shim. (see next step for details)
    • Writing e2e-test on top of this change is almost done. My coworker will open PR to add test suit for container isolation soon. But it has a dependency on the PR#84154 so it will be tagged as WIP.
    • next step
      • In kubernetes side, next step of this work will be adding e2e-test and updating online document.
      • In container runtime side, 1) changes for cri-o are merged. 2) changes for c8d are waiting merging now. 3) For docker, it requires more effort, 1. adding hugepages field on docker configuration fields. 2. update docker vendoring in kubernetes to have new field 3. update dockershim to set hugepage limit to docker configuration.
  • [vinaykul, quinton-hoole] In-Place Vertical Scaling KEP update
    • @thockin approved!!
    • Next steps:
      • Update PR 1342 with API review change
      • CRI changes review & implementation plan
      • Formal SIG-Node approval, merge as implementable, and start writing code