community/sig-node/archive/meeting-notes-2019.md

58 KiB
Raw Blame History

SIG Node Meeting Notes

Future

  • Virtual Kubelet (@rbitia)
  • [Jess Frazelle / Kent Rancourt]: Proposal for kubelet feature to freeze pods placed in a hypothetical “frozen” state by the replication controller in response to a scale-down event. Enable pods to be thawed by socket activation.
  • regular resource usage tracking resource tracking for 100 pods per node => this e2e test has been failing (flaking?) quite consistently in release-blocking dashboards. Should we block the 1.14 release on this? If not, could you help resolve it? Issue:75039 (@mariantalla)
  • Issue to discuss: Hardware topology awareness at node level (including NUMA)
  • OCI Hooks PreStart, PostStop - @alban
  • Hugepages in 1.18(notes)

December 24 & 31, 2019

  • Happy Holidays! No meeting 🎉🎉

December 17, 2019

December 10, 2019

  • [kkmsft,patricklang] Review alternate proposal to #84486 discussed earlier - annotations in ImageManager with certain fields copied from PodSandboxConfig (runtimehandler). Link
    • ask: can Lantao review since this was based on in-person discussion at Kubecon?
    • Will update existing KEP and change API sections to follow this approach.
  • [kad, RenaudWasTaken] WG-Resource-Management future
    • kad and RenaudWasTaken to come up with written proposal/plan/goals for WG
  • [wojtek-t] Immutable Secrets/ConfigMaps (KEP)
  • [RainbowMango] Request for review and a proposal:
  • [vpickard] Questions about doing E2E tests on NUMA nodes
  • [klueska] Best method to migrate checkpointed state across kubelet upgrade
    • Needed to remove for this week since I cant make it to the meeting anymore
    • Please see this and comment when you have time
  • [jaypipes] Currently working on repro'ing this bug: https://github.com/kubernetes/kubernetes/issues/79159
    • I can reproduce the bug successfully but am a bit confused as to why I never see Allocatable.cpu for the node never get decremented. Is this number not intended to be decremented ever? i.e. does the scheduler keep the current number of static cores allocatable to each node in memory?

December 03, 2019

November 12, 2019

November 5, 2019

October 29, 2019

October 22, 2019

October 15, 2019

October 8

October 1

September 24

September 17

Updates:

*   Reviewed with sig-scheduling last Thu, Abdullah and Bobby are fine with it.
*   Updated sig-autoscaling today, requested approval - they will look this week.
*   No word from sig-arch yet for API review request.

September 10

Updates:

*   Restate Kubelet restart resize handling to minimum guarantees.
*   Clarify scheduler role - requested sig-scheduling review

September 03

Key changes:

*   Call out details of Kubelet-APIServer interaction in handling Pod resize.
    *   Scheduling a API review with SIG API.
*   Move eviction of lower priority Pods to Future work section.
*   Remove Pod departure as a trigger for resize retries.
*   Call out details of Static CPU manager policy resize handling.

August 27th

  • [vinaykul] Kick-off review of updated Vertical Scaling KEP. The design has been updated as per consensus achieved from discussions over the past few weeks.

Key changes in this commit:

*   PodSpec holds ResourcesAllocated for Pods Containers on Node
*   resourceallocation subresource for Kubelet to set/update ResourcesAllocated
*   Kubelet restart fault tolerance section from [KEP discussion](https://github.com/kubernetes/enhancements/pull/686#discussion_r311136176) 
*   Remove Resizing PodCondition 

August 20th

August 13th

August 6th

July 30

  • [jaypipes] Request a status update on node-level userspace remapping work from either @derekwaynecarr or @vikaschoudhary16

    • Is this stuck waiting on reviews? Stuck on implementation or design debate? Is there anything that EKS team members can contribute to speed this along?
    • There are resource constraints on RH side, plus it's de-prioritized. Jay will reach out to the customer who is asking about this feature and determine if there is an alternate solution to the problem and whether the user ID remapping solution would be solving the problem fully in a unique way for the customer. EKS team may contribute resources to reignite Vikas' original PR if customer feels user ID remapping is the most viable solution (and EKS team feels the proposed implementation is viable as well)
  • [derekwaynecarr] Need to fix up our mailing list settings

  • [derekwaynecarr] 1.16 Feature Freeze items

  • [klueska] Blocking TopologyManager PRs:

    (All other PRs currently depend on these and require a rebase once they are merged)

    (Once these are merged, all other changes should be isolated to cm)

  • [vinaykul] Vertical Scaling KEP

  • [verb] #80645: Assigning sig-node as owner of pods integration tests

    • wants approval from Dawn & Derek
  • [tallclair] seccomp to GA

July 23

July 16

July 9

July 2

Meeting canceled

June 25

June 18

June 11

Cancelled.

June 4

in

  • [patricklang] Fixing issue found with podResources enabled by default - Windowskubelet won't start "Failed to create listener for podResources endpoint"
    • 2 PRs open with slightly different approaches #78670 / #78671 - feedback? need to merge today and will need approver
    • Device plugin isnt tested on Windows either - @dashpole to look into how something can be disabled on a per-OS basis. Meeting consensus was to look into disabling device plugin manager thats not used/tested on Windows
  • [derekwaynecarr] Kubelet in userns
  • PodOverhead
    • API review completed, PR approved but not merged
    • Moving to 1.16
  • RuntimeClass Scheduler
    • API implementation approved, PRs not in
    • Moving to 1.16
  • TerminationPeriod: https://github.com/kubernetes/kubernetes/issues/77873
    • Yuju: asked to understand use case more, may have other options to do it
    • Derek - the Pod API already captures the user intent here, and it not being honored at shutdown could catch them by surprise

May 28

May 21

  • Proposed to cancel meeting due to KubeCon EU

May 14

  • Follow-up review to In-Place Vertical Scaling KEP after last week discussion.
    • Identify any items that may need to be addressed to get KEP approved.
    • @vinaykul, @derekwaynecarr @dashpole @dawnchen
  • Do we want to have dedicated meetings for issue triage? (@derekwaynecarr)

May 7

  • Pod Overhead: requests vs limits & QoS - see this comment
    • shouldn't affect QoS
    • BestEffort support should be a policy decision, independent of overhead implementation (e.g. fro Kata containers)
    • Overhead should still be considered with BestEffort pods (scheduling, running)
  • Review latest updates to In-Place Vertical Scaling KEP @vinaykul, @DerekCarr @dashpole

April 30

Apr 23

Apr 16

  • Removing cloud provider info from the /spec endpoint - #76291
  • Bringing CRI-ContainerD to Windows (doc) (@patricklang)
    • Background, use cases and some key decision points outlined
    • Ask: Close on list of reviewers this week so this can move to a KEP
    • Ask: For the proposals that require CRI changes, designate team & process to move forward. KEP or ?
  • Who owns the policy around revendoring the Docker API (see PR)? Is it SIG-Node? (@patricklang) This broke Windows tests across the board from 4/12 to 4/14 when we set DOCKER_API_VERSION to override it, but could have been prevented with testing (PR with /test trigger in review). Can we get someone to help push this PR through so /test works and we can prevent this next time?
    • What about pinning Docker API version so revendoring doesnt change it?
  • FYI, KEP for Ephemeral Containers has moved to implementable. Please chime in with any feedback.

Apr 09

  • Meeting canceled due to lack of agenda topics
  • Please reach out on slack if blocked

Apr 02

Mar 26

Mar 19

Mar 12th

Cancelled

Mar 5th

  • Add maxInitialFailureCount to health probes (@matthyx)
    • target for v1.15
    • Next Step: find reviewers from sig-node for this KEP
  • NUMA Manager: https://github.com/kubernetes/enhancements/issues/693#issuecomment-466728227
  • RuntimeClass beta update [@tallclair] - Summary of changes requested in API review & implications for upgrades.
    • move RuntimeClass API from CRD to a core API, since CRD is not fully ready yet; previously existing RuntimeClass objects will need to be created;
    • rename runtime_handler to handler, and make the handler field as required; this only affects RuntimeClass objects; the RuntimeClassName field of PodSpec can still be left empty;
    • node e2e tests will be added for beta; no conformance tests will be added for now;
  • do we currently track the review effort from the sig-node side for the sig-windows PRs for v1.14 (@Patrick)?
    • currently, we dont. The process for v.14 is very fluid. Please ping sig-node if help is needed.
    • We will improve this for v1.15
  • Runc memory issue: https://github.com/opencontainers/runc/issues/1980
    • Fix under review: https://github.com/opencontainers/runc/pull/1984
    • Basically, instead of copying the runc binary into memory or /tmp, it creates a temporary read-only bind mount into runc state directory to avoid extra memory usage.
    • The fix is still under review.

Feb 26th

  • Separating a CRI for docker from Kubelet (@dims)
    • Previous SIG Node decision: container runtime is part of the os distro and node image. In the future, the container runtime in the test infra is provided and supported by the os distro and node image vendors. (@DawnChen)
    • There is no rush to make decision on this since many productions are still depending on DockerShim even both cri-containerd and cri-o are productionized. But it is worthy talking about the current status and the goal. (@DawnChen)
    • Sig-windows wants to move to cri-containerd as fast as possible. Some features only exist in cri-containerd, and hard to add into dockershim, e.g. some RuntimeClass features. (@Patrick)
    • It is not the time yet to make the decision, we probably need another year. (@Derek)
    • We already started adding features not supported by dockershim, and there will be more, e.g. secure pod, runtime class. At least if people agree on that we are going to eventually deprecate dockershim, we can continue adding this kind of features. (@yujuhong)
    • Looking at about a year deprecation clock, once blockers are satisfied.
    • We can start thinking about and document dockershim deprecation criteria. That doesnt need to wait until the clock timer starts. (@yujuhong)
    • Would really like to see this in a KEP showing how it lines up with graduating RuntimeClass, getting CRI to 1.0, and giving time for other runtimes to catch up that should be a well communicated part of sig-node roadmap (@Patrick)
    • Who will be the advocate for when we know this is the right thing to do?
    • Who can drive the KEP? @resouer volunteered to write a KEP for this. (@DawnChen)
    • Next Step: we will come back to this topic one month after the KEP is sent to the community. (@DawnChen)
  • Follow up discussion on pod resource overhead issue/RFC doc (@egernst)
    • overview presentation
    • introduce a pod overhead to the PodSpec, which will be taken into account along with the container requests. This will be populated automatically by an admission controller, which utilizes the values from RuntimeClass CRD. Would like to intercept Kubernetes 1.15 as a separate feature.
    • Why not just add pod level resource limit/requests? (@Derek)
    • Next step: continue the discussion on the KEP
  • Warning / Announcement runc memory spike issue after the CVE fix, pod with low memory limit (<10mb) may not run anymore https://github.com/opencontainers/runc/issues/1980. (@DawnChen)

Feb 19th

  • Promoting cloud provider node labels to GA (@andrewsykim)
    • Approved by SIG Node
  • Deprecate “containerized” kubelet - #74148 (@dims)
    • Yes from SIG Node
    • Only change how to run Kubelet for HyperKube, not deprecate HyperKube.
    • We should announcing this to kubernetes-dev@ and Kubernetes Community meeting

Feb 12th

Feb 5th

Jan 29

Jan 22

Jan 15

  • UserNS remapping: updated proposal, implementation PR and Demo (vikasc)
  • PLEG relist has a race with event channel: #72482 WIP PR: #72709 (@resouer)
  • lts discussion https://github.com/kubernetes/community/pull/2911 (tpepper)
    • Plan to formalize the workgroup
    • Sig Node is one of stakeholders
  • v1.14 release team checkin (spiffxp)
    • If possible can we run through this in first 30 mins? sched conflict afterwards
    • Im here in lieu of Claire Laurence, v1.14 Enhancements Lead
    • Enhancement Freeze Jan 29
    • All kubernetes/enhancements issues targeted for v1.14 must have a KEP, even if they didnt before
    • KEP graduation criteria should include a checklist of requirements for alpha/beta/stable, including test plan and upgrade/downgrade plan
    • Review of kubernetes/enhancements issues targeted for v1.13, what needs to be moved to v1.14, what can be closed?
      • Dawn: Done. Re-targeted some to v1.14, and closed others.
    • Review of kubernetes/enhancements issues targeted for v1.14, are these accurate?
      • Will file the rest of features.
  • Graduate HugePages to GA, any objections? (@derekwaynecarr)
  • ReplicaSet controller continuously creating pods failing due to SysctlForbidden (@Suraj) #kubernetes/72593
    • ReplicaSet controller should know about the pod failure
    • Controllers should be smarter to handle these errors and backoff
    • If an user is using sysctl then it is assumed they are breaking the containers and pod boundaries that kubernetes endorses
    • Ideal to solve it at the controller level than at the kubelet level
  • Initial review of PR https://github.com/kubernetes/enhancements/pull/686 (@vinaykul)

Jan 08