community/sig-instrumentation/archive/meeting-notes-2021.md

33 KiB
Raw Blame History

Triage (2021-12-30)

  • CANCELED FOR YEAR END SHUTDOWN

Agenda (2021-12-23)

  • CANCELED FOR YEAR END SHUTDOWN

Triage (2021-12-16)

Cancelled due to conflict with Contributor Celebration

[NEXT] Agenda (2021-12-09)

Agenda:

  • Announcements and AI follow-up
  • [Leads] Selecting a new TL
    • Action: written email on Frederics departure
  • [ehashman] 1.24 KEP discussion
  • [shuaichen] metrics.k8s.io SLO (performance) and pagination
    • Ping ehashman offline for help with this
  • [fromani] (can be postponed, 1.24 or beyond) klog: towards per-flow verbosity
    • Looking for previous history/attempts (if any) and design yay/nay
    • Probably deserves a full KEP, will write depending on the above bullet point
      • POC/usecase description here (caveat: security implications not addressed)

[NEXT] Triage (2021-12-02)

Attendees:

  • ehashman
  • Damien Grisonnet
  • David Ashpole
  • Haoyu Sun
  • Jan Fajerski
  • Kevin Wiesmuller
  • MZ
  • Shuai
  • yongfeng du

Agenda (2021-11-25)

Agenda:

  • CANCELED FOR US THANKSGIVING
    • Leads wont be available

Triage (2021-11-18)

Attendees:

  • Ehashman
  • Logicalhan
  • Dgrisonnet
  • Ben Luddy

https://github.com/prometheus-operator/kube-prometheus/pull/1499

Agenda (2021-11-11)

Agenda:

Triage (2021-10-04)

Agenda (2021-10-28)

Agenda:

Triage (2021-10-21)

Cancelled as all four leads were unavailable to start the meeting.

Agenda (2021-10-14)

Agenda:

  • CANCELED DUE TO KUBECON NA

Triage (2021-10-07)

Attendees:

Agenda (2021-09-30)

Agenda:

  • CANCELED DUE TO NO AGENDA

Triage (2021-09-23)

Attendees:

  • Ehashman
  • Dashpole
  • CatherineF
  • Kefan Yang
  • dgrissonet

Agenda:

Agenda (2021-09-16)

Agenda:

  • CANCELED DUE TO NO AGENDA

Triage (2021-09-09)

Agenda (2021-09-02)

Attendees:

  • Dashpole
  • Brancz
  • Ehashman
  • Erain
  • Catherine Fang
  • Yashika Badaya

Agenda:

  • [logicalhan] Revisit stability classes
  • [ehashman] KEP review for 1.23 *

Agenda (2021-08-19)

Cancelled?

Agenda (2021-08-05)

Attendees:

  • kakkoyun

Agenda:

  • [deads2k] Default metrics cardinality
    • https://github.com/kubernetes/kubernetes/issues/104008 Cardinality regression in 1.22
    • https://github.com/kubernetes/kubernetes/pull/102523
    • [ehashman] In 1.22, we had a metric added that accidentally included a namespace dimension. This caused a cardinality explosion which wasnt detected until Red Hat performed upgrade testing in downstream OpenShift by running e2e tests and saw large memory regressions for the Prometheus instances, causing nodes to go not ready.
    • How can we prevent this in the future rather than reacting to this many months later in downstream integration testing?
    • [aojea] SIG Scalability currently isnt gathering all the metrics, which means we cant see a trend in the number of overall metrics.
    • There is a hook in the scalability perf framework that we could potentially use.
    • If you dont run the correctness tests, youd not generate the namespaces so we wouldnt have caught it. So we need to ensure we run full E2Es/conformance.
    • [aojea] Other CI/e2es dont run a Prometheus so scalability tests will be the easiest as they have one.
    • [deads2k] Rather than a scalability suite, we could add a [Late] annotation like in OpenShift that run the tests in gingko last in the e2e suite and thus could hit a metrics endpoint and block a PR if they cause a large regression.
    • [ehashman] why not both? Start with scalability tests to get an idea of baseline numbers, perhaps adding e2e blocking tests later once we have the machinery in place.
    • Action (ehashman): Investigate adding total metric counts in scalability tests with existing Prom. File an issue describing what we want in the perf-tests repo.
    • Action (coffeepac): Look into adding metrics-grabber tests to e2e suite.
  • [serathius] Klog kep

Triage (2021-07-29)

Attendees:

Agenda (2021-07-22)

Attendees:

  • Brancz
  • Logicalhan
  • Dashpole
  • ehashman
  • Dgrisonnet
  • Coffeepac
  • Gaurav Tiwari
  • Catherine Fang
  • Joadavis
  • jpbetz

Agenda:

  • [logicalhan] Metrics stability (adding beta phase)
    • Han reached out to SIG Arch but did not get a response
      • Action: Han to add to next weeks SIG Arch agenda
    • WG Reliability suggested that we should add this but trail graduation
      • Will this delay graduation?
      • No: e.g. beta feature has alpha metrics, feature can go GA with a beta metric
      • Could we move back metric requirements in PRR?
      • [ehashman] No, that shouldnt be necessary; idea of metrics as part of PRR is that beta features are (in 95% of cases) on by default, and therefore people need a way to debug them, so we require people to define how to measure feature perf with metrics. Those metrics dont need to be net new for the feature.
      • [logicalhan] Maybe we could use this as criteria for promoting metrics? If other KEPs rely on them?
  • [logicalhan, lili, jpbetz] Donating auger to SIG Instrumentation
    • Looks like a useful thing, as it relates to observability it falls under our charter
    • Would increase visibility for the project and also hopefully improve diversity of maintainers
    • Agreed, we will accept as SIG Instrumentation project under kubernetes-sigs
    • Action: Han to kick off ownership process

Triage (2021-07-15)

Attendees:

  • Dashpole
  • Ehashman
  • Logicalhan
  • Dgrissonet
  • Catherine
  • Parul

Agenda (2021-07-08)

Triage (2021-06-30)

Attendees:

  • Logicalhan
  • Ehashman
  • dgrisonnet

Agenda (2021-06-24)

  • Reminder: code freeze is July 8
  • [ehashman] KEP review
  • [lilic] Continue discussion around promoting metrics stability that we started in triage
    • Get rid of duplicate set of metrics for watch counts · Issue #102545 · kubernetes/kubernetes
    • https://github.com/kubernetes/kubernetes/pull/102595
      • Issue 1: duplicated metrics, one is a subset of the other
      • Issue 2: the superset metric doesnt comply with the naming guidelines so we cant promote it to stable
      • Issue 3: we dont have a proposal for either of these to promote to stable
      • Issue 4: we dont own this metric, need this to be driven by API machinery, but we can make recommendations to get something promoted
      • [lilic] We havent fully agreed on what makes it stable - should be it something we can alert or dashboard on?
      • Proposal: propose a new metric that meets the criteria to replace both, and then deprecate both
      • Action: Han to drive (make it better)
  • [logicalhan] Do we need a beta stage for metrics?
    • Since everything is alpha but we dont have a lot of stable metrics, people are relying on alpha
    • Many metrics will never be promoted but we dont have a way to incentivize that
    • Metric graduation doesnt follow the standard feature flag cycle (nothing exists between alpha and GA)
    • Experimental/debug metrics?
    • Action: Need an initial proposal. Han and Frederic volunteer to start a draft.
  • [serathius] proposal for deprecating klog flags in core k8s components Json format should support same set of feature flags as klog · Issue #99270 · kubernetes/kubernetes
    • Describe case by case why particular flag is hard to implement and should be deprecated.

Attendees:

  • logicalhan
  • ehashman
  • dashpole
  • lili
  • dgrisonnet
  • catherine
  • marek
  • brancz

Triage (2021-06-16)

Agenda (2021-06-10)

Attendees:

  • ehashman
  • logicalhan
  • dashpole
  • Marek
  • joadavis
  • dgrisonnet
  • Pat Christopher
  • Gerassimos
  • Filip
  • Yu Yi
  • Scott
  • [ehashman] Putting out a call for help with metric documentation
    • We have a lot of metrics. We dont have any documentation. Do we want to expand the static analysis we have for stable metrics to alpha ones, and work on improving the documentation available? Perhaps we could add docstring annotations similar to the conformance tests to allow for expanded documentation?
    • Problems with static analysis for alpha ones: doesnt work necessarily for the alpha types. Variable names, concatenated strings, etc. Some metrics are automatically generated (e.g. from kubelet), custom collectors also cause
    • Static analysis is not super resilient; only parses if something is stable
    • Lets not let the perfect be the enemy of the good; we only have 4 metrics in our docs right now, and we could add a lot more
    • Static analysis would need to be improved before we can do more of this stuff
    • Pat Christopher would be interested in digging into this if we can get bugs filed
    • Action: Han to file some bugs detailing the issues with current static analysis to unblock doc generation of metrics
    • We would also need for documentation for how the static analysis works (KEP has a lot of detail) -- good starting point for the developer docs
    • After we fix the static analysis, we would need to parse the data and then we can autogenerate the docs for the website
  • [ehashman] Back from SIG Node on https://github.com/kubernetes/kubernetes/issues/101851#issuecomment-848101063
    • Confirmed that we will not add node start time.
  • [logicalhan] Recent issue: someones trying to remove a metric because its alpha, but all metrics are alpha so it isnt necessarily safe to remove. How do we handle this?
    • Cant really tell people to follow the stable/alpha policy when there are only 4 stable metrics, most components dont have any stable metrics
    • Can we make a policy for deprecation? Say, if a metric has been in at least 2 releases, there needs to be a 1-release deprecation period?
      • [han] Maybe even longer: if its been in for 4+ releases, need a 1-release deprecation period
      • All metric removals must be accompanied by ACTION REQUIRED release notes, on both deprecation and removal.
      • We could enforce this with the tooling if metrics had a version they were introduced.
      • [dashpole] What if we only allowed metrics to stay in alpha for a set number of releases? Need to prevent “perma-beta”, we make this an explicit decision point by the maintainers of each component
      • Cant really force a metric to stable, because some metrics arent suitable for stable (e.g. constantly changing)
    • Could also introduce a beta metrics phase
      • We would need a KEP for this
      • Action: If anyone wants to pick this up and write a proposal, help is wanted.
    • We need a policy for deprecation of alpha metrics. Weve informally had a policy for years but weve never written it down.
      • Right now 90%+ of metrics in kube have no guarantees; we cant just remove things randomly.
      • Suggestion: we need to formalize a policy in our community docs.
      • Action: Han to open a PR for discussion.

Agenda (2021-05-27)

  • [serathius/dgrisonnet] Create metrics-api-machinery project that encompasses both core, custom, and external metrics.
    • Name:
      • … <please propose>
      • metrics-api-machinery
    • TODO
      • Create new repo, migrate code and deprecate to dont break backward compatibility
      • Who wants to do the work?

Attendees

Triage (2021-05-19)

Attendees

Agenda (2021-05-13)

Attendees:

  • logicalhan
  • ehashman
  • Andrew Pollack
  • dashpole
  • Marek
  • Yu Yi
  • Solly Ross
  • Joadavis
  • Kristin Barkardottir
  • Nikos Fotiou
  • Pat Christopher
  • John

[CANCELLED] Triage (2021-05-05)

  • Cancelled due to conflict with KubeCon

Agenda (2021-04-29)

Attendees:

  • logicalhan
  • brancz
  • ehashman
  • Andrew Pollack
  • Damien Grisonnet
  • dashpole
  • Kemal Akkoyun
  • Lili Cosic
  • Marek
  • Yu Yi
  • Matthias Loibl

Triage (2021-04-22)

Agenda (2021-04-15)

Issues:

Attendees:

  • logicalhan
  • dashpole
  • ehashman
  • marek
  • Eddie zaneski
  • Scott
  • Joadavis
  • Yu yi
  • Kemal akkoyun

Triage (2021-04-07)

Note: we have begun removing sig/instrumentation labels from Structured Logging PRs in favour of area/logging.

Attendees:

  • ehashman
  • dashpole
  • scott

Agenda (2021-04-01)

Issues:

  • [voutcn] Make webhook-caused critical request failures more visible: doc
  • [coffeepac] moving fluentd-elasticsearch to kubernetes-sigs/instrumentation-tools from cluster/addons (or sending it someplace else, time for it to go)
  • [serathius] wg structured logging formation updates
    • Creation process starts next week

Attendees:

  • dashpole
  • kakkoyun
  • coffeepac
  • voutcn
  • logicalhan

Triage (2021-03-24)

Attendees:

  • logicalhan
  • marek
  • brancz

Agenda (2021-03-18)

Issues:

Attendees:

  • ehashman
  • logicalhan
  • kakkoyun
  • dgrisonnet
  • dashpole
  • bboreham
  • brancz
  • marek
  • lilic
  • Scott Lee
  • Yuchen Zhou
  • metalmatze
  • joadavis

Triage (2021-03-10)

Notes:

Attendees:

  • Ehashman
  • Logicalhan
  • Brancz
  • Dashpole
  • Serathius
  • lilic

Agenda (2021-03-04)

Issues:

AI:

Attendees:

  • logicalhan
  • ehashman
  • dashpole
  • brancz
  • joadavis
  • metalmatze
  • Huang-Wei
  • Scott
  • kakkoyun

Triage (2021-02-24)

Attendees:

  • Ehashman
  • Dashpole
  • Kakkoyun
  • Steve Nguyen
  • Joseph A Davis
  • Serathius

Agenda (2021-02-18)

Issues:

  • [ehashman] Reminder: code freeze is March 9th
  • [ehashman] Annual OWNERS files/org cleanup
  • [ehashman] Annual report is coming!
  • SIG status report for 1.21 feature dev
    • Han: metrics stability has people assigned for all the 1.21 components (selecting stable metrics, escape hatch), on track
    • Marek: would be helpful to have automatic checking for schemas/conventions for structured logging; need better guidance in docs on using keys
    • ACTION: serathius to do some analysis of present labels/keys and work on determining conventions
  • [brancz] otel datapoint

Attendees:

  • ehashman
  • logicalhan
  • voutcn
  • Metalmatze
  • brancz
  • Serathius
  • lilic
  • dgrisonnet
  • kakkoyun
  • dashpole
  • erain

Triage (2021-02-10)

  • triage/unresolved label doesnt remove the needs-triage label
  • What do we do when we dont want to accept a PR but we also want to mark that weve looked at it?

Attendees:

  • Ehashman
  • Dashpole
  • serathius

Agenda (2021-02-04)

Issues:

Attendees:

  • Ehashman
  • Brancz
  • Kakkoyun
  • Logicalhan
  • Yu yi
  • Marek
  • Dashpole
  • ?

Triage (2021-01-27)

Attendees:

  • Ehashman
  • Logicalhan
  • Scott
  • joadavis

Agenda (2021-01-21)

Issues:

Attendees:

  • Ehashman
  • Brancz
  • logicalhan
  • damien grisonnet
  • erain
  • joadavis
  • Serathius
  • kakkoyun

Triage (2021-01-13)

Attendees:

  • Dashpole
  • Logicalhan
  • Ehashman
  • Akonarde

Agenda (2021-01-07)

Issues:

Attendees:

  • Dashpole
  • ehashman
  • erain
  • Akonarde
  • Kakkoyun
  • logicalhan
  • Metalmatze
  • Lilic
  • Brancz
  • joadavis