Update charter

Signed-off-by: John Belamaric <jbelamaric@google.com>
This commit is contained in:
John Belamaric 2024-04-15 14:45:03 -07:00
parent 425840afa0
commit d18506e4a9
1 changed files with 99 additions and 2 deletions

View File

@ -1,3 +1,100 @@
# WG Device Management
# WG Device Management Charter
In progress.
This charter adheres to the conventions described in the [Kubernetes Charter
README] and uses the Roles and Organization Management outlined in
[wg-governance].
## Scope
Enable simple and efficient configuration, sharing, and allocation of
accelerators and other specialized devices. This working group focuses on the
APIs, abstractions, and feature designs needed to configure, target, and share
the necessary hardware for both batch and serving (inference) workloads.
### In scope
- Enable efficient utilization of specialized hardware devices. This includes
sharing one or more resources effectively (many workloads sharing a pool of
devices), as well as sharing individual devices effectively (several workloads
dividing up a single device for sharing).
- Enable workload authors to specify “just enough” details about their workload
requirements to ensure it runs optimally, without having to understand exactly
how the infrastructure team has provisioned the cluster.
- Enable the scheduler to choose the correct place to run a workload the vast
majority of the time (rejections should be extremely rare).
- Enable cluster autoscalers and other node auto-provisioning components to
predict whether creating additional resources will satisfy workload needs,
before provisioning those resources.
- Enable the shift from “pods run on nodes” to “workloads consume capacity”.
This allows Kubernetes to provision sets of pods on top of sets of nodes and
specialized hardware, while taking into account the relationships between
those infrastructure components.
- Enable in-node devices as well as network-accessible devices.
- Minimize workload disruption due to hardware failures.
- Address fragmentation of accelerator due to fractional use.
- Additional problems that may be identified and deemed in scope as we gather
use cases and requirements from WG Serving, WG Batch, and other stakeholders.
- Address all of the above while with a simple API that is a natural extension
of the existing Kubernetes APIs, and avoids or minimizes any transition
effort.
### Out of Scope
- Higher-level workload controller APIs (for example, the equivalent of
Deployment, StatefulSet, or DaemonSet) for specific types of workloads.
- General resource management requirements not related to devices.
## Deliverables
The WG will coordinate the delivery of KEPs and their implementations by the
participating SIGs. Interim artifacts will include documents capturing use
cases, requirements, and designs; however, all of those will eventually result
in KEPs and code owned by SIGs.
Specifically, we expect to need:
- APIs for publishing resource capacity of in-node and network-accessible
devices, as well as sample code to ease creation of drivers to populate this
information.
- APIs for specifying workload resource requirements with respect to devices.
- APIs, algorithms, and implementations for allocating access to and resources on devices, as well as
persisting the results of those allocations.
- APIs, algorithms, and implementations for allowing adminstrators to control
and govern access to devices.
## Stakeholders
- SIG Architecture
- SIG Autoscaling
- SIG Network
- SIG Node
- SIG Scheduling
Additionally a broad set of end users, device vendors, cloud providers,
Kubernetes distribution providers, and ecosystem projects (particularly
autoscaling-related projects) have expressed interest in this effort. There are
five primary groups of stakeholders from each of which we expect multiple participants:
- Device vendors that manufacture accelerators and other specialized hardware
which they would like to make available to Kubernetes users.
- Kubernetes distribution and managed offering providers that would like to make
specialized hardware available to their users.
- Kubernetes ecosystem projects that help manage workloads utilizing these
accelerators (e.g., Karpenter, Kueue, Volcano)
- End user workload authors that will create workloads that take advantage of
the specialized hardware.
- Cluster administrators that operate and govern clusters containing the
specialized hardware.
## Roles and Organization Management
This sig follows adheres to the Roles and Organization Management outlined in [wg-governance]
and opts-in to updates and modifications to [wg-governance].
## Exit Criteria
The working group will disband when the KEPs resulting from these discussions
have reached a terminal state.
[wg-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/wg-governance.md
[Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md