Update charter
Signed-off-by: John Belamaric <jbelamaric@google.com>
This commit is contained in:
		
							parent
							
								
									425840afa0
								
							
						
					
					
						commit
						d18506e4a9
					
				|  | @ -1,3 +1,100 @@ | |||
| # WG Device Management | ||||
| # WG Device Management Charter | ||||
| 
 | ||||
| In progress. | ||||
| This charter adheres to the conventions described in the [Kubernetes Charter | ||||
| README] and uses the Roles and Organization Management outlined in | ||||
| [wg-governance]. | ||||
| 
 | ||||
| ## Scope | ||||
| 
 | ||||
| Enable simple and efficient configuration, sharing, and allocation of | ||||
| accelerators and other specialized devices. This working group focuses on the | ||||
| APIs, abstractions, and feature designs needed to configure, target, and share | ||||
| the necessary hardware for both batch and serving (inference) workloads. | ||||
| 
 | ||||
| ### In scope | ||||
| 
 | ||||
| - Enable efficient utilization of specialized hardware devices. This includes | ||||
|   sharing one or more resources effectively (many workloads sharing a pool of | ||||
|   devices), as well as sharing individual devices effectively (several workloads | ||||
|   dividing up a single device for sharing). | ||||
| - Enable workload authors to specify “just enough” details about their workload | ||||
|   requirements to ensure it runs optimally, without having to understand exactly | ||||
|   how the infrastructure team has provisioned the cluster. | ||||
| - Enable the scheduler to choose the correct place to run a workload the vast | ||||
|   majority of the time (rejections should be extremely rare). | ||||
| - Enable cluster autoscalers and other node auto-provisioning components to | ||||
|   predict whether creating additional resources will satisfy workload needs, | ||||
|   before provisioning those resources. | ||||
| - Enable the shift from “pods run on nodes” to “workloads consume capacity”. | ||||
|   This allows Kubernetes to provision sets of pods on top of sets of nodes and | ||||
|   specialized hardware, while taking into account the relationships between | ||||
|   those infrastructure components. | ||||
| - Enable in-node devices as well as network-accessible devices. | ||||
| - Minimize workload disruption due to hardware failures. | ||||
| - Address fragmentation of accelerator due to fractional use. | ||||
| - Additional problems that may be identified and deemed in scope as we gather | ||||
|   use cases and requirements from WG Serving, WG Batch, and other stakeholders. | ||||
| - Address all of the above while with a simple API that is a natural extension | ||||
|   of the existing Kubernetes APIs, and avoids or minimizes any transition | ||||
|   effort. | ||||
| 
 | ||||
| ### Out of Scope | ||||
| 
 | ||||
| - Higher-level workload controller APIs (for example, the equivalent of | ||||
|   Deployment, StatefulSet, or DaemonSet) for specific types of workloads. | ||||
| - General resource management requirements not related to devices. | ||||
| 
 | ||||
| ## Deliverables | ||||
| 
 | ||||
| The WG will coordinate the delivery of KEPs and their implementations by the | ||||
| participating SIGs. Interim artifacts will include documents capturing use | ||||
| cases, requirements, and designs; however, all of those will eventually result | ||||
| in KEPs and code owned by SIGs. | ||||
| 
 | ||||
| Specifically, we expect to need: | ||||
| 
 | ||||
| - APIs for publishing resource capacity of in-node and network-accessible | ||||
|   devices, as well as sample code to ease creation of drivers to populate this | ||||
|   information. | ||||
| - APIs for specifying workload resource requirements with respect to devices. | ||||
| - APIs, algorithms, and implementations for allocating access to and resources on devices, as well as | ||||
|   persisting the results of those allocations. | ||||
| - APIs, algorithms, and implementations for allowing adminstrators to control | ||||
|   and govern access to devices. | ||||
| 
 | ||||
| ## Stakeholders | ||||
| 
 | ||||
| - SIG Architecture | ||||
| - SIG Autoscaling | ||||
| - SIG Network | ||||
| - SIG Node | ||||
| - SIG Scheduling | ||||
| 
 | ||||
| Additionally a broad set of end users, device vendors, cloud providers, | ||||
| Kubernetes distribution providers, and ecosystem projects (particularly | ||||
| autoscaling-related projects) have expressed interest in this effort. There are | ||||
| five primary groups of stakeholders from each of which we expect multiple participants: | ||||
| 
 | ||||
| - Device vendors that manufacture accelerators and other specialized hardware | ||||
|   which they would like to make available to Kubernetes users. | ||||
| - Kubernetes distribution and managed offering providers that would like to make | ||||
|   specialized hardware available to their users. | ||||
| - Kubernetes ecosystem projects that help manage workloads utilizing these | ||||
|   accelerators (e.g., Karpenter, Kueue, Volcano) | ||||
| - End user workload authors that will create workloads that take advantage of | ||||
|   the specialized hardware. | ||||
| - Cluster administrators that operate and govern clusters containing the | ||||
|   specialized hardware. | ||||
| 
 | ||||
| ## Roles and Organization Management | ||||
| 
 | ||||
| This sig follows adheres to the Roles and Organization Management outlined in [wg-governance] | ||||
| and opts-in to updates and modifications to [wg-governance]. | ||||
| 
 | ||||
| ## Exit Criteria | ||||
| 
 | ||||
| The working group will disband when the KEPs resulting from these discussions | ||||
| have reached a terminal state. | ||||
| 
 | ||||
| [wg-governance]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/wg-governance.md | ||||
| [Kubernetes Charter README]: https://github.com/kubernetes/community/blob/master/committee-steering/governance/README.md | ||||
|  |  | |||
		Loading…
	
		Reference in New Issue