annual report for wg-batch 2024

This commit is contained in:
Kevin Hannon 2025-01-17 10:51:41 -05:00
parent 45b5baa154
commit 549c6b32db
1 changed files with 117 additions and 12 deletions

View File

@ -2,27 +2,132 @@
## Current initiatives and Project Health
1. What work did the WG do this year that should be highlighted?
<!--
Some example items that might be worth highlighting:
- artifacts
- reports
- white papers
- work not tracked in KEPs
-->
See [2024 Highlights](#2024-highlights).
2. Are there any areas and/or subprojects that your group needs help with (e.g. fewer than 2 active OWNERS)?
None.
### 2024 Highlights
We will breakdown our highlights into Sub Projects, KEPs, talks, community adoption.
#### Sub Projects
##### Kueue
Kueue has had 5 releases in 2024.
- [Release 0.6](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.6.0)
- [Release 0.7](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.7.0)
- [Release 0.8](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.8.0)
- [Release 0.9](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.9.0)
- [Release 0.10](https://github.com/kubernetes-sigs/kueue/releases/tag/v0.10.0)
In 2024, the kueue community would like to highlight are Topology aware scheduling, MultiKueue, Kueue Dashboard, KueueCtrl, Deployment/Statefulset integration for serving and Fair sharing.
Topology aware scheduling facilitates scheduling of workloads that take in account data center topology. Workloads benefit from using interconnects that are physically close together.
MultiKueue provides a way of dispatching batch workloads to worker clusters. Kueue provides multicluster dispatching for popular batch workloads such as Ray, Job, Kubeflow and JobSet. This feature went beta in 0.9.
Kueue Dashboards has been a popular ask for Kueue. Users would like to have a visualization representation of queueing and we are happy to announce that a dashboard has been created for Kueue. This went into kueue in late 2024 and a big focus of 2025 will be to harden this for production.
KueueCtrl provides a cli for creating kueue objects. The plugin is hosted in krew and is easily installed as a kueue plugin.
Deployment/StatefulSet integration provides an avenue for the usage of Kueue for serving workloads. Serving leads to a need for sharing/preemption of model servers that may leverage accelerators. Kueue provides an integration with popular methods of deploying services (Deployment/StatefulSet).
##### JobSet
Jobset has had 4 release in 2024.
- [Release 0.4](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.4.0)
- [Release 0.5](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.5.0)
- [Release 0.6](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.6.0)
- [Release 0.7](https://github.com/kubernetes-sigs/jobset/releases/tag/v0.7.0)
A major achievement of JobSet has been the adoption of JobSet as a component for Kubeflow Training Operator V2.
There has been a collaborative effort with the Kubeflow community and the batch community to implement the features needed for this integration.
[Metaflow](https://github.com/Netflix/metaflow/pull/1804) has adopted the use of JobSet for distributed ML training.
##### KJob
[KJob](https://github.com/kubernetes-sigs/kjob?tab=readme-ov-file#kjob) has been started to provide a CLI friendly way for users to submit batch jobs.
The HPC/ML community tend to prefer CLI over YAML so the focus was to provide a templated solution for submitting batch jobs.
Another focus of this project is to provide a smooth transition for Slurm users.
#### KEPs
WG-Batch provided a series of kubernetes enhancements that improved the experience of batch workloads on Kubernetes. In 2024, this group proposed/implemented the following KEPs.
- [Job Managed By](https://github.com/kubernetes/enhancements/issues/4368)
- Promoted to beta in 2024
- [Job Success Policy](https://github.com/kubernetes/enhancements/issues/3998)
- Promoted to beta.
- [Elastic Index Jobs](https://github.com/kubernetes/enhancements/issues/3715)
- Promoted to stable.
- [Pod Failure Policy](https://github.com/kubernetes/enhancements/issues/3329)
- Promoted to stable.
- [Pod Index Label](https://github.com/kubernetes/enhancements/issues/4017)
- Promoted to stable.
### Talks
- WG-Batch Update at Kubecon NA 2024
- Authors: Kevin Hannon and Marcin Wielgus
- Keynote: MultiCluster Batch Jobs Dispatching with Kueue at CERN
- Authors: Ricardo Rocha and Marcin Wielgus
- Kubecon NA 2024
- Multitenancy and Fairness at Scale with Kueue: A Case Study
- Authors: Aldo Culquicondor & Rajat Phull
- Kubecon NA 2024
- Advanced Resource Management for Running AI/ML Workloads with Kueue
- Authors: Michał Woźniak & Yuki Iwai
- Kubecon EU 2024
- Scale Your Batch / Big Data / AI Workloads Beyond the Kubernetes Scheduler
- Authors: Antonin Stefanutti & Anish Asthana
- KubeCon EU, March, Paris
- WG-Batch Update at Kubecon EU 2024
- Authors: Martin Wielgus
- How the Kubernetes Community is Improving Kubernetes for HPC/AI/ML Workloads
- Authors: Kevin Hannon
- FOSDEM 2024
### Community adoption
- [Kubeflow Training Operator v2](https://github.com/kubeflow/training-operator/blob/0c30f5cd306611f061b6dd529d3c7b7981a7d27c/docs/proposals/2170-kubeflow-training-v2/README.md#kep-2170-kubeflow-training-v2-api) will be using JobSet as a critical component for training and finetuning.
- [Metaflow supports JobSet](https://github.com/Netflix/metaflow/pull/1804) for distributed training.
- Airflow has built an [integration](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/_api/airflow/providers/cncf/kubernetes/operators/kueue/index.html) with Kueue.
## Operational
Operational tasks in [wg-governance.md]:
- [ ] [README.md] reviewed for accuracy and updated if needed
- [ ] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
- [ ] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed
- [ ] Updates provided to sponsoring SIGs in 2024
- [x] [README.md] reviewed for accuracy and updated if needed
- [x] WG leaders in [sigs.yaml] are accurate and active, and updated if needed
- [x] Meeting notes and recordings for 2024 are linked from [README.md] and updated/uploaded if needed
- [] Updates provided to sponsoring SIGs in 2024
- [$sig-name](https://git.k8s.io/community/$sig-id/)
- links to email, meeting notes, slides, or recordings, etc
- [$sig-name](https://git.k8s.io/community/$sig-id/)