Distributed ML Training and Fine-Tuning on Kubernetes

ai distributed fine-tuning gpu huggingface jax kubeflow kubernetes llm machine-learning mlops python pytorch tensorflow xgboost

Go to file

Mahdi Khashan 0d60fbc5b0 [chore] update stale action version to latest (#2642 ) Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>		2025-05-10 13:58:42 +00:00
.github	[chore] update stale action version to latest (#2642 )	2025-05-10 13:58:42 +00:00
api	Move generated Python models into kubeflow_trainer_api package (#2632 )	2025-05-07 18:14:39 +00:00
charts/kubeflow-trainer	Remove kubeflow-trainer prefix from jobset resource names (#2596 )	2025-04-11 14:44:06 +00:00
cmd	Update Go to v1.24 (#2615 ) (#2620 )	2025-04-28 17:02:01 +00:00
docs	Remove TrainJobCreated condition (#2621 )	2025-05-02 17:59:05 +00:00
examples	Add question-answer example for v2 trainer (#2580 )	2025-05-09 21:08:41 +00:00
hack	Move generated Python models into kubeflow_trainer_api package (#2632 )	2025-05-07 18:14:39 +00:00
manifests	KEP-2401: Create LLM Training Runtimes for Llama 3.2 model family (#2590 )	2025-05-10 05:44:41 +00:00
pkg	Remove TrainJobCreated condition (#2621 )	2025-05-02 17:59:05 +00:00
sdk	Move generated Python models into kubeflow_trainer_api package (#2632 )	2025-05-07 18:14:39 +00:00
test	Remove TrainJobCreated condition (#2621 )	2025-05-02 17:59:05 +00:00
.flake8	[SDK] Get Kubernetes Events for Job (#1975 )	2024-01-11 16:30:12 +00:00
.gitignore	Add Helm chart for kubeflow trainer (#2435 )	2025-04-11 02:46:05 +00:00
.golangci.yaml	chore: Enable GCI for golangci-lint (#2540 )	2025-03-18 14:47:21 +00:00
.pre-commit-config.yaml	Move generated Python models into kubeflow_trainer_api package (#2632 )	2025-05-07 18:14:39 +00:00
ADOPTERS.md	Remove the Training Operator V1 Source Code (#2389 )	2025-02-04 15:25:38 +00:00
CHANGELOG.md	Add Changelog for Training Operator v1.9.0-rc.0 (#2380 )	2025-01-09 14:40:23 +00:00
CONTRIBUTING.md	Update Go to v1.24 (#2615 ) (#2620 )	2025-04-28 17:02:01 +00:00
LICENSE	Initial commit	2017-06-28 11:38:15 -07:00
Makefile	Move generated Python models into kubeflow_trainer_api package (#2632 )	2025-05-07 18:14:39 +00:00
OWNERS	Nominate @Electronic-Waste as a reviewer (#2427 )	2025-02-07 12:16:42 +00:00
README.md	chore(docs): Update Slack channel (#2569 )	2025-03-26 20:18:26 +00:00
ROADMAP.md	Remove the Training Operator V1 Source Code (#2389 )	2025-02-04 15:25:38 +00:00
go.mod	Update Go to v1.24 (#2615 ) (#2620 )	2025-04-28 17:02:01 +00:00
go.sum	chore(deps): bump golang.org/x/net from 0.36.0 to 0.38.0 (#2602 )	2025-04-17 00:19:24 +00:00

README.md

Kubeflow Trainer

Overview

Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) fine-tuning and enabling scalable, distributed training of machine learning (ML) models across various frameworks, including PyTorch, JAX, TensorFlow, and others.

You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with Kubeflow Training to orchestrate their ML training on Kubernetes.

Kubeflow Trainer allows you effortlessly develop your LLMs with the Kubeflow Python SDK and build Kubernetes-native Training Runtimes with Kubernetes Custom Resources APIs.

Kubeflow Trainer Introduction

The following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflow Trainer capabilities:

Getting Started

Please check the official Kubeflow documentation to install and get started with Kubeflow Trainer.

Community

The following links provide information on how to get involved in the community:

Join our #kubeflow-trainer Slack channel.
Attend the bi-weekly AutoML and Training Working Group community meeting.
Check out who is using Kubeflow Trainer.

Contributing

Please refer to the CONTRIBUTING guide.

Changelog

Please refer to the CHANGELOG.

Kubeflow Training Operator V1

Kubeflow Trainer project is currently in alpha status, and APIs may change. If you are using Kubeflow Training Operator V1, please refer to this migration document.

Kubeflow Community will maintain the Training Operator V1 source code at the release-1.9 branch.

You can find the documentation for Kubeflow Training Operator V1 in these guides.

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.

PyTorch Operator: list of contributors and maintainers.
MPI Operator: list of contributors and maintainers.
XGBoost Operator: list of contributors and maintainers.
Common library: list of contributors and maintainers.