Distributed ML Training and Fine-Tuning on Kubernetes
Go to file
Andrey Velichkevich d21e03ed04
chore: Nominate @astefanutti as Kubeflow Trainer approver (#2808)
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
2025-08-29 16:22:24 +00:00
.github chore(operator): Upgrade Kubernetes to v1.34 (#2804) 2025-08-28 17:02:21 +00:00
api chore(operator): Upgrade Kubernetes to v1.34 (#2804) 2025-08-28 17:02:21 +00:00
charts/kubeflow-trainer fix(api): Regenerate TrainJob CRD (#2805) 2025-08-28 18:34:22 +00:00
cmd feat(runtimes): Support Distributed MLX on CUDA (#2790) 2025-08-21 12:17:06 +00:00
docs feat: KEP-2432: GPU Testing for LLM Blueprints (#2689) 2025-08-29 02:08:22 +00:00
examples feat(runtimes): Support Distributed MLX on CUDA (#2790) 2025-08-21 12:17:06 +00:00
hack chore(runtimes): Remove MPI pi Runtime (#2760) 2025-08-18 08:13:04 +00:00
manifests fix(api): Regenerate TrainJob CRD (#2805) 2025-08-28 18:34:22 +00:00
pkg chore(operator): Upgrade Kubernetes to v1.34 (#2804) 2025-08-28 17:02:21 +00:00
test chore: deflake test to ensure runtime is created before creating trainjob (#2807) 2025-08-29 06:12:22 +00:00
.flake8 [SDK] Get Kubernetes Events for Job (#1975) 2024-01-11 16:30:12 +00:00
.gitignore fix(api): Fix license path for Kubeflow Trainer Python API (#2771) 2025-08-04 14:44:53 +00:00
.golangci.yaml fix(module): Change Go module name to v2 (#2707) 2025-07-03 17:46:20 +00:00
.pre-commit-config.yaml Remove SDK (#2657) 2025-06-09 13:41:48 +00:00
ADOPTERS.md Add Red Hat to ADOPTERS.md (#2714) 2025-07-05 06:12:20 +00:00
CHANGELOG.md chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 (#2743) 2025-07-21 14:26:58 +00:00
CONTRIBUTING.md fix: update kubeflow sdk reference (#2780) 2025-08-07 15:13:56 +00:00
LICENSE Initial commit 2017-06-28 11:38:15 -07:00
Makefile chore(operator): Upgrade Kubernetes to v1.34 (#2804) 2025-08-28 17:02:21 +00:00
OWNERS chore: Nominate @astefanutti as Kubeflow Trainer approver (#2808) 2025-08-29 16:22:24 +00:00
README.md chore(docs): Add license scan report and status (#2788) 2025-08-29 01:20:22 +00:00
ROADMAP.md feat(docs): Kubeflow Trainer ROADMAP 2025 (#2748) 2025-07-25 05:31:00 +00:00
SECURITY.md feat(docs): Guide to report security vulnerability (#2718) 2025-07-11 23:18:53 +00:00
go.mod chore(operator): Upgrade Kubernetes to v1.34 (#2804) 2025-08-28 17:02:21 +00:00
go.sum chore(operator): Upgrade Kubernetes to v1.34 (#2804) 2025-08-28 17:02:21 +00:00

README.md

Kubeflow Trainer

Join Slack Coverage Status Go Report Card OpenSSF Best Practices Ask DeepWiki FOSSA Status

logo

Latest News 🔥

Overview

Kubeflow Trainer is a Kubernetes-native project designed for large language models (LLMs) fine-tuning and enabling scalable, distributed training of machine learning (ML) models across various frameworks, including PyTorch, JAX, TensorFlow, and others.

You can integrate other ML libraries such as HuggingFace, DeepSpeed, or Megatron-LM with Kubeflow Trainer to run them on Kubernetes.

Kubeflow Trainer enables you to effortlessly develop your LLMs with the Kubeflow Python SDK, and build Kubernetes-native Training Runtimes using Kubernetes Custom Resource APIs.

logo

Kubeflow Trainer Introduction

The following KubeCon + CloudNativeCon 2024 talk provides an overview of Kubeflow Trainer capabilities:

Kubeflow Trainer

Getting Started

Please check the official Kubeflow Trainer documentation to install and get started with Kubeflow Trainer.

Community

The following links provide information on how to get involved in the community:

Contributing

Please refer to the CONTRIBUTING guide.

Changelog

Please refer to the CHANGELOG.

Kubeflow Training Operator V1

Kubeflow Trainer project is currently in alpha status, and APIs may change. If you are using Kubeflow Training Operator V1, please refer to this migration document.

Kubeflow Community will maintain the Training Operator V1 source code at the release-1.9 branch.

You can find the documentation for Kubeflow Training Operator V1 in these guides.

Acknowledgement

This project was originally started as a distributed training operator for TensorFlow and later we merged efforts from other Kubeflow Training Operators to provide a unified and simplified experience for both users and developers. We are very grateful to all who filed issues or helped resolve them, asked and answered questions, and were part of inspiring discussions. We'd also like to thank everyone who's contributed to and maintained the original operators.