2.8 KiB

Raw Blame History

Resources

Kubernetes integration status by big data product

Spark

Apache Spark is a distributed data processing framework.

Status

Kubernetes is supported as a mainline Spark scheduler since release 2.3, see the detailed documentation. That work was done after the Spark on Kubernetes original Design Proposal in the apache-spark-on-k8s git repo.

Activities

Enhancements are under development, with a good overview given in this blog post.

Work is underway for Spark 2.4 to improve support and integration with HDFS.
- Design Document: How Spark on Kubernetes will access Secure HDFS
Shuffle service design
- Design Document Improving Spark Shuffle Reliability
- JIRA issue SPARK-25299: Use remote storage for persisting shuffle data

HDFS

Apache Hadoop HDFS is a distributed file system, the persistence layer for Hadoop.

Status

TODO, e.g. "No release yet."

Activities

Airflow

Apache Airflow is a platform to programmatically author, schedule and monitor workflows.

Status

The Kubernetes executor has been introduced with Airflow release 1.10.0 with support of Kubernetes 1.10.

Activities

Airflow roadmap

Flink

Apache Flink is a distributed data processing framework.

Status

Flink 1.6 supports running a session or job cluster on Kubernetes.

Activities