2.8 KiB
Resources
Kubernetes integration status by big data product
Spark
Apache Spark is a distributed data processing framework.
Status
Kubernetes is supported as a mainline Spark scheduler since release 2.3, see the detailed documentation. That work was done after the Spark on Kubernetes original Design Proposal in the apache-spark-on-k8s git repo.
Activities
Enhancements are under development, with a good overview given in this blog post.
- Work is underway for Spark 2.4 to improve support and integration with HDFS.
- Design Document: How Spark on Kubernetes will access Secure HDFS
- Shuffle service design
- Design Document Improving Spark Shuffle Reliability
- JIRA issue SPARK-25299: Use remote storage for persisting shuffle data
HDFS
Apache Hadoop HDFS is a distributed file system, the persistence layer for Hadoop.
Status
TODO, e.g. "No release yet."
Activities
Airflow
Apache Airflow is a platform to programmatically author, schedule and monitor workflows.
Status
The Kubernetes executor has been introduced with Airflow release 1.10.0 with support of Kubernetes 1.10.
Activities
Flink
Apache Flink is a distributed data processing framework.
Status
Flink 1.6 supports running a session or job cluster on Kubernetes.