community/archive/ug-big-data/resources.md

2.8 KiB

Resources

Kubernetes integration status by big data product

Spark

Apache Spark is a distributed data processing framework.

Status

Kubernetes is supported as a mainline Spark scheduler since release 2.3, see the detailed documentation. That work was done after the Spark on Kubernetes original Design Proposal in the apache-spark-on-k8s git repo.

Activities

Enhancements are under development, with a good overview given in this blog post.

HDFS

Apache Hadoop HDFS is a distributed file system, the persistence layer for Hadoop.

Status

TODO, e.g. "No release yet."

Activities

Airflow

Apache Airflow is a platform to programmatically author, schedule and monitor workflows.

Status

The Kubernetes executor has been introduced with Airflow release 1.10.0 with support of Kubernetes 1.10.

Activities

Apache Flink is a distributed data processing framework.

Status

Flink 1.6 supports running a session or job cluster on Kubernetes.

Activities