# Integration with Google Cloud Storage and BigQuery This document describes how to use Google Cloud services, e.g., Google Cloud Storage (GCS) and BigQuery as data sources or sinks in `SparkApplication`s. For a detailed tutorial on building Spark applications that access GCS and BigQuery, please refer to [Using Spark on Kubernetes Engine to Process Data in BigQuery](https://cloud.google.com/solutions/spark-on-kubernetes-engine). A Spark application requires the [GCS](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage) and [BigQuery](https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery) connectors to access GCS and BigQuery using the Hadoop `FileSystem` API. One way to make the connectors available to the driver and executors is to use a custom Spark image with the connectors built-in, as this example [Dockerfile](https://github.com/GoogleCloudPlatform/spark-on-k8s-gcp-examples/blob/master/dockerfiles/spark-gcs/Dockerfile) shows. An image built from this Dockerfile is located at `gcr.io/ynli-k8s/spark:v2.3.0-gcs`. The connectors require certain Hadoop properties to be set properly to function. Setting Hadoop properties can be done both through a custom Hadoop configuration file, namely, `core-site.xml` in a custom image, or via the `spec.hadoopConf` section in a `SparkApplication`. The example Dockerfile mentioned above shows the use of a custom `core-site.xml` and a custom `spark-env.sh` that points the environment variable `HADOOP_CONF_DIR` to the directory in the container where `core-site.xml` is located. The example `core-sitem.xml` and `spark-env.sh` can be found [here](https://github.com/GoogleCloudPlatform/spark-on-k8s-gcp-examples/tree/master/conf). The GCS and BigQuery connectors need to authenticate with the GCS and BigQuery services before they can use the services. The connectors support using a [GCP service account JSON key file](https://cloud.google.com/iam/docs/creating-managing-service-account-keys) for authentication. The service account must have the necessary IAM roles for access GCS and/or BigQuery granted. The [tutorial](https://cloud.google.com/solutions/spark-on-kubernetes-engine) has detailed information on how to create an service account, grant it the right roles, furnish a key, and download a JSON key file. To tell the connectors to use a service JSON key file for authentication, the following Hadoop configuration properties must be set: ``` google.cloud.auth.service.account.enable=true google.cloud.auth.service.account.json.keyfile= ``` The most common way of getting the service account JSON key file into the driver and executor containers is mount the key file in through a Kubernetes secret volume. Detailed information on how to create a secret can be found in the [tutorial](https://cloud.google.com/solutions/spark-on-kubernetes-engine). Below is an example `SparkApplication` using the custom image at `gcr.io/ynli-k8s/spark:v2.3.0-gcs` with the GCS/BigQuery connectors and the custom Hadoop configuration files above built-in. Note that some of the necessary Hadoop configuration properties are set using `spec.hadoopConf`. Those Hadoop configuration properties are additional to the ones set in the built-in `core-site.xml`. They are set here instead of in `core-site.xml` because of their application-specific nature. The ones set in `core-site.xml` apply to all applications using the image. Also note how the Kubernetes secret named `gcs-bg` that stores the service account JSON key file gets mounted into both the driver and executors. The environment variable `GCS_PROJECT_ID` must be set when using the image at `gcr.io/ynli-k8s/spark:v2.3.0-gcs`. ```yaml apiVersion: "sparkoperator.k8s.io/v1beta2" kind: SparkApplication metadata: name: foo-gcs-bg spec: type: Java mode: cluster image: gcr.io/ynli-k8s/spark:v2.3.0-gcs imagePullPolicy: Always hadoopConf: "fs.gs.project.id": "foo" "fs.gs.system.bucket": "foo-bucket" "google.cloud.auth.service.account.enable": "true" "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/key.json" driver: cores: 1 secrets: - name: "gcs-bq" path: "/mnt/secrets" secretType: GCPServiceAccount envVars: GCS_PROJECT_ID: foo serviceAccount: spark executor: instances: 2 cores: 1 memory: "512m" secrets: - name: "gcs-bq" path: "/mnt/secrets" secretType: GCPServiceAccount envVars: GCS_PROJECT_ID: foo ```