+++ title = "Logging and monitoring" description = "Logging and monitoring for Kubeflow" weight = 100 +++ This guide has information about how to set up logging and monitoring for your Kubeflow deployment. # Logging ## Stackdriver on GKE The default on GKE is to send logs to [Stackdriver logging](https://cloud.google.com/logging/docs/). Stackdriver recently introduced new features for [Kubernetes Monitoring](https://cloud.google.com/monitoring/kubernetes-engine/migration) that are currently in Beta. These features are only available on Kubernetes v1.10 or later and must be explicitly installed. Below are instructions for both versions of Stackdriver Kubernetes. ### Default stackdriver This section contains instructions for using the existing stackdriver support for GKE which is the default. To get the logs for a particular pod you can use the following advanced filter in Stackdriver logging's search UI. ``` resource.type="container" resource.labels.cluster_name="${CLUSTER}" resource.labels.pod_id="${POD_NAME}" ``` where ${POD_NAME} is the name of the pod and ${CLUSTER} is the name of your cluster. The equivalent gcloud command would be ``` gcloud --project=${PROJECT} logging read \ --freshness=24h \ --order asc \ "resource.type=\"container\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_id=\"${POD}\" " ``` Kubernetes events for the TFJob are also available in stackdriver and can be obtained using the following query in the UI ``` resource.labels.cluster_name="${CLUSTER}" logName="projects/${PROJECT}/logs/events" jsonPayload.involvedObject.name="${TFJOB}" ``` The equivalent gcloud command is ``` gcloud --project=${PROJECT} logging read \ --freshness=24h \ --order asc \ "resource.labels.cluster_name=\"${CLUSTER}\" jsonPayload.involvedObject.name=\"${TFJOB}\" logName=\"projects/${PROJECT}/logs/events\" " ``` ### Stackdriver Kubernetes This section contains the relevant stackdriver queries and gloud commands if you are using the new [Stackdriver Kubernetes Monitoring](https://cloud.google.com/monitoring/kubernetes-engine) To get the stdout/stderr logs for a particular container you can use the following advanced filter in Stackdriver logging's search UI. ``` resource.type="k8s_container" resource.labels.cluster_name="${CLUSTER}" resource.labels.pod_name="${POD_NAME}" ``` where ${POD_NAME} is the name of the pod and ${CLUSTER} is the name of your cluster. The equivalent gcloud command would be ``` gcloud --project=${PROJECT} logging read \ --freshness=24h \ --order asc \ "resource.type=\"k8s_container\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_name=\"${POD_NAME}\" " ``` Events about individual pods can be obtained with the following query ``` resource.type="k8s_pod" resource.labels.cluster_name="${CLUSTER}" resource.labels.pod_name="${POD_NAME}" ``` or via gcloud ``` gcloud --project=${PROJECT} logging read \ --freshness=24h \ --order asc \ "resource.type=\"k8s_pod\" resource.labels.cluster_name=\"${CLUSTER}\" resource.labels.pod_name=\"${POD_NAME}\" " ``` #### Filter with labels The new agents also support querying for logs using pod labels For example: ``` resource.type="k8s_container" resource.labels.cluster_name="${CLUSTER}" metadata.userLables.${LABEL_KEY}="${LABEL_VALUE}" ``` # Monitoring ## Stackdriver on GKE The new [Stackdriver Kubernetes Monitoring](https://cloud.google.com/monitoring/kubernetes-engine) provides single dashboard observability and is compatible with Prometheus data model. See this [doc](https://cloud.google.com/monitoring/kubernetes-engine/observing) for more details on the dashboard. Stackdriver by default provides container level CPU/memory metrics. We can also define custom Prometheus metrics and view them on the Stackdriver dashboard. See for more [detail](https://cloud.google.com/monitoring/kubernetes-engine/prometheus). ## Prometheus ### Kubeflow Prometheus component Kubeflow provides a Prometheus [component](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.libsonnet). To deploy the Prometheus component: ``` ks generate prometheus prom --projectId=YOUR_PROJECT --clusterName=YOUR_CLUSTER --zone=ZONE ks apply YOUR_ENV -c prom ``` The prometheus server will scrape the services with annotation `prometheus.io/scrape=true`. See for more [detail](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.yml#L75) and an [example](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/metric-collector.libsonnet#L83). #### Export metrics to Stackdriver The Prometheus server will export metrics to Stackdriver, as [configured](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.yml#L127). We are using an [image](https://github.com/kubeflow/kubeflow/blob/master/kubeflow/gcp/prometheus.libsonnet#L170) provided by Stackdriver. See Stackdriver [doc](https://cloud.google.com/monitoring/kubernetes-engine/prometheus) for more detail, but you don't need to change anything here. If you don't want to export metrics to Stackdriver, remove the `remote_write` part in the `prometheus.yml`, and use a native Prometheus [image](https://hub.docker.com/r/prom/prometheus/tags/). ### Metric collector component for IAP (GKE only) Kubeflow also provides a metric-collector [component](https://github.com/kubeflow/kubeflow/tree/master/metric-collector). This component periodically pings your Kubeflow endpoint and provides a [metric](https://github.com/kubeflow/kubeflow/blob/master/metric-collector/service-readiness/kubeflow-readiness.py#L21) of whether the endpoint is up or not. To deploy it: ``` ks generate metric-collector mc --targetUrl=YOUR_KF_ENDPOINT ks apply YOUR_ENV -c mc ```