Document installation and usage of logging, monitoring and tracing components (#92)

* Document installation and usage of logging, monitoring and tracing components. * Minor documentation changes.
2018-07-10 12:39:19 -07:00 · 2018-07-10 12:39:19 -07:00 · 2e8418c97c
parent 924257ded3
commit 2e8418c97c
15 changed files with 470 additions and 0 deletions
--- a/serving/README.md
+++ b/serving/README.md
@ -0,0 +1,16 @@
+# Serving
+
+## Observability
+
+* [Installing Logging, Metrics and Traces](./installing-logging-metrics-traces.md)
+* [Accessing Logs](./accessing-logs.md)
+* [Accessing Metrics](./accessing-metrics.md)
+* [Accessing Traces](./accessing-traces.md)
+* [Debugging Application Issues](./debugging-application-issues.md)
+* [Debugging Performance Issues](./debugging-performance-issues.md)
+
+## Networking
+
+* [Setting up DNS](./DNS.md)
+
+## Scaling
--- a/serving/accessing-logs.md
+++ b/serving/accessing-logs.md
@ -0,0 +1,115 @@
+# Accessing logs
+
+If logging and monitoring components are not installed yet, go through the 
+[installation instructions](./installing-logging-metrics-traces.md) to setup the 
+necessary components first.
+
+## Kibana and Elasticsearch
+
+To open the Kibana UI (the visualization tool for [Elasticsearch](https://info.elastic.co),
+enter the following command:
+
+```shell
+kubectl proxy
+```
+
+This starts a local proxy of Kibana on port 8001. The Kibana UI is only exposed within
+the cluster for security reasons.
+
+Navigate to the [Kibana UI](http://localhost:8001/api/v1/namespaces/monitoring/services/kibana-logging/proxy/app/kibana)
+(*It might take a couple of minutes for the proxy to work*).
+
+The Discover tab of the Kibana UI looks like this:
+
+![Kibana UI Discover tab](./images/kibana-discover-tab-annotated.png)
+
+You can change the time frame of logs Kibana displays in the upper right corner
+of the screen. The main search bar is across the top of the Discover page.
+
+As more logs are ingested, new fields will be discovered. To have them indexed,
+go to Management > Index Patterns > Refresh button (on top right) > Refresh
+fields.
+
+<!-- TODO: create a video walkthrough of the Kibana UI -->
+
+### Accessing configuration and revision logs
+
+To access the logs for a configuration, enter the following search query in Kibana:
+
+```text
+kubernetes.labels.knative_dev\/configuration: "configuration-example"
+```
+
+Replace `configuration-example` with your configuration's name. Enter the following
+command to get your configuration's name:
+
+```shell
+kubectl get configurations
+```
+
+To access logs for a revision, enter the following search query in Kibana:
+
+```text
+kubernetes.labels.knative_dev\/revision: "configuration-example-00001"
+```
+
+Replace `configuration-example-00001` with your revision's name.
+
+### Accessing build logs
+
+To access the logs for a build, enter the following search query in Kibana:
+
+```text
+kubernetes.labels.build\-name: "test-build"
+```
+
+Replace `test-build` with your build's name. The build name is specified in the `.yaml` file as follows:
+
+```yaml
+apiVersion: build.knative.dev/v1alpha1
+kind: Build
+metadata:
+  name: test-build
+```
+
+### Accessing request logs
+
+To access to request logs, enter the following search in Kibana:
+
+```text
+tag: "requestlog.logentry.istio-system"
+```
+
+Request logs contain details about requests served by the revision. Below is a sample request log:
+
+```text
+@timestamp                   July 10th 2018, 10:09:28.000
+destinationConfiguration     configuration-example
+destinationNamespace         default
+destinationRevision          configuration-example-00001
+destinationService           configuration-example-00001-service.default.svc.cluster.local
+latency                      1.232902ms
+method                       GET
+protocol                     http
+referer                      unknown
+requestHost                  route-example.default.example.com
+requestSize                  0
+responseCode                 200
+responseSize                 36
+severity                     Info
+sourceNamespace              istio-system
+sourceService                unknown
+tag                          requestlog.logentry.istio-system
+traceId                      986d6faa02d49533
+url                          /
+userAgent                    curl/7.60.0
+```
+
+### Accessing end to end request traces
+
+See [Accessing Traces](./accessing-traces.md) page for details.
+
+## Stackdriver
+
+Go to the [Google Cloud Console logging page](https://console.cloud.google.com/logs/viewer) for
+your GCP project which stores your logs via Stackdriver.
--- a/serving/accessing-metrics.md
+++ b/serving/accessing-metrics.md
@ -0,0 +1,27 @@
+# Accessing metrics
+
+To open the [Grafana](https://grafana.com/) UI (the visualization tool 
+for [Prometheus](https://prometheus.io/), enter the following command:
+
+```shell
+kubectl port-forward -n monitoring $(kubectl get pods -n monitoring --selector=app=grafana --output=jsonpath="{.items..metadata.name}") 3000
+```
+
+This starts a local proxy of Grafana on port 3000. The Grafana UI is only exposed within
+the cluster for security reasons.
+
+Navigate to the Grafana UI at [http://localhost:3000](http://localhost:3000). 
+Select `Home` button on the top of the page to see the list of pre-installed dashboards (screenshot below):
+![Knative Dashboards](./images/grafana1.png)
+
+The following dashboards are pre-installed with Knative Serving:
+
+* **Revision HTTP Requests:** HTTP request count, latency and size metrics per revision and per configuration
+* **Nodes:** CPU, memory, network and disk metrics at node level
+* **Pods:** CPU, memory and network metrics at pod level
+* **Deployment:** CPU, memory and network metrics aggregated at deployment level
+* **Istio, Mixer and Pilot:** Detailed Istio mesh, Mixer and Pilot metrics
+* **Kubernetes:** Dashboards giving insights into cluster health, deployments and capacity usage
+
+To login as an administrator and modify or add dashboards, sign in with username `admin` and password `admin`.
+Make sure to change the password before exposing Grafana UI to outside the cluster.
--- a/serving/accessing-traces.md
+++ b/serving/accessing-traces.md
@ -0,0 +1,21 @@
+# Accessing request traces
+
+If logging and monitoring components are not installed yet, go through the 
+[installation instructions](./installing-logging-metrics-traces.md) to setup the 
+necessary components first.
+
+To open the Zipkin UI (the visualization tool for request traces), enter the following command:
+
+```shell
+kubectl proxy
+```
+
+This starts a local proxy of Zipkin on port 8001. The Zipkin UI is only exposed within
+the cluster for security reasons.
+
+Navigate to the [Zipkin UI](http://localhost:8001/api/v1/namespaces/istio-system/services/zipkin:9411/proxy/zipkin/).
+Click on "Find Traces" to see the latest traces. You can search for a trace ID
+or look at traces of a specific application. Click on a trace to see a detailed
+view of a specific call.
+
+<!--TODO: Consider adding a video here. -->
--- a/serving/debugging-application-issues.md
+++ b/serving/debugging-application-issues.md
@ -0,0 +1,130 @@
+# Debugging Issues with Your Application
+
+You deployed your app to Knative Serving, but it isn't working as expected.
+Go through this step by step guide to understand what failed.
+
+## Check command line output
+
+Check your deploy command output to see whether it succeeded or not. If your
+deployment process was terminated, there should be an error message showing up
+in the output that describes the reason why the deployment failed.
+
+This kind of failure is most likely due to either a misconfigured manifest or
+wrong command. For example, the following output says that you must configure
+route traffic percent to sum to 100:
+
+```
+Error from server (InternalError): error when applying patch:
+{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"serving.knative.dev/v1alpha1\",\"kind\":\"Route\",\"metadata\":{\"annotations\":{},\"name\":\"route-example\",\"namespace\":\"default\"},\"spec\":{\"traffic\":[{\"configurationName\":\"configuration-example\",\"percent\":50}]}}\n"}},"spec":{"traffic":[{"configurationName":"configuration-example","percent":50}]}}
+to:
+&{0xc421d98240 0xc421e77490 default route-example STDIN 0xc421db0488 264682 false}
+for: "STDIN": Internal error occurred: admission webhook "webhook.knative.dev" denied the request: mutation failed: The route must have traffic percent sum equal to 100.
+ERROR: Non-zero return code '1' from command: Process exited with status 1
+```
+
+## Check application logs
+
+Knative Serving provides default out-of-the-box logs for your application.
+Access your application logs using [Accessing Logs](./accessing-logs.md) page.
+
+## Check Route status
+
+Run the following command to get the `status` of the `Route` object with which
+you deployed your application:
+
+```shell
+kubectl get route <route-name> -o yaml
+```
+
+The `conditions` in `status` provide the reason if there is any failure. For
+details, see Knative
+[Error Conditions and Reporting](../spec/errors.md)(currently some of them
+are not implemented yet).
+
+## Check Revision status
+
+If you configure your `Route` with `Configuration`, run the following
+command to get the name of the `Revision` created for you deployment
+(look up the configuration name in the `Route` .yaml file):
+
+```shell
+kubectl get configuration <configuration-name> -o jsonpath="{.status.latestCreatedRevisionName}"
+```
+
+If you configure your `Route` with `Revision` directly, look up the revision
+name in the `Route` yaml file.
+
+Then run
+
+```shell
+kubectl get revision <revision-name> -o yaml
+```
+
+A ready `Revision` should has the following condition in `status`:
+
+```yaml
+conditions:
+  - reason: ServiceReady
+    status: "True"
+    type: Ready
+```
+
+If you see this condition, check the following to continue debugging:
+
+* [Check Pod status](#check-pod-status)
+* [Check application logs](#check-application-logs)
+* [Check Istio routing](#check-istio-routing)
+
+If you see other conditions, to debug further:
+
+* Look up the meaning of the conditions in Knative
+     [Error Conditions and Reporting](../spec/errors.md). Note: some of them
+     are not implemented yet. An alternative is to
+     [check Pod status](#check-pod-status).
+* If you are using `BUILD` to deploy and the `BuidComplete` condition is not
+     `True`, [check BUILD status](#check-build-status).
+
+## Check Pod status
+
+To get the `Pod`s for all your deployments:
+
+```shell
+kubectl get pods
+```
+
+This should list all `Pod`s with brief status. For example:
+
+```text
+NAME                                                      READY     STATUS             RESTARTS   AGE
+configuration-example-00001-deployment-659747ff99-9bvr4   2/2       Running            0          3h
+configuration-example-00002-deployment-5f475b7849-gxcht   1/2       CrashLoopBackOff   2          36s
+```
+
+Choose one and use the following command to see detailed information for its
+`status`. Some useful fields are `conditions` and `containerStatuses`:
+
+```shell
+kubectl get pod <pod-name> -o yaml
+
+```
+
+If you see issues with "user-container" container in the containerStatuses, check your application logs as described below.
+
+## Check Build status
+
+If you are using Build to deploy, run the following command to get the Build for
+your `Revision`:
+
+```shell
+kubectl get build $(kubectl get revision <revision-name> -o jsonpath="{.spec.buildName}") -o yaml
+```
+
+The `conditions` in `status` provide the reason if there is any failure. To
+access build logs, first execute `kubectl proxy` and then open [Kibana UI](http://localhost:8001/api/v1/namespaces/monitoring/services/kibana-logging/proxy/app/kibana).
+Use any of the following filters within Kibana UI to
+see build logs. _(See [telemetry guide](../telemetry.md) for more information on
+logging and monitoring features of Knative Serving.)_
+
+* All build logs: `_exists_:"kubernetes.labels.build-name"`
+* Build logs for a specific build: `kubernetes.labels.build-name:"<BUILD NAME>"`
+* Build logs for a specific build and step: `kubernetes.labels.build-name:"<BUILD NAME>" AND kubernetes.container_name:"build-step-<BUILD STEP NAME>"`
--- a/serving/debugging-performance-issues.md
+++ b/serving/debugging-performance-issues.md
@ -0,0 +1,101 @@
+# Investigating Performance Issues
+
+You deployed your application or function to Knative Serving but its performance 
+is not up to the expectations. Knative Serving provides various dashboards and tools to 
+help investigate such issues. This document goes through these dashboards
+and tools.
+
+## Request metrics
+
+Start your investigation with "Revision - HTTP Requests" dashboard. To open this dashboard,
+open Grafana UI as described in [Accessing Metrics](./accessing-metrics.md) and navigate to 
+"Knative Serving - Revision HTTP Requests". Select your configuration and revision
+from the menu on top left of the page. You will see a page like below:
+
+![Knative Serving - Revision HTTP Requests](./images/request_dash1.png)
+
+This dashboard gives visibility into the following for each revision:
+
+* Request volume
+* Request volume per HTTP response code
+* Response time
+* Response time per HTTP response code
+* Request and response sizes
+
+This dashboard can show traffic volume or latency discrepancies between different revisions. 
+If, for example, a revision's latency is higher than others revisions, then 
+focus your investigation on the offending revision through the rest of this guide.
+
+## Request traces
+
+Next, look into request traces to find out where the time is spent for a single request.
+To access request traces, open Zipkin UI as described in [Accessing Traces](./accessing-traces.md).
+Select your revision from the "Service Name" drop down and click on "Find Traces" button.
+This will bring up a view that looks like below:
+
+![Zipkin - Trace Overview](./images/zipkin1.png)
+
+In the example above, we can see that the request spent most of its time in the 
+[span](https://github.com/opentracing/specification/blob/master/specification.md#the-opentracing-data-model) right before the last.
+Investigation should now be focused on that specific span. 
+Clicking on that will bring up a view that looks like below:
+
+![Zipkin - Span Details](./images/zipkin2.png)
+
+This view shows detailed information about the specific span, such as the
+micro service or external URL that was called. In this example, call to a
+Grafana URL is taking the most time and investigation should focus on why
+that URL is taking that long.
+
+## Autoscaler metrics
+
+If request metrics or traces do not show any obvious hot spots, or if they show
+that most of the time is spent in your own code, autoscaler metrics should be
+looked next. To open autoscaler dashboard, open Grafana UI and select 
+"Knative Serving - Autoscaler" dashboard. This will bring up a view that looks like below:
+
+![Knative Serving - Autoscaler](./images/autoscaler_dash1.png)
+
+This view shows four key metrics from Knative Serving autoscaler:
+
+* Actual pod count: # of pods that are running a given revision
+* Desired pod count: # of pods that autoscaler thinks that should serve the
+  revision
+* Requested pod count: # of pods that autoscaler requested from Kubernetes
+* Panic mode: If 0, autoscaler is operating in [stable mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#stable-mode).
+If 1, autoscaler is operating in [panic mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#panic-mode).
+
+If there is a large gap between actual pod count and requested pod count, that
+means that the Kubernetes cluster is unable to keep up allocating new
+resources fast enough, or that the Kubernetes cluster is out of requested
+resources.
+
+If there is a large gap between requested pod count and desired pod count, that
+is an indication that Knative Serving autoscaler is unable to communicate with
+Kubernetes master to make the request.
+
+In the example above, autoscaler requested 18 pods to optimally serve the traffic
+but was only granted 8 pods because the cluster is out of resources.
+
+## CPU and memory usage
+
+You can access total CPU and memory usage of your revision from 
+"Knative Serving - Revision CPU and Memory Usage" dashboard. Opening this will bring up a 
+view that looks like below:
+
+![Knative Serving - Revision CPU and Memory Usage](./images/cpu_dash1.png)
+
+The first chart shows rate of the CPU usage across all pods serving the revision.
+The second chart shows total memory consumed across all pods serving the revision.
+Both of these metrics are further divided into per container usage.
+
+* user-container: This container runs the user code (application, function or container).
+* [istio-proxy](https://github.com/istio/proxy): Sidecar container to form an 
+[Istio](https://istio.io/docs/concepts/what-is-istio/overview.html) mesh.
+* queue-proxy: Knative Serving owned sidecar container to enforce request concurrency limits.
+* autoscaler: Knative Serving owned sidecar container to provide auto scaling for the revision.
+* fluentd-proxy: Sidecar container to collect logs from /var/log.
+
+## Profiling
+
+...To be filled...
--- a/serving/images/autoscaler_dash1.png
+++ b/serving/images/autoscaler_dash1.png
--- a/serving/images/cpu_dash1.png
+++ b/serving/images/cpu_dash1.png
--- a/serving/images/grafana1.png
+++ b/serving/images/grafana1.png
--- a/serving/images/kibana-discover-tab-annotated.png
+++ b/serving/images/kibana-discover-tab-annotated.png
--- a/serving/images/kibana-landing-page-configure-index.png
+++ b/serving/images/kibana-landing-page-configure-index.png
--- a/serving/images/request_dash1.png
+++ b/serving/images/request_dash1.png
--- a/serving/images/zipkin1.png
+++ b/serving/images/zipkin1.png
--- a/serving/images/zipkin2.png
+++ b/serving/images/zipkin2.png
--- a/serving/installing-logging-metrics-traces.md
+++ b/serving/installing-logging-metrics-traces.md
@ -0,0 +1,60 @@
+# Monitoring, Logging and Tracing Installation
+
+Knative Serving offers two different monitoring setups: One that uses Elasticsearch, Kibana, Prometheus and Grafana and another that uses Stackdriver, Prometheus and Grafana. See below for installation instructions for these two setups. You can install only one of these two setups and side-by-side installation of these two are not supported.
+
+## Elasticsearch, Kibana, Prometheus & Grafana Setup
+
+First run:
+
+```shell
+kubectl apply -R -f config/monitoring/100-common \
+    -f config/monitoring/150-elasticsearch-prod \
+    -f third_party/config/monitoring/common \
+    -f third_party/config/monitoring/elasticsearch \
+    -f config/monitoring/200-common \
+    -f config/monitoring/200-common/100-istio.yaml
+```
+
+Monitor logging & monitoring components, until all of the components report Running or Completed:
+
+```shell
+kubectl get pods -n monitoring --watch
+```
+
+CTRL+C when it's done.
+
+We will create two indexes in ElasticSearch - one for application logs and one for request traces. 
+To create the indexes, open Kibana Index Management UI at this [link](http://localhost:8001/api/v1/namespaces/monitoring/services/kibana-logging/proxy/app/kibana#/management/kibana/index)
+(*it might take a couple of minutes for the proxy to work the first time after the installation*).
+
+Within the "Configure an index pattern" page, enter `logstash-*` to `Index pattern` and select `@timestamp` 
+from `Time Filter field name` and click on `Create` button. See below for a screenshot:
+
+![Create logstash-* index](images/kibana-landing-page-configure-index.png)
+
+To create the second index, select `Create Index Pattern` button on top left of the page. 
+Enter `zipkin*` to `Index pattern` and select `timestamp_millis` from `Time Filter field name` 
+and click on `Create` button.
+
+Next, visit instructions below to access to logs, metrics and traces:
+
+* [Accessing Logs](./accessing-logs.md)
+* [Accessing Metrics](./accessing-metrics.md)
+* [Accessing Traces](./accessing-traces.md)
+
+## Stackdriver(logs), Prometheus & Grafana Setup
+
+If your Knative Serving is not built on a GCP based cluster or you want to send logs to
+another GCP project, you need to build your own Fluentd image and modify the
+configuration first. See
+
+1. [Fluentd image on Knative Serving](/image/fluentd/README.md)
+2. [Setting up a logging plugin](setting-up-a-logging-plugin.md)
+
+```shell
+kubectl apply -R -f config/monitoring/100-common \
+    -f config/monitoring/150-stackdriver-prod \
+    -f third_party/config/monitoring/common \
+    -f config/monitoring/200-common \
+    -f config/monitoring/200-common/100-istio.yaml
+```