Document installation and usage of logging, monitoring and tracing components (#92)

* Document installation and usage of logging, monitoring and tracing components.

* Minor documentation changes.
This commit is contained in:
Mustafa Demirhan 2018-07-10 12:39:19 -07:00 committed by GitHub
parent 924257ded3
commit 2e8418c97c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
15 changed files with 470 additions and 0 deletions

16
serving/README.md Normal file
View File

@ -0,0 +1,16 @@
# Serving
## Observability
* [Installing Logging, Metrics and Traces](./installing-logging-metrics-traces.md)
* [Accessing Logs](./accessing-logs.md)
* [Accessing Metrics](./accessing-metrics.md)
* [Accessing Traces](./accessing-traces.md)
* [Debugging Application Issues](./debugging-application-issues.md)
* [Debugging Performance Issues](./debugging-performance-issues.md)
## Networking
* [Setting up DNS](./DNS.md)
## Scaling

115
serving/accessing-logs.md Normal file
View File

@ -0,0 +1,115 @@
# Accessing logs
If logging and monitoring components are not installed yet, go through the
[installation instructions](./installing-logging-metrics-traces.md) to setup the
necessary components first.
## Kibana and Elasticsearch
To open the Kibana UI (the visualization tool for [Elasticsearch](https://info.elastic.co),
enter the following command:
```shell
kubectl proxy
```
This starts a local proxy of Kibana on port 8001. The Kibana UI is only exposed within
the cluster for security reasons.
Navigate to the [Kibana UI](http://localhost:8001/api/v1/namespaces/monitoring/services/kibana-logging/proxy/app/kibana)
(*It might take a couple of minutes for the proxy to work*).
The Discover tab of the Kibana UI looks like this:
![Kibana UI Discover tab](./images/kibana-discover-tab-annotated.png)
You can change the time frame of logs Kibana displays in the upper right corner
of the screen. The main search bar is across the top of the Discover page.
As more logs are ingested, new fields will be discovered. To have them indexed,
go to Management > Index Patterns > Refresh button (on top right) > Refresh
fields.
<!-- TODO: create a video walkthrough of the Kibana UI -->
### Accessing configuration and revision logs
To access the logs for a configuration, enter the following search query in Kibana:
```text
kubernetes.labels.knative_dev\/configuration: "configuration-example"
```
Replace `configuration-example` with your configuration's name. Enter the following
command to get your configuration's name:
```shell
kubectl get configurations
```
To access logs for a revision, enter the following search query in Kibana:
```text
kubernetes.labels.knative_dev\/revision: "configuration-example-00001"
```
Replace `configuration-example-00001` with your revision's name.
### Accessing build logs
To access the logs for a build, enter the following search query in Kibana:
```text
kubernetes.labels.build\-name: "test-build"
```
Replace `test-build` with your build's name. The build name is specified in the `.yaml` file as follows:
```yaml
apiVersion: build.knative.dev/v1alpha1
kind: Build
metadata:
name: test-build
```
### Accessing request logs
To access to request logs, enter the following search in Kibana:
```text
tag: "requestlog.logentry.istio-system"
```
Request logs contain details about requests served by the revision. Below is a sample request log:
```text
@timestamp July 10th 2018, 10:09:28.000
destinationConfiguration configuration-example
destinationNamespace default
destinationRevision configuration-example-00001
destinationService configuration-example-00001-service.default.svc.cluster.local
latency 1.232902ms
method GET
protocol http
referer unknown
requestHost route-example.default.example.com
requestSize 0
responseCode 200
responseSize 36
severity Info
sourceNamespace istio-system
sourceService unknown
tag requestlog.logentry.istio-system
traceId 986d6faa02d49533
url /
userAgent curl/7.60.0
```
### Accessing end to end request traces
See [Accessing Traces](./accessing-traces.md) page for details.
## Stackdriver
Go to the [Google Cloud Console logging page](https://console.cloud.google.com/logs/viewer) for
your GCP project which stores your logs via Stackdriver.

View File

@ -0,0 +1,27 @@
# Accessing metrics
To open the [Grafana](https://grafana.com/) UI (the visualization tool
for [Prometheus](https://prometheus.io/), enter the following command:
```shell
kubectl port-forward -n monitoring $(kubectl get pods -n monitoring --selector=app=grafana --output=jsonpath="{.items..metadata.name}") 3000
```
This starts a local proxy of Grafana on port 3000. The Grafana UI is only exposed within
the cluster for security reasons.
Navigate to the Grafana UI at [http://localhost:3000](http://localhost:3000).
Select `Home` button on the top of the page to see the list of pre-installed dashboards (screenshot below):
![Knative Dashboards](./images/grafana1.png)
The following dashboards are pre-installed with Knative Serving:
* **Revision HTTP Requests:** HTTP request count, latency and size metrics per revision and per configuration
* **Nodes:** CPU, memory, network and disk metrics at node level
* **Pods:** CPU, memory and network metrics at pod level
* **Deployment:** CPU, memory and network metrics aggregated at deployment level
* **Istio, Mixer and Pilot:** Detailed Istio mesh, Mixer and Pilot metrics
* **Kubernetes:** Dashboards giving insights into cluster health, deployments and capacity usage
To login as an administrator and modify or add dashboards, sign in with username `admin` and password `admin`.
Make sure to change the password before exposing Grafana UI to outside the cluster.

View File

@ -0,0 +1,21 @@
# Accessing request traces
If logging and monitoring components are not installed yet, go through the
[installation instructions](./installing-logging-metrics-traces.md) to setup the
necessary components first.
To open the Zipkin UI (the visualization tool for request traces), enter the following command:
```shell
kubectl proxy
```
This starts a local proxy of Zipkin on port 8001. The Zipkin UI is only exposed within
the cluster for security reasons.
Navigate to the [Zipkin UI](http://localhost:8001/api/v1/namespaces/istio-system/services/zipkin:9411/proxy/zipkin/).
Click on "Find Traces" to see the latest traces. You can search for a trace ID
or look at traces of a specific application. Click on a trace to see a detailed
view of a specific call.
<!--TODO: Consider adding a video here. -->

View File

@ -0,0 +1,130 @@
# Debugging Issues with Your Application
You deployed your app to Knative Serving, but it isn't working as expected.
Go through this step by step guide to understand what failed.
## Check command line output
Check your deploy command output to see whether it succeeded or not. If your
deployment process was terminated, there should be an error message showing up
in the output that describes the reason why the deployment failed.
This kind of failure is most likely due to either a misconfigured manifest or
wrong command. For example, the following output says that you must configure
route traffic percent to sum to 100:
```
Error from server (InternalError): error when applying patch:
{"metadata":{"annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"serving.knative.dev/v1alpha1\",\"kind\":\"Route\",\"metadata\":{\"annotations\":{},\"name\":\"route-example\",\"namespace\":\"default\"},\"spec\":{\"traffic\":[{\"configurationName\":\"configuration-example\",\"percent\":50}]}}\n"}},"spec":{"traffic":[{"configurationName":"configuration-example","percent":50}]}}
to:
&{0xc421d98240 0xc421e77490 default route-example STDIN 0xc421db0488 264682 false}
for: "STDIN": Internal error occurred: admission webhook "webhook.knative.dev" denied the request: mutation failed: The route must have traffic percent sum equal to 100.
ERROR: Non-zero return code '1' from command: Process exited with status 1
```
## Check application logs
Knative Serving provides default out-of-the-box logs for your application.
Access your application logs using [Accessing Logs](./accessing-logs.md) page.
## Check Route status
Run the following command to get the `status` of the `Route` object with which
you deployed your application:
```shell
kubectl get route <route-name> -o yaml
```
The `conditions` in `status` provide the reason if there is any failure. For
details, see Knative
[Error Conditions and Reporting](../spec/errors.md)(currently some of them
are not implemented yet).
## Check Revision status
If you configure your `Route` with `Configuration`, run the following
command to get the name of the `Revision` created for you deployment
(look up the configuration name in the `Route` .yaml file):
```shell
kubectl get configuration <configuration-name> -o jsonpath="{.status.latestCreatedRevisionName}"
```
If you configure your `Route` with `Revision` directly, look up the revision
name in the `Route` yaml file.
Then run
```shell
kubectl get revision <revision-name> -o yaml
```
A ready `Revision` should has the following condition in `status`:
```yaml
conditions:
- reason: ServiceReady
status: "True"
type: Ready
```
If you see this condition, check the following to continue debugging:
* [Check Pod status](#check-pod-status)
* [Check application logs](#check-application-logs)
* [Check Istio routing](#check-istio-routing)
If you see other conditions, to debug further:
* Look up the meaning of the conditions in Knative
[Error Conditions and Reporting](../spec/errors.md). Note: some of them
are not implemented yet. An alternative is to
[check Pod status](#check-pod-status).
* If you are using `BUILD` to deploy and the `BuidComplete` condition is not
`True`, [check BUILD status](#check-build-status).
## Check Pod status
To get the `Pod`s for all your deployments:
```shell
kubectl get pods
```
This should list all `Pod`s with brief status. For example:
```text
NAME READY STATUS RESTARTS AGE
configuration-example-00001-deployment-659747ff99-9bvr4 2/2 Running 0 3h
configuration-example-00002-deployment-5f475b7849-gxcht 1/2 CrashLoopBackOff 2 36s
```
Choose one and use the following command to see detailed information for its
`status`. Some useful fields are `conditions` and `containerStatuses`:
```shell
kubectl get pod <pod-name> -o yaml
```
If you see issues with "user-container" container in the containerStatuses, check your application logs as described below.
## Check Build status
If you are using Build to deploy, run the following command to get the Build for
your `Revision`:
```shell
kubectl get build $(kubectl get revision <revision-name> -o jsonpath="{.spec.buildName}") -o yaml
```
The `conditions` in `status` provide the reason if there is any failure. To
access build logs, first execute `kubectl proxy` and then open [Kibana UI](http://localhost:8001/api/v1/namespaces/monitoring/services/kibana-logging/proxy/app/kibana).
Use any of the following filters within Kibana UI to
see build logs. _(See [telemetry guide](../telemetry.md) for more information on
logging and monitoring features of Knative Serving.)_
* All build logs: `_exists_:"kubernetes.labels.build-name"`
* Build logs for a specific build: `kubernetes.labels.build-name:"<BUILD NAME>"`
* Build logs for a specific build and step: `kubernetes.labels.build-name:"<BUILD NAME>" AND kubernetes.container_name:"build-step-<BUILD STEP NAME>"`

View File

@ -0,0 +1,101 @@
# Investigating Performance Issues
You deployed your application or function to Knative Serving but its performance
is not up to the expectations. Knative Serving provides various dashboards and tools to
help investigate such issues. This document goes through these dashboards
and tools.
## Request metrics
Start your investigation with "Revision - HTTP Requests" dashboard. To open this dashboard,
open Grafana UI as described in [Accessing Metrics](./accessing-metrics.md) and navigate to
"Knative Serving - Revision HTTP Requests". Select your configuration and revision
from the menu on top left of the page. You will see a page like below:
![Knative Serving - Revision HTTP Requests](./images/request_dash1.png)
This dashboard gives visibility into the following for each revision:
* Request volume
* Request volume per HTTP response code
* Response time
* Response time per HTTP response code
* Request and response sizes
This dashboard can show traffic volume or latency discrepancies between different revisions.
If, for example, a revision's latency is higher than others revisions, then
focus your investigation on the offending revision through the rest of this guide.
## Request traces
Next, look into request traces to find out where the time is spent for a single request.
To access request traces, open Zipkin UI as described in [Accessing Traces](./accessing-traces.md).
Select your revision from the "Service Name" drop down and click on "Find Traces" button.
This will bring up a view that looks like below:
![Zipkin - Trace Overview](./images/zipkin1.png)
In the example above, we can see that the request spent most of its time in the
[span](https://github.com/opentracing/specification/blob/master/specification.md#the-opentracing-data-model) right before the last.
Investigation should now be focused on that specific span.
Clicking on that will bring up a view that looks like below:
![Zipkin - Span Details](./images/zipkin2.png)
This view shows detailed information about the specific span, such as the
micro service or external URL that was called. In this example, call to a
Grafana URL is taking the most time and investigation should focus on why
that URL is taking that long.
## Autoscaler metrics
If request metrics or traces do not show any obvious hot spots, or if they show
that most of the time is spent in your own code, autoscaler metrics should be
looked next. To open autoscaler dashboard, open Grafana UI and select
"Knative Serving - Autoscaler" dashboard. This will bring up a view that looks like below:
![Knative Serving - Autoscaler](./images/autoscaler_dash1.png)
This view shows four key metrics from Knative Serving autoscaler:
* Actual pod count: # of pods that are running a given revision
* Desired pod count: # of pods that autoscaler thinks that should serve the
revision
* Requested pod count: # of pods that autoscaler requested from Kubernetes
* Panic mode: If 0, autoscaler is operating in [stable mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#stable-mode).
If 1, autoscaler is operating in [panic mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#panic-mode).
If there is a large gap between actual pod count and requested pod count, that
means that the Kubernetes cluster is unable to keep up allocating new
resources fast enough, or that the Kubernetes cluster is out of requested
resources.
If there is a large gap between requested pod count and desired pod count, that
is an indication that Knative Serving autoscaler is unable to communicate with
Kubernetes master to make the request.
In the example above, autoscaler requested 18 pods to optimally serve the traffic
but was only granted 8 pods because the cluster is out of resources.
## CPU and memory usage
You can access total CPU and memory usage of your revision from
"Knative Serving - Revision CPU and Memory Usage" dashboard. Opening this will bring up a
view that looks like below:
![Knative Serving - Revision CPU and Memory Usage](./images/cpu_dash1.png)
The first chart shows rate of the CPU usage across all pods serving the revision.
The second chart shows total memory consumed across all pods serving the revision.
Both of these metrics are further divided into per container usage.
* user-container: This container runs the user code (application, function or container).
* [istio-proxy](https://github.com/istio/proxy): Sidecar container to form an
[Istio](https://istio.io/docs/concepts/what-is-istio/overview.html) mesh.
* queue-proxy: Knative Serving owned sidecar container to enforce request concurrency limits.
* autoscaler: Knative Serving owned sidecar container to provide auto scaling for the revision.
* fluentd-proxy: Sidecar container to collect logs from /var/log.
## Profiling
...To be filled...

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 167 KiB

BIN
serving/images/grafana1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 138 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 177 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 320 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 200 KiB

BIN
serving/images/zipkin1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

BIN
serving/images/zipkin2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 77 KiB

View File

@ -0,0 +1,60 @@
# Monitoring, Logging and Tracing Installation
Knative Serving offers two different monitoring setups: One that uses Elasticsearch, Kibana, Prometheus and Grafana and another that uses Stackdriver, Prometheus and Grafana. See below for installation instructions for these two setups. You can install only one of these two setups and side-by-side installation of these two are not supported.
## Elasticsearch, Kibana, Prometheus & Grafana Setup
First run:
```shell
kubectl apply -R -f config/monitoring/100-common \
-f config/monitoring/150-elasticsearch-prod \
-f third_party/config/monitoring/common \
-f third_party/config/monitoring/elasticsearch \
-f config/monitoring/200-common \
-f config/monitoring/200-common/100-istio.yaml
```
Monitor logging & monitoring components, until all of the components report Running or Completed:
```shell
kubectl get pods -n monitoring --watch
```
CTRL+C when it's done.
We will create two indexes in ElasticSearch - one for application logs and one for request traces.
To create the indexes, open Kibana Index Management UI at this [link](http://localhost:8001/api/v1/namespaces/monitoring/services/kibana-logging/proxy/app/kibana#/management/kibana/index)
(*it might take a couple of minutes for the proxy to work the first time after the installation*).
Within the "Configure an index pattern" page, enter `logstash-*` to `Index pattern` and select `@timestamp`
from `Time Filter field name` and click on `Create` button. See below for a screenshot:
![Create logstash-* index](images/kibana-landing-page-configure-index.png)
To create the second index, select `Create Index Pattern` button on top left of the page.
Enter `zipkin*` to `Index pattern` and select `timestamp_millis` from `Time Filter field name`
and click on `Create` button.
Next, visit instructions below to access to logs, metrics and traces:
* [Accessing Logs](./accessing-logs.md)
* [Accessing Metrics](./accessing-metrics.md)
* [Accessing Traces](./accessing-traces.md)
## Stackdriver(logs), Prometheus & Grafana Setup
If your Knative Serving is not built on a GCP based cluster or you want to send logs to
another GCP project, you need to build your own Fluentd image and modify the
configuration first. See
1. [Fluentd image on Knative Serving](/image/fluentd/README.md)
2. [Setting up a logging plugin](setting-up-a-logging-plugin.md)
```shell
kubectl apply -R -f config/monitoring/100-common \
-f config/monitoring/150-stackdriver-prod \
-f third_party/config/monitoring/common \
-f config/monitoring/200-common \
-f config/monitoring/200-common/100-istio.yaml
```