From 839b2cf82222f9f1b799024264595fdce40564b5 Mon Sep 17 00:00:00 2001 From: Peter Delaney Date: Wed, 18 Jul 2018 10:50:32 -0700 Subject: [PATCH] Copy edits to debugging-performance-issues.md (#181) * Copy edits to debugging-performance-issues.md * Wording tweaks Add "the" before "Grafana UI" Add "the" before "Kubernetes master" --- serving/debugging-performance-issues.md | 99 +++++++++++++------------ 1 file changed, 52 insertions(+), 47 deletions(-) diff --git a/serving/debugging-performance-issues.md b/serving/debugging-performance-issues.md index 75ae9789d..1a241f939 100644 --- a/serving/debugging-performance-issues.md +++ b/serving/debugging-performance-issues.md @@ -1,26 +1,29 @@ # Investigating Performance Issues You deployed your application or function to Knative Serving but its performance -is not up to the expectations. Knative Serving provides various dashboards and tools to -help investigate such issues. This document goes through these dashboards -and tools. +doesn't meet your expectations. Knative Serving provides various dashboards and tools to +help investigate such issues. This document reviews these dashboards and tools. ## Request metrics -Start your investigation with "Revision - HTTP Requests" dashboard. To open this dashboard, -open Grafana UI as described in [Accessing Metrics](./accessing-metrics.md) and navigate to -"Knative Serving - Revision HTTP Requests". Select your configuration and revision -from the menu on top left of the page. You will see a page like below: +Start your investigation with the "Revision - HTTP Requests" dashboard. -![Knative Serving - Revision HTTP Requests](./images/request_dash1.png) +1. To open this dashboard, open the Grafana UI as described in + [Accessing Metrics](./accessing-metrics.md) and navigate to + "Knative Serving - Revision HTTP Requests". -This dashboard gives visibility into the following for each revision: +1. Select your configuration and revision from the menu on top left of the page. + You will see a page like this: -* Request volume -* Request volume per HTTP response code -* Response time -* Response time per HTTP response code -* Request and response sizes + ![Knative Serving - Revision HTTP Requests](./images/request_dash1.png) + + This dashboard gives visibility into the following for each revision: + + * Request volume + * Request volume per HTTP response code + * Response time + * Response time per HTTP response code + * Request and response sizes This dashboard can show traffic volume or latency discrepancies between different revisions. If, for example, a revision's latency is higher than others revisions, then @@ -29,59 +32,61 @@ focus your investigation on the offending revision through the rest of this guid ## Request traces Next, look into request traces to find out where the time is spent for a single request. -To access request traces, open Zipkin UI as described in [Accessing Traces](./accessing-traces.md). -Select your revision from the "Service Name" drop down and click on "Find Traces" button. -This will bring up a view that looks like below: -![Zipkin - Trace Overview](./images/zipkin1.png) +1. To access request traces, open the Zipkin UI as described in [Accessing Traces](./accessing-traces.md). -In the example above, we can see that the request spent most of its time in the -[span](https://github.com/opentracing/specification/blob/master/specification.md#the-opentracing-data-model) right before the last. -Investigation should now be focused on that specific span. -Clicking on that will bring up a view that looks like below: +1. Select your revision from the "Service Name" dropdown, and then click the "Find Traces" button. You'll + get a view that looks like this: -![Zipkin - Span Details](./images/zipkin2.png) + ![Zipkin - Trace Overview](./images/zipkin1.png) -This view shows detailed information about the specific span, such as the -micro service or external URL that was called. In this example, call to a -Grafana URL is taking the most time and investigation should focus on why -that URL is taking that long. + In this example, you can see that the request spent most of its time in the + [span](https://github.com/opentracing/specification/blob/master/specification.md#the-opentracing-data-model) + right before the last, so focus your investigation on that specific span. + +1. Click that span to see a view like the following: + + ![Zipkin - Span Details](./images/zipkin2.png) + + This view shows detailed information about the specific span, such as the + micro service or external URL that was called. In this example, the call to a + Grafana URL is taking the most time. Focus your investigation on why that URL + is taking that long. ## Autoscaler metrics If request metrics or traces do not show any obvious hot spots, or if they show -that most of the time is spent in your own code, autoscaler metrics should be -looked next. To open autoscaler dashboard, open Grafana UI and select -"Knative Serving - Autoscaler" dashboard. This will bring up a view that looks like below: +that most of the time is spent in your own code, look at autoscaler metrics next. -![Knative Serving - Autoscaler](./images/autoscaler_dash1.png) +1. To open the autoscaler dashboard, open Grafana UI and select + "Knative Serving - Autoscaler" dashboard, which looks like this: -This view shows four key metrics from Knative Serving autoscaler: + ![Knative Serving - Autoscaler](./images/autoscaler_dash1.png) + +This view shows 4 key metrics from the Knative Serving autoscaler: * Actual pod count: # of pods that are running a given revision -* Desired pod count: # of pods that autoscaler thinks that should serve the - revision -* Requested pod count: # of pods that autoscaler requested from Kubernetes -* Panic mode: If 0, autoscaler is operating in [stable mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#stable-mode). -If 1, autoscaler is operating in [panic mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#panic-mode). +* Desired pod count: # of pods that autoscaler thinks should serve the revision +* Requested pod count: # of pods that the autoscaler requested from Kubernetes +* Panic mode: If 0, the autoscaler is operating in [stable mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#stable-mode). +If 1, the autoscaler is operating in [panic mode](https://github.com/knative/serving/blob/master/docs/scaling/DEVELOPMENT.md#panic-mode). -If there is a large gap between actual pod count and requested pod count, that -means that the Kubernetes cluster is unable to keep up allocating new +A large gap between the actual pod count and the requested pod count +indicates that the Kubernetes cluster is unable to keep up allocating new resources fast enough, or that the Kubernetes cluster is out of requested resources. -If there is a large gap between requested pod count and desired pod count, that -is an indication that Knative Serving autoscaler is unable to communicate with -Kubernetes master to make the request. +A large gap between the requested pod count and the desired pod count indicates that +the Knative Serving autoscaler is unable to communicate with the Kubernetes master to make +the request. -In the example above, autoscaler requested 18 pods to optimally serve the traffic +In the preceding example, the autoscaler requested 18 pods to optimally serve the traffic but was only granted 8 pods because the cluster is out of resources. ## CPU and memory usage You can access total CPU and memory usage of your revision from -"Knative Serving - Revision CPU and Memory Usage" dashboard. Opening this will bring up a -view that looks like below: +the "Knative Serving - Revision CPU and Memory Usage" dashboard, which looks like this: ![Knative Serving - Revision CPU and Memory Usage](./images/cpu_dash1.png) @@ -89,11 +94,11 @@ The first chart shows rate of the CPU usage across all pods serving the revision The second chart shows total memory consumed across all pods serving the revision. Both of these metrics are further divided into per container usage. -* user-container: This container runs the user code (application, function or container). +* user-container: This container runs the user code (application, function, or container). * [istio-proxy](https://github.com/istio/proxy): Sidecar container to form an [Istio](https://istio.io/docs/concepts/what-is-istio/overview.html) mesh. * queue-proxy: Knative Serving owned sidecar container to enforce request concurrency limits. -* autoscaler: Knative Serving owned sidecar container to provide auto scaling for the revision. +* autoscaler: Knative Serving owned sidecar container to provide autoscaling for the revision. * fluentd-proxy: Sidecar container to collect logs from /var/log. ## Profiling