Autoscaler metrics and performance investigation guide (#789)

Fixes #578 and #493 * Added a document titled "Investigating Performance Issues" - this document will guide users through debugging application performance issues and will show how they can use the observability features offered by Elafros to identify such issues. * Added metrics and a dashboard for autoscaler component and documented how it is used in the guide above.
2018-05-01 15:01:55 -07:00 · 2018-05-01 15:01:55 -07:00 · 501e8cca82
parent 9cf3c0879c
commit 501e8cca82
6 changed files with 94 additions and 0 deletions
--- a/debugging/images/autoscaler_dash1.png
+++ b/debugging/images/autoscaler_dash1.png
--- a/debugging/images/cpu_dash1.png
+++ b/debugging/images/cpu_dash1.png
--- a/debugging/images/request_dash1.png
+++ b/debugging/images/request_dash1.png
--- a/debugging/images/zipkin1.png
+++ b/debugging/images/zipkin1.png
--- a/debugging/images/zipkin2.png
+++ b/debugging/images/zipkin2.png
--- a/debugging/performance-investigation-guide.md
+++ b/debugging/performance-investigation-guide.md
@ -0,0 +1,94 @@
+# Investigating Performance Issues
+
+You deployed your application or function to Elafros but its performance 
+is not up to the expectations. Elafros provides various dashboards and tools to 
+help investigate such issues. This document goes through these dashboards
+and tools.
+
+## Request metrics
+
+Start your investigation with "Revision - HTTP Requests" dashboard. To open this dashboard,
+open Grafana UI as described in [telemetry.md](../telemetry.md) and navigate to 
+"Elafros - Revision HTTP Requests". Select your configuration and revision
+from the menu on top left of the page. You will see a page like below:
+
+![Elafros - Revision HTTP Requests](images/request_dash1.png)
+
+This dashboard gives visibility into the following for each revision:
+* Request volume
+* Request volume per HTTP response code
+* Response time
+* Response time per HTTP response code
+* Request and response sizes
+
+This dashboard can show traffic volume or latency discrepancies between different revisions. 
+If, for example, a revision's latency is higher than others revisions, then 
+focus your investigation on the offending revision through the rest of this guide.
+
+## Request traces
+Next, look into request traces to find out where the time is spent for a single request.
+To access request traces, open Zipkin UI as described in [telemetry.md](../telemetry.md).
+Select your revision from the "Service Name" drop down and click on "Find Traces" button.
+This will bring up a view that looks like below:
+
+![Zipkin - Trace Overview](images/zipkin1.png)
+
+In the example above, we can see that the request spent most of its time in the 
+[span](https://github.com/opentracing/specification/blob/master/specification.md#the-opentracing-data-model) right before the last.
+Investigation should now be focused on that specific span. 
+Clicking on that will bring up a view that looks like below:
+
+![Zipkin - Span Details](images/zipkin2.png)
+
+This view shows detailed information about the specific span, such as the
+micro service or external URL that was called. In this example, call to a
+Grafana URL is taking the most time and investigation should focus on why
+that URL is taking that long.
+
+## Autoscaler metrics
+If request metrics or traces do not show any obvious hot spots, or if they show
+that most of the time is spent in your own code, autoscaler metrics should be
+looked next. To open autoscaler dashboard, open Grafana UI and select 
+"Elafros - Autoscaler" dashboard. This will bring up a view that looks like below:
+
+![Elafros - Autoscaler](images/autoscaler_dash1.png)
+
+This view shows four key metrics from Elafros autoscaler:
+* Actual pod count: # of pods that are running a given revision
+* Desired pod count: # of pods that autoscaler thinks that should serve the
+  revision
+* Requested pod count: # of pods that autoscaler requested from Kubernetes
+* Panic mode: If 0, autoscaler is operating in [stable mode](../../pkg/autoscaler/README.md#stable-mode).
+If 1, autoscaler is operating in [panic mode](../../pkg/autoscaler/README.md#panic-mode).
+
+If there is a large gap between actual pod count and requested pod count, that
+means that the Kubernetes cluster is unable to keep up allocating new
+resources fast enough, or that the Kubernetes cluster is out of requested
+resources.
+
+If there is a large gap between requested pod count and desired pod count, that
+is an indication that Elafros autoscaler is unable to communicate with
+Kubernetes master to make the request.
+
+In the example above, autoscaler requested 18 pods to optimally serve the traffic
+but was only granted 8 pods because the cluster is out of resources.
+
+## CPU and memory usage
+You can access total CPU and memory usage of your revision from 
+"Elafros - Revision CPU and Memory Usage" dashboard. Opening this will bring up a 
+view that looks like below:
+
+![Elafros - Revision CPU and Memory Usage](images/cpu_dash1.png)
+
+The first chart shows rate of the CPU usage across all pods serving the revision.
+The second chart shows total memory consumed across all pods serving the revision.
+Both of these metrics are further divided into per container usage.
+* ela-container: This container runs the user code (application, function or container).
+* [istio-proxy](https://github.com/istio/proxy): Sidecar container to form an 
+[Istio](https://istio.io/docs/concepts/what-is-istio/overview.html) mesh.
+* queue-proxy: Elafros owned sidecar container to enforce request concurrency limits.
+* autoscaler: Elafros owned sidecar container to provide auto scaling for the revision.
+* fluentd-proxy: Sidecar container to collect logs from /var/log.
+
+## Profiling
+...To be filled...