mirror of https://github.com/knative/docs.git
Autoscaler metrics and performance investigation guide (#789)
Fixes #578 and #493 * Added a document titled "Investigating Performance Issues" - this document will guide users through debugging application performance issues and will show how they can use the observability features offered by Elafros to identify such issues. * Added metrics and a dashboard for autoscaler component and documented how it is used in the guide above.
This commit is contained in:
parent
9cf3c0879c
commit
501e8cca82
Binary file not shown.
After Width: | Height: | Size: 94 KiB |
Binary file not shown.
After Width: | Height: | Size: 167 KiB |
Binary file not shown.
After Width: | Height: | Size: 200 KiB |
Binary file not shown.
After Width: | Height: | Size: 66 KiB |
Binary file not shown.
After Width: | Height: | Size: 77 KiB |
|
@ -0,0 +1,94 @@
|
||||||
|
# Investigating Performance Issues
|
||||||
|
|
||||||
|
You deployed your application or function to Elafros but its performance
|
||||||
|
is not up to the expectations. Elafros provides various dashboards and tools to
|
||||||
|
help investigate such issues. This document goes through these dashboards
|
||||||
|
and tools.
|
||||||
|
|
||||||
|
## Request metrics
|
||||||
|
|
||||||
|
Start your investigation with "Revision - HTTP Requests" dashboard. To open this dashboard,
|
||||||
|
open Grafana UI as described in [telemetry.md](../telemetry.md) and navigate to
|
||||||
|
"Elafros - Revision HTTP Requests". Select your configuration and revision
|
||||||
|
from the menu on top left of the page. You will see a page like below:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This dashboard gives visibility into the following for each revision:
|
||||||
|
* Request volume
|
||||||
|
* Request volume per HTTP response code
|
||||||
|
* Response time
|
||||||
|
* Response time per HTTP response code
|
||||||
|
* Request and response sizes
|
||||||
|
|
||||||
|
This dashboard can show traffic volume or latency discrepancies between different revisions.
|
||||||
|
If, for example, a revision's latency is higher than others revisions, then
|
||||||
|
focus your investigation on the offending revision through the rest of this guide.
|
||||||
|
|
||||||
|
## Request traces
|
||||||
|
Next, look into request traces to find out where the time is spent for a single request.
|
||||||
|
To access request traces, open Zipkin UI as described in [telemetry.md](../telemetry.md).
|
||||||
|
Select your revision from the "Service Name" drop down and click on "Find Traces" button.
|
||||||
|
This will bring up a view that looks like below:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
In the example above, we can see that the request spent most of its time in the
|
||||||
|
[span](https://github.com/opentracing/specification/blob/master/specification.md#the-opentracing-data-model) right before the last.
|
||||||
|
Investigation should now be focused on that specific span.
|
||||||
|
Clicking on that will bring up a view that looks like below:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This view shows detailed information about the specific span, such as the
|
||||||
|
micro service or external URL that was called. In this example, call to a
|
||||||
|
Grafana URL is taking the most time and investigation should focus on why
|
||||||
|
that URL is taking that long.
|
||||||
|
|
||||||
|
## Autoscaler metrics
|
||||||
|
If request metrics or traces do not show any obvious hot spots, or if they show
|
||||||
|
that most of the time is spent in your own code, autoscaler metrics should be
|
||||||
|
looked next. To open autoscaler dashboard, open Grafana UI and select
|
||||||
|
"Elafros - Autoscaler" dashboard. This will bring up a view that looks like below:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
This view shows four key metrics from Elafros autoscaler:
|
||||||
|
* Actual pod count: # of pods that are running a given revision
|
||||||
|
* Desired pod count: # of pods that autoscaler thinks that should serve the
|
||||||
|
revision
|
||||||
|
* Requested pod count: # of pods that autoscaler requested from Kubernetes
|
||||||
|
* Panic mode: If 0, autoscaler is operating in [stable mode](../../pkg/autoscaler/README.md#stable-mode).
|
||||||
|
If 1, autoscaler is operating in [panic mode](../../pkg/autoscaler/README.md#panic-mode).
|
||||||
|
|
||||||
|
If there is a large gap between actual pod count and requested pod count, that
|
||||||
|
means that the Kubernetes cluster is unable to keep up allocating new
|
||||||
|
resources fast enough, or that the Kubernetes cluster is out of requested
|
||||||
|
resources.
|
||||||
|
|
||||||
|
If there is a large gap between requested pod count and desired pod count, that
|
||||||
|
is an indication that Elafros autoscaler is unable to communicate with
|
||||||
|
Kubernetes master to make the request.
|
||||||
|
|
||||||
|
In the example above, autoscaler requested 18 pods to optimally serve the traffic
|
||||||
|
but was only granted 8 pods because the cluster is out of resources.
|
||||||
|
|
||||||
|
## CPU and memory usage
|
||||||
|
You can access total CPU and memory usage of your revision from
|
||||||
|
"Elafros - Revision CPU and Memory Usage" dashboard. Opening this will bring up a
|
||||||
|
view that looks like below:
|
||||||
|
|
||||||
|

|
||||||
|
|
||||||
|
The first chart shows rate of the CPU usage across all pods serving the revision.
|
||||||
|
The second chart shows total memory consumed across all pods serving the revision.
|
||||||
|
Both of these metrics are further divided into per container usage.
|
||||||
|
* ela-container: This container runs the user code (application, function or container).
|
||||||
|
* [istio-proxy](https://github.com/istio/proxy): Sidecar container to form an
|
||||||
|
[Istio](https://istio.io/docs/concepts/what-is-istio/overview.html) mesh.
|
||||||
|
* queue-proxy: Elafros owned sidecar container to enforce request concurrency limits.
|
||||||
|
* autoscaler: Elafros owned sidecar container to provide auto scaling for the revision.
|
||||||
|
* fluentd-proxy: Sidecar container to collect logs from /var/log.
|
||||||
|
|
||||||
|
## Profiling
|
||||||
|
...To be filled...
|
Loading…
Reference in New Issue