Move scaling documentation to docs (#1170)

* Add scaling folder. * Move autoscaling roadmap. * Pointer to the new scaling docs folder. * Update README.md * Fix autoscaler.go link.
2018-06-12 12:17:15 -07:00 · 2018-06-12 12:17:15 -07:00 · fe22ba2887
parent f976fde906
commit fe22ba2887
4 changed files with 172 additions and 0 deletions
--- a/roadmap/scaling-2018.md
+++ b/roadmap/scaling-2018.md
@ -0,0 +1,56 @@
+# 2018 Autoscaling Roadmap
+
+This is what we hope to accomplish in 2018.
+
+## References
+
+[Autoscaling Design Goals](README.md#design-goals):
+
+  1. *Make it fast*
+  2. *Make it light*
+  3. *Make everything better*
+
+In 2018 we will focus primarily on making autoscaling correct, fast and light.
+
+## Areas of Interest and Requirements
+
+1. **Correctness**.  When scaling from 0-to-1, 1-to-N and back down, error rates must not increase.  We must have visibility of correctness over time at small and large scales.
+2. **Performance**.  When scaling from 1-to-N and back down, autoscaling must maintain reasonable latency.  The Knative Serving implementation of autoscaling must be competitive in its ability to serve variable load.
+3. **Scale to zero**.  Idle ([Reserve](README.md#behavior)) Revisions must cost nothing.  Reserve Revisions must serve the first request in 1 second or less.
+4. **Development**.  Autoscaler development must follow a clear roadmap.  Getting started as a developer must be easy and the team must scale horizontally.
+5. **Integration**.  Autoscaler should be pluggable and support multiple strategies and workloads.
+
+### Correctness
+
+1. **Write autoscaler end-to-end tests** to cover low-scale regressions, runnable by individual developers before checkin. ([#420](https://github.com/knative/serving/issues/420))
+2. **Test error rates at high scale** to cover regressions at larger scales (~1000 QPS and ~1000 clients). ([#421](https://github.com/knative/serving/issues/421))
+3. **Test error rates around idle states** to cover various scale-to-zero edge cases. ([#422](https://github.com/knative/serving/issues/422))
+
+### Performance
+
+1. **Establish canonical load test scenarios** to prove autoscaler performance and guide development.  We need to establish the code to run, the request load to generate, and the performance expected.  This will tell us where we need to improve.
+2. **Reproducable load tests** which can be run by anyone with minimal setup.  These must be transparent and easy to run.  They must be meaningful tests which prove autoscaler performance. ([#424](https://github.com/knative/serving/pull/424))
+3. **Vertical pod autoscaling** to allow revisions to adapt the differing requirements of user's code.  Rather than applying the same resource requests to all revision deployments, use vertical pod autoscaler (or something) to adjust the resources required.  Resource requirements for one revision should be inherited by the next revision so there isn't always a learning period with new revisions.
+
+### Scale to Zero
+
+1. **Implement scale to zero**.  Revisions should autoamtically scale to zero after a period of inactivity.  And scale back up from zero to one with the first request.  The first requests should succeed (even if latency is high) and then latency should return to normal levels once the revision is active.
+2. **Reduce Reserve Revision start time** from 4 seconds to 1 second or less.  Maybe we can keep some resources around like the Replica Set or pre-cache images so that Pods are spun up faster.
+
+### Development
+
+1. **Decouple autoscaling from revision controller** to allow the two to evolve separately.  The revision controller should not be setting up the autoscaler deployment and service.
+
+### Integration
+
+1. **Autoscaler multitenancy** will allow the autoscaler to remain "always on".  It will also reduce the overhead of running one single-tenant autoscaler pod per revision.
+2. **Consume custom metrics API** as an abstraction to allow pluggability.  This may require another metrics aggregation component to get the queue-proxy produced concurrency metrics behind a custom metrics API.  It will also allow autoscaling based on Prometheus metrics through an adapter.  A good acceptance criteria is the ability to plug in the vanilla Horizontal Pod Autoscaler (HPA) in lieu of the Knative Serving autoscaler (minus scale-to-zero capabilities).
+3. **Autoscale queue-based workloads** in addition to request/reply workloads.  The first use-case is the integration of Riff autoscaling into the multitenant autoscaler.  The autoscaler must be able to select the appropriate strategy for scaling the revision.
+
+## What We Are Not Doing Yet
+
+These are things we are explicitly leaving off the roadmap.  But we might do exploratory work to set them up for later development.  Most of these are related to Design Goal #3: *make everything better*.
+
+1. **Use Envoy for single-threading** instead of using the Queue Proxy to enforce serialization of requests to the application container.  This only applies in single-threaded mode.  It would allow us to remove the Queue Proxy entirely.  But it would probably require feature work in Envoy/Istio.
+2. **Remove metrics reporting from the Queue Proxy** in order to rely on a common, Knative Serving metrics pipeline.  This could mean polling the Pods to get the same metrics as are reported to Prometheus.  Or going to Prometheus to get the metrics it has aggregated.  It means removing the metrics push from the [Queue Proxy to the Autoscaler](README.md#context).
+3. **[Slow Brain](README.md#slow-brain--fast-brain) implementation** to automatically adjust target concurrency to the application's behavior.  Instead, we can rely on vertical pod autoscaling for now to size the pod to an average of one request at a time.
--- a/scaling/DEVELOPMENT.md
+++ b/scaling/DEVELOPMENT.md
@ -0,0 +1,109 @@
+# Autoscaling
+
+Knative Serving Revisions are automatically scaled up and down according incoming traffic.
+
+## Definitions
+
+* Knative Serving **Revision** -- a custom resource which is a running snapshot of the user's code (in a Container) and configuration.
+* Knative Serving **Route** -- a custom resource which exposes Revisions to clients via an Istio ingress rule.
+* Kubernetes **Deployment** -- a k8s resource which manages the lifecycle of individual Pods running Containers.  One of these is running user code in each Revision.
+* Knative Serving **Autoscaler** -- another k8s Deployment (one per Revision) running a single Pod which watches request load on the Pods running user code.  It increases and decreases the size of the Deployment running the user code in order to compensate for higher or lower traffic load.
+* Knative Serving **Activator** -- a k8s Deployment running a single, multi-tenant Pod (one per Cluster for all Revisions) which catches requests for Revisions with no Pods.  It brings up Pods running user code (via the Revision controller) and forwards caught requests.
+* **Concurrency** -- the number of requests currently being served at a given moment.  More QPS or higher latency means more concurrent requests.
+
+## Behavior
+
+Revisions have three autoscaling states which are:
+1. **Active** when they are actively serving requests,
+2. **Reserve** when they are scaled down to 0 Pods but is still in service, and
+3. **Retired** when they will no longer receive traffic.
+
+When a Revision is actively serving requests it will increase and decrease the number of Pods to maintain the desired average concurrent requests per Pod.  When requests are no longer being served, the Revision will be put in a Reserve state.  When the first request arrives, the Revision is put in an Active state, and the request is queued until it becomes ready.
+
+In the Active state, each Revision has a Deployment which maintains the desired number of Pods.  It also has an Autoscaler which watches traffic metrics and adjusts the Deployment's desired number of pods up and down.  Each Pod reports its number of concurrent requests each second to the Autoscaler (how many clients are connected at that moment).
+
+In the Reserve state, the Revision has no scheduled Pods and consumes no CPU.  The Istio route rule for the Revision points to the single multi-tenant Activator which will catch traffic for all Reserve Revisions.  When the Activator catches a request for a Reserve Revision, it will flip the Revision to an Active state and then forward requests to the Revision when it ready.
+
+In the Retired state, the Revision has provisioned resources.  No requests will be served for the Revision.
+
+## Context 
+
+```
+   +---------------------+
+   | ROUTE               |
+   |                     |
+   |   +-------------+   |
+   |   | Istio Route |---------------+
+   |   +-------------+   |           |
+   |         |           |           |
+   +---------|-----------+           |
+             |                       |
+             |                       |
+             | inactive              | active
+             |  route                | route
+             |                       |
+             |                       |
+             |                +------|---------------------------------+
+             V         watch  |      V                                 |
+       +-----------+   first  |   +- ----+  create   +------------+    |
+       | Activator |------------->| Pods |<----------| Deployment |    |
+       +-----------+          |   +------+           +------------+    |
+             |                |       |                     ^          |
+             |   activate     |       |                     | resize   |
+             +--------------->|       |                     |          |
+                              |       |    metrics    +------------+   |
+                              |       +-------------->| Autoscaler |   |
+                              |                       +------------+   |
+                              | REVISION                               |
+                              +----------------------------------------+
+                              
+```
+
+## Design Goals
+
+1. **Make if fast**.  Revisions should be able to scale from 0 to 1000 concurrent requests in 30 seconds or less.
+2. **Make it light**.  Wherever possible the system should be able to figure out the right thing to do without the user's intervention or configuration.
+3. **Make everything better**.  Creating custom components is a short-term strategy to get something working now.  The long-term strategy is to make the underlying components better so that custom code can be replaced with configuration.  E.g. Autoscaler should be replaced with the K8s [Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) and [Custom Metrics](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-custom-metrics).
+
+### Slow Brain / Fast Brain
+
+The Knative Serving Autoscaler is split into two parts:
+1. **Fast Brain** that maintains the desired level of concurrent requests per Pod (satisfying [Design Goal #1](#design-goals)), and the
+2. **Slow Brain** that comes up with the desired level based on CPU, memory and latency statistics (satisfying [Design Goal #2](#design-goals)).
+
+## Fast Brain Implementation
+
+This is subject to change as the Knative Serving implementation changes.
+
+### Code
+
+* [Autoscaler Library](../../pkg/autoscaler/autoscaler.go)
+* [Autoscaler Binary](../../cmd/autoscaler/main.go)
+* [Queue Proxy Binary](../../cmd/queue/main.go)
+
+### Autoscaler
+
+There is a proxy in the Knative Serving Pods (`queue-proxy`) which is responsible for enforcing request queue parameters (single or multi threaded), and reporting concurrent client metrics to the Autoscaler.  If we can get rid of this and just use [Envoy](https://www.envoyproxy.io/docs/envoy/latest/), that would be great (see [Design Goal #3](#design-goals)).  The Knative Serving controller injects the identity of the Revision into the queue proxy environment variables.  When the queue proxy wakes up, it will find the Autoscaler for the Revision and establish a websocket connection.  Every 1 second, the queue proxy pushes a gob serialized struct with the observed number of concurrent requests at that moment.
+
+The Autoscaler is also given the identity of the Revision through environment variables.  When it wakes up, it starts a websocket-enabled http server.  Queue proxies start sending their metrics to the Autoscaler and it maintains a 60-second sliding window of data points.  The Autoscaler has two modes of operation, Panic Mode and Stable Mode.
+
+#### Stable Mode
+
+In Stable Mode the Autoscaler adjusts the size of the Deployment to achieve the desired average concurrency per Pod (currently [hardcoded](https://github.com/knative/serving/blob/c4a543ecce61f5cac96b0e334e57db305ff4bcb3/cmd/autoscaler/main.go#L36), later provided by the Slow Brain).  It calculates the observed concurrency per pod by averaging all data points over the 60 second window.  When it adjusts the size of the Deployment it bases the desired Pod count on the number of observed Pods in the metrics stream, not the number of Pods in the Deployment spec.  This is important to keep the Autoscaler from running away (there is delay between when the Pod count is increased and when new Pods come online to serve requests and provide a metrics stream).
+
+#### Panic Mode
+
+The Autoscaler evaluates its metrics every 2 seconds.  In addition to the 60-second window, it also keeps a 6-second window (the panic window).  If the 6-second average concurrency reaches 2 times the desired average, then the Autoscaler transitions into Panic Mode.  In Panic Mode the Autoscaler bases all its decisions on the 6-second window, which makes it much more responsive to sudden increases in traffic.  Every 2 seconds it adjusts the size of the Deployment to achieve the stable, desired average (or a maximum of 10 times the current observed Pod count, whichever is smaller).  To prevent rapid fluctuations in the Pod count, the Autoscaler will only increase Deployment size during Panic Mode, never decrease.  60 seconds after the last Panic Mode increase to the Deployment size, the Autoscaler transistions back to Stable Mode and begins evaluating the 60-second windows again.
+
+#### Deactivation
+
+When the Autoscaler has observed an average concurrency per pod of 0.0 for some time ([#305](https://github.com/knative/serving/issues/305)), it will transistion the Revision into the Reserve state.  This causes the Deployment and the Autoscaler to be turned down (or scaled to 0) and routes all traffic for the Revision to the Activator.
+
+### Activator
+
+The Activator is a single multi-tenant component that catches traffic for all Reserve Revisions.  It is responsible for activating the Revisions and then proxying the caught requests to the appropriate Pods.  It woud be preferable to have a hook in Istio to do this so we can get rid of the Activator (see [Design Goal #3](#design-goals)).  When the Activator gets a request for a Reserve Revision, it calls the Knative Serving control plane to transistion the Revision to an Active state.  It will take a few seconds for all the resources to be provisioned, so more requests might arrive at the Activator in the meantime.  The Activator establishes a watch for Pods belonging to the target Revision.  Once the first Pod comes up, all enqueued requests are proxied to that Pod.  Concurrently, the Knative Serving control plane will update the Istio route rules to take the Activator back out of the serving path.
+
+## Slow Brain Implementation
+
+*Currently the Slow Brain is not implemented and the desired concurrency level is hardcoded at 1.0 ([code](https://github.com/knative/serving/blob/7f1385cb88ca660378f8afcc78ad4bfcddd83c47/cmd/autoscaler/main.go#L36)).*
+
--- a/scaling/OWNERS
+++ b/scaling/OWNERS
@ -0,0 +1,4 @@
+# The OWNERS file is used by prow to automatically merge approved PRs.
+
+approvers:
+- josephburnett
--- a/scaling/README.md
+++ b/scaling/README.md
@ -0,0 +1,3 @@
+# Knative Serving Scaling
+
+TODO: write developer/operator facing documentation.