docs/roadmap/monitoring.md

91 lines
4.7 KiB
Markdown

# 2018 Roadmap for Monitoring and Logging
This document captures what we hope to accomplish in 2018 in Monitoring and Logging areas for Elafros.
## Overview
We will provide distinct experiences for [operator personas](../product/personas.md#operator-personas),
[developer personas](../product/personas.md#developer-personas) and [contributors](../product/personas.md#contributors).
### Operator Capabilities
* Provide default collection of cluster logs and metrics from infrastructure components such as Kubernetes.
* Provide default dashboards and interfaces for viewing cluster logs and metrics.
* Auto-scale, upgrade and maintain the default logging, metrics, alerting and tracing backends.
* Operators can set custom alerts on cluster events.
* Operators can fine tune of scale, performance and features of the default logging, metrics, alerting and tracing backends.
* Operators can retrieve a list of all components emitting logs or metrics using a CLI.
* Operators can "tail" logs and metrics using a CLI for a specific component.
* Operators can install extensions that forward logs and metrics to different backends (e.g. Stack Driver).
### Developer Capabilities
* Provide default collection of logs, metrics, and request traces.
* Provide default dashboards and interfaces for viewing logs, metrics and traces, and for setting alerts on the same.
* Developers can set custom application and function alerts.
* Developers can create shared dashboards for logs and metrics for applications and functions.
* Developers can retrieve a list of all components they have access to that are emitting logs and/or metrics using a CLI.
* Developers can "tail" logs and metrics using a CLI for any component they have access to.
### Contributor Capabilities
* Contributors can write extensions and translate logs and metrics into the format
for different loggings and metrics stores (e.g. StackDriver).
## Basics
### Milestones: M3 and M4
In this phase, we will enable a shared infrastructure where everyone has access to all data.
No personas specific experience or access will be provided.
The following items will be installed and secured in a cluster by default,
but we will provide the ability to replace or remove these in a later milestone.
* Prometheus
* Alert Manager
* Prometheus Operator
* Grafana
* ElasticSearch
* Kibana
* Zipkin
* Fluentd
Logs from the following locations will be collected:
* stderr & stdout for all application and function containers
* Build logs
Following metrics will be collected:
* Envoy, Istio Mixer (per request metrics), Istio Pilot
* Node and pod level metrics (CPU, memory, disk and network)
* Elafros controller metrics
Request logs from Istio proxy, user applications and user functions will be collected by Zipkin.
## Developer Contracts
### Milestones: M4 and M5
In this phase, we will define and implement features for the developer persona.
* [M4 & M5] Define and implement developer contracts for logging, metrics, alerting and tracing.
* [M4] Write step-by-step guidelines for developers to debug issues throughout the lifecycle of their applications and functions.
* [M4] Provide developer samples written in Golang. Support for other languages will come in a later phase.
* [M5] Implement the developer CLI to list components and tail logs, metrics and traces.
## Operator Contracts
### Milestones: M6 and M7
In this phase, we will define and implement features for the operator persona.
* [M6 & M7] Define and implement operator contracts.
* [M6] Write step-by-step guidelines for operators to debug issues in the cluster.
* [M7] Deploy operator specific instances of the default backends to separate access of operators vs developers.
* [M7] Implement the operator CLI to list components and tail logs and metrics.
## Contributor Contracts
### Milestones: M8
In this phase, we will define and implement the features for the contributor persona.
* [M8] Define and implement contracts for plugging in custom logging, metrics, alerting and tracing backends.
We will not provide maintenance, rollout processes, etc for third-party monitoring, logging, or tracing extensions,
though we may maintain a "contrib" directory for such contributions.
* [M8] Add an extension for one managed solution (e.g. Stack Driver).
## M9 and Onwards
* Allow namespace specific instances of default backends for namespace level access control.
* Implement auto-scaling of the default backends.
* Implement upgrading of the default backends.
* Implement maintenance of the default backends (data retention, daily index creations, etc).
* Provide developer samples written in Node.js, Java, Python, PHP, .Net and Ruby.
## Out of Scope for 2018
* Improving the underlying logging, monitoring, and tracing systems to support multi-tenancy.