From ecd85f38eb136a0de703d4f2c9f5eea3197ae346 Mon Sep 17 00:00:00 2001 From: Mustafa Demirhan <4033879+mdemirhan@users.noreply.github.com> Date: Thu, 5 Apr 2018 10:54:53 -0700 Subject: [PATCH] 2018 roadmap for Monitoring and Logging (#521) Proposed 2018 roadmap for monitoring and logging. --- roadmap/monitoring.md | 90 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 90 insertions(+) create mode 100644 roadmap/monitoring.md diff --git a/roadmap/monitoring.md b/roadmap/monitoring.md new file mode 100644 index 000000000..1fc8a009a --- /dev/null +++ b/roadmap/monitoring.md @@ -0,0 +1,90 @@ +# 2018 Roadmap for Monitoring and Logging + +This document captures what we hope to accomplish in 2018 in Monitoring and Logging areas for Elafros. + +## Overview +We will provide distinct experiences for [operator personas](../product/personas.md#operator-personas), +[developer personas](../product/personas.md#developer-personas) and [contributors](../product/personas.md#contributors). + +### Operator Capabilities +* Provide default collection of cluster logs and metrics from infrastructure components such as Kubernetes. +* Provide default dashboards and interfaces for viewing cluster logs and metrics. +* Auto-scale, upgrade and maintain the default logging, metrics, alerting and tracing backends. +* Operators can set custom alerts on cluster events. +* Operators can fine tune of scale, performance and features of the default logging, metrics, alerting and tracing backends. +* Operators can retrieve a list of all components emitting logs or metrics using a CLI. +* Operators can "tail" logs and metrics using a CLI for a specific component. +* Operators can install extensions that forward logs and metrics to different backends (e.g. Stack Driver). + +### Developer Capabilities +* Provide default collection of logs, metrics, and request traces. +* Provide default dashboards and interfaces for viewing logs, metrics and traces, and for setting alerts on the same. +* Developers can set custom application and function alerts. +* Developers can create shared dashboards for logs and metrics for applications and functions. +* Developers can retrieve a list of all components they have access to that are emitting logs and/or metrics using a CLI. +* Developers can "tail" logs and metrics using a CLI for any component they have access to. + +### Contributor Capabilities +* Contributors can write extensions and translate logs and metrics into the format +for different loggings and metrics stores (e.g. StackDriver). + +## Basics +### Milestones: M3 and M4 +In this phase, we will enable a shared infrastructure where everyone has access to all data. +No personas specific experience or access will be provided. + +The following items will be installed and secured in a cluster by default, +but we will provide the ability to replace or remove these in a later milestone. +* Prometheus +* Alert Manager +* Prometheus Operator +* Grafana +* ElasticSearch +* Kibana +* Zipkin +* Fluentd + +Logs from the following locations will be collected: +* stderr & stdout for all application and function containers +* Build logs + +Following metrics will be collected: +* Envoy, Istio Mixer (per request metrics), Istio Pilot +* Node and pod level metrics (CPU, memory, disk and network) +* Elafros controller metrics + +Request logs from Istio proxy, user applications and user functions will be collected by Zipkin. + +## Developer Contracts +### Milestones: M4 and M5 +In this phase, we will define and implement features for the developer persona. +* [M4 & M5] Define and implement developer contracts for logging, metrics, alerting and tracing. +* [M4] Write step-by-step guidelines for developers to debug issues throughout the lifecycle of their applications and functions. +* [M4] Provide developer samples written in Golang. Support for other languages will come in a later phase. +* [M5] Implement the developer CLI to list components and tail logs, metrics and traces. + +## Operator Contracts +### Milestones: M6 and M7 +In this phase, we will define and implement features for the operator persona. +* [M6 & M7] Define and implement operator contracts. +* [M6] Write step-by-step guidelines for operators to debug issues in the cluster. +* [M7] Deploy operator specific instances of the default backends to separate access of operators vs developers. +* [M7] Implement the operator CLI to list components and tail logs and metrics. + +## Contributor Contracts +### Milestones: M8 +In this phase, we will define and implement the features for the contributor persona. +* [M8] Define and implement contracts for plugging in custom logging, metrics, alerting and tracing backends. +We will not provide maintenance, rollout processes, etc for third-party monitoring, logging, or tracing extensions, +though we may maintain a "contrib" directory for such contributions. +* [M8] Add an extension for one managed solution (e.g. Stack Driver). + +## M9 and Onwards +* Allow namespace specific instances of default backends for namespace level access control. +* Implement auto-scaling of the default backends. +* Implement upgrading of the default backends. +* Implement maintenance of the default backends (data retention, daily index creations, etc). +* Provide developer samples written in Node.js, Java, Python, PHP, .Net and Ruby. + +## Out of Scope for 2018 +* Improving the underlying logging, monitoring, and tracing systems to support multi-tenancy.