docs snapshot for crossplane version `master`

2020-02-06 22:19:35 +00:00 · 2020-02-06 22:19:35 +00:00 · b04855fdc9
parent 64498953b3
commit b04855fdc9
1 changed files with 197 additions and 0 deletions
--- a/docs/master/observability-developer-guide.md
+++ b/docs/master/observability-developer-guide.md
@ -0,0 +1,197 @@
+---
+title: Observability Developer Guide
+toc: true
+weight: 730
+indent: true
+---
+# Observability Developer Guide
+
+## Introduction
+
+Observability is crucial to Crossplane users; both those operating Crossplane
+and those using Crossplane to operate their infrastructure. Crossplane currently
+approaches observability via Kubernetes events and structured logs. Timeseries
+metrics are desired but [not yet implemented].
+
+## Goals
+
+In short, a non-admin user and an admin user should both be able to debug any
+issues only by inspecting logs and events. There should be no need to rebuild
+the Crossplane binary or to reach out to a Crossplane developer.
+
+A user should be able to:
+
+* Debug an issue without rebuilding the Crossplane binary
+* Understand an issue without contacting a cluster admin
+* Ask a cluster admin to check the logs for more details about the reason the
+  issue happened, if the details are not part of the error message
+
+A cluster admin should be able to:
+
+* Debug an issue without rebuilding the Crossplane binary
+* Debug an issue only by looking at the logs
+* Debug an issue without needing to contact a Crossplane developer
+
+## Error reporting in the logs
+
+Error reporting in the logs is mostly intended for consumption by Crossplane
+cluster admins. A cluster admin should be able to debug any issue by inspecting
+the logs, without needing to add more logs themselves or contact a Crossplane
+developer. This means that logs should contain:
+
+* Error messages, at either the info or debug level as contextually appropriate
+* Any context leading up to an error, typically at debug level, so that the
+  errors can be debugged
+
+## Error reporting as events
+
+Error reporting as Kubernetes events is primarily aimed toward end-users of
+Crossplane who are not cluster admins. Crossplane typically runs as a Kubernetes
+pod, and thus it is unlikely that most users of Crossplane will have access to
+its logs. [Events], on the other hand, are available as top-level Kubernetes
+objects, and show up the objects they relate to when running `kubectl describe`.
+
+Events should be recorded in the following cases:
+
+* A significant operation is taken on a resource
+* The state of a resource is changed
+* An error occurs
+
+The events recorded in these cases can be thought of as forming an event log of
+things that happen for the resources that Crossplane manages. Each event should
+refer back to the relevant controller and resource, and use other fields of the
+Event kind as appropriate.
+
+More details about examples of how to interact with events can be found in the
+guide to [debugging an application cluster].
+
+## Choosing between methods of error reporting
+
+There are many ways to report errors, such as:
+
+* Metrics
+* Events
+* Logging
+* Tracing
+
+It can be confusing to figure out which one is appropriate in a given situation.
+This section will try to offer advice and a mindset that can be used to help
+make this decision.
+
+Let's set the context by listing the different user scenarios where error
+reporting may be consumed. Here are the typical scenarios as we imagine them:
+
+1. A person **using** a system needs to figure out why things aren't working as
+   expected, and whether they made a mistake that they can correct.
+2. A person **operating** a service needs to monitor the service's **health**,
+   both now and historically.
+3. A person **debugging** a problem which happened in a **live environment**
+   (often an **operator** of the system) needs information to figure out what
+   happened.
+4. A person **developing** the software wants to **observe** what is happening.
+5. A person **debugging** the software in a **development environment**
+   (typically a **developer** of the system) wants to debug a problem (there is
+   a lot of overlap between this and the live environment debugging scenario).
+
+The goal is to satisfy the users in all of the scenarios. We'll refer to the
+scenarios by number.
+
+The short version is: we should do whatever satisfies all of the scenarios.
+Logging and events are the recommendations for satisfying the scenarios,
+although they don't cover scenario 2.
+
+The longer version is:
+
+* Scenario 1 is best served by events in the context of Crossplane, since the
+  users may not have access to read logs or metrics, and even if they did, it
+  would be hard to relate them back to the event the user is trying to
+  understand.
+* Scenario 2 is best served by metrics, because they can be aggregated and
+  understood as a whole. And because they can be used to track things over time.
+* Scenario 3 is best served by either logging that contains all the information
+  about and leading up to the event. Request-tracing systems are also useful for
+  this scenario.
+* Scenario 4 is usually logs, maybe at a more verbose level than normal. But it
+  could be an attached debugger or some other type of tool. It could also be a
+  test suite.
+* Scenario 5 is usually either logs, up to the highest imaginable verbosity, or
+  an attached debugging session. If there's a gap in reporting, it could involve
+  adding some print statements to get more logging.
+
+As for the question of how to decide whether to log or not, we believe it helps
+to try to visualize which of the scenarios the error or information in question
+will be used for. We recommend starting with reporting as much information as
+possible, but with configurable runtime behavior so that, for example, debugging
+logs don't show up in production normally.
+
+For the question of what constitutes an error, errors should be actionable by a
+human. See the [Dave Cheney article] on this topic for some more discussion.
+
+## In Practice
+
+Crossplane provides two observability libraries as part of crossplane-runtime:
+
+* [`event`] emits Kubernetes events.
+* [`logging`] produces structured logs. Refer to its package documentation for
+  additional context on its API choices.
+
+Keep the following in mind when using the above libraries:
+
+* [Do] [not] use package level loggers or event recorders. Instantiate them in
+  `main()` and plumb them down to where they're needed.
+* Each [`Reconciler`] implementation should use its own `logging.Logger` and
+  `event.Recorder`. Implementations are strongly encouraged to default to using
+  `logging.NewNopLogger()` and `event.NewNopRecorder()`, and accept a functional
+  loggers and recorder via variadic options. See for example the [managed
+  resource reconciler].
+* Each controller should use its name as its event recorder's name, and include
+  its name under the `controller` structured logging key. The controllers name
+  should be of the form `controllertype/resourcekind`, for example
+  `managed/cloudsqlinstance` or `stacks/stackdefinition`. Controller names
+  should always be lowercase.
+* Logs and events should typically be emitted by the `Reconcile` method of the
+  `Reconciler` implementation; not by functions called by `Reconcile`. Author
+  the methods orchestrated by `Reconcile` as if they were a library; prefer
+  surfacing useful information for the `Reconciler` to log (for example by
+  [wrapping errors]) over plumbing loggers and event recorders down to
+  increasingly deeper layers of code.
+* Almost nothing is worth logging at info level. When deciding which logging
+  level to use, consider a production deployment of Crossplane reconciling tens
+  or hundreds of managed resources. If in doubt, pick debug. You can easily
+  increase the log level later if it proves warranted.
+* The above is true even for errors; consider the audience. Is this an error
+  only the Crossplane cluster operator can fix? Does it indicate a significant
+  degradation of Crossplane's functionality? If so, log it at info. If the error
+  pertains to a single Crossplane resource emit an event instead.
+* Always log errors under the structured logging key `error` (e.g.
+  `log.Debug("boom!, "error", err)`). Many logging implementations (including
+  Crossplane's) add context like stack traces for this key.
+* Emit events liberally; they're rate limited and deduplicated.
+* Follow [API conventions] when emitting events; ensure event reasons are unique
+  and `CamelCase`.
+* Consider emitting events and logs when a terminal condition is encountered
+  (e.g. `Reconcile` returns) over logging logic flow. i.e. Prefer one log line
+  that reads "encountered an error fooing the bar" over two log lines that read
+  "about to foo the bar" and "encountered an error". Recall that if the audience
+  is a developer debugging Crossplane they will be provided a stack trace with
+  file and line context when an error is logged.
+* Consider including the `reconcile.Request`, and the resource's UID and
+  resource version (not API version) under the keys `request`, `uid`, and
+  `version`. Doing so allows log readers to determine what specific version of a
+  resource the log pertains to.
+
+Finally, when in doubt, aim for consistency with existing Crossplane controller
+implementations.
+
+[not yet implemented]: https://github.com/crossplaneio/crossplane/issues/314
+[Events]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#event-v1-core
+[debugging an application cluster]: https://kubernetes.io/docs/tasks/debug-application-cluster/
+[Dave Cheney article]: https://dave.cheney.net/2015/11/05/lets-talk-about-logging
+[`event`]: https://godoc.org/github.com/crossplaneio/crossplane-runtime/pkg/event
+[`logging`]: https://godoc.org/github.com/crossplaneio/crossplane-runtime/pkg/logging
+[Do]: https://peter.bourgon.org/go-best-practices-2016/#logging-and-instrumentation
+[not]: https://dave.cheney.net/2017/01/23/the-package-level-logger-anti-pattern
+[`Reconciler`]: https://godoc.org/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler
+[managed resource reconciler]: https://github.com/crossplaneio/crossplane-runtime/blob/a6bb0/pkg/reconciler/managed/reconciler.go#L436
+[wrapping errors]: https://godoc.org/github.com/pkg/errors#Wrap
+[API conventions]: https://github.com/kubernetes/community/blob/09f55c6/contributors/devel/sig-architecture/api-conventions.md#events