docs snapshot for crossplane version `master`

This commit is contained in:
Crossplane 2020-02-06 22:19:35 +00:00
parent 64498953b3
commit b04855fdc9
1 changed files with 197 additions and 0 deletions

View File

@ -0,0 +1,197 @@
---
title: Observability Developer Guide
toc: true
weight: 730
indent: true
---
# Observability Developer Guide
## Introduction
Observability is crucial to Crossplane users; both those operating Crossplane
and those using Crossplane to operate their infrastructure. Crossplane currently
approaches observability via Kubernetes events and structured logs. Timeseries
metrics are desired but [not yet implemented].
## Goals
In short, a non-admin user and an admin user should both be able to debug any
issues only by inspecting logs and events. There should be no need to rebuild
the Crossplane binary or to reach out to a Crossplane developer.
A user should be able to:
* Debug an issue without rebuilding the Crossplane binary
* Understand an issue without contacting a cluster admin
* Ask a cluster admin to check the logs for more details about the reason the
issue happened, if the details are not part of the error message
A cluster admin should be able to:
* Debug an issue without rebuilding the Crossplane binary
* Debug an issue only by looking at the logs
* Debug an issue without needing to contact a Crossplane developer
## Error reporting in the logs
Error reporting in the logs is mostly intended for consumption by Crossplane
cluster admins. A cluster admin should be able to debug any issue by inspecting
the logs, without needing to add more logs themselves or contact a Crossplane
developer. This means that logs should contain:
* Error messages, at either the info or debug level as contextually appropriate
* Any context leading up to an error, typically at debug level, so that the
errors can be debugged
## Error reporting as events
Error reporting as Kubernetes events is primarily aimed toward end-users of
Crossplane who are not cluster admins. Crossplane typically runs as a Kubernetes
pod, and thus it is unlikely that most users of Crossplane will have access to
its logs. [Events], on the other hand, are available as top-level Kubernetes
objects, and show up the objects they relate to when running `kubectl describe`.
Events should be recorded in the following cases:
* A significant operation is taken on a resource
* The state of a resource is changed
* An error occurs
The events recorded in these cases can be thought of as forming an event log of
things that happen for the resources that Crossplane manages. Each event should
refer back to the relevant controller and resource, and use other fields of the
Event kind as appropriate.
More details about examples of how to interact with events can be found in the
guide to [debugging an application cluster].
## Choosing between methods of error reporting
There are many ways to report errors, such as:
* Metrics
* Events
* Logging
* Tracing
It can be confusing to figure out which one is appropriate in a given situation.
This section will try to offer advice and a mindset that can be used to help
make this decision.
Let's set the context by listing the different user scenarios where error
reporting may be consumed. Here are the typical scenarios as we imagine them:
1. A person **using** a system needs to figure out why things aren't working as
expected, and whether they made a mistake that they can correct.
2. A person **operating** a service needs to monitor the service's **health**,
both now and historically.
3. A person **debugging** a problem which happened in a **live environment**
(often an **operator** of the system) needs information to figure out what
happened.
4. A person **developing** the software wants to **observe** what is happening.
5. A person **debugging** the software in a **development environment**
(typically a **developer** of the system) wants to debug a problem (there is
a lot of overlap between this and the live environment debugging scenario).
The goal is to satisfy the users in all of the scenarios. We'll refer to the
scenarios by number.
The short version is: we should do whatever satisfies all of the scenarios.
Logging and events are the recommendations for satisfying the scenarios,
although they don't cover scenario 2.
The longer version is:
* Scenario 1 is best served by events in the context of Crossplane, since the
users may not have access to read logs or metrics, and even if they did, it
would be hard to relate them back to the event the user is trying to
understand.
* Scenario 2 is best served by metrics, because they can be aggregated and
understood as a whole. And because they can be used to track things over time.
* Scenario 3 is best served by either logging that contains all the information
about and leading up to the event. Request-tracing systems are also useful for
this scenario.
* Scenario 4 is usually logs, maybe at a more verbose level than normal. But it
could be an attached debugger or some other type of tool. It could also be a
test suite.
* Scenario 5 is usually either logs, up to the highest imaginable verbosity, or
an attached debugging session. If there's a gap in reporting, it could involve
adding some print statements to get more logging.
As for the question of how to decide whether to log or not, we believe it helps
to try to visualize which of the scenarios the error or information in question
will be used for. We recommend starting with reporting as much information as
possible, but with configurable runtime behavior so that, for example, debugging
logs don't show up in production normally.
For the question of what constitutes an error, errors should be actionable by a
human. See the [Dave Cheney article] on this topic for some more discussion.
## In Practice
Crossplane provides two observability libraries as part of crossplane-runtime:
* [`event`] emits Kubernetes events.
* [`logging`] produces structured logs. Refer to its package documentation for
additional context on its API choices.
Keep the following in mind when using the above libraries:
* [Do] [not] use package level loggers or event recorders. Instantiate them in
`main()` and plumb them down to where they're needed.
* Each [`Reconciler`] implementation should use its own `logging.Logger` and
`event.Recorder`. Implementations are strongly encouraged to default to using
`logging.NewNopLogger()` and `event.NewNopRecorder()`, and accept a functional
loggers and recorder via variadic options. See for example the [managed
resource reconciler].
* Each controller should use its name as its event recorder's name, and include
its name under the `controller` structured logging key. The controllers name
should be of the form `controllertype/resourcekind`, for example
`managed/cloudsqlinstance` or `stacks/stackdefinition`. Controller names
should always be lowercase.
* Logs and events should typically be emitted by the `Reconcile` method of the
`Reconciler` implementation; not by functions called by `Reconcile`. Author
the methods orchestrated by `Reconcile` as if they were a library; prefer
surfacing useful information for the `Reconciler` to log (for example by
[wrapping errors]) over plumbing loggers and event recorders down to
increasingly deeper layers of code.
* Almost nothing is worth logging at info level. When deciding which logging
level to use, consider a production deployment of Crossplane reconciling tens
or hundreds of managed resources. If in doubt, pick debug. You can easily
increase the log level later if it proves warranted.
* The above is true even for errors; consider the audience. Is this an error
only the Crossplane cluster operator can fix? Does it indicate a significant
degradation of Crossplane's functionality? If so, log it at info. If the error
pertains to a single Crossplane resource emit an event instead.
* Always log errors under the structured logging key `error` (e.g.
`log.Debug("boom!, "error", err)`). Many logging implementations (including
Crossplane's) add context like stack traces for this key.
* Emit events liberally; they're rate limited and deduplicated.
* Follow [API conventions] when emitting events; ensure event reasons are unique
and `CamelCase`.
* Consider emitting events and logs when a terminal condition is encountered
(e.g. `Reconcile` returns) over logging logic flow. i.e. Prefer one log line
that reads "encountered an error fooing the bar" over two log lines that read
"about to foo the bar" and "encountered an error". Recall that if the audience
is a developer debugging Crossplane they will be provided a stack trace with
file and line context when an error is logged.
* Consider including the `reconcile.Request`, and the resource's UID and
resource version (not API version) under the keys `request`, `uid`, and
`version`. Doing so allows log readers to determine what specific version of a
resource the log pertains to.
Finally, when in doubt, aim for consistency with existing Crossplane controller
implementations.
[not yet implemented]: https://github.com/crossplaneio/crossplane/issues/314
[Events]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.16/#event-v1-core
[debugging an application cluster]: https://kubernetes.io/docs/tasks/debug-application-cluster/
[Dave Cheney article]: https://dave.cheney.net/2015/11/05/lets-talk-about-logging
[`event`]: https://godoc.org/github.com/crossplaneio/crossplane-runtime/pkg/event
[`logging`]: https://godoc.org/github.com/crossplaneio/crossplane-runtime/pkg/logging
[Do]: https://peter.bourgon.org/go-best-practices-2016/#logging-and-instrumentation
[not]: https://dave.cheney.net/2017/01/23/the-package-level-logger-anti-pattern
[`Reconciler`]: https://godoc.org/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler
[managed resource reconciler]: https://github.com/crossplaneio/crossplane-runtime/blob/a6bb0/pkg/reconciler/managed/reconciler.go#L436
[wrapping errors]: https://godoc.org/github.com/pkg/errors#Wrap
[API conventions]: https://github.com/kubernetes/community/blob/09f55c6/contributors/devel/sig-architecture/api-conventions.md#events