In anticipation of removing Transport-related Event types, we want to
separate the concerns of recording transport metrics updates from
reporting them to the metrics endpoint.
The transport module has been split into `Registry` and `Report` types:
`Registry` is responsible for recording updates, and `Report` is
responsible for rendering metrics.
Following #67 and #68, the `labels::TlsStatus` type can be removed in
favor of extending underlying `ctx::transport::TlsStatus` type to
implement `FmtLabels`.
The `DstLabels` type has a large an undefined scope. Over time, it's
become critical in a variety of contexts, even outside of the scope of
prometheus endpoint labels.
This change makes `DstLabels` private to the
`telemetry::metrics::labels` module. This more clearly separates the
concerns of prometheus labeling from discovery, etc.
Now, destination metadata includes an index map of arbitrary metadata
labels.
Previously, dst label strings would be formatted eagerly at
service-discovery-time. This change moves this string formatting into
the data path(!). I think this is an acceptable tradeoff temporarily,
while we work to stop constructing RequestLabels in the data path
altogether.
In order to start dismantling the monolithic ctx structures, this change
removes the root `ctx::Process` type. This simplifies `ctx::Proxy` such
that is copyable and need not be `Arc`ed.
`telemetry::metrics::labels::Direction` has been updated to decorate
`ctx::Proxy` instead of modeling the same variants directly as an enum.
The `telemetry::metrics::transport` module exposes much of its
implementation details to callers, which makes it difficult to make
changes to how the module is structured. In preparation for further
refactors, this change narrows the module's public API:
All labels and scopes types have been made private. A single, public
`Transports` type has been introduce to describe the entire public
interface of the module. This has been crafted to be free of `event`
types and to have minimal external dependencies.
A new `transport::Eos` type has been introduced to replace the
overloaded `labels::Classification` type -- this type was initially
introduced to model _HTTP response_ classification, but it was
reused for transports. This is an undesirable coupling that will have to
be broken when we start to adddress HTTP response classification
properly.
Various transport-specific labels are defined in the common
`metrics::labels` module.
In preparation to refactor transport metrics into Sensor and Report
halves, this change moves all transport-specific labels into the
transport metrics module.
The metrics::Scopes type exposes its internal implementation to many of
its uses.
By extracting the type into its own module, we are forced to provide an
explicit public interface, hiding its IndexMap implementation details.
The current implementation of TLS config reload telemetry has several
layers: a sensor emits events into a dispatcher that updates metrics.
This can be simplified by sharing a common metrics object between the
metrics registry and config module.
The metrics registry is updated to hold a read-only `tls_config_reload::Fmt`;
and the config module holds a `tls_config_reload::Sensor`.
The `Sensor` type holds a strong reference to an inner metrics
structure and acquires a lock on updates.
The `Fmt` type holds a weak reference to the metrics structure so
that the metrics server can print the state as long as it's actually held
in memory. If the `Sensor` is dropped (for instance, because TLS has
been administratively disabled), then no metrics will be formatted by
`Fmt`.
Details of TLS configuration reload metrics span several modules.
This change consolidates all of this logic into a single
module with a single public type, `TlsConfigReload`, that supports
recording successful reloads, recording errors, and formatting its
current state.
This sets up further simplifications, eventually leading to the removal
of the `Event` API.
The proxy's telemetry system is implemented with a channel: the proxy thread
generates events and the control thread consumes these events to record
metrics and satisfy Tap observations. This design was intended to minimize
latency overhead in the data path.
However, this design leads to substantial CPU overhead: the control thread's
work scales with the proxy thread's work, leading to resource contention in
busy, resource-limited deployments. This design also has other drawbacks in
terms of allocation & makes it difficult to implement planned features like
payload-aware Tapping.
This change removes the event channel so that all telemetry is recorded
instantaneously in the data path, setting up for further simplifications so
that, eventually, the metrics registry properly uses service lifetimes to
support eviction.
This change has a potentially negative side effect: metrics scrapes obtain
the same lock that the data path uses to write metrics so, if the metrics
server gets heavy traffic, it can directly impact proxy latency. These
effects will be ameliorated by future changes that reduce the need for the
Mutex in the proxy thread.
This branch adds a label displaying the Unix error code name (or the raw
error code, on other operating systems or if the error code was not
recognized) to the metrics generated for TCP connection failures.
It also adds a couple of tests for label formatting.
Signed-off-by: Eliza Weisman <eliza@buoyant.io>