bumping google.golang.org/genproto/googleapis/rpc 8af14fe...796eee8:
> 796eee8 chore(all): update all (# 1163)
> c02fea0 chore: fix kokoro with safedir for git (# 1164)
> 5fefd90 chore(all): update all (# 1162)
> af27646 chore(all): update all (# 1161)
> 9d4c2d2 chore(all): update all (# 1160)
bumping google.golang.org/genproto/googleapis/api 8af14fe...796eee8:
> 796eee8 chore(all): update all (# 1163)
> c02fea0 chore: fix kokoro with safedir for git (# 1164)
> 5fefd90 chore(all): update all (# 1162)
> af27646 chore(all): update all (# 1161)
> 9d4c2d2 chore(all): update all (# 1160)
bumping k8s.io/client-go ece8c00...4b5b7fa:
> 4b5b7fa Update dependencies to v0.31.4 tag
> 5e3e8ea informers: add comment that Start does not block
> f71a5cc Call non-blocking informerFactory.Start synchronously to avoid races
> 4536e5a Merge pull request # 124012 from Jefftree/le-controller
> 93c6a5b Merge pull request # 126353 from liggitt/fix-vendor
> 825f52e Change PingTime to be persistent
> 6a9911a revendor dependencies
> f45c451 fix ordering issue in candidates
> fe54892 Merge pull request # 126243 from SergeyKanzhelev/devicePluginFailures
> 18dd587 feedback: leasecandidate clients
> 79fd7ab generated files
> 1f27757 Review feedback
> ac9204c Merge pull request # 126145 from carlory/kep-3751-api
> 2099375 CLE controller and client changes
> dcfcc90 Merge pull request # 126125 from mprahl/stop-idempotent
> 9dea255 Promote VolumeAttributesClass to beta
> 8a2bbd0 Coordinated Leader Election Alpha API
> bad8f77 Merge pull request # 126091 from seans3/ws-err-extra-info
> 001900e Allow calling Stop multiple times on RetryWatcher
> 3aff10e Adds extra error information from response to bad handshake error when possible
> a9affb4 Merge pull request # 125488 from pohly/dra-1.31
> a7db3ad DRA: new API for 1.31
> e0bc24e DRA: remove "sharable" from claim allocation result
> a7f430b DRA: remove immediate allocation
> 91ff2f6 DRA: bump API v1alpha2 -> v1alpha3
>
|
||
---|---|---|
.. | ||
metricskey | ||
README.md | ||
client.go | ||
config.go | ||
config_observability.go | ||
doc.go | ||
exporter.go | ||
memstats.go | ||
metrics.go | ||
metrics_worker.go | ||
opencensus_exporter.go | ||
prometheus_exporter.go | ||
record.go | ||
resource_view.go | ||
testing.go | ||
utils.go | ||
workqueue.go | ||
zz_generated.deepcopy.go |
README.md
Common metrics export interfaces for Knative
See the Plan for details on where this is heading.
Current status
The code currently uses OpenCensus to support exporting metrics to multiple backends. Currently, two backends are supported: Prometheus and OpenCensus/OTel.
Metrics export is controlled by a ConfigMap called config-observability
which
is a key-value map with specific values supported for each of the OpenCensus
and Prometheus backends. Hot-reload of the ConfigMap on a running process is
supported by directly watching (via the Kubernetes API) the
config-observability
object. Configuration via environment is also supported
for use by the queue-proxy
, which runs with user permissions in the user's
namespace.
Problems
There are currently
6 supported Golang exporters for OpenCensus.
We do not want to build all of those backends into the core of knative.dev/pkg
and all
downstream dependents, and we'd like all the code shipped in knative.dev/pkg
to be able to be tested without needing any special environment setup.
With the current direct-integration setup, there needs to be initial and ongoing
work in pkg
(which should be high-value, low-churn code) to maintain and
update stats exporters which need to be statically linked into ~all Knative
binaries. This setup also causes problems for vendors who may want or need to
perform an out-of-tree integration (e.g. proprietary or partially-proprietary
monitoring stacks).
Another problem is that each vendor's exporter requires different parameters,
supplied as Golang Options
methods which may require complex connections with
the Knative ConfigMap. Two examples of this are secrets like API keys and the
Prometheus monitoring port (which requires additional service/etc wiring).
See also this doc, where the plan was worked out.
The plan
OpenCensus (and eventually OpenTelemetry) offers an sidecar or host-level agent with speaks the OpenCensus protocol and can proxy from this protocol to multiple backends.
(From OpenCensus Documentation)
We will standardize on export to the OpenCensus export protocol, and encourage
vendors to implement their own OpenCensus Agent or Collector DaemonSet, Sidecar,
or other
OpenCensus Protocol
service which connects to their desired monitoring environment. For now, we
will use the config-observability
ConfigMap to provide the OpenCensus
endpoint, but we will work with the OpenTelemetry group to define a
kubernetes-friendly standard export path.
Additionally, once OpenTelemetry agent is stable, we will propose adding the OpenTelemetry agent running on a localhost port as part of the runtime contract.
We need to make sure that the OpenCensus library does not block, fail, or queue metrics in-process excessively in the case where the OpenCensus Agent is not present on the cluster. This will allow us to ship Knative components which attempt to reach out the Agent if present, and which simply retain local statistics for a short period of time if not.
Concerns
- Unsure about the stability of the OpenCensus Agent (or successor). We're currently investigating this, but the OpenCensus agent seems to have been recommended by several others.
- Running
fluentd
as a sidecar was very big (400MB) and had a large impact on cold start times.- Mitigation: run the OpenCensus agent as a DaemonSet (like we do with
fluentd
now).
- Mitigation: run the OpenCensus agent as a DaemonSet (like we do with
- Running as a DaemonSet may make it more difficult to ensure that metrics for
each namespace end up in the right place.
- We have this problem with the built-in configurations today, so this doesn't make the problem substantially worse.
- May want/need some connection between the Agent and the Kubelet to verify sender identities eventually.
- Only expose OpenCensus Agent on localhost, not outside the node.
Steps to reach the goal
- Add OpenCensus Agent as one of the export options.
- Ensure that all tests pass in a non-Google-Cloud connected environment. This is true today. Ensure this on an ongoing basis.
- Google to implement OpenCensus Agent configuration to match what they are doing for Stackdriver now. (No public issue link because this should be in Google's vendor-specific configuration.)
- Document how to configure OpenCensus/OpenTelemetry Agent + Prometheus to achieve the current level of application visibility, and determine a long-term course for how to maintain this as a "bare minimum" supported configuration. https://github.com/knative/docs/pull/3005
- Stop adding exporter features outside of the OpenCensus / OpenTelemetry export as of 0.13 release (03 March 2020). Between now and 0.13, small amounts of additional features can be built in to assist with the bridging process or to support existing products. New products should build on the OpenCensus Agent approach.
- Removal of the Stackdriver OpenCensus Exporter https://github.com/knative/pkg/issues/2173
- Revisit Adopting OpenTelemetry SDKs instead of OpenCensus