* webhook changes for observability
* go mod
* vendor
* Allow the webhook to accept a MetricsProvider. This makes it easier to unit test
Digging into OTel implementation it looks like instruments are cached.
Thus when different webhooks request the same instrument a cache lookup
for the instrument will occur.
* passthrough additional OTel interfaces
* linter fixes
* drop global meter in tests
* PR feedback
* PR feedback
* when recording the admission/conversion duration use the attributes from the labeler
* add a comment that otelhttp middleware creates the labeler
* simply code for readability
* change metric names to allow reuse across different webhook types
* fix linter
* tweak attribute names
* webhook: add options to disable resource_namespace tag in metrics
To add some context, historically, `resource_name` was removed from this
tag list due to its high potential of causing high metrics cardinality.
See [knative/pkg#1464][1] for more information.
While that's great, but it might not be sufficient for large scale use
cases where namespaces can be super dynamic (with generateName, too) or
grows fase enough. There is an issue report from
[tektoncd/pipeline#3171][2] which talks about this.
This proposal makes it possible to disable `resource_namespace` tag via
an option function. The default behavior is not changed, so no user
impact if any of existing users rely on this tag. There is no API
contract change either due to the beauty of variadic functions.
Now downstream projects can consume this by override `StatsReporter` in
webhook context options with their own preference. As a caveat here, if
downstream project does choose to override `StatsReporter`, the default
`ReportMetrics` function shouldn't be called by default as they may now
have a different set of tag keys to report. As such, this function is
only called if the default `StatsReporter` is used.
[1]: https://github.com/knative/pkg/pull/1464
[2]: https://github.com/tektoncd/pipeline/issues/3171
* webhook: add StatsReporterOptions in webhook.Options
There are two ways to customize StatsReporter:
1. Use a whole new StatsReporter implementation.
1. Or pass Option funcs to customize the default StatsReporter.
Option 1 is less practical at this time due to the metrics registration
conflict. `webhook.RegisterMetrics()` is called regardless which
StatsReporter implementation is used (which is a problem by itself). The
second option is more practical since it works well without dealing with
metrics conflicts.
The `webhook.Option` in particular allows people to discard certain
metrics tags.
* TestRegistrationStopChanFire now uses ephemeral ports
* For TLS servers dial TLS
* have server error logs appear in zap
* log the correct error
* pass ephemeral listeners to the webhook for testing
* make minimum tls version configurable
* change default min TLS version to 1.3
* change opencensus tls min version to 1.3
* Update env var name
Co-authored-by: Dave Protasowski <dprotaso@gmail.com>
* use webhook options to configure min tls version
* add unit tests for webhook tlsMinVersion option
* Update webhook/env.go
Co-authored-by: Dave Protasowski <dprotaso@gmail.com>
* address feeback
---------
Co-authored-by: Dave Protasowski <dprotaso@gmail.com>
* Implement a new shared "Drainer" handler.
This implements a new `http.Handler` called `Drainer`, which is intended to wrap some inner `http.Handler` business logic with a new outer handler that can respond to Kubelet probes (successfully until told to "Drain()").
This takes over the webhook's relatively new probe handling and lame duck logic with one key difference. Previously the webhook waited for a fixed period after SIGTERM before exitting, but the new logic waits for this same grace period AFTER THE LAST REQUEST. So if the handler keeps getting (non-probe) requests, the timer will continually reset, and once it stops receiving requests for the configured grace period, "Drain()" will return and the webhook will exit.
The goal of this work is to try to better cope with what we believe to be high tail latencies of the API server seeing that a webhook replica is shutting down.
Related: https://github.com/knative/pkg/issues/1509
* Switch to RWLock
* Kubelet probes would result in the webhook writing the HTTP status twice
Doesn't seem like it affected anything - just writes out some extra
log messages
* nits
* nits
* nits
* nits
* Implement the K8s lifecycle in webhook.
The webhook never properly implemented the Kubernetes SIGTERM/SIGKILL
lifecycle, and doesn't even really support readiness probes today. This
change enables folks to use a block like this on their webhook container:
```yaml
readinessProbe: &probe
periodSeconds: 1
httpGet:
scheme: HTTPS
port: 8443
httpHeaders:
- name: k-kubelet-probe
value: "webhook"
livenessProbe: *probe
```
With this, the webhook won't report as `Ready` until a probe has succeeded,
and when the SIGTERM is received, we will start failing probes for a grace
period (so our Endpoint drops) before shutting down the webhook's HTTP Server.
This was uncovered by running the webhook across 10 replicas in Serving with
the "Goose" (https://github.com/knative/pkg/pull/1316) enabled for the e2e
tests. The failure mode I saw was conversion webhook requests failing across
random tests.
This also moves the Serving probe-detection function into PKG.
* Increase the log level when we start to fail probes
* Wait for go routines to terminate on all paths.
* Add new callback pattern to pkg
* include the context
* typo
* Remove the empty instance of unstructured
* initialize the unstructured var
* Eliminate the unneeded pointer
* Pass a pointer to unstructured callback
* Create a validation specific context struct
* Move callback tests to own unit test case
* Switch from converting to decoding
* Update webhook/resourcesemantics/validation/validation.go
Co-Authored-By: Victor Agababov <vagababov@gmail.com>
* don't wrap context and include params
* split validation files
* include 2020 copyright
* include unit test for WithKubeClient
* Don't bother updating copyright date
* Inclue a unit test for panic
* Move dryRun to context
* Include context dry run unit test
* put the request operation in the context
* eliminate circular dep
* move kubeclient test out of context_test
* dont bother iterating callback map
* Callback takes a list of supported verbs
* Remove extra type
* Ensure Callback interface is public
* Alias Operation into validation
* alias Operation right in Webhook
* Update webhook/resourcesemantics/validation/validation_admit.go
Co-Authored-By: Victor Agababov <vagababov@gmail.com>
* Update webhook/resourcesemantics/validation/validation_admit_test.go
Co-Authored-By: Victor Agababov <vagababov@gmail.com>
* Update webhook/resourcesemantics/validation/validation_admit_test.go
Co-Authored-By: Victor Agababov <vagababov@gmail.com>
* Update webhook/resourcesemantics/validation/validation_admit.go
Co-Authored-By: Victor Agababov <vagababov@gmail.com>
* Update webhook/resourcesemantics/validation/validation_admit.go
Co-Authored-By: Victor Agababov <vagababov@gmail.com>
* Update webhook/resourcesemantics/validation/validation_admit_test.go
Co-Authored-By: Victor Agababov <vagababov@gmail.com>
* correct parens
* minor style fixes
* Rename Callback to Func
* Fix build error
* Switch callback to take a list with a factory
* keep descriptive names
* update comment
* Drop pointer, correct comments
* Add a unit test to disallow duplicate verbs
* fix comments, struct{} for set
* switch to variadic arg for NewCallback
Co-authored-by: Victor Agababov <vagababov@gmail.com>
* Start the webhook before informers sync.
Some webhooks (e.g. conversion) are required to list resources, so by delaying those until after informers have synced, we create a deadlock when they run in the same process. This change has two key parts:
1. Start the webhook immediately when our process starts, and issue a callback from sharedmain when the informers have synced.
2. Block `Admit` calls until informers have synced (all conversions are exempt), unless they have been designated by implementing `webhook.StatelessAdmissionController`.
Our built-in admission controllers (defaulting, validation, configmap validation) have all been marked as stateless, the main case where we want to block `Admit` calls is when we require the informer to have synchronized to populate indices for Bindings.
* Add missing err declaration
* ConversionController implementation
This controller will reconcile target CRDs with the correct
conversion webhook configuration. Specifically, the HTTP path and
CA bundle will be updated.
Additionally, the conversion controller will perform the given
conversions through a hub and spoke model utilizing the
apis.Convertible interface.
* Webhook now can host ConversionControllers
* injection/sharedmain now supports webhook.ConversionControllers
These conversion controllers will be hosted by the webhook that
the sharedmain will start
* support defaulting & include godoc
* Refactor webhook to allow adding conversion support
* pr feedback
* fix memory leak
* We can use mux.Handle
* move admission integration tests to separate file
By combining our validation logic into our mutating webhook we were previously allowing for mutating webhooks evaluated after our own to modify our resources into invalid shapes. There are no guarantees around ordering of mutating webhooks (that I could find), so the only way to remedy this properly is to split apart the two into separate webhook configurations:
- `defaulting`: which runs during the mutating admission webhook phase
- `validation`: which runs during the validating admission webhook phase.
The diagram in [this post](https://kubernetes.io/blog/2019/03/21/a-guide-to-kubernetes-admission-controllers/) is very helpful in illustrating the flow of webhooks.
Fixes: https://github.com/knative/pkg/issues/847
GetCertificate allows us to start in TLS mode and dynamically fetch new certificates as they change. This will eventually allow us to decouple the cert creation process from the core webhook logic, and in a subsequent change service this from a secret lister cache.