- Configure all gRPC clients to check the overall serving status of each
endpoint via the `grpc_health_v1` service.
- Configure all gRPC servers to expose the `grpc_health_v1` service to
any client permitted to access one of the server’s services.
- Modify long-running, deep health checks to set and transition the
overall (empty string) health status of the gRPC server in addition to
the specific service they were configured for.
Fixes#8227
Changes the default grpc client/server histogram buckets from the
defaults to better track the long tail of slow requests. Removes `.005`
and `.25` granularity in favor of adding the larger values of `45` and `90`
to avoid changing the cardinality.
```
# Before, the default prometheus buckets
[]float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}
# After
[]float64{.01, .025, .05, .1, .5, 1, 2.5, 5, 10, 45, 90}
```
Fixes https://github.com/letsencrypt/boulder/issues/6384
Directly update:
- go.opentelemetry.io/otel/* from v1.26.0 to v1.27.0
- go.opentelemetry.io/contrib/* from v0.51.0 to v0.52.0
Indirectly update:
- google.golang.org/protobuf from v1.33.0 to v1.34.0
This update breaks some of our existing otel grpc interceptors, but in
return allows us to use the newer grpc StatsHandler mechanism, while
still filtering out health-check requests.
Fixes https://github.com/letsencrypt/boulder/issues/7235
Many services already have --addr and/or --debug-addr flags.
However, it wasn't universal, so this PR adds flags to commands where
they're not currently present.
This makes it easier to use a shared config file but listen on different
ports, for running multiple instances on a single host.
The config options are made optional as well, and removed from
config-next/.
Add the necessary scaffolding for deep health checking of our various
gRPC components. Each component implementation that also implements the
grpc.checker interface will be checked periodically, and the health
status of the component will be updated accordingly.
Add the necessary methods to SA to implement the grpc.checker interface
and register these new health checks with Consul.
Additionally:
- Update entry point script to check for ProxySQL readiness.
- Increase the poll rate for gRPC Consul checks from 5s to 2s to help
with DNS failures, due to check failures, on startup.
- Change log level for Consul from INFO to ERROR to deal with noisy logs
full of transport failures due to Consul gRPC checks firing before the
SAs are up.
Fixes#6878
Part of #6795
Add a new shared config stanza which all boulder components can use to
configure their Open Telemetry tracing. This allows components to
specify where their traces should be sent, what their sampling ratio
should be, and whether or not they should respect their parent's
sampling decisions (so that web front-ends can ignore sampling info
coming from outside our infrastructure). It's likely we'll need to
evolve this configuration over time, but this is a good starting point.
Add basic Open Telemetry setup to our existing cmd.StatsAndLogging
helper, so that it gets initialized at the same time as our other
observability helpers. This sets certain default fields on all
traces/spans generated by the service. Currently these include the
service name, the service version, and information about the telemetry
SDK itself. In the future we'll likely augment this with information
about the host and process.
Finally, add instrumentation for the HTTP servers and grpc
clients/servers. This gives us a starting point of being able to monitor
Boulder, but is fairly minimal as this PR is already somewhat unwieldy:
It's really only enough to understand that everything is wired up
properly in the configuration. In subsequent work we'll enhance those
spans with more data, and add more spans for things not automatically
traced here.
Fixes https://github.com/letsencrypt/boulder/issues/6361
---------
Co-authored-by: Aaron Gable <aaron@aarongable.com>
Although #6771 significantly cleaned up how gRPC services stop and clean
up, it didn't make any changes to our HTTP servers or our non-server
(e.g. crl-updater, log-validator) processes. This change finishes the
work.
Add a new helper method cmd.WaitForSignal, which simply blocks until one
of the three signals we care about is received. This easily replaces all
calls to cmd.CatchSignals which passed `nil` as the callback argument,
with the added advantage that it doesn't call os.Exit() and therefore
allows deferred cleanup functions to execute. This new function is
intended to be the last line of main(), allowing the whole process to
exit once it returns.
Reimplement cmd.CatchSignals as a thin wrapper around cmd.WaitForSignal,
but with the added callback functionality. Also remove the os.Exit()
call from CatchSignals, so that the main goroutine is allowed to finish
whatever it's doing, call deferred functions, and exit naturally.
Update all of our non-gRPC binaries to use one of these two functions.
The vast majority use WaitForSignal, as they run their main processing
loop in a background goroutine. A few (particularly those that can run
either in run-once or in daemonized mode) still use CatchSignals, since
their primary processing happens directly on the main goroutine.
The changes to //test/load-generator are the most invasive, simply
because that binary needed to have a context plumbed into it for proper
cancellation, but it already had a custom struct type named "context"
which needed to be renamed to avoid shadowing.
Fixes https://github.com/letsencrypt/boulder/issues/6794
The CA, RA, and VA have multiple goroutines running alongside primary
gRPC handling goroutine. These ancillary goroutines should be gracefully
shut down when the process is about to exit. Historically, we have
handled this by putting a call to each of these goroutine's shutdown
function inside cmd.CatchSignals, so that when a SIGINT is received, all
of the various cleanup routines happen in sequence.
But there's a cleaner way to do it: just use defer! All of these
cleanups need to happen after the primary gRPC server has fully shut
down, so that we know they stick around at least as long as the service
is handling gRPC requests. And when the service receives a SIGINT,
cmd.CatchSignals will call the gRPC server's GracefulStop, which will
cause the server's .Serve() to finally exit, which will cause start() to
exit, which will cause main() to exit, which will cause all deferred
functions to be run.
In addition, remove filterShutdownErrors as the bug which made it
necessary (.Serve() returning an error even when GracefulShutdown() is
called) was fixed back in 2017. This allows us to call the start()
function in a much more natural way, simply logging any error it returns
instead of calling os.Exit(1) if it returns an error.
This allows us to simplify the exit-handling code in these three
services' main() functions, and lets us be a bit more idiomatic with our
deferred cleanup functions.
Part of #6794
Remove tracing using Beeline from Boulder. The only remnant left behind
is the deprecated configuration, to ensure deployability.
We had previously planned to swap in OpenTelemetry in a single PR, but
that adds significant churn in a single change, so we're doing this as
multiple steps that will each be significantly easier to reason about
and review.
Part of #6361
Add a new gRPC server interceptor (both unary and streaming) which
verifies that the mTLS info set on the persistent connection has a
client cert which contains a name which is allowlisted for the
particular service being called, not just for the overall server.
This will allow us to make more services -- particularly the CA and the
SA -- more similar to the VA. We will be able to run multiple services
on the same port, while still being able to control access to those
services on a per-client basis. It will also let us split those services
(e.g. into read-only and read-write subsets) much more easily, because a
client will be able to switch which service it is calling without also
having to be reconfigured to call a different address. And finally, it
will allow us to simplify configuration for clients (such as the RA)
which maintain connections to multiple different services on the same
server, as they'll be able to re-use the same address configuration.
Turn bgrpc.NewServer into a builder-pattern, with a config-based
initialization, multiple calls to Add to add new gRPC services, and a
final call to Build to produce the start() and stop() functions which
control server behavior. All calls are chainable to produce compact code
in each component's main() function.
This improves the process of creating a new gRPC server in three ways:
1) It avoids the need for generics/templating, which was slightly
verbose.
2) It allows the set of services to be registered on this server to be
known ahead of time.
3) It greatly streamlines adding multiple services to the same server,
which we use today in the VA and will be using soon in the SA and CA.
While we're here, add a new per-service config stanza to the
GRPCServerConfig, so that individual services on the same server can
have their own configuration. For now, only provide a "ClientNames" key,
which will be used in a follow-up PR.
Part of #6454
Collapse most of our boilerplate gRPC creation steps (in particular,
creating default metrics, making the server and listener, registering
the server, creating and registering the health service, filtering
shutdown errors from the output, and gracefully stopping) into a single
function in the existing bgrpc package. This allows all but one of our
server main functions to drop their calls to NewServer and
NewServerMetrics.
To enable this, create a new helper type and method in the bgrpc
package. Conceptually, this could be just a new function, but it must be
attached to a new type so that it can be generic over the type of gRPC
server being created. (Unfortunately, the grpc.RegisterFooServer methods
do not accept an interface type for their second argument).
The only main function which is not updated is the boulder-va, which is
a special case because it creates multiple gRPC servers but (unlike the
CA) serves them all on the same port with the same server and listener.
Part of #6452
Create new gRPC interceptors which are capable of working
on streaming gRPC methods. Add these new interceptors, as
well as the default metrics interceptor provided by grpc-prometheus,
to all of our gRPC clients and servers.
The new interceptors behave virtually identically to their unary
counterparts: they wrap and unwrap our custom errors from the
gRPC metadata, they increment and decrement the in-flight RPC
metric, and they ensure that the RPCs don't fail-fast and do have
enough time left in their deadline to actually finish.
Unfortunately, because the interfaces for unary and streaming
RPCs are so divergent, it's not feasible to share code between the
two kinds of interceptors. While much of the new code is copy-pasted
from the old interceptors, there are subtle differences (such as not
immediately deferring the local context's cancel() function).
Fixes#6356
Add `stylecheck` to our list of lints, since it got separated out from
`staticcheck`. Fix the way we configure both to be clearer and not
rely on regexes.
Additionally fix a number of easy-to-change `staticcheck` and
`stylecheck` violations, allowing us to reduce our number of ignored
checks.
Part of #5681
A recent mysql driver upgrade caused a performance regression. We
believe this may be due to cancellations getting passed through to the
database driver, which as of the upgrade will more aggressively tear
down connections that experienced a cancellation.
Also, we only recently started propagation cancellations all the way
from the frontend in #5404.
This makes it so the driver doesn't see the cancellation.
Second attempt at #5447
Use the built-in grpc-go client and server interceptor chaining
utilities, instead of the ones provided by go-grpc-middleware.
Simplify our interceptors to call their handlers/invokers directly,
instead of delegating to the metrics interceptor, and add the
metrics interceptor to the chains instead.
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
This allows servers to tell clients to go away after some period of time, which triggers the clients to re-resolve DNS.
Per grpc/grpc#12295, this is the preferred way to do this.
Related: #5307.
In all boulder services, we construct a single tls.Config object
and then pass it into multiple gRPC setup methods. In all boulder
services but one, we pass the object into multiple clients, and
just one server. In general, this is safe, because all of the client
setup happens on the main thread, and the server setup similarly
happens on the main thread before spinning off the gRPC server
goroutine.
In the CA, we do the above and pass the tlsConfig object into two
gRPC server setup functions. Thus the first server goroutine races
with the setup of the second server.
This change removes the post-hoc assignment of MinVersion,
MaxVersion, and CipherSuites of the tls.Config object passed
to grpc.ClientSetup and grpc.NewServer. And adds those same
values to the cmd.TLSConfig.Load, the method responsible for
constructing the tls.Config object before it's passed to
grpc.ClientSetup and grpc.NewServer.
Part of #5159
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.
This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.
Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.
A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.
Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available.
During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that.
Fixes#3641.
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.
Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.
I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.
Also, update go-grpc-prometheus to get the necessary methods.
```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.
This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.
This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
There's an off-the-shelf package that provides most of the stats we care about
for gRPC using interceptors. This change vendors go-grpc-prometheus and its
dependencies, and calls out to the interceptors provided by that package from
our own interceptors.
This will allow us to get metrics like latency histograms by call, status codes
by call, and so on.
Fixes#2390.
This change vendors go-grpc-prometheus and its dependencies. Per contributing guidelines, I've run the tests on these dependencies, and they pass:
go test github.com/davecgh/go-spew/spew github.com/grpc-ecosystem/go-grpc-prometheus github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto github.com/pmezard/go-difflib/difflib github.com/stretchr/testify/assert github.com/stretchr/testify/require github.com/stretchr/testify/suite
ok github.com/davecgh/go-spew/spew 0.022s
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.120s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
ok github.com/pmezard/go-difflib/difflib 0.042s
ok github.com/stretchr/testify/assert 0.021s
ok github.com/stretchr/testify/require 0.017s
ok github.com/stretchr/testify/suite 0.012s
As described in #2282, our gRPC code uses mutual TLS to authenticate both clients and servers. However, currently our gRPC servers will accept any client certificate signed by the internal CA we use to authenticate connections. Instead, we would like each server to have a list of which clients it will accept. This will improve security by preventing the compromise of one client private key being used to access endpoints unrelated to its intended scope/purpose.
This PR implements support for gRPC servers to specify a list of accepted client names. A `serverTransportCredentials` implementing `ServerHandshake` uses a `verifyClient` function to enforce that the connecting peer presents a client certificate with a SAN entry that matches an entry on the list of accepted client names
The `NewServer` function from `grpc/server.go` is updated to instantiate the `serverTransportCredentials` used by `grpc.NewServer`, specifying an accepted names list populated from the `cmd.GRPCServerConfig.ClientNames` config field.
The pre-existing client and server certificates in `test/grpc-creds/` are replaced by versions that contain SAN entries as well as subject common names. A DNS and an IP SAN entry are added to allow testing both methods of specifying allowed SANs. The `generate.sh` script is converted to use @jsha's `minica` tool (OpenSSL CLI is blech!).
An example client whitelist is added to each of the existing gRPC endpoints in config-next/ to allow the SAN of the test RPC client certificate.
Resolves#2282