boulder

Commit Graph

Author	SHA1	Message	Date
Samantha Frank	1bfc3186c8	grpc: Enable client-side health_v1 health checking (#8254 ) - Configure all gRPC clients to check the overall serving status of each endpoint via the `grpc_health_v1` service. - Configure all gRPC servers to expose the `grpc_health_v1` service to any client permitted to access one of the server’s services. - Modify long-running, deep health checks to set and transition the overall (empty string) health status of the gRPC server in addition to the specific service they were configured for. Fixes #8227	2025-06-18 10:37:20 -04:00
Phil Porada	07d6713736	grpc: client/server histogram bucket change (#7591 ) Changes the default grpc client/server histogram buckets from the defaults to better track the long tail of slow requests. Removes `.005` and `.25` granularity in favor of adding the larger values of `45` and `90` to avoid changing the cardinality. ``` # Before, the default prometheus buckets []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10} # After []float64{.01, .025, .05, .1, .5, 1, 2.5, 5, 10, 45, 90} ``` Fixes https://github.com/letsencrypt/boulder/issues/6384	2024-07-16 10:36:57 -07:00
Phil Porada	472effbb9b	grpc: Switch to go-grpc-middleware/providers/prometheus (#7588 ) While investigating #6384, I noticed that [go-grpc-prometheus](https://github.com/grpc-ecosystem/go-grpc-prometheus?tab=readme-ov-file) was deprecated last year and users should switch to [go-grpc-middleware](https://github.com/grpc-ecosystem/go-grpc-middleware) instead. The default prometheus histogram buckets will continue to be used and [can be found here](`6e3f4b1091/prometheus/histogram.go (L261-L265)`).	2024-07-12 15:02:37 -04:00
dependabot[bot]	6b4577ecc4	update otel dependencies to v1.27.0 and v0.52.0 (#7496 ) Directly update: - go.opentelemetry.io/otel/* from v1.26.0 to v1.27.0 - go.opentelemetry.io/contrib/* from v0.51.0 to v0.52.0 Indirectly update: - google.golang.org/protobuf from v1.33.0 to v1.34.0 This update breaks some of our existing otel grpc interceptors, but in return allows us to use the newer grpc StatsHandler mechanism, while still filtering out health-check requests. Fixes https://github.com/letsencrypt/boulder/issues/7235	2024-05-29 15:46:35 -07:00
Matthew McPherrin	cb5384dcd7	Add --addr and/or --debug-addr flags to all commands (#7175 ) Many services already have --addr and/or --debug-addr flags. However, it wasn't universal, so this PR adds flags to commands where they're not currently present. This makes it easier to use a shared config file but listen on different ports, for running multiple instances on a single host. The config options are made optional as well, and removed from config-next/.	2023-12-07 17:41:01 -08:00
Samantha	124c4cc6f5	grpc/sa: Implement deep health checks (#6928 ) Add the necessary scaffolding for deep health checking of our various gRPC components. Each component implementation that also implements the grpc.checker interface will be checked periodically, and the health status of the component will be updated accordingly. Add the necessary methods to SA to implement the grpc.checker interface and register these new health checks with Consul. Additionally: - Update entry point script to check for ProxySQL readiness. - Increase the poll rate for gRPC Consul checks from 5s to 2s to help with DNS failures, due to check failures, on startup. - Change log level for Consul from INFO to ERROR to deal with noisy logs full of transport failures due to Consul gRPC checks firing before the SAs are up. Fixes #6878 Part of #6795	2023-06-12 13:58:53 -04:00
Samantha	c453ca0571	grpc: Deprecate clientNames field (#6870 ) - SRE removed in IN-8755 Fixes #6698	2023-05-08 14:49:27 -04:00
Matthew McPherrin	0060e695b5	Introduce OpenTelemetry Tracing (#6750 ) Add a new shared config stanza which all boulder components can use to configure their Open Telemetry tracing. This allows components to specify where their traces should be sent, what their sampling ratio should be, and whether or not they should respect their parent's sampling decisions (so that web front-ends can ignore sampling info coming from outside our infrastructure). It's likely we'll need to evolve this configuration over time, but this is a good starting point. Add basic Open Telemetry setup to our existing cmd.StatsAndLogging helper, so that it gets initialized at the same time as our other observability helpers. This sets certain default fields on all traces/spans generated by the service. Currently these include the service name, the service version, and information about the telemetry SDK itself. In the future we'll likely augment this with information about the host and process. Finally, add instrumentation for the HTTP servers and grpc clients/servers. This gives us a starting point of being able to monitor Boulder, but is fairly minimal as this PR is already somewhat unwieldy: It's really only enough to understand that everything is wired up properly in the configuration. In subsequent work we'll enhance those spans with more data, and add more spans for things not automatically traced here. Fixes https://github.com/letsencrypt/boulder/issues/6361 --------- Co-authored-by: Aaron Gable <aaron@aarongable.com>	2023-04-21 10:46:59 -07:00
Aaron Gable	bd1d27b8e8	Fix non-gRPC process cleanup and exit (#6808 ) Although #6771 significantly cleaned up how gRPC services stop and clean up, it didn't make any changes to our HTTP servers or our non-server (e.g. crl-updater, log-validator) processes. This change finishes the work. Add a new helper method cmd.WaitForSignal, which simply blocks until one of the three signals we care about is received. This easily replaces all calls to cmd.CatchSignals which passed `nil` as the callback argument, with the added advantage that it doesn't call os.Exit() and therefore allows deferred cleanup functions to execute. This new function is intended to be the last line of main(), allowing the whole process to exit once it returns. Reimplement cmd.CatchSignals as a thin wrapper around cmd.WaitForSignal, but with the added callback functionality. Also remove the os.Exit() call from CatchSignals, so that the main goroutine is allowed to finish whatever it's doing, call deferred functions, and exit naturally. Update all of our non-gRPC binaries to use one of these two functions. The vast majority use WaitForSignal, as they run their main processing loop in a background goroutine. A few (particularly those that can run either in run-once or in daemonized mode) still use CatchSignals, since their primary processing happens directly on the main goroutine. The changes to //test/load-generator are the most invasive, simply because that binary needed to have a context plumbed into it for proper cancellation, but it already had a custom struct type named "context" which needed to be renamed to avoid shadowing. Fixes https://github.com/letsencrypt/boulder/issues/6794	2023-04-14 16:22:56 -04:00
Aaron Gable	d6cd589795	Simplify how gRPC services start, stop, and clean up (#6771 ) The CA, RA, and VA have multiple goroutines running alongside primary gRPC handling goroutine. These ancillary goroutines should be gracefully shut down when the process is about to exit. Historically, we have handled this by putting a call to each of these goroutine's shutdown function inside cmd.CatchSignals, so that when a SIGINT is received, all of the various cleanup routines happen in sequence. But there's a cleaner way to do it: just use defer! All of these cleanups need to happen after the primary gRPC server has fully shut down, so that we know they stick around at least as long as the service is handling gRPC requests. And when the service receives a SIGINT, cmd.CatchSignals will call the gRPC server's GracefulStop, which will cause the server's .Serve() to finally exit, which will cause start() to exit, which will cause main() to exit, which will cause all deferred functions to be run. In addition, remove filterShutdownErrors as the bug which made it necessary (.Serve() returning an error even when GracefulShutdown() is called) was fixed back in 2017. This allows us to call the start() function in a much more natural way, simply logging any error it returns instead of calling os.Exit(1) if it returns an error. This allows us to simplify the exit-handling code in these three services' main() functions, and lets us be a bit more idiomatic with our deferred cleanup functions. Part of #6794	2023-04-05 14:55:57 -07:00
Matthew McPherrin	e1ed1a2ac2	Remove beeline tracing (#6733 ) Remove tracing using Beeline from Boulder. The only remnant left behind is the deprecated configuration, to ensure deployability. We had previously planned to swap in OpenTelemetry in a single PR, but that adds significant churn in a single change, so we're doing this as multiple steps that will each be significantly easier to reason about and review. Part of #6361	2023-03-14 15:14:27 -07:00
Samantha	8227052345	GRPC: Add TODO for Config.GRPCServerConfig.ClientNames (#6718 ) Part of #6698	2023-03-02 17:17:55 -05:00
Jacob Hoffman-Andrews	c23e59ba59	wfe2: don't pass through client-initiated cancellation (#6608 ) And clean up the code and tests that were used for cancellation pass-through. Fixes #6603	2023-01-26 17:26:15 -08:00
Aaron Gable	257136779c	Add interceptor for per-rpc client auth (#6488 ) Add a new gRPC server interceptor (both unary and streaming) which verifies that the mTLS info set on the persistent connection has a client cert which contains a name which is allowlisted for the particular service being called, not just for the overall server. This will allow us to make more services -- particularly the CA and the SA -- more similar to the VA. We will be able to run multiple services on the same port, while still being able to control access to those services on a per-client basis. It will also let us split those services (e.g. into read-only and read-write subsets) much more easily, because a client will be able to switch which service it is calling without also having to be reconfigured to call a different address. And finally, it will allow us to simplify configuration for clients (such as the RA) which maintain connections to multiple different services on the same server, as they'll be able to re-use the same address configuration.	2022-11-07 13:47:47 -08:00
Aaron Gable	46c8d66c31	bgrpc.NewServer: support multiple services (#6487 ) Turn bgrpc.NewServer into a builder-pattern, with a config-based initialization, multiple calls to Add to add new gRPC services, and a final call to Build to produce the start() and stop() functions which control server behavior. All calls are chainable to produce compact code in each component's main() function. This improves the process of creating a new gRPC server in three ways: 1) It avoids the need for generics/templating, which was slightly verbose. 2) It allows the set of services to be registered on this server to be known ahead of time. 3) It greatly streamlines adding multiple services to the same server, which we use today in the VA and will be using soon in the SA and CA. While we're here, add a new per-service config stanza to the GRPCServerConfig, so that individual services on the same server can have their own configuration. For now, only provide a "ClientNames" key, which will be used in a follow-up PR. Part of #6454	2022-11-04 13:26:42 -07:00
Aaron Gable	9213bd0993	Streamline gRPC server creation (#6457 ) Collapse most of our boilerplate gRPC creation steps (in particular, creating default metrics, making the server and listener, registering the server, creating and registering the health service, filtering shutdown errors from the output, and gracefully stopping) into a single function in the existing bgrpc package. This allows all but one of our server main functions to drop their calls to NewServer and NewServerMetrics. To enable this, create a new helper type and method in the bgrpc package. Conceptually, this could be just a new function, but it must be attached to a new type so that it can be generic over the type of gRPC server being created. (Unfortunately, the grpc.RegisterFooServer methods do not accept an interface type for their second argument). The only main function which is not updated is the boulder-va, which is a special case because it creates multiple gRPC servers but (unlike the CA) serves them all on the same port with the same server and listener. Part of #6452	2022-10-26 15:45:52 -07:00
Aaron Gable	927b1622b7	Add gRPC stream interceptors (#6370 ) Create new gRPC interceptors which are capable of working on streaming gRPC methods. Add these new interceptors, as well as the default metrics interceptor provided by grpc-prometheus, to all of our gRPC clients and servers. The new interceptors behave virtually identically to their unary counterparts: they wrap and unwrap our custom errors from the gRPC metadata, they increment and decrement the in-flight RPC metric, and they ensure that the RPCs don't fail-fast and do have enough time left in their deadline to actually finish. Unfortunately, because the interfaces for unary and streaming RPCs are so divergent, it's not feasible to share code between the two kinds of interceptors. While much of the new code is copy-pasted from the old interceptors, there are subtle differences (such as not immediately deferring the local context's cancel() function). Fixes #6356	2022-09-12 09:28:12 -07:00
Aaron Gable	ab79f96d7b	Fixup staticcheck and stylecheck, and violations thereof (#5897 ) Add `stylecheck` to our list of lints, since it got separated out from `staticcheck`. Fix the way we configure both to be clearer and not rely on regexes. Additionally fix a number of easy-to-change `staticcheck` and `stylecheck` violations, allowing us to reduce our number of ignored checks. Part of #5681	2022-01-20 16:22:30 -08:00
Jacob Hoffman-Andrews	2d2c723d34	Break the chain of cancellations at the SA (#5459 ) A recent mysql driver upgrade caused a performance regression. We believe this may be due to cancellations getting passed through to the database driver, which as of the upgrade will more aggressively tear down connections that experienced a cancellation. Also, we only recently started propagation cancellations all the way from the frontend in #5404. This makes it so the driver doesn't see the cancellation. Second attempt at #5447	2021-06-24 16:49:32 -07:00
Aaron Gable	229377aabc	Simplify gRPC interceptors (#5435 ) Use the built-in grpc-go client and server interceptor chaining utilities, instead of the ones provided by go-grpc-middleware. Simplify our interceptors to call their handlers/invokers directly, instead of delegating to the metrics interceptor, and add the metrics interceptor to the chains instead.	2021-05-26 10:19:11 -07:00
Aaron Gable	9abb39d4d6	Honeycomb integration proof-of-concept (#5408 ) Add Honeycomb tracing to all Boulder components which act as HTTP servers, gRPC servers, or gRPC clients. Add many values which we currently emit to logs to the trace spans. Add a way to configure the Honeycomb integration to our config files, and by default configure all of our tests to "mute" (send nothing). Followup changes will refine the configuration, attempt to reduce the new dependency load, and introduce better sampling. Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218	2021-05-24 16:13:08 -07:00
Jacob Hoffman-Andrews	b4e483d38b	Add gRPC MaxConnectionAge config. (#5311 ) This allows servers to tell clients to go away after some period of time, which triggers the clients to re-resolve DNS. Per grpc/grpc#12295, this is the preferred way to do this. Related: #5307.	2021-03-01 18:37:47 -08:00
Samantha	07aef67fa6	Refactoring tls.Config mutation out of grpc (#5175 ) In all boulder services, we construct a single tls.Config object and then pass it into multiple gRPC setup methods. In all boulder services but one, we pass the object into multiple clients, and just one server. In general, this is safe, because all of the client setup happens on the main thread, and the server setup similarly happens on the main thread before spinning off the gRPC server goroutine. In the CA, we do the above and pass the tlsConfig object into two gRPC server setup functions. Thus the first server goroutine races with the setup of the second server. This change removes the post-hoc assignment of MinVersion, MaxVersion, and CipherSuites of the tls.Config object passed to grpc.ClientSetup and grpc.NewServer. And adds those same values to the cmd.TLSConfig.Load, the method responsible for constructing the tls.Config object before it's passed to grpc.ClientSetup and grpc.NewServer. Part of #5159	2020-11-12 16:24:16 -08:00
Roland Bracewell Shoemaker	3532dce246	Excise grpc maxConcurrentStreams configuration (#4257 )	2019-06-12 09:35:24 -04:00
Roland Bracewell Shoemaker	dbab48b488	Restrict gRPC TLS connections to 1.2 and ECDHE-RSA-CHACHA20-POLY1305 (#3903 )	2018-10-24 16:07:12 -04:00
Daniel McCarney	0e07eacb01	gRPC: Rename histogram rpc_lag -> grpc_lag (#3673 )	2018-04-26 16:19:13 -04:00
Daniel McCarney	aa810a3142	gRPC: publish RPC latency stat in server interceptor. (#3665 ) We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it. This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram. Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands. A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder. Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.	2018-04-25 15:37:22 -07:00
Jacob Hoffman-Andrews	2a1cd4981a	Allow configuring gRPC's MaxConcurrentStreams (#3642 ) During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available. During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that. Fixes #3641.	2018-04-12 17:17:17 -04:00
Jacob Hoffman-Andrews	68d5cc3331	Restore gRPC metrics (#3265 ) The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back. Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server. I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling. Also, update go-grpc-prometheus to get the necessary methods. ``` $ go test github.com/grpc-ecosystem/go-grpc-prometheus/... ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s ? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files] ```	2017-12-07 15:44:55 -08:00
Jacob Hoffman-Andrews	d542960a35	Remove statsd version of RPC stats (#2693 ) * Remove statsd-style RPC stats. * Remove tests for old code.	2017-04-25 10:10:35 -04:00
Jacob Hoffman-Andrews	510e279208	Simplify gRPC TLS configs. (#2470 ) Previously, a given binary would have three TLS config fields (CA cert, cert, key) for its gRPC server, plus each of its configured gRPC clients. In typical use, we expect all three of those to be the same across both servers and clients within a given binary. This change reuses the TLSConfig type already defined for use with AMQP, adds a Load() convenience function that turns it into a *tls.Config, and configures it for use with all of the binaries. This should make configuration easier and more robust, since it more closely matches usage. This change preserves temporary backwards-compatibility for the ocsp-updater->publisher RPCs, since those are the only instances of gRPC currently enabled in production.	2017-01-06 14:19:18 -08:00
Jacob Hoffman-Andrews	b8a237ffb3	Use grpc-go-prometheus for RPC stats. (#2391 ) There's an off-the-shelf package that provides most of the stats we care about for gRPC using interceptors. This change vendors go-grpc-prometheus and its dependencies, and calls out to the interceptors provided by that package from our own interceptors. This will allow us to get metrics like latency histograms by call, status codes by call, and so on. Fixes #2390. This change vendors go-grpc-prometheus and its dependencies. Per contributing guidelines, I've run the tests on these dependencies, and they pass: go test github.com/davecgh/go-spew/spew github.com/grpc-ecosystem/go-grpc-prometheus github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto github.com/pmezard/go-difflib/difflib github.com/stretchr/testify/assert github.com/stretchr/testify/require github.com/stretchr/testify/suite ok github.com/davecgh/go-spew/spew 0.022s ok github.com/grpc-ecosystem/go-grpc-prometheus 0.120s ? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files] ok github.com/pmezard/go-difflib/difflib 0.042s ok github.com/stretchr/testify/assert 0.021s ok github.com/stretchr/testify/require 0.017s ok github.com/stretchr/testify/suite 0.012s	2016-12-05 14:31:22 -08:00
Daniel McCarney	6c983e8c9e	Implements client whitelisting for gRPC. (#2307 ) As described in #2282, our gRPC code uses mutual TLS to authenticate both clients and servers. However, currently our gRPC servers will accept any client certificate signed by the internal CA we use to authenticate connections. Instead, we would like each server to have a list of which clients it will accept. This will improve security by preventing the compromise of one client private key being used to access endpoints unrelated to its intended scope/purpose. This PR implements support for gRPC servers to specify a list of accepted client names. A `serverTransportCredentials` implementing `ServerHandshake` uses a `verifyClient` function to enforce that the connecting peer presents a client certificate with a SAN entry that matches an entry on the list of accepted client names The `NewServer` function from `grpc/server.go` is updated to instantiate the `serverTransportCredentials` used by `grpc.NewServer`, specifying an accepted names list populated from the `cmd.GRPCServerConfig.ClientNames` config field. The pre-existing client and server certificates in `test/grpc-creds/` are replaced by versions that contain SAN entries as well as subject common names. A DNS and an IP SAN entry are added to allow testing both methods of specifying allowed SANs. The `generate.sh` script is converted to use @jsha's `minica` tool (OpenSSL CLI is blech!). An example client whitelist is added to each of the existing gRPC endpoints in config-next/ to allow the SAN of the test RPC client certificate. Resolves #2282	2016-11-08 13:57:34 -05:00
Jacob Hoffman-Andrews	332b019b99	Split grpc/util.go into client and server. (#2212 ) Having files or packages named util is not great, because they wind up attracting lots of small, unrelated functionality.	2016-09-29 10:53:17 -07:00

34 Commits