Commit Graph

36 Commits

Author SHA1 Message Date
Samantha Frank 1bfc3186c8
grpc: Enable client-side health_v1 health checking (#8254)
- Configure all gRPC clients to check the overall serving status of each
endpoint via the `grpc_health_v1` service.
- Configure all gRPC servers to expose the `grpc_health_v1` service to
any client permitted to access one of the server’s services.
- Modify long-running, deep health checks to set and transition the
overall (empty string) health status of the gRPC server in addition to
the specific service they were configured for.

Fixes #8227
2025-06-18 10:37:20 -04:00
dependabot[bot] 426482781c
build(deps): bump the otel group (#7968)
Update:
- https://github.com/open-telemetry/opentelemetry-go-contrib from 0.55.0 to 0.61.0
- https://github.com/open-telemetry/opentelemetry-go from 1.30.0 to 1.36.0
- several golang.org/x/ packages
- their transitive dependencies
2025-06-06 17:22:48 -07:00
Phil Porada 07d6713736
grpc: client/server histogram bucket change (#7591)
Changes the default grpc client/server histogram buckets from the
defaults to better track the long tail of slow requests. Removes `.005`
and `.25` granularity in favor of adding the larger values of `45` and `90`
to avoid changing the cardinality.

```
# Before, the default prometheus buckets
[]float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10}

# After
[]float64{.01, .025, .05, .1, .5, 1, 2.5, 5, 10, 45, 90}
```

Fixes https://github.com/letsencrypt/boulder/issues/6384
2024-07-16 10:36:57 -07:00
Phil Porada 472effbb9b
grpc: Switch to go-grpc-middleware/providers/prometheus (#7588)
While investigating #6384, I noticed that
[go-grpc-prometheus](https://github.com/grpc-ecosystem/go-grpc-prometheus?tab=readme-ov-file)
was deprecated last year and users should switch to
[go-grpc-middleware](https://github.com/grpc-ecosystem/go-grpc-middleware)
instead. The default prometheus histogram buckets will continue to be
used and [can be found
here](6e3f4b1091/prometheus/histogram.go (L261-L265)).
2024-07-12 15:02:37 -04:00
forcedebug b33d28c8bd
Remove repeated words in comments (#7445)
Signed-off-by: forcedebug <forcedebug@outlook.com>
2024-04-23 10:30:33 -04:00
Jacob Hoffman-Andrews ac4be89b56
grpc: add NoWaitForReady config field (#6850)
Currently we set WaitForReady(true), which causes gRPC requests to not
fail immediately if no backends are available, but instead wait until
the timeout in case a backend does become available. The downside is
that this behavior masks true connection errors. We'd like to turn it
off.

Fixes #6834
2023-05-09 16:16:44 -07:00
Matthew McPherrin 0060e695b5
Introduce OpenTelemetry Tracing (#6750)
Add a new shared config stanza which all boulder components can use to
configure their Open Telemetry tracing. This allows components to
specify where their traces should be sent, what their sampling ratio
should be, and whether or not they should respect their parent's
sampling decisions (so that web front-ends can ignore sampling info
coming from outside our infrastructure). It's likely we'll need to
evolve this configuration over time, but this is a good starting point.

Add basic Open Telemetry setup to our existing cmd.StatsAndLogging
helper, so that it gets initialized at the same time as our other
observability helpers. This sets certain default fields on all
traces/spans generated by the service. Currently these include the
service name, the service version, and information about the telemetry
SDK itself. In the future we'll likely augment this with information
about the host and process.

Finally, add instrumentation for the HTTP servers and grpc
clients/servers. This gives us a starting point of being able to monitor
Boulder, but is fairly minimal as this PR is already somewhat unwieldy:
It's really only enough to understand that everything is wired up
properly in the configuration. In subsequent work we'll enhance those
spans with more data, and add more spans for things not automatically
traced here.

Fixes https://github.com/letsencrypt/boulder/issues/6361

---------

Co-authored-by: Aaron Gable <aaron@aarongable.com>
2023-04-21 10:46:59 -07:00
Matthew McPherrin e1ed1a2ac2
Remove beeline tracing (#6733)
Remove tracing using Beeline from Boulder. The only remnant left behind
is the deprecated configuration, to ensure deployability.

We had previously planned to swap in OpenTelemetry in a single PR, but
that adds significant churn in a single change, so we're doing this as
multiple steps that will each be significantly easier to reason about
and review.

Part of #6361
2023-03-14 15:14:27 -07:00
Jacob Hoffman-Andrews c23e59ba59
wfe2: don't pass through client-initiated cancellation (#6608)
And clean up the code and tests that were used for cancellation
pass-through.

Fixes #6603
2023-01-26 17:26:15 -08:00
Aaron Gable 257136779c
Add interceptor for per-rpc client auth (#6488)
Add a new gRPC server interceptor (both unary and streaming) which
verifies that the mTLS info set on the persistent connection has a
client cert which contains a name which is allowlisted for the
particular service being called, not just for the overall server.

This will allow us to make more services -- particularly the CA and the
SA -- more similar to the VA. We will be able to run multiple services
on the same port, while still being able to control access to those
services on a per-client basis. It will also let us split those services
(e.g. into read-only and read-write subsets) much more easily, because a
client will be able to switch which service it is calling without also
having to be reconfigured to call a different address. And finally, it
will allow us to simplify configuration for clients (such as the RA)
which maintain connections to multiple different services on the same
server, as they'll be able to re-use the same address configuration.
2022-11-07 13:47:47 -08:00
Aaron Gable 0a02cdf7e3
Streamline gRPC client creation (#6472)
Remove the need for clients to explicitly call bgrpc.NewClientMetrics,
by moving that call inside bgrpc.ClientSetup. In case ClientSetup is
called multiple times, use the recommended method to gracefully recover
from registering duplicate metrics. This makes gRPC client setup much
more similar to gRPC server setup after the previous server refactoring
change landed.
2022-10-28 08:45:52 -07:00
Aaron Gable 9213bd0993
Streamline gRPC server creation (#6457)
Collapse most of our boilerplate gRPC creation steps (in particular,
creating default metrics, making the server and listener, registering
the server, creating and registering the health service, filtering
shutdown errors from the output, and gracefully stopping) into a single
function in the existing bgrpc package. This allows all but one of our
server main functions to drop their calls to NewServer and
NewServerMetrics.

To enable this, create a new helper type and method in the bgrpc
package. Conceptually, this could be just a new function, but it must be
attached to a new type so that it can be generic over the type of gRPC
server being created. (Unfortunately, the grpc.RegisterFooServer methods
do not accept an interface type for their second argument).

The only main function which is not updated is the boulder-va, which is
a special case because it creates multiple gRPC servers but (unlike the
CA) serves them all on the same port with the same server and listener.

Part of #6452
2022-10-26 15:45:52 -07:00
Samantha ffad58009e
grpc: Backend discovery improvements (#6394)
- Fork the default `dns` resolver from `go-grpc` to add backend discovery via
  DNS SRV resource records.
- Add new fields for SRV based discovery to `cmd.GRPCClientConfig`
- Add new (optional) field `DNSAuthority` for specifying custom DNS server to
  `cmd.GRPCClientConfig`
- Add a utility method to `cmd.GRPCClientConfig` to simplify target URI and host
  construction. With three schemes and `DNSAuthority` it makes more sense to
  handle all of this parsing and construction outside of the RPC client
  constructor.

Resolves #6111
2022-09-23 13:11:59 -07:00
Aaron Gable 927b1622b7
Add gRPC stream interceptors (#6370)
Create new gRPC interceptors which are capable of working
on streaming gRPC methods. Add these new interceptors, as
well as the default metrics interceptor provided by grpc-prometheus,
to all of our gRPC clients and servers.

The new interceptors behave virtually identically to their unary
counterparts: they wrap and unwrap our custom errors from the
gRPC metadata, they increment and decrement the in-flight RPC
metric, and they ensure that the RPCs don't fail-fast and do have
enough time left in their deadline to actually finish.

Unfortunately, because the interfaces for unary and streaming
RPCs are so divergent, it's not feasible to share code between the
two kinds of interceptors. While much of the new code is copy-pasted
from the old interceptors, there are subtle differences (such as not
immediately deferring the local context's cancel() function).

Fixes #6356
2022-09-12 09:28:12 -07:00
Aaron Gable c706609e79
Update grpc from v1.36.1 to v1.49.0 (#6336)
Changelog: https://github.com/grpc/grpc-go/compare/v1.36.1...v1.49.0

The biggest change for us is that grpc.WithBalancerName has
transitioned from deprecated to fully removed. The fix is to replace
it with a JSON-formatted "default config" object, as demonstrated in
https://github.com/grpc/grpc-go/pull/5232#issuecomment-1106921140.

This should unblock updating other dependencies which want to
transitively update gRPC as well.
2022-09-01 13:29:06 -07:00
Samantha 576b6777b5
grpc: Implement a static multiple IP address gRPC resolver (#6270)
- Implement a static resolver for the gPRC dialer under the scheme `static:///`
  which allows the dialer to resolve a backend from a static list of IPv4/IPv6
  addresses passed via the existing JSON config.
- Add config key `serverAddresses` to the `GRPCClientConfig` which, when
  populated, enables static IP resolution of gRPC server backends.
- Set `config-next` to use static gRPC backend resolution for all SA clients.
- Generate a new SA certificate which adds `10.77.77.77` and `10.88.88.88` to
  the SANs.

Resolves #6255
2022-08-05 10:20:57 -07:00
Aaron Gable e5a08e3753
Only convert gRPC cancellations into 408s at WFEs (#5566)
Pull the "was the gRPC error a Canceled error" checking code out into a
separate interceptor, and add that interceptor only in the wfe and wfe2
gRPC clients.

Although the vast majority of our cancelations come from the HTTP client
disconnecting (and that cancelation being propagated through our gRPC
stack), there are a few other situations in which we cancel gRPC
connections, including when we receive a quorum of responses from VAs
and no longer need responses from the remaining remote VA(s). This
change ensures that we do not treat those other kinds of cancelations in
the same way that we treat client-initiated cancelations.

Fixes #5444
2021-08-09 10:35:18 -07:00
Aaron Gable 229377aabc
Simplify gRPC interceptors (#5435)
Use the built-in grpc-go client and server interceptor chaining
utilities, instead of the ones provided by go-grpc-middleware.
Simplify our interceptors to call their handlers/invokers directly,
instead of delegating to the metrics interceptor, and add the
metrics interceptor to the chains instead.
2021-05-26 10:19:11 -07:00
Aaron Gable 9abb39d4d6
Honeycomb integration proof-of-concept (#5408)
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).

Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.

Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
2021-05-24 16:13:08 -07:00
Samantha 07aef67fa6
Refactoring tls.Config mutation out of grpc (#5175)
In all boulder services, we construct a single tls.Config object
and then pass it into multiple gRPC setup methods. In all boulder
services but one, we pass the object into multiple clients, and
just one server. In general, this is safe, because all of the client
setup happens on the main thread, and the server setup similarly
happens on the main thread before spinning off the gRPC server
goroutine.

In the CA, we do the above and pass the tlsConfig object into two
gRPC server setup functions. Thus the first server goroutine races
with the setup of the second server.

This change removes the post-hoc assignment of MinVersion,
MaxVersion, and CipherSuites of the tls.Config object passed
to grpc.ClientSetup and grpc.NewServer. And adds those same
values to the cmd.TLSConfig.Load, the method responsible for
constructing the tls.Config object before it's passed to
grpc.ClientSetup and grpc.NewServer.

Part of #5159
2020-11-12 16:24:16 -08:00
Aaron Gable 2d14cfb8d1
Add gRPC Health service to all Boulder services (#5093)
This health service implements the gRPC Health Checking
Protocol, as defined in 
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
and as implemented by the gRPC authors in
https://pkg.go.dev/google.golang.org/grpc/health@v1.29.0

It simply instantiates a health service, and attaches it to the same
gRPC server that is handling requests to the primary (e.g. CA) service.
When the main service would be shut down (e.g. because it caught a
signal), it also sets the status of the service to NOT_SERVING.

This change also imports the health client into our grpc client,
ensuring that all of our grpc clients use the health service to inform
their load-balancing behavior.

This will be used to replace our current usage of polling the debug
port to determine whether a given service is up and running. It may
also be useful for more comprehensive checks and blackbox probing
in the future.

Part of #5074
2020-10-06 12:14:02 -07:00
Jacob Hoffman-Andrews d4168626ad Fix orphan-finder (#4507)
This creates the correct type of backend service for the OCSP generator.
It also adds an invocation of orphan-finder during the integration
tests.

This also adds a minor safety check to SA that I hit while writing the
test. Without this safety check, passing a certificate with no DNSNames
to AddCertificate would result in an obscure MariaDB syntax error
without enough context to track it down. In normal circumstances this
shouldn't be hit, but it will be good to have a solid error message if
we hit it in tests sometime.

Also, this tweaks the .travis.yml so it explicitly sets BOULDER_CONFIG_DIR
to test/config in the default case. Because the docker-compose run
command uses -e BOULDER_CONFIG_DIR="${BOULDER_CONFIG_DIR}",
we were setting a blank BOULDER_CONFIG_DIR in default case.
Since the Python startservers script sets a default if BOULDER_CONFIG_DIR
is not set, we haven't noticed this before. But since this test case relies
on the actual environment variable, it became an issue.

Fixes #4499
2019-10-25 09:51:14 -07:00
Daniel McCarney d1daeee831 Config: serverAddresses -> serverAddress. (#4035)
The plural `serverAddresses` field in gRPC config has been deprecated for a bit now. We've removed the last usages of it in our staging/prod environments and can clear out the related code. Moving forward we only support a singular `serverAddress` and rely on DNS to direct to multiple instances of a given server.
2019-01-25 10:50:53 -08:00
Roland Bracewell Shoemaker dbab48b488 Restrict gRPC TLS connections to 1.2 and ECDHE-RSA-CHACHA20-POLY1305 (#3903) 2018-10-24 16:07:12 -04:00
Roland Bracewell Shoemaker 876c727b6f Update gRPC (#3817)
Fixes #3474.
2018-08-20 10:55:42 -04:00
Jacob Hoffman-Andrews dbcb16543e Start using multiple-IP hostnames for load balancing (#3687)
We'd like to start using the DNS load balancer in the latest version of gRPC. That means putting all IPs for a service under a single hostname (or using a SRV record, but we're not taking that path). This change adds an sd-test-srv to act as our service discovery DNS service. It returns both Boulder IP addresses for any A lookup ending in ".boulder". This change also sets up the Docker DNS for our boulder container to defer to sd-test-srv when it doesn't know an answer.

sd-test-srv doesn't know how to resolve public Internet names like `github.com`. Resolving public names is required for the `godep-restore` test phase, so this change breaks out a copy of the boulder container that is used only for `godep-restore`.

This change implements a shim of a DNS resolver for gRPC, so that we can switch to DNS-based load balancing with the currently vendored gRPC, then when we upgrade to the latest gRPC we won't need a simultaneous config update.

Also, this change introduces a check at the end of the integration test that each backend received at least one RPC, ensuring that we are not sending all load to a single backend.
2018-05-23 09:47:14 -04:00
Daniel McCarney 4f9ee00510 gRPC: publish in-flight RPC gauge in client interceptor. (#3672)
This PR updates the Boulder gRPC clientInterceptor to update a Prometheus gauge stat for each in-flight RPC it dispatches, sliced by service and method.

A unit test is included that uses a custom ChillerServer that lets the test block up a bunch of RPCs, check the in-flight gauge value is increased, unblock the RPCs, and recheck that the in-flight gauge is reduced. To check the gauge value for a specific set of labels a new test-tools.go function GaugeValueWithLabels is added.

Updates #3635
2018-04-27 15:53:54 -07:00
Daniel McCarney aa810a3142 gRPC: publish RPC latency stat in server interceptor. (#3665)
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.

This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.

Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.

A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.

Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
2018-04-25 15:37:22 -07:00
Jacob Hoffman-Andrews 68d5cc3331
Restore gRPC metrics (#3265)
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.

Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.

I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.

Also, update go-grpc-prometheus to get the necessary methods.

```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok      github.com/grpc-ecosystem/go-grpc-prometheus    0.069s
?       github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
2017-12-07 15:44:55 -08:00
Jacob Hoffman-Andrews d542960a35 Remove statsd version of RPC stats (#2693)
* Remove statsd-style RPC stats.

* Remove tests for old code.
2017-04-25 10:10:35 -04:00
Jacob Hoffman-Andrews d6ba7fcba9 Add some timing histogram stats (#2482)
Previously our gRPC client code called the wrong function, enabling server-side instead of client-side histograms.

Also, add a timing stat for the generate / store combination in OCSP Updater.
2017-01-10 11:02:41 -08:00
Jacob Hoffman-Andrews 510e279208 Simplify gRPC TLS configs. (#2470)
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.

This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.

This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
2017-01-06 14:19:18 -08:00
Jacob Hoffman-Andrews b8a237ffb3 Use grpc-go-prometheus for RPC stats. (#2391)
There's an off-the-shelf package that provides most of the stats we care about
for gRPC using interceptors. This change vendors go-grpc-prometheus and its
dependencies, and calls out to the interceptors provided by that package from
our own interceptors.

This will allow us to get metrics like latency histograms by call, status codes
by call, and so on.

Fixes #2390.

This change vendors go-grpc-prometheus and its dependencies. Per contributing guidelines, I've run the tests on these dependencies, and they pass:

go test github.com/davecgh/go-spew/spew github.com/grpc-ecosystem/go-grpc-prometheus github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto github.com/pmezard/go-difflib/difflib github.com/stretchr/testify/assert github.com/stretchr/testify/require github.com/stretchr/testify/suite 
ok      github.com/davecgh/go-spew/spew 0.022s
ok      github.com/grpc-ecosystem/go-grpc-prometheus    0.120s
?       github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
ok      github.com/pmezard/go-difflib/difflib   0.042s
ok      github.com/stretchr/testify/assert      0.021s
ok      github.com/stretchr/testify/require     0.017s
ok      github.com/stretchr/testify/suite       0.012s
2016-12-05 14:31:22 -08:00
Jacob Hoffman-Andrews 27a1446010 Move timeouts into client interceptor. (#2387)
Previously we had custom code in each gRPC wrapper to implement timeouts. Moving
the timeout code into the client interceptor allows us to simplify things and
reduce code duplication.
2016-12-05 10:42:26 -05:00
Daniel McCarney 6c983e8c9e Implements client whitelisting for gRPC. (#2307)
As described in #2282, our gRPC code uses mutual TLS to authenticate both clients and servers. However, currently our gRPC servers will accept any client certificate signed by the internal CA we use to authenticate connections. Instead, we would like each server to have a list of which clients it will accept. This will improve security by preventing the compromise of one client private key being used to access endpoints unrelated to its intended scope/purpose.

This PR implements support for gRPC servers to specify a list of accepted client names. A `serverTransportCredentials` implementing `ServerHandshake` uses a `verifyClient` function to enforce that the connecting peer presents a client certificate with a SAN entry that matches an entry on the list of accepted client names

The `NewServer` function from `grpc/server.go` is updated to instantiate the `serverTransportCredentials` used by `grpc.NewServer`, specifying an accepted names list populated from the `cmd.GRPCServerConfig.ClientNames` config field.

The pre-existing client and server certificates in `test/grpc-creds/` are replaced by versions that contain SAN entries as well as subject common names. A DNS and an IP SAN entry are added to allow testing both methods of specifying allowed SANs. The `generate.sh` script is converted to use @jsha's `minica` tool (OpenSSL CLI is blech!).

An example client whitelist is added to each of the existing gRPC endpoints in config-next/ to allow the SAN of the test RPC client certificate.

Resolves #2282
2016-11-08 13:57:34 -05:00
Jacob Hoffman-Andrews 332b019b99 Split grpc/util.go into client and server. (#2212)
Having files or packages named util is not great, because they wind up
attracting lots of small, unrelated functionality.
2016-09-29 10:53:17 -07:00