Commit Graph

43 Commits

Author SHA1 Message Date
Jacob Hoffman-Andrews 600010305a
grpc: factor out setup func (#7909)
This uses a pattern that is new to our tests. setup accepts a variadic
list of options, and uses a type switch to make use of those options
during setup. This allows us to pass setup only the options that are
relevant to any given test case, leaving the rest to sensible defaults.
2025-01-20 12:31:57 -05:00
Jacob Hoffman-Andrews 04dec59c67
ra: log User-Agent (#7908)
In the WFE, store the User-Agent in a `context.Context` object. In our
gRPC interceptors, pass that field in a Metadata header, and re-add it
to `Context` on the server side.

Add a test in the gRPC interceptors that User-Agent is properly
propagated.

Note: this adds a new `setup()` function for the gRPC tests that is
currently only used by the new test. I'll upload another PR shortly that
expands the use of that function to more tests.

Fixes https://github.com/letsencrypt/boulder/issues/7792
2025-01-14 13:39:41 -08:00
Aaron Gable e5731a4c23
gRPC: reject request if clock skew is too large (#7686)
Have our gRPC server interceptor check for excessive clock skew between
its own clock and gRPC client clocks. Do this by taking advantage of the
client request timestamp that most clients already supply for the
purpose of measuring cross-service latency. If the included timestamp is
more than 10 minutes from the gRPC server's local time, immediately
error out.

To keep the integration tests -- which heavily rely on clock
manipulation -- working, use build tags to disable this behavior during
integration testing.

Fixes https://github.com/letsencrypt/boulder/issues/7684
2024-08-29 11:32:24 -07:00
Aaron Gable e05d47a10a
Replace explicit int loops with range-over-int (#7434)
This adopts modern Go syntax to reduce the chance of off-by-one errors
and remove unnecessary loop variable declarations.

Fixes https://github.com/letsencrypt/boulder/issues/7227
2024-04-22 10:34:51 -07:00
Phil Porada b8b105453a
Rename protobuf duration fields to <fieldname>NS and populate new duration fields (#7115)
* Renames all of int64 as a time.Duration fields to `<fieldname>NS` to
indicate they are Unix nanoseconds.
* Adds new `google.protobuf.Duration` fields to each .proto file where
we previously had been using an int64 field to populate a time.Duration.
* Updates relevant gRPC messages.

Part 1 of 3 for https://github.com/letsencrypt/boulder/issues/7097
2023-10-26 10:46:03 -04:00
Phil Porada 034316ef6a
Rename int64 timestamp related protobuf fields to <fieldname>NS (#7069)
Rename all of int64 timestamp fields to `<fieldname>NS` to indicate they
are Unix nanosecond timestamps.

Part 1 of 4 related to
https://github.com/letsencrypt/boulder/issues/7060
2023-09-15 13:49:07 -04:00
Jacob Hoffman-Andrews ac4be89b56
grpc: add NoWaitForReady config field (#6850)
Currently we set WaitForReady(true), which causes gRPC requests to not
fail immediately if no backends are available, but instead wait until
the timeout in case a backend does become available. The downside is
that this behavior masks true connection errors. We'd like to turn it
off.

Fixes #6834
2023-05-09 16:16:44 -07:00
Aaron Gable 58f1c55284
Allow BoulderErrors to be interpreted as grpc.Statuses (#6654)
Add the GRPCStatus method to our BoulderError type, so that the gRPC
server code can automatically set an appropriate Status on all gRPC
responses, based on the kind of error that we return. We still serialize
the whole BoulderError type and details into the response metadata, so
that it can be rehydrated on the client side, but this allows the
gRPC-native Status to be something other than Unknown. As part of this
change, have our custom error serialization code stop manually setting
the gRPC status code to codes.Unknown.

This change allows the default gRPC prometheus metrics to more
accurately report the kinds of errors our gRPC requests experience, and
may allow us to more elegantly transition to using grpc.Status errors in
other places where they're relevant and useful.
2023-02-16 14:17:09 -08:00
Jacob Hoffman-Andrews c23e59ba59
wfe2: don't pass through client-initiated cancellation (#6608)
And clean up the code and tests that were used for cancellation
pass-through.

Fixes #6603
2023-01-26 17:26:15 -08:00
Aaron Gable 257136779c
Add interceptor for per-rpc client auth (#6488)
Add a new gRPC server interceptor (both unary and streaming) which
verifies that the mTLS info set on the persistent connection has a
client cert which contains a name which is allowlisted for the
particular service being called, not just for the overall server.

This will allow us to make more services -- particularly the CA and the
SA -- more similar to the VA. We will be able to run multiple services
on the same port, while still being able to control access to those
services on a per-client basis. It will also let us split those services
(e.g. into read-only and read-write subsets) much more easily, because a
client will be able to switch which service it is calling without also
having to be reconfigured to call a different address. And finally, it
will allow us to simplify configuration for clients (such as the RA)
which maintain connections to multiple different services on the same
server, as they'll be able to re-use the same address configuration.
2022-11-07 13:47:47 -08:00
Aaron Gable 0a02cdf7e3
Streamline gRPC client creation (#6472)
Remove the need for clients to explicitly call bgrpc.NewClientMetrics,
by moving that call inside bgrpc.ClientSetup. In case ClientSetup is
called multiple times, use the recommended method to gracefully recover
from registering duplicate metrics. This makes gRPC client setup much
more similar to gRPC server setup after the previous server refactoring
change landed.
2022-10-28 08:45:52 -07:00
Aaron Gable 9213bd0993
Streamline gRPC server creation (#6457)
Collapse most of our boilerplate gRPC creation steps (in particular,
creating default metrics, making the server and listener, registering
the server, creating and registering the health service, filtering
shutdown errors from the output, and gracefully stopping) into a single
function in the existing bgrpc package. This allows all but one of our
server main functions to drop their calls to NewServer and
NewServerMetrics.

To enable this, create a new helper type and method in the bgrpc
package. Conceptually, this could be just a new function, but it must be
attached to a new type so that it can be generic over the type of gRPC
server being created. (Unfortunately, the grpc.RegisterFooServer methods
do not accept an interface type for their second argument).

The only main function which is not updated is the boulder-va, which is
a special case because it creates multiple gRPC servers but (unlike the
CA) serves them all on the same port with the same server and listener.

Part of #6452
2022-10-26 15:45:52 -07:00
Aaron Gable 927b1622b7
Add gRPC stream interceptors (#6370)
Create new gRPC interceptors which are capable of working
on streaming gRPC methods. Add these new interceptors, as
well as the default metrics interceptor provided by grpc-prometheus,
to all of our gRPC clients and servers.

The new interceptors behave virtually identically to their unary
counterparts: they wrap and unwrap our custom errors from the
gRPC metadata, they increment and decrement the in-flight RPC
metric, and they ensure that the RPCs don't fail-fast and do have
enough time left in their deadline to actually finish.

Unfortunately, because the interfaces for unary and streaming
RPCs are so divergent, it's not feasible to share code between the
two kinds of interceptors. While much of the new code is copy-pasted
from the old interceptors, there are subtle differences (such as not
immediately deferring the local context's cancel() function).

Fixes #6356
2022-09-12 09:28:12 -07:00
Aaron Gable c706609e79
Update grpc from v1.36.1 to v1.49.0 (#6336)
Changelog: https://github.com/grpc/grpc-go/compare/v1.36.1...v1.49.0

The biggest change for us is that grpc.WithBalancerName has
transitioned from deprecated to fully removed. The fix is to replace
it with a JSON-formatted "default config" object, as demonstrated in
https://github.com/grpc/grpc-go/pull/5232#issuecomment-1106921140.

This should unblock updating other dependencies which want to
transitively update gRPC as well.
2022-09-01 13:29:06 -07:00
Aaron Gable d1b211ec5a
Start testing on go1.19 (#6227)
Run the Boulder unit and integration tests with go1.19.

In addition, make a few small changes to allow both sets of
tests to run side-by-side. Mark a few tests, including our lints
and generate checks, as go1.18-only. Reformat a few doc
comments, particularly lists, to abide by go1.19's stricter gofmt.

Causes #6275
2022-08-10 15:30:43 -07:00
Aaron Gable ab79f96d7b
Fixup staticcheck and stylecheck, and violations thereof (#5897)
Add `stylecheck` to our list of lints, since it got separated out from
`staticcheck`. Fix the way we configure both to be clearer and not
rely on regexes.

Additionally fix a number of easy-to-change `staticcheck` and
`stylecheck` violations, allowing us to reduce our number of ignored
checks.

Part of #5681
2022-01-20 16:22:30 -08:00
Aaron Gable eb5d0e9ba9
Update golangci-lint from v1.29.0 to v1.42.1 (#5745)
Update the version of golangci-lint we use in our docker image,
and update the version of the docker image we use in our tests.
Fix a couple places where we were violating lints (ineffective assign
and calling `t.Fatal` from outside the main test goroutine), and add
one lint (using math/rand) to the ignore list.

Fixes #5710
2021-10-22 16:26:59 -07:00
Aaron Gable e5a08e3753
Only convert gRPC cancellations into 408s at WFEs (#5566)
Pull the "was the gRPC error a Canceled error" checking code out into a
separate interceptor, and add that interceptor only in the wfe and wfe2
gRPC clients.

Although the vast majority of our cancelations come from the HTTP client
disconnecting (and that cancelation being propagated through our gRPC
stack), there are a few other situations in which we cancel gRPC
connections, including when we receive a quorum of responses from VAs
and no longer need responses from the remaining remote VA(s). This
change ensures that we do not treat those other kinds of cancelations in
the same way that we treat client-initiated cancelations.

Fixes #5444
2021-08-09 10:35:18 -07:00
Samantha 631f6dfa0c
GRPC: Log user-initiated cancellations as HTTP 408 (#5546)
- Log user-initiated cancellations as HTTP 408 instead of HTTP 500
- Only check status code of `err` if an error was intercepted

Fixes #5444
2021-07-30 16:10:16 -07:00
Jacob Hoffman-Andrews 2d2c723d34
Break the chain of cancellations at the SA (#5459)
A recent mysql driver upgrade caused a performance regression. We
believe this may be due to cancellations getting passed through to the
database driver, which as of the upgrade will more aggressively tear
down connections that experienced a cancellation.

Also, we only recently started propagation cancellations all the way
from the frontend in #5404.

This makes it so the driver doesn't see the cancellation.

Second attempt at #5447
2021-06-24 16:49:32 -07:00
Aaron Gable 6629b49376
Fix grpc test proto generation (#5452)
The //grpc/test_proto/generate.go file was not generating the protos
in its own directory, it was regenerating the VA protos. Therefore the
generated files were out of date, and were relying on an old version
of the go proto library, which we can now remove from our direct deps.

Part of #5443
Part of #5453
2021-06-02 16:19:25 -07:00
Aaron Gable ef1d3c4cde
Standardize on `AssertMetricWithLabelsEquals` (#5371)
Update all of our tests to use `AssertMetricWithLabelsEquals`
instead of combinations of the older `CountFoo` helpers with
simple asserts. This coalesces all of our prometheus inspection
logic into a single function, allowing the deletion of four separate
helper functions.
2021-04-01 15:20:43 -07:00
Jacob Hoffman-Andrews 3c0e414a74
Update interceptors_test to proto3. (#5046) 2020-08-24 16:05:57 -07:00
Jacob Hoffman-Andrews ca26126ca9
Replace master with main. (#4917)
Also, update an example username in mailer tests.
2020-06-30 16:39:39 -07:00
Jacob Hoffman-Andrews bef02e782a
Fix nits found by staticcheck (#4726)
Part of #4700
2020-03-30 10:20:20 -07:00
Roland Bracewell Shoemaker 5b2f11e07e Switch away from old style statsd metrics wrappers (#4606)
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.

There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.

Fixes #4591.
2019-12-18 11:08:25 -05:00
Jacob Hoffman-Andrews e3f797f9dc grpc: Add better error message for timeouts. (#4324)
Right now we sometimes get errors like:

rpc error: code = Unknown desc = rpc error:
  code = DeadlineExceeded desc = context deadline exceeded

For instance, when an SA call times out, and the RA returns that
timed-out error to the WFE. These are kind of confusing because they
have two layers of nested gRPC error, and they don't provide additional
information about which SA call timed out.

This change replaces DeadlineExceeded errors with our own error type
that includes the service and the method that were called, as well as
the amount of time it took (which helps understand if timeouts are
happening because earlier calls ate up time towards the deadline).

When the RA->SA NewOrder call times out, and the RA returns that error to WFE:

"InternalErrors":["rpc error: code = Unknown desc =
  sa.StorageAuthority.NewOrder timed out after 14954 ms"]

When the WFE->RA NewOrder call times out:

"InternalErrors":["ra.RegistrationAuthority.NewOrder timed out after 15000 ms"]

Note that this change only handles timeouts at one level deep, which I
think is sufficient for our needs.
2019-07-08 13:47:25 -04:00
Roland Bracewell Shoemaker 6f93942a04 Consistently used stdlib context package (#4229) 2019-05-28 14:36:16 -04:00
Roland Bracewell Shoemaker a9a0846ee9
Remove checks for deployed features (#3881)
Removes the checks for a handful of deployed feature flags in preparation for removing the flags entirely. Also moves all of the currently deprecated flags to a separate section of the flags list so they can be more easily removed once purged from production configs.

Fixes #3880.
2018-10-17 20:29:18 -07:00
Roland Bracewell Shoemaker 876c727b6f Update gRPC (#3817)
Fixes #3474.
2018-08-20 10:55:42 -04:00
Joel Sing f8a023e49c Remove various unnecessary uses of fmt.Sprintf (#3707)
Remove various unnecessary uses of fmt.Sprintf - in particular:

- Avoid calls like t.Error(fmt.Sprintf(...)), where t.Errorf can be used directly.

- Use strconv when converting an integer to a string, rather than using
  fmt.Sprintf("%d", ...). This is simpler and can also detect type errors at
  compile time.

- Instead of using x.Write([]byte(fmt.Sprintf(...))), use fmt.Fprintf(x, ...).
2018-05-11 11:55:25 -07:00
Daniel McCarney 4f9ee00510 gRPC: publish in-flight RPC gauge in client interceptor. (#3672)
This PR updates the Boulder gRPC clientInterceptor to update a Prometheus gauge stat for each in-flight RPC it dispatches, sliced by service and method.

A unit test is included that uses a custom ChillerServer that lets the test block up a bunch of RPCs, check the in-flight gauge value is increased, unblock the RPCs, and recheck that the in-flight gauge is reduced. To check the gauge value for a specific set of labels a new test-tools.go function GaugeValueWithLabels is added.

Updates #3635
2018-04-27 15:53:54 -07:00
Daniel McCarney aa810a3142 gRPC: publish RPC latency stat in server interceptor. (#3665)
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.

This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.

Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.

A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.

Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
2018-04-25 15:37:22 -07:00
Jacob Hoffman-Andrews a4f9de9e35 Improve nesting of RPC deadlines (#3619)
gRPC passes deadline information through the RPC boundary, but client and server have the same deadline. Ideally we'd like the server to have a slightly tighter deadline than the client, so if one of the server's onward RPCs or other network calls times out, the server can pass back more detailed information to the client, rather than the client timing out the server and losing the opportunity to log more detailed information about which component caused the timeout.

In this change, I subtract 100ms from the deadline on the server side of our interceptors, using our existing serverInterceptor. I also check that there is at least 100ms remaining in which to do useful work, so the server doesn't begin a potentially expensive task only to abort it.

Fixes #3608.
2018-04-06 15:40:18 +01:00
Jacob Hoffman-Andrews 68d5cc3331
Restore gRPC metrics (#3265)
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.

Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.

I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.

Also, update go-grpc-prometheus to get the necessary methods.

```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok      github.com/grpc-ecosystem/go-grpc-prometheus    0.069s
?       github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
2017-12-07 15:44:55 -08:00
Jacob Hoffman-Andrews d542960a35 Remove statsd version of RPC stats (#2693)
* Remove statsd-style RPC stats.

* Remove tests for old code.
2017-04-25 10:10:35 -04:00
Jacob Hoffman-Andrews 263db24571 Disable fail-fast for gRPC. (#2397) (#2434)
This is a roll-forward of 5b865f1, with the QueueDeclare and QueueBind changes
in AMQP-RPC removed, and the startup order changes in test/startservers.py
removed. The AMQP-RPC changes caused RabbitMQ permission problems in production,
and the startup order changes depended on the AMQP-RPC changes but were not
required now that we have a unittest also.

This allows us to restart backends with relatively little interruption in
service, provided the backends come up promptly.

Fixes #2389 and #2408
2016-12-15 12:52:34 -08:00
Jacob Hoffman-Andrews 5407a45b02 Revert "Disable fail-fast for gRPC. (#2397)" (#2427)
This reverts commit 5b865f1d63.

The QueueDeclare and QueueBind calls in that change caused AMQP permission
denied errors.
2016-12-13 13:20:08 -08:00
Jacob Hoffman-Andrews 5b865f1d63 Disable fail-fast for gRPC. (#2397)
This allows us to restart backends with relatively little interruption in
service, provided the backends come up promptly.

Fixes #2389 and #2408
2016-12-09 12:03:45 -08:00
Jacob Hoffman-Andrews 27a1446010 Move timeouts into client interceptor. (#2387)
Previously we had custom code in each gRPC wrapper to implement timeouts. Moving
the timeout code into the client interceptor allows us to simplify things and
reduce code duplication.
2016-12-05 10:42:26 -05:00
Roland Bracewell Shoemaker 09483007bd Cleanup gRPC metric formatting (#2218)
Based on experience with the new gRPC staging deployment. gRPC generates `FullMethod` names such as `-ServiceName-MethodName` which can be confusing. For client calls to a service we actually want something formatted like `ServiceName-MethodName` and for server requests we want just `MethodName`.

This PR adds a method to clean up the `FullMethod` names returned by gRPC and formats them the way we expect.
2016-10-14 10:26:13 -07:00
Roland Bracewell Shoemaker e187c92715 Add gRPC client side metrics (#2151)
Fixes #1880.

Updates google.golang.org/grpc and github.com/jmhodges/clock, both test suites pass. A few of the gRPC interfaces changed so this also fixes those breakages.
2016-09-09 15:17:36 -04:00
Roland Bracewell Shoemaker 7b29dba75d Add gRPC server-side interceptor (#1933)
Adds a server side unary RPC interceptor which includes basic stats. We could also use this to add a server request ID to the context.Context to identify the call through the system, but really I'd rather do that on the client side before the RPC is sent which requires the client interceptor implementation upstream. Also updates google.golang.org/grpc.

Updates #1880.
2016-06-20 11:27:32 -04:00