boulder

Commit Graph

Author	SHA1	Message	Date
Jacob Hoffman-Andrews	04dec59c67	ra: log User-Agent (#7908 ) In the WFE, store the User-Agent in a `context.Context` object. In our gRPC interceptors, pass that field in a Metadata header, and re-add it to `Context` on the server side. Add a test in the gRPC interceptors that User-Agent is properly propagated. Note: this adds a new `setup()` function for the gRPC tests that is currently only used by the new test. I'll upload another PR shortly that expands the use of that function to more tests. Fixes https://github.com/letsencrypt/boulder/issues/7792	2025-01-14 13:39:41 -08:00
Aaron Gable	e5731a4c23	gRPC: reject request if clock skew is too large (#7686 ) Have our gRPC server interceptor check for excessive clock skew between its own clock and gRPC client clocks. Do this by taking advantage of the client request timestamp that most clients already supply for the purpose of measuring cross-service latency. If the included timestamp is more than 10 minutes from the gRPC server's local time, immediately error out. To keep the integration tests -- which heavily rely on clock manipulation -- working, use build tags to disable this behavior during integration testing. Fixes https://github.com/letsencrypt/boulder/issues/7684	2024-08-29 11:32:24 -07:00
Jacob Hoffman-Andrews	ac4be89b56	grpc: add NoWaitForReady config field (#6850 ) Currently we set WaitForReady(true), which causes gRPC requests to not fail immediately if no backends are available, but instead wait until the timeout in case a backend does become available. The downside is that this behavior masks true connection errors. We'd like to turn it off. Fixes #6834	2023-05-09 16:16:44 -07:00
Jacob Hoffman-Andrews	c23e59ba59	wfe2: don't pass through client-initiated cancellation (#6608 ) And clean up the code and tests that were used for cancellation pass-through. Fixes #6603	2023-01-26 17:26:15 -08:00
Aaron Gable	257136779c	Add interceptor for per-rpc client auth (#6488 ) Add a new gRPC server interceptor (both unary and streaming) which verifies that the mTLS info set on the persistent connection has a client cert which contains a name which is allowlisted for the particular service being called, not just for the overall server. This will allow us to make more services -- particularly the CA and the SA -- more similar to the VA. We will be able to run multiple services on the same port, while still being able to control access to those services on a per-client basis. It will also let us split those services (e.g. into read-only and read-write subsets) much more easily, because a client will be able to switch which service it is calling without also having to be reconfigured to call a different address. And finally, it will allow us to simplify configuration for clients (such as the RA) which maintain connections to multiple different services on the same server, as they'll be able to re-use the same address configuration.	2022-11-07 13:47:47 -08:00
Aaron Gable	927b1622b7	Add gRPC stream interceptors (#6370 ) Create new gRPC interceptors which are capable of working on streaming gRPC methods. Add these new interceptors, as well as the default metrics interceptor provided by grpc-prometheus, to all of our gRPC clients and servers. The new interceptors behave virtually identically to their unary counterparts: they wrap and unwrap our custom errors from the gRPC metadata, they increment and decrement the in-flight RPC metric, and they ensure that the RPCs don't fail-fast and do have enough time left in their deadline to actually finish. Unfortunately, because the interfaces for unary and streaming RPCs are so divergent, it's not feasible to share code between the two kinds of interceptors. While much of the new code is copy-pasted from the old interceptors, there are subtle differences (such as not immediately deferring the local context's cancel() function). Fixes #6356	2022-09-12 09:28:12 -07:00
Aaron Gable	8cb01a0c34	Enable additional linters (#6106 ) These new linters are almost all part of golangci-lint's collection of default linters, that would all be running if we weren't setting `disable-all: true`. By adding them, we now have parity with the default configuration, as well as the additional linters we like. Adds the following linters: * unconvert * deadcode * structcheck * typecheck * varcheck * wastedassign	2022-05-11 13:58:58 -07:00
Aaron Gable	305ef9cce9	Improve error checking paradigm (#5920 ) We have decided that we don't like the if err := call(); err != nil syntax, because it creates confusing scopes, but we have not cleaned up all existing instances of that syntax. However, we have now found a case where that syntax enables a bug: It caused readers to believe that a later err = call() statement was assigning to an already-declared err in the local scope, when in fact it was assigning to an already-declared err in the parent scope of a closure. This caused our ineffassign and staticcheck linters to be unable to analyze the lifetime of the err variable, and so they did not complain when we never checked the actual value of that error. This change standardizes on the two-line error checking syntax everywhere, so that we can more easily ensure that our linters are correctly analyzing all error assignments.	2022-02-01 14:42:43 -07:00
Aaron Gable	ab79f96d7b	Fixup staticcheck and stylecheck, and violations thereof (#5897 ) Add `stylecheck` to our list of lints, since it got separated out from `staticcheck`. Fix the way we configure both to be clearer and not rely on regexes. Additionally fix a number of easy-to-change `staticcheck` and `stylecheck` violations, allowing us to reduce our number of ignored checks. Part of #5681	2022-01-20 16:22:30 -08:00
Aaron Gable	e5a08e3753	Only convert gRPC cancellations into 408s at WFEs (#5566 ) Pull the "was the gRPC error a Canceled error" checking code out into a separate interceptor, and add that interceptor only in the wfe and wfe2 gRPC clients. Although the vast majority of our cancelations come from the HTTP client disconnecting (and that cancelation being propagated through our gRPC stack), there are a few other situations in which we cancel gRPC connections, including when we receive a quorum of responses from VAs and no longer need responses from the remaining remote VA(s). This change ensures that we do not treat those other kinds of cancelations in the same way that we treat client-initiated cancelations. Fixes #5444	2021-08-09 10:35:18 -07:00
Samantha	631f6dfa0c	GRPC: Log user-initiated cancellations as HTTP 408 (#5546 ) - Log user-initiated cancellations as HTTP 408 instead of HTTP 500 - Only check status code of `err` if an error was intercepted Fixes #5444	2021-07-30 16:10:16 -07:00
Jacob Hoffman-Andrews	2d2c723d34	Break the chain of cancellations at the SA (#5459 ) A recent mysql driver upgrade caused a performance regression. We believe this may be due to cancellations getting passed through to the database driver, which as of the upgrade will more aggressively tear down connections that experienced a cancellation. Also, we only recently started propagation cancellations all the way from the frontend in #5404. This makes it so the driver doesn't see the cancellation. Second attempt at #5447	2021-06-24 16:49:32 -07:00
Aaron Gable	229377aabc	Simplify gRPC interceptors (#5435 ) Use the built-in grpc-go client and server interceptor chaining utilities, instead of the ones provided by go-grpc-middleware. Simplify our interceptors to call their handlers/invokers directly, instead of delegating to the metrics interceptor, and add the metrics interceptor to the chains instead.	2021-05-26 10:19:11 -07:00
Jacob Hoffman-Andrews	bef02e782a	Fix nits found by staticcheck (#4726 ) Part of #4700	2020-03-30 10:20:20 -07:00
Daniel McCarney	f1894f8d1d	tidy: typo fixes flagged by codespell (#4634 )	2020-01-07 14:01:26 -05:00
Jacob Hoffman-Andrews	e3f797f9dc	grpc: Add better error message for timeouts. (#4324 ) Right now we sometimes get errors like: rpc error: code = Unknown desc = rpc error: code = DeadlineExceeded desc = context deadline exceeded For instance, when an SA call times out, and the RA returns that timed-out error to the WFE. These are kind of confusing because they have two layers of nested gRPC error, and they don't provide additional information about which SA call timed out. This change replaces DeadlineExceeded errors with our own error type that includes the service and the method that were called, as well as the amount of time it took (which helps understand if timeouts are happening because earlier calls ate up time towards the deadline). When the RA->SA NewOrder call times out, and the RA returns that error to WFE: "InternalErrors":["rpc error: code = Unknown desc = sa.StorageAuthority.NewOrder timed out after 14954 ms"] When the WFE->RA NewOrder call times out: "InternalErrors":["ra.RegistrationAuthority.NewOrder timed out after 15000 ms"] Note that this change only handles timeouts at one level deep, which I think is sufficient for our needs.	2019-07-08 13:47:25 -04:00
Roland Bracewell Shoemaker	6f93942a04	Consistently used stdlib context package (#4229 )	2019-05-28 14:36:16 -04:00
Roland Bracewell Shoemaker	a9a0846ee9	Remove checks for deployed features (#3881 ) Removes the checks for a handful of deployed feature flags in preparation for removing the flags entirely. Also moves all of the currently deprecated flags to a separate section of the flags list so they can be more easily removed once purged from production configs. Fixes #3880.	2018-10-17 20:29:18 -07:00
Roland Bracewell Shoemaker	876c727b6f	Update gRPC (#3817 ) Fixes #3474.	2018-08-20 10:55:42 -04:00
Daniel McCarney	4f9ee00510	gRPC: publish in-flight RPC gauge in client interceptor. (#3672 ) This PR updates the Boulder gRPC clientInterceptor to update a Prometheus gauge stat for each in-flight RPC it dispatches, sliced by service and method. A unit test is included that uses a custom ChillerServer that lets the test block up a bunch of RPCs, check the in-flight gauge value is increased, unblock the RPCs, and recheck that the in-flight gauge is reduced. To check the gauge value for a specific set of labels a new test-tools.go function GaugeValueWithLabels is added. Updates #3635	2018-04-27 15:53:54 -07:00
Daniel McCarney	aa810a3142	gRPC: publish RPC latency stat in server interceptor. (#3665 ) We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it. This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram. Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands. A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder. Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.	2018-04-25 15:37:22 -07:00
Jacob Hoffman-Andrews	a4f9de9e35	Improve nesting of RPC deadlines (#3619 ) gRPC passes deadline information through the RPC boundary, but client and server have the same deadline. Ideally we'd like the server to have a slightly tighter deadline than the client, so if one of the server's onward RPCs or other network calls times out, the server can pass back more detailed information to the client, rather than the client timing out the server and losing the opportunity to log more detailed information about which component caused the timeout. In this change, I subtract 100ms from the deadline on the server side of our interceptors, using our existing serverInterceptor. I also check that there is at least 100ms remaining in which to do useful work, so the server doesn't begin a potentially expensive task only to abort it. Fixes #3608.	2018-04-06 15:40:18 +01:00
Jacob Hoffman-Andrews	68d5cc3331	Restore gRPC metrics (#3265 ) The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back. Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server. I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling. Also, update go-grpc-prometheus to get the necessary methods. ``` $ go test github.com/grpc-ecosystem/go-grpc-prometheus/... ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s ? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files] ```	2017-12-07 15:44:55 -08:00
Jacob Hoffman-Andrews	d542960a35	Remove statsd version of RPC stats (#2693 ) * Remove statsd-style RPC stats. * Remove tests for old code.	2017-04-25 10:10:35 -04:00
Jacob Hoffman-Andrews	d99800ecb1	Remove some last traces of AMQP. (#2687 ) Fixes #2665	2017-04-20 10:43:17 -07:00
Roland Bracewell Shoemaker	e2b2511898	Overhaul internal error usage (#2583 ) This patch removes all usages of the `core.XXXError` and almost all usages of `probs` outside of the WFE and VA and replaces them with a unified internal error type. Since the VA uses `probs.ProblemDetails` quite extensively in challenges, and currently stores them in the DB I've saved this change for another change (it'll also require a migration). Since `ProblemDetails` should only ever be exposed to end-users all of its related logic should be moved into the `WFE` but since it still needs to be exposed to the VA and SA I've left it in place for now. The new internal `errors` package offers the same convenience functions as `probs` does as well as a new simpler type testing method. A few small changes have also been made to error messages, mainly adding the library and function name to internal server errors for easier debugging (i.e. where a number of functions return the exact same errors and there is no other way to distinguish which method threw the error). Also adds proper encoding of internal errors transferred over gRPC (the current encoding scheme is kept for `core` and `probs` errors since it'll be ideally be removed after we deploy this and follow-up changes) using `grpc/metadata` instead of the gRPC status codes. Fixes #2507. Updates #2254 and #2505.	2017-03-22 23:27:31 -07:00
Roland Bracewell Shoemaker	0c04fe2f5e	Move error wrapping/unwrapping into the interceptors (#2556 ) Instead of using `unwrapError/wrapError` in each of the wrapper functions do it in the server/client interceptors instead. This means we now consistently do error unwrapping/wrapping. Fixes #2509.	2017-02-13 12:56:23 -05:00
Daniel	e88db3cd5e	Revert "Revert "Copy all statsd stats to Prometheus. (#2474 )" (#2541 )" This reverts commit `9d9e4941a5` and restores the statsd prometheus code.	2017-02-01 15:48:18 -05:00
Daniel McCarney	9d9e4941a5	Revert "Copy all statsd stats to Prometheus. (#2474 )" (#2541 ) This reverts commit `58ccd7a71a`. We are seeing multiple boulder components restart when they encounter the stat registration race condition described in https://github.com/letsencrypt/boulder/issues/2540	2017-02-01 12:50:27 -05:00
Jacob Hoffman-Andrews	58ccd7a71a	Copy all statsd stats to Prometheus. (#2474 ) We have a number of stats already expressed using the statsd interface. During the switchover period to direct Prometheus collection, we'd like to make those stats available both ways. This change automatically exports any stats exported using the statsd interface via Prometheus as well. This is a little tricky because Prometheus expects all stats to by registered exactly once. Prometheus does offer a mechanism to gracefully recover from registering a stat more than once by handling a certain error, but it is not safe for concurrent access. So I added a concurrency-safe wrapper that creates Prometheus stats on demand and memoizes them. In the process, made a few small required side changes: - Clean "/" from method names in the gRPC interceptors. They are allowed in statsd but not in Prometheus. - Replace "127.0.0.1" with "boulder" as the name of our testing CT log. Prometheus stats can't start with a number. - Remove ":" from the CT-log stat names emitted by Publisher. Prometheus stats can't include it. - Remove a stray "RA" in front of some rate limit stats, since it was duplicative (we were emitting "RA.RA..." before). Note that this means two stat groups in particular are duplicated: - Gostats* is duplicated with the default process-level stats exported by the Prometheus library. - gRPCClient* are duplicated by the stats generated by the go-grpc-prometheus package. When writing dashboards and alerts in the Prometheus world, we should be careful to avoid these two categories, as they will disappear eventually. As a general rule, if a stat is available with an all-lowercase name, choose that one, as it is probably the Prometheus-native version. In the long run we will want to create most stats using the native Prometheus stat interface, since it allows us to use add labels to metrics, which is very useful. For instance, currently our DNS stats distinguish types of queries by appending the type to the stat name. This would be more natural as a label in Prometheus.	2017-01-10 10:30:15 -05:00
Jacob Hoffman-Andrews	263db24571	Disable fail-fast for gRPC. (#2397 ) (#2434 ) This is a roll-forward of `5b865f1`, with the QueueDeclare and QueueBind changes in AMQP-RPC removed, and the startup order changes in test/startservers.py removed. The AMQP-RPC changes caused RabbitMQ permission problems in production, and the startup order changes depended on the AMQP-RPC changes but were not required now that we have a unittest also. This allows us to restart backends with relatively little interruption in service, provided the backends come up promptly. Fixes #2389 and #2408	2016-12-15 12:52:34 -08:00
Jacob Hoffman-Andrews	5407a45b02	Revert "Disable fail-fast for gRPC. (#2397 )" (#2427 ) This reverts commit `5b865f1d63`. The QueueDeclare and QueueBind calls in that change caused AMQP permission denied errors.	2016-12-13 13:20:08 -08:00
Jacob Hoffman-Andrews	5b865f1d63	Disable fail-fast for gRPC. (#2397 ) This allows us to restart backends with relatively little interruption in service, provided the backends come up promptly. Fixes #2389 and #2408	2016-12-09 12:03:45 -08:00
Jacob Hoffman-Andrews	b8a237ffb3	Use grpc-go-prometheus for RPC stats. (#2391 ) There's an off-the-shelf package that provides most of the stats we care about for gRPC using interceptors. This change vendors go-grpc-prometheus and its dependencies, and calls out to the interceptors provided by that package from our own interceptors. This will allow us to get metrics like latency histograms by call, status codes by call, and so on. Fixes #2390. This change vendors go-grpc-prometheus and its dependencies. Per contributing guidelines, I've run the tests on these dependencies, and they pass: go test github.com/davecgh/go-spew/spew github.com/grpc-ecosystem/go-grpc-prometheus github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto github.com/pmezard/go-difflib/difflib github.com/stretchr/testify/assert github.com/stretchr/testify/require github.com/stretchr/testify/suite ok github.com/davecgh/go-spew/spew 0.022s ok github.com/grpc-ecosystem/go-grpc-prometheus 0.120s ? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files] ok github.com/pmezard/go-difflib/difflib 0.042s ok github.com/stretchr/testify/assert 0.021s ok github.com/stretchr/testify/require 0.017s ok github.com/stretchr/testify/suite 0.012s	2016-12-05 14:31:22 -08:00
Jacob Hoffman-Andrews	27a1446010	Move timeouts into client interceptor. (#2387 ) Previously we had custom code in each gRPC wrapper to implement timeouts. Moving the timeout code into the client interceptor allows us to simplify things and reduce code duplication.	2016-12-05 10:42:26 -05:00
Roland Bracewell Shoemaker	09483007bd	Cleanup gRPC metric formatting (#2218 ) Based on experience with the new gRPC staging deployment. gRPC generates `FullMethod` names such as `-ServiceName-MethodName` which can be confusing. For client calls to a service we actually want something formatted like `ServiceName-MethodName` and for server requests we want just `MethodName`. This PR adds a method to clean up the `FullMethod` names returned by gRPC and formats them the way we expect.	2016-10-14 10:26:13 -07:00
Roland Bracewell Shoemaker	e187c92715	Add gRPC client side metrics (#2151 ) Fixes #1880. Updates google.golang.org/grpc and github.com/jmhodges/clock, both test suites pass. A few of the gRPC interfaces changed so this also fixes those breakages.	2016-09-09 15:17:36 -04:00
Roland Bracewell Shoemaker	7b29dba75d	Add gRPC server-side interceptor (#1933 ) Adds a server side unary RPC interceptor which includes basic stats. We could also use this to add a server request ID to the context.Context to identify the call through the system, but really I'd rather do that on the client side before the RPC is sent which requires the client interceptor implementation upstream. Also updates google.golang.org/grpc. Updates #1880.	2016-06-20 11:27:32 -04:00

38 Commits