boulder

Commit Graph

Author	SHA1	Message	Date
Jacob Hoffman-Andrews	bef02e782a	Fix nits found by staticcheck (#4726 ) Part of #4700	2020-03-30 10:20:20 -07:00
Roland Bracewell Shoemaker	5b2f11e07e	Switch away from old style statsd metrics wrappers (#4606 ) In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back. There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change. Fixes #4591.	2019-12-18 11:08:25 -05:00
Jacob Hoffman-Andrews	e3f797f9dc	grpc: Add better error message for timeouts. (#4324 ) Right now we sometimes get errors like: rpc error: code = Unknown desc = rpc error: code = DeadlineExceeded desc = context deadline exceeded For instance, when an SA call times out, and the RA returns that timed-out error to the WFE. These are kind of confusing because they have two layers of nested gRPC error, and they don't provide additional information about which SA call timed out. This change replaces DeadlineExceeded errors with our own error type that includes the service and the method that were called, as well as the amount of time it took (which helps understand if timeouts are happening because earlier calls ate up time towards the deadline). When the RA->SA NewOrder call times out, and the RA returns that error to WFE: "InternalErrors":["rpc error: code = Unknown desc = sa.StorageAuthority.NewOrder timed out after 14954 ms"] When the WFE->RA NewOrder call times out: "InternalErrors":["ra.RegistrationAuthority.NewOrder timed out after 15000 ms"] Note that this change only handles timeouts at one level deep, which I think is sufficient for our needs.	2019-07-08 13:47:25 -04:00
Roland Bracewell Shoemaker	6f93942a04	Consistently used stdlib context package (#4229 )	2019-05-28 14:36:16 -04:00
Roland Bracewell Shoemaker	a9a0846ee9	Remove checks for deployed features (#3881 ) Removes the checks for a handful of deployed feature flags in preparation for removing the flags entirely. Also moves all of the currently deprecated flags to a separate section of the flags list so they can be more easily removed once purged from production configs. Fixes #3880.	2018-10-17 20:29:18 -07:00
Roland Bracewell Shoemaker	876c727b6f	Update gRPC (#3817 ) Fixes #3474.	2018-08-20 10:55:42 -04:00
Joel Sing	f8a023e49c	Remove various unnecessary uses of fmt.Sprintf (#3707 ) Remove various unnecessary uses of fmt.Sprintf - in particular: - Avoid calls like t.Error(fmt.Sprintf(...)), where t.Errorf can be used directly. - Use strconv when converting an integer to a string, rather than using fmt.Sprintf("%d", ...). This is simpler and can also detect type errors at compile time. - Instead of using x.Write([]byte(fmt.Sprintf(...))), use fmt.Fprintf(x, ...).	2018-05-11 11:55:25 -07:00
Daniel McCarney	4f9ee00510	gRPC: publish in-flight RPC gauge in client interceptor. (#3672 ) This PR updates the Boulder gRPC clientInterceptor to update a Prometheus gauge stat for each in-flight RPC it dispatches, sliced by service and method. A unit test is included that uses a custom ChillerServer that lets the test block up a bunch of RPCs, check the in-flight gauge value is increased, unblock the RPCs, and recheck that the in-flight gauge is reduced. To check the gauge value for a specific set of labels a new test-tools.go function GaugeValueWithLabels is added. Updates #3635	2018-04-27 15:53:54 -07:00
Daniel McCarney	aa810a3142	gRPC: publish RPC latency stat in server interceptor. (#3665 ) We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it. This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram. Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands. A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder. Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.	2018-04-25 15:37:22 -07:00
Jacob Hoffman-Andrews	a4f9de9e35	Improve nesting of RPC deadlines (#3619 ) gRPC passes deadline information through the RPC boundary, but client and server have the same deadline. Ideally we'd like the server to have a slightly tighter deadline than the client, so if one of the server's onward RPCs or other network calls times out, the server can pass back more detailed information to the client, rather than the client timing out the server and losing the opportunity to log more detailed information about which component caused the timeout. In this change, I subtract 100ms from the deadline on the server side of our interceptors, using our existing serverInterceptor. I also check that there is at least 100ms remaining in which to do useful work, so the server doesn't begin a potentially expensive task only to abort it. Fixes #3608.	2018-04-06 15:40:18 +01:00
Jacob Hoffman-Andrews	68d5cc3331	Restore gRPC metrics (#3265 ) The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back. Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server. I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling. Also, update go-grpc-prometheus to get the necessary methods. ``` $ go test github.com/grpc-ecosystem/go-grpc-prometheus/... ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s ? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files] ```	2017-12-07 15:44:55 -08:00
Jacob Hoffman-Andrews	d542960a35	Remove statsd version of RPC stats (#2693 ) * Remove statsd-style RPC stats. * Remove tests for old code.	2017-04-25 10:10:35 -04:00
Jacob Hoffman-Andrews	263db24571	Disable fail-fast for gRPC. (#2397 ) (#2434 ) This is a roll-forward of `5b865f1`, with the QueueDeclare and QueueBind changes in AMQP-RPC removed, and the startup order changes in test/startservers.py removed. The AMQP-RPC changes caused RabbitMQ permission problems in production, and the startup order changes depended on the AMQP-RPC changes but were not required now that we have a unittest also. This allows us to restart backends with relatively little interruption in service, provided the backends come up promptly. Fixes #2389 and #2408	2016-12-15 12:52:34 -08:00
Jacob Hoffman-Andrews	5407a45b02	Revert "Disable fail-fast for gRPC. (#2397 )" (#2427 ) This reverts commit `5b865f1d63`. The QueueDeclare and QueueBind calls in that change caused AMQP permission denied errors.	2016-12-13 13:20:08 -08:00
Jacob Hoffman-Andrews	5b865f1d63	Disable fail-fast for gRPC. (#2397 ) This allows us to restart backends with relatively little interruption in service, provided the backends come up promptly. Fixes #2389 and #2408	2016-12-09 12:03:45 -08:00
Jacob Hoffman-Andrews	27a1446010	Move timeouts into client interceptor. (#2387 ) Previously we had custom code in each gRPC wrapper to implement timeouts. Moving the timeout code into the client interceptor allows us to simplify things and reduce code duplication.	2016-12-05 10:42:26 -05:00
Roland Bracewell Shoemaker	09483007bd	Cleanup gRPC metric formatting (#2218 ) Based on experience with the new gRPC staging deployment. gRPC generates `FullMethod` names such as `-ServiceName-MethodName` which can be confusing. For client calls to a service we actually want something formatted like `ServiceName-MethodName` and for server requests we want just `MethodName`. This PR adds a method to clean up the `FullMethod` names returned by gRPC and formats them the way we expect.	2016-10-14 10:26:13 -07:00
Roland Bracewell Shoemaker	e187c92715	Add gRPC client side metrics (#2151 ) Fixes #1880. Updates google.golang.org/grpc and github.com/jmhodges/clock, both test suites pass. A few of the gRPC interfaces changed so this also fixes those breakages.	2016-09-09 15:17:36 -04:00
Roland Bracewell Shoemaker	7b29dba75d	Add gRPC server-side interceptor (#1933 ) Adds a server side unary RPC interceptor which includes basic stats. We could also use this to add a server request ID to the context.Context to identify the call through the system, but really I'd rather do that on the client side before the RPC is sent which requires the client interceptor implementation upstream. Also updates google.golang.org/grpc. Updates #1880.	2016-06-20 11:27:32 -04:00

19 Commits