Remove various unnecessary uses of fmt.Sprintf - in particular:
- Avoid calls like t.Error(fmt.Sprintf(...)), where t.Errorf can be used directly.
- Use strconv when converting an integer to a string, rather than using
fmt.Sprintf("%d", ...). This is simpler and can also detect type errors at
compile time.
- Instead of using x.Write([]byte(fmt.Sprintf(...))), use fmt.Fprintf(x, ...).
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.
This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.
Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.
A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.
Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
* Randomize order of CT logs when submitting precerts so we maximize the chances we actually exercise all of the logs in a group and not just the first in the list.
* Add metrics for winning logs
The `TotalCertificates` rate limit serves to ensure we don't
accidentally exceed our OCSP signing capacity by issuing too many
certificates within a fixed period. In practice this rate limit has been
fragile and the associated queries have been linked to performance
problems.
Since we now have better means of monitoring our OCSP signing capacity
this commit removes the rate limit and associated code.
This commit updates the `boulder-ra` and `boulder-ca` commands to refuse
to start if their configured `MaxNames` is 0 (the default value). This
should always be set to a positive number.
This commit also updates `csr/csr.go` to always apply the max names
check since it will never be 0 after the change above.
Also refactor `FailOnError` to pull out a separate `Fail` function.
Related to https://github.com/letsencrypt/boulder/issues/3632
Our various main.go functions gated some key code on whether the TLS
and/or GRPC config fields were present. Now that those fields are fully
deployed in production, we can simplify the code and require them.
Also, rename tls to tlsConfig everywhere to avoid confusion with the tls
package.
Avoid assigning to the same err from two different goroutines in
boulder-ca (fix a race).
Add a set of logs which will be submitted to but not relied on for their SCTs,
this allows us to test submissions to a particular log or submit to a log which
is not yet approved by a browser/root program.
Also add a feature which stops cancellations of remaining submissions when racing
to get a SCT from a group of logs.
Additionally add an informational log that always times out in config-next.
Fixes#3464 and fixes#3465.
This updates the PA component to allow authorization challenge types that are globally disabled if the account ID owning the authorization is on a configured whitelist for that challenge type.
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.
Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.
I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.
Also, update go-grpc-prometheus to get the necessary methods.
```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
Previously, we used prometheus.DefaultRegisterer to register our stats, which uses global state to export its HTTP stats. We also used net/http/pprof's behavior of registering to the default global HTTP ServeMux, via DebugServer, which starts an HTTP server that uses that global ServeMux.
In this change, I merge DebugServer's functions into StatsAndLogging. StatsAndLogging now takes an address parameter and fires off an HTTP server in a goroutine. That HTTP server is newly defined, and doesn't use DefaultServeMux. On it is registered the Prometheus stats handler, and handlers for the various pprof traces. In the process I split StatsAndLogging internally into two functions: makeStats and MakeLogger. I didn't port across the expvar variable exporting, which serves a similar function to Prometheus stats but which we never use.
One nice immediate effect of this change: Since StatsAndLogging now requires and address, I noticed a bunch of commands that called StatsAndLogging, and passed around the resulting Scope, but never made use of it because they didn't run a DebugServer. Under the old StatsD world, these command still could have exported their stats by pushing, but since we moved to Prometheus their stats stopped being collected. We haven't used any of these stats, so instead of adding debug ports to all short-lived commands, or setting up a push gateway, I simply removed them and switched those commands to initialize only a Logger, no stats.
Fixes#3020.
In order to write integration tests for some features, especially related to rate limiting, rechecking of CAA, and expiration of authzs, orders, and certs, we need to be able to fake the passage of time in integration tests.
To do so, this change switches out all clock.Default() instances for cmd.Clock(), which can be set manually with the FAKECLOCK environment variable. integration-test.py now starts up all servers once before the main body of tests, with FAKECLOCK set to a date 70 days ago, and does some initial setup for a new integration test case. That test case tries to fetch a 70-day-old authz URL, and expects it to 404.
In order to make this work, I also had to change a number of our test binaries to shut down cleanly in response to SIGTERM. Without that change, stopping the servers between the setup phase and the main tests caused startservers.check() to fail, because some processes exited with nonzero status.
Note: This is an initial stab at things, to prove out the technique. Long-term, I think we will want to use an idiom where test cases are classes that have a number of optional setup phases that may be run at e.g. 70 days prior and 5 days prior. This could help us avoid a proliferation of global state as we add more time-dependent test cases.
Previously, we would produce an error an a nonzero status code on shutdown,
because gRPC's GracefulStop would cause s.Serve() to return an error. Now we
filter that specific error and treat it as success. This also allows us to kill
process with SIGTERM instead of SIGKILL in integration tests.
Fixes#2410.
Fixes#2889.
VA now implements two gRPC services: VA and CAA. These both run on the same port, but this allows implementation of the IsCAAValid RPC to skip using the gRPC wrappers, and makes it easier to potentially separate the service into its own package in the future.
RA.NewCertificate now checks the expiration times of authorizations, and will call out to VA to recheck CAA for those authorizations that were not validated recently enough.
Fixes#639.
This resolves something that has bugged me for two+ years, our DNSResolverImpl is not a DNS resolver, it is a DNS client. This change just makes that obvious.
This used to be used for AMQP queue names. Now that AMQP is gone, these consts
were only used when printing a version string at startup. This changes
VersionString to just use the name of the current program, and removes
`const clientName = ` from many of our main.go's.
Adds a basic truncated modulus hash check for RSA keys that can be used to check keys against the Debian `{openssl,openssh,openvpn}-blacklist` lists of weak keys generated during the [Debian weak key incident](https://wiki.debian.org/SSLkeys).
Testing is gated on adding a new configuration key to the WFE, RA, and CA configs which contains the path to a directory which should contain the weak key lists.
Fixes#157.
This removes the config and code to output to statsd.
- Change `cmd.StatsAndLogging` to output a `Scope`, not a `Statter`.
- Remove the prefixing of component name (e.g. "VA") in front of stats; this was stripped by `autoProm` but now no longer needs to be.
- Delete vendored statsd client.
- Delete `MockStatter` (generated by gomock) and `mocks.Statter` (hand generated) in favor of mocking `metrics.Scope`, which is the interface we now use everywhere.
- Remove a few unused methods on `metrics.Scope`, and update its generated mock.
- Refactor `autoProm` and add `autoRegisterer`, which can be included in a `metrics.Scope`, avoiding global state. `autoProm` now registers everything with the `prometheus.Registerer` it is given.
- Change va_test.go's `setup()` to not return a stats object; instead the individual tests that care about stats override `va.stats` directly.
Fixes#2639, #2733.
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.
This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.
This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
This allows finer-grained control of which components can request issuance. The OCSP Updater should not be able to request issuance.
Also, update test/grpc-creds/generate.sh to reissue the certs properly.
Resolves#2417
Previously we had custom code in each gRPC wrapper to implement timeouts. Moving
the timeout code into the client interceptor allows us to simplify things and
reduce code duplication.
Adds a gRPC server to the SA and SA gRPC Clients to the WFE, RA, CA, Publisher, OCSP updater, orphan finder, admin revoker, and expiration mailer.
Also adds a CA gRPC client to the OCSP Updater which was missed in #2193.
Fixes#2347.
Implements a less RPC focused signal catch/shutdown method. Certain things that probably could also use this (i.e. `ocsp-updater`) haven't been given it as they would require rather substantial changes to allow for a graceful shutdown approach.
Fixes#2298.
With the current gRPC design the CA talks directly to the Publisher when calling SubmitToCT which crosses security bounadries (secure internal segment -> internet facing segment) which is dangerous if (however unlikely) the Publisher is compromised and there is a gRPC exploit that allows memory corruption on the caller end of a RPC which could expose sensitive information or cause arbitrary issuance.
Instead we move the RPC call to the RA which is in a less sensitive network segment. Switching the call site from the CA -> RA is gated on adding the gRPC PublisherService object to the RA config.
Fixes#2202.
Add feature flagged support for issuing for IDNs, fixes#597.
This patch expects that clients have performed valid IDN2008 encoding on any label that includes unicode characters. Invalid encodings (including non-compatible IDN2003 encoding) will be rejected. No script-mixing or script exclusion checks are performed as we assume that if a name is resolvable that it conforms to the registrar's policies on these matters and if it uses non-standard scripts in sub-domains etc that browsers should be the ones choosing how to display those names.
Required a full update of the golang.org/x/net tree to pull in golang.org/x/net/idna, all test suites pass.
Updates #1699.
Adds a new package, `features`, which exposes methods to set and check if various internal features are enabled. The implementation uses global state to store the features so that services embedded in another service do not each require their own features map in order to check if something is enabled.
Requires a `boulder-tools` image update to include `golang.org/x/tools/cmd/stringer`.
Fixes#1880.
Updates google.golang.org/grpc and github.com/jmhodges/clock, both test suites pass. A few of the gRPC interfaces changed so this also fixes those breakages.
Introduces the `authorizationLifetimeDays` and `pendingAuthorizationLifetimeDays` configuration options for `RA`.
If the values are missing from configuration, the code defaults back to the current values (300/7 days).
fixes#2024
This PR removes the use of all anonymous struct fields that were introduced by myself as per my work on splitting up boulder-config (#1962).
The root of the bug was related to the loading of the json configuration file into the config struct. The config structs contained several embedded (anonymous) fields. An embedded (anonymous) field in a struct actually results in the flattening of the json structure. This caused json.Unmarshal to look not at the nested level, but at the root level of the json object and hence not find the nested field (i.e. AllowedSigningAlgos).
See https://play.golang.org/p/6uVCsEu3Df for a working example.
This fixes the reported bug: #2018
In #1971 we added the CAASERVFAILExceptions config field and argument to NewDNSResolverImpl. This argument only needs to be passed to the VA, where we do CAA validations. However, I accidentally added code to the RA as well to use this new config field. This changes backs that out.
Presently clients may request a new AuthZ be created for a domain that they have already proved authorization over. This results in unnecessary bloat in the authorizations table and duplicated effort.
This commit alters the `NewAuthorization` function of the RA such that before going through the work of creating a new AuthZ it checks whether there already exists a valid AuthZ for the domain/regID that expires in more than 24 hours from the current date. If there is, then we short circuit creation and return the existing AuthZ. When this case occurs the `RA.ReusedValidAuthz` counter is incremented to provide visibility.
Since clients requesting a new AuthZ and getting an AuthZ back expect to turn around and post updates to the corresponding challenges we also return early in `UpdateAuthorization` when asked to update an AuthZ that is already valid. When this case occurs the `RA.ReusedValidAuthzChallenge` counter is incremented.
All of the above behaviour is gated by a new RA config flag `reuseValidAuthz`. In the default case (false) the RA does **not** reuse any AuthZ's and instead maintains the historic behaviour; always creating a new AuthZ when requested, irregardless of whether there are already valid AuthZ's that could be reused. In the true case (enabled only in `boulder-config-next.json`) the AuthZ reuse described above is enabled.
Resolves#1854