This updates va.proto to use proto3 syntax, and updates
all clients of the autogenerated code to use the new types.
In particular, it removes indirection from built-in types
(proto3 uses ints, rather than pointers to ints, for example).
Depends on #5003Fixes#4956
ACME Challenges are well-known strings ("http-01", "dns-01", and
"tlsalpn-01") identifying which kind of challenge should be used
to verify control of a domain. Because they are well-known and
only certain values are valid, it is better to represent them as
something more akin to an enum than as bare strings. This also
improves our ability to ensure that an AcmeChallenge is not
accidentally used as some other kind of string in a different
context. This change also brings them closer in line with the
existing core.AcmeResource and core.OCSPStatus string enums.
Fixes#5009
Previously, canceled remote validations were simply noted and then
dropped on the floor. This should be safe, as they're theoretically
only canceled when the parent span (i.e. the local PerformValidation
RPC) ends. But for the sake of defense-in-depth, it seems better to
correctly mark canceled remote validations as having Problems, so
that their results cannot be accidentally used anywhere.
This results in a test behavior change: if EnforceMultiVA is on, and
some RPCs are canceled, this now results in validation failure. This
should not have any production impact, because remote validations
should only be canceled when the parent RPC early-exits, but that only
happens when EnforceMultiVA is not enabled. These tests now test a
case where the other remote validations were canceled for some other
reason, which should result in validation failure.
Updates the type of the ValidationAuthority's PerformValidation
method to be identical to that of the corresponding auto-generated
grpc method, i.e. directly taking and returning proto message
types, rather than exploded arguments.
This allows all logic to be removed from the VA wrappers, which
will allow them to be fully removed after the migration to proto3.
Also updates all tests and VA clients to adopt the new interface.
Depends on #4983 (do not review first four commits)
Part of #4956
In #3708, we added formatters for the the convenience methods in the
`probs` package.
However, in #4783, @alexzorin pointed out that we were incorrectly
passing an error message through fmt.Sprintf as the format parameter
rather than as a value parameter.
I proposed a fix in #4784, but during code review we concluded that the
underlying problem was the pattern of using format-style functions that
don't have some variant of printf in the name. That makes this wrong:
`probs.DNS(err.Error())`, and this right: `probs.DNS("%s", err)`. Since
that's an easy mistake to make and a hard one to spot during code review,
we're going to stop using this particular pattern and call `fmt.Sprintf`
directly.
This PR reverts #3708 and adds some `fmt.Sprintf` where needed.
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.
There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.
Fixes#4591.
In order to move multi perspective validation forward we need to support policy
in Boulder configuration that can relax multi-va requirements temporarily.
A similar mechanism was used in support of the gradual deprecation of the
TLS-SNI-01 challenge type and with the introduction of CAA enforcement and has
shown to be a helpful tool to have available when introducing changes that are
expected to break sites.
When the VA "multiVAPolicyFile" is specified it is assumed to be a YAML file
containing two lists:
1. disabledNames - a list of domain names that are exempt from multi VA
enforcement.
2. disabledAccounts - a list of account IDs that are exempt from multi VA
enforcement.
When a hostname or account ID is added to the policy we'll begin communication
with the related ACME account contact to establish that this is a temporary
measure and the root problem will need to be addressed before an eventual
cut-off date.
Resolves https://github.com/letsencrypt/boulder/issues/4455
This PR changes the VA to return `dns` problem type for errors when performing
HTTP-01 challenges for domains that have no IP addresses, or errors looking up
the IP addresses.
The `va.getAddrs` function is internal to the VA and can return
`berrors.BoulderError`s with a DNS type when there is an error, allowing the
calling code to convert this to a problem when required
using an updated `detailedError` function. This avoids some clunky conversion
the HTTP-01 code was doing that misrepresented DNS level errors as connection
problems with a DNS detail message.
In order to add an integration test for challenge validation that results in
`getAddrs` DNS level errors the Boulder tools image had to be bumped to a tag
that includes the latest `pebble-challtestsrv` that
supports mocking SERVFAILs. It isn't possible to mock this case with internal IP
addresses because our VA test configuration does not filter internal addresses
to support the testing context.
Additionally this branch removes the `UnknownHostProblem` from the `probs`
package:
1. It isn't used anywhere after 532c210
2. It's not a real RFC 8555 problem type. We should/do use the
DNS type for this.
Resolves https://github.com/letsencrypt/boulder/issues/4407
This will allow implementing sub-problems without creating a cyclic
dependency between `core` and `problems`.
The `identifier` package is somewhat small/single-purpose and in the
future we may want to move more "ACME" bits beyond the `identifier`
types into a dedicated package outside of `core`.
In some rare conditions the malformed HTTP response error message that
we match in the VA for HTTP-01 connections to HTTP/2 servers will be
returned as a raw `http.badStringError` that doesn't have a transport
connection broken prefix. In these cases the existing
`test_http2_http01_challenge` integration tests fails because the
`h2SettingsFrameErrRegex` doesn't match the returned error.
To accommodate this we make the `h2SettingsFrameErrRegex` optionally
match the transport connection broken prefix.
In practice it seems the only way to add a specific error for when an initial HTTP-01 challenge request is made to an HTTP/2 server mis-configured on `:80` is with a regex on the error string.
The error returned from the stdlib `http.Client` for HTTP to an HTTP/2 server is just an `errors.ErrorString` instance without any context (once you peel it out of the wrapping `url.Error`):
> Err:(*errors.errorString)(0xc420609bf0)}] errStr=[Get http://example.com/.well-known/acme-challenge/xxxxxxx: net/http: HTTP/1.x transport connection broken: malformed HTTP response "\x00\x00\x12\x04\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x80\x00\x04\x00\x01\x00\x00\x00\x05\x00\xff\xff\xff\x00\x00\x04\b\x00\x00\x00\x00\x00\u007f\xff\x00\x00\x00\x00\b\a\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01"]
Even directly in the stdlib code at the place in `http/response.go` that generates the error it's using a `&badStringError{}` and just putting the byte string that `textproto` read into it.
To detect this case in `detailedError` I added a pre-compiled regex that will match the net/http malformed HTTP response error for raw bytes matching an arbitrarily sized HTTP/2 SETTINGS frame. Per RFC "A SETTINGS frame MUST be sent by both endpoints at the start of a connection" and so this seems like a fairly reliable indicator of an unexpected HTTP/2 response in an HTTP/1.1 context.
Thanks to @mnordhoff for the detailed notes (and RFC refs) in #3416 It made this a lot easier!
Resolves#3416.
When I introduced the new HTTP-01 code I did it in `va/http.go` intending to try and make the very large `va.go` file a little bit smaller. This is the continuation of that work.
* f96ad92 - moves remaining HTTP-01 specific code to `va/http.go`.
* 1efb9a1 - moves TLS-ALPN-01 code into `va/tlsalpn.go`.
* 95ea567 - moves DNS-01 code into `va/dns.go`.
* 6ff0395 - moves unit tests from `va/va_test.go` into `va/http_test.go`, `va/tlsalpn_test.go` and `va/dns_test.go`.
In the end `va/va.go` contains code related to metrics, top level RPCs (e.g. `PerformValidation`), and the multi-VA code. This makes the file lengths much more manageable overall.
Note: There is certainly room for cleaning up some of the older unit test cruft from `va/va_test.go`. For now I only moved it as-is into the challenge specific test files.
* `EnforceMultiVA` to allow configuring multiple VAs but not changing the primary VA's result based on what the remote VAs return.
* `MultiVAFullResults` to allow collecting all of the remote VA results. When all results are collected a JSON log line with the differential between the primary/remote VAs is logged.
Resolves https://github.com/letsencrypt/boulder/issues/4066
* Remove the challenge whitelist
* Reduce the signature for ChallengesFor and ChallengeTypeEnabled
* Some unit tests in the VA were changed from testing TLS-SNI to testing the same behavior
in TLS-ALPN, when that behavior wasn't already tested. For instance timeouts during connect
are now tested.
Fixes#4109
Our integration test test_http_challenge_timeout occasionally fails with
boulder-ra [AUDIT] Could not communicate with VA: rpc
error: code = DeadlineExceeded desc = context deadline exceeded
In at least one of these cases, the VA correctly timed-out its HTTP
request and logged a validation error with the correct error message.
I believe that there is a race between the VA returning its validation
error to the RA, and the RA timing out its gRPC call. By shaving some
time off the context we should more reliably get the response back to
the RA.
The order the primary VA calls `PerformValidation` on configured
remote VAs is also changed to be done in a random order.
Resolves#4087
We're only using the simplified HTTP-01 code from `va/http.go` now 🎉 The old unit tests that still seem relevant are left in place in `va/va_test.go` instead of being moved to `va/http_test.go` to signal that they're a bit crufty and could probably use a separate cleanup. For now I'm hesitant to remove test coverage so I updated them in-place without moving them to a new home.
Resolves https://github.com/letsencrypt/boulder/issues/4089
The `singleDialTimeout` field was previously a global `const` in the
`va` package. Making it a field of the VA impl (and the dialer structs)
makes it easier to test that it is working as expected with a smaller
than normal value.
A new `TestPreresolvedDialerTimeout` unit test is added that tests the
fix from https://github.com/letsencrypt/boulder/pull/4046
Without the fix applied:
```
=== RUN TestPreresolvedDialerTimeout
--- FAIL: TestPreresolvedDialerTimeout (0.49s)
http_test.go:86: fetch didn't timeout after 50ms
FAIL
FAIL github.com/letsencrypt/boulder/va 0.512s
```
With the fix applied:
```
=== RUN TestPreresolvedDialerTimeout
--- PASS: TestPreresolvedDialerTimeout (0.05s)
PASS
ok github.com/letsencrypt/boulder/va 1.075s
```
Marshaling invalid UTF-8 strings to protocol buffers causes an error. This can
happen in VA `PerformValidation` RPC responses if remote servers return invalid
UTF-8 in some ACME challenge contexts. We previously fixed this for HTTP-01 and
DNS-01 but missed a case where TLS-ALPN-01/TLS-SNI-01 challenge response
certificate content was included in error messages without replacing invalid
UTF-8. That's now fixed & unit tests are added.
To aid in diagnosing any future instances the VA is also updated to proactively
attempt to marshal its `PerformValidation` results before handing off to the RPC
wrappers that will do the same. This way if we detect an error in marshaling the
VA can audit log the escaped content for investigation purposes.
Hopefully with these two efforts combined we can avoid any future VA RPC errors
from UTF-8 encoding.
Resolves https://github.com/letsencrypt/boulder/issues/3838
To complete https://github.com/letsencrypt/boulder/issues/3956 the `challtestsrv` is updated such that its existing TLS-ALPN-01 challenge test server will serve HTTP-01 responses with a self-signed certificate when a non-TLS-ALPN-01 request arrives. This lets the TLS-ALPN-01 challenge server double as a HTTPS version of the HTTP challenge server. The `challtestsrv` now also supports adding/remove redirects that will be served to clients when requesting matching paths.
The existing chisel/chisel2 integration tests are updated to use the `challtestsrv` instead of starting their own standalone servers. This centralizes our mock challenge responses and lets us bind the `challtestsrv` to the VA's HTTP port in `startservers.py` without clashing ports later on.
New integration tests are added for HTTP-01 redirect scenarios using the updated `challtestserv`. These test cases cover:
* valid HTTP -> HTTP redirect
* valid HTTP -> HTTPS redirect
* Invalid HTTP -> non-HTTP/HTTPS port redirect
* Invalid HTTP-> non-HTTP/HTTPS protocol scheme redirect
* Invalid HTTP-> bare IP redirect
* Invalid HTTP redirect loop
The new integration tests shook out two fixes that were required for the legacy VA HTTP-01 code (afad22b) and one fix for the challtestsrv mock DNS (59b7d6d).
Resolves https://github.com/letsencrypt/boulder/issues/3956
Continued bugs from the custom dialer approach used by the VA for HTTP-01 (most recently https://github.com/letsencrypt/boulder/issues/3889) motivated a rewrite.
Instead of using a custom dialer to be able to control DNS resolution for HTTP validation requests we can construct URLs for the IP addresses we resolve and overload the Host header. This avoids having to do address resolution within the dialer and eliminates the complexity of the dialer `addrInfoChan`. The only thing left for our custom dialer now is to shave some time off of the provided context to help us discern timeouts before/after connect.
The existing IP preference & fallback behaviour is preserved: e.g. if a host has both IPv6 and IPv4 addresses we connect to the first IPv6 address. If there is a network error connecting to that address (e.g. an error during "dial"), we try once more with the first IPv4 address. No other retries are done. Matching existing behaviour no fallback is done for HTTP level failures on an IPv6 address (e.g. mismatched webroots, redirect loops, etc). A new Prometheus counter "http01_fallbacks" is used to keep track of the number of fallbacks performed.
As a result of moving the layer at which the retry happens a fallback like described above will now produce two validation records: one for the initial IPv6 connection, and one for the IPv4 connection. Neither will have the "addressesTried" field populated, just "addressesResolved" and "addressUsed". Previously with the dialer doing the retry we would have created just one validation record with an IPv4 "addressUsed" field and both an IPv6 and IPv4 address in the "addressesTried" field.
Because this is a big diff for a key part of the VA the new code is gated by the `SimplifiedVAHTTP` feature flag.
Resolves#3889
In the VA, we were rendering a Duration to JSON, which gave an integer
number of nanoseconds rather than a float64 of seconds. Also, in both VA
and WFE we were rendering way more precision than we needed. Millisecond
precision is enough, and since we log latency for every WFE response,
the extra bytes are worth saving.
Previously, if a TLS handshake timed out, we would block forever in
`conn.Handshake()`, leaking both a TCP connection and a goroutine.
This sets a deadline on the underlying TCP connection, ensuring that
`conn.Handshake()` eventually times out.
Fixes#3915
Removes the checks for a handful of deployed feature flags in preparation for removing the flags entirely. Also moves all of the currently deprecated flags to a separate section of the flags list so they can be more easily removed once purged from production configs.
Fixes#3880.
For HTTP-01 challenges that return incorrect responses, the
VA tries to put the first little bit of the HTTP response in the problem
detail.
However, VA needs to be able to serialize the problem detail as a
protobuf to send it to the RA, and protobufs require string types to be
UTF-8. Filter out any invalid UTF-8 sequences and replace them with
REPLACEMENT CHARACTER.
If the context provided to a remote VA's `PerformValidation` is
cancelled we should not treat the returned context cancelled error as an
unexpected error and should instead ignore it as an expected result.
Previously, if a remote VA returned an error that is not a ProblemDetail, the
primary VA would log a ServerInternalProblem but not the underlying error.
This commit updates performRemoteValidation to always return the full error it
receives from a remote VA.
This commit also adds a unittest that checks that the VA still returns a
ServerInternalProblem to the RA, and that the VA audit logs the underlying
error.
Resolves https://github.com/letsencrypt/boulder/issues/3753
This adds support for the account-uri CAA parameter as specified by
section 3 of https://tools.ietf.org/html/draft-ietf-acme-caa-04, allowing
issuance to be restricted to one or more ACME accounts as specified by CAA
records.
When performing CAA checking respect the validation-methods parameter (if
present) and restrict the allowed authorization methods to those specified.
This allows a domain to restrict authorization methods that can be used with
Let's Encrypt.
This is largely based on PR #3003 (by @lukaslihotzki), which was landed and
then later reverted due to issue #3143. The bug the resulted in the previous
code being reverted has been addressed (likely inadvertently) by 76973d0f.
This implementation also includes integration tests for CAA validation-methods.
Fixes issue #3143.
Many of the probs.XYZ calls are of the form probs.XYZ(fmt.Sprintf(...)).
Convert these functions to take a format string and optional arguments,
following the same pattern used in the errors package. Convert the
various call sites to remove the now redundant fmt.Sprintf calls.
A very large number of the logger calls are of the form log.Function(fmt.Sprintf(...)).
Rather than sprinkling fmt.Sprintf at every logger call site, provide formatting versions
of the logger functions and call these directly with the format and arguments.
While here remove some unnecessary trailing newlines and calls to String/Error.
This change cleans up how `va.http01Dialer` works with regards to `core.ValidationRecord`s. Instead of using the record as both an input and a output it now uses a set of inputs and outputs information about addresses via a channel. The validation record is then constructed in the parent scope or in the redirect function instead of the dialer itself.
Fixes#2730, fixes#3109, and fixes#3663.