When a retryable error occurs and there are multiple DNS servers configured it is prudent to change servers before retrying the query. This helps ensure that one dead DNS server won't result in queries failing.
Resolves https://github.com/letsencrypt/boulder/issues/3846
Remove various unnecessary uses of fmt.Sprintf - in particular:
- Avoid calls like t.Error(fmt.Sprintf(...)), where t.Errorf can be used directly.
- Use strconv when converting an integer to a string, rather than using
fmt.Sprintf("%d", ...). This is simpler and can also detect type errors at
compile time.
- Instead of using x.Write([]byte(fmt.Sprintf(...))), use fmt.Fprintf(x, ...).
Distinguish between deadline exceeded vs canceled. Also, combine those
two cases with "out of retries" into a single stat with a label
determining type.
Also instead of repeating the same bucket definitions everywhere just use a single top level var in the metrics package in order to discourage copy/pasting.
Fixes#3607.
Before this change, we would just log "Correct value not found for DNS challenge"
when we got a TXT record that didn't match what we expected. This was different
from the error when no TXT records were found at all, but viewing the error out of
context doesn't make that clear. This change improves the error to specifically say
that we found a TXT record, but it was the wrong one.
Also in this change: if we found multiple TXT records, we mention the number;
and we trim the length of the echoed TXT record.
We used to use TCP because we would request DNSSEC records from Unbound, and
they would always cause truncated records when present. Now that we no longer
request those (#2718), we can use UDP. This is better because the TCP serving
paths in Unbound are likely less thoroughly tested, and not optimized for high
load. In particular this may resolve some availability problems we've seen
recently when trying to upgrade to a more recent Unbound.
Note that this only affects the Boulder->Unbound path. The Unbound->upstream
path is already UDP by default (with TCP fallback for truncated ANSWERs).
There were a bunch of test fixtures in bdns/mocks.go that were only used in va/caa_test.go. This moves them to be in the same file so we have less spooky action at a distance.
One side-effect: We can't construct bdns.DNSError with the internal fields we want, because those fields are unexported. So we switch a couple of mock cases to just return a generic error, and the corresponding test cases to expect that error.
Since the legacy CAA spec does the wrong thing with DNAMEs (treating them as
CNAMEs), and it's hard to reconcile this approach with CNAME handling, and
DNAMEs are extremely rare, reject outright any CAA responses containing DNAMEs.
Also, in the process, fix a bug in the previous LegacyCAA implementation.
Because the processing of records in LookupCAA was gated by `if
answer.Header().RRType == dnsType`, non-CAA responses were filtered out. This
wasn't caught by previous testing, because it was unittesting that mocked out
bdns.
This implements the pre-erratum 5065 version of CAA, behind a feature flag.
This involved refactoring DNSClient.LookupCAA to return a list of CNAMEs in addition to the CAA records, and adding an alternate lookuper that does tree-climbing on single-depth aliases.
Add a logging statement that fires when a remote VA fail causes
overall failure. Also change remoteValidationFailures into a
counter that counts the same thing, instead of a histogram. Since
the histogram had the default bucket sizes, it failed to collect
what we needed, and produced more metrics than necessary.
Fixes#639.
This resolves something that has bugged me for two+ years, our DNSResolverImpl is not a DNS resolver, it is a DNS client. This change just makes that obvious.
This PR introduces a new feature flag "IPv6First".
When the "IPv6First" feature is enabled the VA's HTTP dialer and TLS SNI
(01 and 02) certificate fetch requests will attempt to automatically
retry when the initial connection was to IPv6 and there is an IPv4
address available to retry with.
This resolves https://github.com/letsencrypt/boulder/issues/2623
I'm interested in seeing both how often DNS responses we see are signed (mainly for CAA, but
also interested in other query types). This change adds a new counter, `Authenticated`, that can
be compared against the `Successes` counter to find the percentage of signed responses we see.
The counter is incremented if the `msg.AuthenticatedData` bit is set by the upstream resolver.
Remove `SetEdns0` call in `bdns.exchangeOne`. Since we talk over TCP to the production
resolver and we don't do any local validation of DNSSEC records adding the EDNS0 OPT
record is pointless and confusing. Testing against a local `unbound` instance shows you
don't need to set the DO bit for DNSSEC requests/validation to be done at the resolver
level.
This patch removes all usages of the `core.XXXError` and almost all usages of `probs` outside of the WFE and VA and replaces them with a unified internal error type. Since the VA uses `probs.ProblemDetails` quite extensively in challenges, and currently stores them in the DB I've saved this change for another change (it'll also require a migration). Since `ProblemDetails` should only ever be exposed to end-users all of its related logic should be moved into the `WFE` but since it still needs to be exposed to the VA and SA I've left it in place for now.
The new internal `errors` package offers the same convenience functions as `probs` does as well as a new simpler type testing method. A few small changes have also been made to error messages, mainly adding the library and function name to internal server errors for easier debugging (i.e. where a number of functions return the exact same errors and there is no other way to distinguish which method threw the error).
Also adds proper encoding of internal errors transferred over gRPC (the current encoding scheme is kept for `core` and `probs` errors since it'll be ideally be removed after we deploy this and follow-up changes) using `grpc/metadata` instead of the gRPC status codes.
Fixes#2507. Updates #2254 and #2505.
Presently when the VA performs a DNS-01 challenge verification it
returns the same error for the case where the remote nameserver had the
**wrong** TXT value, and when the remote nameserver had an **empty**
response for the TXT query. It would aid debugging if the user was told
which of the two failure cases was responsible for the overall challenge
failure.
This commit adds a unique error message for the empty TXT records case,
and a unit test/mock to exercise the new the error message.
Resolves#2326
This PR reworks the validateEmail() function from the RA to allow timeouts during DNS validation of MX/A/AAAA records for an email to be non-fatal and match our intention to verify emails best-effort.
Notes:
bdns/problem.go - DNSError.Timeout() was changed to also include context cancellation and timeout as DNS timeouts. This matches what DNSError.Error() was doing to set the error message and supports external callers to Timeout not duplicating the work.
bdns/mocks.go - the LookupMX mock was changed to support always.error and always.timeout in a manner similar to the LookupHost mock. Otherwise the TestValidateEmail unit test for the RA would fail when the MX lookup completed before the Host lookup because the error wouldn't be correct (empty DNS records vs a timeout or network error).
test/config/ra.json, test/config-next/ra.json - the dnsTries and dnsTimeout values were updated such that dnsTries * dnsTimeout was <= the WFE->RA RPC timeout (currently 15s in the test configs). This allows the dns lookups to all timeout without the overall RPC timing out.
Resolves#2260.
The LookupIPv6 flag has been enabled in production and isn't required anymore. This PR removes the flag entirely.
The errA and errAAAA error handling in LookupHost is left as-is, meaning that a non-nil errAAAA will not be returned to the caller. This matches the existing behaviour, and the expectations of the TestDNSLookupHost unit tests.
This commit also removes the tests from TestDNSLookupHost that tested the LookupIPv6 == false behaviours since those are no longer implemented.
Resolves#2191
Updates #1699.
Adds a new package, `features`, which exposes methods to set and check if various internal features are enabled. The implementation uses global state to store the features so that services embedded in another service do not each require their own features map in order to check if something is enabled.
Requires a `boulder-tools` image update to include `golang.org/x/tools/cmd/stringer`.
This PR, makes testing the bdns package more reliable. A race condition in TestMain was resulting in the test running before the test dns server had started. This is fixed by actively polling for the DNS server to be ready before starting the test suite.
Furthermore, a 1 millisecond server read/write timeout was proving to time out on occasion. This is fixed increasing to a 1 second read/write timeout to increase test reliability.
FYI: ran package bdns tests 1000 times with 22 failures previously, after this PR ran 1000 times with 0 failures.
fixes#1317
Several of the `ProblemType`s had convenience functions to instantiate `ProblemDetails`s using their type and a detail message. Where these existed I did a quick scan of the codebase to convert places where callers were explicitly constructing the `ProblemDetails` to use the convenience function.
For the `ProblemType`s that did not have such a function, I created one and then converted callers to use it.
Solves #1837.
When a CAA request to Unbound times out, fall back to checking CAA via Google Public DNS' HTTPS API, through multiple proxies so as to hit geographically distributed paths. All successful multipath responses must be identical in order to succeed, and at most one can fail.
Fixes#1618
exchangeOne used a deferd method which contained a expression as a argument. Because of how defer works the arguments where evaluated immediately (unlike the method) causing the total latency to always be the same.
* Split out CAA checking service (minus logging etc)
* Add example.yml config + follow general Boulder style
* Update protobuf package to correct version
* Add grpc client to va
* Add TLS authentication in both directions for CAA client/server
* Remove go lint check
* Add bcodes package listing custom codes for Boulder
* Add very basic (pull-only) gRPC metrics to VA + caa-service
* Fix all errcheck errors
* Add errcheck to test.sh
* Add a new sa.Rollback method to make handling errors in rollbacks easier.
This also causes a behavior change in the VA. If a HTTP connection is
abruptly closed after serving the headers for a non-200 response, the
reported error will be the read failure instead of the non-200.
Also, stop skipping CAA lookups for the root TLDs. The RFC is unclear on
the desired behavior here, but the ICANNTLD function is nonstandard and
the behavior is strictly more conservative than what we had before.
This unblocks the removal of the ICANNTLD function, which allows us to
stop forking upstream.
Closes#1522
The CAA response checking method has been refactored to have a
easier to follow straight-line control flow. Several bugs in it have
been fixed:
- Firstly, parameters for issue and issuewild directives were not
parsed, so any attempt to specify parameters would result in
a string mismatch with the CA CAA identity (e.g. "letsencrypt.org").
Moreover, the syntax as specified permits leading and trailing
whitespace, so a parameter-free record such as
" letsencrypt.org ; " would not be considered a match.
This has been fixed by stripping whitespace and parameters. The RFC
does not specify the criticality of parameters, so unknown
parameters (currently all parameters) are considered noncritical.
I justify this as follows:
If someone decides to nominate a CA in a CAA record, they can,
with trivial research, determine what parameters, if any, that
CA supports, and presumably in trusting them in the first place
is able to adequately trust that the CA will continue to support
those parameters. The risk from other CAs is zero because other CAs
do not process the parameters because the records in which they
appear they do not relate to them.
- Previously, all of the flag bits were considered to effectively mean
'critical'. However, the RFC specifies that all bits except for the
actual critical bit (decimal 128) should be ignored. In practice,
many people have misunderstood the RFC to mean that the critical bit
is decimal 1, so both bits are interpreted to mean 'critical', but
this change ignores all of the other bits. This ensures that the
remaining six bits are reasonably usable for future standards action
if any need should arise.
- Previously, existence of an "issue" directive but no "issuewild"
directive was essentially equivalent to an unsatisfiable "issuewild"
directive, meaning that no wildcard identifiers could pass the CAA
check. This is contrary to the RFC, which states that issuewild
should default to what is specified for "issue" if no issuewild
directives are specified. (This is somewhat moot since boulder
doesn't currently support wildcard issuance.)
- Conversely, existence of an "issuewild" directive but no "issue"
directive would cause CAA validation for a non-wildcard identifier
to fail, which was contrary to the RFC. This has been fixed.
- More generally, existence of any unknown non-critical directive, say
"foobar", would cause the CAA checking code to act as though an
unsatisfiable "issue" directive existed, preventing any issuance.
This has been fixed.
Test coverage for corner cases is enhanced and provides regression
testing for these bugs.
statsd statistics have been added for tracking the relative frequency
of occurrence of different CAA features and outcomes. I added these on
a whim suspecting that they may be of interest.
Fixes#1436.