When a retryable error occurs and there are multiple DNS servers configured it is prudent to change servers before retrying the query. This helps ensure that one dead DNS server won't result in queries failing.
Resolves https://github.com/letsencrypt/boulder/issues/3846
Distinguish between deadline exceeded vs canceled. Also, combine those
two cases with "out of retries" into a single stat with a label
determining type.
We used to use TCP because we would request DNSSEC records from Unbound, and
they would always cause truncated records when present. Now that we no longer
request those (#2718), we can use UDP. This is better because the TCP serving
paths in Unbound are likely less thoroughly tested, and not optimized for high
load. In particular this may resolve some availability problems we've seen
recently when trying to upgrade to a more recent Unbound.
Note that this only affects the Boulder->Unbound path. The Unbound->upstream
path is already UDP by default (with TCP fallback for truncated ANSWERs).
Since the legacy CAA spec does the wrong thing with DNAMEs (treating them as
CNAMEs), and it's hard to reconcile this approach with CNAME handling, and
DNAMEs are extremely rare, reject outright any CAA responses containing DNAMEs.
Also, in the process, fix a bug in the previous LegacyCAA implementation.
Because the processing of records in LookupCAA was gated by `if
answer.Header().RRType == dnsType`, non-CAA responses were filtered out. This
wasn't caught by previous testing, because it was unittesting that mocked out
bdns.
This implements the pre-erratum 5065 version of CAA, behind a feature flag.
This involved refactoring DNSClient.LookupCAA to return a list of CNAMEs in addition to the CAA records, and adding an alternate lookuper that does tree-climbing on single-depth aliases.
Add a logging statement that fires when a remote VA fail causes
overall failure. Also change remoteValidationFailures into a
counter that counts the same thing, instead of a histogram. Since
the histogram had the default bucket sizes, it failed to collect
what we needed, and produced more metrics than necessary.
Fixes#639.
This resolves something that has bugged me for two+ years, our DNSResolverImpl is not a DNS resolver, it is a DNS client. This change just makes that obvious.
The LookupIPv6 flag has been enabled in production and isn't required anymore. This PR removes the flag entirely.
The errA and errAAAA error handling in LookupHost is left as-is, meaning that a non-nil errAAAA will not be returned to the caller. This matches the existing behaviour, and the expectations of the TestDNSLookupHost unit tests.
This commit also removes the tests from TestDNSLookupHost that tested the LookupIPv6 == false behaviours since those are no longer implemented.
Resolves#2191
This PR, makes testing the bdns package more reliable. A race condition in TestMain was resulting in the test running before the test dns server had started. This is fixed by actively polling for the DNS server to be ready before starting the test suite.
Furthermore, a 1 millisecond server read/write timeout was proving to time out on occasion. This is fixed increasing to a 1 second read/write timeout to increase test reliability.
FYI: ran package bdns tests 1000 times with 22 failures previously, after this PR ran 1000 times with 0 failures.
fixes#1317
* Split out CAA checking service (minus logging etc)
* Add example.yml config + follow general Boulder style
* Update protobuf package to correct version
* Add grpc client to va
* Add TLS authentication in both directions for CAA client/server
* Remove go lint check
* Add bcodes package listing custom codes for Boulder
* Add very basic (pull-only) gRPC metrics to VA + caa-service
* Fix all errcheck errors
* Add errcheck to test.sh
* Add a new sa.Rollback method to make handling errors in rollbacks easier.
This also causes a behavior change in the VA. If a HTTP connection is
abruptly closed after serving the headers for a non-200 response, the
reported error will be the read failure instead of the non-200.
This is more what we expect from a dns server.
dig A nx.google.com @ns2.google.com
; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> A nx.google.com @ns2.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 28643
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;nx.google.com. IN A
;; AUTHORITY SECTION:
google.com. 60 IN SOA ns4.google.com. dns-admin.google.com. 112672771 900 900 1800 60
;; Query time: 13 msec
;; SERVER: 216.239.34.10#53(216.239.34.10)
;; WHEN: Thu Jan 21 14:44:06 CET 2016
;; MSG SIZE rcvd: 81
VS
dig A www.google.com @ns2.google.com
; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> A www.google.com @ns2.google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18684
;; flags: qr aa rd; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available
;; QUESTION SECTION:
;www.google.com. IN A
;; ANSWER SECTION:
www.google.com. 300 IN A 64.233.184.99
www.google.com. 300 IN A 64.233.184.105
www.google.com. 300 IN A 64.233.184.106
www.google.com. 300 IN A 64.233.184.104
www.google.com. 300 IN A 64.233.184.147
www.google.com. 300 IN A 64.233.184.103
;; Query time: 13 msec
;; SERVER: 216.239.34.10#53(216.239.34.10)
;; WHEN: Thu Jan 21 14:44:32 CET 2016
;; MSG SIZE rcvd: 128
Previously we would return a detailed errorString, which ProblemDetailsFromDNSError
would turn into a generic, uninformative "Server failure at resolver".
Now we return a new internal dnsError type, which ProblemDetailsFromDNSError can
turn into a more informative message to be shown to the user.
This provides a means to add retries to DNS look ups, and, with some
future work, end retries early if our request deadline is blown. That
future work is tagged with #1292.
Updates #1258
This moves the RTT metrics calculation inside of the DNSResolver. This
cleans up code in the RA and VA and makes some adding retries to the
DNSResolver less ugly to do.
Note: this will put `Rate` and `RTT` after the name of DNS query
type (`A`, `MX`, etc.). I think that's fine and desirable. We aren't
using this data in alerts or many dashboards, yet, so a flag day is
okay.
Fixes#1124
Moves the DNS code from core to dns and renames the dns package to bdns
to be clearer.
Fixes#1260 and will be good to have while we add retries and such.