Commit Graph

64 Commits

Author SHA1 Message Date
Daniel McCarney 94bcebd658
VA: Ignore cancelled errs from remote VAs. (#3827)
If the context provided to a remote VA's `PerformValidation` is
cancelled we should not treat the returned context cancelled error as an
unexpected error and should instead ignore it as an expected result.
2018-08-27 12:20:54 -04:00
Roland Bracewell Shoemaker 1ef93c3809 Support both obsolete and new TLS-ALPN OID (#3819) 2018-08-20 10:51:33 -04:00
Daniel McCarney 2dadd5e09a VA: Log exceptional non-problem remote VA errors. (#3760)
Previously, if a remote VA returned an error that is not a ProblemDetail, the
primary VA would log a ServerInternalProblem but not the underlying error.
This commit updates performRemoteValidation to always return the full error it
receives from a remote VA.

This commit also adds a unittest that checks that the VA still returns a
ServerInternalProblem to the RA, and that the VA audit logs the underlying
error.

Resolves https://github.com/letsencrypt/boulder/issues/3753
2018-06-15 10:53:16 -07:00
Roland Bracewell Shoemaker 813aa788e9 Assume acmeValidation-v1 is wrapped OCTET STRING (#3752)
As defined by the spec.
2018-06-11 14:44:13 -07:00
Joel Sing 9c2859c87b Add support for CAA account-uri validation. (#3736)
This adds support for the account-uri CAA parameter as specified by
section 3 of https://tools.ietf.org/html/draft-ietf-acme-caa-04, allowing
issuance to be restricted to one or more ACME accounts as specified by CAA
records.
2018-06-08 12:08:03 -07:00
Maciej Dębski bb9ddb124e Implement TLS-ALPN-01 and integration test for it (#3654)
This implements newly proposed TLS-ALPN-01 validation method, as described in
https://tools.ietf.org/html/draft-ietf-acme-tls-alpn-01 This challenge type is disabled 
except in the config-next tree.
2018-06-06 13:04:09 -04:00
Joel Sing 2540d59296 Implement CAA validation-methods checking. (#3716)
When performing CAA checking respect the validation-methods parameter (if
present) and restrict the allowed authorization methods to those specified.
This allows a domain to restrict authorization methods that can be used with
Let's Encrypt.

This is largely based on PR #3003 (by @lukaslihotzki), which was landed and
then later reverted due to issue #3143. The bug the resulted in the previous
code being reverted has been addressed (likely inadvertently) by 76973d0f.

This implementation also includes integration tests for CAA validation-methods.

Fixes issue #3143.
2018-05-23 14:32:31 -07:00
Joel Sing 9990d14654 Convert the probs functions to be formatters. (#3708)
Many of the probs.XYZ calls are of the form probs.XYZ(fmt.Sprintf(...)).
Convert these functions to take a format string and optional arguments,
following the same pattern used in the errors package. Convert the
various call sites to remove the now redundant fmt.Sprintf calls.
2018-05-11 11:51:16 -07:00
Joel Sing 8ebdfc60b6 Provide formatting logger functions. (#3699)
A very large number of the logger calls are of the form log.Function(fmt.Sprintf(...)).
Rather than sprinkling fmt.Sprintf at every logger call site, provide formatting versions
of the logger functions and call these directly with the format and arguments.

While here remove some unnecessary trailing newlines and calls to String/Error.
2018-05-10 11:06:29 -07:00
Roland Bracewell Shoemaker b2a2a24dc3 Stop using validation record as an input/output (#3694)
This change cleans up how `va.http01Dialer` works with regards to `core.ValidationRecord`s. Instead of using the record as both an input and a output it now uses a set of inputs and outputs information about addresses via a channel. The validation record is then constructed in the parent scope or in the redirect function instead of the dialer itself.

Fixes #2730, fixes #3109, and fixes #3663.
2018-05-09 11:55:14 -04:00
Roland Bracewell Shoemaker a5ac5fa078 Deprecate IPv6First feature flag (#3684) 2018-05-02 10:22:25 -07:00
Jacob Hoffman-Andrews d0a510664b Remove Timeout field from VA's http.Client. (#3661)
This field was set to singleDialTimeout, but the net/http library treats
it as covering all of dial, write headers, and read headers and body.
Since http01Dialer also uses singleDialTimeout, there's a race between
http01Dialer and net/http to see who will time out first. The result is
that sometimes we give "Timeout after connect" when the error really
should be "Timeout during connect." This issue also inhibits IPv6 to
IPv4 fallback, and tickles a data race that was causing a rare panic in
VA: https://github.com/letsencrypt/boulder/issues/3109.

After this change, the overall HTTP request will get the full deadline
allowed by the RPC context. The dialer will continue to use
singleDialTimeout for each of its two possible dial attempts.
2018-04-23 09:24:23 -04:00
Jacob Hoffman-Andrews a1b98d9163
Use a context when dialing TLS for TLS-SNI (#3648)
This allows us to have fast-running unittests without modifying the global state in singleDialTimeout,
which can become a const.

Fixes #3628.

Builds on top of #3629, review that first.
2018-04-16 15:06:56 -07:00
Jacob Hoffman-Andrews 339ea954bd Give more detailed validation errors in VA (#3629)
In particular, differentiate timeouts during connect (which are usually a firewall problem) from timeouts after connect (which are usually a software problem). In the process, refactor the tests and add testing for specific problem detail messages.

This also switches over the HTTP challenge's dialer to use DialContext, and to shave a little bit of headroom off of the context deadline, so that the dial can report its timeout before the overall context expires, which would lead to an overly generic "deadline exceeded" error, which would then get translated (incorrectly) into a "timeout after connect."

There is an additional error case, Timeout during %s (your server may be slow or overloaded), (where %s can be read or write) which doesn't have any unittests. I believe it may not be possible to trigger this, since read and write timeouts get subsumed by the HTTP or TLS library, but it's worth having as a fallback case. We'll see if it shows up in the logs.

Among the test refactorings, I shortened the timeout on the TLS timeout test to 50ms. Previously this was the long pole making the whole test take 10s. Now it takes ~500 ms overall.

I recommend starting review at https://github.com/letsencrypt/boulder/compare/detailed-va-errors?expand=1#diff-4c51d1d7ca3ec3022d14b42809af0d7eR671 (the changes to detailedError), then reviewing the Dial -> DialContext changes, then the tests.
2018-04-16 12:01:08 -07:00
Jacob Hoffman-Andrews 699c7e4c44 Add a DNS problem type. (#3625)
As specified in ACME. Also, include problem type in the stats.

Fixes #3613.
2018-04-09 12:21:02 -04:00
Roland Bracewell Shoemaker 8167abd5e3 Use internet facing appropriate histogram buckets for DNS latencies (#3616)
Also instead of repeating the same bucket definitions everywhere just use a single top level var in the metrics package in order to discourage copy/pasting.

Fixes #3607.
2018-04-04 08:01:54 -04:00
Jacob Hoffman-Andrews 11434650b7 Check safe browsing at validation time (#3539)
Right now we check safe browsing at new-authz time, which introduces a possible
external dependency when calling new-authz. This is usually fine, since most safe
browsing checks can be satisfied locally, but when requests have to go external,
it can create variance in new-authz timing.

Fixes #3491.
2018-03-09 11:15:05 +00:00
Daniel McCarney 28cc969814 Remove TLS-SNI-02 implementation. (#3516)
This code was never enabled in production. Our original intent was to
ship this as part of the ACMEv2 API. Before that could happen flaws were
identified in TLS-SNI-01|02 that resulted in TLS-SNI-02 being removed
from the ACME protocol. We won't ever be enabling this code and so we
might as well remove it.
2018-03-02 10:56:13 -08:00
Jacob Hoffman-Andrews 1fe8aa8128 Improve errors for DNS challenge (#3317)
Before this change, we would just log "Correct value not found for DNS challenge"
when we got a TXT record that didn't match what we expected. This was different
from the error when no TXT records were found at all, but viewing the error out of
context doesn't make that clear. This change improves the error to specifically say
that we found a TXT record, but it was the wrong one.

Also in this change: if we found multiple TXT records, we mention the number;
and we trim the length of the echoed TXT record.
2018-01-03 15:37:23 -05:00
Daniel McCarney de5fbbdb67
Implement CAA issueWild enforcement for wildcard names (#3266)
This commit implements RFC 6844's description of the "CAA issuewild
property" for CAA records.

We check CAA in two places: at the time of validation, and at the time
of issuance when an authorization is more than 8hours old. Both
locations have been updated to properly enforce issuewild when checking
CAA for a domain corresponding to a wildcard name in a certificate
order.

Resolves https://github.com/letsencrypt/boulder/issues/3211
2017-12-13 12:09:33 -05:00
Daniel McCarney 1c99f91733 Policy based issuance for wildcard identifiers (Round two) (#3252)
This PR implements issuance for wildcard names in the V2 order flow. By policy, pending authorizations for wildcard names only receive a DNS-01 challenge for the base domain. We do not re-use authorizations for the base domain that do not come from a previous wildcard issuance (e.g. a normal authorization for example.com turned valid by way of a DNS-01 challenge will not be reused for a *.example.com order).

The wildcard prefix is stripped off of the authorization identifier value in two places:

When presenting the authorization to the user - ACME forbids having a wildcard character in an authorization identifier.
When performing validation - We validate the base domain name without the *. prefix.
This PR is largely a rewrite/extension of #3231. Instead of using a pseudo-challenge-type (DNS-01-Wildcard) to indicate an authorization & identifier correspond to the base name of a wildcard order name we instead allow the identifier to take the wildcard order name with the *. prefix.
2017-12-04 12:18:10 -08:00
Daniel McCarney 55dd1020c0 Increase VA SingleDialTimeout to 10s. (#3260)
This PR changes the VA's singleDialTimeout value from 5 * time.Second to 10 * time.Second. This will give slower servers a better chance to respond, especially for the multi-VA case where n requests arrive ~simultaneously.

This PR also bumps the RA->VA timeout by 5s and the WFE->RA timeout by 5s to accommodate the increased dial timeout. I put this in a separate commit in case we'd rather deal with this separately.
2017-12-04 09:53:26 -08:00
Roland Bracewell Shoemaker 9da1bea433 Update histogram buckets for latencies that measure things over the internet (#3254)
Updates the buckets for histograms in the publisher, va, and expiration-mailer which are used to measure the latency of operations that go over the internet and therefore are liable to take a lot longer than the default buckets can measure. Uses a standard set of buckets for all three instead of attempting to tune for each one.

Fixes #3217.
2017-11-29 15:13:14 -08:00
Jacob Hoffman-Andrews 51991cd264 Fix logging of hostname in VA. (#3149)
The pbToAuthzMeta method in rpc/pb-marshalling.go only propagates ID and
registrationID, not hostname. So log the "domain" parameter instead.
2017-10-06 11:10:02 -07:00
Daniel McCarney 1794c56eb8 Revert "Add CAA parameter to restrict challenge type (#3003)" (#3145)
This reverts commit 23e2c4a836.
2017-10-04 12:00:44 -07:00
lukaslihotzki 23e2c4a836 Add CAA parameter to restrict challenge type (#3003)
This commit adds CAA `issue` paramter parsing and the `challenge` parameter to permit a single challenge type only. By setting `challenge=dns-01`, the nameserver keeps control over every issued certificate.
2017-10-02 11:59:47 -07:00
Jacob Hoffman-Andrews d2883f12c1 Remove TimingDuration call from VA (#3122)
Also switch over tests.

Fixes #3100
2017-09-28 14:25:22 -07:00
Daniel McCarney 966e02313f Forbid HTTP redirects to non-80/443 ports. (#3115)
Prior to this commit the VA would follow redirects from the initial
HTTP-01 challenge request on port 80 to any other port. In practice the
Let's Encrypt production environment has network egress firewall rules
that drop outbound requests that are not on port 80 or 443. In effect
this meant any challenge request that was redirected from 80 to a port
other than 80/443 was turned into a mysterious connection timeout error.

We have decided to preserve the egress firewall rule and continue to act
conservatively. Only port 80 and 443 should be allowed in redirects.

This commit updates the VA to return a clear error message when
a non-80/443 redirect is made.

To aid in testing/configuration the actual ports enforced are specified
by the va.httpPort and va.httpsPort that are used for the initial
outbound HTTP-01 connection.

The VA TestHTTPRedirectLookup unit test is updated accordingly to test
that a non-80/443 redirect fails with the expected message.

Resolves #3049
2017-09-25 10:19:10 -07:00
Jacob Hoffman-Andrews 568407e5b8 Remote VA logging and stats (#3063)
Add a logging statement that fires when a remote VA fail causes
overall failure. Also change remoteValidationFailures into a
counter that counts the same thing, instead of a histogram. Since
the histogram had the default bucket sizes, it failed to collect
what we needed, and produced more metrics than necessary.
2017-09-11 12:50:50 -07:00
Jacob Hoffman-Andrews 0d69b24fcc Move VA's CAA code into separate file (#3010)
va.go is quite a large file. This splits out the CAA-related code and tests into its own file for simplicity. This is a simple move; no code has been changed, and there is no package split.
2017-08-28 11:24:03 -07:00
Roland Bracewell Shoemaker 05d869b005 Rename DNSResolver -> DNSClient (#2878)
Fixes #639.

This resolves something that has bugged me for two+ years, our DNSResolverImpl is not a DNS resolver, it is a DNS client. This change just makes that obvious.
2017-07-18 08:37:45 -04:00
Daniel McCarney 2f53c202a6 Rename `validationTimeout`, document and clarify purpose. (#2866)
This commit renames the `validationTimeout` to `singleDialTimeout` and
adds a comment describing its purpose.
2017-07-12 11:35:14 -07:00
Roland Bracewell Shoemaker 77f1364e9a Return more detailed error for connection reset in va (#2860)
If we hit a `syscall.ECONNRESET` error return a more useful error than `Error getting validation data`, updates the `TestDetailedError` test to cover this case.

Fixes #2851.
2017-07-11 14:29:31 -07:00
Daniel McCarney 957a68c72b Fix HTTP-01 IPv6 to IPv4 fallback with fresh dialer per conn. (#2852)
The implementation of the dialer used by the HTTP01 challenge, constructed with `resolveAndConstructDialer`, used the same wrapped `net.Dialer` for both the initial IPv6 connection, and any subsequent IPv4 fallback connections. This caused the IPv4 fallback to never succeed for cases where the initial IPv6 connection expended the `validationTimeout`.

This commit updates the http01Dialer (newly renamed from `dialer` since it is in fact specific to HTTP01 challenges) to use a fresh dialer for each connection. To facilitate testing the http01Dialer maintains
a count of how many dialer instances it has constructed. We use this in a unit test to ensure the correct behaviour without a great deal of new mocking/interfaces.

Resolves #2770
2017-07-10 15:41:49 -04:00
Daniel McCarney 7120d72197 Improve error message for IPv6 failure with no IPv4 fallback. (#2844)
This commit improves the rather vague error message that was previously returned if an IPv6 challenge validation failed when IPv6First was enabled and there were no IPv4 addresses left to try as a fallback.

Resolves #2821
2017-07-05 15:54:30 -04:00
Jacob Hoffman-Andrews f710e574b3 Provide specific error for connection refused (#2843)
Fixes #2830.
2017-06-30 16:22:19 -07:00
Roland Bracewell Shoemaker 088b872287 Implement multi VA validation (#2802)
Adds basic multi-path validation functionality. A new method `performRemoteValidation` is added to `boulder-va` which is called if it is configured with a list of remote VA gRPC addresses. In this initial implementation the remote VAs are only used to check the validation result of the main VA, if all of the remote validations succeed but the local validation failed, the overall validation will still fail. Remote VAs use the exact same code as the local VA to perform validation. If the local validation succeeds then a configured quorum of the remote VA successes must be met in order to fully complete the validation.

This implementation assumes that metrics are collected from the remote VAs in order to have visibility into their individual validation latencies etc.

Fixes #2621.
2017-06-29 14:11:01 -07:00
Roland Bracewell Shoemaker 764658ab84 Remove CAA distributed resolver (#2804)
We never used it and it's been superseded by the multi-VA design.
2017-06-15 13:35:50 -04:00
Jacob Hoffman-Andrews f236ca522f Improve validation error messages. (#2791)
Previously, a lot of validations problems would give the message "Failed to
connect to X for ..." This was misleading because the issue was not always a
connection error, and when it was, it was valuable to distinguish between
connection refused and timeout. Also, for HTTP, this message would echo the
first URL in a redirect chain, when we really want the URL that failed.

Renames the misleading "parseHTTPConnError" and removes an inaccurate
check for temporary errors. It also eliminates the "detail" argument, instead
generating all messages inside the function.

Improves the handling of tls.alert errors to actually pass through the error
message, rather than just quietly changing the problem type (which was very
easy to miss).

Gives a specific error message for timeouts.

Preserves the URL from url.Error types and incorporates it into error messages.

Splits the HTTP timeout test into its own test case.
2017-06-12 10:35:22 -04:00
Jacob Hoffman-Andrews 0bfb542514 Use fields, not globals, for stats (#2790)
Following up on #2752, we don't need to use global vars for our Prometheus stats. We already have a custom registry plumbed through using Scope objects. In this PR, expose the MustRegister method of that registry through the Scope interface, and move existing global vars to be fields of objects. This should improve testability somewhat.

Note that this has a bit of an unfortunate side effect: two instances of the same stats-using class (e.g. VA) can't use the same Scope object, because their MustRegister calls will conflict. In practice this is fine since we never instantiate duplicates of the the classes that use stats, but it's something we should keep an eye on.

Updates #2733
2017-06-06 12:09:31 -07:00
Daniel c905cfb8db
Rewords IPv6 -> IPv4 fallback error messages. 2017-05-11 10:35:30 -04:00
Daniel McCarney 47452d6c6c Prefer IPv6 addresses, fall back to IPv4. (#2715)
This PR introduces a new feature flag "IPv6First". 

When the "IPv6First" feature is enabled the VA's HTTP dialer and TLS SNI
(01 and 02) certificate fetch requests will attempt to automatically
retry when the initial connection was to IPv6 and there is an IPv4
address available to retry with.

This resolves https://github.com/letsencrypt/boulder/issues/2623
2017-05-08 13:00:16 -07:00
David Calavera cc5ee3906b Refactor IsSane and IsSane* to return useful errors. (#2685)
This change changes the returning values from boolean to error.

It makes `checkConsistency` an internal function and removes the
optional argument in favor of making checks explicit where they are
used.

It also renames those functions to CheckConsistency* to not
give the impression of still returning boolean values.

Signed-off-by: David Calavera <david.calavera@gmail.com>
2017-04-19 12:08:47 -04:00
Jacob Hoffman-Andrews d849f58cec Fix "valiation typo in VA. (#2676) 2017-04-12 11:12:22 -07:00
Jacob Hoffman-Andrews 4b665e35a6 Use Prometheus stats for VA, WFE, and OCSP Responder (#2628)
Rename HTTPMonitor to MeasuredHandler.
Remove inflight stat (we didn't use it).
Add timing stat by method, endpoint, and status code.
The timing stat subsumes the "rate" stat, so remove that.
WFE now wraps in MeasuredHandler, instead of relying on its cmd/main.go.
Remove FBAdapter stats.
MeasuredHandler tracks stats by method, status code, and endpoint.

In VA, add a Prometheus histogram for validation timing.
2017-04-03 17:03:04 -07:00
Roland Bracewell Shoemaker e2b2511898 Overhaul internal error usage (#2583)
This patch removes all usages of the `core.XXXError` and almost all usages of `probs` outside of the WFE and VA and replaces them with a unified internal error type. Since the VA uses `probs.ProblemDetails` quite extensively in challenges, and currently stores them in the DB I've saved this change for another change (it'll also require a migration). Since `ProblemDetails` should only ever be exposed to end-users all of its related logic should be moved into the `WFE` but since it still needs to be exposed to the VA and SA I've left it in place for now.

The new internal `errors` package offers the same convenience functions as `probs` does as well as a new simpler type testing method. A few small changes have also been made to error messages, mainly adding the library and function name to internal server errors for easier debugging (i.e. where a number of functions return the exact same errors and there is no other way to distinguish which method threw the error).

Also adds proper encoding of internal errors transferred over gRPC (the current encoding scheme is kept for `core` and `probs` errors since it'll be ideally be removed after we deploy this and follow-up changes) using `grpc/metadata` instead of the gRPC status codes.

Fixes #2507. Updates #2254 and #2505.
2017-03-22 23:27:31 -07:00
David Calavera c71c3cff80 Implement TLS-SNI-02 challenge validations. (#2585)
I think these are all the necessary changes to implement TLS-SNI-02 validations, according to the section 7.3 of draft 05:

https://tools.ietf.org/html/draft-ietf-acme-acme-05#section-7.3

I don't have much experience with this code, I'll really appreciate your feedback.

Signed-off-by: David Calavera <david.calavera@gmail.com>
2017-03-22 10:17:59 -07:00
Daniel McCarney e81f7477a3 Fixes outdated IPv6 TODO on `getAddr`. (#2601)
The VA's `getAddr` function prior to this commit had an outdated comment
& a pointer to a TODO for Boulder Issue #593. That issue has been closed
and bdns' `LookupHost` supports AAAA records now. This commit updates
the comment to match the current behaviour and removes the TODO.
2017-03-09 13:20:03 -05:00
Daniel McCarney 3fa950ac58 Improve VA TLS-SNI-01 challenge failure error. (#2527)
Previous to this PR the VA's validateTLSWithZName function would
return an error message containing the SAN names of the leaf certificate
when the validation failed. This commit updates that message to include
the Subject Common Name of the leaf cert in addition to the SANs. The
names are deduplicated to prevent listing a Subj CN twice if its also
a SAN. This will help debug cases where a cert with no SANs is returned
by the server.

In addition, the number of certificates in the chain received from the
server is included in the message. This will hopefully further help
users identify misconfiguration since a TLS SNI 01 challenge response
should have a chain length of 1.

Resolves #2468
2017-01-27 10:05:42 -08:00
Roland Bracewell Shoemaker 170e37c675 Add a special error message if we are trying to talk TLS to a HTTP-only server (#2511)
If the VA fails to validate a TLS-SNI-01 challenge because it is trying to talk TLS to a HTTP-only server return a special error message that is slightly more informative.
2017-01-20 11:36:39 -05:00