This PR reverts 27d531101f and undoes the merge of the `pendingAuthorizations` and `authz` table. This change had unintended performance impacts on the `CountPendingAuthorizations` query that exacerbated load issues and need to be addressed.
This PR reworks the validateEmail() function from the RA to allow timeouts during DNS validation of MX/A/AAAA records for an email to be non-fatal and match our intention to verify emails best-effort.
Notes:
bdns/problem.go - DNSError.Timeout() was changed to also include context cancellation and timeout as DNS timeouts. This matches what DNSError.Error() was doing to set the error message and supports external callers to Timeout not duplicating the work.
bdns/mocks.go - the LookupMX mock was changed to support always.error and always.timeout in a manner similar to the LookupHost mock. Otherwise the TestValidateEmail unit test for the RA would fail when the MX lookup completed before the Host lookup because the error wouldn't be correct (empty DNS records vs a timeout or network error).
test/config/ra.json, test/config-next/ra.json - the dnsTries and dnsTimeout values were updated such that dnsTries * dnsTimeout was <= the WFE->RA RPC timeout (currently 15s in the test configs). This allows the dns lookups to all timeout without the overall RPC timing out.
Resolves#2260.
The RA performs an RPC to the SA's `GetValidAuthorizations` function when attempting to find existing valid authorizations to reuse. Prior to this commit, ff the RPC fails (e.g. due to a timeout) the calling code
logs the failure as a warning but fails to return the error and cease processing. This results in a nil panic when we later try to index `auths`
This commit inserts the missing `return` to ensure we don't process further, thereby resolving #2274.
A test for this fix is provided with `TestReuseAuthorizationFaultySA`. Without f52f340 applied this test recreates the panic observed in #2274 and produces:
```
go test -p 1 -v -race --test.run TestReuseAuthorizationFaultySA github.com/letsencrypt/boulder/ra
=== RUN TestReuseAuthorizationFaultySA
--- FAIL: TestReuseAuthorizationFaultySA (0.04s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal 0xb code=0x1 addr=0x20 pc=0x4be2b8]
```
With f52f340 it passes. Yay!
Fixes#503.
Functionality is gated by the feature flag `AllowKeyRollover`. Since this functionality is only specified in ACME draft-03 and we mostly implement the draft-02 style this takes some liberties in the implementation, which are described in the updated divergences doc. The `key-change` resource is used to side-step draft-03 `url` requirement.
To remove challenges with expired/pending authz's when they are deleted we want to introduce a foreign key relationship to the challenges table's authorizationID field with instruction to cascade on delete (#2155). As pointed out in a comment this is made difficult by the current usage of a separate pendingAuthorizations table for pending authorizations.
To be able to remove the pendingAuthorizations table entirely (#2163) we need to first stop using it. This PR introduces the code changes required to achieve this.
Notes:
The SA's NewPendingAuthorization function was updated to store all new pending auths in the authz table and to ensure the status is StatusPending.
The SA's GetAuthorization, UpdatePendingAuthorization, FinalizeAuthorization, and RevokeAuthorizationsByDomain functions were updated to properly handle the fact that a pending authz could be in either the pendingAuthorizations table, or the authz table, and to do the right thing accordingly.
Several places in the RA unit tests created a pending authorization with a status "Valid", then finalized it later. This broke when NewPendingAuthorization was changed to enforce Pending status before creating the authz row since the FinalizeAuthorization code expected to only finalize Valid rows. To fix this some of the RA unit tests were changed to explicitly set status to Valid before calling FinalizeAuthorization. This matches the true intention of the tests to quickly create a pending & then finalized authorization.
The expired-authz-purger utility was updated to purge from both the pendingAuthorizations and authz table as required.
The return values of RevokeAuthorizationsByDomain have changed slightly. Previously it returned a 2 element array where the first element was the number of pending authorizations revoked and the second element was the number of finalized authorizations revoked. This is changed so that now it is the number of rows from the pendingAuthorizations and authz tables respectively. E.g. the second count for the authz table may now include non-finalized authzs in its count of affected rows. The admin-revoker is the only place that used this SA method and it was updated appropriately to describe the "rows" change.
The "purger" database user needs to have a new GRANT SELECT, DELETE for the authz table in addition to its existing GRANT for the pendingAuthorizations table.
This resolves#2162
Updates #1699.
Adds a new package, `features`, which exposes methods to set and check if various internal features are enabled. The implementation uses global state to store the features so that services embedded in another service do not each require their own features map in order to check if something is enabled.
Requires a `boulder-tools` image update to include `golang.org/x/tools/cmd/stringer`.
Part of #2080.
This change vendors `crypto/x509`, `crypto/x509/pkix`, and `encoding/asn1` from 1d5f6a765d. That commit is a direct child of the Go 1.5.4 release tag, so it contains the same code as the current Go version we are using. In that commit I rewrote imports in those packages so they depend on each other internally rather than calling out to the standard library, which would cause type disagreements.
I changed the imports in each place where we're parsing CSRs, and imported under a different name `oldx509`, both to avoid collisions and make it clear what's going on. Places that only use `x509` to parse certificates are not changed, and will use the current standard library.
This will unblock us from moving to Go 1.6, and subsequently Go 1.7.
Introduces the `authorizationLifetimeDays` and `pendingAuthorizationLifetimeDays` configuration options for `RA`.
If the values are missing from configuration, the code defaults back to the current values (300/7 days).
fixes#2024
This PR adds two optimizations to fix the optimistic lock errors observed in #1986.
First, the WFE now returns early for registration POST's (before invoking the RA and SA) when the POST body is the trivial update (`{"resource":"reg"}`). This prevents any DB operations from being performed when there is no work to be done.
Second, the RA now tracks whether a update actually changes the base registration's `Contact` slice, or `Agreement` string. If the proposed update doesn't change either of these fields then the RA will return early before handing the update to the SA.
Both changes save database operations from being performed needlessly and will help avoid the optimistic lock errors we observed when a problematic client was POSTing the trivial update repeatedly in a short period.
The fix was verified as follows: I checked out master and artificially introduced lock contention into the SA by adding a 2s sleep into `UpdateRegistration` between fetching the `existingRegModel` to get the `LockCol` value and calling `ssa.dbMap.Update`. With the sleep in place & two certbot clients posting matching registration updates the lock contention error is produced as expected. After checking out the `empty-reg-updates` branch, re-adding the sleep to the SA, and performing the same two client reg updates no error is produced.
The RA test uses a `DummyValidationAuthority` which is called from `ra.UpdateAuithorization` in a goroutine. Both the tester and the spawned goroutine want to mutate/read fields on the `DummyValidationAuthority` which causes a race. This PR fixes the race by blocking on a `chan struct{}` until the `DummyValidationAuthority` has mutated the required state.
Fixes#1980.
https://github.com/letsencrypt/boulder/pull/1995
This PR replaces the `x/net/publicsuffix` package with `weppos/publicsuffix-go`.
The conversations that leaded to this decision are #1479 and #1374. To summarize the discussion, the main issue with `x/net/publicsuffix` is that the package compiles the list into the Go source code and doesn't provide a way to easily pull updates (e.g. by re-parsing the original PSL) unless the entire package is recompiled.
The PSL update frequency is almost daily, which makes very hard to recompile the official Golang package to stay up-to-date with all the changes. Moreover, Golang maintainers expressed some concerns about rebuilding and committing changes with a frequency that would keep the package in sync with the original PSL. See https://github.com/letsencrypt/boulder/issues/1374#issuecomment-182429297
`weppos/publicsuffix-go` contains a compiled version of the list that is updated weekly (or more frequently). Moreover, the package can read and parse a PSL from a String or a File which will effectively decouple the Boulder source code with the list itself. The main benefit is that it will be possible to update the definition by simply downloading the latest list and restarting the application (assuming the list is persisted in memory).
Boulder uses MalformedRequestError as a universal error. This pull request adds the RejectedIdentifierError and use it on a blacklist error.
For a client implementation, it is easier and cleaner to use an exception than parse the error message.
1336c42813/policy/pa.go (L131)Fixes#1938
PR in acme : ietf-wg-acme/acme#142
This PR adds a expvar.Int published under the key "lastIssuance" that contains the timestamp of the last successful certificate issuance. This allows easy creation of a script that monitors the RA debug server (port 8002) to ensure that there has been a successful issuance within a set period (e.g. last five minutes).
The underlying expvar.Int code uses the atomic package to ensure safe updates/reads across multiple goroutines.
This resolves#1945 and was selected in place of the more complex circular bucket design. While the timestamp approach doesn't provide the issuance volume as readily it is less complex and meets the immediate need of a reliable external monitoring process hook.
https://github.com/letsencrypt/boulder/pull/1982
It looks like the filesystem on Mac OSX El Captain
(and potentially previous versions) require at least
one second before you get a different timestamp.
Any value lower than one second caused:
- a deadlock error in a test in the reloader module
- a rate limit error in the ra module
See GH-1955.
This PR, adds a check in registration authority for non-ASCII encoded characters in an email address. This is due to a 'funky email implementation'.
Fixes#1350
In https://github.com/letsencrypt/boulder/pull/774 we introduced and account key stored with the challenge. This was a stopgap fix to the now-defunct SimpleHTTP and DNS challenges in the face of https://mailarchive.ietf.org/arch/msg/acme/F71iz6qq1o_QPVhJCV4dqWf-4Yc. However, we no longer offer or implement those challenges, so the extra field is unnecessary. It also take up a huge amount of space in the challenges table, which is our biggest table. SimpleHTTP and DNS challenges were removed in https://github.com/letsencrypt/boulder/pull/1247.
We can provide a follow-up migration to delete the column later, once we have a plan for large migrations without downtime.
Fixes#1909
As reported in #1925 the Certificates per Domain rate limit was being
incorrectly enforced on certificate renewals for FQDN sets that have
been previously issued. This is counter to the described rate
limit policies[0] that detail a separate rate limit for certificates
issued for the "exact same set of Fully Qualified Domain Names".
The bug was caused by the result of `domainsForRateLimiting` overwriting
the original `names []string` provided to the RA's
`checkCertificatesPerNameLimit` function. This meant instead of looking
for an existing FQDN set for the full set of domain names being
requested we checked for an FQDN set for just the eTLD+1's of the
domains. (e.g "www.example.com, foo.example.com, bar.example.com" vs
"example.com").
This commit preserves the original `names` values for doing an FQDN set
lookup and uses the `tldNames` from `domainsForRateLimiting` elsewhere.
This fixes#1925.
A test is added to ensure that `checkCertificatesPerNameLimit` does the
correct thing both with and without an existing FQDN set.
[0] https://community.letsencrypt.org/t/rate-limits-for-lets-encrypt/6769
The RA UpdateRegistration function merges a base registration object with an update by calling Registration.MergeUpdate. Prior to this commit MergeUpdate only allowed the updated registration object to overwrite the Contact field of the existing registration if the updated reg. defined at least one AcmeURL. This prevented clients from being able to outright remove the contact associated with an existing registration.
This commit removes the len() check on the input.Contact in MergeUpdate to allow the r.Contact field to be overwritten by a []*core.AcmeURL(nil) Contact field. Subsequently clients can now send an empty contacts list in the update registration POST in order to remove their reg contact.
Fixes#1846
* Allow removing registration contact.
* Adds a test for `MergeUpdate` contact removal.
* Change `Registration.Contact` type to `*[]*core.AcmeURL`.
* End validateContacts early for empty contacts
* Test removing reg. contact more thoroughly.
Presently clients may request a new AuthZ be created for a domain that they have already proved authorization over. This results in unnecessary bloat in the authorizations table and duplicated effort.
This commit alters the `NewAuthorization` function of the RA such that before going through the work of creating a new AuthZ it checks whether there already exists a valid AuthZ for the domain/regID that expires in more than 24 hours from the current date. If there is, then we short circuit creation and return the existing AuthZ. When this case occurs the `RA.ReusedValidAuthz` counter is incremented to provide visibility.
Since clients requesting a new AuthZ and getting an AuthZ back expect to turn around and post updates to the corresponding challenges we also return early in `UpdateAuthorization` when asked to update an AuthZ that is already valid. When this case occurs the `RA.ReusedValidAuthzChallenge` counter is incremented.
All of the above behaviour is gated by a new RA config flag `reuseValidAuthz`. In the default case (false) the RA does **not** reuse any AuthZ's and instead maintains the historic behaviour; always creating a new AuthZ when requested, irregardless of whether there are already valid AuthZ's that could be reused. In the true case (enabled only in `boulder-config-next.json`) the AuthZ reuse described above is enabled.
Resolves#1854
Resolves#1810 by automatically updating the RA ratelimit.RateLimitConfig whenever the backing config file is changed. Much like the Policy Authority uses a reloader instance to support updating the Hostname policy on the fly, this PR changes the Registration Authority to use a reloader for the rate limit policy file.
Access to the ra.rlPolicies member is protected with a RWMutex now that there is a potential for the values to be reloaded while a reader is active.
A test is introduced to ensure that writing a new policy YAML to the policy config file results in new values being set in the RA's rlPolicies instance.
https://github.com/letsencrypt/boulder/pull/1894