This commit implements RFC 6844's description of the "CAA issuewild
property" for CAA records.
We check CAA in two places: at the time of validation, and at the time
of issuance when an authorization is more than 8hours old. Both
locations have been updated to properly enforce issuewild when checking
CAA for a domain corresponding to a wildcard name in a certificate
order.
Resolves https://github.com/letsencrypt/boulder/issues/3211
This commit adds the "new-nonce" endpoint to the WFE2. A small unit test
is included. The existing /directory unit tests are updated for the new
endpoint.
When we implemented the new-order issuance flow for the WFE2 we forgot to include the endpoint in the /directory object. This commit adds it and updates associated tests.
Now, rather than LIMIT / OFFSET, this uses the highest id from the last batch in each new batch's query. This makes efficient use of the index, and means the database does not have to scan over a large number of non-expired rows before starting to find any expired rows.
This also changes the structure of the purge function to continually push ids for deletion onto a channel, to be processed by goroutines consuming that channel.
Also, remove the --yes flag and prompting.
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.
Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.
I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.
Also, update go-grpc-prometheus to get the necessary methods.
```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
If two requests simultaneously update a challenge for the same
authorization there is a chance that `UpdatePendingAuthorization` will
encounter a Gorp optimistic lock error having read one LockCol value
from a `Select` on the `pendingAuthorizations` table only for it to have
changed by the time an `Update` on the same row is performed.
After closer examination this `Update` is unnecessary! Only
`RA.UpdateAuthorization` calls `SA.UpdatePendingAuthorization` and it
does so only to record updated challenge information by way of
`UpdatePendingAuthorization`'s call to `updateChallenges`. Since no data
in the `pendingAuthorizations` row is being changed we don't need to do
this `Update` at all, saving both a potential race condition & some
database load.
This commit removes the `Update` entirely. Several SA unit tests had to
be updated because they were (ab)using `UpdatePendingAuthorization` to
mutate pendingAuthz rows.
This commit adds a new rate limit to restrict the number of outstanding
pending orders per account. If the threshold for this rate limit is
crossed subsequent new-order requests will return a 429 response.
Note: Since this the rate limit object itself defines an `Enabled()`
test based on whether or not it has been configured there is **not**
a feature flag for this change.
Resolves https://github.com/letsencrypt/boulder/issues/3246
This PR implements issuance for wildcard names in the V2 order flow. By policy, pending authorizations for wildcard names only receive a DNS-01 challenge for the base domain. We do not re-use authorizations for the base domain that do not come from a previous wildcard issuance (e.g. a normal authorization for example.com turned valid by way of a DNS-01 challenge will not be reused for a *.example.com order).
The wildcard prefix is stripped off of the authorization identifier value in two places:
When presenting the authorization to the user - ACME forbids having a wildcard character in an authorization identifier.
When performing validation - We validate the base domain name without the *. prefix.
This PR is largely a rewrite/extension of #3231. Instead of using a pseudo-challenge-type (DNS-01-Wildcard) to indicate an authorization & identifier correspond to the base name of a wildcard order name we instead allow the identifier to take the wildcard order name with the *. prefix.
This PR changes the VA's singleDialTimeout value from 5 * time.Second to 10 * time.Second. This will give slower servers a better chance to respond, especially for the multi-VA case where n requests arrive ~simultaneously.
This PR also bumps the RA->VA timeout by 5s and the WFE->RA timeout by 5s to accommodate the increased dial timeout. I put this in a separate commit in case we'd rather deal with this separately.
Updates the buckets for histograms in the publisher, va, and expiration-mailer which are used to measure the latency of operations that go over the internet and therefore are liable to take a lot longer than the default buckets can measure. Uses a standard set of buckets for all three instead of attempting to tune for each one.
Fixes#3217.
As described in #3201 a concurrent challenge POST would result in 500 errors if the pending authz row was deleted by a promotion to the authz table underneath another request.
This PR adjusts the SA & RA so that if a pending authz is promoted to a final authz between updates a not found error will be returned instead of a server internal error.
Resolves https://github.com/letsencrypt/boulder/issues/3201
Move Boulder CPU panel over
Add basic expiry mailer stats
Add HSM signatures panel
Add page splits panel
Add CT submissions panel, with old and new metrics for now
Makes a couple of changes:
* Change `SubmitToCT` to make submissions to each log in parallel instead of in serial, this prevents a single slow log from eating up the majority of the deadline and causing submissions to other logs to fail
* Remove the 'submissionTimeout' field on the publisher since it is actually bounded by the gRPC timeout as is misleading
* Add a timeout to the CT clients internal HTTP client so that when log servers hang indefinitely we actually do retries instead of just using the entire submission deadline. Currently set at 2.5 minutes
Fixes#3218.
When counting certificates for rate limiting, we attempted to impose a limit on
the query results to avoid we did not receive so many results that they caused
slowness on the database or SA side. However, that check has never actually been
executed correctly. The check was fixed in #3126, but rolling out that fix
broke issuance for subscribers with rate limit overrides that have allowed them
to exceed the limit.
Because this limit has not been needed in practice over the years, remove it
rather than refining it. The size of the results are loosely governed by our
rate limits (and overrides), and if result sizes from this query become a
performance issue in the future, we can address it then. For now, opt for
simplification.
Fixes#3214.
Pin version of Prometheus to `1.8.2` because Prometheus folks has just released v2 and switched the `:latest` tag to that.
Using Prometheus v2 produces an error:
prometheus: error: unknown short flag '-c'
In #1864 we discussed possible
optimizations to how expiration-mailer and ocsp-updater query the
certificateStatus table. In #2177 we
added the notAfter and isExpired fields for more efficient querying.
However, we forgot to add indexes on these fields. This change adds new indexes
and drops the old indexes, and should result in much more efficient querying in
those two components.
Also, remove a comment that goose couldn't understand.
Running EXPLAINs to show the difference:
For expiration-mailer, before:
MariaDB [boulder_sa_integration]> EXPLAIN SELECT cs.serial FROM certificateStatus AS cs WHERE cs.notAfter > DATE_ADD(NOW(), INTERVAL 21 DAY) AND cs.notAfter < DATE_ADD(NOW(), INTERVAL 10 DAY) AND cs.status != "revoked" AND COALESCE(TIMESTAMPDIFF(SECOND, cs.lastExpirationNagSent, cs.notAfter) > 10 * 86400, 1) ORDER BY cs.notAfter ASC LIMIT 100000;
+------+-------------+-------+------+------------------------------+------+---------+------+------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+------+------------------------------+------+---------+------+------+-----------------------------+
| 1 | SIMPLE | cs | ALL | status_certificateStatus_idx | NULL | NULL | NULL | 486 | Using where; Using filesort |
+------+-------------+-------+------+------------------------------+------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)
For expiration-mailer, after:
MariaDB [boulder_sa_integration]> EXPLAIN SELECT cs.serial FROM certificateStatus AS cs WHERE cs.notAfter < DATE_ADD(NOW(), INTERVAL 21 DAY) AND cs.notAfter < DATE_ADD(NOW(), INTERVAL 10 DAY) AND cs.status != "revoked" AND COALESCE(TIMESTAMPDIFF(SECOND, cs.lastExpirationNagSent, cs.notAfter) > 10 * 86400, 1) ORDER BY cs.notAfter ASC LIMIT 100000;
+------+-------------+-------+-------+---------------+--------------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------+--------------+---------+------+------+------------------------------------+
| 1 | SIMPLE | cs | range | notAfter_idx | notAfter_idx | 6 | NULL | 1 | Using index condition; Using where |
+------+-------------+-------+-------+---------------+--------------+---------+------+------+------------------------------------+
For ocsp-updater, before:
MariaDB [boulder_sa_integration]> EXPLAIN SELECT cs.serial, cs.status, cs.revokedDate, cs.notAfter FROM certificateStatus AS cs WHERE cs.ocspLastUpdated > DATE_SUB(NOW(), INTERVAL 10 DAY) AND cs.ocspLastUpdated < DATE_SUB(NOW(), INTERVAL 3 DAY) AND NOT cs.isExpired ORDER BY cs.ocspLastUpdated ASC LIMIT 100000;
+------+-------------+-------+-------+---------------------------------------+---------------------------------------+---------+------+------+------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+---------------------------------------+---------------------------------------+---------+------+------+------------------------------------+
| 1 | SIMPLE | cs | range | ocspLastUpdated_certificateStatus_idx | ocspLastUpdated_certificateStatus_idx | 5 | NULL | 1 | Using index condition; Using where |
+------+-------------+-------+-------+---------------------------------------+---------------------------------------+---------+------+------+------------------------------------+
1 row in set (0.00 sec)
For ocsp-updater, after:
MariaDB [boulder_sa_integration]> EXPLAIN SELECT cs.serial, cs.status, cs.revokedDate, cs.notAfter FROM certificateStatus AS cs WHERE cs.ocspLastUpdated > DATE_SUB(NOW(), INTERVAL 10 DAY) AND cs.ocspLastUpdated < DATE_SUB(NOW(), INTERVAL 3 DAY) AND NOT cs.isExpired ORDER BY cs.ocspLastUpdated ASC LIMIT 100000;
+------+-------------+-------+-------+-------------------------------+-------------------------------+---------+------+------+-----------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-------+-------+-------------------------------+-------------------------------+---------+------+------+-----------------------+
| 1 | SIMPLE | cs | range | isExpired_ocspLastUpdated_idx | isExpired_ocspLastUpdated_idx | 7 | NULL | 1 | Using index condition |
+------+-------------+-------+-------+-------------------------------+-------------------------------+---------+------+------+-----------------------+
1 row in set (0.00 sec)
We started running our DB migrations in the background to speed up CI. However,
the semantics of subprocesses and `wait` mean that if a migration fails, the
overall `create_db.sh` doesn't fail. That means, for instance, tests continue to
run, and it's hard to find the resulting error.
This change runs the migrations in serial again so that we can catch such errors
more easily.
This pulls in multilog support (logs sharded by date). As a result,
it also pulls in new dependencies gogo/protobuf (for UnmarshalText) and
golang/protobuf/ptypes (for Timestamp).
Replaces #3202, adding a smaller set of dependencies. See also #3205.
Tests run:
```
$ go test github.com/gogo/protobuf/proto github.com/golang/protobuf/ptypes/... github.com/google/certificate-transparency-go/...
ok github.com/gogo/protobuf/proto 0.063s
ok github.com/golang/protobuf/ptypes 0.009s
? github.com/golang/protobuf/ptypes/any [no test files]
? github.com/golang/protobuf/ptypes/duration [no test files]
? github.com/golang/protobuf/ptypes/empty [no test files]
? github.com/golang/protobuf/ptypes/struct [no test files]
? github.com/golang/protobuf/ptypes/timestamp [no test files]
? github.com/golang/protobuf/ptypes/wrappers [no test files]
ok github.com/google/certificate-transparency-go 1.005s
ok github.com/google/certificate-transparency-go/asn1 0.021s
ok github.com/google/certificate-transparency-go/client 22.034s
? github.com/google/certificate-transparency-go/client/ctclient [no test files]
ok github.com/google/certificate-transparency-go/fixchain 0.145s
? github.com/google/certificate-transparency-go/fixchain/main [no test files]
ok github.com/google/certificate-transparency-go/fixchain/ratelimiter 27.745s
ok github.com/google/certificate-transparency-go/gossip 0.772s
? github.com/google/certificate-transparency-go/gossip/main [no test files]
ok github.com/google/certificate-transparency-go/jsonclient 25.523s
ok github.com/google/certificate-transparency-go/merkletree 0.004s
? github.com/google/certificate-transparency-go/preload [no test files]
? github.com/google/certificate-transparency-go/preload/dumpscts/main [no test files]
? github.com/google/certificate-transparency-go/preload/main [no test files]
ok github.com/google/certificate-transparency-go/scanner 0.010s
? github.com/google/certificate-transparency-go/scanner/main [no test files]
ok github.com/google/certificate-transparency-go/tls 0.026s
ok github.com/google/certificate-transparency-go/x509 0.417s
? github.com/google/certificate-transparency-go/x509/pkix [no test files]
? github.com/google/certificate-transparency-go/x509util [no test files]
```
* Remove non-TLS support from mailer entirely
* Add a config option for trusted roots in expiration-mailer. If unset, it defaults to the system roots, so this does not need to be set in production.
* Use TLS in mail-test-srv, along with an internal root and localhost certificates signed by that root.
Fixes https://github.com/letsencrypt/boulder/issues/3205.
Previously, we would only move aside Godeps.json before running `godep save ./...`. However, in order to get a true picture of what is needed, we must also remove the existing `vendor/` directory.
This change also removes some unnecessary dependencies that have piled up over the years, generally test dependencies. Godep used to vendor such dependencies but no longer does.
In #3167 I removed the code that would use this, but forgot to remove the
exporting code. This follows up on that. We don't currently use this for
monitoring, and it's easier to get the current flags from a config file.
In #3167 I removed expvar, thinking it was unused, but it turns out the RA
exports the last issuance time, and core/util.go has a function to export
BuildID, both of which are used in monitoring. This wasn't caught at compile
time because the global expvar package was happy to register the exports even
though there was no handler to serve them.
This PR implements order finalization for the ACME v2 API.
In broad strokes this means:
* Removing the CSR from order objects & the new-order flow
* Adding identifiers to the order object & new-order
* Providing a finalization URL as part of orders returned by new-order
* Adding support to the WFE's Order endpoint to receive finalization POST requests with a CSR
* Updating the RA to accept finalization requests and to ensure orders are fully validated before issuance can proceed
* Updating the SA to allow finding order authorizations & updating orders.
* Updating the CA to accept an Order ID to log when issuing a certificate corresponding to an order object
Resolves#3123
This pulls in google/safebrowsing#74, which introduces a new LookupURLsContext that allows us to pass through timeout information nicely.
Also, update calling code to use LookupURLsContext instead of LookupURLs.
Previously the expired-authz-purger would try to load the ids for all relevant
authzs into memory before doing any work. On a very large table, this would mean
running out of memory. This setting allows limiting how much work will be done
in one chunk.
Also add periodic logging of deletion count.
Fixes#3147.
This is only the migration, so far. Rather than doing the feature-switch dance,
we can wait for this migration to be applied, and then commit the code to start
setting it, with a feature switch to start checking it, which can be turned on
once we've been setting the bit in production for a week.
Having this as an indexed bit on issuedNames allows us to cheaply exclude
renewals from our rate limit queries, so we can avoid the ordering dependency
for renewals vs new issuances on the same domain.
Fixes#3161
There were two bugs in #3167:
All process-level stats got prefixed with "boulder", which broke dashboards.
All request_time stats got dropped, because measured_http was using the prometheus DefaultRegisterer.
To fix, this PR plumbs through a scope object to measured_http, and uses an empty prefix when calling NewProcessCollector().
Go's default is 2: https://golang.org/src/database/sql/sql.go#L686.
Graphs show we are opening 100-200 fresh connections per second on the SA.
Changing this default should reduce that a lot, which should reduce load on both
the SA and MariaDB. This should also improve latency, since every new TCP
connection adds a little bit of latency.
Previously, we used prometheus.DefaultRegisterer to register our stats, which uses global state to export its HTTP stats. We also used net/http/pprof's behavior of registering to the default global HTTP ServeMux, via DebugServer, which starts an HTTP server that uses that global ServeMux.
In this change, I merge DebugServer's functions into StatsAndLogging. StatsAndLogging now takes an address parameter and fires off an HTTP server in a goroutine. That HTTP server is newly defined, and doesn't use DefaultServeMux. On it is registered the Prometheus stats handler, and handlers for the various pprof traces. In the process I split StatsAndLogging internally into two functions: makeStats and MakeLogger. I didn't port across the expvar variable exporting, which serves a similar function to Prometheus stats but which we never use.
One nice immediate effect of this change: Since StatsAndLogging now requires and address, I noticed a bunch of commands that called StatsAndLogging, and passed around the resulting Scope, but never made use of it because they didn't run a DebugServer. Under the old StatsD world, these command still could have exported their stats by pushing, but since we moved to Prometheus their stats stopped being collected. We haven't used any of these stats, so instead of adding debug ports to all short-lived commands, or setting up a push gateway, I simply removed them and switched those commands to initialize only a Logger, no stats.