In #3651 we introduced a new parameter to sa.AddCertificate to allow specifying the Issued date. If nil, we defaulted to the current time to maintain deployability guidelines.
Now that this has been deployed everywhere this PR updates SA.AddCertificate and the gRPC wrappers such that a nil issuer argument is rejected with an error.
Unit tests that were previously using nil for the issued time are updated to explicitly set the issued time to the fake clock's now().
Resolves#3657
Now that #3638 has been deployed to all of the RA instances there are no
more RPC clients using the SA's `CountCertificatesRange` RPC.
This commit deletes the implementation, the RPC definition & wrappers,
and all the test code/mocks.
load generator: send correct ACMEv2 Content-Type on POST.
This PR updates the Boulder load-generator to send the correct ACMEv2 Content-Type header when POSTing the ACME server. This is required for ACMEv2 and without it all POST requests to the WFE2 running a test/config-next configuration result in malformed 400 errors. While only required by ACMEv2 this commit sends it for ACMEv1 requests as well. No harm no foul.
integration tests: allow running just the load generator.
Prior to this PR an omission in an if statement in integration-test.py meant that you couldn't invoke test/integration-test.py with just the --load argument to only run the load generator. This commit updates the if to allow this use case.
This PR updates the Boulder gRPC clientInterceptor to update a Prometheus gauge stat for each in-flight RPC it dispatches, sliced by service and method.
A unit test is included that uses a custom ChillerServer that lets the test block up a bunch of RPCs, check the in-flight gauge value is increased, unblock the RPCs, and recheck that the in-flight gauge is reduced. To check the gauge value for a specific set of labels a new test-tools.go function GaugeValueWithLabels is added.
Updates #3635
Prior to Go 1.9 (https://golang.org/doc/go1.9),
various go commands would expand "./..." to include vendor directories.
We worked around this by listing "./..." then grepping out vendor. Now
that we are on Go 1.10 this is no longer necessary. Remove the TESTPATHS
hack.
We still need to exclude certain test directories when running errcheck,
so some of the "go list" logic gets moved into the errcheck stanza.
Also, as of Go 1.10, running coverage on multiple packages in one run is
supported, so replace the "for" loop in the coverage stanza with a
single command.
Also, remove GITHUB_SECRET_FILE and "die," both of which were unused.
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.
This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.
Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.
A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.
Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
This field was set to singleDialTimeout, but the net/http library treats
it as covering all of dial, write headers, and read headers and body.
Since http01Dialer also uses singleDialTimeout, there's a race between
http01Dialer and net/http to see who will time out first. The result is
that sometimes we give "Timeout after connect" when the error really
should be "Timeout during connect." This issue also inhibits IPv6 to
IPv4 fallback, and tickles a data race that was causing a rare panic in
VA: https://github.com/letsencrypt/boulder/issues/3109.
After this change, the overall HTTP request will get the full deadline
allowed by the RPC context. The dialer will continue to use
singleDialTimeout for each of its two possible dial attempts.
* Randomize order of CT logs when submitting precerts so we maximize the chances we actually exercise all of the logs in a group and not just the first in the list.
* Add metrics for winning logs
When the WFE calls the RA the RA creates a sub context which is cancelled when
the RPC returns. Because we were spawning the publisher RPC calls in a
goroutine with the context from ra.IssueCertificate as soon as
ra.IssueCertificate returned that context was being canceled which in turn
canceled the publisher RPC calls.
Instead of using the RA RPC context simply use a `context.Background()` so
that the RPC context doesn't break these submissions. Also return to
pre-features.CancelCTSubmissions behavior where precert submissions would
be canceled once we retrieved SCTs from the winning logs instead of relying
on the magic behavior of the RA RPC canceling them itself.
The Boulder orphan-finder command uses the SA's AddCertificate RPC to add orphaned certificates it finds back to the DB. Prior to this commit this RPC always set the core.Certificate.Issued field to the
current time. For the orphan-finder case this meant that the Issued date would incorrectly be set to when the certificate was found, not when it was actually issued. This could cause cert-checker to alarm based on the unusual delta between the cert NotBefore and the core.Certificate.Issued value.
This PR updates the AddCertificate RPC to accept an optional issued timestamp in the request arguments. In the SA layer we address deployability concerns by setting a default value of the current time when none is explicitly provided. This matches the classic behaviour and will let an old RA communicate with a new SA.
This PR updates the orphan-finder to provide an explicit issued time to sa.AddCertificate. The explicit issued time is calculated using the found certificate's NotBefore and the configured backdate.
This lets the orphan-finder set the true issued time in the core.Certificate object, avoiding any cert-checker alarms.
Resolves#3624
This PR updates the Boulder github.com/weppos/publicsuffix-go dependency to
weppos/publicsuffix-go@542377b - the tip of master at the time of writing.
Unit tests are confirmed to pass:
$ go test ./...
? github.com/weppos/publicsuffix-go/cmd/load [no test files]
ok github.com/weppos/publicsuffix-go/net/publicsuffix 0.005s
ok github.com/weppos/publicsuffix-go/publicsuffix 0.022s
Notably this update adds the .sport TLD and we've had some requests to support issuance for domains under this newly created TLD.
This allows us to have fast-running unittests without modifying the global state in singleDialTimeout,
which can become a const.
Fixes#3628.
Builds on top of #3629, review that first.
In particular, differentiate timeouts during connect (which are usually a firewall problem) from timeouts after connect (which are usually a software problem). In the process, refactor the tests and add testing for specific problem detail messages.
This also switches over the HTTP challenge's dialer to use DialContext, and to shave a little bit of headroom off of the context deadline, so that the dial can report its timeout before the overall context expires, which would lead to an overly generic "deadline exceeded" error, which would then get translated (incorrectly) into a "timeout after connect."
There is an additional error case, Timeout during %s (your server may be slow or overloaded), (where %s can be read or write) which doesn't have any unittests. I believe it may not be possible to trigger this, since read and write timeouts get subsumed by the HTTP or TLS library, but it's worth having as a fallback case. We'll see if it shows up in the logs.
Among the test refactorings, I shortened the timeout on the TLS timeout test to 50ms. Previously this was the long pole making the whole test take 10s. Now it takes ~500 ms overall.
I recommend starting review at https://github.com/letsencrypt/boulder/compare/detailed-va-errors?expand=1#diff-4c51d1d7ca3ec3022d14b42809af0d7eR671 (the changes to detailedError), then reviewing the Dial -> DialContext changes, then the tests.
Submits final certificates to any configured CT logs. This doesn't introduce a feature flag as it is config gated, any log we want to submit final certificates to needs to have it's log description updated to include the `"submitFinalCerts": true` field.
Fixes#3605.
During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available.
During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that.
Fixes#3641.
In #3614 we adjusted the `SA.NewOrder` function to conditionally call `ssa.statusForOrder` on the new order when `features.OrderReadyStatus` was enabled. Unfortunately this call to `ssa.statusForOrder` happened *before* the `req.BeganProcessing` field was initialized with a pointer to a `false` bool. The `ssa.statusForOrder` function (correctly) assumes that `req.BeganProcessing == nil` is illegal and doesn't correspond to a known status. This results in `NewOrder` requests returning a 500 error
of the form:
> Internal error - Error creating new order - Order XXX is in an invalid state. No state known for this order's authorizations
Our integration tests missed this because we didn't have a test case that issued for a set of names with one account, and then issued again for the same set of names with the same account.
This PR fixes the original bug by moving the `BeganProcessing` initialization before the call to `statusForOrder`. This PR also adds an integration test to catch this sort of bug again in the future.
Prior to the SA fix this test failed with the 500 server internal error observed by the Certbot team. With the SA fix in place the test passes again.
Finally, this PR disables the `OrderReadyStatus` feature flag in `test/config-next/sa.json`. Certbot's ACME implementation breaks when this flag is enabled (See https://github.com/certbot/certbot/issues/5856). Since Certbot runs integration tests against Boulder with config-next we should be courteous and leave this flag disabled until we are closer to being able to turn it on for staging/prod.
The `TotalCertificates` rate limit serves to ensure we don't
accidentally exceed our OCSP signing capacity by issuing too many
certificates within a fixed period. In practice this rate limit has been
fragile and the associated queries have been linked to performance
problems.
Since we now have better means of monitoring our OCSP signing capacity
this commit removes the rate limit and associated code.
We have updated staging/prod Boulder builds to use Go 1.10.1. This means
we no longer need to support Go 1.10.0 in dev docker images, CI, and our
image building tools.
This commit disables the `OrderReadyStatus` feature flag in
`test/config-next/sa.json`. Certbot's ACME implementation breaks when
this flag is enabled (See
https://github.com/certbot/certbot/issues/5856). Since Certbot runs
integration tests against Boulder with config-next we should be
courteous and leave this flag disabled until we are closer to being able
to turn it on for staging/prod.
In #3614 we adjusted the `SA.NewOrder` function to conditionally call
`ssa.statusForOrder` on the new order when `features.OrderReadyStatus`
was enabled. Unfortunately this call to `ssa.statusForOrder` happened
*before* the `req.BeganProcessing` field was initialized with a pointer
to a `false` bool. The `ssa.statusForOrder` function (correctly) assumes
that `req.BeganProcessing == nil` is illegal and doesn't correspond to
a known status. This results in NewOrder requests returning a 500 error
of the form:
> Internal error - Error creating new order - Order XXX is in an invalid
> state. No state known for this order's authorizations
Our integration tests missed this because we didn't have a test case
that issued for a set of names with one account, and then issued again
for the same set of names with the same account.
This commit fixes the original bug by moving the `BeganProcessing`
initialization before the call to `statusForOrder`. This commit also
adds an integration test to catch this sort of bug again in the
future.
Prior to the SA fix this test failed with the 500 server internal error
observed by the Certbot team. With the SA fix in place the test passes
again.
* SA: Add Order "Ready" status, feature flag.
This commit adds the new "Ready" status to `core/objects.go` and updates
`sa.statusForOrder` to use it conditionally for orders with all valid
authorizations that haven't been finalized yet. This state is used
conditionally based on the `features.OrderReadyStatus` feature flag
since it will likely break some existing clients that expect status
"Processing" for this state. The SA unit test for `statusForOrder` is
updated with a "ready" status test case.
* RA: Enforce order ready status conditionally.
This commit updates the RA to conditionally expect orders that are being
finalized to be in the "ready" status instead of "pending". This is
conditionally enforced based on the `OrderReadyStatus` feature flag.
Along the way the SA was changed to calculate the order status for the
order returned in `sa.NewOrder` dynamically now that it could be
something other than "pending".
* WFE2: Conditionally enforce order ready status for finalization.
Similar to the RA the WFE2 should conditionally enforce that an order's
status is either "ready" or "pending" based on the "OrderReadyStatus"
feature flag.
* Integration: Fix `test_order_finalize_early`.
This commit updates the V2 `test_order_finalize_early` test for the
"ready" status. A nice side-effect of the ready state change is that we
no longer invalidate an order when it is finalized too soon because we
can reject the finalization in the WFE. Subsequently the
`test_order_finalize_early` testcase is also smaller.
* Integration: Test classic behaviour w/o feature flag.
In the previous commit I fixed the integration test for the
`config/test-next` run that has the `OrderReadyStatus` feature flag set
but broke it for the `config/test` run without the feature flag.
This commit updates the `test_order_finalize_early` test to work
correctly based on the feature flag status in both cases.
Previously we updated the RA's issueCertificateInner function to prefix errors returned from the CA with meaningful information about which CA RPC caused the failure. Unfortunately by using fmt.Errorf to do this we're discarding the underlying error type. This can cause unexpected server internal errors downstream if (for e.g.) the CA rejects a CSR with a malformed error (see #3632).
This PR updates the issueCertificateInner error message prefixing to maintain the error type if the underlying error is a berrors.BoulderError. A RA unit test with several mock CAs is added to test the prefixing occurs as expected without loss of error type.
This PR also adds an integration test that ensures we reject a CSR with >100 names with a malformed error. This is not strictly related to this PR but since I wrote it while debugging the root issue I thought I'd include it. To allow this test to pass the pendingAuthorizationsPerAccount in test/rate-limit-policies.yml and associated tests had to be adjusted.
Resolves#3632
This commit updates the `boulder-ra` and `boulder-ca` commands to refuse
to start if their configured `MaxNames` is 0 (the default value). This
should always be set to a positive number.
This commit also updates `csr/csr.go` to always apply the max names
check since it will never be 0 after the change above.
Also refactor `FailOnError` to pull out a separate `Fail` function.
Related to https://github.com/letsencrypt/boulder/issues/3632
Distinguish between deadline exceeded vs canceled. Also, combine those
two cases with "out of retries" into a single stat with a label
determining type.
gRPC passes deadline information through the RPC boundary, but client and server have the same deadline. Ideally we'd like the server to have a slightly tighter deadline than the client, so if one of the server's onward RPCs or other network calls times out, the server can pass back more detailed information to the client, rather than the client timing out the server and losing the opportunity to log more detailed information about which component caused the timeout.
In this change, I subtract 100ms from the deadline on the server side of our interceptors, using our existing serverInterceptor. I also check that there is at least 100ms remaining in which to do useful work, so the server doesn't begin a potentially expensive task only to abort it.
Fixes#3608.
This PR updates the `test/boulder-tools/tag_and_upload.sh` script to template a `Dockerfile` for building multiple copies of `boulder-tools`: one per supported Go version. Unfortunately this is required because only Docker 17+ supports an env var in a Dockerfile `FROM`. It's best if we can stay on package manger installed versions of Docker which precludes 17+ 😞.
The `docker-compose.yml` is updated to version "3" to allow specifying a `GO_VERSION` env var in the respective Boulder `image` directives. This requires `docker-compose` version 1.10.0+ which in turn requires Docker engine version 1.13.0+. The README is updated to reflect these new requirements. This Docker engine version is commonly available in package managers (e.g. Ubuntu 16.04). A sufficient `docker-compose` version is not, but this is a simple one binary Go application that is easy to update outside of package managers.
The `.travis.yml` config file is updated to set the `GO_VERSION` in the build matrix, allowing build tasks for different Go versions. Since the `docker-compose.yml` now requires `docker-compose` 1.10.0+ the
`.travis.yml` also gains a new `before_install` for setting up a modern `docker-compose` version.
Lastly tools and images are updated to support both Go 1.10 (our current Go version) and Go 1.10.1 (the new point release). By default Go 1.10 is used, we can switch this once staging/prod are updated.
_*TODO*: One thing I haven't implemented yet is a `sed` expression in `tag_and_upload.sh` that updates both `image` lines in `docker-compose.yml` with an up-to-date tag. Putting this up for review while I work on that last creature comfort._
Resolves https://github.com/letsencrypt/boulder/issues/3551
Replaces https://github.com/letsencrypt/boulder/pull/3620 (GH got stuck from a yaml error)
This allows these tools to easily be run in command line mode from
the host machine against a Boulder running inside docker-compose up
without modifying the FAKE_DNS field in docker-compose.yml. This
allows for easier testing of various conditions.
This PR updates the RA such that certificateRequestEvent objects created during issuance and written to the audit log as JSON also include a new Authorizations field. This field is a map of the form map[string]certificateRequestAuthz and can be used to map from an identifier name appearing in the associated certificate to a certificateRequestAuthz object. Each of the certificateRequestAuthz objects holds an authorization ID and the type of challenge that made the authorization valid.
Example Audit log output (with the JSON pulled out and pretty-printed):
{
"ID":"0BjPk94KlxExRRIQ3kslRVSJ68KMaTh416chRvq0wyA",
"Requester":666,
"SerialNumber":"ff699d91cab5bc84f1bc97fc71e4e27abc1a",
"VerifiedFields":["subject.commonName","subjectAltName"],
"CommonName":"rand.44634cbf.xyz",
"Names":["rand.44634cbf.xyz"],
"NotBefore":"2018-03-28T19:50:07Z",
"NotAfter":"2018-06-26T19:50:07Z",
"RequestTime":"2018-03-28T20:50:07.234038859Z",
"ResponseTime":"2018-03-28T20:50:07.278848954Z",
"Authorizations": {
"rand.44634cbf.xyz" : {
"ID":"jGt37Rnvfr0nhZn-wLkxrQxc2HBfV4t6TSraRiGnNBM",
"ChallengeType":"http-01"
}
}
}
Resolves#3253
* Update `globalsign/certlint` to d4a45be.
This commit updates the `github.com/globalsign/certlint` dependency to
the latest tip of master (d4a45be06892f3e664f69892aca79a48df510be0).
Unit tests are confirmed to pass:
```
$ go test ./...
ok github.com/globalsign/certlint 3.816s
ok github.com/globalsign/certlint/asn1 (cached)
? github.com/globalsign/certlint/certdata [no test files]
? github.com/globalsign/certlint/checks [no test files]
? github.com/globalsign/certlint/checks/certificate/aiaissuers [no
test files]
? github.com/globalsign/certlint/checks/certificate/all [no test
files]
? github.com/globalsign/certlint/checks/certificate/basicconstraints
[no test files]
? github.com/globalsign/certlint/checks/certificate/extensions [no
test files]
? github.com/globalsign/certlint/checks/certificate/extkeyusage [no
test files]
ok github.com/globalsign/certlint/checks/certificate/internal
(cached)
? github.com/globalsign/certlint/checks/certificate/issuerdn [no
test files]
? github.com/globalsign/certlint/checks/certificate/keyusage [no
test files]
? github.com/globalsign/certlint/checks/certificate/publickey [no
test files]
? github.com/globalsign/certlint/checks/certificate/publickey/goodkey
[no test files]
ok github.com/globalsign/certlint/checks/certificate/publicsuffix
(cached)
? github.com/globalsign/certlint/checks/certificate/revocation [no
test files]
? github.com/globalsign/certlint/checks/certificate/serialnumber
[no test files]
? github.com/globalsign/certlint/checks/certificate/signaturealgorithm
[no test files]
ok github.com/globalsign/certlint/checks/certificate/subject (cached)
ok github.com/globalsign/certlint/checks/certificate/subjectaltname
(cached)
? github.com/globalsign/certlint/checks/certificate/validity [no
test files]
? github.com/globalsign/certlint/checks/certificate/version [no test
files]
? github.com/globalsign/certlint/checks/certificate/wildcard [no
test files]
? github.com/globalsign/certlint/checks/extensions/adobetimestamp
[no test files]
? github.com/globalsign/certlint/checks/extensions/all [no test
files]
? github.com/globalsign/certlint/checks/extensions/authorityinfoaccess
[no test files]
? github.com/globalsign/certlint/checks/extensions/authoritykeyid
[no test files]
? github.com/globalsign/certlint/checks/extensions/basicconstraints
[no test files]
? github.com/globalsign/certlint/checks/extensions/crldistributionpoints
[no test files]
? github.com/globalsign/certlint/checks/extensions/ct [no test
files]
? github.com/globalsign/certlint/checks/extensions/extkeyusage [no
test files]
? github.com/globalsign/certlint/checks/extensions/keyusage [no test
files]
? github.com/globalsign/certlint/checks/extensions/nameconstraints
[no test files]
ok github.com/globalsign/certlint/checks/extensions/ocspmuststaple
(cached)
? github.com/globalsign/certlint/checks/extensions/ocspnocheck [no
test files]
? github.com/globalsign/certlint/checks/extensions/pdfrevocation
[no test files]
? github.com/globalsign/certlint/checks/extensions/policyidentifiers
[no test files]
? github.com/globalsign/certlint/checks/extensions/smimecapabilities
[no test files]
? github.com/globalsign/certlint/checks/extensions/subjectaltname
[no test files]
? github.com/globalsign/certlint/checks/extensions/subjectkeyid [no
test files]
ok github.com/globalsign/certlint/errors (cached)
? github.com/globalsign/certlint/examples/ct [no test files]
? github.com/globalsign/certlint/examples/specificchecks [no test
files]
```
* Certchecker: Remove OCSP Must Staple err ignore, fix typos.
This commit removes the explicit ignore for OCSP Must Staple errors that
was added when the upstream `certlint` package didn't understand that
PKIX extension. That problem was resolved and so we can remove the
ignore from `cert-checker`.
This commit also fixes two typos that were fixed upstream and needed to
be reflected in expected error messages in the `certlint` unit test.
* Certchecker: Ignore Certlint CN/SAN == PSL errors.
`globalsign/certlint`, used by `cmd/cert-checker` to vet certs,
improperly flags certificates that have subj CN/SANs equal to a private
entry in the public suffix list as faulty.
This commit adds a regex that will skip errors that match the certlint
PSL error string. Prior to this workaround the addition of a private PSL
entry as a SAN in the `TestCheckCert` test cert fails the test:
```
--- FAIL: TestCheckCert (1.72s)
main_test.go:221: Found unexpected problem 'Certificate subjectAltName
"dev-myqnapcloud.com" equals "dev-myqnapcloud.com" from the public
suffix list'.
```
With the workaround in place, the test passes again.
Also instead of repeating the same bucket definitions everywhere just use a single top level var in the metrics package in order to discourage copy/pasting.
Fixes#3607.
This commit addresses two config elements that were defined but not
wired through to the WFE implementation object. Prior to this commit the
`c.WFE.DirectoryCAAIdentity` and `c.WFE.DirectoryWebsite` configuration
values were read and unmarshaled from config but not passed to the WFE.
After this commit these two config options will be picked up by the WFE
impl.