In #3614 we adjusted the `SA.NewOrder` function to conditionally call
`ssa.statusForOrder` on the new order when `features.OrderReadyStatus`
was enabled. Unfortunately this call to `ssa.statusForOrder` happened
*before* the `req.BeganProcessing` field was initialized with a pointer
to a `false` bool. The `ssa.statusForOrder` function (correctly) assumes
that `req.BeganProcessing == nil` is illegal and doesn't correspond to
a known status. This results in NewOrder requests returning a 500 error
of the form:
> Internal error - Error creating new order - Order XXX is in an invalid
> state. No state known for this order's authorizations
Our integration tests missed this because we didn't have a test case
that issued for a set of names with one account, and then issued again
for the same set of names with the same account.
This commit fixes the original bug by moving the `BeganProcessing`
initialization before the call to `statusForOrder`. This commit also
adds an integration test to catch this sort of bug again in the
future.
Prior to the SA fix this test failed with the 500 server internal error
observed by the Certbot team. With the SA fix in place the test passes
again.
* SA: Add Order "Ready" status, feature flag.
This commit adds the new "Ready" status to `core/objects.go` and updates
`sa.statusForOrder` to use it conditionally for orders with all valid
authorizations that haven't been finalized yet. This state is used
conditionally based on the `features.OrderReadyStatus` feature flag
since it will likely break some existing clients that expect status
"Processing" for this state. The SA unit test for `statusForOrder` is
updated with a "ready" status test case.
* RA: Enforce order ready status conditionally.
This commit updates the RA to conditionally expect orders that are being
finalized to be in the "ready" status instead of "pending". This is
conditionally enforced based on the `OrderReadyStatus` feature flag.
Along the way the SA was changed to calculate the order status for the
order returned in `sa.NewOrder` dynamically now that it could be
something other than "pending".
* WFE2: Conditionally enforce order ready status for finalization.
Similar to the RA the WFE2 should conditionally enforce that an order's
status is either "ready" or "pending" based on the "OrderReadyStatus"
feature flag.
* Integration: Fix `test_order_finalize_early`.
This commit updates the V2 `test_order_finalize_early` test for the
"ready" status. A nice side-effect of the ready state change is that we
no longer invalidate an order when it is finalized too soon because we
can reject the finalization in the WFE. Subsequently the
`test_order_finalize_early` testcase is also smaller.
* Integration: Test classic behaviour w/o feature flag.
In the previous commit I fixed the integration test for the
`config/test-next` run that has the `OrderReadyStatus` feature flag set
but broke it for the `config/test` run without the feature flag.
This commit updates the `test_order_finalize_early` test to work
correctly based on the feature flag status in both cases.
Previously we updated the RA's issueCertificateInner function to prefix errors returned from the CA with meaningful information about which CA RPC caused the failure. Unfortunately by using fmt.Errorf to do this we're discarding the underlying error type. This can cause unexpected server internal errors downstream if (for e.g.) the CA rejects a CSR with a malformed error (see #3632).
This PR updates the issueCertificateInner error message prefixing to maintain the error type if the underlying error is a berrors.BoulderError. A RA unit test with several mock CAs is added to test the prefixing occurs as expected without loss of error type.
This PR also adds an integration test that ensures we reject a CSR with >100 names with a malformed error. This is not strictly related to this PR but since I wrote it while debugging the root issue I thought I'd include it. To allow this test to pass the pendingAuthorizationsPerAccount in test/rate-limit-policies.yml and associated tests had to be adjusted.
Resolves#3632
gRPC passes deadline information through the RPC boundary, but client and server have the same deadline. Ideally we'd like the server to have a slightly tighter deadline than the client, so if one of the server's onward RPCs or other network calls times out, the server can pass back more detailed information to the client, rather than the client timing out the server and losing the opportunity to log more detailed information about which component caused the timeout.
In this change, I subtract 100ms from the deadline on the server side of our interceptors, using our existing serverInterceptor. I also check that there is at least 100ms remaining in which to do useful work, so the server doesn't begin a potentially expensive task only to abort it.
Fixes#3608.
This PR updates the `test/boulder-tools/tag_and_upload.sh` script to template a `Dockerfile` for building multiple copies of `boulder-tools`: one per supported Go version. Unfortunately this is required because only Docker 17+ supports an env var in a Dockerfile `FROM`. It's best if we can stay on package manger installed versions of Docker which precludes 17+ 😞.
The `docker-compose.yml` is updated to version "3" to allow specifying a `GO_VERSION` env var in the respective Boulder `image` directives. This requires `docker-compose` version 1.10.0+ which in turn requires Docker engine version 1.13.0+. The README is updated to reflect these new requirements. This Docker engine version is commonly available in package managers (e.g. Ubuntu 16.04). A sufficient `docker-compose` version is not, but this is a simple one binary Go application that is easy to update outside of package managers.
The `.travis.yml` config file is updated to set the `GO_VERSION` in the build matrix, allowing build tasks for different Go versions. Since the `docker-compose.yml` now requires `docker-compose` 1.10.0+ the
`.travis.yml` also gains a new `before_install` for setting up a modern `docker-compose` version.
Lastly tools and images are updated to support both Go 1.10 (our current Go version) and Go 1.10.1 (the new point release). By default Go 1.10 is used, we can switch this once staging/prod are updated.
_*TODO*: One thing I haven't implemented yet is a `sed` expression in `tag_and_upload.sh` that updates both `image` lines in `docker-compose.yml` with an up-to-date tag. Putting this up for review while I work on that last creature comfort._
Resolves https://github.com/letsencrypt/boulder/issues/3551
Replaces https://github.com/letsencrypt/boulder/pull/3620 (GH got stuck from a yaml error)
This allows these tools to easily be run in command line mode from
the host machine against a Boulder running inside docker-compose up
without modifying the FAKE_DNS field in docker-compose.yml. This
allows for easier testing of various conditions.
This commit updates the WFE and WFE2 to have configuration support for
setting a value for the `/directory` object's "meta" field's
optional "caaIdentities" and "website" fields. The config-next wfe/wfe2
configuration are updated with values for these fields. Unit tests are
updated to check that they are sent when expected and not otherwise.
Bonus content: The `test.AssertUnmarshaledEquals` function had a bug
where it would consider two inputs equal when the # of keys differed.
This commit also fixes that bug.
In publisher and in the integration test, check that SCTs are in a
reasonable range. Also, update CreateTestingSignedSCT (used by
ct-test-srv) to produce SCTs correctly with a timetamp in Unix epoch
milliseconds.
- Remove acme-v2 test phase.
- Rename integration-test-v2.py to v2_integration, so it can be imported.
- Import all symbols from v2_integration before running test_*.
- In chisel2:
- Rename DIRECTORY so it doesn't collide.
- Incidental logging and error fixes.
- Merge v1 and v2 load testing into a single function.
- Run cert-checker just once, after all other test cases.
- In v2_integration:
- Remove unnecessary imports.
- Import chisel2 methods in the chisel2 namespace so they don't
collide with chisel methods.
- Remove main and shutdown code.
Previously, each time we defined a new test case in integration-test.py, we had to explicitly call it.
This made it easy to leave out cases without realizing it. After this change, we will automatically
find all functions named "test_" and call them. As a result, I found that we weren't calling
`test_revoked_by_account`, and it was failing. So I fixed it as part of this PR.
Fixes#3518
We shipped this feature flag disabled because at the time Certbot's acme
module had a bug with revocation that sent the wrong content-type. That
has been fixed upstream and the current boulder-tools image has the fix.
We should be able to turn this flag on now for config-next without
breaking tests.
Adds SCT embedding to the certificate issuance flow. When a issuance is requested a precertificate (the requested certificate but poisoned with the critical CT extension) is issued and submitted to the required CT logs. Once the SCTs for the precertificate have been collected a new certificate is issued with the poison extension replace with a SCT list extension containing the retrieved SCTs.
Fixes#2244, fixes#3492 and fixes#3429.
By default the MariaDB/MySQL container starts with a global max
connections limit of 100. The SA is configured in `test/config` and
`test/config-next` to use 100 connections. This doesn't leave any
overhead for `ocsp-updater` connections and can be reached under load
pretty easily.
This commit adjusts the global max connections setting from
`test/create_db.sh`, setting it to a more generous `500` instead of the
default `100`.
`boulder-publisher` and `ct-test-serv` make a number of assumptions about _what_ is
being submitted to a CT log (mainly that it's always a plain certificate) which prevents
proper submission of precertificates and verification/usage of the returned SCTs.
This change adds the ability for the publisher to submit both normal certificates and
precertificates and verify the returned SCTs. This change also adds the ability for
`ct-test-serv` to accept `/ct/v1/add-pre-chain` submissions and create correct SCTs
for precert entry types.
Right now we check safe browsing at new-authz time, which introduces a possible
external dependency when calling new-authz. This is usually fine, since most safe
browsing checks can be satisfied locally, but when requests have to go external,
it can create variance in new-authz timing.
Fixes#3491.
This pulls in an updated Certbot repo with a fix to content-type for the
revoke method.
Also, adds a convenience replacement in tag_and_release.sh to update
docker-compose.yml.
Prior to this commit all of the `newOrder` requests generated by the
load generator had 1 DNS type identifier. In this commit the
`maxNamesPerCert` parameter that was present in the codebase is fixed
(it was never wired through to config before) and used to size the
orders randomly. Each order will have between 1 and `maxNamesPerCert`
identifiers.
For integration testing we set this at 20. 100 would be more realistic
but we'll start small for now.
This commit adds short 15s runs of the load generator against the V1 and
V2 APIs during the three integration test runs (v1 config, v1
config-next, and v2). 15s was selected because 30s caused too much
output and the build log to be truncated.
Presently the latency output is *not* being checked for errors. This was
too flaky in practice.
A fix for a race condition in the load-generator code itself related to
HTTP status code tracking is included in this commit.
The pending authz rate limit also needed to be adjusted to keep the
load-generator from failing requests after hitting 429s.
This code was never enabled in production. Our original intent was to
ship this as part of the ACMEv2 API. Before that could happen flaws were
identified in TLS-SNI-01|02 that resulted in TLS-SNI-02 being removed
from the ACME protocol. We won't ever be enabling this code and so we
might as well remove it.
Previously we introduced the concept of a "pending orders per account
ID" rate limit. After struggling with making an implementation of this
rate limit perform well we reevaluated the problem and decided a "new
orders per account per time window" rate limit would be a better fit for
ACMEv2 overall.
This commit introduces the new newOrdersPerAccount rate limit. The RA
now checks this before creating new pending orders in ra.NewOrder. It
does so after order reuse takes place ensuring the rate limit is only
applied in cases when a distinct new pending order row would be created.
To accomplish this a migration for a new orders field (created) and an
index over created and registrationID is added. It would be possible to
use the existing expires field for this like we've done in the past, but that
was primarily to avoid running a migration on a large table in prod. Since
we don't have that problem yet for V2 tables we can Do The Right Thing
and add a column.
For deployability the deprecated pendingOrdersPerAccount code & SA
gRPC bits are left around. A follow-up PR will be needed to remove
those (#3502).
Resolves#3410
When we originally deployed the inline CT submission code, we wanted to
be conservative in case it increased our issuance error rate. However,
we've established that the success rate is quite good, so we'll remove
some complexity and make things more realistic by removing the code that
avoids returning errors when CT submission fails.
This change updates boulder-tools to use Go 1.10, and references a
newly-pushed image built using that new config.
Since boulder-tools pulls in the latest Certbot master at the time of
build, this also pulls in the latest changes to Certbot's acme module,
which now supports ACME v2. This means we no longer have to check out
the special acme-v2-integration branch in our integration tests.
This also updates chisel2.py to reflect some of the API changes that
landed in the acme module as it was merged to master.
Since we don't need additional checkouts to get the ACMEv2-compatible
version of the acme module, we can include it in the default RUN set for
local tests.
This commit adds new ACMEv2 actions for the load-generator and an
example configuration for load testing ACME v2 against a local boulder
instance.
The load-generator's existing V1 code remains but the two ACME versions
can not be intermixed in the same load generator plan. The load
generator should be making 100% ACME v1 action calls or 100% ACME v2
action calls.
Follow-up work:
* Adding a short load generator run for V1 and V2 to the integration tests/CI
* Randomizing the # of names in pending orders - right now they are always 1 identifier orders.
* Making it easier to save state on an initial run without needing to create an empty JSON file
* DNS-01 support & Wildcard names
I'm going to start chipping at the "Follow-up" items but wanted to get the meat of this PR up for review now.
Resolves#3137
Use `t.Helper` and `t.Fatalf` instead of our own versions.
Remove some unused or single-user helpers.
Make the output of `AssetUnmarshaledEquals` clearer by showing one line per field.
Add a set of logs which will be submitted to but not relied on for their SCTs,
this allows us to test submissions to a particular log or submit to a log which
is not yet approved by a browser/root program.
Also add a feature which stops cancellations of remaining submissions when racing
to get a SCT from a group of logs.
Additionally add an informational log that always times out in config-next.
Fixes#3464 and fixes#3465.
The test for the certificates_per_name rate limit uses an exact domain
name that has an override in the rate limit config file to have a limit
of 0. This works correctly most of the time. However, if that mechanism
fails once (due to some bug), future runs of the integation test will
continue to fail, because there will now be an issued certificate for
"lim.it" in the DB, and subsequent attempts will be considered renewals.
This change adds a random subdomain to the test, so that it's not
eligible for the renewal exemption.
This was added in the theory that it would stop some errors running
`docker compose go generate ./...` but in tests today, that command
succeeds right now even after removing this from the build steps and
rebuilding. This is not surprising, because as Roland pointed out, the
/alt-gopath/ path was never actually referenced anywhere.
We recently reordered the programs in startservers.py to reflect
dependencies, so we wouldn't generate gRPC errors when a process comes
up before its backend. However, we missed the remote VAs. This change
moves them to the beginning.
Also, don't print "all processes started" after each process starts.
chisel had verify_ssl=False. Remove that, and set a sensible default
for REQUESTS_CA_BUNDLE to make it easier to run chisel on the command
line. Port the REQUESTS_CA_BUNDLE change into chisel2 as well.
Removes usage of the `EnforceChallengeDisable` feature, the feature itself is not removed as it is still configured in staging/production, once that is fixed I'll submit another PR removing the actual flag.
This keeps the behavior that when authorizations are retrieved from the SA they have their challenges populated, because that seems to make the most sense to me? It also retains TLS re-validation.
Fixes#3441.
Boulder is fairly noisy about gRPC connection errors. This is a mixed
blessing: Our gRPC configuration will try to reconnect until it hits
an RPC deadline, and most likely eventually succeed. In that case,
we don't consider those to really be errors. However, in cases where
a connection is repeatedly failing, we'd like to see errors in the
logs about connection failure, rather than "deadline exceeded." So
we want to keep logging of gRPC errors.
However, right now we get a lot of these errors logged during
integration tests. They make the output hard to read, and may disguise
more serious errors. So we'd like to avoid causing such errors in
normal integration test operation.
This change reorders the startup of Boulder components by their gRPC
dependencies, so everything's backend is likely to be up and running
before it starts. It also reverses that order for clean shutdowns,
and waits for each process to exit before signalling the next one.
With these changes, I still got connection errors. Taking listenbuddy
out of the gRPC path fixed them. I believe the issue is that
listenbuddy is not a truly transparent proxy. In particular, it
accepts an inbound TCP connection before opening an outbound TCP
connection. If opening that outbound connection results in "connection
refused," it closes the inbound connection. That means gRPC sees a
"connection closed" (or "connection reset"?) rather than "connection
refused". I'm guessing it handles those cases differently, explaining
the different error results.
We've been using listenbuddy to trigger disconnects while Boulder is
running, to ensure that gRPC's reconnect code works. I think we can
probably rely on gRPC's reconnect to work. The initial problem that
led us to start testing this was a configuration problem; now that
we have the configuration we want, we should be fine and don't need
to keep testing reconnects on every integration test run.
For the upcoming SCT embedding changes, it will be useful to have a CT test server that blocks for nontrivial amounts of time before responding. This change introduces a config file for `ct-test-srv` that can be used to set up multiple "personalities" on various ports. Each personality can have a "latencySchedule" that determines how long it will sleep before servicing responding to a submission.
This change also introduces two new "personalities" on :4510 and :4511, plus configures CTLogGroups in the RA. Having four CT log personalities allows us to simulate two nontrivial log groups.
Note: This triggers Publisher to emit audit errors on timed-out submissions. We may want to make Publisher not treat those as errors, and instead only log an error if a whole log group fails.
This commit resolves the case where an error during finalization occurs.
Prior to this commit if an error (expected or otherwise) occurred after
setting an order to status processing at the start of order
finalization the order would be stuck processing forever.
The SA now has a `SetOrderError` RPC that can be used by the RA to
persist an error onto an order. The order status calculation can use
this error to decide if the order is invalid. The WFE is updated to
write the error to the order JSON when displaying the order information.
Prior to this commit the order protobuf had the error field as
a `[]byte`. It doesn't seem like this is the right decision, we have
a specific protobuf type for ProblemDetails and so this commit switches
the error field to use it. The conversion to/from `[]byte` is done with
the model by the SA.
An integration test is included that prior to this commit left an order
in a stuck processing state. With this commit the integration test
passes as expected.
Resolves https://github.com/letsencrypt/boulder/issues/3403