In publisher and in the integration test, check that SCTs are in a
reasonable range. Also, update CreateTestingSignedSCT (used by
ct-test-srv) to produce SCTs correctly with a timetamp in Unix epoch
milliseconds.
- Remove acme-v2 test phase.
- Rename integration-test-v2.py to v2_integration, so it can be imported.
- Import all symbols from v2_integration before running test_*.
- In chisel2:
- Rename DIRECTORY so it doesn't collide.
- Incidental logging and error fixes.
- Merge v1 and v2 load testing into a single function.
- Run cert-checker just once, after all other test cases.
- In v2_integration:
- Remove unnecessary imports.
- Import chisel2 methods in the chisel2 namespace so they don't
collide with chisel methods.
- Remove main and shutdown code.
Previously, each time we defined a new test case in integration-test.py, we had to explicitly call it.
This made it easy to leave out cases without realizing it. After this change, we will automatically
find all functions named "test_" and call them. As a result, I found that we weren't calling
`test_revoked_by_account`, and it was failing. So I fixed it as part of this PR.
Fixes#3518
Adds SCT embedding to the certificate issuance flow. When a issuance is requested a precertificate (the requested certificate but poisoned with the critical CT extension) is issued and submitted to the required CT logs. Once the SCTs for the precertificate have been collected a new certificate is issued with the poison extension replace with a SCT list extension containing the retrieved SCTs.
Fixes#2244, fixes#3492 and fixes#3429.
This commit adds short 15s runs of the load generator against the V1 and
V2 APIs during the three integration test runs (v1 config, v1
config-next, and v2). 15s was selected because 30s caused too much
output and the build log to be truncated.
Presently the latency output is *not* being checked for errors. This was
too flaky in practice.
A fix for a race condition in the load-generator code itself related to
HTTP status code tracking is included in this commit.
The pending authz rate limit also needed to be adjusted to keep the
load-generator from failing requests after hitting 429s.
The test for the certificates_per_name rate limit uses an exact domain
name that has an override in the rate limit config file to have a limit
of 0. This works correctly most of the time. However, if that mechanism
fails once (due to some bug), future runs of the integation test will
continue to fail, because there will now be an issued certificate for
"lim.it" in the DB, and subsequent attempts will be considered renewals.
This change adds a random subdomain to the test, so that it's not
eligible for the renewal exemption.
Boulder is fairly noisy about gRPC connection errors. This is a mixed
blessing: Our gRPC configuration will try to reconnect until it hits
an RPC deadline, and most likely eventually succeed. In that case,
we don't consider those to really be errors. However, in cases where
a connection is repeatedly failing, we'd like to see errors in the
logs about connection failure, rather than "deadline exceeded." So
we want to keep logging of gRPC errors.
However, right now we get a lot of these errors logged during
integration tests. They make the output hard to read, and may disguise
more serious errors. So we'd like to avoid causing such errors in
normal integration test operation.
This change reorders the startup of Boulder components by their gRPC
dependencies, so everything's backend is likely to be up and running
before it starts. It also reverses that order for clean shutdowns,
and waits for each process to exit before signalling the next one.
With these changes, I still got connection errors. Taking listenbuddy
out of the gRPC path fixed them. I believe the issue is that
listenbuddy is not a truly transparent proxy. In particular, it
accepts an inbound TCP connection before opening an outbound TCP
connection. If opening that outbound connection results in "connection
refused," it closes the inbound connection. That means gRPC sees a
"connection closed" (or "connection reset"?) rather than "connection
refused". I'm guessing it handles those cases differently, explaining
the different error results.
We've been using listenbuddy to trigger disconnects while Boulder is
running, to ensure that gRPC's reconnect code works. I think we can
probably rely on gRPC's reconnect to work. The initial problem that
led us to start testing this was a configuration problem; now that
we have the configuration we want, we should be fine and don't need
to keep testing reconnects on every integration test run.
Previously, there was a disagreement between WFE and CA as to what the correct
issuer certificate was. Consolidate on test-ca2.pem (h2ppy h2cker fake CA).
Also, the CA configs contained an outdated entry for "IssuerCert", which was not
being used: The CA configs now use an "Issuers" array to allow signing by
multiple issuer certificates at once (for instance when rolling intermediates).
Removed this outdated entry, and the config code for CA to load it. I've
confirmed these changes match what is currently in production.
Added an integration test to check for this problem in the future.
Fixes#3309, thanks to @icing for bringing the issue to our attention!
This also includes changes from #3321 to clarify certificates for WFE.
Now, rather than LIMIT / OFFSET, this uses the highest id from the last batch in each new batch's query. This makes efficient use of the index, and means the database does not have to scan over a large number of non-expired rows before starting to find any expired rows.
This also changes the structure of the purge function to continually push ids for deletion onto a channel, to be processed by goroutines consuming that channel.
Also, remove the --yes flag and prompting.
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.
Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.
I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.
Also, update go-grpc-prometheus to get the necessary methods.
```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
There were two bugs in #3167:
All process-level stats got prefixed with "boulder", which broke dashboards.
All request_time stats got dropped, because measured_http was using the prometheus DefaultRegisterer.
To fix, this PR plumbs through a scope object to measured_http, and uses an empty prefix when calling NewProcessCollector().
In 2fb247488f we consolidated the
`regModelV2` and `regModelv1` structs to one `regModel` type. In the
process we accidentally lost the explicit assignment of the
to-be-updated registration model's `LockCol` with the value of the
existing registration's `LockCol`. This meant that the Update was
occurring with a where clause `LockCol=0` (the default value).
In practice this meant that the first reg update would succeed (since
the reg row starts with LockCol=0) but any regs that had already been
updated once before would modify 0 rows in the update (because the where
clause on `LockCol` failed) and this in turn was translated into
a ServerInternal error since we knew the reg being updated did exist.
This commit updates the SA's `UpdateRegistration` function to properly
set the `LockCol` on the to-be-updated row.
This commit additionally adds an integration test for registration
contact information updating to ensure we don't fall into this trap in
the future.
Previously, CAA problems were lumped in under "ConnectionProblem" or
"Unauthorized". This should make things clearer and easier to differentiate.
Fixes#3043
Fixes#3020.
In order to write integration tests for some features, especially related to rate limiting, rechecking of CAA, and expiration of authzs, orders, and certs, we need to be able to fake the passage of time in integration tests.
To do so, this change switches out all clock.Default() instances for cmd.Clock(), which can be set manually with the FAKECLOCK environment variable. integration-test.py now starts up all servers once before the main body of tests, with FAKECLOCK set to a date 70 days ago, and does some initial setup for a new integration test case. That test case tries to fetch a 70-day-old authz URL, and expects it to 404.
In order to make this work, I also had to change a number of our test binaries to shut down cleanly in response to SIGTERM. Without that change, stopping the servers between the setup phase and the main tests caused startservers.check() to fail, because some processes exited with nonzero status.
Note: This is an initial stab at things, to prove out the technique. Long-term, I think we will want to use an idiom where test cases are classes that have a number of optional setup phases that may be run at e.g. 70 days prior and 5 days prior. This could help us avoid a proliferation of global state as we add more time-dependent test cases.
As described in Boulder issue #2800 the implementation of the SA's
`countCertificates` function meant that the renewal exemption for the
Certificates Per Domain rate limit was difficult to work with. To
maximize allotted certificates clients were required to perform all new
issuances first, followed by the "free" renewals. This arrangement was
difficult to coordinate.
In this PR `countCertificates` is updated such that renewals are
excluded from the count reliably. To do so the SA takes the serials it
finds for a given domain from the issuedNames table and cross references
them with the FQDN sets it can find for the associated serials. With the
FQDN sets a second query is done to find all the non-renewal FQDN sets
for the serials, giving a count of the total non-renewal issuances to
use for rate limiting.
Resolves#2800
The `submissions_b` count in the integration test `test_ct_submission` function was being populated initially by using `url_a` when it _should_ be initialized using `url_b` since it's the count of submissions to log b.
This resolves https://github.com/letsencrypt/boulder/issues/2723
I tested this fix with a branch that ran this test 12 times per build. Prior to this fix multiple builds out of 20 (~4-5) would fail. With this fix, all 20 passed.
In 18f4c5c we introduced a workaround for the CT submission integration
test to allow exactly expected, or twice as many CT log submissions as
expected to account for the case where the ocsp-updater and the CA race.
This didn't completely patch over the issue because the number of
submissions can fall between `n` and `2n`.
This commit updates the hack to be even hackier (twice as hacky or your
money back). Now we consider any value *between* `n` and `2n` as a test
pass.
Instead of running it at the current time to clean out left over cruft run it with a FAKECLOCK of +1 year so that we catch everything that could get in the way.
Fixes#2148.
Instead of just doing a blanket `DELETE FROM ...` this changes the `expired-authz-purger` to select all of the expired IDs (for both pending and finalized authorizations) then loop over them deleting each and its associated challenges from their respective tables.
Local testing indicates the performance of this is not awful but we should do a test run on staging to verify. If it ends up taking way too long to run there the easiest optimization would be to turn the slice of IDs into a channel and run multiple workers looping over the channel deleting stuff instead of just a single one.
Makes a few small integration test changes in order to facilitate deleting both pending and finalized authorizations.
Presently the CA and the ocsp-updater can race on the initial
submission of a certificate to the configured logs. This results
in double submitting certificates. In integration tests with the fake CT
server this manifests as an occasional failure of the
`test_ct_submission` test (Issue #2579).
The race we currently experience is expected to be fixed in
the future by a planned redesign so for now this commit works around the
failure by allowing either the expected number of submissions, or
exactly double the expected. This fixes#2579. The need to fix the
underlying race was captured in #2610.
The workaround was verified by submitting 10 builds to travis, all
succeeded.
Simply rely on exceptions from check_output.
Also, factor out common params for check_output into a `run` helper function.
Makes sure we always capture stderr into stdout.
This PR has three primary contributions:
1. The existing code for using the V4 safe browsing API introduced in #2446 had some bugs that are fixed in this PR.
2. A gsb-test-srv is added to provide a mock Google Safebrowsing V4 server for integration testing purposes.
3. A short integration test is added to test end-to-end GSB lookup for an "unsafe" domain.
For 1) most notably Boulder was assuming the new V4 library accepted a directory for its database persistence when it instead expects an existing file to be provided. Additionally the VA wasn't properly instantiating feature flags preventing the V4 api from being used by the VA.
For 2) the test server is designed to have a fixed set of "bad" domains (Currently just honest.achmeds.discount.hosting.com). When asked for a database update by a client it will package the list of bad domains up & send them to the client. When the client is asked to do a URL lookup it will check the local database for a matching prefix, and if found, perform a lookup against the test server. The test server will process the lookup and increment a count for how many times the bad domain was asked about.
For 3) the Boulder startservers.py was updated to start the gsb-test-srv and the VA is configured to talk to it using the V4 API. The integration test consists of attempting issuance for a domain pre-configured in the gsb-test-srv as a bad domain. If the issuance succeeds we know the GSB lookup code is faulty. If the issuance fails, we check that the gsb-test-srv received the correct number of lookups for the "bad" domain and fail if the expected isn't reality.
Notes for reviewers:
* The gsb-test-srv has to be started before anything will use it. Right now the v4 library handles database update request failures poorly and will not retry for 30min. See google/safebrowsing#44 for more information.
* There's not an easy way to test for "good" domain lookups, only hits against the list. The design of the V4 API is such that a list of prefixes is delivered to the client in the db update phase and if the domain in question matches no prefixes then the lookup is deemed unneccesary and not performed. I experimented with sending 256 1 byte prefixes to try and trick the client to always do a lookup, but the min prefix size is 4 bytes and enumerating all possible prefixes seemed gross.
* The test server has a /add endpoint that could be used by integration tests to add new domains to the block list, but it isn't being used presently. The trouble is that the client only updates its database every 30 minutes at present, and so adding a new domain will only take affect after the client updates the database.
Resolves#2448
Add a new tiny client called chisel, in place of test.js. This reduces the
number of language runtimes Boulder depends on for its tests. Also, since chisel
uses the acme Python library, we get more testing of that library, which
underlies Certbot. This also gives us more flexibility to hook different parts
of the issuance flows in our tests.
Reorganize integration-test.py itself. There was not clear separation of
specific test cases. Some test cases were added as part of run_node_test; some
were wrapped around it. There is now much closer to one function per test case.
Eventually we may be able to adopt Python's test infrastructure for these test
cases.
Remove some unused imports; consolidate on urllib2 instead of urllib.
For getting serial number and expiration date, replace shelling out to OpenSSL
with using pyOpenSSL, since we already have an in-memory parsed certificate.
Replace ISSUANCE_FAILED, REVOCATION_FAILED, MAILER_FAILED with simple die, since
we don't use these. Later, I'd like to remove the other specific exit codes. We
don't make very good use of them, and it would be more effective to just use
stack traces or, even better, reporting of which test cases failed.
Make single_ocsp_sign responsible for its own subprocess lifecycle.
Skip running startservers if WFE is already running, to make it easier to
iterate against a running Boulder (saves a few seconds of Boulder startup).
We turn arrays into maps with a range command. Previously, we were taking the
address of the iteration variable in that range command, which meant incorrect
results since the iteration variable gets reassigned.
Also change the integration test to catch this error.
Fixes#2496
Previously, the expiration-mailer would always run with the default config, even
if BOULDER_CONFIG_DIR was used to point at config-next. This led to missing some
config parse problems that should have been test failures.
Use labels ending in _key for private key labels.
Create two separate slots in make-softhsm rather than overwriting a single slot.
Update make-softhsm instructions to point out both files to edit.
Improve error output from integation test.
Output base64-encoded DER, as expected by ocsp-responder.
Use flags instead of template for Status, ThisUpdate, NextUpdate.
Provide better help.
Remove old test (wasn't run automatically).
Add it to integration test, and use its output for integration test of issuer ocsp-responder.
Add another slot to boulder-tools HSM image, to store root key.
Under the new defaults, if the syslog section is missing, we'll use the default
config that we use in prod: no logs to stdout, INFO and below to syslog.
This allows us to remove the syslog section from prod configs, and potentially
move it to individual service configs in the future.
* Improve syslog defaults.
* Add stdout logging for purger test.
* Use plain int for sysloglevel.
* Fix JSON syntax
* Fix syslog config for expired-authz-purger.
https://github.com/letsencrypt/boulder/pull/1932
* Add cmd/expired-authz-purger with integration test
* Return count
* gofmt >.>
* add to boulder-config-next.json
* Commit missing file
* Exec on the dbMap
* fprintf the error message
* Review fixes + test
* Review fixes pt. 1
* Review fixes pt. 2 (actually add test file this time :|)
* Fix prompt
* Switch to using flag lib
* Use COUNT(1)
* Revert config -> flag stuff
* Review fixes
* Revert-revert COUNT(1) change
* Review fixes pt. 1
* Nest config struct
* Test review fixes
* Factor out getting future output with FAKECLOCK
* Review fixes pt. 2
* Review fixes pt. 3
This creates a new server, 'mail-test-srv', which is a simplistic SMTP
server that accepts mail and can report the received mail over HTTP.
An integration test is added that uses the new server to test the expiry
mailer.
The FAKECLOCK environment variable is used to force the expiry mailer to
think that the just-issued certificate is about to expire.
Additionally, the expiry mailer is modified to cleanly shut down its
SMTP connections.
Adds a dns-01 type validation to test.js and reworks dns-test-srv to allow changing TXT record values.
Also makes some changes to how integration-test.py works in order to reduce complexity now the
ct-test-srv is working again.
Fixes https://github.com/letsencrypt/boulder/issues/1212.
This exposes a new constructor in amqp-rpc.go specifically for ActivityMonitor,
which overrides the normal routingKey to be the wildcard "#".
It also adds an expvar for the number of messages processed in ActivityMonitor,
and adds an integration test case that checks that ActivityMonitor has received
more than zero messages.
In https://github.com/letsencrypt/boulder/pull/1110 we put
the activate command in the wrong place so it didn't run if
LETSENCRYPT_PATH was set.
Also remove SIMPLE_HTTP_PORT which is no longer necessary. It was used to keep
the build passing as the client transitioned ports. The client now defaults to
5002.
This gets us closer to allowing the client repo to use
integration-test.py. They have a different path without "venv" in it for
their virtualenv set up.
Updates #1101
Travis:
* Downloads the Let's Encrypt client
* Installs system requirements for client
* Sets up virtualenv
Dockerfile:
* Buildout for development
* Includes numerous pacakges needed for integration testing
(including all of the above in Travis)
test.sh:
* If no path is defined for the LE client
* Download the Let's Encrypt client
* Set up virtualenv
test/amqp-integration-test.py:
* Run client test with sensible defaults
* One test: auth for foo.com
This allows us to use the same PKCS#11 key for both cert signing and OCSP
signing, and simplifies config and startup.
This also starts building with -tags pkcs11 in all scripts, which is required
now that the CA can choose between pkcs11 and non-pkcs11.
In order to successfully issue using a pkcs11 key, you'll need to run a version
of Go built off the master branch. The released versions are missing this
commit:
fe40cdd756,
which is necessary for PKCS#11 signing.