Per the documentation we should be using *|UNSUB:url|* not |UNSUB:url|. I also confirmed that the template that predates the Boulder data/ version used *|UNSUB:https://mandrillapp.com/unsub|*.
I also moved the "WARNING" onto the preceding line. The unsubscribe URL is quite long and gnarly once its been filed in and the warning could be missed in a MUA that doesn't wrap lines well.
This resolves#2523
Add a new tiny client called chisel, in place of test.js. This reduces the
number of language runtimes Boulder depends on for its tests. Also, since chisel
uses the acme Python library, we get more testing of that library, which
underlies Certbot. This also gives us more flexibility to hook different parts
of the issuance flows in our tests.
Reorganize integration-test.py itself. There was not clear separation of
specific test cases. Some test cases were added as part of run_node_test; some
were wrapped around it. There is now much closer to one function per test case.
Eventually we may be able to adopt Python's test infrastructure for these test
cases.
Remove some unused imports; consolidate on urllib2 instead of urllib.
For getting serial number and expiration date, replace shelling out to OpenSSL
with using pyOpenSSL, since we already have an in-memory parsed certificate.
Replace ISSUANCE_FAILED, REVOCATION_FAILED, MAILER_FAILED with simple die, since
we don't use these. Later, I'd like to remove the other specific exit codes. We
don't make very good use of them, and it would be more effective to just use
stack traces or, even better, reporting of which test cases failed.
Make test_single_ocsp_sign responsible for its own subprocess lifecycle.
Skip running startservers if WFE is already running, to make it easier to
iterate against a running Boulder (saves a few seconds of Boulder startup).
Previously we had `Error` and `ValidationRecords` fields in the `Challenge` protobuf but they were never populated which mean't that when using gRPC these fields wouldn't be sent to the SA from the RA on a `FinalizeAuthorization` call. This change populates those fields and updates the PB marshaling tests to verify the correct behavior.
Fixes#2514.
Fixes#976.
This implements a new rate limit, InvalidAuthorizationsPerAccount. If a given account fails authorization for a given hostname too many times within the window, subsequent new-authz attempts for that account and hostname will fail early with a rateLimited error. This mitigates the misconfigured clients that constantly retry authorization even though they always fail (e.g., because the hostname no longer resolves).
For the new rate limit, I added a new SA RPC, CountInvalidAuthorizations. I chose to implement this only in gRPC, not in AMQP-RPC, so checking the rate limit is gated on gRPC. See #2406 for some description of the how and why. I also chose to directly use the gRPC interfaces rather than wrapping them in core.StorageAuthority, as a step towards what we will want to do once we've moved fully to gRPC.
Because authorizations don't have a created time, we need to look at the expires time instead. Invalid authorizations retain the expiration they were given when they were created as pending authorizations, so we use now + pendingAuthorizationLifetime as one side of the window for rate limiting, and look backwards from there. Note that this means you could maliciously bypass this rate limit by stacking up pending authorizations over time, then failing them all at once.
Similarly, since this limit is by (account, hostname) rather than just (hostname), you can bypass it by creating multiple accounts. It would be more natural and robust to limit by hostname, like our certificate limits. However, we currently only have two indexes on the authz table: the primary key, and
(`registrationID`,`identifier`,`status`,`expires`)
Since this limit is intended mainly to combat misconfigured clients, I think this is sufficient for now.
Corresponding PR for website: letsencrypt/website#125
This PR has three primary contributions:
1. The existing code for using the V4 safe browsing API introduced in #2446 had some bugs that are fixed in this PR.
2. A gsb-test-srv is added to provide a mock Google Safebrowsing V4 server for integration testing purposes.
3. A short integration test is added to test end-to-end GSB lookup for an "unsafe" domain.
For 1) most notably Boulder was assuming the new V4 library accepted a directory for its database persistence when it instead expects an existing file to be provided. Additionally the VA wasn't properly instantiating feature flags preventing the V4 api from being used by the VA.
For 2) the test server is designed to have a fixed set of "bad" domains (Currently just honest.achmeds.discount.hosting.com). When asked for a database update by a client it will package the list of bad domains up & send them to the client. When the client is asked to do a URL lookup it will check the local database for a matching prefix, and if found, perform a lookup against the test server. The test server will process the lookup and increment a count for how many times the bad domain was asked about.
For 3) the Boulder startservers.py was updated to start the gsb-test-srv and the VA is configured to talk to it using the V4 API. The integration test consists of attempting issuance for a domain pre-configured in the gsb-test-srv as a bad domain. If the issuance succeeds we know the GSB lookup code is faulty. If the issuance fails, we check that the gsb-test-srv received the correct number of lookups for the "bad" domain and fail if the expected isn't reality.
Notes for reviewers:
* The gsb-test-srv has to be started before anything will use it. Right now the v4 library handles database update request failures poorly and will not retry for 30min. See google/safebrowsing#44 for more information.
* There's not an easy way to test for "good" domain lookups, only hits against the list. The design of the V4 API is such that a list of prefixes is delivered to the client in the db update phase and if the domain in question matches no prefixes then the lookup is deemed unneccesary and not performed. I experimented with sending 256 1 byte prefixes to try and trick the client to always do a lookup, but the min prefix size is 4 bytes and enumerating all possible prefixes seemed gross.
* The test server has a /add endpoint that could be used by integration tests to add new domains to the block list, but it isn't being used presently. The trouble is that the client only updates its database every 30 minutes at present, and so adding a new domain will only take affect after the client updates the database.
Resolves#2448
Add a new tiny client called chisel, in place of test.js. This reduces the
number of language runtimes Boulder depends on for its tests. Also, since chisel
uses the acme Python library, we get more testing of that library, which
underlies Certbot. This also gives us more flexibility to hook different parts
of the issuance flows in our tests.
Reorganize integration-test.py itself. There was not clear separation of
specific test cases. Some test cases were added as part of run_node_test; some
were wrapped around it. There is now much closer to one function per test case.
Eventually we may be able to adopt Python's test infrastructure for these test
cases.
Remove some unused imports; consolidate on urllib2 instead of urllib.
For getting serial number and expiration date, replace shelling out to OpenSSL
with using pyOpenSSL, since we already have an in-memory parsed certificate.
Replace ISSUANCE_FAILED, REVOCATION_FAILED, MAILER_FAILED with simple die, since
we don't use these. Later, I'd like to remove the other specific exit codes. We
don't make very good use of them, and it would be more effective to just use
stack traces or, even better, reporting of which test cases failed.
Make single_ocsp_sign responsible for its own subprocess lifecycle.
Skip running startservers if WFE is already running, to make it easier to
iterate against a running Boulder (saves a few seconds of Boulder startup).
If the VA fails to validate a TLS-SNI-01 challenge because it is trying to talk TLS to a HTTP-only server return a special error message that is slightly more informative.
Add a super basic counter for certificate and OCSP signatures so we have a slightly less noisy idea of our current HSM signing performance and where it is going.
Fixes#2438.
Currently services will pass both `core.XXXError` and `probs.XXX` type errors across the gRPC layer. In the future (#2505) we intend to stop passing `probs.XXX` type errors across this layer but for now we need to support them until that change is landed. This patch takes the easiest path to allow this by encoding the `probs.ProblemDetails` to JSON and storing it in the gRPC error body so that it can be passed around.
Fixes#2497.
Akamai expects a special URL to be used to purge responses from POST'd requests, this change adds those URLs to the existing GET URLs when purging OCSP responses.
Note this assumes all POST requests encode their OCSP request in the same manner that Golang does, which is most likely not the case in some situations. In order to mimic all the possible formats we would need to write a bunch of custom ASN.1 definitions that conform with specific field use/order that browsers etc use and generate a URL for each of those as well.
Fixes#996.
Prefer up over start to allow prometheus container to find boulder.
Use ocspMinTimeToExpiry: 0h trick instead of updating DB manually.
Offer command to fill DB.
Offer Prometheus link to throughput graph.
We turn arrays into maps with a range command. Previously, we were taking the
address of the iteration variable in that range command, which meant incorrect
results since the iteration variable gets reassigned.
Also change the integration test to catch this error.
Fixes#2496
Previously the comment was wrong (left off the leading `\0`), but the base64-encoded `AUTH` line was right.
This change also modified the password from `paswd` to `passwd`.
* Make restarting boulder in docker nicer.
Handle SIGTERM in startservers.py.
Forcibly remove rsyslog pid to avoid error.
* Add explanatory comment.
* Send SIGTERM instead of kill.
* Further improvements.
- Handle SIGINT too.
- Use unbuffered mode for Python so the print statements (like "all servers
running") get printed right away rather than at shutdown
- Squelch an unnecessary OSError about interrupting the wait() call.
Pulls in logging improvements in OCSP Responder and the CT client, plus a handful of API changes. Also, the CT client verifies responses by default now.
This change includes some Boulder diffs to accommodate the API changes.
Previously, the expiration-mailer would always run with the default config, even
if BOULDER_CONFIG_DIR was used to point at config-next. This led to missing some
config parse problems that should have been test failures.
We're planning to add dial timeouts in prod, and want to make it consistent with dev.
Read timeout has to be fairly generous because it has to be at least as high as the longest query we expect to run. Dial timeout can be much more aggressive, because in all normal cases, dial should complete very quickly. This allows us to timeout connections more quickly when the database is slow or unavailable, returning 500 immediately rather than piling up connections and eventually returning 500.
This is mainly useful for services that serve user traffic directly and have a high number of maxDBConns. It's not as important for admin tools like the cert-checker, mailer, and so on.
By default the Boulder dev env's VA config test/config/va.json has
a portConfig that connects to port 5002 for HTTP-01 challenges and
port 5001 for TLS-SNI-01 challenges.
This can be confusing to someone that follows the FAKE_DNS guidelines
for using a client from the host machine but expects the VA to reach out
on port 80 or 443 like production/staging's VA.
This commit updates the README to include a note on the non-standard
ports and how to change them.
Set authorizationLifetimeDays to 60 across both config and config-next.
Set NumSessions to 2 in both config and config-next. A decrease from 10 because pkcs11-proxy (or pkcs11-daemon?) seems to error out under load if you have more sessions than CPUs.
Reorder parallelGenerateOCSPRequests to match config-next.
Remove extra tags for parsing yaml in config objects.
This makes it easier to test monitoring of local Boulder instances, and look at stats when load testing.
Note that this creates a dependency from prom -> boulder that can only be satisfied by docker-compose
up or docker-compose up boulder, not by docker-compose run --service-ports boulder (since
docker-compose doesn't see the latter as a full-fledged boulder instance).
Previously our gRPC client code called the wrong function, enabling server-side instead of client-side histograms.
Also, add a timing stat for the generate / store combination in OCSP Updater.
The result is thrown away, so this adds a slight performance overhead for no
benefit. In theory it would catch malformed certificates in the DB, but that is
the job of cert-checker.
We have a number of stats already expressed using the statsd interface. During
the switchover period to direct Prometheus collection, we'd like to make those
stats available both ways. This change automatically exports any stats exported
using the statsd interface via Prometheus as well.
This is a little tricky because Prometheus expects all stats to by registered
exactly once. Prometheus does offer a mechanism to gracefully recover from
registering a stat more than once by handling a certain error, but it is not
safe for concurrent access. So I added a concurrency-safe wrapper that creates
Prometheus stats on demand and memoizes them.
In the process, made a few small required side changes:
- Clean "/" from method names in the gRPC interceptors. They are allowed in
statsd but not in Prometheus.
- Replace "127.0.0.1" with "boulder" as the name of our testing CT log.
Prometheus stats can't start with a number.
- Remove ":" from the CT-log stat names emitted by Publisher. Prometheus stats
can't include it.
- Remove a stray "RA" in front of some rate limit stats, since it was
duplicative (we were emitting "RA.RA..." before).
Note that this means two stat groups in particular are duplicated:
- Gostats* is duplicated with the default process-level stats exported by the
Prometheus library.
- gRPCClient* are duplicated by the stats generated by the go-grpc-prometheus
package.
When writing dashboards and alerts in the Prometheus world, we should be careful
to avoid these two categories, as they will disappear eventually. As a general
rule, if a stat is available with an all-lowercase name, choose that one, as it
is probably the Prometheus-native version.
In the long run we will want to create most stats using the native Prometheus
stat interface, since it allows us to use add labels to metrics, which is very
useful. For instance, currently our DNS stats distinguish types of queries by
appending the type to the stat name. This would be more natural as a label in
Prometheus.
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.
This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.
This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
This allows finer-grained control of which components can request issuance. The OCSP Updater should not be able to request issuance.
Also, update test/grpc-creds/generate.sh to reissue the certs properly.
Resolves#2417
In https://github.com/letsencrypt/boulder/pull/2453 we created
individual client certificates for each gRPC client. The "clientNames"
list in the `config-next/publisher.json` was updated for the new
component-specific SANs but we neglected to updated
`config/publisher.json`. This caused the `ocsp-updater` (which uses gRPC
in the base `config/` to talk to the `publisher`) to fail to connect.
This commit updates `config/publisher.json` to have the same clientNames
as `config-next/publisher.json` and resolves#2465
Previously the instructions assumed you had Go setup on your host, which
somewhat defeates the point of running Boulder inside Docker, since it requires
more initial setup. These instructions make first-time users less likely to hit
the oci runtime error described later in the README.
We recently made changes to the IANA suffixes, and you may want to pull them into the latest Boulder version.
```
➜ publicsuffix-go git:(master) git show -s
commit 3ea542729b4d7056a9d1356c9baf27bcad2bda7f
Author: Simone Carletti <weppos@weppos.net>
Date: Mon Jan 2 18:28:57 2017 +0100
Release 0.3.1
```
```
➜ publicsuffix-go git:(master) go test ./...
? github.com/weppos/publicsuffix-go/cmd/gen [no test files]
? github.com/weppos/publicsuffix-go/cmd/load [no test files]
ok github.com/weppos/publicsuffix-go/net/publicsuffix 0.045s
ok github.com/weppos/publicsuffix-go/publicsuffix 0.091s
```
v0.3.1 is tagged and signed with my PGP key.
https://github.com/weppos/publicsuffix-go/releases/tag/v0.3.1
Previously, all gRPC services used the same client and server certificates. Now,
each service has its own certificate, which it uses for both client and server
authentication, more closely simulating production.
This also adds aliases for each of the relevant hostnames in /etc/hosts. There
may be some issues if Docker decides to rewrite /etc/hosts while Boulder is
running, but this seems to work for now.
Right now we are using a third-party client for the Google Safe Browsing API, but Google has recently released their own [Golang library](https://github.com/google/safebrowsing) which also supports the newer v4 API. Using this library will let us avoid fixing some lingering race conditions & unpleasantness in our fork of `go-safebrowsing-api`.
This PR adds support for using the Google library & the v4 API in place of our existing fork when the `GoogleSafeBrowsingV4` feature flag is enabled in the VA "features" configuration.
Resolves https://github.com/letsencrypt/boulder/issues/1863
Per `CONTRIBUTING.md` I also ran the unit tests for the new dependency:
```
daniel@XXXXXXXXXX:~/go/src/github.com/google/safebrowsing$ go test ./...
ok github.com/google/safebrowsing 3.274s
? github.com/google/safebrowsing/cmd/sblookup [no test files]
? github.com/google/safebrowsing/cmd/sbserver [no test files]
? github.com/google/safebrowsing/cmd/sbserver/statik [no test files]
? github.com/google/safebrowsing/internal/safebrowsing_proto [no test files]
ok github.com/google/safebrowsing/vendor/github.com/golang/protobuf/jsonpb 0.012s
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/jsonpb/jsonpb_test_proto [no test files]
ok github.com/google/safebrowsing/vendor/github.com/golang/protobuf/proto 0.062s
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/proto/proto3_proto [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/protoc-gen-go [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/protoc-gen-go/descriptor [no test files]
ok github.com/google/safebrowsing/vendor/github.com/golang/protobuf/protoc-gen-go/generator 0.017s
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/protoc-gen-go/grpc [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/protoc-gen-go/plugin [no test files]
ok github.com/google/safebrowsing/vendor/github.com/golang/protobuf/ptypes 0.009s
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/ptypes/any [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/ptypes/duration [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/ptypes/empty [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/ptypes/struct [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/ptypes/timestamp [no test files]
? github.com/google/safebrowsing/vendor/github.com/golang/protobuf/ptypes/wrappers [no test files]
? github.com/google/safebrowsing/vendor/github.com/rakyll/statik [no test files]
? github.com/google/safebrowsing/vendor/github.com/rakyll/statik/fs [no test files]
ok github.com/google/safebrowsing/vendor/golang.org/x/net/idna 0.003s
```
Prior to this commit, when there was an err from
`wfe.SA.GetRegistrationByKey`, and that error wasn't an instance of
`core.NoSuchRegistrationError`, `verifyPOST` converted the error into
a problem by sending it through `core.ProblemDetailsForError(err, "")`.
In this case, this isn't an appropriate strategy. The only possible
errors that can be sent through this function will not match any of the
`case` statements in `core.ProblemDetailsForError` and will be returned
by the `default` case:
```
default:
// Internal server error messages may include sensitive data, so we do
// not include it.
return probs.ServerInternal(msg)
```
Since `verifyPOST` calls this function with `msg = ""`,
`ProblemDetailsForError` will return an empty `ServerInternalProblem`. When the
caller of `verifyPOST` gives the returned serv internal problem to `sendError`
it will produce: `"Internal error - - %s<nil>"` because the problem's detail
is "" and the error code given to `sendError` is nil.
Since having examined the code paths in `verifyPOST` before
`core.ProblemDetailsForError` won't ever match anything but the default case
producing a blank message it seems the proper fix here is to not use
`ProblemDetailsForError` at all and instead directly instantiate a
`ServerInternalProblem` with a suitable message.
Fixes two bugs in the Akamai cache purging library and one in the `ocsp-updater` and adds some tests to the Akamai library.
* The first was that the backoff logic was broken, the backoff was calculated but discarded as it was assumed the sleep happened inside `core.RetryBackoff` instead of it returning the amount of time to backoff.
* The second was that the internal HTTP client would only log errors if they were fatal which was superfluous as the caller would also log the fatal errors and masked what the actual issue was during retries.
* The last in `ocsp-updater` was that `path.Join` was used to create a URL which is not an intended use of the method as it attempts to clean paths. This meant that the scheme prefix `http://` would be 'cleaned' to `http:/`, since Akamai has no idea what the malformed URLs referred to it would return 403 Forbidden which we could consider a temporary error and retry until failure.
As pointed out by @webczat (Thanks again!), my last update to `acme-divergences.md`
https://github.com/letsencrypt/boulder/pull/2402 was still a little bit off-the-mark accuracy wise.
This PR resolves the problems (fingers crossed! 🎌) that remained with the
documentation of the Section 6.3 "URL field" divergence and the Section 5.4.1
"existing registration" divergence.
Resolves https://github.com/letsencrypt/boulder/issues/2414
This PR modifies the `expiration-mailer` utility to change the subject used in the reminder emails to include a domain name from the expiring certificate.
Previously unless otherwise specified using the `Mailer.Subject` configuration parameter all reminder emails were sent with the subject `Certificate expiration notice`. Both the `test/config/` and `test/config-next` expiration mailer configurations do not override the subject and were using the default.
With this PR, if no `Mailer.Subject` configuration parameter is provided then reminder emails are sent with the subject `Certificate expiration notice for domain "foo.bar.com"` in the case of only one domain in the expiring certificate, and `Certificate expiration for domain "foo.bar.com" (and $(n-1) more)` for the case where there are n > 1 domains (e.g. "(and 1 more)", "(and 2 more)" ...). I explicitly left support for the `Mailer.Subject` override to allow legacy configurations to function.
I didn't explicitly add a new unit test for this behaviour because the existing unit tests were exercising both the "configuration override" portion of the subject behaviour, and matching the new expected subject. It would be entirely duplicated code to write a separate test for the subject template.
Resolves#2411
This PR splits up the expiration-mailer's `findExpiringCertificates` query into two parts:
1. One query to find `certificateStatus` serial numbers that match the search criteria
2. Sequential queries to find each `certificate` row for the results from 1.
This removes the `JOIN` on two large tables from the original `findExpiringCertificates` query and lets us shift load away from the database. https://github.com/letsencrypt/boulder/issues/2432 wasn't sufficient to reduce the load of this query.
Resolves https://github.com/letsencrypt/boulder/issues/2425
Skips adding tests for reg-revoke as it would require significant changes to how test.js
functions that would additionally require re-working a number of the other integration
tests.
Updates #2340.
Similar to #2431 the expiration-mailer's `findExpiringCertificates()` query can be optimized slightly by using `certificateStatus.NotAfter` in place of `certificate.expires` in the `WHERE` clause of its query when the `CertStatusOptimizationsMigrated` feature is enabled.
Resolves https://github.com/letsencrypt/boulder/issues/2425