Our Travis output is quite verbose with the WFE output, and it's very
rare that we have to reference it. I'd like to remove the INFO-level
logs (i.e. the logs of every request) so that it's easier to see real
errors, and faster to scroll to the bottom of logs of failed runs.
Retains the existing logging of orphaned certs until we are confident that this
solution can fully replace it (even then we may want to keep it just for auditing etc).
Fixes#3636.
Things removed:
* features.EmbedSCTs (and all the associated RA/CA/ocsp-updater code etc)
* ca.enablePrecertificateFlow (and all the associated RA/CA code)
* sa.AddSCTReceipt and sa.GetSCTReceipt RPCs
* publisher.SubmitToCT and publisher.SubmitToSingleCT RPCs
Fixes#3755.
In staging/prod we use `doNotForceCN: false` for both the RA & CA
config. Switching this to `true` is blocked on CABF work that will
likely take considerable time.
In the short-term we should use `doNotForceCN: false` in `test/config`
and only use `doNotForceCN: true` in `test/config-next`.
We're currently stuck on gRPC v1.1 because of a breaking change to certificate validation in gRPC 1.8. Our gRPC balancer uses a static list of multiple hostnames, and expects to validate against those hostnames. However gRPC expects that a service is one hostname, with multiple IP addresses, and validates all those IP addresses against the same hostname. See grpc/grpc-go#2012.
If we follow gRPC's assumptions, we can rip out our custom Balancer and custom TransportCredentials, and will probably have a lower-friction time in general.
This PR is the first step in doing so. In order to satisfy the "multiple IPs, one port" property of gRPC backends in our Docker container infrastructure, we switch to Docker's user-defined networking. This allows us to give the Boulder container multiple IP addresses on different local networks, and gives it different DNS aliases in each network.
In startservers.py, each shard of a service listens on a different DNS alias for that service, and therefore a different IP address. The listening port for each shard of a service is now identical.
This change also updates the gRPC service certificates. Now, each certificate that is used in a gRPC service (as opposed to something that is "only" a client) has three names. For instance, sa1.boulder, sa2.boulder, and sa.boulder (the generic service name). For now, we are validating against the specific hostnames. When we update our gRPC dependency, we will begin validating against the generic service name.
Incidentally, the DNS aliases feature of Docker allows us to get rid of some hackery in entrypoint.sh that inserted entries into /etc/hosts.
Note: Boulder now has a dependency on the DNS aliases feature in Docker. By default, docker-compose run creates a temporary container and doesn't assign any aliases to it. We now need to specify docker-compose run --use-aliases to get the correct behavior. Without --use-aliases, Boulder won't be able to resolve the hostnames it wants to bind to.
load generator: send correct ACMEv2 Content-Type on POST.
This PR updates the Boulder load-generator to send the correct ACMEv2 Content-Type header when POSTing the ACME server. This is required for ACMEv2 and without it all POST requests to the WFE2 running a test/config-next configuration result in malformed 400 errors. While only required by ACMEv2 this commit sends it for ACMEv1 requests as well. No harm no foul.
integration tests: allow running just the load generator.
Prior to this PR an omission in an if statement in integration-test.py meant that you couldn't invoke test/integration-test.py with just the --load argument to only run the load generator. This commit updates the if to allow this use case.
Adds SCT embedding to the certificate issuance flow. When a issuance is requested a precertificate (the requested certificate but poisoned with the critical CT extension) is issued and submitted to the required CT logs. Once the SCTs for the precertificate have been collected a new certificate is issued with the poison extension replace with a SCT list extension containing the retrieved SCTs.
Fixes#2244, fixes#3492 and fixes#3429.
Removes usage of the `EnforceChallengeDisable` feature, the feature itself is not removed as it is still configured in staging/production, once that is fixed I'll submit another PR removing the actual flag.
This keeps the behavior that when authorizations are retrieved from the SA they have their challenges populated, because that seems to make the most sense to me? It also retains TLS re-validation.
Fixes#3441.
Boulder is fairly noisy about gRPC connection errors. This is a mixed
blessing: Our gRPC configuration will try to reconnect until it hits
an RPC deadline, and most likely eventually succeed. In that case,
we don't consider those to really be errors. However, in cases where
a connection is repeatedly failing, we'd like to see errors in the
logs about connection failure, rather than "deadline exceeded." So
we want to keep logging of gRPC errors.
However, right now we get a lot of these errors logged during
integration tests. They make the output hard to read, and may disguise
more serious errors. So we'd like to avoid causing such errors in
normal integration test operation.
This change reorders the startup of Boulder components by their gRPC
dependencies, so everything's backend is likely to be up and running
before it starts. It also reverses that order for clean shutdowns,
and waits for each process to exit before signalling the next one.
With these changes, I still got connection errors. Taking listenbuddy
out of the gRPC path fixed them. I believe the issue is that
listenbuddy is not a truly transparent proxy. In particular, it
accepts an inbound TCP connection before opening an outbound TCP
connection. If opening that outbound connection results in "connection
refused," it closes the inbound connection. That means gRPC sees a
"connection closed" (or "connection reset"?) rather than "connection
refused". I'm guessing it handles those cases differently, explaining
the different error results.
We've been using listenbuddy to trigger disconnects while Boulder is
running, to ensure that gRPC's reconnect code works. I think we can
probably rely on gRPC's reconnect to work. The initial problem that
led us to start testing this was a configuration problem; now that
we have the configuration we want, we should be fine and don't need
to keep testing reconnects on every integration test run.
The WFE2 doesn't check any of the feature flags that are configured in
the `test/config/wfe2.json` and `test/config-next/wfe2.json` config
files - we default to acting as if all new features are enabled for the
V2 work. This commit removes the flags from the config to avoid
confusion or expectations that changing the config will disable the
features.
Previously, there was a disagreement between WFE and CA as to what the correct
issuer certificate was. Consolidate on test-ca2.pem (h2ppy h2cker fake CA).
Also, the CA configs contained an outdated entry for "IssuerCert", which was not
being used: The CA configs now use an "Issuers" array to allow signing by
multiple issuer certificates at once (for instance when rolling intermediates).
Removed this outdated entry, and the config code for CA to load it. I've
confirmed these changes match what is currently in production.
Added an integration test to check for this problem in the future.
Fixes#3309, thanks to @icing for bringing the issue to our attention!
This also includes changes from #3321 to clarify certificates for WFE.
- Encode certificate as PEM.
- Use lowercase for field names.
- Use termsOfServiceAgreed instead of Agreement
- Use a different ToS URL for v2 that points at the v2 HTTPS port.
Resolves#3280
This PR changes the VA's singleDialTimeout value from 5 * time.Second to 10 * time.Second. This will give slower servers a better chance to respond, especially for the multi-VA case where n requests arrive ~simultaneously.
This PR also bumps the RA->VA timeout by 5s and the WFE->RA timeout by 5s to accommodate the increased dial timeout. I put this in a separate commit in case we'd rather deal with this separately.
* Remove non-TLS support from mailer entirely
* Add a config option for trusted roots in expiration-mailer. If unset, it defaults to the system roots, so this does not need to be set in production.
* Use TLS in mail-test-srv, along with an internal root and localhost certificates signed by that root.
This PR is the initial duplication of the WFE to create a WFE2
package. The rationale is briefly explained in `wfe2/README.md`.
Per #2822 this PR only lays the groundwork for further customization
and deduplication. Presently both the WFE and WFE2 are identical except
for the following configuration differences:
* The WFE offers HTTP and HTTPS on 4000 and 4430 respectively, the WFE2
offers HTTP on 4001 and 4431.
* The WFE has a debug port on 8000, the WFE2 uses the next free "8000
range port" and puts its debug service on 8013
Resolves https://github.com/letsencrypt/boulder/issues/2822
Deletes github.com/streadway/amqp and the various RabbitMQ setup tools etc. Changes how listenbuddy is used to proxy all of the gRPC client -> server connections so we test reconnection logic.
+49 -8,221 😁Fixes#2640 and #2562.
In order to provide the correct issuer certificate for older certificates after an issuer certificate rollover or when using multiple issuer certificates (e.g. RSA and ECDSA), use the AIA CA Issuer URL embedded in the certificate for the rel="up" link served by WFE. This behaviour is gated behind the UseAIAIssuerURL feature, which defaults to false.
To prevent MitM vulnerabilities in cases where the AIA URL is HTTP-only, it is upgraded to HTTPS.
This also adds a test for the issuer URL returned by the /acme/cert endpoint. wfe/test/178.{crt,key} were regenerated to add the AIA extension required to pass the test.
/acme/cert was changed to return an absolute URL to the issuer endpoint (making it consistent with /acme/new-cert).
Fixes#1663
Based on #1780
Pulls in logging improvements in OCSP Responder and the CT client, plus a handful of API changes. Also, the CT client verifies responses by default now.
This change includes some Boulder diffs to accommodate the API changes.
Set authorizationLifetimeDays to 60 across both config and config-next.
Set NumSessions to 2 in both config and config-next. A decrease from 10 because pkcs11-proxy (or pkcs11-daemon?) seems to error out under load if you have more sessions than CPUs.
Reorder parallelGenerateOCSPRequests to match config-next.
Remove extra tags for parsing yaml in config objects.
We have a number of stats already expressed using the statsd interface. During
the switchover period to direct Prometheus collection, we'd like to make those
stats available both ways. This change automatically exports any stats exported
using the statsd interface via Prometheus as well.
This is a little tricky because Prometheus expects all stats to by registered
exactly once. Prometheus does offer a mechanism to gracefully recover from
registering a stat more than once by handling a certain error, but it is not
safe for concurrent access. So I added a concurrency-safe wrapper that creates
Prometheus stats on demand and memoizes them.
In the process, made a few small required side changes:
- Clean "/" from method names in the gRPC interceptors. They are allowed in
statsd but not in Prometheus.
- Replace "127.0.0.1" with "boulder" as the name of our testing CT log.
Prometheus stats can't start with a number.
- Remove ":" from the CT-log stat names emitted by Publisher. Prometheus stats
can't include it.
- Remove a stray "RA" in front of some rate limit stats, since it was
duplicative (we were emitting "RA.RA..." before).
Note that this means two stat groups in particular are duplicated:
- Gostats* is duplicated with the default process-level stats exported by the
Prometheus library.
- gRPCClient* are duplicated by the stats generated by the go-grpc-prometheus
package.
When writing dashboards and alerts in the Prometheus world, we should be careful
to avoid these two categories, as they will disappear eventually. As a general
rule, if a stat is available with an all-lowercase name, choose that one, as it
is probably the Prometheus-native version.
In the long run we will want to create most stats using the native Prometheus
stat interface, since it allows us to use add labels to metrics, which is very
useful. For instance, currently our DNS stats distinguish types of queries by
appending the type to the stat name. This would be more natural as a label in
Prometheus.
This allows finer-grained control of which components can request issuance. The OCSP Updater should not be able to request issuance.
Also, update test/grpc-creds/generate.sh to reissue the certs properly.
Resolves#2417
In https://github.com/letsencrypt/boulder/pull/2453 we created
individual client certificates for each gRPC client. The "clientNames"
list in the `config-next/publisher.json` was updated for the new
component-specific SANs but we neglected to updated
`config/publisher.json`. This caused the `ocsp-updater` (which uses gRPC
in the base `config/` to talk to the `publisher`) to fail to connect.
This commit updates `config/publisher.json` to have the same clientNames
as `config-next/publisher.json` and resolves#2465
Previously, all gRPC services used the same client and server certificates. Now,
each service has its own certificate, which it uses for both client and server
authentication, more closely simulating production.
This also adds aliases for each of the relevant hostnames in /etc/hosts. There
may be some issues if Docker decides to rewrite /etc/hosts while Boulder is
running, but this seems to work for now.
Previously all OCSP signing and storage would be serial, which meant it was hard
to exercise the full capacity of our HSM. In this change, we run a limited
number of update and store requests in parallel.
This change also changes stats generation in generateOCSPResponses so we can
tell the difference between stats produced by new OCSP requests vs existing ones,
and adds a new stat that records how long the SQL query in findStaleOCSPResponses
takes.
- Remove spinner from test.js. It made Travis logs hard to read.
- Listen on all interfaces for debugAddr. This makes it possible to check
Prometheus metrics for instances running in a Docker container.
- Standardize DNS timeouts on 1s and 3 retries across all configs. This ensures
DNS completes within the relevant RPC timeouts.
- Remove RA service queue from VA, since VA no longer uses the callback to RA on
completing a challenge.
In https://groups.google.com/forum/#!topic/mozilla.dev.security.policy/_pSjsrZrTWY, we had a problem with the policy authority configuration, but cert-checker didn't alert about it because it uses the same policy configuration.
This PR adds support for an explicit list of regular expressions used to match forbidden names. The regular expressions are applied after the PA has done its usual validation process in order to act as a defense-in-depth mechanism for cases (.mil, .local, etc) that we know we never want to support, even if the PA thinks they are valid (e.g. due to a policy configuration malfunction).
Initially the forbidden name regexps are:
`^\s*$`,
`\.mil$`,
`\.local$`,
`^localhost$`,
`\.localhost$`,
Additionally, the existing cert-checker.json config in both test/config/ and test/config-next/ was missing the hostnamePolicyFile entry required for operation of cert-checker. This PR adds a hostnamePolicyFile entry pointing at the existing test/hostname-policy.json file. The cert checker can now be used in the dev env with cert-checker -config test/config/cert-checker.json without error.
Resolves#2366
The ocsp-updater has been switched over to the `config-next/` usage of
gRPC for submitting to the publisher service. This commit removes the
legacy AQMP support for such.
This does not remove the `rpc/rpc-wrappers.go` implementation of
`NewPublisherClient` at this point because it appears `boulder-ca` may
still be using it.
This PR reworks the validateEmail() function from the RA to allow timeouts during DNS validation of MX/A/AAAA records for an email to be non-fatal and match our intention to verify emails best-effort.
Notes:
bdns/problem.go - DNSError.Timeout() was changed to also include context cancellation and timeout as DNS timeouts. This matches what DNSError.Error() was doing to set the error message and supports external callers to Timeout not duplicating the work.
bdns/mocks.go - the LookupMX mock was changed to support always.error and always.timeout in a manner similar to the LookupHost mock. Otherwise the TestValidateEmail unit test for the RA would fail when the MX lookup completed before the Host lookup because the error wouldn't be correct (empty DNS records vs a timeout or network error).
test/config/ra.json, test/config-next/ra.json - the dnsTries and dnsTimeout values were updated such that dnsTries * dnsTimeout was <= the WFE->RA RPC timeout (currently 15s in the test configs). This allows the dns lookups to all timeout without the overall RPC timing out.
Resolves#2260.
The "20160817143417_AddCertStatusNotAfter.sql" db migration adds a "notAfter" column to the certificateStatus database table. This field duplicates the contents of the certificates table "expires" column. This enables performance improvements (see #1864) for both the ocsp-updater and the expiration-mailer utilities.
Since existing rows will have a NULL value in the new field the notafter-backfill utility exists to perform a one-time update of the existing certificateStatus rows to set their notAfter column based on the data that exists in the certificates table.
This follows on https://github.com/letsencrypt/boulder/pull/2177 and requires that the migration be applied & the feature flag set accordingly before use.
Fixes#2237.
This PR adds a `printStatus` function that is called every iteration of the mailer's `run()` loop. The status output is logged at the `info` level and includes the destination email, the current message being sent, the total number of messages to send, and the elapsed time since `run()` started.
The status output can be disabled by lowering the default syslog level in the `notify-mailer` config.
Additionally, this PR adds stats support for the mailer package. Three new stats are
published during the `MailerImpl`'s `SendMail` function (called in a loop by the mailer utilities):
`Mailer.SendMail.Attempts`
`Mailer.SendMail.Successes`
`Mailer.SendMail.Errors`
This PR removes two stats from the `expiration-mailer` that are redundant copies of the above:
`Mailer.Expiration.Errors.SendingNag.SendFailure`
`Mailer.Expiration.Sent`
This resolves#2026.
Another step in completing #1962, which will remove the global configuration file and codegangsta/cli from boulder. 3 more to go!
This PR, is a little bit different than others in that there was a lot more reliance on codegangsta/cli especially in the implementation of subcommands. I put some thought into creating our own SubCommand struct, but given the lack of complexity it seemed unnecessary as the same could be accomplished with slightly more advanced usage of os and flag.
Instead of reading the CA key from a file on disk into memory and using that for signing in `boulder-ca` this patch adds a new Docker container that runs SoftHSM and pkcs11-proxy in order to hold the key and perform signing operations. The pkcs11-proxy module is used by `boulder-ca` to talk to the SoftHSM container.
This exercises (almost) the full pkcs11 path through boulder and will allow testing various HSM related failures in the future as well as simplifying tuning signing performance for benchmarking.
Fixes#703.
For the notify-mailer, this PR fixes a bug with the -end parameter where the default (99999999) would cause a slice index out of range error. This was fixed by setting the -end value to len(m.destinations) in run when it is too large.
For both the notify-mailer and the contact-exporter a bug was fixed that was comparing the required flags against nil when the defaults were set to a non-nil pointer to "". This resulted in confusing errors when the mandatory arguments were not provided.
This PR also adds a separated config example for both the notify-mailer and the contact-exporter into test/config and test/config-next respectively.
Finally a documentation string was added to describe the overall design & usage of both tools, including example invocations.
Moves the wfe to it's own config file.
Each config will now belong in `test/config` and `test/config-next` analogous to `boulder-config` and `boulder-config-next`.