This enables ocsp-responder to talk to the RA and request freshly signed
OCSP responses.
ocsp/responder/redis_source is moved to ocsp/responder/redis/redis_source.go
and significantly modified. Instead of assuming a response is always available
in Redis, it wraps a live-signing source. When a response is not available,
it attempts a live signing.
If live signing succeeds, the Redis responder returns the result right away
and attempts to write a copy to Redis on a goroutine using a background
context.
To make things more efficient, I eliminate an unneeded ocsp.ParseResponse
from the storage path. And I factored out a FakeResponse helper to make
the unittests more manageable.
Commits should be reviewable one-by-one.
Fixes#6191
Add a new code path to the ctpolicy package which enforces Chrome's new
CT Policy, which requires that SCTs come from logs run by two different
operators, rather than one Google and one non-Google log. To achieve
this, invert the "race" logic: rather than assuming we always have two
groups, and racing the logs within each group against each other, we now
race the various groups against each other, and pick just one arbitrary
log from each group to attempt submission to.
Ensure that the new code path does the right thing by adding a new zlint
which checks that the two SCTs embedded in a certificate come from logs
run by different operators. To support this lint, which needs to have a
canonical mapping from logs to their operators, import the Chrome CT Log
List JSON Schema and autogenerate Go structs from it so that we can
parse a real CT Log List. Also add flags to all services which run these
lints (the CA and cert-checker) to let them load a CT Log List from disk
and provide it to the lint.
Finally, since we now have the ability to load a CT Log List file
anyway, use this capability to simplify configuration of the RA. Rather
than listing all of the details for each log we're willing to submit to,
simply list the names (technically, Descriptions) of each log, and look
up the rest of the details from the log list file.
To support this change, SRE will need to deploy log list files (the real
Chrome log list for prod, and a custom log list for staging) and then
update the configuration of the RA, CA, and cert-checker. Once that
transition is complete, the deletion TODOs left behind by this change
will be able to be completed, removing the old RA configuration and old
ctpolicy race logic.
Part of #5938
Simplify the WFE `RevokeCertificate` API method in three ways:
- Remove most of the logic checking if the requester is authorized to
revoke the certificate in question (based on who is making the
request, what authorizations they have, and what reason they're
requesting). That checking is now done by the RA. Instead, simply
verify that the JWS is authenticated.
- Remove the hard-to-read `authorizedToRevoke` callbacks, and make the
`revokeCertBySubscriberKey` (nee `revokeCertByKeyID`) and
`revokeCertByCertKey` (nee `revokeCertByJWK`) helpers much more
straight-line in their execution logic.
- Call the RA's new `RevokeCertByApplicant` and `RevokeCertByKey` gRPC
methods, rather than the deprecated `RevokeCertificateWithReg`.
This change, without any flag flips, should be invisible to the
end-user. It will slightly change some of our log message formats.
However, by now relying on the new RA gRPC revocation methods, this
change allows us to change our revocation policies by enabling the
`AllowDoubleRevocation` and `MozRevocationReasons` feature flags, which
affect the behavior of those new helpers.
Fixes#5936
Add two new gRPC methods to the SA:
- `RevokeCertByKey` will be used when the API request was signed by the
certificate's keypair, rather than a Subscriber keypair. If the
request is for reason `keyCompromise`, it will ensure that the key is
added to the blocked keys table, and will attempt to "re-revoke" a
certificate that was already revoked for some other reason.
- `RevokeCertByApplicant` supports both the path where the original
subscriber or another account which has proven control over all of the
identifier in the certificate requests revocation via the API. It does
not allow the requested reason to be `keyCompromise`, as these
requests do not represent a demonstration of key compromise.
In addition, add a new feature flag `MozRevocationReasons` which
controls the behavior of these new methods. If the flag is not set, they
behave like they have historically (see above). If the flag is set to true,
then the new methods enforce the upcoming Mozilla policies around
revocation reasons, namely:
- Only the original Subscriber can choose the revocation reason; other
clients will get a set reason code based on the method of requesting
revocation. When the original Subscriber requests reason
`keyCompromise`, this request will be honored, but the key will not be
blocked and other certificates with that key will not also be revoked.
- Revocations signed with the certificate key will always get reason
`keyCompromise`, because we do not know who is sending the request and
therefore must assume that the use of the key in this way represents
compromise. Because these requests will always be fore reason
`keyCompromise`, they will always be added to the blocked keys table
and they will always attempt "re-revocation".
- Revocations authorized via control of all names in the cert will
always get reason `cessationOfOperation`, which is to be used when the
original Subscriber does not control all names in the certificate
anymore.
Finally, update the existing `AdministrativelyRevokeCertificate` method
to use the new helper functions shared by the two new methods.
Part of #5936
Add a new check to GoodKey which attempts to factor the public modulus
of the presented key using Fermat's factorization method. This method
will succeed if and only if the prime factors are very close to each
other -- i.e. almost certainly were not selected independently from a
random uniform distribution, but were instead calculated via some other
less secure method.
To support this new feature, add a new config flag to the RA, CA, and
WFE, which all use the GoodKey checks. As part of adding this new config
value, refactor the GoodKey config items into their own config struct
which can be re-used across all services.
If the new `FermatRounds` config value has not been set, it will default
to zero, causing no factorization to be attempted.
Fixes#5850
Part of #5851
This allows repeated runs using the same hiearchy, and avoids spurious
errors from ocsp-updater saying "This CA doesn't have an issuer cert
with ID XXX"
Fixes#5721
Add a new method to the SA's gRPC interface which takes both an Order
and a list of new Authorizations to insert into the database, and adds
both (as well as the various ancillary rows) inside a transaction.
To enable this, add a new abstraction layer inside the `db/` package
that facilitates inserting many rows at once, as we do for the `authz2`,
`orderToAuthz2`, and `requestedNames` tables in this operation.
Finally, add a new codepath to the RA (and a feature flag to control it)
which uses this new SA method instead of separately calling the
`NewAuthorization` method multiple times. Enable this feature flag in
the config-next integration tests.
This should reduce the failure rate of the new-order flow by reducing
the number of database operations by coalescing multiple inserts into a
single multi-row insert. It should also reduce the incidence of new
authorizations being created in the database but then never exposed to
the subscriber because of a failure later in the new-order flow, both by
reducing failures overall and by adding those authorizations in a
transaction which will be rolled back if there is a later failure.
Fixes#5577
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
This allows servers to tell clients to go away after some period of time, which triggers the clients to re-resolve DNS.
Per grpc/grpc#12295, this is the preferred way to do this.
Related: #5307.
The RA is responsible for contacting Akamai to purge cached OCSP
responses when a certificate is revoked and fresh OCSP responses need to
be served ASAP. In order to do so, it needs to construct the same OCSP
URLs that clients would construct, and that Akamai would cache. In order
to do that, it needs access to the issuing certificate to compute a hash
across its Subject Info and Public Key.
Currently, the RA holds a single issuer certificate in memory, and uses
that cert to compute all OCSP URLs, on the assumption that all certs
we're being asked to revoke were issued by the same issuer.
In order to support issuance from multiple intermediates at the same
time (e.g. RSA and ECDSA), and to support rollover between different
issuers of the same type (we may need to revoke certs issued by two
different issuers for the 90 days in which their end-entity certs
overlap), this commit changes the configuration to provide a list of
issuer certificates instead.
In order to support efficient lookup of issuer certs, this change also
introduces a new concept, the Chain ID. The Chain ID is a truncated hash
across the raw bytes of either the Issuer Info or the Subject Info of a
given cert. As such, it can be used to confirm issuer/subject
relationships between certificates. In the future, this may be a
replacement for our current IssuerID (a truncated hash over the whole
issuer certificate), but for now it is used to map revoked certs to
their issuers inside the RA.
Part of #5120
This adds a new tool, `health-checker`, which is a client of the new
Health Checker Service that has been integrated into all of our
boulder components. This tool takes an address, a timeout, and a
config file. It then attempts to connect to a gRPC Health Service at
the given address, retrying until it hits its timeout, using credentials
specified by the config file.
This is then wrapped by a new function `waithealth` in our Python
helpers, which serves much the same function as `waitport`, but
specifically for services which surface a gRPC Health Service
This in turn requires slight modifications to `startservers`, namely
specifying the address and port on which each service starts its
gRPC listener.
Finally, this change also introduces new credentials for this
health-checker, and adds those credentials as a valid client to
all services' json configs. A similar change would have to be made
to our production configs if we were to establish a long-lived
health checker/prober in prod.
Fixes#5074
This commit consists of three classes of changes:
1) Changing various command main.go files to always behave as they
would have when features.BlockedKeyTable was true. Also changing
one test in the same manner.
2) Removing the BlockedKeyTable flag from configuration in config-next,
because the flag is already live.
3) Moving the BlockedKeyTable flag to the "deprecated" section of
features.go, and regenerating featureflag_strings.go.
A future change will remove the BlockedKeyTable flag (and other
similarly deprecated flags) from features.go entirely.
Fixes#4873
Currently 99.99% of RSA keys we see in certificates at Let's Encrypt are
either 2048, 3072, or 4096 bits, but we support every 8 bit increment
between 2048 and 4096. Supporting these uncommon key sizes opens us up to
having to block much larger ranges of keys when dealing with something
like the Debian weak keys incident. Instead we should just reduce the
set of key sizes we support down to what people actually use.
Fixes#4835.
This ended up taking a lot more work than I expected. In order to make the implementation more robust a bunch of stuff we previously relied on has been ripped out in order to reduce unnecessary complexity (I think I insisted on a bunch of this in the first place, so glad I can kill it now).
In particular this change:
* Removes bhsm and pkcs11-proxy: softhsm and pkcs11-proxy don't play well together, and any softhsm manipulation would need to happen on bhsm, then require a restart of pkcs11-proxy to pull in the on-disk changes. This makes manipulating softhsm from the boulder container extremely difficult, and because of the need to initialize new on each run (described below) we need direct access to the softhsm2 tools since pkcs11-tool cannot do slot initialization operations over the wire. I originally argued for bhsm as a way to mimic a network attached HSM, mainly so that we could do network level fault testing. In reality we've never actually done this, and the extra complexity is not really realistic for a handful of reasons. It seems better to just rip it out and operate directly on a local softhsm instance (the other option would be to use pkcs11-proxy locally, but this still would require manually restarting the proxy whenever softhsm2-util was used, and wouldn't really offer any realistic benefit).
* Initializes the softhsm slots on each integration test run, rather than when creating the docker image (this is necessary to prevent churn in test/cert-ceremonies/generate.go, which would need to be updated to reflect the new slot IDs each time a new boulder-tools image was created since slot IDs are randomly generated)
* Installs softhsm from source so that we can use a more up to date version (2.5.0 vs. 2.2.0 which is in the debian repo)
* Generates the root and intermediate private keys in softhsm and writes out the root and intermediate public keys to /tmp for use in integration tests (the existing test-{ca,root} certs are kept in test/ because they are used in a whole bunch of unit tests. At some point these should probably be renamed/moved to be more representative of what they are used for, but that is left for a follow-up in order to keep the churn in this PR as related to the ceremony work as possible)
Another follow-up item here is that we should really be zeroing out the database at the start of each integration test run, since certain things like certificates and ocsp responses will be signed by a key/issuer that is no longer is use/doesn't match the current key/issuer.
Fixes#4832.
Adds a daemon which monitors the new blockedKeys table and checks for any unexpired, unrevoked certificates that are associated with the added SPKI hashes and revokes them, notifying the user that issued the certificates.
Fixes#4772.
For now this mainly provides an example config and confirms that
log-validator can start up and shut down cleanly, as well as provide a
stat indicating how many log lines it has handled.
This introduces a syslog config to the boulder-tools image that will write
logs to /var/log/program.log. It also tweaks the various .json config
files so they have non-default syslogLevel, to ensure they actually
write something for log-validator to verify.
Prev. we weren't checking the domain portion of an email contact address
very strictly in the RA. This updates the PA to export a function that
can be used to validate the domain the same way we validate domain
portions of DNS type identifiers for issuance.
This also changes the RA to use the `invalidEmail` error type in more
places.
A new Go integration test is added that checks these errors end-to-end
for both account creation and account update.
This change set makes the authz2 storage format the default format. It removes
most of the functionality related to the previous storage format, except for
the SA fallbacks and old gRPC methods which have been left for a follow-up
change in order to make these changes deployable without introducing
incompatibilities.
Fixes#4454.
We occasionally have reason to block public keys from being used in CSRs
or for JWKs. This work adds support for loading a YAML blocked keys list
to the WFE, the RA and the CA (all the components already using the
`goodekey` package).
The list is loaded in-memory and is intended to be used sparingly and
not for more complicated mass blocking scenarios. This augments the
existing debian weak key checking which is specific to RSA keys and
operates on a truncated hash of the key modulus. In comparison the
admin. blocked keys are identified by the Base64 encoding of a SHA256
hash over the DER encoding of the public key expressed as a PKIX subject
public key. For ECDSA keys in particular we believe a more thorough
solution would have to consider inverted curve points but to start we're
calling this approach "Good Enough".
A utility program (`block-a-key`) is provided that can read a PEM
formatted x509 certificate or a JSON formatted JWK and emit lines to be
added to the blocked keys YAML to block the related public key.
A test blocked keys YAML file is included
(`test/example-blocked-keys.yml`), initially populated with a few of the
keys from the `test/` directory. We may want to do a more through pass
through Boulder's source code and add a block entry for every test
private key.
Resolves https://github.com/letsencrypt/boulder/issues/4404
Enables integration tests for authz2 and fixes a few bugs that were flagged up during the process. Disables expired-authorization-purger integration tests if config-next is being used as expired-authz-purger expects to purge some stuff but doesn't know about authz2 authorizations, a new test will be added with #4188.
Fixes#4079.
- docker-rebuild isn't needed now that boulder and bhsm containers run directly off
the boulder-tools image.
- Remove DNS options from RA config.
- Remove GSB options from VA config.
Our existing hostname policy configs use JSON. We would like to switch to YAML
to match the rate limit policy configs and to support commenting/tagging
entries in the hostname policy.
The PA is updated to support both JSON and YAML while we migrate the existing
policy data to the new format. To verify there are no changes in functionality
the existing unit tests were updated to test the same policy expressed in YAML
and JSON form.
In integration tests the `config` tree continues to use a JSON hostname policy
file to test the legacy support. In `config-next` a YAML hostname policy file
is used instead.
The new YAML format allows separating out `HighRiskBlockedNames` (primarily
used to meet the requirement of managing issuance for "high risk" domains per
CABF BRs, mostly static) and `AdminBlockedNames` (used after administrative
action is taken to block a domain. Additions are made with some frequency).
Since the rate at which we change entries in these lists differs it is helpful
to separate them in the policy content.
- Move fakeclock, get_future_output, and random_domain to helpers.py.
- Remove tempdir handling from integration-test.py since it's already
done in helpers.py
- Consolidate handling of config dir into helpers.py, and add
CONFIG_NEXT boolean.
- Move RevokeAtRA config gating into verify_revocation to reduce
redundancy.
- Skip load-balancing test when filter is enabled.
- Ungate test_sct_embedding
- Rework test_ct_submissions, which was out of date. In particular, have a couple of
logs where submitFinalCert: false, and make ct-test-srv store submission counts
by hostnames for better test case isolation.
* Remove the challenge whitelist
* Reduce the signature for ChallengesFor and ChallengeTypeEnabled
* Some unit tests in the VA were changed from testing TLS-SNI to testing the same behavior
in TLS-ALPN, when that behavior wasn't already tested. For instance timeouts during connect
are now tested.
Fixes#4109
I tried dropping the RA->VA timeout to make the
`test_http_challenge_timeout` integration test faster. It seems to flake
in CI so I'm restoring the original 20s timeout. This makes
`test_http_challenge_timeout` slower but c'est la vie.
If an order for a given set of names will fail finalization because of certificate rate limits (certs per domain, certs per fqdn set) there isn't any point in allowing an order for those names to be created. We can stop a lot of requests earlier by enforcing the cert rate limits at new order time as well as finalization time. A new RA `EarlyOrderRateLimit` feature flag controls whether this is done or not.
Resolves#3975
Implements a feature that enables immediate revocation instead of marking a certificate revoked and waiting for the OCSP-Updater to generate the OCSP response. This means that as soon as the request returns from the WFE the revoked OCSP response should be available to the user. This feature requires that the RA be configured to use the standalone Akamai purger service.
Fixes#4031.
Required a little bit of rework of the RA issuance flow (to add parsing of the precert to determine the expiration date, and moving final cert parsing before final cert submission) and RA tests, but I think it shouldn't create any issues...
Fixes#3197.
In db01b0b we removed email validation from the RA. This was the only
use of the `bdns` package by the RA and so we can go one step further
and delete the remaining setup, configuration and `bdns` fields.
Things removed:
* features.EmbedSCTs (and all the associated RA/CA/ocsp-updater code etc)
* ca.enablePrecertificateFlow (and all the associated RA/CA code)
* sa.AddSCTReceipt and sa.GetSCTReceipt RPCs
* publisher.SubmitToCT and publisher.SubmitToSingleCT RPCs
Fixes#3755.
We'd like to start using the DNS load balancer in the latest version of gRPC. That means putting all IPs for a service under a single hostname (or using a SRV record, but we're not taking that path). This change adds an sd-test-srv to act as our service discovery DNS service. It returns both Boulder IP addresses for any A lookup ending in ".boulder". This change also sets up the Docker DNS for our boulder container to defer to sd-test-srv when it doesn't know an answer.
sd-test-srv doesn't know how to resolve public Internet names like `github.com`. Resolving public names is required for the `godep-restore` test phase, so this change breaks out a copy of the boulder container that is used only for `godep-restore`.
This change implements a shim of a DNS resolver for gRPC, so that we can switch to DNS-based load balancing with the currently vendored gRPC, then when we upgrade to the latest gRPC we won't need a simultaneous config update.
Also, this change introduces a check at the end of the integration test that each backend received at least one RPC, ensuring that we are not sending all load to a single backend.
In staging/prod we use `doNotForceCN: false` for both the RA & CA
config. Switching this to `true` is blocked on CABF work that will
likely take considerable time.
In the short-term we should use `doNotForceCN: false` in `test/config`
and only use `doNotForceCN: true` in `test/config-next`.
We're currently stuck on gRPC v1.1 because of a breaking change to certificate validation in gRPC 1.8. Our gRPC balancer uses a static list of multiple hostnames, and expects to validate against those hostnames. However gRPC expects that a service is one hostname, with multiple IP addresses, and validates all those IP addresses against the same hostname. See grpc/grpc-go#2012.
If we follow gRPC's assumptions, we can rip out our custom Balancer and custom TransportCredentials, and will probably have a lower-friction time in general.
This PR is the first step in doing so. In order to satisfy the "multiple IPs, one port" property of gRPC backends in our Docker container infrastructure, we switch to Docker's user-defined networking. This allows us to give the Boulder container multiple IP addresses on different local networks, and gives it different DNS aliases in each network.
In startservers.py, each shard of a service listens on a different DNS alias for that service, and therefore a different IP address. The listening port for each shard of a service is now identical.
This change also updates the gRPC service certificates. Now, each certificate that is used in a gRPC service (as opposed to something that is "only" a client) has three names. For instance, sa1.boulder, sa2.boulder, and sa.boulder (the generic service name). For now, we are validating against the specific hostnames. When we update our gRPC dependency, we will begin validating against the generic service name.
Incidentally, the DNS aliases feature of Docker allows us to get rid of some hackery in entrypoint.sh that inserted entries into /etc/hosts.
Note: Boulder now has a dependency on the DNS aliases feature in Docker. By default, docker-compose run creates a temporary container and doesn't assign any aliases to it. We now need to specify docker-compose run --use-aliases to get the correct behavior. Without --use-aliases, Boulder won't be able to resolve the hostnames it wants to bind to.
When the WFE calls the RA the RA creates a sub context which is cancelled when
the RPC returns. Because we were spawning the publisher RPC calls in a
goroutine with the context from ra.IssueCertificate as soon as
ra.IssueCertificate returned that context was being canceled which in turn
canceled the publisher RPC calls.
Instead of using the RA RPC context simply use a `context.Background()` so
that the RPC context doesn't break these submissions. Also return to
pre-features.CancelCTSubmissions behavior where precert submissions would
be canceled once we retrieved SCTs from the winning logs instead of relying
on the magic behavior of the RA RPC canceling them itself.
Submits final certificates to any configured CT logs. This doesn't introduce a feature flag as it is config gated, any log we want to submit final certificates to needs to have it's log description updated to include the `"submitFinalCerts": true` field.
Fixes#3605.
During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available.
During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that.
Fixes#3641.
gRPC passes deadline information through the RPC boundary, but client and server have the same deadline. Ideally we'd like the server to have a slightly tighter deadline than the client, so if one of the server's onward RPCs or other network calls times out, the server can pass back more detailed information to the client, rather than the client timing out the server and losing the opportunity to log more detailed information about which component caused the timeout.
In this change, I subtract 100ms from the deadline on the server side of our interceptors, using our existing serverInterceptor. I also check that there is at least 100ms remaining in which to do useful work, so the server doesn't begin a potentially expensive task only to abort it.
Fixes#3608.
Adds SCT embedding to the certificate issuance flow. When a issuance is requested a precertificate (the requested certificate but poisoned with the critical CT extension) is issued and submitted to the required CT logs. Once the SCTs for the precertificate have been collected a new certificate is issued with the poison extension replace with a SCT list extension containing the retrieved SCTs.
Fixes#2244, fixes#3492 and fixes#3429.
Right now we check safe browsing at new-authz time, which introduces a possible
external dependency when calling new-authz. This is usually fine, since most safe
browsing checks can be satisfied locally, but when requests have to go external,
it can create variance in new-authz timing.
Fixes#3491.
This code was never enabled in production. Our original intent was to
ship this as part of the ACMEv2 API. Before that could happen flaws were
identified in TLS-SNI-01|02 that resulted in TLS-SNI-02 being removed
from the ACME protocol. We won't ever be enabling this code and so we
might as well remove it.
Add a set of logs which will be submitted to but not relied on for their SCTs,
this allows us to test submissions to a particular log or submit to a log which
is not yet approved by a browser/root program.
Also add a feature which stops cancellations of remaining submissions when racing
to get a SCT from a group of logs.
Additionally add an informational log that always times out in config-next.
Fixes#3464 and fixes#3465.
Removes usage of the `EnforceChallengeDisable` feature, the feature itself is not removed as it is still configured in staging/production, once that is fixed I'll submit another PR removing the actual flag.
This keeps the behavior that when authorizations are retrieved from the SA they have their challenges populated, because that seems to make the most sense to me? It also retains TLS re-validation.
Fixes#3441.
Boulder is fairly noisy about gRPC connection errors. This is a mixed
blessing: Our gRPC configuration will try to reconnect until it hits
an RPC deadline, and most likely eventually succeed. In that case,
we don't consider those to really be errors. However, in cases where
a connection is repeatedly failing, we'd like to see errors in the
logs about connection failure, rather than "deadline exceeded." So
we want to keep logging of gRPC errors.
However, right now we get a lot of these errors logged during
integration tests. They make the output hard to read, and may disguise
more serious errors. So we'd like to avoid causing such errors in
normal integration test operation.
This change reorders the startup of Boulder components by their gRPC
dependencies, so everything's backend is likely to be up and running
before it starts. It also reverses that order for clean shutdowns,
and waits for each process to exit before signalling the next one.
With these changes, I still got connection errors. Taking listenbuddy
out of the gRPC path fixed them. I believe the issue is that
listenbuddy is not a truly transparent proxy. In particular, it
accepts an inbound TCP connection before opening an outbound TCP
connection. If opening that outbound connection results in "connection
refused," it closes the inbound connection. That means gRPC sees a
"connection closed" (or "connection reset"?) rather than "connection
refused". I'm guessing it handles those cases differently, explaining
the different error results.
We've been using listenbuddy to trigger disconnects while Boulder is
running, to ensure that gRPC's reconnect code works. I think we can
probably rely on gRPC's reconnect to work. The initial problem that
led us to start testing this was a configuration problem; now that
we have the configuration we want, we should be fine and don't need
to keep testing reconnects on every integration test run.
For the upcoming SCT embedding changes, it will be useful to have a CT test server that blocks for nontrivial amounts of time before responding. This change introduces a config file for `ct-test-srv` that can be used to set up multiple "personalities" on various ports. Each personality can have a "latencySchedule" that determines how long it will sleep before servicing responding to a submission.
This change also introduces two new "personalities" on :4510 and :4511, plus configures CTLogGroups in the RA. Having four CT log personalities allows us to simulate two nontrivial log groups.
Note: This triggers Publisher to emit audit errors on timed-out submissions. We may want to make Publisher not treat those as errors, and instead only log an error if a whole log group fails.
This change adds a feature flag, TLSSNIRevalidation. When it is enabled, Boulder
will create new authorization objects with TLS-SNI challenges if the requesting
account has issued a certificate with the relevant domain name, and was the most
recent account to do so*. This setting overrides the configured list of
challenges in the PolicyAuthority, so even if TLS-SNI is disabled in general, it
will be enabled for revalidation.
Note that this interacts with EnforceChallengeDisable. Because
EnforceChallengeDisable causes additional checked at validation time and at
issuance time, we need to update those two places as well. We'll send a
follow-up PR with that.
*We chose to make this work only for the most recent account to issue, even if
there were overlapping certificates, because it significantly simplifies the
database access patterns and should work for 95+% of cases.
Note that this change will let an account revalidate and reissue for a domain
even if the previous issuance on that account used http-01 or dns-01. This also
simplifies implementation, and fits within the intent of the mitigation plan: If
someone previously issued for a domain using http-01, we have high confidence
that they are actually the owner, and they are not going to "steal" the domain
from themselves using tls-sni-01.
Also note: This change also doesn't work properly with ReusePendingAuthz: true.
Specifically, if you attempted issuance in the last couple days and failed
because there was no tls-sni challenge, you'll still have an http-01 challenge
lying around, and we'll reuse that; then your client will fail due to lack of
tls-sni challenge again.
This change was joint work between @rolandshoemaker and @jsha.
This updates the PA component to allow authorization challenge types that are globally disabled if the account ID owning the authorization is on a configured whitelist for that challenge type.
This PR implements issuance for wildcard names in the V2 order flow. By policy, pending authorizations for wildcard names only receive a DNS-01 challenge for the base domain. We do not re-use authorizations for the base domain that do not come from a previous wildcard issuance (e.g. a normal authorization for example.com turned valid by way of a DNS-01 challenge will not be reused for a *.example.com order).
The wildcard prefix is stripped off of the authorization identifier value in two places:
When presenting the authorization to the user - ACME forbids having a wildcard character in an authorization identifier.
When performing validation - We validate the base domain name without the *. prefix.
This PR is largely a rewrite/extension of #3231. Instead of using a pseudo-challenge-type (DNS-01-Wildcard) to indicate an authorization & identifier correspond to the base name of a wildcard order name we instead allow the identifier to take the wildcard order name with the *. prefix.
This PR changes the VA's singleDialTimeout value from 5 * time.Second to 10 * time.Second. This will give slower servers a better chance to respond, especially for the multi-VA case where n requests arrive ~simultaneously.
This PR also bumps the RA->VA timeout by 5s and the WFE->RA timeout by 5s to accommodate the increased dial timeout. I put this in a separate commit in case we'd rather deal with this separately.
Makes a couple of changes:
* Change `SubmitToCT` to make submissions to each log in parallel instead of in serial, this prevents a single slow log from eating up the majority of the deadline and causing submissions to other logs to fail
* Remove the 'submissionTimeout' field on the publisher since it is actually bounded by the gRPC timeout as is misleading
* Add a timeout to the CT clients internal HTTP client so that when log servers hang indefinitely we actually do retries instead of just using the entire submission deadline. Currently set at 2.5 minutes
Fixes#3218.
We used to use TCP because we would request DNSSEC records from Unbound, and
they would always cause truncated records when present. Now that we no longer
request those (#2718), we can use UDP. This is better because the TCP serving
paths in Unbound are likely less thoroughly tested, and not optimized for high
load. In particular this may resolve some availability problems we've seen
recently when trying to upgrade to a more recent Unbound.
Note that this only affects the Boulder->Unbound path. The Unbound->upstream
path is already UDP by default (with TCP fallback for truncated ANSWERs).
Fixes#2889.
VA now implements two gRPC services: VA and CAA. These both run on the same port, but this allows implementation of the IsCAAValid RPC to skip using the gRPC wrappers, and makes it easier to potentially separate the service into its own package in the future.
RA.NewCertificate now checks the expiration times of authorizations, and will call out to VA to recheck CAA for those authorizations that were not validated recently enough.
The existing ReusePendingAuthz implementation had some bugs:
It would recycle deactivated authorizations, which then couldn't be fulfilled. (#2840)
Since it was implemented in the SA, it wouldn't get called until after the RA checks the Pending Authorizations rate limit. Which means it wouldn't fulfill its intended purpose of making accounts less likely to get stuck in a Pending Authorizations limited state. (#2831)
This factors out the reuse functionality, which used to be inside an "if" statement in the SA. Now the SA has an explicit GetPendingAuthorization RPC, which gets called from the RA before calling NewPendingAuthorization. This happens to obsolete #2807, by putting the recycling logic for both valid and pending authorizations in the RA.
Adds a basic truncated modulus hash check for RSA keys that can be used to check keys against the Debian `{openssl,openssh,openvpn}-blacklist` lists of weak keys generated during the [Debian weak key incident](https://wiki.debian.org/SSLkeys).
Testing is gated on adding a new configuration key to the WFE, RA, and CA configs which contains the path to a directory which should contain the weak key lists.
Fixes#157.
This removes the config and code to output to statsd.
- Change `cmd.StatsAndLogging` to output a `Scope`, not a `Statter`.
- Remove the prefixing of component name (e.g. "VA") in front of stats; this was stripped by `autoProm` but now no longer needs to be.
- Delete vendored statsd client.
- Delete `MockStatter` (generated by gomock) and `mocks.Statter` (hand generated) in favor of mocking `metrics.Scope`, which is the interface we now use everywhere.
- Remove a few unused methods on `metrics.Scope`, and update its generated mock.
- Refactor `autoProm` and add `autoRegisterer`, which can be included in a `metrics.Scope`, avoiding global state. `autoProm` now registers everything with the `prometheus.Registerer` it is given.
- Change va_test.go's `setup()` to not return a stats object; instead the individual tests that care about stats override `va.stats` directly.
Fixes#2639, #2733.
Prior to this PR if a domain was an exact match to a public suffix
list entry the certificates per name rate limit was applied based on the
count of certificates issued for that exact name and all of its
subdomains.
This PR introduces an exception such that exact public suffix
matches correctly have the certificate per name rate limit applied based
on only exact name matches.
In order to accomplish this a new RPC is added to the SA
`CountCertificatesByExactNames`. This operates similar to the existing
`CountCertificatesByNames` but does *not* include subdomains in the
count, only exact matches to the names provided. The usage of this new
RPC is feature flag gated behind the "CountCertificatesExact" feature flag.
The RA unit tests are updated to test the new code paths both with and
without the feature flag enabled.
Resolves#2681
Deletes github.com/streadway/amqp and the various RabbitMQ setup tools etc. Changes how listenbuddy is used to proxy all of the gRPC client -> server connections so we test reconnection logic.
+49 -8,221 😁Fixes#2640 and #2562.
I think these are all the necessary changes to implement TLS-SNI-02 validations, according to the section 7.3 of draft 05:
https://tools.ietf.org/html/draft-ietf-acme-acme-05#section-7.3
I don't have much experience with this code, I'll really appreciate your feedback.
Signed-off-by: David Calavera <david.calavera@gmail.com>
Set authorizationLifetimeDays to 60 across both config and config-next.
Set NumSessions to 2 in both config and config-next. A decrease from 10 because pkcs11-proxy (or pkcs11-daemon?) seems to error out under load if you have more sessions than CPUs.
Reorder parallelGenerateOCSPRequests to match config-next.
Remove extra tags for parsing yaml in config objects.
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.
This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.
This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
Previously, all gRPC services used the same client and server certificates. Now,
each service has its own certificate, which it uses for both client and server
authentication, more closely simulating production.
This also adds aliases for each of the relevant hostnames in /etc/hosts. There
may be some issues if Docker decides to rewrite /etc/hosts while Boulder is
running, but this seems to work for now.
- Remove spinner from test.js. It made Travis logs hard to read.
- Listen on all interfaces for debugAddr. This makes it possible to check
Prometheus metrics for instances running in a Docker container.
- Standardize DNS timeouts on 1s and 3 retries across all configs. This ensures
DNS completes within the relevant RPC timeouts.
- Remove RA service queue from VA, since VA no longer uses the callback to RA on
completing a challenge.
Adds a gRPC server to the SA and SA gRPC Clients to the WFE, RA, CA, Publisher, OCSP updater, orphan finder, admin revoker, and expiration mailer.
Also adds a CA gRPC client to the OCSP Updater which was missed in #2193.
Fixes#2347.
With the current gRPC design the CA talks directly to the Publisher when calling SubmitToCT which crosses security bounadries (secure internal segment -> internet facing segment) which is dangerous if (however unlikely) the Publisher is compromised and there is a gRPC exploit that allows memory corruption on the caller end of a RPC which could expose sensitive information or cause arbitrary issuance.
Instead we move the RPC call to the RA which is in a less sensitive network segment. Switching the call site from the CA -> RA is gated on adding the gRPC PublisherService object to the RA config.
Fixes#2202.
As described in #2282, our gRPC code uses mutual TLS to authenticate both clients and servers. However, currently our gRPC servers will accept any client certificate signed by the internal CA we use to authenticate connections. Instead, we would like each server to have a list of which clients it will accept. This will improve security by preventing the compromise of one client private key being used to access endpoints unrelated to its intended scope/purpose.
This PR implements support for gRPC servers to specify a list of accepted client names. A `serverTransportCredentials` implementing `ServerHandshake` uses a `verifyClient` function to enforce that the connecting peer presents a client certificate with a SAN entry that matches an entry on the list of accepted client names
The `NewServer` function from `grpc/server.go` is updated to instantiate the `serverTransportCredentials` used by `grpc.NewServer`, specifying an accepted names list populated from the `cmd.GRPCServerConfig.ClientNames` config field.
The pre-existing client and server certificates in `test/grpc-creds/` are replaced by versions that contain SAN entries as well as subject common names. A DNS and an IP SAN entry are added to allow testing both methods of specifying allowed SANs. The `generate.sh` script is converted to use @jsha's `minica` tool (OpenSSL CLI is blech!).
An example client whitelist is added to each of the existing gRPC endpoints in config-next/ to allow the SAN of the test RPC client certificate.
Resolves#2282
This PR reworks the validateEmail() function from the RA to allow timeouts during DNS validation of MX/A/AAAA records for an email to be non-fatal and match our intention to verify emails best-effort.
Notes:
bdns/problem.go - DNSError.Timeout() was changed to also include context cancellation and timeout as DNS timeouts. This matches what DNSError.Error() was doing to set the error message and supports external callers to Timeout not duplicating the work.
bdns/mocks.go - the LookupMX mock was changed to support always.error and always.timeout in a manner similar to the LookupHost mock. Otherwise the TestValidateEmail unit test for the RA would fail when the MX lookup completed before the Host lookup because the error wouldn't be correct (empty DNS records vs a timeout or network error).
test/config/ra.json, test/config-next/ra.json - the dnsTries and dnsTimeout values were updated such that dnsTries * dnsTimeout was <= the WFE->RA RPC timeout (currently 15s in the test configs). This allows the dns lookups to all timeout without the overall RPC timing out.
Resolves#2260.
Fixes#503.
Functionality is gated by the feature flag `AllowKeyRollover`. Since this functionality is only specified in ACME draft-03 and we mostly implement the draft-02 style this takes some liberties in the implementation, which are described in the updated divergences doc. The `key-change` resource is used to side-step draft-03 `url` requirement.
Add feature flagged support for issuing for IDNs, fixes#597.
This patch expects that clients have performed valid IDN2008 encoding on any label that includes unicode characters. Invalid encodings (including non-compatible IDN2003 encoding) will be rejected. No script-mixing or script exclusion checks are performed as we assume that if a name is resolvable that it conforms to the registrar's policies on these matters and if it uses non-standard scripts in sub-domains etc that browsers should be the ones choosing how to display those names.
Required a full update of the golang.org/x/net tree to pull in golang.org/x/net/idna, all test suites pass.