Update the hierarchy which the integration tests auto-generate inside
the ./hierarchy folder to include three intermediates of each key type,
two to be actively loaded and one to be held in reserve. To facilitate
this:
- Update the generation script to loop, rather than hard-coding each
intermediate we want
- Improve the filenames of the generated hierarchy to be more readable
- Replace the WFE's AIA endpoint with a thin aia-test-srv so that we
don't have to have NameIDs hardcoded in our ca.json configs
Having this new hierarchy will make it easier for our integration tests
to validate that new features like "unpredictable issuance" are working
correctly.
Part of https://github.com/letsencrypt/boulder/issues/729
Remove three deprecated feature flags which have been removed from all
production configs:
- StoreLintingCertificateInsteadOfPrecertificate
- LeaseCRLShards
- AllowUnrecognizedFeatures
Deprecate three flags which are set to true in all production configs:
- CAAAfterValidation
- AllowNoCommonName
- SHA256SubjectKeyIdentifier
IN-9879 tracked the removal of these flags.
Create a new administration tool "bin/admin" as a successor to and
replacement of "admin-revoker".
This new tool supports all the same fundamental capabilities as the old
admin-revoker, including:
- Revoking by serial, by batch of serials, by incident table, and by
private key
- Blocking a key to let bad-key-revoker take care of revocation
- Clearing email addresses from all accounts that use them
Improvements over the old admin-revoker include:
- All commands run in "dry-run" mode by default, to prevent accidental
executions
- All revocation mechanisms allow setting the revocation reason,
skipping blocking the key, indicating that the certificate is malformed,
and controlling the number of parallel workers conducting revocation
- All revocation mechanisms do not parse the cert in question, leaving
that to the RA
- Autogenerated usage information for all subcommands
- A much more modular structure to simplify adding more capabilities in
the future
- Significantly simplified tests with smaller mocks
The new tool has analogues of all of admin-revokers unit tests, and all
integration tests have been updated to use the new tool instead. A
future PR will remove admin-revoker, once we're sure SRE has had time to
update all of their playbooks.
Fixes https://github.com/letsencrypt/boulder/issues/7135
Fixes https://github.com/letsencrypt/boulder/issues/7269
Fixes https://github.com/letsencrypt/boulder/issues/7268
Fixes https://github.com/letsencrypt/boulder/issues/6927
Part of https://github.com/letsencrypt/boulder/issues/6840
When passing detailed error information between services as gRPC
metadata, ensure that the suberrors being sent contain only ascii
characters, because gRPC metadata is sent as HTTP headers which only
allow visible ascii characters.
Also add a regression test.
These names corresponded to single instances of a service, and were
primarily used for (a) specifying which interface to bind a gRPC port on
and (b) allowing `health-checker` to check individual instances rather
than a service as a whole.
For (a), change the `--grpc-addr` flags to bind to "all interfaces." For
(b), provide a specific IP address and port for health checking. This
required adding a `--hostOverride` flag for `health-checker` because the
service certificates contain hostname SANs, not IP address SANs.
Clarify the situation with nonce services a little bit. Previously we
had one nonce "service" in Consul and got nonces from that (i.e.
randomly between the two nonce-service instances). Now we have two nonce
services in consul, representing multiple datacenters, and one of them
is explicitly configured as the "get" service, while both are configured
as the "redeem" service.
Part of #7245.
Note this change does not yet get rid of the rednet/bluenet distinction,
nor does it get rid of all use of 10.88.88.88. That will be a followup
change.
The RequireCommonName feature flag was our only "inverted" feature flag,
which defaulted to true and had to be explicitly set to false. This
inversion can lead to confusion, especially to readers who expect all Go
default values to be zero values. We plan to remove the ability for our
feature flag system to support default-true flags, which the existence
of this flag blocked. Since this flag has not been set in any real
configs, inverting it is easy.
Part of https://github.com/letsencrypt/boulder/issues/6802
Have the crl-storer download the previous CRL from S3, parse it, and
compare its number against the about-to-be-uploaded CRL. This is not an
atomic operation, so it is not a 100% guarantee, but it is still a
useful safety check to prevent accidentally uploading CRL shards whose
CRL Numbers are not strictly increasing.
Part of https://github.com/letsencrypt/boulder/issues/6456
In `//cmd/ceremony`:
* Added `CertificateToCrossSignPath` to the `cross-certificate` ceremony
type. This new input field takes an existing certificate that will be
cross-signed and performs checks against the manually configured data in
each ceremony file.
* Added byte-for-byte subject/issuer comparison checks to root,
intermediate, and cross-certificate ceremonies to detect that signing is
happening as expected.
* Added Fermat factorization check from the `//goodkey` package to all
functions that generate new key material.
In `//linter`:
* The Check function now exports linting certificate bytes. The idea is
that a linting certificate's `tbsCertificate` bytes can be compared
against the final certificate's `tbsCertificate` bytes as a verification
that `x509.CreateCertificate` was deterministic and produced identical
DER bytes after each signing operation.
Other notable changes:
* Re-orders the issuers list in each CA config to match staging and
production. There is an ordering issue mentioned by @aarongable two
years ago on IN-5913 that didn't make it's way back to this repository.
> Order here matters – the default chain we serve for each intermediate
should be the first listed chain containing that intermediate.
* Enables `ECDSAForAll` in `config-next` CA configs to match Staging.
* Generates 2x new ECDSA subordinate CAs cross-signed by an RSA root and
adds these chains to the WFE for clients to download.
* Increased the test.sh startup timeout to account for the extra
ceremony run time.
Fixes https://github.com/letsencrypt/boulder/issues/7003
---------
Co-authored-by: Aaron Gable <aaron@letsencrypt.org>
Simplify the index-picking logic in the SA's leaseOldestCrlShard method.
Specifically, more clearly separate it into "missing" and "non-missing"
cases, which require entirely different logic: picking a random missing
shard, or picking the oldest unleased shard, respectively.
Also change the UpdateCRLShard method to "unlease" shards when they're
updated. This allows the crl-updater to run as quickly as it likes,
while still ensuring that multiple instances do not step on each other's
toes.
The config change for shardWidth and lookbackPeriod instead of
certificateLifetime has been deployed in prod since IN-8445. The config
change changing the shardWidth is just so that the tests neither produce
a bazillion shards, nor have to do a bazillion SA queries for each chunk
within a shard, improving the readability of test logs.
Part of https://github.com/letsencrypt/boulder/issues/7023
Allow gRPC SRV resolver to succeed even when some names are not resolved
successfully. Cross-DC services (e.g. nonce) will fail to resolve when
the link between DCs is severed or one DC is taken offline, this should
not result in hard gRPC service failures.
Fixes#6974
Fix an issue related to the custom gRPC Picker implementation introduced
in #6618. When a nonce contained a prefix not associated with a known
backend, the Picker would continuously rebuild, re-resolve DNS, and
eventually throw a 500 "Server Error" at RPC timeout. The Picker now
promptly returns a 400 "Bad Nonce" error as expected, in response the
requesting client should retry their request with a fresh nonce.
Additionally:
- WFE unit tests use derived nonces when `"BOULDER_CONFIG_DIR" ==
"test/config-next"`.
- `Balancer.Build()` in "noncebalancer" forces a rebuild until non-zero
backends are available. This matches the
[balancer/roundrobin](d524b40946/balancer/roundrobin/roundrobin.go (L49-L53))
implementation.
- Nonces with no matching backend increment "jose_errors" with label
`"type": "JWSInvalidNonce"` and "nonce_no_backend_found".
- Nonces of incorrect length are now rejected at the WFE and increment
"jose_errors" with label `"type": "JWSMalformedNonce"` instead of
`"type": "JWSInvalidNonce"`.
- Nonces not encoded as base64url are now rejected at the WFE and
increment "jose_errors" with label `"type": "JWSMalformedNonce"` instead
of `"type": "JWSInvalidNonce"`.
Fixes#6969
Part of #6974
Add a new feature flag, LeaseCRLShards, which controls certain aspects
of crl-updater's behavior.
When this flag is enabled, crl-updater calls the new SA.LeaseCRLShard
method before beginning work on a shard. This prevents it from stepping
on the toes of another crl-updater instance which may be working on the
same shard. This is important to prevent two competing instances from
accidentally updating a CRL's Number (which is an integer representation
of its thisUpdate timestamp) *backwards*, which would be a compliance
violation.
When this flag is enabled, crl-updater also calls the new
SA.UpdateCRLShard method after finishing work on a shard.
In the future, additional work will be done to make crl-updater use the
"give me the oldest available shard" mode of the LeaseCRLShard method.
Fixes https://github.com/letsencrypt/boulder/issues/6897
This change replaces [gorp] with [borp].
The changes consist of a mass renaming of the import and comments / doc
fixups, plus modifications of many call sites to provide a
context.Context everywhere, since gorp newly requires this (this was one
of the motivating factors for the borp fork).
This also refactors `github.com/letsencrypt/boulder/db.WrappedMap` and
`github.com/letsencrypt/boulder/db.Transaction` to not embed their
underlying gorp/borp objects, but to have them as plain fields. This
ensures that we can only call methods on them that are specifically
implemented in `github.com/letsencrypt/boulder/db`, so we don't miss
wrapping any. This required introducing a `NewWrappedMap` method along
with accessors `SQLDb()` and `BorpDB()` to get at the internal fields
during metrics and logging setup.
Fixes#6944
When a user wants their email address deleted from the database but no
longer has access to their account, this allows an administrator to
clear it.
This adds `admin` as an alias for `admin-revoker`, because we'd like the
clear-email sub-command to be a part of that overall tool, but it's not
really revocation related.
Part of #6864
When processing CAA records, keep track of the FQDN at which that CAA
record was found (which may be different from the FQDN for which we are
attempting issuance, since we crawl CAA records upwards from the
requested name to the TLD). Then surface this name upwards so that it
can be included in our own log lines and in the problem documents which
we return to clients.
Fixes https://github.com/letsencrypt/boulder/issues/3171
Remove the remaining divergences from RFC8555 regarding what error types
we use in certain situations. Specifically:
- use "invalidContact" instead of "invalidEmail";
- use "unsupportedContact" for contact addresses that use a protocol
other than "mailto:"; and
- use "unsupportedIdentifier" for identifiers that specify a type other
than "dns".
This adds Jaeger's all-in-one dev container (with no persistent storage)
to boulder's dev docker-compose. It configures config-next/ to send all
traces there.
A new integration test creates an account and issues a cert, then
verifies the trace contains some set of expected spans.
This test found that async finalize broke spans, so I fixed that and a
few related spots where we make a new context.
We only ever set it to the same value, and then read it back in
make_client, so just hardcode it there instead.
It's a bit spooky-action-at-a-distance and is process-wide with no
synchronization, which means we can't safely use different values
anyway.
Replace inline connect string with a new one in test/vars (that points
to boulder_sa_integration).
Remove comments about interpolateParams=false being required; it is not.
Add clauses to getPrecertByName to ensure it follows its documented
constraints (return the latest one).
Follow-up on #6807. Fixes#6848.
In order to get rid of the orphan queue, we want to make sure that
before we sign a precertificate, we have enough data in the database
that we can fulfill our revocation-checking obligations even if storing
that precertificate in the database fails. That means:
- We should have a row in the certificateStatus table for the serial.
- But we should not serve "good" for that serial until we are positive
the precertificate was issued (BRs 4.9.10).
- We should have a record in the live DB of the proposed certificate's
public key, so the bad-key-revoker can mark it revoked.
- We should have a record in the live DB of the proposed certificate's
names, so it can be revoked if we are required to revoke based on names.
The SA.AddPrecertificate method already achieves these goals for
precertificates by writing to the various metadata tables. This PR
repurposes the SA.AddPrecertificate method to write "proposed
precertificates" instead.
We already create a linting certificate before the precertificate, and
that linting certificate is identical to the precertificate that will be
issued except for the private key used to sign it (and the AKID). So for
instance it contains the right pubkey and SANs, and the Issuer name is
the same as the Issuer name that will be used. So we'll use the linting
certificate as the "proposed precertificate" and store it to the DB,
along with appropriate metadata.
In the new code path, rather than writing "good" for the new
certificateStatus row, we write a new, fake OCSP status string "wait".
This will cause us to return internalServerError to OCSP requests for
that serial (but we won't get such requests because the serial has not
yet been published). After we finish precertificate issuance, we update
the status to "good" with SA.SetCertificateStatusReady.
Part of #6665
Export new prometheus metrics for the `notBefore` and `notAfter` fields
to track internal certificate validity periods when calling the `Load()`
method for a `*tls.Config`. Each metric is labeled with the `serial`
field.
```
tlsconfig_notafter_seconds{serial="2152072875247971686"} 1.664821961e+09
tlsconfig_notbefore_seconds{serial="2152072875247971686"} 1.664821960e+09
```
Fixes https://github.com/letsencrypt/boulder/issues/6829
Update github.com/eggsampler/acme from v3.3.0 to v3.4.0.
Changelog: https://github.com/eggsampler/acme/compare/v3.3.0...v3.4.0
Update the ARI integration test to use the eggampler/acme client's new
ARI capabilities for making both GET and POST requests. This simplifies
and streamlines the test significantly, and lets us test the POST path.
Fixes#6781
When sending an ARI response, write the Retry-After header before
writing the JSON response body. This is necessary because
http.ResponseWriter implicitly calls WriteHeader whenever Write is
called, flushing all headers to the network and preventing any
additional headers from being written. Unfortunately, the unittests use
httptest.ResponseRecorder, which doesn't seem to enforce this invariant
(it's happy to report headers which were written after the body). Add a
header check to the integration tests, to make up for this deficiency.
When external clients make POST requests to our ARI endpoint, they're
getting 404s even when a GET request with the same exact CertID
succeeds. Logs show that this is because the SA is returning "method
GetSerialMetadata not implemented" when the WFE attempts that gRPC
request. This is due to an oversight: the GetSerialMetadata method is
not implemented on the SQLStorageAuthorityRO object, only on the
SQLStorageAuthority object. The unit tests did not catch this bug
because they supply a mock SA, which does implement the method in
question.
Update the receiver and add a wrapper so that GetSerialMetadata is
implemented on both the read-write and read-only SA implementation
types. Add a new kind of test assertion which helps ensure this won't
happen again. Add a TODO for an integration test covering the ARI POST
codepath to prevent a regression.
Fixes#6778
Change the SetCommonName flag, introduced in #6706, to
RequireCommonName. Rather than having the flag control both whether or
not a name is hoisted from the SANs into the CN *and* whether or not the
CA is willing to issue certs with no CN, this updated flag now only
controls the latter. By default, the new flag is true, and continues our
current behavior of failing issuance if we cannot set a CN in the cert.
When the flag is set to false, then we are willing to issue certificates
for which the CSR contains no CN and there is no SAN short enough to be
hoisted into the CN field.
When we have rolled out this change, we can move on to the next flag in
this series: HoistCommonName, which will control whether or not a SAN is
hoisted at all, effectively giving the CSRs (and therefore the clients)
full control over whether their certificate contains a SAN.
This change is safe because no environment explicitly sets the
SetCommonName flag to false yet.
Fixes#5112
Add a new feature flag, `SetCommonName`, which defaults to `true`. In
this default state, no behavior changes.
When set to `false` on the CA, this flag will cause the CA to leave the
Subject commonName field of the certificate blank, as is recommended by
the Baseline Requirements Section 7.1.4.2.2(a).
Also slightly modify the behavior of the RA's `matchesCSR()` function,
to allow for both certificates that have a CN and certificates that
don't. It is not feasible to put this behavior behind the same
SetCommonName flag, because that would require an atomic deploy of both
the RA and the CA.
Obsoletes #5112
For consistency, put the error field at the end of unstructured log
lines to make them more ... structured.
Adds the `issuerID` field to "orphaning certificate" log line in the CA
to match the "orphaning precertificate" log line.
Fixes broken tests as a result of the CA and bdns log line change.
Fixes#5457
Add an integration test which verifies that we reject finalize requests
with CSRs containing a fermat-factorizable public key.
Originally this change was also going to remove our Fermat factorization
implementation from good_key.go, and simply rely on the similar check in
zlint's e_rsa_fermat_factorization check. However, while relying solely
on the lint works, it causes us to block such requests with a 500
serverInternal error, because we consider failing lints to be our fault.
This would be a regression from the current status quo, where such
requests are rejected with a 400 badCSR error and details of the
factorization, so we are leaving our goodkey checks in place.
Update our implementation of ARI to return a renewal window entirely in
the past (i.e., suggesting immediate renewal) if the certificate in
question has been revoked for any reason. This will allow clients which
implement ARI to discover that they need to replace their certificate
without having to query OCSP directly, especially as we move into a
future where OCSP is mostly supplanted by aggregated CRLs.
Fixes#6503
Deprecate these feature flags, which are consistently set in both prod
and staging and which we do not expect to change the value of ever
again:
- AllowReRevocation
- AllowV1Registration
- CheckFailedAuthorizationsFirst
- FasterNewOrdersRateLimit
- GetAuthzReadOnly
- GetAuthzUseIndex
- MozRevocationReasons
- RejectDuplicateCSRExtensions
- RestrictRSAKeySizes
- SHA1CSRs
Move each feature flag to the "deprecated" section of features.go.
Remove all references to these feature flags from Boulder application
code, and make the code they were guarding the only path. Deduplicate
tests which were testing both the feature-enabled and feature-disabled
code paths. Remove the flags from all config-next JSON configs (but
leave them in config ones until they're fully deleted, not just
deprecated). Finally, replace a few testdata CSRs used in CA tests,
because they had SHA1WithRSAEncryption signatures that are now rejected.
Fixes#5171Fixes#6476
Part of #5997
Boulder builds a single binary which is symlinked to the different binary names, which are included in its releases.
However, requiring symlinks isn't always convenient.
This change makes the base `boulder` command usable as any of the other binary names. If the binary is invoked as boulder, runs the second argument as the command name. It shifts off the `boulder` from os.Args so that all the existing argument parsing can remain unchanged.
This uses the subcommand versions in integration tests, which I think is important to verify this change works, however we can debate whether or not that should be merged, since we're using the symlink method in production, that's what we want to test.
Issue #6362 suggests we want to move to a more fully-featured command-line parsing library that has proper subcommand support. This fixes one fragment of that, by providing subcommands, but is definitely nowhere near as nice as it could be with a more fully fleshed out library. Thus this change takes a minimal-touch approach to this change, since we know a larger refactoring is coming.
- Add a new gRPC client config field which overrides the dNSName checked in the
certificate presented by the gRPC server.
- Revert all test gRPC credentials to `<service>.boulder`
- Revert all ClientNames in gRPC server configs to `<service>.boulder`
- Set all gRPC clients in `test/config` to use `serverAddress` + `hostOverride`
- Set all gRPC clients in `test/config-next` to use `srvLookup` + `hostOverride`
- Rename incorrect SRV record for `ca` with port `9096` to `ca-ocsp`
- Rename incorrect SRV record for `ca` with port `9106` to `ca-crl`
Resolves#6424
- Add a dedicated Consul container
- Replace `sd-test-srv` with Consul
- Add documentation for configuring Consul
- Re-issue all gRPC credentials for `<service-name>.service.consul`
Part of #6111
Make every function in the Run -> Tick -> tickIssuer -> tickShard chain
return an error. Make that return value a named return (which we usually
avoid) so that we can remove the manual setting of the metric result
label and have the deferred metric handling function take care of that
instead. In addition, let that cleanup function wrap the returned error
(if any) with the identity of the shard, issuer, or tick that is
returning it, so that we don't have to include that info in every
individual error message. Finally, have the functions which spin off
many helpers (Tick and tickIssuer) collect all of their helpers' errors
and only surface that error at the end, to ensure the process completes
even in the presence of transient errors.
In crl-updater's main, surface the error returned by Run or Tick, to
make debugging easier.
Now that both crl-updater and crl-storer are running in prod,
run this integration test in both test environments as well.
In addition, remove the fake storer grpc client that the updater
used when no storer client was configured, as storer clients
are now configured in all environments.
Update our ACME Renewal Info implementation to parse
the CertID-based request format specified in the current
version of the draft specification.
Part of #6033
Debug and Info messages still go to stdout.
Fix the CAA integration test, which asserted that stderr should be empty
when caa-log-checker finds a problem. That used to be the case because
we never logged to stderr, but now it is the case.
Update the logging docs.
Fixes#6324
The iotuil package has been deprecated since go1.16; the various
functions it provided now exist in the os and io packages. Replace all
instances of ioutil with either io or os, as appropriate.
Create a new crl-storer service, which receives CRL shards via gRPC and
uploads them to an S3 bucket. It ignores AWS SDK configuration in the
usual places, in favor of configuration from our standard JSON service
config files. It ensures that the CRLs it receives parse and are signed
by the appropriate issuer before uploading them.
Integrate crl-updater with the new service. It streams bytes to the
crl-storer as it receives them from the CA, without performing any
checking at the same time. This new functionality is disabled if the
crl-updater does not have a config stanza instructing it how to connect
to the crl-storer.
Finally, add a new test component, the s3-test-srv. This acts similarly
to the existing mail-test-srv: it receives requests, stores information
about them, and exposes that information for later querying by the
integration test. The integration test uses this to ensure that a
newly-revoked certificate does show up in the next generation of CRLs
produced.
Fixes#6162
Add a new filter to mail-test-srv, allowing test processes to query
for messages sent from a specific address, not just ones sent to
a specific address. This fixes a race condition in the revocation
integration tests where the number of messages sent to a cert's
contact address would be higher than expected because expiration
mailer sent a message while the test was running. Also reduce
bad-key-revoker's maximum backoff to 2 seconds to ensure that
it continues to run frequently during the integration tests, despite
usually not having any work to do.
While we're here, also improve the comments on various revocation
integration tests, remove some unnecessary cruft, and split the tests
out to explicitly test functionality with the MozRevocationReasons
flag both enabled and disabled. Also, change ocsp_helper's default
output from os.Stdout to ioutil.Discard to prevent hundreds of lines
of log spam when the integration tests fail during a test that uses
that library.
Fixes#6248
Add a new code path to the ctpolicy package which enforces Chrome's new
CT Policy, which requires that SCTs come from logs run by two different
operators, rather than one Google and one non-Google log. To achieve
this, invert the "race" logic: rather than assuming we always have two
groups, and racing the logs within each group against each other, we now
race the various groups against each other, and pick just one arbitrary
log from each group to attempt submission to.
Ensure that the new code path does the right thing by adding a new zlint
which checks that the two SCTs embedded in a certificate come from logs
run by different operators. To support this lint, which needs to have a
canonical mapping from logs to their operators, import the Chrome CT Log
List JSON Schema and autogenerate Go structs from it so that we can
parse a real CT Log List. Also add flags to all services which run these
lints (the CA and cert-checker) to let them load a CT Log List from disk
and provide it to the lint.
Finally, since we now have the ability to load a CT Log List file
anyway, use this capability to simplify configuration of the RA. Rather
than listing all of the details for each log we're willing to submit to,
simply list the names (technically, Descriptions) of each log, and look
up the rest of the details from the log list file.
To support this change, SRE will need to deploy log list files (the real
Chrome log list for prod, and a custom log list for staging) and then
update the configuration of the RA, CA, and cert-checker. Once that
transition is complete, the deletion TODOs left behind by this change
will be able to be completed, removing the old RA configuration and old
ctpolicy race logic.
Part of #5938
Simplify the WFE `RevokeCertificate` API method in three ways:
- Remove most of the logic checking if the requester is authorized to
revoke the certificate in question (based on who is making the
request, what authorizations they have, and what reason they're
requesting). That checking is now done by the RA. Instead, simply
verify that the JWS is authenticated.
- Remove the hard-to-read `authorizedToRevoke` callbacks, and make the
`revokeCertBySubscriberKey` (nee `revokeCertByKeyID`) and
`revokeCertByCertKey` (nee `revokeCertByJWK`) helpers much more
straight-line in their execution logic.
- Call the RA's new `RevokeCertByApplicant` and `RevokeCertByKey` gRPC
methods, rather than the deprecated `RevokeCertificateWithReg`.
This change, without any flag flips, should be invisible to the
end-user. It will slightly change some of our log message formats.
However, by now relying on the new RA gRPC revocation methods, this
change allows us to change our revocation policies by enabling the
`AllowDoubleRevocation` and `MozRevocationReasons` feature flags, which
affect the behavior of those new helpers.
Fixes#5936
- Add new configuration key `throughput`, a mapping which contains all
throughput related akamai-purger settings.
- Deprecate configuration key `purgeInterval` in favor of `purgeBatchInterval` in
the new `throughput` configuration mapping.
- When no `throughput` or `purgeInterval` is provided, the purger uses optimized
default settings which offer 1.9x the throughput of current production settings.
- At startup, all throughput related settings are modeled to ensure that we
don't exceed the limits imposed on us by Akamai.
- Queue is now `[][]string`, instead of `[]string`.
- When a given queue entry is purged we know all 3 of it's URLs were purged.
- At startup we know the size of a theoretical request to purge based on the
number of queue entries included
- Raises the queue size from ~333-thousand cached OCSP responses to
1.25-million, which is roughly 6 hours of work using the optimized default
settings
- Raise `purgeInterval` in test config from 1ms, which violates API limits, to 800ms
Fixes#5984
Reverts letsencrypt/boulder#5963
Turns out the tests are still flaky -- using the `grpc.WaitForReady(true)`
connection option results in sometimes seeing 9 entries added to the
purger queue, and sometimes 10 entries. Reverting because flakiness
on main should not be tolerated.
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.36.1 to 1.44.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.36.1...v1.44.0)
Also update akamai-purger integration test to avoid experimental API.
The `conn.GetState()` API is marked experimental and may change behavior
at any time. It appears to have changed between v1.36.1 and v1.44.0,
and so the akamai-purger integration tests which rely on it break.
Rather than writing our own loop which polls `conn.GetState()`, just
use the stable `WaitForReady(true)` connection option, and apply it to
all connections by setting it as a default option in the dial options.
Light cleanup of akamai-purger and the akamai cache-client. This does not make
any material changes to logic.
- Use `errors.New` and `errors.Is` instead of a custom `ErrFatal` type and
`errors.As`
- Add whitespace to separate chunks of execution and error checking from one
another
- Use `logger.Infof` and `logger.Errorf` instead of wrapped calls to
`fmt.Sprintf`
- Remove capital letters from the beginning of error messages
- Additional comments and removal of some that are no longer accurate
Overhaul the revocation integration tests to comprehensively test
every combination of:
- revoking a cert vs a precert
- revoking via the cert key, the subscriber key, or a separate account
that has validation for all of the names in the cert
- revoking for reason Unspecified vs for reason KeyCompromise
Also update a number of the python tests to verify that they cannot
revoke for reason keyCompromise, but can and do revoke with other
reasons.
These tests are testing functionality that is no longer in use in
production deployments of Boulder. As we go about removing wfe1
functionality, these tests will break, so let's just remove them
wholesale right now. I have verified that all of the tests removed in
this PR are duplicated against wfe2.
One of the changes in this PR is to cease starting up the wfe1 process
in the integration tests at all. However, that component was serving
requests for the AIA Issuer URL, which gets queried by various OCSP and
revocation tests. In order to keep those tests working, this change also
adds an integration-test-only handler to wfe2, and updates the CA
configuration to point at the new handler.
Part of #5681
Add a unit test and an integration test that both exercise the new
experimental ACME Renewal Info endpoint. These tests do not
yet validate the contents of the response, just that the appropriate
HTTP response code is returned, but they will be developed as the
code under test evolves.
Fixes#5674
Update the version of golangci-lint we use in our docker image,
and update the version of the docker image we use in our tests.
Fix a couple places where we were violating lints (ineffective assign
and calling `t.Fatal` from outside the main test goroutine), and add
one lint (using math/rand) to the ignore list.
Fixes#5710
This allows repeated runs using the same hiearchy, and avoids spurious
errors from ocsp-updater saying "This CA doesn't have an issuer cert
with ID XXX"
Fixes#5721
Instead of using the default `json.Unmarshal`, explicitly
construct and use a `json.Decoder` so that we can set the
`DisallowUnknownFields` flag on the decoder. This causes
any unrecognized config keys to result in errors at boulder
startup time.
Fixes#5643
Previously, caa-log-checker's core algorithm was:
1. Load every single VA (CAA) log file, producing an in-memory map of
names to the time at which they were checked
2. Iterate over the RA (Issuance) log file, checking each issuance event
to see if it occurred less than 8 hours after an event in the
in-memory map.
This consumes significant memory, as the map of all CAA checks is
redundant (contains entries for w.x.y.z, x.y.z, and y.z) and holds
unnecessary data (contains entries for CAA checks that occurred much
more than 8 hours before or after any issuance in the RA log).
Invert this algorithm, as such:
1. Load the RA (Issuance) log file, producing an in-memory map of names
to the time at which they were issued
2. Iterate over each VA (CAA) log file, removing entries from the
in-memory map if they occurred less than 8 hours after the current
CAA checking event.
This reduces the memory consumption of caa-log-checker, because the
total number of issuance events is much smaller and the map does not
need to hold redundant data. The tradeoff is that caa-log-checker can no
longer print partial output as it runs; all results are held until the
very end, when it can inspect the in-memory map to see if it is empty.
Fixes#5552
Add a new rate limit, identical in implementation to the current
`CertificatesPerFQDNSet` limit, intended to always have both a lower
window and a lower threshold. This allows us to block runaway clients
quickly, and give their owners the ability to fix and try again quickly
(on the order of hours instead of days).
Configure the integration tests to set this new limit at 2 certs per 2
hours. Also increase the existing limit from 5 to 6 certs in 7 days, to
allow clients to hit the first limit three times before being fully
blocked for the week. Also add a new integration test to verify this
behavior.
Note that the new ratelimit must have a window greater than the
configured certificate backdate (currently 1 hour) in order to be
useful.
Fixes#5210
Update orphan-finder's `generateOCSP` function to make its request to
the CA using the certificate's serial number and issuer ID, rather than
the full DER bytes. To facilitate this, add an `IssuerCerts` item to the
orphan-finder config, and add an `issuers` map to its struct, mimicking
fields of the same name and purpose on the RA. Leave the old code path
in the `generateOCSP` method for now, to be fully removed after the new
config has been deployed.
Also update the unittests to use real on-disk certificates instead of
inline strings, and similarly correct the integration test to use a
certificate with the correct Issuer field.
Part of #5079Fixes#5149
The ocsp-responder takes a path to a certificate file as one of
its config values. It uses this path as one of the inputs when
constructing its DBSource, the object responsible for querying
the database for pregenerated OCSP responses to fulfill requests.
However, this certificate file is not necessary to query the
database; rather, it only acts as a filter: OCSP requests whose
IssuerKeyHash do not match the hash of the loaded certificate are
rejected outright, without querying the DB. In addition, there is
currently only support for a single certificate file in the config.
This change adds support for multiple issuer certificate files in
the config, and refactors the pre-database filtering of bad OCSP
requests into a helper object dedicated solely to that purpose.
Fixes#5119
This change adds two new test assertion helpers, `AssertErrorIs`
and `AssertErrorWraps`. The former is a wrapper around `errors.Is`,
and asserts that the error's wrapping chain contains a specific (i.e.
singleton) error. The latter is a wrapper around `errors.As`, and
asserts that the error's wrapping chain contains any error which is
of the given type; it also has the same unwrapping side effect as
`errors.As`, which can be useful for further assertions about the
contents of the error.
It also makes two small changes to our `berrors` package, namely
making `berrors.ErrorType` itself an error rather than just an int,
and giving `berrors.BoulderError` an `Unwrap()` method which
exposes that inner `ErrorType`. This allows us to use the two new
helpers above to make assertions about berrors, rather than
having to hand-roll equality assertions about their types.
Finally, it takes advantage of the two changes above to greatly
simplify many of the assertions in our tests, removing conditional
checks and replacing them with simple assertions.
This adds a configurable output parameter for ocsp_helper, which
defaults to stdout. This allows suppressing the stdout output when using
ocsp_helper in integration tests. That output was making it hard to
see details about failing tests.
This change adds `req.IssuerID` to the set of fields that the SA's
`AddPrecertificate` method requires be non-zero.
As a result, this also updates many tests, both unit and integration,
to ensure that they supply a value (usually just 1) for that field. The
most complex part of the test changes is a slight refactoring to the
orphan-finder code, which makes it easier to reason about the
separation between log line parsing and building and sending the
request.
Based on #5096Fixes#5097
Since we now sync caaChecks logs daily instead of continuously,
caa-log-checker can no longer assume that the validation logs it is
checking cover the exact same span of time as the issuance logs. This
commit adds -earliest and -latest parameters so that the script
that drives this tool can restrict verification to a timespan where we
know the data is valid.
Also adds a -debug flag to caa-log-checker to enable debug logs. At the
moment this makes the tool write to stderr how many issuance messages
were evaluated and how many were skipped due to -earliest and
-latest parameters.
This copies over a number of features flags and other settings from
test/config-next that have been applied in prod.
Also, remove the config-next gate on various tests.
Adds a new -expect-reason flag to the checkocsp binary to allow for
verifying the revocation reason of the certificate(s) in question.
This flag has a default value of -1, meaning that no particular
revocation reason will be expected or enforced.
Also updates the -expect-status flag to have the same default (-1) and
behavior, so that when the tool is run interactively it can simply
print the revocation status of each certificate.
Finally, refactors the way the ocsp/helper library declares flags and
accesses their values. This unifies the interface and makes it easy to
extend to allow tests to modify parameters other than expectStatus when
desired.
Fixes#4885
This ended up taking a lot more work than I expected. In order to make the implementation more robust a bunch of stuff we previously relied on has been ripped out in order to reduce unnecessary complexity (I think I insisted on a bunch of this in the first place, so glad I can kill it now).
In particular this change:
* Removes bhsm and pkcs11-proxy: softhsm and pkcs11-proxy don't play well together, and any softhsm manipulation would need to happen on bhsm, then require a restart of pkcs11-proxy to pull in the on-disk changes. This makes manipulating softhsm from the boulder container extremely difficult, and because of the need to initialize new on each run (described below) we need direct access to the softhsm2 tools since pkcs11-tool cannot do slot initialization operations over the wire. I originally argued for bhsm as a way to mimic a network attached HSM, mainly so that we could do network level fault testing. In reality we've never actually done this, and the extra complexity is not really realistic for a handful of reasons. It seems better to just rip it out and operate directly on a local softhsm instance (the other option would be to use pkcs11-proxy locally, but this still would require manually restarting the proxy whenever softhsm2-util was used, and wouldn't really offer any realistic benefit).
* Initializes the softhsm slots on each integration test run, rather than when creating the docker image (this is necessary to prevent churn in test/cert-ceremonies/generate.go, which would need to be updated to reflect the new slot IDs each time a new boulder-tools image was created since slot IDs are randomly generated)
* Installs softhsm from source so that we can use a more up to date version (2.5.0 vs. 2.2.0 which is in the debian repo)
* Generates the root and intermediate private keys in softhsm and writes out the root and intermediate public keys to /tmp for use in integration tests (the existing test-{ca,root} certs are kept in test/ because they are used in a whole bunch of unit tests. At some point these should probably be renamed/moved to be more representative of what they are used for, but that is left for a follow-up in order to keep the churn in this PR as related to the ceremony work as possible)
Another follow-up item here is that we should really be zeroing out the database at the start of each integration test run, since certain things like certificates and ocsp responses will be signed by a key/issuer that is no longer is use/doesn't match the current key/issuer.
Fixes#4832.
Adds a productionized version of our internal tooling to the tree. The
major differences are: it doesn't skip certs with only one name, it
doesn't read in all the va logs in parallel, it only supports reading
one ra log at a time, and it adds unit tests.
Probably it should include a integration test, but that requires
capturing logs on the docker container, which I don't think we currently
do? Probably would make for a good follow-up issue.
Fixes#4698.
Patches:
Make sure all log tailing types call Cleanup
Make sure the http.Response body is closed in all cases
Make sure that the challenge token is always deleted
Adds a daemon which monitors the new blockedKeys table and checks for any unexpired, unrevoked certificates that are associated with the added SPKI hashes and revokes them, notifying the user that issued the certificates.
Fixes#4772.
Previously, the test called `.Round(time.Minute)` on the expected
and actual expiration times, intending to perform an "approximately
equal" function.
However, when the expected and actual times differed by a second, but
they happened to fall on opposite sides of a rounding interval (i.e. 30
seconds into a minute), they would be rounded in opposite directions,
resulting in a conclusion that they were not equal.
This change instead defines an acceptable range of plus or minus a
minute for the expiration time, and checks that the actual expiration
time is in that interval.
In 67ec373a96 we removed "unused" WFE and WFE2
config elements. Unfortunately I missed that one of these elements,
`allowOrigins`, **is** used and without this config in
place CORS is broken.
We have unit tests for the CORS headers but we did not have any end-to-end
integration tests that would catch a problem with the WFE/WFE2 missing the
`allowOrigins` config element.
This commit restores the `allowOrigins` config value across the WFE/WFE2
configs and also adds a very small integration test. That test only checks one
CORS header and only for the HTTP ACMEv2 endpoint but I think it's sufficient
for the moment (and definitely better than nothing).
Prior to fixing the config elements the integration test fails as expected:
```
--- FAIL: TestWFECORS (0.00s)
wfe_test.go:28: "" != "*"
FAIL
FAIL github.com/letsencrypt/boulder/test/integration 0.014s
FAIL
```
The RA should set the expiry of valid authorizations based only on the current time and the configured authorizationLifetime. It should not extend the pending authorization's lifetime by the authorizationLifetime.
Resolves#4617
I didn't gate this with a feature flag. If we think this needs an API announcement and gradual rollout (I don't personally think this change deserves that) then I think we should change the RA config's authorizationLifetimeDays value to 37 days instead of adding a feature flag that we'll have to clean up after the flag date. We can change it back to 30 after the flag date.
In the deep dark history of Boulder we ended up jamming contacts into
a VARCHAR db field. We need to make sure that when contacts are
marshaled the resulting bytes will fit into the column or a 500 will
be returned to the user when the SA RPC fails.
One day we should fix this properly and not return a hacky error message
that's hard for users to understand. Unfortunately that will likely
require a migration or a new DB table. In the shorter term this hack
will prevent 500s which is a clear improvement.
Prev. we weren't checking the domain portion of an email contact address
very strictly in the RA. This updates the PA to export a function that
can be used to validate the domain the same way we validate domain
portions of DNS type identifiers for issuance.
This also changes the RA to use the `invalidEmail` error type in more
places.
A new Go integration test is added that checks these errors end-to-end
for both account creation and account update.
We need the RA's `NewOrder` RPC to return a `berrors.Malformed` instance
when there are too many identifiers. A bare error will be turned into
a server internal problem by the WFE2's `web.ProblemDetailsForError`
call while a `berrors.Malformed` will produce the expected malformed
problem.
This commit fixes the err, updates the unit test, and adds an end-to-end
integration test so we don't mess this up again.
This updates the `github.com/eggsampler/acme` dependency used in our Go-based
integration tests to v3. Notably this fixes a data race we encountered in CI.
With the data race fixed this branch can also revert
54a798b7f6 and resolve
https://github.com/letsencrypt/boulder/issues/4542
I ran a `go mod tidy` to cleanup the old `v2` copy of the dep and it also
removed a few stale cfssl/mysql items from the `go.mod`.
Upstream library's tests are confirmed to pass:
```
~/go/src/github.com/eggsampler/acme$ git log --pretty=format:'%h' -n 1
b581dc6
~/go/src/github.com/eggsampler/acme$ make pebble
mkdir -p /home/daniel/go/src/github.com/letsencrypt/pebble
git clone --depth 1 https://github.com/letsencrypt/pebble.git /home/daniel/go/src/github.com/letsencrypt/pebble \
|| (cd /home/daniel/go/src/github.com/letsencrypt/pebble; git checkout -f master && git reset --hard HEAD && git pull -q)
fatal: destination path '/home/daniel/go/src/github.com/letsencrypt/pebble' already exists and is not an empty directory.
Already on 'master'
Your branch is up-to-date with 'le/master'.
HEAD is now at 6c2d514 wfe: compare Identifier.Type with acme.IndentifierIP (#287)
docker-compose -f /home/daniel/go/src/github.com/letsencrypt/pebble/docker-compose.yml up -d
Creating network "pebble_acmenet" with driver "bridge"
Creating pebble_challtestsrv_1 ... done
Creating pebble_pebble_1 ... done
while ! wget --delete-after -q --no-check-certificate "https://localhost:14000/dir" ; do sleep 1 ; done
go clean -testcache
go test -race -coverprofile=coverage_18.txt -covermode=atomic github.com/eggsampler/acme/v3
ok github.com/eggsampler/acme/v3 24.292s coverage: 83.0% of statements
docker-compose -f /home/daniel/go/src/github.com/letsencrypt/pebble/docker-compose.yml down
Stopping pebble_pebble_1 ... done
Stopping pebble_challtestsrv_1 ... done
Removing pebble_pebble_1 ... done
Removing pebble_challtestsrv_1 ... done
Removing network pebble_acmenet
```
Since 6f71c0c switched the Go integration tests to run in parallel the
`TestPrecertificateOCSP` test has been flaky. To fix the flake the test
needs to be changed to be resilient to precertificates other than the
one it is expecting being returned by the ct-test-srv since other tests
are also concurrently using it.