This is a follow-up fix to
https://github.com/letsencrypt/boulder/pull/6987 which mistakenly did
not update the docker compose default BOULDER_TOOLS_TAG variable from
go1.20.5 to go1.20.6. This would only manifest on developer machines
manually running unit/integration tests rather than CI which explicitly
tests against a matrix of BOULDER_TOOLS_TAG versions.
```
Error response from daemon: manifest for letsencrypt/boulder-tools:go1.20.5_2023-07-11 not found: manifest unknown: manifest unknown
```
Update zlint to v3.5.0, which introduces scaffolding for running lints
over CRLs.
Convert all of our existing CRL checks to structs which match the zlint
interface, and add them to the registry. Then change our linter's
CheckCRL function, and crl-checker's Validate function, to run all lints
in the zlint registry.
Finally, update the ceremony tool to run these lints as well.
This change touches a lot of files, but involves almost no logic
changes. It's all just infrastructure, changing the way our lints and
their tests are shaped, and moving test files into new homes.
Fixes https://github.com/letsencrypt/boulder/issues/6934
Fixes https://github.com/letsencrypt/boulder/issues/6979
This version includes a fix that seems relevant to us:
> The HTTP/1 client did not fully validate the contents of the Host
header. A maliciously crafted Host header could inject additional
headers or entire requests. The HTTP/1 client now refuses to send
requests containing an invalid Request.Host or Request.URL.Host value.
>
> Thanks to Bartek Nowotarski for reporting this issue.
>
> Includes security fixes for CVE-2023-29406 and Go issue
https://go.dev/issue/60374
This was introduced early in Boulder development when we had the concept
of a "short serial" (monotonically increasing) which would be prepended
to random bytes to form the full serial. We wanted to specially report
the case that there were duplicates of a given short serial since it
meant a problem with our monotonicity.
We've long since abandoned that idea, and also this code can't be
exercised because sa.SelectCertificate does a LIMIT 1 anyhow.
The upstream zlint lints are organized not by what kind of certificate
they apply to, but what source they are from. This change rearranges
(and slightly renames) our custom lints to match the same structure.
This will make it easier for us to temporarily add lints (e.g. for our
CRLs) which we intend to upstream to zlint later.
Part of https://github.com/letsencrypt/boulder/issues/6934
Add go1.21rc2 to the matrix of go versions we test against.
Add a new step to our CI workflows (boulder-ci, try-release, and
release) which sets the "GOEXPERIMENT=loopvar" environment variable if
we're running go1.21. This experiment makes it so that loop variables
are scoped only to their single loop iteration, rather than to the whole
loop. This prevents bugs such as our CAA Rechecking incident
(https://bugzilla.mozilla.org/show_bug.cgi?id=1619047). Also add a line
to our docker setup to propagate this environment variable into the
container, where it can affect builds.
Finally, fix one TLS-ALPN-01 test to have the fake subscriber server
actually willing to negotiate the acme-tls/1 protocol, so that the ACME
server's tls client actually waits to (fail to) get the certificate,
instead of dying immediately. This fix is related to the upgrade to
go1.21, not the loopvar experiment.
Fixes https://github.com/letsencrypt/boulder/issues/6950
Add two new methods, LeaseCRLShard and UpdateCRLShard, to the SA gRPC
interface. These methods work in concert both to prevent multiple
instances of crl-updater from stepping on each others toes, and to lay
the groundwork for a less bursty version of crl-updater in the future.
Introduce a new database table, crlShards, which tracks the thisUpdate
and nextUpdate timestamps of each CRL shard for each issuer. It also has
a column "leasedUntil", which is also a timestamp. Grant the SA user
read-write access to this table.
LeaseCRLShard updates the leasedUntil column of the identified shard to
the given time. It returns an error if the identified shard's
leasedUntil timestamp is already in the future. This provides a
mechanism for crl-updater instances to "lick the cookie", so to speak,
marking CRL shards as "taken" so that multiple crl-updater instances
don't attempt to work on the same shard at the same time. Using a
timestamp has the added benefit that leases are guaranteed to expire,
ensuring that we don't accidentally fail to work on a shard forever.
LeaseCRLShard has a second mode of operation, when a range of potential
shards is given in the request, rather than a single shard. In this
mode, it returns the shard (within the given range) whose thisUpdate
timestamp is oldest. (Shards with no thisUpdate timestamp, including
because the requested range includes shard indices the database doesn't
yet know about, count as older than any shard with any thisUpdate
timestamp.) This allows crl-updater instances which don't care which
shard they're working on to do the most urgent work first.
UpdateCRLShard updates the thisUpdate and nextUpdate timestamps of the
identified shard. This closes the loop with the second mode of
LeaseCRLShard above: by updating the thisUpdate timestamp, the method
marks the shard as no longer urgently needing to be worked on.
IN-9220 tracks creating this table in staging and production
Part of #6897
This introduces a small new package, `precert`, with one function
`Correspond` that checks a precertificate against a final certificate to
see if they correspond in the relationship described in RFC 6962.
This also modifies the `issuance` package so that RequestFromPrecert
generates an IssuanceRequest that keeps a reference to the
precertificate's bytes. The allows `issuance.Prepare` to do a
correspondence check when preparing to sign the final certificate. Note
in particular that the correspondence check is done against the
_linting_ version of the final certificate. This allows us to catch
correspondence problems before the real, trusted signature is actually
made.
Fixes#6945
This can take two values (typically the return values of a two-value
function) and panic if the error is non-nil, returning the interesting
value. This is particularly useful for cases where we statically know
the call will succeed.
Thanks to @mcpherrinm for the idea!
Over on the community forum, there's been requests for the new .vn
domains. weppos/publicsuffix-go hasn't had a release tagged in a little
while, so this is the result of:
go get github.com/weppos/publicsuffix-go@latest
go mod tidy
go mod vendor
The contacts field of an account can be very verbose, and is irrelevant
to the vast majority -- e.g. creating orders, validating challenges, and
downloading certificates -- of requests made by an account. To reduce
the length of our WFE log lines, remove the Contacts field from all
logs. When we actually need it, we can get it from the database.
Also remove the RequestEvent.TLS field, which is unused.
In WFE, we do a goodkey check when validating a self-authenticated POST
(i.e. when creating an account). For a while, that was a purely local
check, looking at a list of bad keys or bad moduluses, or checking for
factorability. At some point we also added a backend check, querying the
SA to see if a key was blocked. However, we did not update this one code
path to distinguish "bad key" from "timeout querying SA." That meant
that sometimes we would give a badPublicKey Problem Document when we
should have given an internalServerError.
Related:
https://github.com/letsencrypt/boulder/issues/6795#issuecomment-1574217398
For "ordinary" errors like "file not found" for some part of the config,
we would prefer to log an error and exit without logging about a panic
and printing a stack trace.
To achieve that, we want to call `defer AuditPanic()` once, at the top
of `cmd/boulder`'s main. That's so early that we haven't yet parsed the
config, which means we haven't yet initialized a logger. We compromise:
`AuditPanic` now calls `log.Get()`, which will retrieve the configured
logger if one has been set up, or will create a default one (which logs
to stderr/stdout).
AuditPanic and Fail/FailOnError now cooperate: Fail/FailOnError panic
with a special type, and AuditPanic checks for that type and prints a
simple message before exiting when it's present.
This PR also coincidentally fixes a bug: panicking didn't previously
cause the program to exit with nonzero status, because it recovered the
panic but then did not explicitly exit nonzero.
Fixes#6933
Previously, we had three chained calls initializing a database:
- InitWrappedDb calls NewDbMap
- NewDbMap calls NewDbMapFromConfig
Since all three are exporetd, this left me wondering when to call one vs
the others.
It turns out that NewDbMap is only called from tests, so I renamed it to
DBMapForTest to make that clear.
NewDbMapFromConfig is only called internally to the SA, so I made it
unexported it as newDbMapFromMysqlConfig.
Also, I copied the ParseDSN call into InitWrappedDb, so it doesn't need
to call DBMapForTest. Now InitWrappedDb and DBMapForTest both
independently call newDbMapFromMysqlConfig.
I also noticed that InitDBMetrics was only called internally so I
unexported it.
Add the necessary scaffolding for deep health checking of our various
gRPC components. Each component implementation that also implements the
grpc.checker interface will be checked periodically, and the health
status of the component will be updated accordingly.
Add the necessary methods to SA to implement the grpc.checker interface
and register these new health checks with Consul.
Additionally:
- Update entry point script to check for ProxySQL readiness.
- Increase the poll rate for gRPC Consul checks from 5s to 2s to help
with DNS failures, due to check failures, on startup.
- Change log level for Consul from INFO to ERROR to deal with noisy logs
full of transport failures due to Consul gRPC checks firing before the
SAs are up.
Fixes#6878
Part of #6795
Previously, we allocated a buffer of 8192 bytes to write the stack trace
into. Presumably the intent was to set the _capacity_ to 8192 bytes but
leave the length at 0:
make([]byte, 0, 8192)
However, the code as written set the length to 8192 as well. This meant
that any time we logged a stack trace from a panic, if the stack trace
was less than 8192 bytes, we'd additionally log a bunch of zero bytes.
Setting the capacity was premature optimization anyhow.
Fixes#6931
Remove the load test stage of the integration test, which generates
superfluous amounts of log.
Turn down logging on the CA and VA from info to error-only.
Part of https://github.com/letsencrypt/boulder/issues/6890
- Update consul container from `1.13.1` to `1.14.2` to match production.
- Specify `grpc_tls`, now required instead of defaulted to `8503` when
`enable_agent_tls_for_checks` is specified.
Part of #6911
When a user wants their email address deleted from the database but no
longer has access to their account, this allows an administrator to
clear it.
This adds `admin` as an alias for `admin-revoker`, because we'd like the
clear-email sub-command to be a part of that overall tool, but it's not
really revocation related.
Part of #6864
I think ideally we'd only ever call exportMetrics
with a valid time, but that's a bit bigger of a refactor of this code.
This was the fix we lightly decided on in the discussion of #6635Fixes#6635
RFC 8659 (CAA; https://www.rfc-editor.org/rfc/rfc8659) says that "A CA
MUST NOT issue certificates for any FQDN if the Relevant RRset for that
FQDN contains a CAA critical Property for an unknown or unsupported
Property Tag."
Let's Encrypt does technically support the iodef property tag: we
recognize it, but then ignore it and never choose to send notifications
to the given contact address. Historically, we have carried around the
iodef property tags in our internal structures as though we might use
them, but all code referencing them was essentially dead code.
As part of a set of simplifications,
https://github.com/letsencrypt/boulder/pull/6886 made it so that we
completely ignore iodef property tags. However, this had the unintended
side-effect of causing iodef property tags with the Critical bit set to
be counted as "unknown critical" tags, which prevent issuance.
This change causes our property tag parsing code to recognize iodef tags
again, so that critical iodef tags don't prevent issuance.
Given two empty slices, one that is equal to nil and one that is not,
AssertDeepEquals used to produce this confusing output:
[[]] !(deep)= [[]]
After this change, it produces:
[[]string(nil)] !(deep)= [[]string{}]
Adds new prometheus metrics from the configured log list and configured
CT logs to the ctpolicy constructor. `ct_operator_group_size_gauge`
returns the number of configured logs managed by each operator in the
log list. `ct_shard_expiration_seconds` returns a Unix timestamp
representation of the `end_exclusive` field for each configured log in
the `sctLogs` list. For posterity, Boulder retrieves SCTs from logs in
the `sctLogs` list.
```
ct_operator_group_size_gauge{operator="Operator A",source="finalLogs"} 2
ct_operator_group_size_gauge{operator="Operator A",source="sctLogs"} 4
ct_operator_group_size_gauge{operator="Operator B",source="sctLogs"} 2
ct_operator_group_size_gauge{operator="Operator D",source="sctLogs"} 1
ct_operator_group_size_gauge{operator="Operator F",source="finalLogs"} 1
ct_operator_group_size_gauge{operator="Operator F",source="infoLogs"} 1
ct_shard_expiration_seconds{logID="A1 Current",operator="Operator A"} 3.15576e+09
ct_shard_expiration_seconds{logID="A1 Future",operator="Operator A"} 3.47126688e+10
ct_shard_expiration_seconds{logID="A2 Current",operator="Operator A"} 3.15576e+09
ct_shard_expiration_seconds{logID="A2 Past",operator="Operator A"} 0
ct_shard_expiration_seconds{logID="B1",operator="Operator B"} 3.15576e+09
ct_shard_expiration_seconds{logID="B2",operator="Operator B"} 3.15576e+09
ct_shard_expiration_seconds{logID="D1",operator="Operator D"} 3.15576e+09
```
Fixes https://github.com/letsencrypt/boulder/issues/5705
Removes the `//ctpolicy/loglist.go` init function which previously
seeded the math/rand global random generator in favor of Go 1.20
math/rand now doing this automatically. See release notes
[here.](https://tip.golang.org/doc/go1.20)
> The [math/rand](https://tip.golang.org/pkg/math/rand/) package now
automatically seeds the global random number generator (used by
top-level functions like Float64 and Int) with a random value, and the
top-level [Seed](https://tip.golang.org/pkg/math/rand/#Seed) function
has been deprecated. Programs that need a reproducible sequence of
random numbers should prefer to allocate their own random source, using
rand.New(rand.NewSource(seed)).
Move the creation of the FilterSource outside of the conditional block,
so that the underlying source gets wrapped no matter which kind (either
a inMemorySource or a checkedRedisSource) it is.
This has two advantages: first, it means that static ocsp responders are
safer and more accurate, because they're not basing their responses on
both the issuer and the serial, not just the serial; and second, it
makes the current config validation tag which marks the "issuerCerts"
config field as required with `min=1` accurate.
We no longer issue OCSP responses for our intermediate certificates,
instead producing CRLs which cover those intermediates. Remove the OCSP
response from our integration test ceremony, remove the configuration
for the static ocsp-responder which serves that response, and remove the
integration test which spins up and checks that responder. Replace all
of the above with new CRLs generated as part of the integration test
ceremony.
A recent refactoring (https://github.com/letsencrypt/boulder/pull/6906)
started treating NXDOMAIN for a CAA lookup as a hard error, when it
should be treated (from Boulder's point of view) as meaning there is an
empty list of resource records.
Removes the `Hostname` and `Port` fields from an http-01
ValidationRecord model prior to storing the record in the database.
Using `"hostname":"example.com","port":"80"` as a snippet of a whole
validation record, we'll save minimum 36 bytes for each new http-01
ValidationRecord that gets stored. When retrieving the record, the
ValidationRecord `RehydrateHostPort` method will repopulate the
`Hostname` and `Port` fields from the `URL` field.
Fixes the main goal of
https://github.com/letsencrypt/boulder/issues/5231.
---------
Co-authored-by: Samantha <hello@entropy.cat>
Enable SA gRPC health checks in Consul ahead of further changes for
#6878. Calls to the `Check` method of the SA's grpc.health.v1.Health
service must respond `SERVING` before the `sa` service will be
advertised in Consul DNS. Consul will continue to poll this service
every 5 seconds.
- Add `bconsul` docker service to boulder `bluenet` and `rednet`
- Add TLS credentials for `consul.boulder`:
```shell
$ openssl x509 -in consul.boulder/cert.pem -text | grep DNS
DNS:consul.boulder
```
- Update `test/grpc-creds/generate.sh` to add `consul.boulder`
- Update test SA configs to allow `consul.boulder` to access to
`grpc.health.v1.Health`
Part of #6878
Add per-shard exponential backoff and retry to crl-updater. Each
individual CRL shard will be retried up to MaxAttempts (default 1)
times, with exponential backoff starting at 1 second and maxing out at 1
minute between each attempt.
This can effectively reduce the parallelism of crl-updater: while a
goroutine is sleeping between attempts of a failing shard, it is not
doing work on another shard. This is a desirable feature, since it means
that crl-updater gently reduces the total load it places on the network
and database when shards start to fail.
Setting this new config parameter is tracked in IN-9140
Fixes https://github.com/letsencrypt/boulder/issues/6895
When processing CAA records, keep track of the FQDN at which that CAA
record was found (which may be different from the FQDN for which we are
attempting issuance, since we crawl CAA records upwards from the
requested name to the TLD). Then surface this name upwards so that it
can be included in our own log lines and in the problem documents which
we return to clients.
Fixes https://github.com/letsencrypt/boulder/issues/3171