Simplify the way we load and handle CT logs: rather than keeping them
grouped by operator, simply keep a flat list and annotate each log with
its operator's name. At submission time, instead of shuffling operator
groups and submitting to one log from each group, shuffle the whole set
of individual logs.
Support tiled logs by similarly annotating each log with whether it is
tiled or not.
Also make the way we know when to stop getting SCTs more robust.
Previously we would stop as soon as we had two, since we knew that they
would be from different operator groups and didn't care about tiled
logs. Instead, introduce an explicit CT policy compliance evaluation
function which tells us if the set of SCTs we have so far forms a
compliant set.
This is not our desired end-state for CT log submission. Ideally we'd
like to: simplify things even further (don't race all the logs, simply
try to submit to two at a time), improve selection (intelligently pick
the next log to submit to, rather than just a random shuffle), and
fine-tune latency (tiled logs should have longer timeouts than classic
ones). Those improvements will come in future PRs.
Part of https://github.com/letsencrypt/boulder/issues/7872
The schema tool used to parse log_list_schema.json doesn't work well
with the updated schema. This is going to be required to support
static-ct-api logs from current Chrome log lists.
Instead, use the loglist3 package inside the certificate-transparency-go
project, which Boulder already uses for CT submission otherwise.
As well, the Log IDs and keys returned from loglist3 have already been
base64 decoded, so this re-encodes them to minimize the impact on the
rest of the codebase and keep this change small.
The test log_list.json file needed to be made a bit more realistic for
loglist3 to parse without base64 or date parsing errors.
The intention here should be to initialize a slice with a capacity of
len(remaining) rather than initializing the length of this slice, so that
the resulting error message doesn't start with empty-string entries.
Replace all of Boulder's usage of the Go stdlib "math/rand" package with
the newer "math/rand/v2" package which first became available in go1.22.
This package has an improved API and faster performance across the
board.
See https://go.dev/blog/randv2 and https://go.dev/blog/chacha8rand for
details.
Have the RA populate the new pubpb.Request.Kind field, along side the
deprecated pubpb.Request.Precert field, so that the publisher has more
detailed information about what kind of CT submission this is.
Fixes https://github.com/letsencrypt/boulder/issues/7161
Adds new prometheus metrics from the configured log list and configured
CT logs to the ctpolicy constructor. `ct_operator_group_size_gauge`
returns the number of configured logs managed by each operator in the
log list. `ct_shard_expiration_seconds` returns a Unix timestamp
representation of the `end_exclusive` field for each configured log in
the `sctLogs` list. For posterity, Boulder retrieves SCTs from logs in
the `sctLogs` list.
```
ct_operator_group_size_gauge{operator="Operator A",source="finalLogs"} 2
ct_operator_group_size_gauge{operator="Operator A",source="sctLogs"} 4
ct_operator_group_size_gauge{operator="Operator B",source="sctLogs"} 2
ct_operator_group_size_gauge{operator="Operator D",source="sctLogs"} 1
ct_operator_group_size_gauge{operator="Operator F",source="finalLogs"} 1
ct_operator_group_size_gauge{operator="Operator F",source="infoLogs"} 1
ct_shard_expiration_seconds{logID="A1 Current",operator="Operator A"} 3.15576e+09
ct_shard_expiration_seconds{logID="A1 Future",operator="Operator A"} 3.47126688e+10
ct_shard_expiration_seconds{logID="A2 Current",operator="Operator A"} 3.15576e+09
ct_shard_expiration_seconds{logID="A2 Past",operator="Operator A"} 0
ct_shard_expiration_seconds{logID="B1",operator="Operator B"} 3.15576e+09
ct_shard_expiration_seconds{logID="B2",operator="Operator B"} 3.15576e+09
ct_shard_expiration_seconds{logID="D1",operator="Operator D"} 3.15576e+09
```
Fixes https://github.com/letsencrypt/boulder/issues/5705
Removes the `//ctpolicy/loglist.go` init function which previously
seeded the math/rand global random generator in favor of Go 1.20
math/rand now doing this automatically. See release notes
[here.](https://tip.golang.org/doc/go1.20)
> The [math/rand](https://tip.golang.org/pkg/math/rand/) package now
automatically seeds the global random number generator (used by
top-level functions like Float64 and Int) with a random value, and the
top-level [Seed](https://tip.golang.org/pkg/math/rand/#Seed) function
has been deprecated. Programs that need a reproducible sequence of
random numbers should prefer to allocate their own random source, using
rand.New(rand.NewSource(seed)).
- Require `letsencrypt/validator` package.
- Add a framework for registering configuration structs and any custom
validators for each Boulder component at `init()` time.
- Add a `validate` subcommand which allows you to pass a `-component`
name and `-config` file path.
- Expose validation via exported utility functions
`cmd.LookupConfigValidator()`, `cmd.ValidateJSONConfig()` and
`cmd.ValidateYAMLConfig()`.
- Add unit test which validates all registered component configuration
structs against test configuration files.
Part of #6052
We rely on the ratelimit/ package in CI to validate our ratelimit
configurations. However, because that package relies on cmd/ just for
cmd.ConfigDuration, many additional dependencies get pulled in.
This refactors just that struct to a separate config package. This was
done using Goland's automatic refactoring tooling, which also organized
a few imports while it was touching them, keeping standard library,
internal and external dependencies grouped.
While the Chrome log_list.json has a `state` stanza for every
log, the all_logs_list.json file does not. This code was originally
tested against the former file, but we are actually using the
latter file in production. Add a check for missing `state` stanzas
to avoid a nil pointer dereference.
Enable the "unparam" linter, which checks for unused function
parameters, unused function return values, and parameters and
return values that always have the same value every time they
are used.
In addition, fix many instances where the unparam linter complains
about our existing codebase. Remove error return values from a
number of functions that never return an error, remove or use
context and test parameters that were previously unused, and
simplify a number of (mostly test-only) functions that always take the
same value for their parameter. Most notably, remove the ability to
customize the RSA Public Exponent from the ceremony tooling,
since it should always be 65537 anyway.
Fixes#6104
The iotuil package has been deprecated since go1.16; the various
functions it provided now exist in the os and io packages. Replace all
instances of ioutil with either io or os, as appropriate.
When a context is cancelled or times out, all child contexts (such as
those from `subCtx, _ := context.WithCancel(ctx)`) are also immediately
cancelled: they all become able to read from their `ctx.Done()` channel
simultaneously. This means that child goroutines can realize that they
have been cancelled and return *before* their parent goroutine has
realized that it has also been cancelled. This results in technically
correct but unhelpful error messages, such as the parent reporting that
it got too many errors from its children, rather than reporting that it
timed out.
In `getOperatorSCTs`, avoid this by always (in the error case) reading
results from all subroutines. When the parent context is cancelled, all
of the child goroutines will quickly exit, and `getOperatorSCTs` will
quickly collect their errors. Then, before blindly returning those errors,
it simply checks its own context and returns that error instead, if it
exists.
Add a new code path to the ctpolicy package which enforces Chrome's new
CT Policy, which requires that SCTs come from logs run by two different
operators, rather than one Google and one non-Google log. To achieve
this, invert the "race" logic: rather than assuming we always have two
groups, and racing the logs within each group against each other, we now
race the various groups against each other, and pick just one arbitrary
log from each group to attempt submission to.
Ensure that the new code path does the right thing by adding a new zlint
which checks that the two SCTs embedded in a certificate come from logs
run by different operators. To support this lint, which needs to have a
canonical mapping from logs to their operators, import the Chrome CT Log
List JSON Schema and autogenerate Go structs from it so that we can
parse a real CT Log List. Also add flags to all services which run these
lints (the CA and cert-checker) to let them load a CT Log List from disk
and provide it to the lint.
Finally, since we now have the ability to load a CT Log List file
anyway, use this capability to simplify configuration of the RA. Rather
than listing all of the details for each log we're willing to submit to,
simply list the names (technically, Descriptions) of each log, and look
up the rest of the details from the log list file.
To support this change, SRE will need to deploy log list files (the real
Chrome log list for prod, and a custom log list for staging) and then
update the configuration of the RA, CA, and cert-checker. Once that
transition is complete, the deletion TODOs left behind by this change
will be able to be completed, removing the old RA configuration and old
ctpolicy race logic.
Part of #5938
Running an older version (v0.0.1-2020.1.4) of `staticcheck` in
whole-program mode (`staticcheck --unused.whole-program=true -- ./...`)
finds various instances of unused code which don't normally show up
as CI issues. I've used this to find and remove a large chunk of the
unused code, to pave the way for additional large deletions accompanying
the WFE1 removal.
Part of #5681
Update all of our tests to use `AssertMetricWithLabelsEquals`
instead of combinations of the older `CountFoo` helpers with
simple asserts. This coalesces all of our prometheus inspection
logic into a single function, allowing the deletion of four separate
helper functions.
Delete the PublisherClientWrapper and PublisherServerWrapper. Update
various structs and functions to expect a pubpb.PublisherClient instead
of a core.Publisher; these two interfaces differ only in that the
auto-generated PublisherClient takes a variadic CallOptions parameter.
Update all mock publishers in tests to match the new interface. Finally,
delete the now-unused core.Publisher interface and some already-unused
mock-generating code.
This deletes a single sanity check (for a nil SCT even when there is a
nil error), but that check was redundant with an identical check in the
only extant client code in ctpolicy.go.
Fixes#5323
This change adds two new test assertion helpers, `AssertErrorIs`
and `AssertErrorWraps`. The former is a wrapper around `errors.Is`,
and asserts that the error's wrapping chain contains a specific (i.e.
singleton) error. The latter is a wrapper around `errors.As`, and
asserts that the error's wrapping chain contains any error which is
of the given type; it also has the same unwrapping side effect as
`errors.As`, which can be useful for further assertions about the
contents of the error.
It also makes two small changes to our `berrors` package, namely
making `berrors.ErrorType` itself an error rather than just an int,
and giving `berrors.BoulderError` an `Unwrap()` method which
exposes that inner `ErrorType`. This allows us to use the two new
helpers above to make assertions about berrors, rather than
having to hand-roll equality assertions about their types.
Finally, it takes advantage of the two changes above to greatly
simplify many of the assertions in our tests, removing conditional
checks and replacing them with simple assertions.
And make corresponding changes to call sites and wrappers.
Note that proto2 vs proto3 is distinction in the syntax of the .proto files
and doesn't change the wire format, so this meets the deployability
guidelines.
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.
There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.
Fixes#4591.
This follows up on some refactoring we had done previously but not
completed. This removes various binary-specific config structs from the
common cmd package, and moves them into their appropriate packages. In
the case of CT configs, they had to be moved into their own package to
avoid a dependency loop between RA and ctpolicy.
When a whole log group errors or times out, this can create confusing patterns
in the stats. Including these per-group errors and timeouts as their own
subsection of the stat should hopefully help make things clearer.
Required a little bit of rework of the RA issuance flow (to add parsing of the precert to determine the expiration date, and moving final cert parsing before final cert submission) and RA tests, but I think it shouldn't create any issues...
Fixes#3197.
ctpolicy permutes logs before submitting to them, to give each log a
chance. The stagger feature was meant to sleep for an amount of time
proportional to a log's position in the permuted list. However, it was
actually using the log's position in the un-permuted list, so logs that
appear later in the config would always be submitted to later than logs
earlier in the config.
This fixes that, and does some minor variable renaming for clarity.
Things removed:
* features.EmbedSCTs (and all the associated RA/CA/ocsp-updater code etc)
* ca.enablePrecertificateFlow (and all the associated RA/CA code)
* sa.AddSCTReceipt and sa.GetSCTReceipt RPCs
* publisher.SubmitToCT and publisher.SubmitToSingleCT RPCs
Fixes#3755.
A very large number of the logger calls are of the form log.Function(fmt.Sprintf(...)).
Rather than sprinkling fmt.Sprintf at every logger call site, provide formatting versions
of the logger functions and call these directly with the format and arguments.
While here remove some unnecessary trailing newlines and calls to String/Error.
* Randomize order of CT logs when submitting precerts so we maximize the chances we actually exercise all of the logs in a group and not just the first in the list.
* Add metrics for winning logs
When the WFE calls the RA the RA creates a sub context which is cancelled when
the RPC returns. Because we were spawning the publisher RPC calls in a
goroutine with the context from ra.IssueCertificate as soon as
ra.IssueCertificate returned that context was being canceled which in turn
canceled the publisher RPC calls.
Instead of using the RA RPC context simply use a `context.Background()` so
that the RPC context doesn't break these submissions. Also return to
pre-features.CancelCTSubmissions behavior where precert submissions would
be canceled once we retrieved SCTs from the winning logs instead of relying
on the magic behavior of the RA RPC canceling them itself.
Submits final certificates to any configured CT logs. This doesn't introduce a feature flag as it is config gated, any log we want to submit final certificates to needs to have it's log description updated to include the `"submitFinalCerts": true` field.
Fixes#3605.
This commit updates CTPolicy & the RA to return a distinct error when
the RA is unable to fetch the required SCTs for a certificate when
processing an issuance. This error type is plumbed up to the WFE/WFE2
where the `web/probs.go` code converts it into a server internal error
with a suitable user facing error.
Adds SCT embedding to the certificate issuance flow. When a issuance is requested a precertificate (the requested certificate but poisoned with the critical CT extension) is issued and submitted to the required CT logs. Once the SCTs for the precertificate have been collected a new certificate is issued with the poison extension replace with a SCT list extension containing the retrieved SCTs.
Fixes#2244, fixes#3492 and fixes#3429.
Add a set of logs which will be submitted to but not relied on for their SCTs,
this allows us to test submissions to a particular log or submit to a log which
is not yet approved by a browser/root program.
Also add a feature which stops cancellations of remaining submissions when racing
to get a SCT from a group of logs.
Additionally add an informational log that always times out in config-next.
Fixes#3464 and fixes#3465.
With the CT policy changes, we cancel any outstanding requests in a
group once we've gotten a successful response from any log in that
group. That means various function calls will return early with an error
code indicating cancellation. We want to avoid logging such error codes,
because they were not really errors, they were intentional.
This change introduces a small utility package called "canceled", which
checks for both context.Canceled and the gRPC return code indicating
cancelled.