Commit Graph

32 Commits

Author SHA1 Message Date
Jacob Hoffman-Andrews eda496606d
crl-updater: split temporal/explicit sharding by serial (#7990)
When we turn on explicit sharding, we'll change the CA serial prefix, so
we can know that all issuance from the new prefixes uses explicit
sharding, and all issuance from the old prefixes uses temporal sharding.
This lets us avoid putting a revoked cert in two different CRL shards
(the temporal one and the explicit one).

To achieve this, the crl-updater gets a list of temporally sharded
serial prefixes. When it queries the `certificateStatus` table by date
(`GetRevokedCerts`), it will filter out explicitly sharded certificates:
those that don't have their prefix on the list.

Part of #7094
2025-02-04 11:45:46 -05:00
Jacob Hoffman-Andrews e0221b6bbe
crl-updater: query by explicit shard too (#7973)
Add querying by explicit shard (SA.GetRevokedCertsByShard) in addition
to querying by temporal shard (SA.GetRevokedCerts).

Merge results from both kinds of shard. De-duplicate by serial within a
shard, because the same certificate could wind up in a temporal shard
that matches its explicit shard.

When de-duplicating, validate that revocation reasons are the same or
(very unlikely) represent a re-revocation based on demonstrating key
compromise. This can happen because the two different SA queries occur
at slightly different times.

Add unit testing that CRL entries make it through the whole pipeline
from SA, to CA, to uploader.

Rename some types in the unittest to be more accessible.

Tweak a comment in SA.UpdateRevokedCertificate to make it clear that
status _and_ reason are critical for re-revocation.

Note: This GetRevokedCertsByShard code path will always return zero
certificates right now, because nothing is writing to the
`revokedCertificates` table. Writing to that table is gated on
certificates having CRL URLs in them, which is not yet implemented (and
will be config-gated).

Part of #7094
2025-01-27 10:11:09 -08:00
Aaron Gable e05d47a10a
Replace explicit int loops with range-over-int (#7434)
This adopts modern Go syntax to reduce the chance of off-by-one errors
and remove unnecessary loop variable declarations.

Fixes https://github.com/letsencrypt/boulder/issues/7227
2024-04-22 10:34:51 -07:00
Aaron Gable ab6e023b6f
Simplify issuance.NameID and how it is used (#7260)
Rename "IssuerNameID" to just "NameID". Similarly rename the standalone
functions which compute it to better describe their function. Add a
.NameID() directly to issuance.Issuer, so that callers in other packages
don't have to directly access the .Cert member of an Issuer. Finally,
rearrange the code in issuance.go to be sensibly grouped as concerning
NameIDs, Certificates, or Issuers, rather than all mixed up between the
three.

Fixes https://github.com/letsencrypt/boulder/issues/5152
2024-01-17 12:55:56 -08:00
Phil Porada 6925fad324
Finish migration from int64 timestamps to timestamppb (#7142)
This is a cleanup PR finishing the migration from int64 timestamps to
protobuf `*timestamppb.Timestamps` by removing all usage of the old
int64 fields. In the previous PR
https://github.com/letsencrypt/boulder/pull/7121 all fields were
switched to read from the protobuf timestamppb fields.

Adds a new case to `core.IsAnyNilOrZero` to check various properties of
a `*timestamppb.Timestamp` reducing the visual complexity for receivers.

Fixes https://github.com/letsencrypt/boulder/issues/7060
2023-11-27 13:37:31 -08:00
Aaron Gable 1d31a22245
Export crl shard calculation code for future use (#7127)
Make crl.GetChunkAtTIme an exported function, so that it can be used by
the RA in a future PR without making that PR bigger and harder to
review. Also move it from being a method to an independent function
which takes two new arguments to compensate for the loss of its
receiver.

Also move some tests from batch_test to updater_test, where they should
have been in the first place.

Part of https://github.com/letsencrypt/boulder/issues/7094
2023-11-02 11:21:01 -07:00
Phil Porada a5c2772004
Add and populate new protobuf Timestamp fields (#7070)
* Adds new `google.protobuf.Timestamp` fields to each .proto file where
we had been using `int64` fields as a timestamp.
* Updates relevant gRPC messages to populate the new
`google.protobuf.Timestamp` fields in addition to the old `int64`
timestamp fields.
* Added tests for each `<x>ToPB` and `PBto<x>` functions to ensure that
new fields passed into a gRPC message arrive as intended.
* Removed an unused error return from `PBToCert` and `PBToCertStatus`
and cleaned up each call site.

Built on-top of https://github.com/letsencrypt/boulder/pull/7069
Part 2 of 4 related to
https://github.com/letsencrypt/boulder/issues/7060
2023-10-11 12:12:12 -04:00
Aaron Gable 519c199c9a
Greatly simplify crl-updater's batch mode (#7079)
Replace crl-updater's overly complex RunOnce and updateIssuer methods
with a single, much simpler RunOnce that is modeled off of the
recently-redone continuous Run method's model. Instead of breaking
things down by issuer then shard, simply kick off everything in
parallel. This also improves batch mode's ability to listen for context
cancellations at all the appropriate times.

At the same time, move getShardMappings into the shared updater.go file
because it is used by both the batch and continuous modes of operation,
and improve uniformity of usage of the crlId structure in log output.

Fixes https://github.com/letsencrypt/boulder/issues/7066
2023-09-19 13:40:18 -07:00
Phil Porada 034316ef6a
Rename int64 timestamp related protobuf fields to <fieldname>NS (#7069)
Rename all of int64 timestamp fields to `<fieldname>NS` to indicate they
are Unix nanosecond timestamps.

Part 1 of 4 related to
https://github.com/letsencrypt/boulder/issues/7060
2023-09-15 13:49:07 -04:00
Aaron Gable 102b447e8d
Smoother scheduling and leasing for crl-updater (#7010)
Overhaul crl-updater's default (i.e. non-runOnce) behavior to update
individual CRL shards continuously, rather than updating all shards in a
large batch.

To accomplish this, it spins up one goroutine for each shard of each
issuer this updater is responsible for. Each goroutine is solely
responsible for its assigned shard. It sleeps for a random amount of
time (to stagger their starts), then begins a ticker to wake up every
updateInterval and re-issue its shard.

As part of this change, refactor updater.go into three separate files
(batch.go, continuous.go, and updater.go) containing functions dedicated
to single-run batch processing, long-running continuous processing, and
shared helpers, respectively.

IN-9475 tracks the deprecation of the `updateOffset` config key. The
other configuration changes in this PR do not require production
changes.

Fixes https://github.com/letsencrypt/boulder/issues/7023
2023-09-08 09:16:15 -07:00
Aaron Gable 9a4f0ca678
Deprecate LeaseCRLShards feature (#7009)
This feature flag is enabled in both staging and prod.
2023-08-07 15:17:00 -07:00
Aaron Gable 63319d8cd0
crl-updater: duplicate shard 0 as shard NumShards (#7008)
When crl-updater produces a CRL with shard index 0, have it also produce
an identical CRL with shard index NumShards (e.g. 128). This is the
first step towards having it only produce shards numbered 1 through
NumShards, i.e. transition from using 0-indexing to 1-indexing.

We want to do this because various aspects of Golang and gRPC cannot
tell the difference between "this struct/message has no shard index set"
and "this struct/message has shard index 0 set".

Part of https://github.com/letsencrypt/boulder/issues/7007
2023-07-26 15:32:20 -07:00
Aaron Gable 908421bb98
crl-updater: lease CRL shards to prevent races (#6941)
Add a new feature flag, LeaseCRLShards, which controls certain aspects
of crl-updater's behavior.

When this flag is enabled, crl-updater calls the new SA.LeaseCRLShard
method before beginning work on a shard. This prevents it from stepping
on the toes of another crl-updater instance which may be working on the
same shard. This is important to prevent two competing instances from
accidentally updating a CRL's Number (which is an integer representation
of its thisUpdate timestamp) *backwards*, which would be a compliance
violation.

When this flag is enabled, crl-updater also calls the new
SA.UpdateCRLShard method after finishing work on a shard.

In the future, additional work will be done to make crl-updater use the
"give me the oldest available shard" mode of the LeaseCRLShard method.

Fixes https://github.com/letsencrypt/boulder/issues/6897
2023-07-19 15:11:16 -07:00
Aaron Gable fe523f142d
crl-updater: retry failed shards (#6907)
Add per-shard exponential backoff and retry to crl-updater. Each
individual CRL shard will be retried up to MaxAttempts (default 1)
times, with exponential backoff starting at 1 second and maxing out at 1
minute between each attempt.

This can effectively reduce the parallelism of crl-updater: while a
goroutine is sleeping between attempts of a failing shard, it is not
doing work on another shard. This is a desirable feature, since it means
that crl-updater gently reduces the total load it places on the network
and database when shards start to fail.

Setting this new config parameter is tracked in IN-9140
Fixes https://github.com/letsencrypt/boulder/issues/6895
2023-05-22 12:59:09 -07:00
Aaron Gable 98fa0f07b4
Re-enable errcheck linter (#6819)
Enable the errcheck linter. Update the way we express exclusions to use
the new, non-deprecated, non-regex-based format. Fix all places where we
began accidentally violating errcheck while it was disabled.
2023-04-14 15:41:12 -04:00
Aaron Gable ba34ac6b6e
Use read-only SA clients in wfe, ocsp, and crl (#6484)
In the WFE, ocsp-responder, and crl-updater, switch from using
StorageAuthorityClients to StorageAuthorityReadOnlyClients. This ensures
that these services cannot call methods which write to our database.

Fixes #6454
2022-12-02 13:48:28 -08:00
Aaron Gable 6efd941e3c
Stabilize CRL shard boundaries (#6445)
Add two new config keys to the crl-updater:
* shardWidth, which controls the width of the chunks that we divide all
of time into, with a default value of "16h" (approximately the same as
today's shard width derived from 128 shards covering 90 days); and
* lookbackPeriod, which controls the amount of already-expired
certificates that should be included in our CRLs to ensure that even
certificates which are revoked immediately before they expire still show
up in aborts least one CRL, with a default value of "24h" (approximately
the same as today's lookback period derived from our run frequency of
6h).

Use these two new values to change the way CRL shards are computed.

Previously, we would compute the total time we care about based on the
configured certificate lifetime (to determine how far forward to look)
and the configured update period (to determine how far back to look),
and then divide that time evenly by the number of shards. However, this
method had two fatal flaws. First, if the certificate lifetime is
configured incorrectly, then the CRL updater will fail to query the
database for some certs that should be included in the CRLs. Second, if
the update period is changed, this would change the lookback period,
which in turn would change the shard width, causing all CRL entries to
suddenly change which shard they're in.

Instead, first compute all chunk locations based only on the shard width
and number of shards. Then determine which chunks we need to care about
based on the configured lookback period and by querying the database for
the farthest-future expiration, to ensure we cover all extant
certificates. This may mean that more than one chunk of time will get
mapped to a single shard, but that's okay -- each chunk will remain
mapped to the same shard for the whole time we care about it.

Fixes #6438
Fixes #6440
2022-10-27 15:59:48 -07:00
Aaron Gable 9bd0c7967f
Do CRL shard math with shorter durations (#6425)
Change the anchor time used to stabilize shard boundaries from
the zero time (time.Time{}) to the notBefore date of Let's Encrypt's
first self-signed root certificate.

This prevents us from attempting to compute durations that are
greater than 290 years, the maximum representable duration in
Go. Previously, using the zero time was causing our durations to
all be equal to the maximum duration, removing the utility of the
anchor time and causing our shard boundaries to drift with time.
2022-10-03 11:24:40 -07:00
Matthew McPherrin 6874d909f2
Enable go vet printf-auditing our logger (#6421)
Explicitly inform go vet about the names of our logging methods
which should be checked in the same way as fmt.Printf is. Although
go vet can often find such functions on its own, it can't find these
ones because log.Logger is an interface, not a struct.

In addition, fix several format string mistakes caught by go vet.
2022-09-30 16:37:53 -07:00
Aaron Gable 65a60807cc
Fix new crl-updater numEntries log line (#6420) 2022-09-30 14:07:19 -07:00
Aaron Gable 76583552c2
Simplify crl-updater's simultaneous gRPC streams (#6419)
Previously, we would stream CRL Entries directly from the SA's response
stream into the CA's request stream, and similarly directly stream bytes
from the CA's response stream into the Storer's request stream.

Since we're seeing odd errors and inconsistencies in our gRPC streaming
metrics, simplify these to only conduct one stream at a time. This will
make our streaming and error semantics much simpler, at the cost of
memory usage in the updater.
2022-09-29 14:57:47 -07:00
Jacob Hoffman-Andrews de2574a37a
crl/updater: fix incorrect logging of error (#6401)
Fix instances where an error check was conditioned on something
other than the traditional `err`, such as `myStruct.err`, but then the
error being logged was the `err` from elsewhere in the function.
2022-09-28 09:30:32 -07:00
Aaron Gable d53c90a3bc
Streamline and test crl-updater errors and audit logs (#6382)
Modify the way errors are handled in crl-updater:
- Rather than having each method in the tick, tickIssuer, tickShard
  chain concatenate all errors from its children, simply have them
  summarize the errors. This results in much shorter error messages.
- Rather than having each method log its own errors as it returns, have
  each caller responsible for logging the errors it receives from its
  children.

In addition, add tests for tick, tickIssuer, and tickShard which cover
their simple errors paths, where one of the gRPC requests to the SA, CA,
or CRLStorer encounters an error. These tests let us ensure that errors
are being properly propagated upwards through the layers of indirection
and goroutines in the three methods under test, and that the appropriate
metrics are being incremented and log messages are being printed.

Fixes #6375
2022-09-14 16:41:28 -07:00
Aaron Gable 4eb9a9f06a
crl-updater: log once per shard on success or failure (#6373)
Ensure that crl-updater logs one line per shard, no matter whether
that shard succeeded or failed.
2022-09-12 14:35:34 -07:00
Aaron Gable 7f189f7a3b
Improve how crl-updater formats and surfaces errors (#6369)
Make every function in the Run -> Tick -> tickIssuer -> tickShard chain
return an error. Make that return value a named return (which we usually
avoid) so that we can remove the manual setting of the metric result
label and have the deferred metric handling function take care of that
instead. In addition, let that cleanup function wrap the returned error
(if any) with the identity of the shard, issuer, or tick that is
returning it, so that we don't have to include that info in every
individual error message. Finally, have the functions which spin off
many helpers (Tick and tickIssuer) collect all of their helpers' errors
and only surface that error at the end, to ensure the process completes
even in the presence of transient errors.

In crl-updater's main, surface the error returned by Run or Tick, to
make debugging easier.
2022-09-12 11:36:42 -07:00
Aaron Gable 9228e60159
Fix goroutine leak by canceling gRPC contexts (#6371)
In crl-updater's tickShard, which handles all of the gRPC requests
to the SA, CA, and crl-storer, ensure that any open gRPC connections
get closed when the function returns for any reason by canceling the
context object used in those connections.

This fixes a goroutine leak where the updater would open simultaneous
connections to the SA and CA, get an error from the SA, and then return
without closing the connection to the CA. This left the CA stream open
forever, leading to goroutine and memory leaks in the updater.
2022-09-09 16:03:23 -07:00
Aaron Gable efa0cd0a10
Improve crl metrics (#6358)
Remove the secondsSinceSuccess metric, which was both never
being set and rather useless given the bursty nature of CRL update
scheduling.

Add an "issuer" label to the crl_updater_generated metric, to match
the other metrics exported by the updater and the storer.
2022-09-09 14:22:04 -07:00
Samantha 7ed4cd992e
CRL: Improve shard identification in error messages (#6306)
- Create new package `crl`
- Add a common unique CRL identifier `crl.id` with constructor `crl.Id()`
- Replace `shardIdx` with `crl.Id` in `storer` and `updater` errors
- Add a common type for the `CRLNumber` field `crl.number` with constructor
  `crl.Number()`
- Replace `CRLNumber` construction in CA and CRL package with `crl.Number()`

Resolves #6261
2022-08-23 12:35:00 -07:00
Aaron Gable 6a9bb399f7
Create new crl-storer service (#6264)
Create a new crl-storer service, which receives CRL shards via gRPC and
uploads them to an S3 bucket. It ignores AWS SDK configuration in the
usual places, in favor of configuration from our standard JSON service
config files. It ensures that the CRLs it receives parse and are signed
by the appropriate issuer before uploading them.

Integrate crl-updater with the new service. It streams bytes to the
crl-storer as it receives them from the CA, without performing any
checking at the same time. This new functionality is disabled if the
crl-updater does not have a config stanza instructing it how to connect
to the crl-storer.

Finally, add a new test component, the s3-test-srv. This acts similarly
to the existing mail-test-srv: it receives requests, stores information
about them, and exposes that information for later querying by the
integration test. The integration test uses this to ensure that a
newly-revoked certificate does show up in the next generation of CRLs
produced.

Fixes #6162
2022-08-08 16:22:48 -07:00
Aaron Gable 733bcec941
Standardize on 'shardIdx' to identify crl shards (#6263)
Realized that "ShardID" is a bad name, because a real unique
identifier of a shard would include the issuer, crl number, and
shard number. Switching to "ShardIdx" makes it clearer that
shards within a full and complete CRL are identified by a
zero-indexed integer.
2022-08-02 13:21:26 -07:00
Aaron Gable 694d73d67b
crl-updater: add UpdateOffset config to run on a schedule (#6260)
Add a new config key `UpdateOffset` to crl-updater, which causes it to
run on a regular schedule rather than running immediately upon startup
and then every `UpdatePeriod` after that. It is safe for this new config
key to be omitted and take the default zero value.

Also add a new command line flag `runOnce` to crl-updater which causes
it to immediately run a single time and then exit, rather than running
continuously as a daemon. This will be useful for integration tests and
emergency situations.

Part of #6163
2022-07-29 13:30:16 -07:00
Aaron Gable 436061fb35
CRL: Create crl-updater service (#6212)
Create a new service named crl-updater. It is responsible for
maintaining the full set of CRLs we issue: one "full and complete" CRL
for each currently-active Issuer, split into a number of "shards" which
are essentially CRLs with arbitrary scopes.

The crl-updater is modeled after the ocsp-updater: it is a long-running
standalone service that wakes up periodically, does a large amount of
work in parallel, and then sleeps. The period at which it wakes to do
work is configurable. Unlike the ocsp-responder, it does all of its work
every time it wakes, so we expect to set the update frequency at 6-24
hours.

Maintaining CRL scopes is done statelessly. Every certificate belongs to
a specific "bucket", given its notAfter date. This mapping is generally
unchanging over the life of the certificate, so revoked certificate
entries will not be moving between shards upon every update. The only
exception is if we change the number of shards, in which case all of the
bucket boundaries will be recomputed. For more details, see the comment
on `getShardBoundaries`.

It uses the new SA.GetRevokedCerts method to collect all of the revoked
certificates whose notAfter timestamps fall within the boundaries of
each shard's time-bucket. It uses the new CA.GenerateCRL method to sign
the CRLs. In the future, it will send signed CRLs to the crl-storer to
be persisted outside our infrastructure.

Fixes #6163
2022-07-08 09:34:51 -07:00