Update from go1.23.1 to go1.23.6 for our primary CI and release builds.
This brings in a few security fixes that aren't directly relevant to us.
Add go1.24.0 to our matrix of CI and release versions, to prepare for
switching to this next major version in prod.
Replace all of Boulder's usage of the Go stdlib "math/rand" package with
the newer "math/rand/v2" package which first became available in go1.22.
This package has an improved API and faster performance across the
board.
See https://go.dev/blog/randv2 and https://go.dev/blog/chacha8rand for
details.
Add a new "issuer" label to the ocsp-responder's ocsp_filter_responses
metric. This allows the count of responses served by ocsp-responder to
be broken down by which intermediate issued the certificate (and OCSP
response) in question.
This approach has the benefit of being minimal. The filterSource is the
only place within ocsp-responder that actually has knowledge of which
intermediate issued the certificate/ocsp response. The HTTP-handling
code above filterSource and the other redis and live-signing sources
below filterSource have no knowledge of the set of issuing
intermediates. They operate solely on the serial, because we guarantee
that our serials are unique across all issuers. So adding the metric
label here means that we don't have to make any other ocsp-responder
code aware of the issuers.
However, this approach has the cost of being somewhat surprising. Every
source has a `counter` metric with a "result" label; adding this
"issuer" label makes the filterSource's metric unique.
Fixes https://github.com/letsencrypt/boulder/issues/7538
Replace "mocks.StorageAuthority" with "sapb.StorageAuthorityClient" in
our test mocks. The improves them by removing implementations of the
methods the tests don't actually need, instead of inheriting lots of
extraneous methods from the huge and cumbersome mocks.StorageAuthority.
This reduces our usage of mocks.StorageAuthority to only the WFE tests
(which create one in the frequently-used setup() function), which will
make refactoring those mocks in the pursuit of
https://github.com/letsencrypt/boulder/issues/7476 much easier.
Part of https://github.com/letsencrypt/boulder/issues/7476
Rename "IssuerNameID" to just "NameID". Similarly rename the standalone
functions which compute it to better describe their function. Add a
.NameID() directly to issuance.Issuer, so that callers in other packages
don't have to directly access the .Cert member of an Issuer. Finally,
rearrange the code in issuance.go to be sensibly grouped as concerning
NameIDs, Certificates, or Issuers, rather than all mixed up between the
three.
Fixes https://github.com/letsencrypt/boulder/issues/5152
The issuance.KeyHash() and issuance.NameHash() methods are used solely
by the OCSP responder filterSource. To avoid any possibility of
confusion between these OCSP-specific values and the normal SKID
extension, move them from the issuance package into the filterSource
itself.
In #6478, we stopped passing through Redis errors to the top-level
Responder object, preferring instead to live-sign. As part of that
change, we logged the Redis errors so they wouldn't disappear. However,
the sample rate for those errors was hard coded to 1-in-1000, instead of
following the LogSampleRate configured in the JSON.
This adds a field to redisSource for logSampleRate, and passes it
through from the JSON config in ocsp-responder/main.go.
Part of #7091
This new standard library method returns a context with all of the
original metadata (e.g. tracing spans) still attached, but which will
not be canceled by any cancel funcs, deadlines, or timeouts set on the
parent context. We do this manually in a few places to prevent client
cancellations (usually disconnects) from disrupting our work, so this
just makes that code slightly simpler.
Fixes https://github.com/letsencrypt/boulder/issues/5506
Run staticcheck as a standalone binary rather than as a library via
golangci-lint. From the golangci-lint help out,
> staticcheck (megacheck): It's a set of rules from staticcheck. It's
not the same thing as the staticcheck binary. The author of staticcheck
doesn't support or approve the use of staticcheck as a library inside
golangci-lint.
We decided to disable ST1000 which warns about incorrect or missing
package comments.
For SA4011, I chose to change the semantics[1] of the for loop rather
than ignoring the SA4011 lint for that line.
Fixes https://github.com/letsencrypt/boulder/issues/6988
1. https://go.dev/ref/spec#Continue_statements
This change replaces [gorp] with [borp].
The changes consist of a mass renaming of the import and comments / doc
fixups, plus modifications of many call sites to provide a
context.Context everywhere, since gorp newly requires this (this was one
of the motivating factors for the borp fork).
This also refactors `github.com/letsencrypt/boulder/db.WrappedMap` and
`github.com/letsencrypt/boulder/db.Transaction` to not embed their
underlying gorp/borp objects, but to have them as plain fields. This
ensures that we can only call methods on them that are specifically
implemented in `github.com/letsencrypt/boulder/db`, so we don't miss
wrapping any. This required introducing a `NewWrappedMap` method along
with accessors `SQLDb()` and `BorpDB()` to get at the internal fields
during metrics and logging setup.
Fixes#6944
This adds Jaeger's all-in-one dev container (with no persistent storage)
to boulder's dev docker-compose. It configures config-next/ to send all
traces there.
A new integration test creates an account and issues a cert, then
verifies the trace contains some set of expected spans.
This test found that async finalize broke spans, so I fixed that and a
few related spots where we make a new context.
Enable the errcheck linter. Update the way we express exclusions to use
the new, non-deprecated, non-regex-based format. Fix all places where we
began accidentally violating errcheck while it was disabled.
Deprecate the ROCSPStage6 feature flag. Remove all references to the
`ocspResponse` column from the SA, both when reading from and when
writing to the `certificateStatus` table. This makes it safe to fully
remove that column from the database.
IN-8731 enabled this flag in all environments, so it is safe to
deprecate.
Part of #6285
Delete the ocsp-updater service, and the //ocsp/updater library that
supports it. Remove test configs for the service, and remove references
to the service from other test files.
This service has been fully shut down for an extended period now, and is
safe to remove.
Fixes#6499
- Require `letsencrypt/validator` package.
- Add a framework for registering configuration structs and any custom
validators for each Boulder component at `init()` time.
- Add a `validate` subcommand which allows you to pass a `-component`
name and `-config` file path.
- Expose validation via exported utility functions
`cmd.LookupConfigValidator()`, `cmd.ValidateJSONConfig()` and
`cmd.ValidateYAMLConfig()`.
- Add unit test which validates all registered component configuration
structs against test configuration files.
Part of #6052
Remove tracing using Beeline from Boulder. The only remnant left behind
is the deprecated configuration, to ensure deployability.
We had previously planned to swap in OpenTelemetry in a single PR, but
that adds significant churn in a single change, so we're doing this as
multiple steps that will each be significantly easier to reason about
and review.
Part of #6361
sa: rename AddPrecertificateRequest.IssuerID
to IssuerNameID. This is in preparation for adding a similarly-named
field to AddSerialRequest.
Part of #5152.
Add a new time.Duration field, LagFactor, to both the SA's config struct
and the read-only SA's implementation struct. In the GetRegistration,
GetOrder, and GetAuthorization2 methods, if the database select returned
a NoRows error and a lagFactor duration is configured, then sleep for
lagFactor seconds and retry the select.
This allows us to compensate for the replication lag between our primary
write database and our read-only replica databases. Sometimes clients
will fire requests in rapid succession (such as creating a new order,
then immediately querying the authorizations associated with that
order), and the subsequent requests will fail because they are directed
to read replicas which are lagging behind the primary. Adding this
simple sleep-and-retry will let us mitigate many of these failures,
without adding too much complexity.
Fixes#6593
When the ocsp-responder queries the database for a certificate status,
we want to return a 404 if we don't find a certificate status row for
the serial in question. This is because we often receive requests for
serials which we never issued, and for very old (expired) serials whose
status data we may have purged from the database.
Previously, we did this by checking whether the error returned by the
database was the "ErrNoRows" used by Go's SQL library. However, when the
ocsp-responder uses the SA to get this information, rather than querying
the database directly, the SA's gRPC service returns berrors.NotFound
instead. The code was not checking for this error, and therefore turned
some requests that should have been 404s into 500s.
Check for both kinds of "not found" error, and return a 404 for both.
Add tests to ensure that we return responder.ErrNotFound in both cases.
In #6293, we gave the ocsp-responder the ability to use a gRPC
connection to the SA to get status information for certificates, rather
than using a database connection directly. However, that change
neglected to make the database connection configuration optional: an
ocsp-responder with an SA gRPC client configured would never use its
database connection, but if it wasn't configured it would refuse to
start. Fix this oversight by making the DBConfig stanza optional.
In live.go we use a semaphore to limit how many inflight signing
requests we can have, so a flood of OCSP traffic doesn't flood our CA
instances. If traffic exceeds our capacity to sign responses for long
enough, we want to eventually start fast-rejecting inbound requests that
are unlikely to get serviced before their deadline is reached. To do
that, add a MaxSigningWaiters config field to the OCSP responder.
Note that the files in //semaphore are forked from x/sync/semaphore,
with modifications to add the MaxWaiters field and functionality.
Fixes#6392
In the WFE, ocsp-responder, and crl-updater, switch from using
StorageAuthorityClients to StorageAuthorityReadOnlyClients. This ensures
that these services cannot call methods which write to our database.
Fixes#6454
Create a new gRPC service named StorageAuthorityReadOnly which only
exposes a read-only subset of the existing StorageAuthority service's
methods.
Implement this by splitting the existing SA in half, and having the
read-write half embed and wrap an instance of the read-only half.
Unfortunately, many of our tests use exported read-write methods as part
of their test setup, so the tests are all being performed against the
read-write struct, but they are exercising the same code as the
read-only implementation exposes.
Expose this new service at the SA on the same port as the existing
service, but with (in config-next) different sets of allowed clients. In
the future, read-only clients will be removed from the read-write
service's set of allowed clients.
Part of #6454
Previously, the live-signing routine was lookking for
`rocsp.ErrRedisNotFound` errors in order to increment the
`certificate_not_found` metrics. But this was a bug, copy-pasted from
code higher in the file that does a similar check. The live-signing code
actually returns `responder.ErrNotFound`. Check for that error instead,
to properly increment our metrics.
These flags are set in both staging and prod. Deprecate them, make
all code gated behind them the only path, and delete code (multi_source)
which was only accessible when these flags were not set.
Part of #6285
This was masking a bug, because the integration test for OCSP responses
for expired certificates was looking for the "unauthorized" OCSP
response status. Which we were returning, even though our HTTP-level
response code was 533.
In live.Source, translate berrors.NotFound (returned by RA when the
certificate is expired) into responder.NotFound (which causes an
Unauthorized response rather than a 5xx).
In the Redis source, remove the special case that will return a stale
response if live signing fails, and simply pass through the error from
the live source.
Before this fix, if we found a stale response in Redis, tried to get a
fresh response, and found that the certificate was expired, we would
have served the stale response rather than our usual 404 for expired
certificates. Since that messes with our metrics, we don't want to do
it.
Also, fix an incorrect use of `%w` in log.Warningf.
- Move incidents tables from `boulder_sa` to `incidents_sa` (added in #6344)
- Grant read perms for all tables in `incidents_sa`
- Modify unit tests to account for new schema and grants
- Add database cleaning func for `boulder_sa`
- Adjust cleanup funcs to omit `sql-migrate` tables instead of `goose`
Resolves#6328
Give less confusing names to the metrics in checked_redis_source, e.g.
"revocation_re_sign_success" instead of "sign_and_save_success".
Also use a new enum type as the `cause` parameter to signAndSave, to
make it clear what should be passed.
Finally, in redis_source, split `counter` into two separate Prometheus
counters: one for requests in general, and a separate one for
signAndSave. The counter for signAndSave has two labels: cause and
result.
Fixes#6339
The third argument to signAndSave is intended to be a "cause", to
provide a description of why we are doing a fresh signing that can
be included in our metric labels.
It was mistakenly being set to the serial number of the cert whose
new OCSP response is being generated, causing the number of
unique labels on this metric to explode.
Part of #6339
Enable the "unparam" linter, which checks for unused function
parameters, unused function return values, and parameters and
return values that always have the same value every time they
are used.
In addition, fix many instances where the unparam linter complains
about our existing codebase. Remove error return values from a
number of functions that never return an error, remove or use
context and test parameters that were previously unused, and
simplify a number of (mostly test-only) functions that always take the
same value for their parameter. Most notably, remove the ability to
customize the RSA Public Exponent from the ceremony tooling,
since it should always be 65537 anyway.
Fixes#6104
Add a new `GetRevocationStatus` gRPC method to the SA which retrieves
only the subset of the certificate status metadata relevant to
revocation, namely whether the certificate has been revoked, when it was
revoked, and the revocation reason. Notably, this method is our first
use of the `goog.protobuf.Timestamp` type in a message, which is more
ergonomic and less prone to errors than using unix nanoseconds.
Use this new method in ocsp-responder's checked_redis_source, to avoid
having to send many other pieces of metadata and the full ocsp response
bytes over the network. It provides all the information necessary to
determine if the response from Redis is up-to-date.
Within the checked_redis_source, use this new method in two different
ways: if only a database connection is configured (as is the case today)
then get this information directly from the db; if a gRPC connection to
the SA is available then prefer that instead. This may make requests
slower, but will allow us to remove database access from the hosts which
run the ocsp-responder today, simplifying our network.
The new behavior consists of two pieces, each locked behind a config
gate:
- Performing the smaller database query is only enabled if the
ocsp-responder has the `ROCSPStage3` feature flag enabled.
- Talking to the SA rather than the database directly is only enabled if
the ocsp-responder has an `saService` gRPC stanza in its config.
Fixes#6274
The iotuil package has been deprecated since go1.16; the various
functions it provided now exist in the os and io packages. Replace all
instances of ioutil with either io or os, as appropriate.
Add checkedRedisSource, a new OCSP Source which gets
responses from Redis, gets metadata from the database, and
only serves the Redis response if it matches the authoritative
metadata. If there is a mismatch, it requests a new OCSP
response from the CA, stores it in Redis, and serves the new
response.
This behavior is locked behind a new ROCSPStage3 feature flag.
Part of #6079
For multiSource, split out checkSecondary's metrics into their own
counter. Treat NotFound as a separate error type (so we can more
clearly distinguish the half-hourly pattern of fetches for expired
certificates).
In redisSource, add a histogram for the ages of responses fetched from
cache (regardless of whether they are served or not). This parallels
ocsp_respond_ages in ocsp/responder.go, but may show ages beyond the
compliance limit, even under normal operations, because it is checked
before signAndServe is called.
This prevents a memory leak when rate of requests that require signing
exceeds the ability of the live.Source to sign them. Such requests
were getting stuck in semaphore.Weighted.Acquire with a context that had
no deadline.
Previously we used "ExpectedFreshness" to control how frequently the
Redis source would request re-signing of stale entries. But that field
also controls whether multi_source is willing to serve a MariaDB
response. It's better to split these into two values.
This enables ocsp-responder to talk to the RA and request freshly signed
OCSP responses.
ocsp/responder/redis_source is moved to ocsp/responder/redis/redis_source.go
and significantly modified. Instead of assuming a response is always available
in Redis, it wraps a live-signing source. When a response is not available,
it attempts a live signing.
If live signing succeeds, the Redis responder returns the result right away
and attempts to write a copy to Redis on a goroutine using a background
context.
To make things more efficient, I eliminate an unneeded ocsp.ParseResponse
from the storage path. And I factored out a FakeResponse helper to make
the unittests more manageable.
Commits should be reviewable one-by-one.
Fixes#6191