#7782 fixed an issue where concurrent requests to the same existing
bucket ignored all but one rate limit spend. However, concurrent
requests to the same empty bucket can still cause multiple
initializations that skip all but one spend. Use BatchSetNotExisting
(SETNX in Redis) to detect this scenario and then fall back to
BatchIncrement (INCRBY in Redis).
#7782 sets the TTL (Time-To-Live) of incremented buckets to the maximum
possible burst for the applied limit. Because this TTL doesn’t match the
TAT, these buckets can become "stale," holding a TAT in the past.
Incrementing these stale buckets by cost * emissionInterval leaves the
new TAT behind the current time, allowing clients who are sometimes idle
to gain extra burst capacity. Instead, use BatchSet (SET in Redis) to
overwrite the TAT to now + cost * emissionInterval. Though this
introduces a similar race condition as empty buckets, it’s less harmful
than granting extra burst capacity.
webpki.go was discarding stdout when "make build" failed. We can make it
print stdout in that context, but it's more straightforward to run "make
build" from the shell script that calls webpki.go, where its stdout will
naturally be emitted.
Inspired by a recent CI run where there was a straightforward build
failure in some of Boulder's code, but it was masked by an error running
webpki.go in the `bsetup` container.
Mostly we refer consistently to token bucket, but these two places (one
of which is soon to be removed) still had the "leaky" terminology, which
is potentially confusing.
Remove `debugAddr` from the `admin` tool, which doesn't use it - or need
it, now that `newStatsRegistry` via `StatsAndLogging` doesn't require
it.
Remove `debugAddr` from `config-next/sfe.json`, as we usually set it on
the CLI instead.
Fixes#7838
Today, we have VA.PerformValidation, a method called by the RA at
challenge time to perform DCV and check CAA. We also have VA.IsCAAValid,
a method invoked by the RA at finalize time when a CAA re-check is
necessary. Both of these methods can be executed on remote VA
perspectives by calling the generic VA.performRemoteValidation.
This change splits VA.PerformValidation into VA.DoDCV and VA.DoCAA,
which are both called on remote VA perspectives by calling the generic
VA.doRemoteOperation. VA.DoDCV, VA.DoCAA, and VA.doRemoteOperation
fulfill the requirements of SC-067 V3: Require Multi-Perspective
Issuance Corroboration by:
- Requiring at least three distinct perspectives, as outlined in the
"Phased Implementation Timeline" in BRs section 3.2.2.9 ("Effective
March 15, 2025").
- Ensuring that the number of non-corroborating (failing) perspectives
remains below the threshold defined by the "Table: Quorum Requirements"
in BRs section 3.2.2.9.
- Ensuring that corroborating (passing) perspectives reside in at least
2 distinct Regional Internet Registries (RIRs) per the "Phased
Implementation Timeline" in BRs section 3.2.2.9 ("Effective March 15,
2026").
- Including an MPIC summary consisting of: passing perspectives, failing
perspectives, passing RIRs, and a quorum met for issuance (e.g., 2/3 or
3/3) in each validation audit log event, per BRs Section 5.4.1,
Requirement 2.8.
When the new SeparateDCVAndCAAChecks feature flag is enabled on the RA,
calls to VA.IsCAAValid (during finalization) and VA.PerformValidation
(during challenge) are replaced with calls to VA.DoCAA and a sequence of
VA.DoDCV followed by VA.DoCAA, respectively.
Fixes#7612Fixes#7614Fixes#7615Fixes#7616
Early drafts of the ACME spec said that clients should retrieve their existing account information by POSTing the empty JSON object `{}` to their account URL. This instruction was removed in the final version of ACME, replaced by the concept of POST-as-GET, which uses a wholly empty body to accomplish the same goal. However, Boulder has continued to incidentally support this behavior: when we receive an empty JSON object, our `updateAccount` code in the RA applies their desired diff (none) on top of their current account, writes it back to the database, and returns the updated account object...which hasn't actually changed. This behavior is also half-tested by `TestEmptyAccount`, but that test is actually testing that the MockRA implements the same behavior as the real RA; it's not truly testing the WFE's behavior.
This PR changes the WFE to explicitly treat receiving the empty JSON object as a request to retrieve the account data unchanged, rather than implicitly relying on internal details of the RA's account-update logic, which are expected to change in #7827.
---------
Co-authored-by: Jacob Hoffman-Andrews <jsha+github@letsencrypt.org>
Github is currently rolling out ubuntu-latest as ubuntu-24.04. Manage
that switch explicitly by running most jobs on 24.04
https://github.com/actions/runner-images/issues/10636
This keeps the release on 20.04 to ensure released binaries can run on
older operating systems (because of CGO/glibc versions)
All 4 usages of the `maps.Keys` function from the
`golang.org/x/exp/maps` package can be refactored to a simpler
alternative. If we need it in the future, it is available in the
standard library since Go 1.23.
- Ensure the Perspective and RIR reported by each remoteVA in the
*vapb.ValidationResult returned by VA.PerformValidation, matches the
expected local configuration when that configuration is present.
- Correct "AfriNIC" to "AFRINIC", everywhere.
Part of https://github.com/letsencrypt/boulder/issues/7819
Pending authz reuse is a nice-to-have feature because it allows us to
create fewer rows in the authz database table when creating new orders.
However, stats show that less than 2% of authorizations that we attach
to new orders are reused pending authzs. And as we move towards using a
more streamlined database schema to store our orders, authorizations,
and validation attempts, disabling pending authz reuse will greatly
simplify our database schema and code.
CPS Compliance Review: our CPS does not speak to whether or not we reuse
pending authorizations for new orders.
IN-10859 tracks enabling this flag in prod
Part of https://github.com/letsencrypt/boulder/issues/7715
For batch operations, include the operation and the number of keys in
the error message. This should help diagnose whether we are getting `i/o
timeout` errors disproportionately for larger requests, or for certain
operations.
Also, make the ignored errors part of the overall WFE request logs,
which allows us to get additional context, like whether certain
requesters or domain names are getting disproportionately many errors.
Related to #7846.
Greatly simplify the two RA unit tests covering failed validations and
account+identifier pausing. Most importantly, directly manipulate the
ratelimit backing store during test setup, to avoid having to "perform"
extra validations.
Fixes https://github.com/letsencrypt/boulder/issues/7812
The purpose of these RA and WFE unit tests is to test how they deal with
certain rate limit conditions, not to test talking to an actual redis
instance. Streamline the tests by having them talk to an in-memory rate
limits store, rather than a redis-backed one.
This reduces the number of validations that get left indefinitely in
"pending" state.
Rename `DrainFinalize()` to `Drain()` to indicate that it now covers
more cases than just finalize.
- Remove undeployed feature flag MultiCAAFullResults
- Perform local CAA checks prior to initiating remote checks, instead of
starting remote checks and proceeding to perform local checks.
- Remove VA.IsCAAValid specific remote validation logic, use
VA.performRemoteOperation instead
- Refactor va.logRemoteResults to be easier to test and omit the RVA
problem
- Drive-by fix: Calculate logEvent.Latency with va.clk.Since() instead
of time.Since() like everything else in VA.performRemoteOperation
- Make performRemoteValidation a more generic function that returns a
new remoteResult interface
- Modify the return value of IsCAAValid and PerformValidation to satisfy
the remoteResult interface
- Include compile time checks and tests that pass an arbitrary operation
- Make the primary VA aware of the expected Perspective and RIR of each
remote VA.
- All Perspectives should be unique, have the primary VA check for
duplicate Perspectives at startup.
- Update test setup functions to ensure that each remote VA client and
corresponding inmem impl have a matching perspective and RIR.
Part of #7819
When a client deactivates a pending authorization, count that towards
their FailedAuthorizationsPerDomainPerAccount and
FailedAuthorizationsForPausingPerDomainPerAccount rate limits. This
should help curb the few clients which constantly create new orders and
authzs, deactivate those pending authzs preventing reuse of them or
their orders, and then rinse and repeat.
Fixes https://github.com/letsencrypt/boulder/issues/7834
Now that the paths with an account (and no `-v3`) are the default,
rename the old-style path constants and handlers to reflect that they
are deprecated.
Part of #7683.
- Remove Perspective and RIR from ValidationRecords
- Make ValidationResultToPB Perspective and RIR aware
- Update comment for VA.PerformValidation
- Make verificationRequestEvent more like doDCVAuditLog
- Update language used in problems created by performRemoteValidation to
be more like doRemoteDCV.
Fix two bugs introduced in #7816:
- Fix localLatency time for CAA rechecking is always 0
- Fix logEvent.InternalError is no longer being written to log
To prepare for the MPIC requirement of having a minimum of 3
perspectives, I added code to `NewValidationAuthorityImpl` to error if
there aren't enough remote VAs configured _and_ the current VA is the
primary perspective. Then I fixed all the tests, which involved adding
some backends in the unittests, and spinning up `remoteva-c` in the
integration tests.
As a reminder, the `boulder va` command always considers itself the
primary perspective, while `boulder remoteva` gives itself a perspective
based on its config.
I wound up backing out the code in `NewValidationAuthorityImpl` because
right now our remote VAs are actually running the `boulder va` command,
so they would error out in prod, even though our actual primary
perspective does have enough backends. So this wound up as a test-only
change.
Previously this was a configuration field.
Ports `maxAllowedFailures()` from `determineMaxAllowedFailures()` in
#7794.
Test updates:
Remove the `maxRemoteFailures` param from `setup` in all VA tests.
Some tests were depending on setting this param directly to provoke
failures.
For example, `TestMultiVAEarlyReturn` previously relied on "zero allowed
failures". Since the number of allowed failures is now 1 for the number
of remote VAs we were testing (2), the VA wasn't returning early with an
error; it was succeeding! To fix that, make sure there are two failures.
Since two failures from two RVAs wouldn't exercise the right situation,
add a third RVA, so we get two failures from three RVAs.
Similarly, TestMultiCAARechecking had several test cases that omitted
this field, effectively setting it to zero allowed failures. I updated
the "1 RVA failure" test case to expect overall success and added a "2
RVA failures" test case to expect overall failure (we previously
expected overall failure from a single RVA failing).
In TestMultiVA I had to change a test for `len(lines) != 1` to
`len(lines) == 0`, because with more backends we were now logging more
errors, and finding e.g. `len(lines)` to be 2.
The zero value for `limit` is invalid, so returning `nil` in error cases
avoids silently returning invalid limits (and means that if code makes a
mistake and references an invalid limit it will be an obvious clear
stack trace).
This is more consistent, since the methods on `limit` use a pointer
receiver. Also, since `limit` is a fairly large object, this saves some
copying.
Related to #7803 and #7797.
Solves CI flakes in TestCertificatesPerDomain and
TestIdentifiersPausedForAccount that are the result of a race on the
Redis database. This has the downside of making failed validations and
successful finalizations take slightly longer.
- Add a new histogram, validationLatency
- Add a VA.observeLatency for observing validation latency
- Refactor to ensure this metric can be observed exclusively within
VA.PerformValidation and VA.IsCAAValid.
- Replace validationTime, localValidationTime, remoteValidationTime,
remoteValidationFailures, caaCheckTime, localCAACheckTime,
remoteCAACheckTime, and remoteCAACheckFailures with validationLatency
We're using `.Keys()` a method of the maps package added in 1.23
(https://pkg.go.dev/maps@master#Keys) but our go.mod states 1.22.0. This
causes some in-IDE linting errors in VSCode.
This means that most traffic will go to the authz URLs with account.
After this has been deployed for 30 days (the max lifetime of an order),
we can remove support for the old paths.
Part of #7683
TestPerformValidation_FailedThenSuccessfulValidationResetsPauseIdentifiersRatelimit
checks for a bucket being empty after a reset. However, that bucket is
based on an account ID that is shared across multiple test cases.
Instead, use a unique account and domain for this test.
Fixes#7812
Return an error and do logging in the caller. This adds early returns on
a number of error conditions, which can prevent nil pointer dereference
in those cases.
Also update the description for AutomaticallyPauseZombieClients.
Follows up #7763.
In the FailedAuthorizations limits, there was code that intentionally
ignored errLimitDisabled errors (`errors.Is(err, errLimitDisabled)`).
However, that that resulted in those functions later using a returned
`limit` value that was invalid (i.e. its zero value). That happened to
trigger some later checks in validateTransaction. Specifically this
check failed:
if txn.cost > txn.limit.Burst {
// error
When txt.limit.Burst is zero, this will always fail.
This problem doesn't really show up in prod, where all the limits are
configured. But it showed up in tests, specifically
TestPerformValidation_FailedValidationsTriggerPauseIdentifiersRatelimit,
where the limits are constructed using a simplified config that leaves
most of them disabled.
In this change, I tried to make handling of errLimitDisabled more
consistent, and always return an allow-only transaction as early as
possible instead of falling through the error condition.
Where that wasn't possible, I used a boolean to record whether the
result of `builder.getLimit()` was valid before referencing any of its
fields.
I also added some "shouldn't happen" errors to catch this problem
earlier if it recurs.
I removed some "skip disabled limit" comments because those say "what
the code does" (which the code also says), not "why the code does it".
Fixes the test failures in #7797.
Add a new WFE & nonce config field, `NonceHMACKey`, which uses the new
`cmd.HMACKeyConfig` type. Deprecate the `NoncePrefixKey` config field.
Generalize the error message when validating `HMACKeyConfig` in
`config`.
Remove the deprecated `UseDerivablePrefix` config field, which is no
longer used anywhere.
Part of #7632
- Added a new key-value ratelimit
`FailedAuthorizationsForPausingPerDomainPerAccount` which is incremented
each time a client fails a validation.
- As long as capacity exists in the bucket, a successful validation
attempt will reset the bucket back to full capacity.
- Upon exhausting bucket capacity, the RA will send a gRPC to the SA to
pause the `account:identifier`. Further validation attempts will be
rejected by the [WFE](https://github.com/letsencrypt/boulder/pull/7599).
- Added a new feature flag, `AutomaticallyPauseZombieClients`, which
enables automatic pausing of zombie clients in the RA.
- Added a new RA metric `paused_pairs{"paused":[bool],
"repaused":[bool], "grace":[bool]}` to monitor use of this new
functionality.
- Updated `ra_test.go` `initAuthorities` to allow accessing the
`*ratelimits.RedisSource` for checking that the new ratelimit functions
as intended.
Co-authored-by: @pgporada
Fixes https://github.com/letsencrypt/boulder/issues/7738
---------
Co-authored-by: Phil Porada <pporada@letsencrypt.org>
Co-authored-by: Phil Porada <philporada@gmail.com>
This adds new handlers under `/acme/authz/` and `/acme/chall/` that
expect to be followed by `{regID}/{authzID}` and
`{regID}/{authzID}/{challengeID}`, respectively. For deployability, the
old handlers continue to work, and the URLs returned for newly created
objects will still point to the paths used by the old handlers
(`/acme/authz-v3/` and `/acme/chall-v3/`).
There are some self-referential URLs in authz and challenge responses,
like the Location header, and the URL of challenges embedded in an
authorization object. This PR updates `prepAuthorizationForDisplay` and
`prepChallengeForDisplay` so those URLs can be generated consistently
with the path that was requested.
For the WFE tests, in most cases I duplicated an entire test and then
updated it to test the `WithAccount` handler. The idea is that once
we're fully switched over to the new format we can delete the tests for
the non-`WithAccount` variants.
Part of #7683