boulder

Commit Graph

Author	SHA1	Message	Date
Samantha Frank	11d543bd98	ratelimits: Correctly handle stale and concurrently initialized buckets (#7886 ) #7782 fixed an issue where concurrent requests to the same existing bucket ignored all but one rate limit spend. However, concurrent requests to the same empty bucket can still cause multiple initializations that skip all but one spend. Use BatchSetNotExisting (SETNX in Redis) to detect this scenario and then fall back to BatchIncrement (INCRBY in Redis). #7782 sets the TTL (Time-To-Live) of incremented buckets to the maximum possible burst for the applied limit. Because this TTL doesn’t match the TAT, these buckets can become "stale," holding a TAT in the past. Incrementing these stale buckets by cost * emissionInterval leaves the new TAT behind the current time, allowing clients who are sometimes idle to gain extra burst capacity. Instead, use BatchSet (SET in Redis) to overwrite the TAT to now + cost * emissionInterval. Though this introduces a similar race condition as empty buckets, it’s less harmful than granting extra burst capacity.	2024-12-17 12:42:51 -05:00
Jacob Hoffman-Andrews	1f9f2bccf5	sa: remove CountFQDNSetTimestamps (#7883 ) This was superseded in #6220 by FQDNTimestampsForWindow and is no longer called.	2024-12-16 12:24:01 -08:00
Jacob Hoffman-Andrews	2678e68806	test: move "make build" for webpki into generate.sh (#7885 ) webpki.go was discarding stdout when "make build" failed. We can make it print stdout in that context, but it's more straightforward to run "make build" from the shell script that calls webpki.go, where its stdout will naturally be emitted. Inspired by a recent CI run where there was a straightforward build failure in some of Boulder's code, but it was masked by an error running webpki.go in the `bsetup` container.	2024-12-13 15:19:22 -08:00
James Renken	62f1a26ccf	wfe: Use separate UpdateRegistrationContact & UpdateRegistrationKey methods (#7827 ) Fixes #7716 Part of #5554	2024-12-13 11:41:59 -05:00
Jacob Hoffman-Andrews	efaa370457	doc: boulder now has Retry-After on all ratelimits (#7876 ) Thanks to MikeMcQ from the forum for [noticing](https://community.letsencrypt.org/t/new-rate-limit-page-in-conflict-with-boulder-variances/229849).	2024-12-12 16:27:58 -05:00
Samantha Frank	1ddd4633f5	DB: Promote pausing schema from config-next to config (#7878 )	2024-12-11 14:38:55 -05:00
Jacob Hoffman-Andrews	40e100c297	doc: replace "leaky" with "token" bucket (#7881 ) Mostly we refer consistently to token bucket, but these two places (one of which is soon to be removed) still had the "leaky" terminology, which is potentially confusing.	2024-12-10 16:39:30 -08:00
James Renken	1b7b9a776b	cmd: Make a debug listen address optional (#7840 ) Remove `debugAddr` from the `admin` tool, which doesn't use it - or need it, now that `newStatsRegistry` via `StatsAndLogging` doesn't require it. Remove `debugAddr` from `config-next/sfe.json`, as we usually set it on the CLI instead. Fixes #7838	2024-12-10 12:25:12 -08:00
Samantha Frank	dda8acc34a	RA/VA: Add MPIC compliant DCV and CAA checks (#7870 ) Today, we have VA.PerformValidation, a method called by the RA at challenge time to perform DCV and check CAA. We also have VA.IsCAAValid, a method invoked by the RA at finalize time when a CAA re-check is necessary. Both of these methods can be executed on remote VA perspectives by calling the generic VA.performRemoteValidation. This change splits VA.PerformValidation into VA.DoDCV and VA.DoCAA, which are both called on remote VA perspectives by calling the generic VA.doRemoteOperation. VA.DoDCV, VA.DoCAA, and VA.doRemoteOperation fulfill the requirements of SC-067 V3: Require Multi-Perspective Issuance Corroboration by: - Requiring at least three distinct perspectives, as outlined in the "Phased Implementation Timeline" in BRs section 3.2.2.9 ("Effective March 15, 2025"). - Ensuring that the number of non-corroborating (failing) perspectives remains below the threshold defined by the "Table: Quorum Requirements" in BRs section 3.2.2.9. - Ensuring that corroborating (passing) perspectives reside in at least 2 distinct Regional Internet Registries (RIRs) per the "Phased Implementation Timeline" in BRs section 3.2.2.9 ("Effective March 15, 2026"). - Including an MPIC summary consisting of: passing perspectives, failing perspectives, passing RIRs, and a quorum met for issuance (e.g., 2/3 or 3/3) in each validation audit log event, per BRs Section 5.4.1, Requirement 2.8. When the new SeparateDCVAndCAAChecks feature flag is enabled on the RA, calls to VA.IsCAAValid (during finalization) and VA.PerformValidation (during challenge) are replaced with calls to VA.DoCAA and a sequence of VA.DoDCV followed by VA.DoCAA, respectively. Fixes #7612 Fixes #7614 Fixes #7615 Fixes #7616	2024-12-10 11:26:08 -05:00
James Renken	071b8c5b35	wfe: Handle empty JSON to /acme/acct like POST-as-GET (#7844 ) Early drafts of the ACME spec said that clients should retrieve their existing account information by POSTing the empty JSON object `{}` to their account URL. This instruction was removed in the final version of ACME, replaced by the concept of POST-as-GET, which uses a wholly empty body to accomplish the same goal. However, Boulder has continued to incidentally support this behavior: when we receive an empty JSON object, our `updateAccount` code in the RA applies their desired diff (none) on top of their current account, writes it back to the database, and returns the updated account object...which hasn't actually changed. This behavior is also half-tested by `TestEmptyAccount`, but that test is actually testing that the MockRA implements the same behavior as the real RA; it's not truly testing the WFE's behavior. This PR changes the WFE to explicitly treat receiving the empty JSON object as a request to retrieve the account data unchanged, rather than implicitly relying on internal details of the RA's account-update logic, which are expected to change in #7827. --------- Co-authored-by: Jacob Hoffman-Andrews <jsha+github@letsencrypt.org>	2024-12-06 16:45:43 -08:00
Matthew McPherrin	7e8b3fa10f	Run most workflows on ubuntu-24.04 (#7875 ) Github is currently rolling out ubuntu-latest as ubuntu-24.04. Manage that switch explicitly by running most jobs on 24.04 https://github.com/actions/runner-images/issues/10636 This keeps the release on 20.04 to ensure released binaries can run on older operating systems (because of CGO/glibc versions)	2024-12-06 13:13:04 -08:00
Eng Zer Jun	13db2a252f	refactor: remove usages of experimental maps package (#7849 ) All 4 usages of the `maps.Keys` function from the `golang.org/x/exp/maps` package can be refactored to a simpler alternative. If we need it in the future, it is available in the standard library since Go 1.23.	2024-12-06 11:50:32 -08:00
Samantha Frank	87104b0a3e	va: Check for RIR and Perspective mismatches at runtime when they're provided (#7841 ) - Ensure the Perspective and RIR reported by each remoteVA in the *vapb.ValidationResult returned by VA.PerformValidation, matches the expected local configuration when that configuration is present. - Correct "AfriNIC" to "AFRINIC", everywhere. Part of https://github.com/letsencrypt/boulder/issues/7819	2024-12-06 14:27:28 -05:00
Aaron Gable	749f9afa6b	Fix RA unit test merge conflict (#7874 )	2024-12-06 08:59:01 -05:00
Aaron Gable	95e5f87f9e	Add feature flag to disable pending authz reuse (#7836 ) Pending authz reuse is a nice-to-have feature because it allows us to create fewer rows in the authz database table when creating new orders. However, stats show that less than 2% of authorizations that we attach to new orders are reused pending authzs. And as we move towards using a more streamlined database schema to store our orders, authorizations, and validation attempts, disabling pending authz reuse will greatly simplify our database schema and code. CPS Compliance Review: our CPS does not speak to whether or not we reuse pending authorizations for new orders. IN-10859 tracks enabling this flag in prod Part of https://github.com/letsencrypt/boulder/issues/7715	2024-12-05 16:14:57 -08:00
Jacob Hoffman-Andrews	27e65f3e9f	ratelimits: add detail to error messages (#7871 ) For batch operations, include the operation and the number of keys in the error message. This should help diagnose whether we are getting `i/o timeout` errors disproportionately for larger requests, or for certain operations. Also, make the ignored errors part of the overall WFE request logs, which allows us to get additional context, like whether certain requesters or domain names are getting disproportionately many errors. Related to #7846.	2024-12-05 15:58:26 -08:00
Jacob Hoffman-Andrews	8c45a4fb04	docs: update CONTRIBUTING.md (#7857 ) Describe what to do if a bot comments on your PR, and update our description of feature flags and their deprecation.	2024-12-04 18:26:08 -08:00
Aaron Gable	aac7c22946	Simplify RA pausing unit tests (#7868 ) Greatly simplify the two RA unit tests covering failed validations and account+identifier pausing. Most importantly, directly manipulate the ratelimit backing store during test setup, to avoid having to "perform" extra validations. Fixes https://github.com/letsencrypt/boulder/issues/7812	2024-12-04 13:51:37 -08:00
Aaron Gable	d962c61067	RA and WFE tests: use inmem rate limit source (#7859 ) The purpose of these RA and WFE unit tests is to test how they deal with certain rate limit conditions, not to test talking to an actual redis instance. Streamline the tests by having them talk to an in-memory rate limits store, rather than a redis-backed one.	2024-12-03 14:52:16 -08:00
Aaron Gable	bac5602c6d	Always use INCRBY for redis rate limits (#7856 ) Deprecate the IncrementRateLimits feature flag, and always use the redis INCRBY instruction to update rate limit TATs. Fixes https://github.com/letsencrypt/boulder/issues/7855	2024-12-02 15:25:33 -08:00
Jacob Hoffman-Andrews	5cdfa3e26c	ra: wait for validations on clean shutdown (#7854 ) This reduces the number of validations that get left indefinitely in "pending" state. Rename `DrainFinalize()` to `Drain()` to indicate that it now covers more cases than just finalize.	2024-12-02 11:00:17 -08:00
Samantha Frank	d64132eebc	VA: Use performValidation for IsCAAValid remote checks (#7850 ) - Remove undeployed feature flag MultiCAAFullResults - Perform local CAA checks prior to initiating remote checks, instead of starting remote checks and proceeding to perform local checks. - Remove VA.IsCAAValid specific remote validation logic, use VA.performRemoteOperation instead - Refactor va.logRemoteResults to be easier to test and omit the RVA problem - Drive-by fix: Calculate logEvent.Latency with va.clk.Since() instead of time.Since() like everything else in VA.performRemoteOperation	2024-11-28 15:24:47 -05:00
Samantha Frank	27a77142ad	VA: Make performRemoteValidation more generic (#7847 ) - Make performRemoteValidation a more generic function that returns a new remoteResult interface - Modify the return value of IsCAAValid and PerformValidation to satisfy the remoteResult interface - Include compile time checks and tests that pass an arbitrary operation	2024-11-27 15:29:33 -05:00
Aaron Gable	ded2e5e610	Remove logging of contact email addresses (#7833 ) Fixes https://github.com/letsencrypt/boulder/issues/7801	2024-11-25 13:33:56 -08:00
Samantha Frank	c3948314ff	va: Make the primary VA aware of the Perspective and RIR of each remote (#7839 ) - Make the primary VA aware of the expected Perspective and RIR of each remote VA. - All Perspectives should be unique, have the primary VA check for duplicate Perspectives at startup. - Update test setup functions to ensure that each remote VA client and corresponding inmem impl have a matching perspective and RIR. Part of #7819	2024-11-25 13:02:03 -05:00
Aaron Gable	7791262815	Rate limit deactivating pending authzs (#7835 ) When a client deactivates a pending authorization, count that towards their FailedAuthorizationsPerDomainPerAccount and FailedAuthorizationsForPausingPerDomainPerAccount rate limits. This should help curb the few clients which constantly create new orders and authzs, deactivate those pending authzs preventing reuse of them or their orders, and then rinse and repeat. Fixes https://github.com/letsencrypt/boulder/issues/7834	2024-11-21 17:17:16 -08:00
Jacob Hoffman-Andrews	53f3cb91c0	wfe: rename deprecated paths and handlers (#7837 ) Now that the paths with an account (and no `-v3`) are the default, rename the old-style path constants and handlers to reflect that they are deprecated. Part of #7683.	2024-11-21 16:51:36 -08:00
Jacob Hoffman-Andrews	01c1488b0f	va: use cancels to early-return. (#7832 ) This allows us to collect a consistent number of error results for logging. Related to #7616.	2024-11-20 13:37:53 -08:00
Samantha Frank	8bf13a90f4	VA: Make PerformValidation more like DoDCV (#7828 ) - Remove Perspective and RIR from ValidationRecords - Make ValidationResultToPB Perspective and RIR aware - Update comment for VA.PerformValidation - Make verificationRequestEvent more like doDCVAuditLog - Update language used in problems created by performRemoteValidation to be more like doRemoteDCV.	2024-11-20 14:13:55 -05:00
Samantha Frank	c2632585f5	core: Move canceled.Is to core.IsCanceled (#7831 ) Small refactor to use errors.Is() rather than the equality operator. Also moves this utility function into core.	2024-11-20 13:10:02 -05:00
Samantha Frank	a8cdaf8989	ratelimit: Remove legacy registrations per IP implementation (#7760 ) Part of #7671	2024-11-19 18:39:21 -05:00
Samantha Frank	65de9fb4b8	VA: Fix two IsCAAValid bugs (#7829 ) Fix two bugs introduced in #7816: - Fix localLatency time for CAA rechecking is always 0 - Fix logEvent.InternalError is no longer being written to log	2024-11-19 11:14:34 -08:00
Jacob Hoffman-Andrews	577a1e38eb	va: prepare to require minimum of 3 RVAs (#7815 ) To prepare for the MPIC requirement of having a minimum of 3 perspectives, I added code to `NewValidationAuthorityImpl` to error if there aren't enough remote VAs configured _and_ the current VA is the primary perspective. Then I fixed all the tests, which involved adding some backends in the unittests, and spinning up `remoteva-c` in the integration tests. As a reminder, the `boulder va` command always considers itself the primary perspective, while `boulder remoteva` gives itself a perspective based on its config. I wound up backing out the code in `NewValidationAuthorityImpl` because right now our remote VAs are actually running the `boulder va` command, so they would error out in prod, even though our actual primary perspective does have enough backends. So this wound up as a test-only change.	2024-11-19 10:23:32 -05:00
Jacob Hoffman-Andrews	a46c388f66	va: compute maxRemoteFailures based on MPIC (#7810 ) Previously this was a configuration field. Ports `maxAllowedFailures()` from `determineMaxAllowedFailures()` in #7794. Test updates: Remove the `maxRemoteFailures` param from `setup` in all VA tests. Some tests were depending on setting this param directly to provoke failures. For example, `TestMultiVAEarlyReturn` previously relied on "zero allowed failures". Since the number of allowed failures is now 1 for the number of remote VAs we were testing (2), the VA wasn't returning early with an error; it was succeeding! To fix that, make sure there are two failures. Since two failures from two RVAs wouldn't exercise the right situation, add a third RVA, so we get two failures from three RVAs. Similarly, TestMultiCAARechecking had several test cases that omitted this field, effectively setting it to zero allowed failures. I updated the "1 RVA failure" test case to expect overall success and added a "2 RVA failures" test case to expect overall failure (we previously expected overall failure from a single RVA failing). In TestMultiVA I had to change a test for `len(lines) != 1` to `len(lines) == 0`, because with more backends we were now logging more errors, and finding e.g. `len(lines)` to be 2.	2024-11-18 15:36:09 -08:00
Jacob Hoffman-Andrews	20fdcbcfe0	ratelimits: always use a pointer for limit (#7804 ) The zero value for `limit` is invalid, so returning `nil` in error cases avoids silently returning invalid limits (and means that if code makes a mistake and references an invalid limit it will be an obvious clear stack trace). This is more consistent, since the methods on `limit` use a pointer receiver. Also, since `limit` is a fairly large object, this saves some copying. Related to #7803 and #7797.	2024-11-15 13:45:07 -08:00
Samantha Frank	3506f09285	RA: Make calls to countCertificateIssued and countFailedValidations synchronous (#7824 ) Solves CI flakes in TestCertificatesPerDomain and TestIdentifiersPausedForAccount that are the result of a race on the Redis database. This has the downside of making failed validations and successful finalizations take slightly longer.	2024-11-15 16:33:51 -05:00
Samantha Frank	3baac6f6df	VA: Consolidate multiple metrics into one histogram (#7816 ) - Add a new histogram, validationLatency - Add a VA.observeLatency for observing validation latency - Refactor to ensure this metric can be observed exclusively within VA.PerformValidation and VA.IsCAAValid. - Replace validationTime, localValidationTime, remoteValidationTime, remoteValidationFailures, caaCheckTime, localCAACheckTime, remoteCAACheckTime, and remoteCAACheckFailures with validationLatency	2024-11-15 15:51:39 -05:00
Samantha Frank	f9fb688427	build: Specify go 1.23.0 in go.mod (#7822 ) We're using `.Keys()` a method of the maps package added in 1.23 (https://pkg.go.dev/maps@master#Keys) but our go.mod states 1.22.0. This causes some in-IDE linting errors in VSCode.	2024-11-15 12:49:04 -08:00
Samantha Frank	b2b5645e16	VA: Cleanup performRemoteValidation (#7814 ) Bring this code more in line with `VA.remoteDoDCV` in #7794. This should make these two easier to diff in review.	2024-11-15 12:25:06 -08:00
Samantha Frank	2502113ac3	VA: Remove logging of RIR and Perspective (#7818 )	2024-11-15 14:54:03 -05:00
Jacob Hoffman-Andrews	56f0ed6419	wfe: orders link to authz IDs with acccount (#7790 ) This means that most traffic will go to the authz URLs with account. After this has been deployed for 30 days (the max lifetime of an order), we can remove support for the old paths. Part of #7683	2024-11-15 10:34:14 -08:00
Jacob Hoffman-Andrews	0d70b12a75	ra: fix unittest for resetting pause limit (#7813 ) TestPerformValidation_FailedThenSuccessfulValidationResetsPauseIdentifiersRatelimit checks for a bucket being empty after a reset. However, that bucket is based on an account ID that is shared across multiple test cases. Instead, use a unique account and domain for this test. Fixes #7812	2024-11-14 15:04:38 -08:00
Jacob Hoffman-Andrews	c39f33e24f	va: fix race in TestMultiVALogging (#7811 )	2024-11-14 14:17:42 -08:00
Jacob Hoffman-Andrews	5e385e440a	ra: clean up countFailedValidations (#7797 ) Return an error and do logging in the caller. This adds early returns on a number of error conditions, which can prevent nil pointer dereference in those cases. Also update the description for AutomaticallyPauseZombieClients. Follows up #7763.	2024-11-14 16:16:36 -05:00
Jacob Hoffman-Andrews	26a9910911	ratelimits: improve disabled limit handling (#7803 ) In the FailedAuthorizations limits, there was code that intentionally ignored errLimitDisabled errors (`errors.Is(err, errLimitDisabled)`). However, that that resulted in those functions later using a returned `limit` value that was invalid (i.e. its zero value). That happened to trigger some later checks in validateTransaction. Specifically this check failed: if txn.cost > txn.limit.Burst { // error When txt.limit.Burst is zero, this will always fail. This problem doesn't really show up in prod, where all the limits are configured. But it showed up in tests, specifically TestPerformValidation_FailedValidationsTriggerPauseIdentifiersRatelimit, where the limits are constructed using a simplified config that leaves most of them disabled. In this change, I tried to make handling of errLimitDisabled more consistent, and always return an allow-only transaction as early as possible instead of falling through the error condition. Where that wasn't possible, I used a boolean to record whether the result of `builder.getLimit()` was valid before referencing any of its fields. I also added some "shouldn't happen" errors to catch this problem earlier if it recurs. I removed some "skip disabled limit" comments because those say "what the code does" (which the code also says), not "why the code does it". Fixes the test failures in #7797.	2024-11-13 16:23:50 -05:00
James Renken	0a27cba9f4	WFE/nonce: Add NonceHMACKey field (#7793 ) Add a new WFE & nonce config field, `NonceHMACKey`, which uses the new `cmd.HMACKeyConfig` type. Deprecate the `NoncePrefixKey` config field. Generalize the error message when validating `HMACKeyConfig` in `config`. Remove the deprecated `UseDerivablePrefix` config field, which is no longer used anywhere. Part of #7632	2024-11-13 10:31:28 -05:00
Jacob Hoffman-Andrews	5be3e99a4d	features: remove deprecated features (#7805 ) Fixes #7802	2024-11-13 10:22:32 -05:00
Jacob Hoffman-Andrews	1d8cf3e212	ra: remove special case for empty DNSNames (#7795 ) This case was added to work around a test case that didn't fill it out; instead, fill DNSNames for that test case.	2024-11-11 11:07:20 -05:00
Kruti Sutaria	a79a830f3b	ratelimits: Auto pause zombie clients (#7763 ) - Added a new key-value ratelimit `FailedAuthorizationsForPausingPerDomainPerAccount` which is incremented each time a client fails a validation. - As long as capacity exists in the bucket, a successful validation attempt will reset the bucket back to full capacity. - Upon exhausting bucket capacity, the RA will send a gRPC to the SA to pause the `account:identifier`. Further validation attempts will be rejected by the [WFE](https://github.com/letsencrypt/boulder/pull/7599). - Added a new feature flag, `AutomaticallyPauseZombieClients`, which enables automatic pausing of zombie clients in the RA. - Added a new RA metric `paused_pairs{"paused":[bool], "repaused":[bool], "grace":[bool]}` to monitor use of this new functionality. - Updated `ra_test.go` `initAuthorities` to allow accessing the `*ratelimits.RedisSource` for checking that the new ratelimit functions as intended. Co-authored-by: @pgporada Fixes https://github.com/letsencrypt/boulder/issues/7738 --------- Co-authored-by: Phil Porada <pporada@letsencrypt.org> Co-authored-by: Phil Porada <philporada@gmail.com>	2024-11-08 13:51:41 -08:00
Jacob Hoffman-Andrews	2058d985cc	Allow account IDs in authz and challenge URLs (#7768 ) This adds new handlers under `/acme/authz/` and `/acme/chall/` that expect to be followed by `{regID}/{authzID}` and `{regID}/{authzID}/{challengeID}`, respectively. For deployability, the old handlers continue to work, and the URLs returned for newly created objects will still point to the paths used by the old handlers (`/acme/authz-v3/` and `/acme/chall-v3/`). There are some self-referential URLs in authz and challenge responses, like the Location header, and the URL of challenges embedded in an authorization object. This PR updates `prepAuthorizationForDisplay` and `prepChallengeForDisplay` so those URLs can be generated consistently with the path that was requested. For the WFE tests, in most cases I duplicated an entire test and then updated it to test the `WithAccount` handler. The idea is that once we're fully switched over to the new format we can delete the tests for the non-`WithAccount` variants. Part of #7683	2024-11-06 11:52:10 -08:00

1 2 3 4 5 ...

6794 Commits All Branches Search

6794 Commits

All Branches