Commit Graph

22 Commits

Author SHA1 Message Date
Samantha Frank 6402a2275f
ratelimits: Remove a metric and some labels that we're not finding useful (#7902) 2024-12-20 08:44:08 -05:00
James Renken 62299362bd
ra/ratelimits: Update tests, use new TransactionBuilder constructor, fix ARI rate limit exception (#7869)
Add a new `ratelimits.NewTransactionBuilderWithLimits` constructor which
takes pre-populated rate limit data, instead of filenames for reading it
off disk.

Use this new constructor to change rate limits during RA tests, instead
of using extra `testdata` files.

Fix ARI renewals' exception from rate limits: consider `isARIRenewal` as
part of the `isRenewal` arg to `checkNewOrderLimits`.

Remove obsolete RA tests for rate limits that are now only checked in
the WFE.

Update remaining new order rate limit tests from deprecated `ratelimit`s
to new Redis `ratelimits`.
2024-12-18 14:23:13 -08:00
Samantha Frank 11d543bd98
ratelimits: Correctly handle stale and concurrently initialized buckets (#7886)
#7782 fixed an issue where concurrent requests to the same existing
bucket ignored all but one rate limit spend. However, concurrent
requests to the same empty bucket can still cause multiple
initializations that skip all but one spend. Use BatchSetNotExisting
(SETNX in Redis) to detect this scenario and then fall back to
BatchIncrement (INCRBY in Redis).

#7782 sets the TTL (Time-To-Live) of incremented buckets to the maximum
possible burst for the applied limit. Because this TTL doesn’t match the
TAT, these buckets can become "stale," holding a TAT in the past.
Incrementing these stale buckets by cost * emissionInterval leaves the
new TAT behind the current time, allowing clients who are sometimes idle
to gain extra burst capacity. Instead, use BatchSet (SET in Redis) to
overwrite the TAT to now + cost * emissionInterval. Though this
introduces a similar race condition as empty buckets, it’s less harmful
than granting extra burst capacity.
2024-12-17 12:42:51 -05:00
Jacob Hoffman-Andrews 40e100c297
doc: replace "leaky" with "token" bucket (#7881)
Mostly we refer consistently to token bucket, but these two places (one
of which is soon to be removed) still had the "leaky" terminology, which
is potentially confusing.
2024-12-10 16:39:30 -08:00
Jacob Hoffman-Andrews 27e65f3e9f
ratelimits: add detail to error messages (#7871)
For batch operations, include the operation and the number of keys in
the error message. This should help diagnose whether we are getting `i/o
timeout` errors disproportionately for larger requests, or for certain
operations.

Also, make the ignored errors part of the overall WFE request logs,
which allows us to get additional context, like whether certain
requesters or domain names are getting disproportionately many errors.

Related to #7846.
2024-12-05 15:58:26 -08:00
Aaron Gable d962c61067
RA and WFE tests: use inmem rate limit source (#7859)
The purpose of these RA and WFE unit tests is to test how they deal with
certain rate limit conditions, not to test talking to an actual redis
instance. Streamline the tests by having them talk to an in-memory rate
limits store, rather than a redis-backed one.
2024-12-03 14:52:16 -08:00
Aaron Gable bac5602c6d
Always use INCRBY for redis rate limits (#7856)
Deprecate the IncrementRateLimits feature flag, and always use the redis
INCRBY instruction to update rate limit TATs.

Fixes https://github.com/letsencrypt/boulder/issues/7855
2024-12-02 15:25:33 -08:00
Kruti Sutaria a79a830f3b
ratelimits: Auto pause zombie clients (#7763)
- Added a new key-value ratelimit
`FailedAuthorizationsForPausingPerDomainPerAccount` which is incremented
each time a client fails a validation.
- As long as capacity exists in the bucket, a successful validation
attempt will reset the bucket back to full capacity.
- Upon exhausting bucket capacity, the RA will send a gRPC to the SA to
pause the `account:identifier`. Further validation attempts will be
rejected by the [WFE](https://github.com/letsencrypt/boulder/pull/7599).
- Added a new feature flag, `AutomaticallyPauseZombieClients`, which
enables automatic pausing of zombie clients in the RA.
- Added a new RA metric `paused_pairs{"paused":[bool],
"repaused":[bool], "grace":[bool]}` to monitor use of this new
functionality.
- Updated `ra_test.go` `initAuthorities` to allow accessing the
`*ratelimits.RedisSource` for checking that the new ratelimit functions
as intended.

Co-authored-by: @pgporada 

Fixes https://github.com/letsencrypt/boulder/issues/7738

---------

Co-authored-by: Phil Porada <pporada@letsencrypt.org>
Co-authored-by: Phil Porada <philporada@gmail.com>
2024-11-08 13:51:41 -08:00
James Renken 4adc65fb7d
Rate limits: replace redis SET with INCRBY (#7782)
Add a new method, `BatchIncrement`, to issue `IncrBy` (instead of `Set`)
to Redis. This helps prevent the race condition that allows bursts of
near-simultaneous requests to, effectively, spend the same token.

Call this new method when incrementing an existing key. New keys still
need to use `BatchSet` because Redis doesn't have a facility to, within
a single operation, increment _or_ set a default value if none exists.

Add a new feature flag, `IncrementRateLimits`, gating the use of this
new method.

CPS Compliance Review: This feature flag does not change any behaviour
that is described or constrained by our CP/CPS. The closest relation
would just be API availability in general.

Fixes #7780
2024-11-04 11:20:44 -08:00
Samantha Frank 6e6c8fe480
ratelimits: Update errors to deep link to individual limits documentation (#7767)
Updates rate limits error messages to deep link to new website docs added in https://github.com/letsencrypt/website/pull/1756.
2024-10-25 13:55:51 -04:00
Samantha Frank 31d0ff0f98
ratelimits: Replace *Decision merging with always returning most restrictive (#7667)
Fix a bug added in #7653 which sometimes attributed an "Allowed"
`Transaction` to the amalgamated "Denied" `*Decision`. Instead, always
return the most restrictive `*Decision` in the batch.

Remove a debug `fmt.Printf()` call added in #7653
2024-08-15 17:49:34 -04:00
Samantha Frank 6a3e9d725b
ratelimits: Provide verbose user-facing rate limit errors (#7653)
- Instruct callers to call *Decision.Result() to check the result of
rate limit transactions
- Preserve the Transaction within the resulting *Decision
- Generate consistently formatted verbose errors using the metadata
found in the *Decision
- Fix broken key-value rate limits integration test in
TestDuplicateFQDNRateLimit

Fixes #7577
2024-08-12 16:14:15 -04:00
Samantha Frank a6e0fdc80e
ratelimits: Fix latency calculations (#7627) 2024-07-24 22:07:33 -04:00
Samantha 063db40db2
ratelimits: Fix cardinality explosion in overrideUsageGauge (#7548) 2024-06-14 13:34:40 -04:00
Samantha eb49d4487e
ratelimits: Implement batched Spends and Refunds (#7143)
- Move default and override limits, and associated methods, out of the
Limiter to new limitRegistry struct, embedded in a new public
TransactionBuilder.
- Export Transaction and add corresponding Transaction constructor
methods for each limit Name, making Limiter and TransactionBuilder the
API for interacting with the ratelimits package.
- Implement batched Spends and Refunds on the Limiter, the new methods
accept a slice of Transactions.
- Add new boolean fields check and spend to Transaction to support more
complicated cases that can arise in batches:
1. the InvalidAuthorizations limit is checked at New Order time in a
batch with many other limits, but should only be spent when an
Authorization is first considered invalid.
2. the CertificatesPerDomain limit is overridden by
CertficatesPerDomainPerAccount, when this is the case, spends of the
CertificatesPerDomain limit should be "best-effort" but NOT deny the
request if capacity is lacking.
- Modify the existing Spend/Refund methods to support
Transaction.check/spend and 0 cost Transactions.
- Make bucketId private and add a constructor for each bucket key format
supported by ratelimits.
- Move domainsForRateLimiting() from the ra.go to ratelimits. This
avoids a circular import issue in ra.go.

Part of #5545
2023-12-07 11:56:02 -05:00
Samantha ca6314fa48
ratelimits: API improvements necessary for batches and limit fixes (#7117)
The `Limiter` API has been adjusted significantly to both improve both
safety and ergonomics and two `Limit` types have been corrected to match
the legacy implementations.

**Safety**
Previously, the key used for looking up limit overrides and for fetching
individual buckets from the key-value store was constructed within the
WFE. This posed a risk: if the key was malformed, the default limit
would still be enforced, but individual overrides would fail to function
properly. This has been addressed by the introduction of a new
`BucketId` type along with a `BucketId` constructor for each `Limit`
type. Each constructor is responsible for producing a well-formed bucket
key which undergoes the very same validation as any potentially matching
override key.

**Ergonomics**
Previously, each of the `Limiter` methods took a `Limit` name, a bucket
identifier, and a cost to be spent/ refunded. To simplify this, each
method now accepts a new `Transaction` type which provides a cost, and
wraps a `BucketId` identifying the specific bucket.

The two changes above, when taken together, make the implementation of
batched rate limit transactions considerably easier, as a batch method
can accept a slice of `Transaction`.

**Limit Corrections**
PR #6947 added all of the existing rate limits which could be made
compatible with the key-value approach. Two of these were improperly
implemented;
- `CertificatesPerDomain` and `CertificatesPerFQDNSet`, were implemented
as
- `CertificatesPerDomainPerAccount` and
`CertificatesPerFQDNSetPerAccount`.

Since we do not actually associate these limits with a particular ACME
account, the `regID` portion of each of their bucket keys has been
removed.
2023-11-08 13:29:01 -05:00
Samantha 9aef5839b5
WFE: Add new key-value ratelimits implementation (#7089)
Integrate the key-value rate limits from #6947 into the WFE. Rate limits
are backed by the Redis source added in #7016, and use the SRV record
shard discovery added in #7042.

Part of #5545
2023-10-04 14:12:38 -04:00
Samantha 6223acd987
ratelimit: Add an override usage gauge (#7076)
Add a gauge similar to the one added to our key-value rate limits
implementation in #7044.

Part of #7036
2023-09-13 17:34:51 -04:00
Samantha 636d30f4a9
ratelimit: Overhaul metrics for the our existing rate limits (#7054)
- Use constants for each rate limit name to ensure consistency when
labeling metrics
- Consistently check `.Enabled()` outside of each limit check RA method
- Replace the existing checks counter with a latency histogram

Part of #5545
2023-09-11 15:06:16 -04:00
Samantha 077a4e2dc4
ratelimits: Export override utilization metrics (#7044)
Fixes #7036
2023-08-23 13:40:23 -04:00
Samantha 48f211c7ba
ratelimits: Add Redis source (#7016)
Part of #5545
2023-08-10 11:45:04 -04:00
Samantha 055f620c4b
Initial implementation of key-value rate limits (#6947)
This design seeks to reduce read-pressure on our DB by moving rate limit
tabulation to a key-value datastore. This PR provides the following:

- (README.md) a short guide to the schemas, formats, and concepts
introduced in this PR
- (source.go) an interface for storing, retrieving, and resetting a
subscriber bucket
- (name.go) an enumeration of all defined rate limits
- (limit.go) a schema for defining default limits and per-subscriber
overrides
- (limiter.go) a high-level API for interacting with key-value rate
limits
- (gcra.go) an implementation of the Generic Cell Rate Algorithm, a
leaky bucket-style scheduling algorithm, used to calculate the present
or future capacity of a subscriber bucket using spend and refund
operations

Note: the included source implementation is test-only and currently
accomplished using a simple in-memory map protected by a mutex,
implementations using Redis and potentially other data stores will
follow.

Part of #5545
2023-07-21 12:57:18 -04:00