Replace the current three-piece setup (enum of feature variables, map of
feature vars to default values, and autogenerated bidirectional maps of
feature variables to and from strings) with a much simpler one-piece
setup: a single struct with one boolean-typed field per feature. This
preserves the overall structure of the package -- a single global
feature set protected by a mutex, and Set, Reset, and Enabled methods --
although the exact function signatures have all changed somewhat.
The executable config format remains the same, so no deployment changes
are necessary. This change does deprecate the AllowUnrecognizedFeatures
feature, as we cannot tell the json config parser to ignore unknown
field names, but that flag is set to False in all of our deployment
environments already.
Fixes https://github.com/letsencrypt/boulder/issues/6802
Fixes https://github.com/letsencrypt/boulder/issues/5229
- Move default and override limits, and associated methods, out of the
Limiter to new limitRegistry struct, embedded in a new public
TransactionBuilder.
- Export Transaction and add corresponding Transaction constructor
methods for each limit Name, making Limiter and TransactionBuilder the
API for interacting with the ratelimits package.
- Implement batched Spends and Refunds on the Limiter, the new methods
accept a slice of Transactions.
- Add new boolean fields check and spend to Transaction to support more
complicated cases that can arise in batches:
1. the InvalidAuthorizations limit is checked at New Order time in a
batch with many other limits, but should only be spent when an
Authorization is first considered invalid.
2. the CertificatesPerDomain limit is overridden by
CertficatesPerDomainPerAccount, when this is the case, spends of the
CertificatesPerDomain limit should be "best-effort" but NOT deny the
request if capacity is lacking.
- Modify the existing Spend/Refund methods to support
Transaction.check/spend and 0 cost Transactions.
- Make bucketId private and add a constructor for each bucket key format
supported by ratelimits.
- Move domainsForRateLimiting() from the ra.go to ratelimits. This
avoids a circular import issue in ra.go.
Part of #5545
This is a cleanup PR finishing the migration from int64 timestamps to
protobuf `*timestamppb.Timestamps` by removing all usage of the old
int64 fields. In the previous PR
https://github.com/letsencrypt/boulder/pull/7121 all fields were
switched to read from the protobuf timestamppb fields.
Adds a new case to `core.IsAnyNilOrZero` to check various properties of
a `*timestamppb.Timestamp` reducing the visual complexity for receivers.
Fixes https://github.com/letsencrypt/boulder/issues/7060
The `Limiter` API has been adjusted significantly to both improve both
safety and ergonomics and two `Limit` types have been corrected to match
the legacy implementations.
**Safety**
Previously, the key used for looking up limit overrides and for fetching
individual buckets from the key-value store was constructed within the
WFE. This posed a risk: if the key was malformed, the default limit
would still be enforced, but individual overrides would fail to function
properly. This has been addressed by the introduction of a new
`BucketId` type along with a `BucketId` constructor for each `Limit`
type. Each constructor is responsible for producing a well-formed bucket
key which undergoes the very same validation as any potentially matching
override key.
**Ergonomics**
Previously, each of the `Limiter` methods took a `Limit` name, a bucket
identifier, and a cost to be spent/ refunded. To simplify this, each
method now accepts a new `Transaction` type which provides a cost, and
wraps a `BucketId` identifying the specific bucket.
The two changes above, when taken together, make the implementation of
batched rate limit transactions considerably easier, as a batch method
can accept a slice of `Transaction`.
**Limit Corrections**
PR #6947 added all of the existing rate limits which could be made
compatible with the key-value approach. Two of these were improperly
implemented;
- `CertificatesPerDomain` and `CertificatesPerFQDNSet`, were implemented
as
- `CertificatesPerDomainPerAccount` and
`CertificatesPerFQDNSetPerAccount`.
Since we do not actually associate these limits with a particular ACME
account, the `regID` portion of each of their bucket keys has been
removed.
The RequireCommonName feature flag was our only "inverted" feature flag,
which defaulted to true and had to be explicitly set to false. This
inversion can lead to confusion, especially to readers who expect all Go
default values to be zero values. We plan to remove the ability for our
feature flag system to support default-true flags, which the existence
of this flag blocked. Since this flag has not been set in any real
configs, inverting it is easy.
Part of https://github.com/letsencrypt/boulder/issues/6802
* Adds new `google.protobuf.Timestamp` fields to each .proto file where
we had been using `int64` fields as a timestamp.
* Updates relevant gRPC messages to populate the new
`google.protobuf.Timestamp` fields in addition to the old `int64`
timestamp fields.
* Added tests for each `<x>ToPB` and `PBto<x>` functions to ensure that
new fields passed into a gRPC message arrive as intended.
* Removed an unused error return from `PBToCert` and `PBToCertStatus`
and cleaned up each call site.
Built on-top of https://github.com/letsencrypt/boulder/pull/7069
Part 2 of 4 related to
https://github.com/letsencrypt/boulder/issues/7060
Integrate the key-value rate limits from #6947 into the WFE. Rate limits
are backed by the Redis source added in #7016, and use the SRV record
shard discovery added in #7042.
Part of #5545
Fix an issue related to the custom gRPC Picker implementation introduced
in #6618. When a nonce contained a prefix not associated with a known
backend, the Picker would continuously rebuild, re-resolve DNS, and
eventually throw a 500 "Server Error" at RPC timeout. The Picker now
promptly returns a 400 "Bad Nonce" error as expected, in response the
requesting client should retry their request with a fresh nonce.
Additionally:
- WFE unit tests use derived nonces when `"BOULDER_CONFIG_DIR" ==
"test/config-next"`.
- `Balancer.Build()` in "noncebalancer" forces a rebuild until non-zero
backends are available. This matches the
[balancer/roundrobin](d524b40946/balancer/roundrobin/roundrobin.go (L49-L53))
implementation.
- Nonces with no matching backend increment "jose_errors" with label
`"type": "JWSInvalidNonce"` and "nonce_no_backend_found".
- Nonces of incorrect length are now rejected at the WFE and increment
"jose_errors" with label `"type": "JWSMalformedNonce"` instead of
`"type": "JWSInvalidNonce"`.
- Nonces not encoded as base64url are now rejected at the WFE and
increment "jose_errors" with label `"type": "JWSMalformedNonce"` instead
of `"type": "JWSInvalidNonce"`.
Fixes#6969
Part of #6974
This was introduced early in Boulder development when we had the concept
of a "short serial" (monotonically increasing) which would be prepended
to random bytes to form the full serial. We wanted to specially report
the case that there were duplicates of a given short serial since it
meant a problem with our monotonicity.
We've long since abandoned that idea, and also this code can't be
exercised because sa.SelectCertificate does a LIMIT 1 anyhow.
This can take two values (typically the return values of a two-value
function) and panic if the error is non-nil, returning the interesting
value. This is particularly useful for cases where we statically know
the call will succeed.
Thanks to @mcpherrinm for the idea!
The contacts field of an account can be very verbose, and is irrelevant
to the vast majority -- e.g. creating orders, validating challenges, and
downloading certificates -- of requests made by an account. To reduce
the length of our WFE log lines, remove the Contacts field from all
logs. When we actually need it, we can get it from the database.
Also remove the RequestEvent.TLS field, which is unused.
In WFE, we do a goodkey check when validating a self-authenticated POST
(i.e. when creating an account). For a while, that was a purely local
check, looking at a list of bad keys or bad moduluses, or checking for
factorability. At some point we also added a backend check, querying the
SA to see if a key was blocked. However, we did not update this one code
path to distinguish "bad key" from "timeout querying SA." That meant
that sometimes we would give a badPublicKey Problem Document when we
should have given an internalServerError.
Related:
https://github.com/letsencrypt/boulder/issues/6795#issuecomment-1574217398
Define a `bJSONWebSignature` struct which embeds a
`*jose.JSONWebSignature`. The only method that can produce a
`bJSONWebSignature` is `wfe.parseJWS` so that we can ensure
safety/sanity checks are performed on the incoming data. Restricts
several methods and functions to take a `jose.Header` as an input
parameter, rather than a full JWS.
Fixes https://github.com/letsencrypt/boulder/issues/5676.
Remove the remaining divergences from RFC8555 regarding what error types
we use in certain situations. Specifically:
- use "invalidContact" instead of "invalidEmail";
- use "unsupportedContact" for contact addresses that use a protocol
other than "mailto:"; and
- use "unsupportedIdentifier" for identifiers that specify a type other
than "dns".
This PR adds a new configuration block specifically for the otelhttp
instrumentation. This block is separate from the existing
"opentelemetry" configuration, and is only relevant when using otelhttp
instrumentation. It does not share any codepath with the existing
configuration, so it is at the top level to indicate which services it
applies to.
There's a bit of plumbing new configuration through. I've adopted the
measured_http package to also set up opentelemetry instead of just
metrics, which should hopefully allow any future changes to be smaller
(just config & there) and more consistent between the wfe2 and ocsp
responder.
There's one option here now, which disables setting
[otelhttp.WithPublicEndpoint](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#WithPublicEndpoint).
This option is designed to do exactly what we want: Don't accept
incoming spans as parents of the new span created in the server.
Previously we had a setting to disable parent-based sampling to help
with this problem, which doesn't really make sense anymore, so let's
just remove it and simplify that setup path. The default of "false" is
designed to be the safe option. It's set to True in the test/ configs
for integration tests that use traces, and I expect we'll likely set it
true in production eventually once the LBs are configured to handle
tracing themselves.
Fixes#6851
Make minor, non-user-visible changes to how we structure the probs
package. Notably:
- Add new problem types for UnsupportedContact and
UnsupportedIdentifier, which are specified by RFC8555 and which we will
use in the future, but haven't been using historically.
- Sort the problem types and constructor functions to match the
(alphabetical) order given in RFC8555.
- Rename some of the constructor functions to better match their
underlying problem types (e.g. "TLSError" to just "TLS").
- Replace the redundant ProblemDetailsToStatusCode function with simply
always returning a 500 if we haven't properly set the problem's
HTTPStatus.
- Remove the ability to use either the V1 or V2 error namespace prefix;
always use the proper RFC namespace prefix.
Update the document number to the latest version, and remove the /get/
prefix since it now supports both the GET and POST portions of the spec.
Also update one piece of tooling to properly get the ARI URL from the
directory, rather than hard-coding it.
Add a new shared config stanza which all boulder components can use to
configure their Open Telemetry tracing. This allows components to
specify where their traces should be sent, what their sampling ratio
should be, and whether or not they should respect their parent's
sampling decisions (so that web front-ends can ignore sampling info
coming from outside our infrastructure). It's likely we'll need to
evolve this configuration over time, but this is a good starting point.
Add basic Open Telemetry setup to our existing cmd.StatsAndLogging
helper, so that it gets initialized at the same time as our other
observability helpers. This sets certain default fields on all
traces/spans generated by the service. Currently these include the
service name, the service version, and information about the telemetry
SDK itself. In the future we'll likely augment this with information
about the host and process.
Finally, add instrumentation for the HTTP servers and grpc
clients/servers. This gives us a starting point of being able to monitor
Boulder, but is fairly minimal as this PR is already somewhat unwieldy:
It's really only enough to understand that everything is wired up
properly in the configuration. In subsequent work we'll enhance those
spans with more data, and add more spans for things not automatically
traced here.
Fixes https://github.com/letsencrypt/boulder/issues/6361
---------
Co-authored-by: Aaron Gable <aaron@aarongable.com>
When sending an ARI response, write the Retry-After header before
writing the JSON response body. This is necessary because
http.ResponseWriter implicitly calls WriteHeader whenever Write is
called, flushing all headers to the network and preventing any
additional headers from being written. Unfortunately, the unittests use
httptest.ResponseRecorder, which doesn't seem to enforce this invariant
(it's happy to report headers which were written after the body). Add a
header check to the integration tests, to make up for this deficiency.
Change the SetCommonName flag, introduced in #6706, to
RequireCommonName. Rather than having the flag control both whether or
not a name is hoisted from the SANs into the CN *and* whether or not the
CA is willing to issue certs with no CN, this updated flag now only
controls the latter. By default, the new flag is true, and continues our
current behavior of failing issuance if we cannot set a CN in the cert.
When the flag is set to false, then we are willing to issue certificates
for which the CSR contains no CN and there is no SAN short enough to be
hoisted into the CN field.
When we have rolled out this change, we can move on to the next flag in
this series: HoistCommonName, which will control whether or not a SAN is
hoisted at all, effectively giving the CSRs (and therefore the clients)
full control over whether their certificate contains a SAN.
This change is safe because no environment explicitly sets the
SetCommonName flag to false yet.
Fixes#5112
Remove tracing using Beeline from Boulder. The only remnant left behind
is the deprecated configuration, to ensure deployability.
We had previously planned to swap in OpenTelemetry in a single PR, but
that adds significant churn in a single change, so we're doing this as
multiple steps that will each be significantly easier to reason about
and review.
Part of #6361
For simple 404s, there's no need to log an InternalError in addition to
the user-facing error, so pass `nil` to sendError as the internalError
parameter. This cleans up Certificate, GetOrder, and FinalizeOrder; all
the places I could find that checked for `NotFound` and also logged an
unnecessary InternalError.
I also removed a redundant an unnecessary error wrapping, and a
reference to "short serial" which is not a concept we have anymore.
This was mostly unused. The only caller was orphan-finder, which used it
to determine if a certificate was already in the database. But this is
not particularly important functionality, so I've removed it.
[RFC 8555 section
7.4](https://www.rfc-editor.org/rfc/rfc8555.html#section-7.4) states
regarding Orders in the "processing" state:
> "processing": The certificate is being issued. Send a POST-as-GET
> request after the time given in the Retry-After header field of
> the response, if any.
Add a Retry-After header when serving Order objects that are in the
"processing" state. This may help control clients which implement Order
polling but without any built-in backoff. The retry interval is
hard-coded to be 3s, slightly above our current 99th percentile Finalize
latency.
Remove the `MandatoryPOSTasGET` flag from the WFE2.
Update the ACMEv2 divergence doc to note that neither staging nor
production use MandatoryPOSTasGET.
Fixes#6582.
Also remove CSRDNSNames, CSRIPAddresses and CSREmailAddresses.
And add a new log field "DNSNames", for use in new-order, finalize, and
revoke requests.
Add a "RevocationReason" field in the "Extra" section for revoke
requests.
Give ARI improved error messages when no request path is specified and
when parsing of the request path blob fails.
Also, add a tool which can be used to quickly generate ARI requests and
print their results, to make manual spot-checking easier.
Fixes#6629
Assign nonce prefixes for each nonce-service by taking the first eight
characters of the the base64url encoded HMAC-SHA256 hash of the RPC
listening address using a provided key. The provided key must be same
across all boulder-wfe and nonce-service instances.
- Add a custom `grpc-go` load balancer implementation (`nonce`) which
can route nonce redemption RPC messages by matching the prefix to the
derived prefix of the nonce-service instance which created it.
- Modify the RPC client constructor to allow the operator to override
the default load balancer implementation (`round_robin`).
- Modify the `srv` RPC resolver to accept a comma separated list of
targets to be resolved.
- Remove unused nonce-service `-prefix` flag.
Fixes#6404
Originally, WFEs had a built-in nonce service. Then we added a "remote
nonce service" via gRPC, but we kept a fallback path for when the remote
nonce service was not configured, to use a built-in nonce service. This
PR removes that fallback path.
Since the fallback path was relied on by the unittests, this also
refactors the unittests to use a gRPC-style nonce service (but in-memory
for the unittests).
Fixes#6530
Update our implementation of ARI to return a renewal window entirely in
the past (i.e., suggesting immediate renewal) if the certificate in
question has been revoked for any reason. This will allow clients which
implement ARI to discover that they need to replace their certificate
without having to query OCSP directly, especially as we move into a
future where OCSP is mostly supplanted by aggregated CRLs.
Fixes#6503
In the WFE, ocsp-responder, and crl-updater, switch from using
StorageAuthorityClients to StorageAuthorityReadOnlyClients. This ensures
that these services cannot call methods which write to our database.
Fixes#6454
When a client cancels their HTTP request, that propagates into
gRPC-level cancellation. But we don't want those canceled RPCs to show
up as 500s, we want them to show up as 408s. We do that by producing a
special ProblemDetails at the gRPC level. However, for that trick to
work, we need to make sure errors from RPC methods get passed through
web.ProblemDetailsForError. There were some places where we didn't do
this and instead created a from-scratch ProblemDetails. This resulted in
spurious 500s.
My methodology was to look at every method call on each of the WFE's
fields that represents a gRPC backend: `ra`, `sa`, `accountGetter`,
`nonceService`, `remoteNonceServe`. If the error handling for that call
did not use web.ProblemDetailsForError, I changed it to use that.
Fixes#6524
In draft because still needs tests.
Clean up several spots where we were behaving differently on
go1.18 and go1.19, now that we're using go1.19 everywhere. Also
re-enable the lint and generate tests, and fix the various places where
the two versions disagreed on how comments should be formatted.
Also clean up the OldTLS codepaths, now that both go1.19 and our
own feature flags have forbidden TLS < 1.2 everywhere.
Fixes#6011
- Add a new field, `RetryAfter` to `BoulderError`s
- Add logic to wrap/unwrap the value of the `RetryAfter` field to our gRPC error
interceptor
- Plumb `RetryAfter` for `DuplicateCertificateError` emitted by RA to the WFE
client response header
Part of #6256
Suggest that subscribers with certificates impacted by an ongoing revocation
incident renew immediately.
- Make SA method `IncidentsForSerial` a callable RPC
Resolves#6282
Update our ACME Renewal Info implementation to parse
the CertID-based request format specified in the current
version of the draft specification.
Part of #6033