I think ideally we'd only ever call exportMetrics
with a valid time, but that's a bit bigger of a refactor of this code.
This was the fix we lightly decided on in the discussion of #6635Fixes#6635
RFC 8659 (CAA; https://www.rfc-editor.org/rfc/rfc8659) says that "A CA
MUST NOT issue certificates for any FQDN if the Relevant RRset for that
FQDN contains a CAA critical Property for an unknown or unsupported
Property Tag."
Let's Encrypt does technically support the iodef property tag: we
recognize it, but then ignore it and never choose to send notifications
to the given contact address. Historically, we have carried around the
iodef property tags in our internal structures as though we might use
them, but all code referencing them was essentially dead code.
As part of a set of simplifications,
https://github.com/letsencrypt/boulder/pull/6886 made it so that we
completely ignore iodef property tags. However, this had the unintended
side-effect of causing iodef property tags with the Critical bit set to
be counted as "unknown critical" tags, which prevent issuance.
This change causes our property tag parsing code to recognize iodef tags
again, so that critical iodef tags don't prevent issuance.
Given two empty slices, one that is equal to nil and one that is not,
AssertDeepEquals used to produce this confusing output:
[[]] !(deep)= [[]]
After this change, it produces:
[[]string(nil)] !(deep)= [[]string{}]
Adds new prometheus metrics from the configured log list and configured
CT logs to the ctpolicy constructor. `ct_operator_group_size_gauge`
returns the number of configured logs managed by each operator in the
log list. `ct_shard_expiration_seconds` returns a Unix timestamp
representation of the `end_exclusive` field for each configured log in
the `sctLogs` list. For posterity, Boulder retrieves SCTs from logs in
the `sctLogs` list.
```
ct_operator_group_size_gauge{operator="Operator A",source="finalLogs"} 2
ct_operator_group_size_gauge{operator="Operator A",source="sctLogs"} 4
ct_operator_group_size_gauge{operator="Operator B",source="sctLogs"} 2
ct_operator_group_size_gauge{operator="Operator D",source="sctLogs"} 1
ct_operator_group_size_gauge{operator="Operator F",source="finalLogs"} 1
ct_operator_group_size_gauge{operator="Operator F",source="infoLogs"} 1
ct_shard_expiration_seconds{logID="A1 Current",operator="Operator A"} 3.15576e+09
ct_shard_expiration_seconds{logID="A1 Future",operator="Operator A"} 3.47126688e+10
ct_shard_expiration_seconds{logID="A2 Current",operator="Operator A"} 3.15576e+09
ct_shard_expiration_seconds{logID="A2 Past",operator="Operator A"} 0
ct_shard_expiration_seconds{logID="B1",operator="Operator B"} 3.15576e+09
ct_shard_expiration_seconds{logID="B2",operator="Operator B"} 3.15576e+09
ct_shard_expiration_seconds{logID="D1",operator="Operator D"} 3.15576e+09
```
Fixes https://github.com/letsencrypt/boulder/issues/5705
Removes the `//ctpolicy/loglist.go` init function which previously
seeded the math/rand global random generator in favor of Go 1.20
math/rand now doing this automatically. See release notes
[here.](https://tip.golang.org/doc/go1.20)
> The [math/rand](https://tip.golang.org/pkg/math/rand/) package now
automatically seeds the global random number generator (used by
top-level functions like Float64 and Int) with a random value, and the
top-level [Seed](https://tip.golang.org/pkg/math/rand/#Seed) function
has been deprecated. Programs that need a reproducible sequence of
random numbers should prefer to allocate their own random source, using
rand.New(rand.NewSource(seed)).
Move the creation of the FilterSource outside of the conditional block,
so that the underlying source gets wrapped no matter which kind (either
a inMemorySource or a checkedRedisSource) it is.
This has two advantages: first, it means that static ocsp responders are
safer and more accurate, because they're not basing their responses on
both the issuer and the serial, not just the serial; and second, it
makes the current config validation tag which marks the "issuerCerts"
config field as required with `min=1` accurate.
We no longer issue OCSP responses for our intermediate certificates,
instead producing CRLs which cover those intermediates. Remove the OCSP
response from our integration test ceremony, remove the configuration
for the static ocsp-responder which serves that response, and remove the
integration test which spins up and checks that responder. Replace all
of the above with new CRLs generated as part of the integration test
ceremony.
A recent refactoring (https://github.com/letsencrypt/boulder/pull/6906)
started treating NXDOMAIN for a CAA lookup as a hard error, when it
should be treated (from Boulder's point of view) as meaning there is an
empty list of resource records.
Removes the `Hostname` and `Port` fields from an http-01
ValidationRecord model prior to storing the record in the database.
Using `"hostname":"example.com","port":"80"` as a snippet of a whole
validation record, we'll save minimum 36 bytes for each new http-01
ValidationRecord that gets stored. When retrieving the record, the
ValidationRecord `RehydrateHostPort` method will repopulate the
`Hostname` and `Port` fields from the `URL` field.
Fixes the main goal of
https://github.com/letsencrypt/boulder/issues/5231.
---------
Co-authored-by: Samantha <hello@entropy.cat>
Enable SA gRPC health checks in Consul ahead of further changes for
#6878. Calls to the `Check` method of the SA's grpc.health.v1.Health
service must respond `SERVING` before the `sa` service will be
advertised in Consul DNS. Consul will continue to poll this service
every 5 seconds.
- Add `bconsul` docker service to boulder `bluenet` and `rednet`
- Add TLS credentials for `consul.boulder`:
```shell
$ openssl x509 -in consul.boulder/cert.pem -text | grep DNS
DNS:consul.boulder
```
- Update `test/grpc-creds/generate.sh` to add `consul.boulder`
- Update test SA configs to allow `consul.boulder` to access to
`grpc.health.v1.Health`
Part of #6878
Add per-shard exponential backoff and retry to crl-updater. Each
individual CRL shard will be retried up to MaxAttempts (default 1)
times, with exponential backoff starting at 1 second and maxing out at 1
minute between each attempt.
This can effectively reduce the parallelism of crl-updater: while a
goroutine is sleeping between attempts of a failing shard, it is not
doing work on another shard. This is a desirable feature, since it means
that crl-updater gently reduces the total load it places on the network
and database when shards start to fail.
Setting this new config parameter is tracked in IN-9140
Fixes https://github.com/letsencrypt/boulder/issues/6895
When processing CAA records, keep track of the FQDN at which that CAA
record was found (which may be different from the FQDN for which we are
attempting issuance, since we crawl CAA records upwards from the
requested name to the TLD). Then surface this name upwards so that it
can be included in our own log lines and in the problem documents which
we return to clients.
Fixes https://github.com/letsencrypt/boulder/issues/3171
If the resolver provides EDE (https://www.rfc-editor.org/rfc/rfc8914),
Boulder will automatically expose it in the error message. Note that
most error messages contain the error RCODE (NXDOMAIN, SERVFAIL, etc),
when there is EDE present we omit it in the interest of brevity. In
practice it will almost always be SERVFAIL, and the extended error
information is more informative anyhow.
This will have no effect in production until we configure Unbound to
enable EDE.
Fixes#6875.
---------
Co-authored-by: Matthew McPherrin <mattm@letsencrypt.org>
Occasionally (and just now) I've responded to an issue or thread that
involves this error message:
> The key authorization file from the server did not match this
challenge
"LoqXcYV8q5ONbJQxbmR7SCTNo3tiAXDfowyjxAjEuX0.9jg46WB3rR_AHD-EBXdN7cBkH1WOu0tA3M9fm21mqTI"
!= "\xef\xffAABBCC
and I've found myself looking at Boulder's source code, to check which
way around the values are. I suspect that users are not understanding it
either.
As a follow-up to https://github.com/letsencrypt/boulder/issues/5467, I
did an audit of all places where we call SelectOne to ensure that those
queries can never return more than one result. These four functions were
the only places that weren't already constrained to a single result
through the use of "SELECT COUNT", "LIMIT 1", "WHERE uniqueKey =", or
similar. Limit these functions' queries to always only return a single
result, now that their underlying tables no longer have unique key
constraints.
Additionally, slightly refactor selectRegistration to just take a single
column name rather than a whole WHERE clause.
Fixes https://github.com/letsencrypt/boulder/issues/6521
When the "integration" build tag is set, reduce the stdout prefix to
just a short timestamp, log level, and process name. The other details
(e.g. date, datacenter, and hostname) are not relevant in CI, and only
serve to clutter the logs.
Part of https://github.com/letsencrypt/boulder/issues/6890
Define a `bJSONWebSignature` struct which embeds a
`*jose.JSONWebSignature`. The only method that can produce a
`bJSONWebSignature` is `wfe.parseJWS` so that we can ensure
safety/sanity checks are performed on the incoming data. Restricts
several methods and functions to take a `jose.Header` as an input
parameter, rather than a full JWS.
Fixes https://github.com/letsencrypt/boulder/issues/5676.
In configs, opentelemetry -> openTelemetry
As pointed out in review of #6867, these should match the case of their
corresponding Go identifiers for consistency.
JSON keys are case-insensitive in Go (part of why we've got a fork in
go-jose),
so this change should have no functional impact.
These external port bindings are not necessary, as the integration test
configs resolve the bjaeger container directly. In addition, these
external port bindings cause problems for rootless docker, so let's
remove them.
Previously if you passed `-h` or `-help` to a sub-sub-command of
admin-revoker it would error out with a red message and a stack trace
(in addition to printing help).
Now, it will print help and exit 1.
Remove the remaining divergences from RFC8555 regarding what error types
we use in certain situations. Specifically:
- use "invalidContact" instead of "invalidEmail";
- use "unsupportedContact" for contact addresses that use a protocol
other than "mailto:"; and
- use "unsupportedIdentifier" for identifiers that specify a type other
than "dns".
Most boulder components have a command line flag to override what gRPC
and debug port they listen on, which is used in tests to run multiple
instances with the same configuration.
However, CA's flag is named "--ca-addr", and not "--addr". This is
inconsistent with SA, RA, VA, nonce, publisher, and purger.
This flag isn't used in production, where we set it in the config file,
so it shouldn't be a breaking change to rename it.
- Make config validation run by default for all Boulder components with
a registered validator.
- Refactor main to parse `boulder` flags directly instead of declaring
them as subcommands.
- Remove the `validate` subcommand and update relevant docs.
- Fix configuration validation for issuer (file source) OCSP responder.
Fixes#6857Fixes#6763
This PR adds a new configuration block specifically for the otelhttp
instrumentation. This block is separate from the existing
"opentelemetry" configuration, and is only relevant when using otelhttp
instrumentation. It does not share any codepath with the existing
configuration, so it is at the top level to indicate which services it
applies to.
There's a bit of plumbing new configuration through. I've adopted the
measured_http package to also set up opentelemetry instead of just
metrics, which should hopefully allow any future changes to be smaller
(just config & there) and more consistent between the wfe2 and ocsp
responder.
There's one option here now, which disables setting
[otelhttp.WithPublicEndpoint](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#WithPublicEndpoint).
This option is designed to do exactly what we want: Don't accept
incoming spans as parents of the new span created in the server.
Previously we had a setting to disable parent-based sampling to help
with this problem, which doesn't really make sense anymore, so let's
just remove it and simplify that setup path. The default of "false" is
designed to be the safe option. It's set to True in the test/ configs
for integration tests that use traces, and I expect we'll likely set it
true in production eventually once the LBs are configured to handle
tracing themselves.
Fixes#6851
Make minor, non-user-visible changes to how we structure the probs
package. Notably:
- Add new problem types for UnsupportedContact and
UnsupportedIdentifier, which are specified by RFC8555 and which we will
use in the future, but haven't been using historically.
- Sort the problem types and constructor functions to match the
(alphabetical) order given in RFC8555.
- Rename some of the constructor functions to better match their
underlying problem types (e.g. "TLSError" to just "TLS").
- Replace the redundant ProblemDetailsToStatusCode function with simply
always returning a 500 if we haven't properly set the problem's
HTTPStatus.
- Remove the ability to use either the V1 or V2 error namespace prefix;
always use the proper RFC namespace prefix.
Currently we set WaitForReady(true), which causes gRPC requests to not
fail immediately if no backends are available, but instead wait until
the timeout in case a backend does become available. The downside is
that this behavior masks true connection errors. We'd like to turn it
off.
Fixes#6834
This adds Jaeger's all-in-one dev container (with no persistent storage)
to boulder's dev docker-compose. It configures config-next/ to send all
traces there.
A new integration test creates an account and issues a cert, then
verifies the trace contains some set of expected spans.
This test found that async finalize broke spans, so I fixed that and a
few related spots where we make a new context.
This upgrades otel to v1.15.0, and the /contrib/ packages to v0.41.0.
Several dependencies are upgraded as dependencies, notably grpc.
This contains a change to grpc, only mapping some grpc.Errors into span
errors if it's Unknown, DeadlineExceeded, Unimplemented, Internal,
Unavailable, or DataLoss, which should be helpful for us as we use grpc
errors semantically in Boulder, especially NotFound.
Update the document number to the latest version, and remove the /get/
prefix since it now supports both the GET and POST portions of the spec.
Also update one piece of tooling to properly get the ARI URL from the
directory, rather than hard-coding it.
We only ever set it to the same value, and then read it back in
make_client, so just hardcode it there instead.
It's a bit spooky-action-at-a-distance and is process-wide with no
synchronization, which means we can't safely use different values
anyway.
Replace inline connect string with a new one in test/vars (that points
to boulder_sa_integration).
Remove comments about interpolateParams=false being required; it is not.
Add clauses to getPrecertByName to ensure it follows its documented
constraints (return the latest one).
Follow-up on #6807. Fixes#6848.