This partially reverts https://github.com/letsencrypt/boulder/pull/8203,
which was landed as commit dea81c7381.
It leaves all of the boulder integration test environment changes in
place, while restoring the DNSAllowLoopbackAddresses config key and its
ability to influence the VA's behavior.
We no longer need a code path to resolve reserved IP addresses during
integration tests.
Move to a public IP for the remaining tests, after #8187 did so for many
of them.
Depends on #8187
Move `IsReservedIP` and its supporting vars from `bdns` to `policy`.
Rewrite `IsReservedIP` to:
* Use `netip` because `netip.Prefix` can be used as a map key, allowing
us to define prefix lists more elegantly. This will enable future work
to import prefix lists from IANA's primary source data.
* Return an error including the reserved network's name.
Refactor `IsReservedIP` tests to be table-based.
Fixes#8040
Plumb the userAgent field, used to set http-01 User-Agent headers, from
va/rva configuration through to where User-Agent headers can be set for
DoH queries. Use integration tests to validate that the User-Agent is
set for http-01 challenges, dns-01 challenges over DoH, and CAA checks
over DoH.
Fixes#7963.
Add an `identifier` field to the `va.PerformValidationRequest` proto, which will soon replace its `dnsName` field.
Accept and prefer the `identifier` field in every VA function that uses this struct. Don't (yet) assume it will be present.
Throughout the VA, accept and handle the IP address identifier type. Handling is similar to DNS names, except that `getAddrs` is not called, and consider that:
- IPs are represented in a different field in the `x509.Certificate` struct.
- IPs must be presented as reverse DNS (`.arpa`) names in SNI for [TLS-ALPN-01 challenge requests](https://datatracker.ietf.org/doc/html/rfc8738#name-tls-with-application-layer-).
- IPv6 addresses are enclosed in square brackets when composing or parsing URLs.
For HTTP-01 challenges, accept redirects to bare IP addresses, which were previously rejected.
Fixes#2706
Part of #7311
Replace DCV and CAA checks (PerformValidation and IsCAAValid) in
va/va.go and va/caa.go with their MPIC compliant counterparts (DoDCV and
DoCAA) in va/vampic.go. Deprecate EnforceMultiCAA and EnforceMPIC and
default code paths as though they are both true. Require that RIR and
Perspective be set for primary and remote VAs.
Fixes#7965Fixes#7819
- Ensure the Perspective and RIR reported by each remoteVA in the
*vapb.ValidationResult returned by VA.PerformValidation, matches the
expected local configuration when that configuration is present.
- Correct "AfriNIC" to "AFRINIC", everywhere.
Part of https://github.com/letsencrypt/boulder/issues/7819
- Make the primary VA aware of the expected Perspective and RIR of each
remote VA.
- All Perspectives should be unique, have the primary VA check for
duplicate Perspectives at startup.
- Update test setup functions to ensure that each remote VA client and
corresponding inmem impl have a matching perspective and RIR.
Part of #7819
Previously this was a configuration field.
Ports `maxAllowedFailures()` from `determineMaxAllowedFailures()` in
#7794.
Test updates:
Remove the `maxRemoteFailures` param from `setup` in all VA tests.
Some tests were depending on setting this param directly to provoke
failures.
For example, `TestMultiVAEarlyReturn` previously relied on "zero allowed
failures". Since the number of allowed failures is now 1 for the number
of remote VAs we were testing (2), the VA wasn't returning early with an
error; it was succeeding! To fix that, make sure there are two failures.
Since two failures from two RVAs wouldn't exercise the right situation,
add a third RVA, so we get two failures from three RVAs.
Similarly, TestMultiCAARechecking had several test cases that omitted
this field, effectively setting it to zero allowed failures. I updated
the "1 RVA failure" test case to expect overall success and added a "2
RVA failures" test case to expect overall failure (we previously
expected overall failure from a single RVA failing).
In TestMultiVA I had to change a test for `len(lines) != 1` to
`len(lines) == 0`, because with more backends we were now logging more
errors, and finding e.g. `len(lines)` to be 2.
- Add `Perspective` and `RIR` fields to the remote-va configuration
- Configure RVA ValidationAuthorityImpl instances with the contents of
the JSON configuration
- Configure VA ValidationAuthorityImpl instances with the constant
`va.PrimaryPerspective`
- Log `Perspective` for non-Primary Perspectives, per the MPIC
requirements in section 5.4.1 (2) vii of the BRs. Also log the RIR for
posterity.
- Introduce `ValidationResult` RPC fields `Perspective` and `Rir`, which
are not currently used but will be required for corroboration in #7616
Fixes https://github.com/letsencrypt/boulder/issues/7613
Part of https://github.com/letsencrypt/boulder/issues/7615
Part of https://github.com/letsencrypt/boulder/issues/7616
* Adds a new `remoteva` binary that takes a distinct configuration from
the existing `boulder-va`
* Removed the `boulder-remoteva` name registration from `boulder-va`.
* Existing users of `boulder-remoteva` must either
1. laterally migrate to `boulder-va` which uses that same config, or
2. switch to using `remoteva` with a new config.
Part of https://github.com/letsencrypt/boulder/issues/5294
We run the RVAs in AWS, where we don't have all the same service
discovery infrastructure we do for the primary VAs and the rest of
Boulder. The solution for populating SRV records we have today hasn't
been reliable, so we'd like to experiment with bringing up RVAs paired
1:1 with a local DNS resolver. This brings back some of the previous
static DNS resolver configuration, though it's not a clean revert
because other configuration has changed in the meantime
Previously, `va.IsCAAValid` would only check CAA records from the
primary VA during initial domain control validation, completely ignoring
any configured RVAs. The upcoming
[MPIC](https://github.com/ryancdickson/staging/pull/8) ballot will
require that it be done from multiple perspectives. With the currently
deployed [Multi-Perspective
Validation](https://letsencrypt.org/2020/02/19/multi-perspective-validation.html)
in staging and production, this change brings us in line with the
[proposed phase
3](https://github.com/ryancdickson/staging/pull/8/files#r1368708684).
This change reuses the existing
[MaxRemoteValidationFailures](21fc191273/cmd/boulder-va/main.go (L35))
variable for the required non-corroboration quorum.
> Phase 3: June 15, 2025 - December 14, 2025 ("CAs MUST implement MPIC
in blocking mode*"):
>
> MUST implement MPIC? Yes
> Required quorum?: Minimally, 2 remote perspectives must be used. If
using less than 6 remote perspectives, 1 non-corroboration is allowed.
If using 6 or more remote perspectives, 2 non-corroborations are
allowed.
> MUST block issuance if quorum is not met: Yes.
> Geographic diversity requirements?: Perspectives must be 500km from 1)
the primary perspective and 2) all other perspectives used in the
quorum.
>
> * Note: "Blocking Mode" is a nickname. As opposed to "monitoring mode"
(described in the last milestone), CAs MUST NOT issue a certificate if
quorum requirements are not met from this point forward.
Adds new VA feature flags:
* `EnforceMultiCAA` instructs a primary VA to command each of its
configured RVAs to perform a CAA recheck.
* `MultiCAAFullResults` causes the primary VA to block waiting for all
RVA CAA recheck results to arrive.
Renamed `va.logRemoteValidationDifferentials` to
`va.logRemoteDifferentials` because it can handle initial domain control
validations and CAA rechecking with minimal editing.
Part of https://github.com/letsencrypt/boulder/issues/7061
Replace the current three-piece setup (enum of feature variables, map of
feature vars to default values, and autogenerated bidirectional maps of
feature variables to and from strings) with a much simpler one-piece
setup: a single struct with one boolean-typed field per feature. This
preserves the overall structure of the package -- a single global
feature set protected by a mutex, and Set, Reset, and Enabled methods --
although the exact function signatures have all changed somewhat.
The executable config format remains the same, so no deployment changes
are necessary. This change does deprecate the AllowUnrecognizedFeatures
feature, as we cannot tell the json config parser to ignore unknown
field names, but that flag is set to False in all of our deployment
environments already.
Fixes https://github.com/letsencrypt/boulder/issues/6802
Fixes https://github.com/letsencrypt/boulder/issues/5229
For "ordinary" errors like "file not found" for some part of the config,
we would prefer to log an error and exit without logging about a panic
and printing a stack trace.
To achieve that, we want to call `defer AuditPanic()` once, at the top
of `cmd/boulder`'s main. That's so early that we haven't yet parsed the
config, which means we haven't yet initialized a logger. We compromise:
`AuditPanic` now calls `log.Get()`, which will retrieve the configured
logger if one has been set up, or will create a default one (which logs
to stderr/stdout).
AuditPanic and Fail/FailOnError now cooperate: Fail/FailOnError panic
with a special type, and AuditPanic checks for that type and prints a
simple message before exiting when it's present.
This PR also coincidentally fixes a bug: panicking didn't previously
cause the program to exit with nonzero status, because it recovered the
panic but then did not explicitly exit nonzero.
Fixes#6933
Add the necessary scaffolding for deep health checking of our various
gRPC components. Each component implementation that also implements the
grpc.checker interface will be checked periodically, and the health
status of the component will be updated accordingly.
Add the necessary methods to SA to implement the grpc.checker interface
and register these new health checks with Consul.
Additionally:
- Update entry point script to check for ProxySQL readiness.
- Increase the poll rate for gRPC Consul checks from 5s to 2s to help
with DNS failures, due to check failures, on startup.
- Change log level for Consul from INFO to ERROR to deal with noisy logs
full of transport failures due to Consul gRPC checks firing before the
SAs are up.
Fixes#6878
Part of #6795
Export new prometheus metrics for the `notBefore` and `notAfter` fields
to track internal certificate validity periods when calling the `Load()`
method for a `*tls.Config`. Each metric is labeled with the `serial`
field.
```
tlsconfig_notafter_seconds{serial="2152072875247971686"} 1.664821961e+09
tlsconfig_notbefore_seconds{serial="2152072875247971686"} 1.664821960e+09
```
Fixes https://github.com/letsencrypt/boulder/issues/6829
Add a new shared config stanza which all boulder components can use to
configure their Open Telemetry tracing. This allows components to
specify where their traces should be sent, what their sampling ratio
should be, and whether or not they should respect their parent's
sampling decisions (so that web front-ends can ignore sampling info
coming from outside our infrastructure). It's likely we'll need to
evolve this configuration over time, but this is a good starting point.
Add basic Open Telemetry setup to our existing cmd.StatsAndLogging
helper, so that it gets initialized at the same time as our other
observability helpers. This sets certain default fields on all
traces/spans generated by the service. Currently these include the
service name, the service version, and information about the telemetry
SDK itself. In the future we'll likely augment this with information
about the host and process.
Finally, add instrumentation for the HTTP servers and grpc
clients/servers. This gives us a starting point of being able to monitor
Boulder, but is fairly minimal as this PR is already somewhat unwieldy:
It's really only enough to understand that everything is wired up
properly in the configuration. In subsequent work we'll enhance those
spans with more data, and add more spans for things not automatically
traced here.
Fixes https://github.com/letsencrypt/boulder/issues/6361
---------
Co-authored-by: Aaron Gable <aaron@aarongable.com>
The CA, RA, and VA have multiple goroutines running alongside primary
gRPC handling goroutine. These ancillary goroutines should be gracefully
shut down when the process is about to exit. Historically, we have
handled this by putting a call to each of these goroutine's shutdown
function inside cmd.CatchSignals, so that when a SIGINT is received, all
of the various cleanup routines happen in sequence.
But there's a cleaner way to do it: just use defer! All of these
cleanups need to happen after the primary gRPC server has fully shut
down, so that we know they stick around at least as long as the service
is handling gRPC requests. And when the service receives a SIGINT,
cmd.CatchSignals will call the gRPC server's GracefulStop, which will
cause the server's .Serve() to finally exit, which will cause start() to
exit, which will cause main() to exit, which will cause all deferred
functions to be run.
In addition, remove filterShutdownErrors as the bug which made it
necessary (.Serve() returning an error even when GracefulShutdown() is
called) was fixed back in 2017. This allows us to call the start()
function in a much more natural way, simply logging any error it returns
instead of calling os.Exit(1) if it returns an error.
This allows us to simplify the exit-handling code in these three
services' main() functions, and lets us be a bit more idiomatic with our
deferred cleanup functions.
Part of #6794
- Require `letsencrypt/validator` package.
- Add a framework for registering configuration structs and any custom
validators for each Boulder component at `init()` time.
- Add a `validate` subcommand which allows you to pass a `-component`
name and `-config` file path.
- Expose validation via exported utility functions
`cmd.LookupConfigValidator()`, `cmd.ValidateJSONConfig()` and
`cmd.ValidateYAMLConfig()`.
- Add unit test which validates all registered component configuration
structs against test configuration files.
Part of #6052
Remove tracing using Beeline from Boulder. The only remnant left behind
is the deprecated configuration, to ensure deployability.
We had previously planned to swap in OpenTelemetry in a single PR, but
that adds significant churn in a single change, so we're doing this as
multiple steps that will each be significantly easier to reason about
and review.
Part of #6361
Remove the PortConfig field from both the VA's config struct and from
the NewValidationAuthorityImpl constructor. This config item is no
longer used anywhere, and removing this prevents us from accidentally
overriding the "Authorized Ports" (80 and 443) which are required by the
Baseline Requirements.
Unit tests are still able to override the httpPort and tlsPort fields of
the ValidationAuthorityImpl.
Fixes#3940
Turn bgrpc.NewServer into a builder-pattern, with a config-based
initialization, multiple calls to Add to add new gRPC services, and a
final call to Build to produce the start() and stop() functions which
control server behavior. All calls are chainable to produce compact code
in each component's main() function.
This improves the process of creating a new gRPC server in three ways:
1) It avoids the need for generics/templating, which was slightly
verbose.
2) It allows the set of services to be registered on this server to be
known ahead of time.
3) It greatly streamlines adding multiple services to the same server,
which we use today in the VA and will be using soon in the SA and CA.
While we're here, add a new per-service config stanza to the
GRPCServerConfig, so that individual services on the same server can
have their own configuration. For now, only provide a "ClientNames" key,
which will be used in a follow-up PR.
Part of #6454
Remove the need for clients to explicitly call bgrpc.NewClientMetrics,
by moving that call inside bgrpc.ClientSetup. In case ClientSetup is
called multiple times, use the recommended method to gracefully recover
from registering duplicate metrics. This makes gRPC client setup much
more similar to gRPC server setup after the previous server refactoring
change landed.
Collapse most of our boilerplate gRPC creation steps (in particular,
creating default metrics, making the server and listener, registering
the server, creating and registering the health service, filtering
shutdown errors from the output, and gracefully stopping) into a single
function in the existing bgrpc package. This allows all but one of our
server main functions to drop their calls to NewServer and
NewServerMetrics.
To enable this, create a new helper type and method in the bgrpc
package. Conceptually, this could be just a new function, but it must be
attached to a new type so that it can be generic over the type of gRPC
server being created. (Unfortunately, the grpc.RegisterFooServer methods
do not accept an interface type for their second argument).
The only main function which is not updated is the boulder-va, which is
a special case because it creates multiple gRPC servers but (unlike the
CA) serves them all on the same port with the same server and listener.
Part of #6452
The resulting `boulder` binary can be invoked by different names to
trigger the behavior of the relevant subcommand. For instance, symlinking
and invoking as `boulder-ca` acts as the CA. Symlinking and invoking as
`boulder-va` acts as the VA.
This reduces the .deb file size from about 200MB to about 20MB.
This works by creating a registry that maps subcommand names to `main`
functions. Each subcommand registers itself in an `init()` function. The
monolithic `boulder` binary then checks what name it was invoked with
(`os.Args[0]`), looks it up in the registry, and invokes the appropriate
`main`. To avoid conflicts, all of the old `package main` are replaced
with `package notmain`.
To get the list of registered subcommands, run `boulder --list`. This
is used when symlinking all the variants into place, to ensure the set
of symlinked names matches the entries in the registry.
Fixes#5692
- Add function `validateServerAddress()` to `bdns/servers.go` which ensures that
DNS server addresses are TCP/ UDP dial-able per: https://golang.org/src/net/dial.go?#L281
- Add unit test for `validateServerAddress()` in `bdns/servers_test.go`
- Update `cmd/boulder-va/main.go` to handle `bdns.NewStaticProvider()`
potentially returning an error.
- Update unit tests in `bdns/dns_test.go`:
- Handle `bdns.NewStaticProvider()` potentially returning an error
- Add an IPv6 address to `TestRotateServerOnErr`
- Ensure DNS server addresses are validated by `validateServerAddress` whenever:
- `dynamicProvider.update() is called`
- `staticProvider` is constructed
- Construct server addresses using `net.JoinHostPost()` when
`dynamicProvider.Addrs()` is called
Fixes#5463
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
Abstract out the way that the bdns library keeps track of the
resolvers it uses to do DNS lookups. Create one implementation,
the `StaticProvider`, which behaves exactly the same as the old
mechanism (providing whatever names or addresses were given
in the config). Create another implementation, `DynamicProvider`,
which re-resolves the provided name on a regular basis.
The dynamic provider consumes a single name, does a lookup
on that name for any SRV records suggesting that it is running a
DNS service, and then looks up A records to get the address of
all the names returned by the SRV query. It exports its successes
and failures as a prometheus metric.
Finally, update the tests and config-next configs to work with
this new mechanism. Give sd-test-srv the capability to respond
to SRV queries, and put the names it provides into docker's
default DNS resolver.
Fixes#5306
This field is not used by any production configs, so we can safely
remove it.
Also, add config fields for DNSTimeout and DNSAllowLoopbackAddress
outside of the Common sub-struct, to allow for its removal later.
Part of #5242
Delete the ValidationAuthorityGRPCServer and ...GRPCClient structs,
and update references to instead reference the underlying vapb.VAClient
type directly. Also delete the core.ValidationAuthority interface.
Does not require updating interfaces elsewhere, as the client
wrapper already included the variadic grpc.CallOption parameter.
Fixes#5325
[Go style says](https://blog.golang.org/package-names):
> Avoid stutter. Since client code uses the package name as a prefix
> when referring to the package contents, the names for those contents
> need not repeat the package name. The HTTP server provided by the
> http package is called Server, not HTTPServer. Client code refers to
> this type as http.Server, so there is no ambiguity.
Rename DNSClient, DNSClientImpl, NewDNSClientImpl,
NewTestDNSClientImpl, DNSError, and MockDNSClient to follow those
guidelines.
Unexport DNSClientImpl and MockTimeoutError (was only used internally).
Make New and NewTest return the Client interface rather than a concrete
`impl` type.
This health service implements the gRPC Health Checking
Protocol, as defined in
https://github.com/grpc/grpc/blob/master/doc/health-checking.md
and as implemented by the gRPC authors in
https://pkg.go.dev/google.golang.org/grpc/health@v1.29.0
It simply instantiates a health service, and attaches it to the same
gRPC server that is handling requests to the primary (e.g. CA) service.
When the main service would be shut down (e.g. because it caught a
signal), it also sets the status of the service to NOT_SERVING.
This change also imports the health client into our grpc client,
ensuring that all of our grpc clients use the health service to inform
their load-balancing behavior.
This will be used to replace our current usage of polling the debug
port to determine whether a given service is up and running. It may
also be useful for more comprehensive checks and blackbox probing
in the future.
Part of #5074
Updates the type of the ValidationAuthority's PerformValidation
method to be identical to that of the corresponding auto-generated
grpc method, i.e. directly taking and returning proto message
types, rather than exploded arguments.
This allows all logic to be removed from the VA wrappers, which
will allow them to be fully removed after the migration to proto3.
Also updates all tests and VA clients to adopt the new interface.
Depends on #4983 (do not review first four commits)
Part of #4956
We previously used mixed case names for proto imports
(e.g. both `caPB` and `rapb`), sometimes in the same file.
This change standardizes on the all-lowercase spelling,
which was predominant throughout the codebase.
We've found we need the context offered from logging the error closer to when it
happens in the `bdns` package rather than in the `va`. Adopting the function
requires adapting it slightly. Specifically in the new location we know it won't
be called with any timeout results, with a non-dns error, or with a nil
underlying error.
Having the logging done in `bdns` (and specifically from `exchangeOne`) also
lets us log the wire format of the query and response when we get a `dns.ErrId`
error indicating a query/response ID mismatch. A small unit test is included
that ensures the logging happens as expected.
In case it proves useful for matching against other metrics the DNS ID mismatch
error case also now increments a dedicated prometheus counter vector stat,
`dns_id_mismatch`. The stat is labelled by resolver and query type.
Resolves https://github.com/letsencrypt/boulder/issues/4532