For "ordinary" errors like "file not found" for some part of the config,
we would prefer to log an error and exit without logging about a panic
and printing a stack trace.
To achieve that, we want to call `defer AuditPanic()` once, at the top
of `cmd/boulder`'s main. That's so early that we haven't yet parsed the
config, which means we haven't yet initialized a logger. We compromise:
`AuditPanic` now calls `log.Get()`, which will retrieve the configured
logger if one has been set up, or will create a default one (which logs
to stderr/stdout).
AuditPanic and Fail/FailOnError now cooperate: Fail/FailOnError panic
with a special type, and AuditPanic checks for that type and prints a
simple message before exiting when it's present.
This PR also coincidentally fixes a bug: panicking didn't previously
cause the program to exit with nonzero status, because it recovered the
panic but then did not explicitly exit nonzero.
Fixes#6933
This PR adds a new configuration block specifically for the otelhttp
instrumentation. This block is separate from the existing
"opentelemetry" configuration, and is only relevant when using otelhttp
instrumentation. It does not share any codepath with the existing
configuration, so it is at the top level to indicate which services it
applies to.
There's a bit of plumbing new configuration through. I've adopted the
measured_http package to also set up opentelemetry instead of just
metrics, which should hopefully allow any future changes to be smaller
(just config & there) and more consistent between the wfe2 and ocsp
responder.
There's one option here now, which disables setting
[otelhttp.WithPublicEndpoint](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#WithPublicEndpoint).
This option is designed to do exactly what we want: Don't accept
incoming spans as parents of the new span created in the server.
Previously we had a setting to disable parent-based sampling to help
with this problem, which doesn't really make sense anymore, so let's
just remove it and simplify that setup path. The default of "false" is
designed to be the safe option. It's set to True in the test/ configs
for integration tests that use traces, and I expect we'll likely set it
true in production eventually once the LBs are configured to handle
tracing themselves.
Fixes#6851
Export new prometheus metrics for the `notBefore` and `notAfter` fields
to track internal certificate validity periods when calling the `Load()`
method for a `*tls.Config`. Each metric is labeled with the `serial`
field.
```
tlsconfig_notafter_seconds{serial="2152072875247971686"} 1.664821961e+09
tlsconfig_notbefore_seconds{serial="2152072875247971686"} 1.664821960e+09
```
Fixes https://github.com/letsencrypt/boulder/issues/6829
Add a new shared config stanza which all boulder components can use to
configure their Open Telemetry tracing. This allows components to
specify where their traces should be sent, what their sampling ratio
should be, and whether or not they should respect their parent's
sampling decisions (so that web front-ends can ignore sampling info
coming from outside our infrastructure). It's likely we'll need to
evolve this configuration over time, but this is a good starting point.
Add basic Open Telemetry setup to our existing cmd.StatsAndLogging
helper, so that it gets initialized at the same time as our other
observability helpers. This sets certain default fields on all
traces/spans generated by the service. Currently these include the
service name, the service version, and information about the telemetry
SDK itself. In the future we'll likely augment this with information
about the host and process.
Finally, add instrumentation for the HTTP servers and grpc
clients/servers. This gives us a starting point of being able to monitor
Boulder, but is fairly minimal as this PR is already somewhat unwieldy:
It's really only enough to understand that everything is wired up
properly in the configuration. In subsequent work we'll enhance those
spans with more data, and add more spans for things not automatically
traced here.
Fixes https://github.com/letsencrypt/boulder/issues/6361
---------
Co-authored-by: Aaron Gable <aaron@aarongable.com>
Although #6771 significantly cleaned up how gRPC services stop and clean
up, it didn't make any changes to our HTTP servers or our non-server
(e.g. crl-updater, log-validator) processes. This change finishes the
work.
Add a new helper method cmd.WaitForSignal, which simply blocks until one
of the three signals we care about is received. This easily replaces all
calls to cmd.CatchSignals which passed `nil` as the callback argument,
with the added advantage that it doesn't call os.Exit() and therefore
allows deferred cleanup functions to execute. This new function is
intended to be the last line of main(), allowing the whole process to
exit once it returns.
Reimplement cmd.CatchSignals as a thin wrapper around cmd.WaitForSignal,
but with the added callback functionality. Also remove the os.Exit()
call from CatchSignals, so that the main goroutine is allowed to finish
whatever it's doing, call deferred functions, and exit naturally.
Update all of our non-gRPC binaries to use one of these two functions.
The vast majority use WaitForSignal, as they run their main processing
loop in a background goroutine. A few (particularly those that can run
either in run-once or in daemonized mode) still use CatchSignals, since
their primary processing happens directly on the main goroutine.
The changes to //test/load-generator are the most invasive, simply
because that binary needed to have a context plumbed into it for proper
cancellation, but it already had a custom struct type named "context"
which needed to be renamed to avoid shadowing.
Fixes https://github.com/letsencrypt/boulder/issues/6794
- Require `letsencrypt/validator` package.
- Add a framework for registering configuration structs and any custom
validators for each Boulder component at `init()` time.
- Add a `validate` subcommand which allows you to pass a `-component`
name and `-config` file path.
- Expose validation via exported utility functions
`cmd.LookupConfigValidator()`, `cmd.ValidateJSONConfig()` and
`cmd.ValidateYAMLConfig()`.
- Add unit test which validates all registered component configuration
structs against test configuration files.
Part of #6052
Remove tracing using Beeline from Boulder. The only remnant left behind
is the deprecated configuration, to ensure deployability.
We had previously planned to swap in OpenTelemetry in a single PR, but
that adds significant churn in a single change, so we're doing this as
multiple steps that will each be significantly easier to reason about
and review.
Part of #6361
We rely on the ratelimit/ package in CI to validate our ratelimit
configurations. However, because that package relies on cmd/ just for
cmd.ConfigDuration, many additional dependencies get pulled in.
This refactors just that struct to a separate config package. This was
done using Goland's automatic refactoring tooling, which also organized
a few imports while it was touching them, keeping standard library,
internal and external dependencies grouped.
Replace the cmd.ServiceConfig embed with just its components (i.e.
DebugAddr and sometimes TLS) in the WFE, crl-updater, ocsp-updater,
ocsp-responder, and expiration-mailer. These services are not gRPC
services, and therefore do not need the full suite of config keys
introduced by cmd.ServiceConfig.
Blocks #6674
Part of #6052
Assign nonce prefixes for each nonce-service by taking the first eight
characters of the the base64url encoded HMAC-SHA256 hash of the RPC
listening address using a provided key. The provided key must be same
across all boulder-wfe and nonce-service instances.
- Add a custom `grpc-go` load balancer implementation (`nonce`) which
can route nonce redemption RPC messages by matching the prefix to the
derived prefix of the nonce-service instance which created it.
- Modify the RPC client constructor to allow the operator to override
the default load balancer implementation (`round_robin`).
- Modify the `srv` RPC resolver to accept a comma separated list of
targets to be resolved.
- Remove unused nonce-service `-prefix` flag.
Fixes#6404
Add go1.20 as a new version to run tests on, and to build release
artifacts from. Fix one test which was failing because it was
accidentally relying on consistent (i.e. unseeded) non-cryptographic
random number generation, which go1.20 now automatically seeds at import
time.
Update the version of golangci-lint used in our docker containers to the
new version that has go1.20 support. Remove a number of nolint comments
that were required due to an old version of the gosec linter.
In the WFE, ocsp-responder, and crl-updater, switch from using
StorageAuthorityClients to StorageAuthorityReadOnlyClients. This ensures
that these services cannot call methods which write to our database.
Fixes#6454
Remove the need for clients to explicitly call bgrpc.NewClientMetrics,
by moving that call inside bgrpc.ClientSetup. In case ClientSetup is
called multiple times, use the recommended method to gracefully recover
from registering duplicate metrics. This makes gRPC client setup much
more similar to gRPC server setup after the previous server refactoring
change landed.
Enable the "unparam" linter, which checks for unused function
parameters, unused function return values, and parameters and
return values that always have the same value every time they
are used.
In addition, fix many instances where the unparam linter complains
about our existing codebase. Remove error return values from a
number of functions that never return an error, remove or use
context and test parameters that were previously unused, and
simplify a number of (mostly test-only) functions that always take the
same value for their parameter. Most notably, remove the ability to
customize the RSA Public Exponent from the ceremony tooling,
since it should always be 65537 anyway.
Fixes#6104
Run the Boulder unit and integration tests with go1.19.
In addition, make a few small changes to allow both sets of
tests to run side-by-side. Mark a few tests, including our lints
and generate checks, as go1.18-only. Reformat a few doc
comments, particularly lists, to abide by go1.19's stricter gofmt.
Causes #6275
The iotuil package has been deprecated since go1.16; the various
functions it provided now exist in the os and io packages. Replace all
instances of ioutil with either io or os, as appropriate.
Enforce that authorizationLifetimeDays and
pendingAuthorizationLifetimeDays values are configured and set
to a value compliant with the v1.8.1 of the CA/B Forum Baseline
Requirements.
Fixes#5923
Add `stylecheck` to our list of lints, since it got separated out from
`staticcheck`. Fix the way we configure both to be clearer and not
rely on regexes.
Additionally fix a number of easy-to-change `staticcheck` and
`stylecheck` violations, allowing us to reduce our number of ignored
checks.
Part of #5681
This means that we don't have to test whether it is nil before
filling it in with the old deprecated values; we can just assume
that it exists whether or not the json config has a matching stanza.
All handling of uninitialized values (i.e. empty strings for key files)
is handled in `goodkey.NewKeyPolicy`.
Followup from #5839.
I chose groupcache/lru as our LRU cache implementation because it's part
of the golang org, written by one of the Go authors, and very simple
and easy to read.
This adds an `AccountGetter` interface that is implemented by both the
AccountCache and the SA. If the WFE config includes an AccountCache field,
it will wrap the SA in an AccountCache with the configured max size and
expiration time.
We set an expiration time on account cache entries because we want a
bounded amount of time that they may be stale by. This will be used in
conjunction with a delay on account-updating pathways to ensure we don't
allow authentication with a deactivated account or changed key.
The account cache stores corepb.Registration objects because protobufs
have an established way to do a deep copy. Deep copies are important so
the cache can maintain its own internal state and ensure nothing external
is modifying it.
As part of this process I changed construction of the WFE. Previously,
"SA" and "RA" were public fields that were mutated after construction. Now
they are parameters to the constructor, along with the new "accountGetter"
parameter.
The cache includes stats for requests categorized by hits and misses.
Add a new check to GoodKey which attempts to factor the public modulus
of the presented key using Fermat's factorization method. This method
will succeed if and only if the prime factors are very close to each
other -- i.e. almost certainly were not selected independently from a
random uniform distribution, but were instead calculated via some other
less secure method.
To support this new feature, add a new config flag to the RA, CA, and
WFE, which all use the GoodKey checks. As part of adding this new config
value, refactor the GoodKey config items into their own config struct
which can be re-used across all services.
If the new `FermatRounds` config value has not been set, it will default
to zero, causing no factorization to be attempted.
Fixes#5850
Part of #5851
These hashes are useful for OCSP computations, as they are the two
values that are used to uniquely identify the issuer of the given cert in
an OCSP request. Here, they are restricted to SHA1 only, as Boulder
only supports SHA1 for OCSP, as per RFC 5019.
In addition, because the `ID`, `NameID`, `NameHash`, and `KeyHash`
are relatively expensive to compute, introduce a new constructor for
`issuance.Certificate` that computes all four values at startup time and
then simply returns the precomputed values when asked.
The resulting `boulder` binary can be invoked by different names to
trigger the behavior of the relevant subcommand. For instance, symlinking
and invoking as `boulder-ca` acts as the CA. Symlinking and invoking as
`boulder-va` acts as the VA.
This reduces the .deb file size from about 200MB to about 20MB.
This works by creating a registry that maps subcommand names to `main`
functions. Each subcommand registers itself in an `init()` function. The
monolithic `boulder` binary then checks what name it was invoked with
(`os.Args[0]`), looks it up in the registry, and invokes the appropriate
`main`. To avoid conflicts, all of the old `package main` are replaced
with `package notmain`.
To get the list of registered subcommands, run `boulder --list`. This
is used when symlinking all the variants into place, to ensure the set
of symlinked names matches the entries in the registry.
Fixes#5692
Remove the last of the gRPC wrapper files. In order to do so:
- Remove the `core.StorageGetter` interface. Replace it with a new
interface (whose methods include the `...grpc.CallOption` arg)
inside the `sa/proto/` package.
- Remove the `core.StorageAdder` interface. There's no real use-case
for having a write-only interface.
- Remove the `core.StorageAuthority` interface, as it is now redundant
with the autogenerated `sapb.StorageAuthorityClient` interface.
- Replace the `certificateStorage` interface (which appears in two
different places) with a single unified interface also in `sa/proto/`.
- Update all test mocks to include the `_ ...grpc.CallOption` arg in
their method signatures so they match the gRPC client interface.
- Delete many methods from mocks which are no longer necessary (mostly
because they're mocking old authz1 methods that no longer exist).
- Move the two `test/inmem/` wrappers into their own sub-packages to
avoid an import cycle.
- Simplify the `satest` package to satisfy one of its TODOs and to
avoid an import cycle.
- Add many methods to the `test/inmem/sa/` wrapper, to accommodate all
of the methods which are called in unittests.
Fixes#5600
Pull the "was the gRPC error a Canceled error" checking code out into a
separate interceptor, and add that interceptor only in the wfe and wfe2
gRPC clients.
Although the vast majority of our cancelations come from the HTTP client
disconnecting (and that cancelation being propagated through our gRPC
stack), there are a few other situations in which we cancel gRPC
connections, including when we receive a quorum of responses from VAs
and no longer need responses from the remaining remote VA(s). This
change ensures that we do not treat those other kinds of cancelations in
the same way that we treat client-initiated cancelations.
Fixes#5444
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
loadChain is an unexported utility function recently added to
boulder-wfe to support the loading and validating of PEM files that
represent a certificate chain
This change moves the core loadChain functionality out of boulder-wfe to
a new exported LoadChain function in the Issuance package. All
boulder-wfe unit tests have been preserved and most of them have been
pared down and added to the Issuance package as well.
Blocks #1669Fixes#5270
This change simplifies and hardens the wfe2's support for having
multiple issuers, and multiple chains for each issuer, configured
and loaded in memory.
The only config-visible change is replacing the old two separate config
values (`certificateChains` and `alternateCertificateChains`) with a
single value (`chains`). This new value does not require the user to
know and hand-code the AIA URLs at which the certificates are available;
instead the chains are simply presented as lists of files. If this new
config value is present, the old config values will be ignored; if it
is not, the old config values will be respected.
Behind the scenes, the chain loading code has been completely changed.
Instead of loading PEM bytes directly from the file, and then asserting
various things (line endings, no trailing bits, etc) about those bytes,
we now parse a certificate from the file, and in-memory recreate the
PEM from that certificate. This approach allows the file loading to be
much more forgiving, while also being stricter: we now check that each
certificate in the chain is correctly signed by the next cert, and that
the last cert in the chain is a self-signed root.
Within the WFE itself, most of the internal structure has been retained.
However, both the internal `issuerCertificates` (used for checking
that certs we are asked to revoke were in fact issued by us) and the
`certificateChains` (used to append chains to end-entity certs when
served to clients) have been updated to be maps keyed by IssuerNameID.
This allows revocation checking to not have to iterate through the
whole list of issuers, and also makes it easy to double-check that
the signatures on end-entity certs are valid before serving them. Actual
checking of the validity will come in a follow-up change, due to the
invasive nature of the necessary test changes.
Fixes#5164
The /issuer-cert endpoint was a holdover from the v1 API, where
it is a critical part of the issuance flow. In the v2 issuance
flow, the issuer certificate is provided directly in the response
for the certificate itself. Thus, this endpoint is redundant.
Stats show that it receives approximately zero traffic (less than
one request per week, all of which are now coming from wget or
browser useragents). It also complicates the refactoring necessary
for the v2 API to support multiple issuers.
As such, it is a safe and easy decision to remove it.
Fixes#5196
Having both of these very similar methods sitting around
only serves to increase confusion. This removes the last
few places which use `cmd.LoadCert` and replaces them
with `core.LoadCert`, and deletes the method itself.
Fixes#5163
This commit consists of three classes of changes:
1) Changing various command main.go files to always behave as they
would have when features.BlockedKeyTable was true. Also changing
one test in the same manner.
2) Removing the BlockedKeyTable flag from configuration in config-next,
because the flag is already live.
3) Moving the BlockedKeyTable flag to the "deprecated" section of
features.go, and regenerating featureflag_strings.go.
A future change will remove the BlockedKeyTable flag (and other
similarly deprecated flags) from features.go entirely.
Fixes#4873
log.Logger, the wrapper type that http.Server.ErrorLog uses will append
a newline to every line before calling Write on the inner logger if the
line doesn't already contain one. This breaks our checksum generation/
verification code because syslog will strip newlines. So that we don't
generate irreproducible checksums we strip the newline that log.Logger
added.
Fixes#4812
In theory we should only receive well-behaved requests, but just in case
there are network issues, this may keep us from waiting forever on a
dead connection.
Also, set the ErrorLog field of our http.Servers so we can collect logs for
unusual problems.
Since Boulder's log system adds checksums to lines, but log-validator
processes entries on a per-line basis, including newlines in log
messages can cause a validation failure.
Closes#4567.
Enabled in `config-next`.
This PR cross-signs the existing issuers (`test-ca-cross.pem`, `test-ca2-cross.pem`) with a new root (`test-root2.key`, `test-root2.pem` = *c2ckling cryptogr2pher f2ke ROOT*).
The cross-signed issuers are referenced in wfe2's configuration, beside the existing `certificateChains` key:
```json
"certificateChains": {
"http://boulder:4430/acme/issuer-cert": [ "test/test-ca2.pem" ],
"http://127.0.0.1:4000/acme/issuer-cert": [ "test/test-ca2.pem" ]
},
"alternateCertificateChains": {
"http://boulder:4430/acme/issuer-cert": [ "test/test-ca2-cross.pem" ],
"http://127.0.0.1:4000/acme/issuer-cert": [ "test/test-ca2-cross.pem" ]
},
```
When this key is populated, the WFE will send links for all alternate certificate chains available for the current end-entity certificate (except for the chain sent in the current response):
Link: <http://localhost:4001/acme/cert/ff5d3d84e777fc91ae3afb7cbc1d2c7735e0/1>;rel="alternate"
For backwards-compatibility, not specifying a chain is the same as specifying `0`: `/acme/cert/{serial} == /acme/cert/{serial}/0` and `0` always refers to the default certificate chain for that issuer (i.e. the value of `certificateChains[aiaIssuerURL]`).
This builds on the work @sh7dm started in #4600. I primarily did some
refactoring, added enforcement of the stale check for authorizations and
challenges, and completed the unit test coverage.
A new Boulder-specific (e.g. not specified by ACME / RFC 8555) API is added for
fetching order, authorization, challenge, and certificate resources by URL
without using POST-as-GET. Since we intend this API to only be used by humans
for debugging and we want to ensure ACME client devs use the standards compliant
method we restrict the GET API to only allowing access to "stale" resources
where the required staleness is defined by the WFE2 "staleTimeout"
configuration value (set to 5m in dev/CI).
Since authorizations don't have a creation date tracked we add
a `authorizationLifetimeDays` and `pendingAuthorizationLifetimeDays`
configuration parameter to the WFE2 that matches the RA's configuration. These
values are subtracted from the authorization expiry to find the creation date to
enforce the staleness check for authz/challenge GETs.
One other note: Resources accessed via the GET API will have Link relation URLs
pointing to the standard ACME API URL. E.g. a GET to a stale challenge will have
a response header with a link "up" relation URL pointing at the POST-as-GET URL
for the associated authorization. I wanted to avoid complicating
`prepAuthorizationForDisplay` and `prepChallengeForDisplay` to be aware of the
GET API and update or exclude the Link relations. This seems like a fine
trade-off since we don't expect machine consumption of the GET API results
(these are for human debugging).
Replaces #4600Resolves#4577
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.
There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.
Fixes#4591.