Commit Graph

1509 Commits

Author SHA1 Message Date
Phil Porada c7dc3a8d72
Test against go1.20.6 (#6987)
This version includes a fix that seems relevant to us:

> The HTTP/1 client did not fully validate the contents of the Host
header. A maliciously crafted Host header could inject additional
headers or entire requests. The HTTP/1 client now refuses to send
requests containing an invalid Request.Host or Request.URL.Host value.
> 
> Thanks to Bartek Nowotarski for reporting this issue.
> 
> Includes security fixes for CVE-2023-29406 and Go issue
https://go.dev/issue/60374
2023-07-11 12:50:42 -07:00
Phil Porada 947e199016
Add govulncheck to CI (#6963)
Fixes https://github.com/letsencrypt/boulder/issues/6354

Runs
[govulncheck](https://pkg.go.dev/golang.org/x/vuln/cmd/govulncheck) in
a one-shot container so that PR creation, updates to a PR, and merges
to main can contact the govuln API and check for known vulnerabilities.

Lastly, upgrades the version of golangci-lint to the [latest available
(v1.53.3)](https://github.com/golangci/golangci-lint/releases).

---------

Co-authored-by: Aaron Gable <aaron@letsencrypt.org>
2023-07-11 09:51:20 -04:00
Jacob Hoffman-Andrews cd24b9db20
ca: deprecate StoreLintingCertificateInsteadOfPrecertificate (#6970)
And turn off the orphan queue in config-next.
2023-07-05 10:44:08 -07:00
Aaron Gable cc596bd4eb
Begin testing on go1.21rc2 with loopvar experiment (#6952)
Add go1.21rc2 to the matrix of go versions we test against.

Add a new step to our CI workflows (boulder-ci, try-release, and
release) which sets the "GOEXPERIMENT=loopvar" environment variable if
we're running go1.21. This experiment makes it so that loop variables
are scoped only to their single loop iteration, rather than to the whole
loop. This prevents bugs such as our CAA Rechecking incident
(https://bugzilla.mozilla.org/show_bug.cgi?id=1619047). Also add a line
to our docker setup to propagate this environment variable into the
container, where it can affect builds.

Finally, fix one TLS-ALPN-01 test to have the fake subscriber server
actually willing to negotiate the acme-tls/1 protocol, so that the ACME
server's tls client actually waits to (fail to) get the certificate,
instead of dying immediately. This fix is related to the upgrade to
go1.21, not the loopvar experiment.

Fixes https://github.com/letsencrypt/boulder/issues/6950
2023-06-26 16:35:29 -07:00
Aaron Gable 8224fad20b
Update to go1.20.5 (#6946)
We are already running go1.20.5 in production.
2023-06-20 14:55:37 -07:00
Jacob Hoffman-Andrews a2b2e53045
cmd: fail without panic (#6935)
For "ordinary" errors like "file not found" for some part of the config,
we would prefer to log an error and exit without logging about a panic
and printing a stack trace.

To achieve that, we want to call `defer AuditPanic()` once, at the top
of `cmd/boulder`'s main. That's so early that we haven't yet parsed the
config, which means we haven't yet initialized a logger. We compromise:
`AuditPanic` now calls `log.Get()`, which will retrieve the configured
logger if one has been set up, or will create a default one (which logs
to stderr/stdout).

AuditPanic and Fail/FailOnError now cooperate: Fail/FailOnError panic
with a special type, and AuditPanic checks for that type and prints a
simple message before exiting when it's present.

This PR also coincidentally fixes a bug: panicking didn't previously
cause the program to exit with nonzero status, because it recovered the
panic but then did not explicitly exit nonzero.

Fixes #6933
2023-06-20 12:29:02 -07:00
Samantha 124c4cc6f5
grpc/sa: Implement deep health checks (#6928)
Add the necessary scaffolding for deep health checking of our various
gRPC components. Each component implementation that also implements the
grpc.checker interface will be checked periodically, and the health
status of the component will be updated accordingly.

Add the necessary methods to SA to implement the grpc.checker interface
and register these new health checks with Consul.

Additionally:
- Update entry point script to check for ProxySQL readiness.
- Increase the poll rate for gRPC Consul checks from 5s to 2s to help
with DNS failures, due to check failures, on startup.
- Change log level for Consul from INFO to ERROR to deal with noisy logs
full of transport failures due to Consul gRPC checks firing before the
SAs are up.

Fixes #6878
Part of #6795
2023-06-12 13:58:53 -04:00
Jacob Hoffman-Andrews 2041e8723b
integration: shorten log output (#6894)
Remove the load test stage of the integration test, which generates
superfluous amounts of log.

Turn down logging on the CA and VA from info to error-only.

Part of https://github.com/letsencrypt/boulder/issues/6890
2023-06-05 13:11:19 -04:00
Samantha dc269a63d5
docker: Update consul container to match production (#6913)
- Update consul container from `1.13.1` to `1.14.2` to match production.
- Specify `grpc_tls`, now required instead of defaulted to `8503` when
`enable_agent_tls_for_checks` is specified.

Part of #6911
2023-06-02 14:35:07 -04:00
Jacob Hoffman-Andrews 80e1510819
admin: add clear-email subcommand (#6919)
When a user wants their email address deleted from the database but no
longer has access to their account, this allows an administrator to
clear it.

This adds `admin` as an alias for `admin-revoker`, because we'd like the
clear-email sub-command to be a part of that overall tool, but it's not
really revocation related.

Part of #6864
2023-06-01 14:33:24 -04:00
Jacob Hoffman-Andrews 521eb55d1e
test: better message for different empty slices (#6920)
Given two empty slices, one that is equal to nil and one that is not,
AssertDeepEquals used to produce this confusing output:

    [[]] !(deep)= [[]]

After this change, it produces:

    [[]string(nil)] !(deep)= [[]string{}]
2023-05-26 09:41:23 -07:00
Aaron Gable 4305f64a28
Replace integration test root ocsp with crls (#6905)
We no longer issue OCSP responses for our intermediate certificates,
instead producing CRLs which cover those intermediates. Remove the OCSP
response from our integration test ceremony, remove the configuration
for the static ocsp-responder which serves that response, and remove the
integration test which spins up and checks that responder. Replace all
of the above with new CRLs generated as part of the integration test
ceremony.
2023-05-24 14:22:43 -07:00
Samantha f09a94bd74
consul: Configure gRPC health check for SA (#6908)
Enable SA gRPC health checks in Consul ahead of further changes for
#6878. Calls to the `Check` method of the SA's grpc.health.v1.Health
service must respond `SERVING` before the `sa` service will be
advertised in Consul DNS. Consul will continue to poll this service
every 5 seconds.

- Add `bconsul` docker service to boulder `bluenet` and `rednet`
- Add TLS credentials for `consul.boulder`:
  ```shell
  $ openssl x509 -in consul.boulder/cert.pem -text | grep DNS
                DNS:consul.boulder
  ```
- Update `test/grpc-creds/generate.sh` to add `consul.boulder`
- Update test SA configs to allow `consul.boulder` to access to
`grpc.health.v1.Health`

Part of #6878
2023-05-23 13:16:49 -04:00
Aaron Gable 26adec08cc
Remove go1.20.3 from CI (#6898)
We are no longer be using go1.20.3 in prod.
2023-05-22 14:47:33 -07:00
Aaron Gable fe523f142d
crl-updater: retry failed shards (#6907)
Add per-shard exponential backoff and retry to crl-updater. Each
individual CRL shard will be retried up to MaxAttempts (default 1)
times, with exponential backoff starting at 1 second and maxing out at 1
minute between each attempt.

This can effectively reduce the parallelism of crl-updater: while a
goroutine is sleeping between attempts of a failing shard, it is not
doing work on another shard. This is a desirable feature, since it means
that crl-updater gently reduces the total load it places on the network
and database when shards start to fail.

Setting this new config parameter is tracked in IN-9140
Fixes https://github.com/letsencrypt/boulder/issues/6895
2023-05-22 12:59:09 -07:00
Aaron Gable 3990a08328
Add relevant domain to CAA errors and logs (#6886)
When processing CAA records, keep track of the FQDN at which that CAA
record was found (which may be different from the FQDN for which we are
attempting issuance, since we crawl CAA records upwards from the
requested name to the TLD). Then surface this name upwards so that it
can be included in our own log lines and in the problem documents which
we return to clients.

Fixes https://github.com/letsencrypt/boulder/issues/3171
2023-05-22 15:08:56 -04:00
Matthew McPherrin b7d9f8c2e3
In config-next/, opentelemetry -> openTelemetry for consistency (#6888)
In configs, opentelemetry -> openTelemetry

As pointed out in review of #6867, these should match the case of their
corresponding Go identifiers for consistency.

JSON keys are case-insensitive in Go (part of why we've got a fork in
go-jose),
so this change should have no functional impact.
2023-05-15 17:07:29 -04:00
Aaron Gable 62ff373885
Probs: remove divergences from RFC8555 (#6877)
Remove the remaining divergences from RFC8555 regarding what error types
we use in certain situations. Specifically:
- use "invalidContact" instead of "invalidEmail";
- use "unsupportedContact" for contact addresses that use a protocol
other than "mailto:"; and
- use "unsupportedIdentifier" for identifiers that specify a type other
than "dns".
2023-05-15 12:35:12 -07:00
Matthew McPherrin c21b44bdc2
Rename CA's "--ca-addr" flag to "--addr" (#6889)
Most boulder components have a command line flag to override what gRPC
and debug port they listen on, which is used in tests to run multiple
instances with the same configuration.

However, CA's flag is named "--ca-addr", and not "--addr". This is
inconsistent with SA, RA, VA, nonce, publisher, and purger.

This flag isn't used in production, where we set it in the config file,
so it shouldn't be a breaking change to rename it.
2023-05-15 11:17:07 -07:00
Matthew McPherrin 3aae67b8a9
Opentelemetry: Add option for public endpoints (#6867)
This PR adds a new configuration block specifically for the otelhttp
instrumentation. This block is separate from the existing
"opentelemetry" configuration, and is only relevant when using otelhttp
instrumentation. It does not share any codepath with the existing
configuration, so it is at the top level to indicate which services it
applies to.

There's a bit of plumbing new configuration through. I've adopted the
measured_http package to also set up opentelemetry instead of just
metrics, which should hopefully allow any future changes to be smaller
(just config & there) and more consistent between the wfe2 and ocsp
responder.

There's one option here now, which disables setting
[otelhttp.WithPublicEndpoint](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#WithPublicEndpoint).
This option is designed to do exactly what we want: Don't accept
incoming spans as parents of the new span created in the server.
Previously we had a setting to disable parent-based sampling to help
with this problem, which doesn't really make sense anymore, so let's
just remove it and simplify that setup path. The default of "false" is
designed to be the safe option. It's set to True in the test/ configs
for integration tests that use traces, and I expect we'll likely set it
true in production eventually once the LBs are configured to handle
tracing themselves.

Fixes #6851
2023-05-12 15:34:34 -04:00
Samantha 310546a14e
VA: Support discovery of DNS resolvers via Consul (#6869)
Deprecate `va.DNSResolver` in favor of backwards compatible
`va.DNSProvider`.

Fixes #6852
2023-05-12 12:54:31 -04:00
Samantha 19c5244088
test: Use consul hostname instead of IP for dnsAuthority (#6883)
Standardize on hostnames for dnsAuthority to match production. 

Related to #6869
2023-05-11 14:13:53 -07:00
Jacob Hoffman-Andrews f295626e4c
ca: remove simulated ISRG OID from config (#6879)
We intend to issue in the future with only the CA/Browser Forum Domain
Validated OID.
2023-05-10 12:39:12 -04:00
Jacob Hoffman-Andrews ac4be89b56
grpc: add NoWaitForReady config field (#6850)
Currently we set WaitForReady(true), which causes gRPC requests to not
fail immediately if no backends are available, but instead wait until
the timeout in case a backend does become available. The downside is
that this behavior masks true connection errors. We'd like to turn it
off.

Fixes #6834
2023-05-09 16:16:44 -07:00
Samantha c453ca0571
grpc: Deprecate clientNames field (#6870)
- SRE removed in IN-8755

Fixes #6698
2023-05-08 14:49:27 -04:00
Samantha 487680629d
cmd: TLSConfig values should be string not *string (#6872)
Fixes #6737
2023-05-08 13:21:42 -04:00
Samantha c9173cc024
boulder-va: Remove deprecated Common fields stanza (#6871)
- SRE removed in IN-8752.

Fixes #6716
2023-05-08 11:47:17 -04:00
Matthew McPherrin 8427245675
OTel Integration test using jaeger (#6842)
This adds Jaeger's all-in-one dev container (with no persistent storage)
to boulder's dev docker-compose. It configures config-next/ to send all
traces there.

A new integration test creates an account and issues a cert, then
verifies the trace contains some set of expected spans.

This test found that async finalize broke spans, so I fixed that and a
few related spots where we make a new context.
2023-05-05 10:41:29 -04:00
Phil Porada f8f45f90a9
Test and build release on go1.20.4 (#6862)
[Go 1.20.4](https://groups.google.com/g/golang-announce/c/MEb0UyuSMsU)
contains a security updates for the html/template package, which we use
in `//cmd/bad-key-revoker`.
2023-05-04 10:55:02 -04:00
Aaron Gable 02fa680b08
Update path to ARI endpoint (#6859)
Update the document number to the latest version, and remove the /get/
prefix since it now supports both the GET and POST portions of the spec.

Also update one piece of tooling to properly get the ARI URL from the
directory, rather than hard-coding it.
2023-05-03 15:20:51 -07:00
Matthew McPherrin b5118dde36
Stop using DIRECTORY env var in integration tests (#6854)
We only ever set it to the same value, and then read it back in
make_client, so just hardcode it there instead.

It's a bit spooky-action-at-a-distance and is process-wide with no
synchronization, which means we can't safely use different values
anyway.
2023-05-03 09:54:48 -04:00
Jacob Hoffman-Andrews a9fc1cb882
Improve cert_storage_failed_test (#6849)
Replace inline connect string with a new one in test/vars (that points
to boulder_sa_integration).

Remove comments about interpolateParams=false being required; it is not.

Add clauses to getPrecertByName to ensure it follows its documented
constraints (return the latest one).

Follow-up on #6807. Fixes #6848.
2023-05-02 15:43:07 -07:00
Jacob Hoffman-Andrews 1c7e0fd1d8
Store linting certificate instead of precertificate (#6807)
In order to get rid of the orphan queue, we want to make sure that
before we sign a precertificate, we have enough data in the database
that we can fulfill our revocation-checking obligations even if storing
that precertificate in the database fails. That means:

- We should have a row in the certificateStatus table for the serial.
- But we should not serve "good" for that serial until we are positive
the precertificate was issued (BRs 4.9.10).
- We should have a record in the live DB of the proposed certificate's
public key, so the bad-key-revoker can mark it revoked.
- We should have a record in the live DB of the proposed certificate's
names, so it can be revoked if we are required to revoke based on names.

The SA.AddPrecertificate method already achieves these goals for
precertificates by writing to the various metadata tables. This PR
repurposes the SA.AddPrecertificate method to write "proposed
precertificates" instead.

We already create a linting certificate before the precertificate, and
that linting certificate is identical to the precertificate that will be
issued except for the private key used to sign it (and the AKID). So for
instance it contains the right pubkey and SANs, and the Issuer name is
the same as the Issuer name that will be used. So we'll use the linting
certificate as the "proposed precertificate" and store it to the DB,
along with appropriate metadata.

In the new code path, rather than writing "good" for the new
certificateStatus row, we write a new, fake OCSP status string "wait".
This will cause us to return internalServerError to OCSP requests for
that serial (but we won't get such requests because the serial has not
yet been published). After we finish precertificate issuance, we update
the status to "good" with SA.SetCertificateStatusReady.

Part of #6665
2023-04-26 13:54:24 -07:00
Aaron Gable 25cae29f70
Resolve TestAkamaiPurgerDrainQueueFails race (#6844)
Fixes https://github.com/letsencrypt/boulder/issues/6837
2023-04-26 11:30:09 -04:00
Phil Porada 17fb1b287f
cmd: Export prometheus metrics for TLS cert notBefore and notAfter fields (#6836)
Export new prometheus metrics for the `notBefore` and `notAfter` fields
to track internal certificate validity periods when calling the `Load()`
method for a `*tls.Config`. Each metric is labeled with the `serial`
field.

```
tlsconfig_notafter_seconds{serial="2152072875247971686"} 1.664821961e+09
tlsconfig_notbefore_seconds{serial="2152072875247971686"} 1.664821960e+09
```

Fixes https://github.com/letsencrypt/boulder/issues/6829
2023-04-24 16:28:05 -04:00
Aaron Gable 5480f1060b
Clean up database schema (#6832)
Make a series of small changes to our test database schema, both to make
it simpler to reason about and to bring it closer in alignment to our
production database schema:
- Incorporate the IssuedNamesDropIndex, Incidents, SimplePartitioning,
and NotUnique migrations into the CombinedSchema, as they have been
fully applied in prod;
- Use CHARSET=utf8mb4 everywhere, instead of just utf8;
- Use UNSIGNED for auto-increment ID columns in the tables where prod
does; and
- Re-sort the tables in CombinedSchema which no longer have foreign key
constraints.

Part of https://github.com/letsencrypt/boulder/issues/6820
2023-04-21 10:37:05 -07:00
Aaron Gable 3ddca2d1b8
Update eggsampler/acme and use it for ARI tests (#6811)
Update github.com/eggsampler/acme from v3.3.0 to v3.4.0.
Changelog: https://github.com/eggsampler/acme/compare/v3.3.0...v3.4.0

Update the ARI integration test to use the eggampler/acme client's new
ARI capabilities for making both GET and POST requests. This simplifies
and streamlines the test significantly, and lets us test the POST path.

Fixes #6781
2023-04-19 14:14:43 -07:00
Phil Porada 0ac848173e
Appease errcheck (#6821)
Check errors during shutdown for several components to appease errcheck.

Related to [1] and [2].

1) https://github.com/letsencrypt/boulder/pull/6808
2) https://github.com/letsencrypt/boulder/pull/6819
2023-04-14 22:32:24 -04:00
Aaron Gable bd1d27b8e8
Fix non-gRPC process cleanup and exit (#6808)
Although #6771 significantly cleaned up how gRPC services stop and clean
up, it didn't make any changes to our HTTP servers or our non-server
(e.g. crl-updater, log-validator) processes. This change finishes the
work.

Add a new helper method cmd.WaitForSignal, which simply blocks until one
of the three signals we care about is received. This easily replaces all
calls to cmd.CatchSignals which passed `nil` as the callback argument,
with the added advantage that it doesn't call os.Exit() and therefore
allows deferred cleanup functions to execute. This new function is
intended to be the last line of main(), allowing the whole process to
exit once it returns.

Reimplement cmd.CatchSignals as a thin wrapper around cmd.WaitForSignal,
but with the added callback functionality. Also remove the os.Exit()
call from CatchSignals, so that the main goroutine is allowed to finish
whatever it's doing, call deferred functions, and exit naturally.

Update all of our non-gRPC binaries to use one of these two functions.
The vast majority use WaitForSignal, as they run their main processing
loop in a background goroutine. A few (particularly those that can run
either in run-once or in daemonized mode) still use CatchSignals, since
their primary processing happens directly on the main goroutine.

The changes to //test/load-generator are the most invasive, simply
because that binary needed to have a context plumbed into it for proper
cancellation, but it already had a custom struct type named "context"
which needed to be renamed to avoid shadowing.

Fixes https://github.com/letsencrypt/boulder/issues/6794
2023-04-14 16:22:56 -04:00
Aaron Gable 98fa0f07b4
Re-enable errcheck linter (#6819)
Enable the errcheck linter. Update the way we express exclusions to use
the new, non-deprecated, non-regex-based format. Fix all places where we
began accidentally violating errcheck while it was disabled.
2023-04-14 15:41:12 -04:00
Phil Porada 56a11f0896
Fix CI failures related to akamai-test-srv (#6815)
Fixes a CI problem introduced by
https://github.com/letsencrypt/boulder/pull/6758 where we could send two
purge requests which caused sporadic CI failures due to an infinite
loop.

Fixes https://github.com/letsencrypt/boulder/issues/6806
2023-04-13 09:56:30 -07:00
Aaron Gable 45329c9472
Deprecate ROCSPStage7 flag (#6804)
Deprecate the ROCSPStage7 feature flag, which caused the RA and CA to
stop generating OCSP responses when issuing new certs and when revoking
certs. (That functionality is now handled just-in-time by the
ocsp-responder.) Delete the old OCSP-generating codepaths from the RA
and CA. Remove the CA's internal reference to an OCSP implementation,
because it no longer needs it.

Additionally, remove the SA's "Issuers" config field, which was never
used.

Fixes #6285
2023-04-12 17:03:06 -07:00
Aaron Gable d6192e7c56
Stop testing go1.20.2 (#6809)
Staging and Prod have fully upgraded to go1.20.3, per IN-8865.
2023-04-10 11:00:25 -07:00
Aaron Gable e55a276efe
CA: Remove deprecated config stanzas (#6595)
These config stanzas have been removed in staging and prod. They used to
configure the separate OCSP and CRL gRPC services provided by the CA
process, but the CA now provides those services on the same port as the
main CA gRPC service.

Fixes #6448
2023-04-07 09:37:34 -07:00
Aaron Gable 94f93361a0
Promote the first SAN from the CSR (#6796)
Rather than promoting the alphabetically-first SAN to be the CN, promote
the SAN which came first in the CSR. This is a reversion to previous
behavior that was changed as a side-effect of:
- https://github.com/letsencrypt/boulder/pull/6706;
- https://github.com/letsencrypt/boulder/pull/6749; and
- https://github.com/letsencrypt/boulder/pull/6757

Fixes https://github.com/letsencrypt/boulder/issues/6801
2023-04-06 14:30:19 -07:00
Aaron Gable 7e994a1216
Deprecate ROCSPStage6 feature flag (#6770)
Deprecate the ROCSPStage6 feature flag. Remove all references to the
`ocspResponse` column from the SA, both when reading from and when
writing to the `certificateStatus` table. This makes it safe to fully
remove that column from the database.

IN-8731 enabled this flag in all environments, so it is safe to
deprecate.

Part of #6285
2023-04-04 15:41:51 -07:00
Phil Porada 8824e347fd
Golang 1.20.3 security release upgrade (#6793)
Release notes: https://groups.google.com/g/golang-announce/c/Xdv6JL9ENs8

This update includes fixes for excessive memory usage when parsing
headers in the net/http package.
2023-04-04 15:33:34 -07:00
Aaron Gable 8c67769be4
Remove ocsp-updater from Boulder (#6769)
Delete the ocsp-updater service, and the //ocsp/updater library that
supports it. Remove test configs for the service, and remove references
to the service from other test files.

This service has been fully shut down for an extended period now, and is
safe to remove.

Fixes #6499
2023-03-31 14:39:04 -07:00
Aaron Gable 22fd579cf2
ARI: write Retry-After header before body (#6787)
When sending an ARI response, write the Retry-After header before
writing the JSON response body. This is necessary because
http.ResponseWriter implicitly calls WriteHeader whenever Write is
called, flushing all headers to the network and preventing any
additional headers from being written. Unfortunately, the unittests use
httptest.ResponseRecorder, which doesn't seem to enforce this invariant
(it's happy to report headers which were written after the body). Add a
header check to the integration tests, to make up for this deficiency.
2023-03-31 10:48:45 -07:00
Aaron Gable 9262ca6e3f
Add grpc implementation tests to all services (#6782)
As a follow-up to #6780, add the same style of implementation test to
all of our other gRPC services. This was not included in that PR just to
keep it small and single-purpose.
2023-03-31 09:52:26 -07:00