Update our ACME Renewal Info implementation to parse
the CertID-based request format specified in the current
version of the draft specification.
Part of #6033
Debug and Info messages still go to stdout.
Fix the CAA integration test, which asserted that stderr should be empty
when caa-log-checker finds a problem. That used to be the case because
we never logged to stderr, but now it is the case.
Update the logging docs.
Fixes#6324
Right now, Boulder expects to be able to connect to syslog, and panics
if it's not available. We'd like to be able to log to stdout/stderr as a
replacement for syslog.
- Add a detailed timestamp (down to microseconds, same as we collect in
prod via syslog).
- Remove the escape codes for colorizing output.
- Report the severity level numerically rather than with a letter prefix.
Add locking for stdout/stderr and syslog logs. Neither the [syslog] package
nor the [os] package document concurrency-safety, and the Go rule is: if
it's not documented to be concurrent-safe, it's not. Notably the [log.Logger]
package is documented to be concurrent-safe, and a look at its implementation
shows it uses a Mutex internally.
Remove places that use the singleton `blog.Get()`, and instead pass through
a logger from main in all the places that need it.
[syslog]: https://pkg.go.dev/log/syslog
[os]: https://pkg.go.dev/os
[log.Logger]: https://pkg.go.dev/log#Logger
Emit PEM output instead of pretty-printed output. Send the pretty-printed
output straight to stdout instead of via a logger, so the internal newlines don't
get escaped.
Fixes#6310
Enable the "unparam" linter, which checks for unused function
parameters, unused function return values, and parameters and
return values that always have the same value every time they
are used.
In addition, fix many instances where the unparam linter complains
about our existing codebase. Remove error return values from a
number of functions that never return an error, remove or use
context and test parameters that were previously unused, and
simplify a number of (mostly test-only) functions that always take the
same value for their parameter. Most notably, remove the ability to
customize the RSA Public Exponent from the ceremony tooling,
since it should always be 65537 anyway.
Fixes#6104
Allow the crl-storer to load whole AWS config files. Although
this requires a deployment to maintain an additional config
files for the crl-storer, and one in a format we usually don't
use, it does give us lots of flexibility in setting up things like
role assumption.
Also remove the S3Region config flag, as it is now redundant
with the contents of the config file, and rename the existing
S3CredsFile config key to AWSCredsFile to better represent
its true contents.
Fixes#6308
Add a new `test.AssertNil()` helper to facilitate asserting that a given
unit test result is a non-boxed nil. Update `test.AssertNotNil()` to use
the reflect package's `.IsNil()` method to catch boxed nils.
In Go, variables whose type is constrained to be an interface type (e.g.
a function parameter which takes an interface, or the return value of a
function which returns `error`, itself an interface type) should
actually be thought of as a (T, V) tuple, where T is their underlying
concrete type and V is their underlying value. Thus, there are two ways
for such a variable to be nil-like: it can be truly nil where T=nil and
V is uninitialized, or it can be a "boxed nil" where T is a nillable
type such as a pointer or a slice and V=nil.
Unfortunately, only the former of these is == nil. The latter is the
cause of frequent bugs, programmer frustration, a whole entry in the Go
FAQ, and considerable design effort to remove from Go 2.
Therefore these two test helpers both call `t.Fatal()` when passed a
boxed nil. We want to avoid passing around boxed nils whenever possible,
and having our tests fail whenever we do is a good way to enforce good
nil hygiene.
Fixes#3279
Add a new `GetRevocationStatus` gRPC method to the SA which retrieves
only the subset of the certificate status metadata relevant to
revocation, namely whether the certificate has been revoked, when it was
revoked, and the revocation reason. Notably, this method is our first
use of the `goog.protobuf.Timestamp` type in a message, which is more
ergonomic and less prone to errors than using unix nanoseconds.
Use this new method in ocsp-responder's checked_redis_source, to avoid
having to send many other pieces of metadata and the full ocsp response
bytes over the network. It provides all the information necessary to
determine if the response from Redis is up-to-date.
Within the checked_redis_source, use this new method in two different
ways: if only a database connection is configured (as is the case today)
then get this information directly from the db; if a gRPC connection to
the SA is available then prefer that instead. This may make requests
slower, but will allow us to remove database access from the hosts which
run the ocsp-responder today, simplifying our network.
The new behavior consists of two pieces, each locked behind a config
gate:
- Performing the smaller database query is only enabled if the
ocsp-responder has the `ROCSPStage3` feature flag enabled.
- Talking to the SA rather than the database directly is only enabled if
the ocsp-responder has an `saService` gRPC stanza in its config.
Fixes#6274
When rsyslog receives multiple identical log lines in a row, it can
collapse those lines into a single instance of the log line and a
follow-up line saying "message repeated X times". However, that
rsyslog-generated line does not contain our log line checksum, so it
immediately causes log-validator to complain about the line. In
addition, the rsyslog docs themselves state that this feature is a
misfeature and should never be turned on. Despite this, Ubuntu turns the
feature on by default when the rsyslog package is installed from apt.
Add an additional command to our dockerfile which overwrites Ubuntu's
default setting to disable this misfeature, and update our test
environment to use the new docker image.
Fixes#6252
Create a new `ROCSPStage6` feature flag which affects the behavior of
the SA. When enabled, this flag causes the `AddPrecertificate`,
`RevokeCertificate`, and `UpdateRevokedCertificate` methods to ignore
the OCSP response bytes provided by their caller. They will no longer
error out if those bytes are missing, and if the bytes are present they
will still not be written to the database.
This allows us to, in the future, cause the RA and CA to stop generating
those OCSP responses entirely, and stop providing them to the SA,
without causing any errors when we do.
Part of #6079
Run the Boulder unit and integration tests with go1.19.
In addition, make a few small changes to allow both sets of
tests to run side-by-side. Mark a few tests, including our lints
and generate checks, as go1.18-only. Reformat a few doc
comments, particularly lists, to abide by go1.19's stricter gofmt.
Causes #6275
The iotuil package has been deprecated since go1.16; the various
functions it provided now exist in the os and io packages. Replace all
instances of ioutil with either io or os, as appropriate.
For some reason ROCSPStage3 was enabled without also enabling
ROCSP Stages 1 and 2. Fix the oversight so we're actually running
all of the first three ROCSP stages in config-next integration tests.
Create a new crl-storer service, which receives CRL shards via gRPC and
uploads them to an S3 bucket. It ignores AWS SDK configuration in the
usual places, in favor of configuration from our standard JSON service
config files. It ensures that the CRLs it receives parse and are signed
by the appropriate issuer before uploading them.
Integrate crl-updater with the new service. It streams bytes to the
crl-storer as it receives them from the CA, without performing any
checking at the same time. This new functionality is disabled if the
crl-updater does not have a config stanza instructing it how to connect
to the crl-storer.
Finally, add a new test component, the s3-test-srv. This acts similarly
to the existing mail-test-srv: it receives requests, stores information
about them, and exposes that information for later querying by the
integration test. The integration test uses this to ensure that a
newly-revoked certificate does show up in the next generation of CRLs
produced.
Fixes#6162
- Implement a static resolver for the gPRC dialer under the scheme `static:///`
which allows the dialer to resolve a backend from a static list of IPv4/IPv6
addresses passed via the existing JSON config.
- Add config key `serverAddresses` to the `GRPCClientConfig` which, when
populated, enables static IP resolution of gRPC server backends.
- Set `config-next` to use static gRPC backend resolution for all SA clients.
- Generate a new SA certificate which adds `10.77.77.77` and `10.88.88.88` to
the SANs.
Resolves#6255
Add checkedRedisSource, a new OCSP Source which gets
responses from Redis, gets metadata from the database, and
only serves the Redis response if it matches the authoritative
metadata. If there is a mismatch, it requests a new OCSP
response from the CA, stores it in Redis, and serves the new
response.
This behavior is locked behind a new ROCSPStage3 feature flag.
Part of #6079
- Move entry for `nonce` service to the second `minica` loop so that DNS names
`nonce1.boulder` and `nonce2.boulder` are added to the SANS
- Remove anachronistic `crl-storer` gRPC cert and key added in #6212
- Modify `ra.checkCertificatesPerFQDNSetLimit()` to use a leaky bucket algorithm
- Return issuance timestamps from `sa.FQDNSetTimestampsForWindow()` in descending order
Resolves#6154
Add a new config key `UpdateOffset` to crl-updater, which causes it to
run on a regular schedule rather than running immediately upon startup
and then every `UpdatePeriod` after that. It is safe for this new config
key to be omitted and take the default zero value.
Also add a new command line flag `runOnce` to crl-updater which causes
it to immediately run a single time and then exit, rather than running
continuously as a daemon. This will be useful for integration tests and
emergency situations.
Part of #6163
Add a new filter to mail-test-srv, allowing test processes to query
for messages sent from a specific address, not just ones sent to
a specific address. This fixes a race condition in the revocation
integration tests where the number of messages sent to a cert's
contact address would be higher than expected because expiration
mailer sent a message while the test was running. Also reduce
bad-key-revoker's maximum backoff to 2 seconds to ensure that
it continues to run frequently during the integration tests, despite
usually not having any work to do.
While we're here, also improve the comments on various revocation
integration tests, remove some unnecessary cruft, and split the tests
out to explicitly test functionality with the MozRevocationReasons
flag both enabled and disabled. Also, change ocsp_helper's default
output from os.Stdout to ioutil.Discard to prevent hundreds of lines
of log spam when the integration tests fail during a test that uses
that library.
Fixes#6248
Previously, when shutting down a `docker-compose` stack,
bredis_clusterer would take 10s to shut down. This decreases the time to
0.4s.
I believe this is because docker-compose was killing `bash` and waiting
for its children to die (they weren't), then hitting a timeout and hard
killing the container. Now, since `exec` replaces the current pid,
docker-compose can kill redis-server directly.
Previously we used "ExpectedFreshness" to control how frequently the
Redis source would request re-signing of stale entries. But that field
also controls whether multi_source is willing to serve a MariaDB
response. It's better to split these into two values.
Previously challtestsrv.py (used by chisel.py) assumed challtestsrv runs
on localhost. But we can also reach it on the fixed IP 10.77.77.77, and
this allows running chisel2.py from the host in addition to running it
inside a container.
This enables ocsp-responder to talk to the RA and request freshly signed
OCSP responses.
ocsp/responder/redis_source is moved to ocsp/responder/redis/redis_source.go
and significantly modified. Instead of assuming a response is always available
in Redis, it wraps a live-signing source. When a response is not available,
it attempts a live signing.
If live signing succeeds, the Redis responder returns the result right away
and attempts to write a copy to Redis on a goroutine using a background
context.
To make things more efficient, I eliminate an unneeded ocsp.ParseResponse
from the storage path. And I factored out a FakeResponse helper to make
the unittests more manageable.
Commits should be reviewable one-by-one.
Fixes#6191
Add a collection of lints (structured similarly, but not identically,
to zlint's certificate lints) which check a variety of requirements
based on RFC 5280, the Baseline Requirements, and the Mozilla
Root Store Policy.
Add a method to lint CRLs to the existing linter package which
uses its fake issuer to sign the CRL, calls all of the above lints,
and returns all of their findings. Call this new method from within
the CA's new GenerateCRL method immediately before signing
the real CRL using the real issuer.
Fixes#6188
Create a new service named crl-updater. It is responsible for
maintaining the full set of CRLs we issue: one "full and complete" CRL
for each currently-active Issuer, split into a number of "shards" which
are essentially CRLs with arbitrary scopes.
The crl-updater is modeled after the ocsp-updater: it is a long-running
standalone service that wakes up periodically, does a large amount of
work in parallel, and then sleeps. The period at which it wakes to do
work is configurable. Unlike the ocsp-responder, it does all of its work
every time it wakes, so we expect to set the update frequency at 6-24
hours.
Maintaining CRL scopes is done statelessly. Every certificate belongs to
a specific "bucket", given its notAfter date. This mapping is generally
unchanging over the life of the certificate, so revoked certificate
entries will not be moving between shards upon every update. The only
exception is if we change the number of shards, in which case all of the
bucket boundaries will be recomputed. For more details, see the comment
on `getShardBoundaries`.
It uses the new SA.GetRevokedCerts method to collect all of the revoked
certificates whose notAfter timestamps fall within the boundaries of
each shard's time-bucket. It uses the new CA.GenerateCRL method to sign
the CRLs. In the future, it will send signed CRLs to the crl-storer to
be persisted outside our infrastructure.
Fixes#6163
The `affiliationChanged` revocation reason is only relevant
to certificates which contain Subject Identity Information.
As we only issue DV certificates, which cannot contain such
information, our certificates should not be able to be revoked
for this reason.
See https://groups.google.com/a/mozilla.org/g/dev-security-policy/c/m3-XPcVcJ9M
Add a new CA gRPC method named `GenerateCRL`. In the
style of the existing `GenerateOCSP` method, this new endpoint
is implemented as a separate service, for which the CA binary
spins up an additional gRPC service.
This method uses gRPC streaming for both its input and output.
For input, the stream must contain exactly one metadata message
identifying the crl number, issuer, and timestamp, and then any
number of messages identifying a single certificate which should
be included in the CRL. For output, it simply streams chunks of
bytes.
Fixes#6161
Expanding `$@` means that if a positional parameter has an internal
space, e.g. "foo bar", it will be split into two positional parameters
in the resulting command, e.g. "foo" "bar". Expanding `"$@"` ensures
that such parameters are quoted during expansion, so we still get
"foo bar" in the exec command, which is always what we wanted.
Add a new code path to the ctpolicy package which enforces Chrome's new
CT Policy, which requires that SCTs come from logs run by two different
operators, rather than one Google and one non-Google log. To achieve
this, invert the "race" logic: rather than assuming we always have two
groups, and racing the logs within each group against each other, we now
race the various groups against each other, and pick just one arbitrary
log from each group to attempt submission to.
Ensure that the new code path does the right thing by adding a new zlint
which checks that the two SCTs embedded in a certificate come from logs
run by different operators. To support this lint, which needs to have a
canonical mapping from logs to their operators, import the Chrome CT Log
List JSON Schema and autogenerate Go structs from it so that we can
parse a real CT Log List. Also add flags to all services which run these
lints (the CA and cert-checker) to let them load a CT Log List from disk
and provide it to the lint.
Finally, since we now have the ability to load a CT Log List file
anyway, use this capability to simplify configuration of the RA. Rather
than listing all of the details for each log we're willing to submit to,
simply list the names (technically, Descriptions) of each log, and look
up the rest of the details from the log list file.
To support this change, SRE will need to deploy log list files (the real
Chrome log list for prod, and a custom log list for staging) and then
update the configuration of the RA, CA, and cert-checker. Once that
transition is complete, the deletion TODOs left behind by this change
will be able to be completed, removing the old RA configuration and old
ctpolicy race logic.
Part of #5938
This reverts commit 7ef6913e71.
We turned on the `ExpirationMailerDontLookTwice` feature flag in prod, and it's
working fine but not clearing the backlog. Since
https://github.com/letsencrypt/boulder/pull/6100 fixed the issue that caused us
to (nearly) stop sending mail when we deployed #6057, this should be safe to
roll forward.
The revert of the revert applied cleanly, except for expiration-mailer/main.go
and `main_test.go`, particularly around the contents `processCerts` (where
`sendToOneRegID` was extracted from) and `sendToOneRegID` itself. So those areas
are good targets for extra attention.
Update:
- golangci-lint from v1.42.1 to v1.46.2
- protoc from v3.15.6 to v3.20.1
- protoc-gen-go from v1.26.0 to v1.28.0
- protoc-gen-go-grpc from v1.1.0 to v1.2.0
- fpm from v1.14.0 to v1.14.2
Also remove a reference to go1.17.9 from one last place.
This does result in updating all of our generated .pb.go files, but only
to update the version number embedded in each file's header.
Fixes#6123
We recently landed a fix so the expiration-mailer won't look twice at
the same certificate. This will cause an immediate behavior change when
it is deployed, and that might have surprising effects. Put the fix
behind a feature flag so we can control when it rolls out more
carefully.
The go-redis docs say default is 10 * NumCPU, but the actual code says 5.
Extra context:
2465baaab5/options.go (L143-L145)2465baaab5/cluster.go (L96-L98)
For Options, the default (documented) is 10 * NumCPUs. For ClusterOptions, the
default (undocumented) is 5 * NumCPUs. We use ClusterOptions. Also worth noting:
for ClusterOptions, the limit is per node.
This was introduced when expiration-mailer was run by cron, and was a
way for expiration-mailer to know something about its expected run
interval so it could send notifications "on time" rather than "just
after" the configured email time.
Now that expiration-mailer runs as a daemon we can simply pull this
value from `Frequency`, which is set to the same value in prod.
Instead write `[]`, a better representation of an empty contact set,
and avoid having literal JSON `null`s in our database.
As part of doing so, add some extra code to //sa/model.go that
bypasses the need for //sa/type-converter.go to do any magic
JSON-to-string-slice conversions for us.
Fixes#6074
When deployed, the newly-parallel expiration-mailer encountered
unexpected difficulties and dropped to apparently sending nearly zero
emails despite not throwing any real errors. Reverting the parallelism
change until we understand and can fix the root cause.
This reverts two commits:
- Allow expiration mailer to work in parallel (#6057)
- Fix data race in expiration-mailer test mocks (#6072)
It also modifies the revert to leave the new `ParallelSends` config key
in place (albeit completely ignored), so that the binary containing this
revert can be safely deployed regardless of config status.
Part of #5682
Previously, each accounts email would be sent in serial,
along with several reads from the database (to check for
certificate renewal) and several writes to the database (to update
`certificateStatus.lastExpirationNagSent`). This adds a config field
for the expiration mailer that sets the parallelism it will use.
That means making and using multiple SMTP connections as well. Previously,
`bmail.Mailer` was not safe for concurrent use. It also had a piece of
API awkwardness: after you created a Mailer, you had to call Connect on
it to change its state.
Instead of treating that as a state change on Mailer, I split out a
separate component: `bmail.Conn`. Now, when you call `Mailer.Connect()`,
you get a Conn. You can send mail on that Conn and Close it when you're
done. A single Mailer instance can produce multiple Conns, so Mailer is
now concurrency-safe (while Conn is not).
This involved a moderate amount of renaming and code movement, and
GitHub's move detector is not keeping up 100%, so an eye towards "is
this moved code?" may help. Also adding `?w=1` to the diff URL to ignore
whitespace diffs.
This copies over settings from config-next that are now deployed in prod.
Also, I updated a comment in sd-test-srv to more accurately describe how SRV records work.
go1.17.9 (released 2022-04-12) includes security fixes to the crypto/elliptic and encoding/pem packages, as well as bug fixes to the linker and runtime. See the [Go 1.17.9 milestone](https://github.com/golang/go/issues?q=milestone%3AGo1.17.9+label%3ACherryPickApproved) on our issue tracker for details.
go1.18.1 (released 2022-04-12) includes security fixes to the crypto/elliptic, crypto/x509, and encoding/pem packages, as well as bug fixes to the compiler, linker, runtime, the go command, vet, and the bytes, crypto/x509, and go/types packages. See the [Go 1.18.1 milestone](https://github.com/golang/go/issues?q=milestone%3AGo1.18.1+label%3ACherryPickApproved) on our issue tracker for details.
First commit adding support for tooling to aid in the tracking and remediation
of incidents.
- Add new SA method `IncidentsForSerial`
- Add database models for `incident`s and `incidentCert`s
- Add protobuf type for `incident`
- Add database migrations for `incidents`, `incident_foo`, and `incident_bar`
- Give db user `sa` permissions to `incidents`, `incident_foo`, and
`incident_bar`
Part Of #5947
Simplify the WFE `RevokeCertificate` API method in three ways:
- Remove most of the logic checking if the requester is authorized to
revoke the certificate in question (based on who is making the
request, what authorizations they have, and what reason they're
requesting). That checking is now done by the RA. Instead, simply
verify that the JWS is authenticated.
- Remove the hard-to-read `authorizedToRevoke` callbacks, and make the
`revokeCertBySubscriberKey` (nee `revokeCertByKeyID`) and
`revokeCertByCertKey` (nee `revokeCertByJWK`) helpers much more
straight-line in their execution logic.
- Call the RA's new `RevokeCertByApplicant` and `RevokeCertByKey` gRPC
methods, rather than the deprecated `RevokeCertificateWithReg`.
This change, without any flag flips, should be invisible to the
end-user. It will slightly change some of our log message formats.
However, by now relying on the new RA gRPC revocation methods, this
change allows us to change our revocation policies by enabling the
`AllowDoubleRevocation` and `MozRevocationReasons` feature flags, which
affect the behavior of those new helpers.
Fixes#5936
- Add new configuration key `throughput`, a mapping which contains all
throughput related akamai-purger settings.
- Deprecate configuration key `purgeInterval` in favor of `purgeBatchInterval` in
the new `throughput` configuration mapping.
- When no `throughput` or `purgeInterval` is provided, the purger uses optimized
default settings which offer 1.9x the throughput of current production settings.
- At startup, all throughput related settings are modeled to ensure that we
don't exceed the limits imposed on us by Akamai.
- Queue is now `[][]string`, instead of `[]string`.
- When a given queue entry is purged we know all 3 of it's URLs were purged.
- At startup we know the size of a theoretical request to purge based on the
number of queue entries included
- Raises the queue size from ~333-thousand cached OCSP responses to
1.25-million, which is roughly 6 hours of work using the optimized default
settings
- Raise `purgeInterval` in test config from 1ms, which violates API limits, to 800ms
Fixes#5984
Adds a rocsp redis client to the sa if cluster information is provided in the
sa config. If a redis cluster is configured, all new certificate OCSP
responses added with sa.AddPrecertificate will attempt to be written to
the redis cluster, but will not block or fail on errors.
Fixes: #5871
- Remove GOPATH-style path structure, which isn't needed with Go
modules.
- Remove check for existing of docker buildx builder instance, since it
was unreliable.
Add two new gRPC methods to the SA:
- `RevokeCertByKey` will be used when the API request was signed by the
certificate's keypair, rather than a Subscriber keypair. If the
request is for reason `keyCompromise`, it will ensure that the key is
added to the blocked keys table, and will attempt to "re-revoke" a
certificate that was already revoked for some other reason.
- `RevokeCertByApplicant` supports both the path where the original
subscriber or another account which has proven control over all of the
identifier in the certificate requests revocation via the API. It does
not allow the requested reason to be `keyCompromise`, as these
requests do not represent a demonstration of key compromise.
In addition, add a new feature flag `MozRevocationReasons` which
controls the behavior of these new methods. If the flag is not set, they
behave like they have historically (see above). If the flag is set to true,
then the new methods enforce the upcoming Mozilla policies around
revocation reasons, namely:
- Only the original Subscriber can choose the revocation reason; other
clients will get a set reason code based on the method of requesting
revocation. When the original Subscriber requests reason
`keyCompromise`, this request will be honored, but the key will not be
blocked and other certificates with that key will not also be revoked.
- Revocations signed with the certificate key will always get reason
`keyCompromise`, because we do not know who is sending the request and
therefore must assume that the use of the key in this way represents
compromise. Because these requests will always be fore reason
`keyCompromise`, they will always be added to the blocked keys table
and they will always attempt "re-revocation".
- Revocations authorized via control of all names in the cert will
always get reason `cessationOfOperation`, which is to be used when the
original Subscriber does not control all names in the certificate
anymore.
Finally, update the existing `AdministrativelyRevokeCertificate` method
to use the new helper functions shared by the two new methods.
Part of #5936
This requires using GODEBUG to enable a couple of thing turned off by go1.18 (TLS 1.0/1.1, SHA-1 CSRs).
Also add help for a failure mode of cross builds.
Reverts letsencrypt/boulder#5963
Turns out the tests are still flaky -- using the `grpc.WaitForReady(true)`
connection option results in sometimes seeing 9 entries added to the
purger queue, and sometimes 10 entries. Reverting because flakiness
on main should not be tolerated.
Bumps [google.golang.org/grpc](https://github.com/grpc/grpc-go) from 1.36.1 to 1.44.0.
- [Release notes](https://github.com/grpc/grpc-go/releases)
- [Commits](https://github.com/grpc/grpc-go/compare/v1.36.1...v1.44.0)
Also update akamai-purger integration test to avoid experimental API.
The `conn.GetState()` API is marked experimental and may change behavior
at any time. It appears to have changed between v1.36.1 and v1.44.0,
and so the akamai-purger integration tests which rely on it break.
Rather than writing our own loop which polls `conn.GetState()`, just
use the stable `WaitForReady(true)` connection option, and apply it to
all connections by setting it as a default option in the dial options.
Light cleanup of akamai-purger and the akamai cache-client. This does not make
any material changes to logic.
- Use `errors.New` and `errors.Is` instead of a custom `ErrFatal` type and
`errors.As`
- Add whitespace to separate chunks of execution and error checking from one
another
- Use `logger.Infof` and `logger.Errorf` instead of wrapped calls to
`fmt.Sprintf`
- Remove capital letters from the beginning of error messages
- Additional comments and removal of some that are no longer accurate
When inside a closure, it is important to not accidentally assign
to variables declared outside the scope of the closure. Doing so
causes static analysis tools (such as `errcheck`) to be unable to
evaluate the lifetime of the variable, and unable to determine if
it is appropriately read from before being assigned to again.
Fix two instances where we assign to a variable declared in the
closure's enclosing scope, rather than declaring a new variable
with the same name.
We have decided that we don't like the if err := call(); err != nil
syntax, because it creates confusing scopes, but we have not cleaned up
all existing instances of that syntax. However, we have now found a
case where that syntax enables a bug: It caused readers to believe that
a later err = call() statement was assigning to an already-declared err
in the local scope, when in fact it was assigning to an
already-declared err in the parent scope of a closure. This caused our
ineffassign and staticcheck linters to be unable to analyze the
lifetime of the err variable, and so they did not complain when we
never checked the actual value of that error.
This change standardizes on the two-line error checking syntax
everywhere, so that we can more easily ensure that our linters are
correctly analyzing all error assignments.
Incidents of key compromise where proof is supplied in the form of a private key
have historically been labor intensive for SRE. This PR seeks to automate the
process of embedded public key validation , query for issuance, revocation, and
blocking by SPKI hash.
For an example of private keys embedding a mismatched public key, see:
https://blog.hboeck.de/archives/888-How-I-tricked-Symantec-with-a-Fake-Private-Key.html.
Adds two new sub-commands (private-key-block and private-key-revoke) and one new
flag (-dry-run) to admin-revoker. Both new sub-commands validate that the
provided private key and provide the operator with an issuance count. Any
blocking and revocation actions are gated by the new '-dry-run' flag, which is
'true' by default.
private-key-block: if -dry-run=false, will immediately block issuance for the
provided key. The operator is informed that bad-key-revoker will eventually
revoke any certificates using the provided key.
private-key-revoke: if -dry-run=false, will revoke all certificates using the
provided key and then blocks future issuance. This avoids a race with the
bad-key-revoker. This command will execute successfully even if issuance for the
provided key is already blocked.
- Add support for blocking issuance by private key to admin-revoker
- Add support for revoking certificates by private key to admin-revoker
- Create new package called 'privatekey'
- Move private key loading logic from 'issuance' to 'privatekey'
- Add embedded public key verification to 'privatekey'
- Add new field `skipBlockKey` to `AdministrativelyRevokeCertificate` protobuf
- Add check in RA to ensure that only KeyCompromise revocations use
`skipBlockKey`
Fixes#5785
Add `stylecheck` to our list of lints, since it got separated out from
`staticcheck`. Fix the way we configure both to be clearer and not
rely on regexes.
Additionally fix a number of easy-to-change `staticcheck` and
`stylecheck` violations, allowing us to reduce our number of ignored
checks.
Part of #5681
These gRPC methods were only used by the ACMEv1 code paths.
Now that boulder-wfe has been fully removed, we can be confident
that no clients ever call these methods, and can remove them from
the gRPC service interface.
Part of #5816
Running an older version (v0.0.1-2020.1.4) of `staticcheck` in
whole-program mode (`staticcheck --unused.whole-program=true -- ./...`)
finds various instances of unused code which don't normally show up
as CI issues. I've used this to find and remove a large chunk of the
unused code, to pave the way for additional large deletions accompanying
the WFE1 removal.
Part of #5681
When we query DNS for a host, and both the A and AAAA lookups fail or
are empty, combine both errors into a single error rather than only
returning the error from the A lookup.
Fixes#5819Fixes#5319
Overhaul the revocation integration tests to comprehensively test
every combination of:
- revoking a cert vs a precert
- revoking via the cert key, the subscriber key, or a separate account
that has validation for all of the names in the cert
- revoking for reason Unspecified vs for reason KeyCompromise
Also update a number of the python tests to verify that they cannot
revoke for reason keyCompromise, but can and do revoke with other
reasons.
When looping over multiple Go versions this script currently exits in error
because we attempt to create a cross-compiling node even though it already
exists. This allows subsequent builds to make use of the Docker cache, reducing
the build time by ~400 seconds.
- Only create the cross-compiling node if it doesn't exist
- No longer remove the cross-compiling node on exit
Followup from #5839.
I chose groupcache/lru as our LRU cache implementation because it's part
of the golang org, written by one of the Go authors, and very simple
and easy to read.
This adds an `AccountGetter` interface that is implemented by both the
AccountCache and the SA. If the WFE config includes an AccountCache field,
it will wrap the SA in an AccountCache with the configured max size and
expiration time.
We set an expiration time on account cache entries because we want a
bounded amount of time that they may be stale by. This will be used in
conjunction with a delay on account-updating pathways to ensure we don't
allow authentication with a deactivated account or changed key.
The account cache stores corepb.Registration objects because protobufs
have an established way to do a deep copy. Deep copies are important so
the cache can maintain its own internal state and ensure nothing external
is modifying it.
As part of this process I changed construction of the WFE. Previously,
"SA" and "RA" were public fields that were mutated after construction. Now
they are parameters to the constructor, along with the new "accountGetter"
parameter.
The cache includes stats for requests categorized by hits and misses.
Add a new check to GoodKey which attempts to factor the public modulus
of the presented key using Fermat's factorization method. This method
will succeed if and only if the prime factors are very close to each
other -- i.e. almost certainly were not selected independently from a
random uniform distribution, but were instead calculated via some other
less secure method.
To support this new feature, add a new config flag to the RA, CA, and
WFE, which all use the GoodKey checks. As part of adding this new config
value, refactor the GoodKey config items into their own config struct
which can be re-used across all services.
If the new `FermatRounds` config value has not been set, it will default
to zero, causing no factorization to be attempted.
Fixes#5850
Part of #5851
Build a new docker container for the new Go 1.17.5 security release,
which includes a fix for the `net/http` package. Update our CI to run
tests on both our current and the new go versions.
Currently, if `docker buildx` fails the cross-compilation node, created before
the build starts, will never be deleted. This ensures that the cross-compilation
node is always deleted before `tag_and_upload.sh` exits.
These tests are testing functionality that is no longer in use in
production deployments of Boulder. As we go about removing wfe1
functionality, these tests will break, so let's just remove them
wholesale right now. I have verified that all of the tests removed in
this PR are duplicated against wfe2.
One of the changes in this PR is to cease starting up the wfe1 process
in the integration tests at all. However, that component was serving
requests for the AIA Issuer URL, which gets queried by various OCSP and
revocation tests. In order to keep those tests working, this change also
adds an integration-test-only handler to wfe2, and updates the CA
configuration to point at the new handler.
Part of #5681
If configured, ocsp-updater will write responses to Redis in parallel
with MariaDB, giving up if Redis is slower and incrementing a stat.
Factors out the ShortIDIssuer concept from rocsp-tool into
rocsp_config.
This is the first step in moving OCSP responses from mysql to redis.
Adds support for parallel lookups to mysql and redis. The mysql source
remains the source of truth. If the secondaryLookup [redis] succeeds,
compare against the primaryLookup [mysql] and return if they concur
that the status is the same and the redis source is at least as fresh
as mysql.
There are checks on the database response for `certStatus.IsExpired`,
`certStatus.OCSPLastUpdated.IsZero()` and
`!src.filter.responseMatchesIssuer`.
The expired check isn't necessary for redis because the response will
be set with a ttl and drop out of redis when it reaches the ttl, and
delivering a response for an expired certificate until that happens
isn't a problem.
The `certStatus.OCSPLastUpdated.IsZero()` check is a MySQL check that
isn't needed in redis.
The `responseMatchesIssuer` check is important and will need to be
checked in some form before MySQL is no longer the source of truth.
There is another project to check issuer for responses and isn't scoped
for this change.
Add a new feature flag `GetAuthzUseIndex` which causes the SA
to add `USE INDEX (regID_identifer_status_expires_idx)` to its authz2
database queries. This should encourage the query planner to actually
use that index instead of falling back to large table-scans.
Fixes#5822
These hashes are useful for OCSP computations, as they are the two
values that are used to uniquely identify the issuer of the given cert in
an OCSP request. Here, they are restricted to SHA1 only, as Boulder
only supports SHA1 for OCSP, as per RFC 5019.
In addition, because the `ID`, `NameID`, `NameHash`, and `KeyHash`
are relatively expensive to compute, introduce a new constructor for
`issuance.Certificate` that computes all four values at startup time and
then simply returns the precomputed values when asked.
Add a feature flag which causes the SA to switch between using the
traditional read-write database connector (pointed at the primary db)
or the newer read-only database connector (usually pointed at a
replica) when executing the `GetAuthorizations2` query.
This scans the database for certificateStatus rows, gets them signed by the CA, and writes them to Redis.
Also, bump the default PoolSize for Redis to 100.
This reverts commit e3ce816425,
which was reviewed in https://github.com/letsencrypt/boulder/pull/5607.
This change caused database queries to exceed the maximum packet size
and fail. Because this was an opportunistic optimization, reverting it
is the safest course moving forward.
When wait-for-it is trying to connect and failing, bash emits errors on
stderr. This captures those errors and sends them to /dev/null.
This also replaces an internal wait_tcp_port function inside
entrypoint.sh with a call to wait-for-it.sh.
This is a sort of proof of concept of the Redis interaction, which will
evolve into a tool for inspection and manual repair of missing entries,
if we find ourselves needing to do that.
The important bits here are rocsp/rocsp.go and
cmd/rocsp-tool/main.go. Also, the newly-vendored Redis client.
This gets us ready to add writing to Redis from ocsp-updater. The Go
redis client requires different configuration for cluster operation
than non-cluster, so we need to simulate a cluster in our integration
environment. Cluster operation requires a manual initialization step,
which you can do like so:
```
docker-compose up -d bredis docker-compose exec bredis bash
/test/redis-create.sh
```
I still need to figure out how to make that happen automatically during
integration tests and when you run docker-compose up.
The hex values in redis.config are randomly generated passwords for the
different users.
Fixes#5723
Use `sa.SelectCertificates` instead of `sa.SelectCertificate` to
fetch the entire batch of certificates all at once, instead of doing
up to 10k individual certificate selections in serial.
Add a unit test and an integration test that both exercise the new
experimental ACME Renewal Info endpoint. These tests do not
yet validate the contents of the response, just that the appropriate
HTTP response code is returned, but they will be developed as the
code under test evolves.
Fixes#5674
- Add new function `SelectPrecertificates` to `SA` which returns `[]CertWithID`
- Replace `admin-revoker` calls to `sa.SelectCertificate(s)` with sa.SelectPrecertificate(s)
- Add SQL permissions for the `revoker` user to the `precertificates` table
Fixes#5708
Update the version of golangci-lint we use in our docker image,
and update the version of the docker image we use in our tests.
Fix a couple places where we were violating lints (ineffective assign
and calling `t.Fatal` from outside the main test goroutine), and add
one lint (using math/rand) to the ignore list.
Fixes#5710
This allows repeated runs using the same hiearchy, and avoids spurious
errors from ocsp-updater saying "This CA doesn't have an issuer cert
with ID XXX"
Fixes#5721
The expiration mailer processes certificates in batches of size
`certLimit` (default 100). In production, it runs in daemon mode, so it
will go on to the next batch when the current one is done. However, in
local integration tests we rely on it getting all its work done in a
single run. This works when you're running from a clean slate, but if
you've run integration tests a bunch of times, there will be a bunch of
certificates from previous runs that clog up the queue, and it won't
send mail for the specific certificate the integration test is looking
for.
Solution: Set `certLimit` very high in the config.
Also, update the default times for sending mail to match what we have in
prod.
- Replace `gorp.DbMap` with calls that use `sql.DB` directly
- Use `rows.Scan()` and `rows.Next()` to get query results (which opens the door to streaming the results)
- Export function `CertStatusMetadataFields` from `SA`
- Add new function `ScanCertStatusRow` to `SA`
- Add new function `NewDbSettingsFromDBConfig` to `SA`
Fixes#5642
Part Of #5715
Previously, `starservers.start()` would implicitly build the binaries.
This separates the `startservers.install()` step as a separate one that
must happen first. This is useful because it allows us to ensure the
`ceremony` tool has been built before we run `setupHierarchy`.
Also, add a `-s` flag to `curl` when checking whether start.py resulted
in a successful startup. This reduces the amount of log spam when it
failed to come up.
Remove the last of the gRPC wrapper files. In order to do so:
- Remove the `core.StorageGetter` interface. Replace it with a new
interface (whose methods include the `...grpc.CallOption` arg)
inside the `sa/proto/` package.
- Remove the `core.StorageAdder` interface. There's no real use-case
for having a write-only interface.
- Remove the `core.StorageAuthority` interface, as it is now redundant
with the autogenerated `sapb.StorageAuthorityClient` interface.
- Replace the `certificateStorage` interface (which appears in two
different places) with a single unified interface also in `sa/proto/`.
- Update all test mocks to include the `_ ...grpc.CallOption` arg in
their method signatures so they match the gRPC client interface.
- Delete many methods from mocks which are no longer necessary (mostly
because they're mocking old authz1 methods that no longer exist).
- Move the two `test/inmem/` wrappers into their own sub-packages to
avoid an import cycle.
- Simplify the `satest` package to satisfy one of its TODOs and to
avoid an import cycle.
- Add many methods to the `test/inmem/sa/` wrapper, to accommodate all
of the methods which are called in unittests.
Fixes#5600
In `sa.checkFQDNSetExists`, query both the normal `fqdnSets` and the
`fqdnSets_old` tables. The `fqdnSets` table was recently truncated to
only have 7 days worth of data, but this helper function is used to
bypass other rate limits if there exists a prior certificate for the
exact same set of names, and that functionality cares about at least
90 days worth of data. Therefore we need to query both tables, at least
until `fqdnSets` contains 90 days worth of data again.
Also make a variety of other changes to support this change: creating
the `fqdnSets_old` table in our test environment, documenting various
places where it needs to be cleaned up, and removing some unused code.
Fixes#5671
Instead of using the default `json.Unmarshal`, explicitly
construct and use a `json.Decoder` so that we can set the
`DisallowUnknownFields` flag on the decoder. This causes
any unrecognized config keys to result in errors at boulder
startup time.
Fixes#5643
Add a new method to the SA's gRPC interface which takes both an Order
and a list of new Authorizations to insert into the database, and adds
both (as well as the various ancillary rows) inside a transaction.
To enable this, add a new abstraction layer inside the `db/` package
that facilitates inserting many rows at once, as we do for the `authz2`,
`orderToAuthz2`, and `requestedNames` tables in this operation.
Finally, add a new codepath to the RA (and a feature flag to control it)
which uses this new SA method instead of separately calling the
`NewAuthorization` method multiple times. Enable this feature flag in
the config-next integration tests.
This should reduce the failure rate of the new-order flow by reducing
the number of database operations by coalescing multiple inserts into a
single multi-row insert. It should also reduce the incidence of new
authorizations being created in the database but then never exposed to
the subscriber because of a failure later in the new-order flow, both by
reducing failures overall and by adding those authorizations in a
transaction which will be rolled back if there is a later failure.
Fixes#5577
- Enable`ocsp-updater` to query for serials matching a configurable suffix to
allow for multiple `ocsp-updater` instances at once
- Add field `SerialSuffixShards` to `OCSPUpdaterConfig`
- Add field `serialSuffixShards` to `test/config-next/ocsp-updater.json`
- Add codepath to default to the previous query when `serialSuffixShards` is
missing from the JSON config
Part of #5629Fixes#5625
Previously, caa-log-checker's core algorithm was:
1. Load every single VA (CAA) log file, producing an in-memory map of
names to the time at which they were checked
2. Iterate over the RA (Issuance) log file, checking each issuance event
to see if it occurred less than 8 hours after an event in the
in-memory map.
This consumes significant memory, as the map of all CAA checks is
redundant (contains entries for w.x.y.z, x.y.z, and y.z) and holds
unnecessary data (contains entries for CAA checks that occurred much
more than 8 hours before or after any issuance in the RA log).
Invert this algorithm, as such:
1. Load the RA (Issuance) log file, producing an in-memory map of names
to the time at which they were issued
2. Iterate over each VA (CAA) log file, removing entries from the
in-memory map if they occurred less than 8 hours after the current
CAA checking event.
This reduces the memory consumption of caa-log-checker, because the
total number of issuance events is much smaller and the map does not
need to hold redundant data. The tradeoff is that caa-log-checker can no
longer print partial output as it runs; all results are held until the
very end, when it can inspect the in-memory map to see if it is empty.
Fixes#5552
Add `acceptableValidityDurations` to cert-checker's config, which
uses `cmd.ConfigDuration` instead of `int` to express the acceptable
validity periods. Deprecate the older int-based `acceptableValidityPeriods`.
This makes it easier to reason about the values in the configs, and
brings this config into line with other configs (such as the CA).
Fixes#5542
Throw away the result of parsing various command-line flags in
cert-checker. Leave the flags themselves in place to avoid breaking
any scripts which pass them, but only respect the values provided by
the config file.
Part of #5489
Delete the expired-authz-purger2 binary, as well as the
various config files, tests, and test helpers that exist to
support it.
This utility is no longer necessary, as it has not been
running for quite some time, and we have developed
alternative means of keeping the growth of the authz
table under control.
Fixes#5568
Delete the boulder-janitor binary, and the various configs
and tests which exist to support it.
This tool has not been actively running in quite some time.
The tables which is covers are either supported by our
more recent partitioning methods, or are rate-limit tables
that we hope to move out of mysql entirely. The cost of
maintaining the janitor is not offset by the benefits it brings
us (or the lack thereof).
Fixes#5569
This changeset adds a second DB connect string for the SA for use in
read-only queries that are not themselves dependencies for read-write
queries. In other words, this is attempting to only catch things like
rate-limit `SELECT`s and other coarse-counting, so we can potentially
move those read queries off the read-write primary database.
It also adds a second DB connect string to the OCSP Updater. This is a
little trickier, as the subsequent `UPDATE`s _are_ dependent on the
output of the `SELECT`, but in this case it's operating on data batches,
and a few seconds' replication latency are several orders of magnitude
below the threshold for update frequency, so any certificates that
aren't caught on run `n` can be caught on run `n+1`.
Since we export DB metrics to Prometheus, this also refactors
`InitDBMetrics` to take a DB Address (host:port tuple) and User out of
the DB connection DSN and include those as labels in the metrics.
Fixes#5550Fixes#4985
Add go1.17beta1 docker images to the set of things we build,
and integrate go1.17beta1 into the set of environments CI runs.
Fix one test which breaks due to an underlying refactoring in
the `crypto/x509` stdlib package. Fix one other test which breaks
due to new guarantees in the stdlib's TLS ALPN implementation.
Also removes go1.16.5 from CI so we're only running 2 versions.
Fixes#5480
In go1.17, the `x509.CreateCertificate()` method fails if the provided
Signer (private key) and Parent (cert with public key) do not match.
This change both updates the lint library to create and use an issuer
cert whose public key matches the throwaway private key used for lint
signatures, and overhauls its public interface for readability and
simplicity.
Rename the `lint` library to `linter`, to allow other methods to be
renamed to reduce word repetition. Reduce the linter library interface
to three functions: `Check()`, `New()`, and `Linter.Check()` by making
all helper functions private. Refactor the top-level `Check()` method to
rely on `New()` and `Linter.Check()` behind the scenes. Finally, create
a new helper method for creating a lint issuer certificate, call this
new method from `New()`, and store the result in the `Linter` struct.
Part of #5480
These parameters are currently accepted by cert-checker
either via the command line or via the config. Add them to
the config-next config as we move towards deprecating
their CLI equivalents.
Part of #5489
In the CA, compute the notAfter timestamp such that the cert is actually
valid for the intended duration, not for one second longer. In the
Issuance library, compute the validity period by including the full
length of the final second indicated by the notAfter date when
determining if the certificate request matches our profile. Update tests
and config files to match.
Fixes#5473
Add a new `acceptableValidityPeriods` field to cert-checker's config.
This field is a list of integers representing validity periods measured
in seconds (so 7776000 is 90 days). This field is multi-valued to enable
transitions between different validity periods (e.g. 90 days + 1 second
to 90 days, or 90 days to 30 days). If the field is not provided,
cert-checker defaults to 90 days.
Also update the way that cert-checker computes the validity period of
the certificates it is checking to include the full width of the final
second represented by the notAfter timestamp.
Finally, update the tests to support this new behavior.
Fixes#5472
In prod, the CA is now configured to issue certificates with
notAfter timestamps 7775999 seconds after their notBefore
timestamp, and to enforce that same difference when validating
issuance requests.
Update our test configs to match.
- Remove field `ECDSAAllowedAccounts` from CA
- Remove `ECDSAAllowedAccounts` from CA tests
- Replace `ECDSAAllowedAccounts` with `ECDSAAllowListFilename` in
`test/config/ca-a.json` and `test/config/ca-b.json`
- Add YAML allow list file at `test/config/ecdsaAllowList.yml`
Fixes#5394
Add tool to audit subscriber registrations for e-mail addresses that
`notify-mailer` is currently configured to skip.
- Add `cmd/contact-auditor` with README
- Add test coverage for `cmd/contact-auditor`
- Add config file at `test/config/contact-auditor`
Part of #5372
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
Add a new rate limit, identical in implementation to the current
`CertificatesPerFQDNSet` limit, intended to always have both a lower
window and a lower threshold. This allows us to block runaway clients
quickly, and give their owners the ability to fix and try again quickly
(on the order of hours instead of days).
Configure the integration tests to set this new limit at 2 certs per 2
hours. Also increase the existing limit from 5 to 6 certs in 7 days, to
allow clients to hit the first limit three times before being fully
blocked for the week. Also add a new integration test to verify this
behavior.
Note that the new ratelimit must have a window greater than the
configured certificate backdate (currently 1 hour) in order to be
useful.
Fixes#5210
Solve a nil pointer dereference of `ecdsaAllowList` in `boulder-ca` by
calling `reloader.New()` in constructor `ca.NewECDSAAllowListFromFile`
instead.
- Add missing entry `ECDSAAllowListFilename` to
`test/config-next/ca-a.json` and `test/config-next/ca-b.json`
- Add missing file ecdsaAllowList.yml to `test/config-next`
- Add missing entry `ECDSAAllowedAccounts` to `test/config/ca-a.json`
and `test/config/ca-b.json`
- Move creation of the reloader to `NewECDSAAllowListFromFile`
Fixes#5414