Add a new code path to the ctpolicy package which enforces Chrome's new
CT Policy, which requires that SCTs come from logs run by two different
operators, rather than one Google and one non-Google log. To achieve
this, invert the "race" logic: rather than assuming we always have two
groups, and racing the logs within each group against each other, we now
race the various groups against each other, and pick just one arbitrary
log from each group to attempt submission to.
Ensure that the new code path does the right thing by adding a new zlint
which checks that the two SCTs embedded in a certificate come from logs
run by different operators. To support this lint, which needs to have a
canonical mapping from logs to their operators, import the Chrome CT Log
List JSON Schema and autogenerate Go structs from it so that we can
parse a real CT Log List. Also add flags to all services which run these
lints (the CA and cert-checker) to let them load a CT Log List from disk
and provide it to the lint.
Finally, since we now have the ability to load a CT Log List file
anyway, use this capability to simplify configuration of the RA. Rather
than listing all of the details for each log we're willing to submit to,
simply list the names (technically, Descriptions) of each log, and look
up the rest of the details from the log list file.
To support this change, SRE will need to deploy log list files (the real
Chrome log list for prod, and a custom log list for staging) and then
update the configuration of the RA, CA, and cert-checker. Once that
transition is complete, the deletion TODOs left behind by this change
will be able to be completed, removing the old RA configuration and old
ctpolicy race logic.
Part of #5938
This reverts commit 5fe5859c38.
Per #5973:
> we will eventually want to go back to doing this in boulder-tools, so it's easy
> to run the lints locally. But this is useful so we can unblock testing on go 1.18beta2.
Fix a race condition in one of the TestMultiVA test cases.
There are many other possible flakes in these tests, because
they use real HTTP to talk to a fake remote VA on localhost,
but we can at least trivially remove this race condition.
Fixes#6119
This reverts commit 7ef6913e71.
We turned on the `ExpirationMailerDontLookTwice` feature flag in prod, and it's
working fine but not clearing the backlog. Since
https://github.com/letsencrypt/boulder/pull/6100 fixed the issue that caused us
to (nearly) stop sending mail when we deployed #6057, this should be safe to
roll forward.
The revert of the revert applied cleanly, except for expiration-mailer/main.go
and `main_test.go`, particularly around the contents `processCerts` (where
`sendToOneRegID` was extracted from) and `sendToOneRegID` itself. So those areas
are good targets for extra attention.
Update:
- golangci-lint from v1.42.1 to v1.46.2
- protoc from v3.15.6 to v3.20.1
- protoc-gen-go from v1.26.0 to v1.28.0
- protoc-gen-go-grpc from v1.1.0 to v1.2.0
- fpm from v1.14.0 to v1.14.2
Also remove a reference to go1.17.9 from one last place.
This does result in updating all of our generated .pb.go files, but only
to update the version number embedded in each file's header.
Fixes#6123
If the test unexpectedly failed, it would hit a nil pointer dereference
on line 511 when it tries to access the `.Type` field of nil. Add another
case to handle this.
Profiling showed that the unit tests were spending almost all of
their time in key generation, because the tests were generating
new fake keys for the linter with every call to `setup()`. Instead,
create the linters just once at startup time.
This decreases the runtime of `go test ./ca/...` from ~9.7s to less
than one second.
Stop logging the full CSR bytes at the WFE; we log these bytes in the
CA immediately before attempting to sign a precertificate anyway.
Also unify the way we log precertificate and final certificate signing
events in the CA: an Info containing the request prior to signing, and
then either an Info or an Error containing the result afterwards.
Fixes#6088
We recently landed a fix so the expiration-mailer won't look twice at
the same certificate. This will cause an immediate behavior change when
it is deployed, and that might have surprising effects. Put the fix
behind a feature flag so we can control when it rolls out more
carefully.
findExpiringCertificates had a previously unstated invariant: It is
supposed to only examine each certificate once per nag window. Otherwise,
the set of certificates it has to examine may get clogged up with
certificates it has previously looked at, and it will fall behind. It
does this by updating lastExpirationNagSent after examining each
certificate. For certificates that have already been renewed, we update
this field even though we didn't actually send any mail.
We accidentally broke this invariant in #6057. When that change rolled
out to staging, suddenly our rate of sent mail plummeted to near-zero,
and our nags_at_capacity metric went to 1 for all expiration windows,
and stayed there until we deployed a revert.
The problem was here:
https://github.com/letsencrypt/boulder/pull/6057/files#diff-ae49e23ff8be05aae6145106f04c76fc1f0b7b336c0cdcbe8183f572dedf47c5R261-R263.
We check if `reg.Contacts` is empty, and bail. Prior to #6057,
that check happened _after_ checking for renewal and updating
lastExpirationNagSent. In the code shuffle of #6057, that check got
moved to _before_ checking for renewal. So expiration-mailer's input
became clogged with certificates that had already been renewed, but were
issued by accounts with no contact.
There's a second problem that has existed practically forever: we hit the
`reg.Contacts` check for no-contact-info certificates that have _not_
been renewed, and those also clog up expiration mailer. This problem is
probably causing some amount of slowness today in both prod and staging.
I wrote unittests to check the first and second problem, and verified
that they fail as appropriate. The second one fails with current `main`;
the first one fails if I simulate #6057 by moving the `reg.Contacts`
check higher in the file.
The fix is simple: delete the `reg.Contacts` check entirely. It's valid
to call `sendNags` with an empty contacts list, and will return a nil
error. That allows findExpiringCertificates to proceed to updating
lastExpirationNagSent for those certificates, so they won't be examined
again on the next iteration. I also removed another spurious length check
in `sendNags`, since there's an equivalent length check a few lines lower.
Related to #5682
In multiSource, if the primary source returned before the secondary
source, we would leak the goroutine waiting on results from the secondary
source. This was because that goroutine was writing to a channel that
no longer had any readers.
Fix that by making the channel in getResponse buffered, so getResponse
can always write to its channel, regardless of whether there are readers.
I've confirmed that the unittest fails when run without the fix.
Add functionality to the ubiquitous RequestEvent (aka logEvent) to
allow handlers to suppress the final log line that is printed when
a non-500 response is being sent.
Use this functionality to suppress logging GET requests for the /directory
endpoint. We can expand this in the future to quiet other logs that
are not helpful for metrics or analysis.
Fixes#6094
The go-redis docs say default is 10 * NumCPU, but the actual code says 5.
Extra context:
2465baaab5/options.go (L143-L145)2465baaab5/cluster.go (L96-L98)
For Options, the default (documented) is 10 * NumCPUs. For ClusterOptions, the
default (undocumented) is 5 * NumCPUs. We use ClusterOptions. Also worth noting:
for ClusterOptions, the limit is per node.
This was introduced when expiration-mailer was run by cron, and was a
way for expiration-mailer to know something about its expected run
interval so it could send notifications "on time" rather than "just
after" the configured email time.
Now that expiration-mailer runs as a daemon we can simply pull this
value from `Frequency`, which is set to the same value in prod.
Most of our existing Redis metrics are related to the connection pool,
but some sound like they could be request-oriented (like redis_hits and
redis_misses). Fix that with better naming. Also, join three related
metrics (hits, misses, and timeouts) under a single metric name with
labels so they more easily by grouped and counted.
Move the stat creation into rocsp/metrics.go so the names are in the
same file as the collector.
Add a unittest for rocsp metrics.
We have various long-running loops in expiration-mailer. When a context
is canceled, rather than running through the loops to their end (and
potentially spamming logs with "context canceled" errors), we should
return early.
Fixes#6069
When testing the SA from within the SA package itself, we can directly
call its methods without having to go through the gRPC client interface.
Rather than needing to mock out gRPC client and server streams that
talk to each other over a channel, we can simply mock the server stream
and read from it directly.
The inMemSA adapter work will still be necessary when (for example)
testing an RA which uses a real in-memory SA to serve as its test
backend, rather than mocking out the SA. In that case, we need to wrap
the `storageAuthorityImpl` in something which causes it to match the
`sapb.StorageAuthorityClient` interface expected by the RA. That's what
the `//test/inmem/` package is for, and we'll move this wrapper code there
when we need it.
By testing these methods directly, rather than through an inmem wrapper,
we remove a potential cyclic import that was preventing us from moving
the inmem stream adapter into the `//test/inmem` package itself.
These new linters are almost all part of golangci-lint's collection
of default linters, that would all be running if we weren't setting
`disable-all: true`. By adding them, we now have parity with the
default configuration, as well as the additional linters we like.
Adds the following linters:
* unconvert
* deadcode
* structcheck
* typecheck
* varcheck
* wastedassign
There was a lot of copy-paste code, and in particular one of
the most important pieces of setup information (the value of the
`lastExpirationNagSent` field) we often hidden off to the right of the
screen. This extracts out common logic into helper functions and replaces
manual INSERTs with gorp inserts.
Instead write `[]`, a better representation of an empty contact set,
and avoid having literal JSON `null`s in our database.
As part of doing so, add some extra code to //sa/model.go that
bypasses the need for //sa/type-converter.go to do any magic
JSON-to-string-slice conversions for us.
Fixes#6074
Generate .deb packages for all currently configured Go versions
(usually the current and upcoming versions that we use in prod), rather
than just the one default version. Also ensure that the uploaded
artifacts have 8-character short hashes in their names.
Unfortunately this does require updating Go versions in one additional
place (the release.yml file), since we are no longer parsing it out of the
docker-compose.yml. This is unavoidable without hacks that I consider
to be even uglier than the repetition.
Fixes#6075Fixes#6084
Bring in the release docs from the boulder-release-process repo,
so that they're adjacent to all of our other docs. This allows us to
delete that repo. Also update references to that repo to instead point
to the new doc here.
Also make minor organization updates to other docs to keep the root
of this repository clean.
When deployed, the newly-parallel expiration-mailer encountered
unexpected difficulties and dropped to apparently sending nearly zero
emails despite not throwing any real errors. Reverting the parallelism
change until we understand and can fix the root cause.
This reverts two commits:
- Allow expiration mailer to work in parallel (#6057)
- Fix data race in expiration-mailer test mocks (#6072)
It also modifies the revert to leave the new `ParallelSends` config key
in place (albeit completely ignored), so that the binary containing this
revert can be safely deployed regardless of config status.
Part of #5682
- Fix cert-checker to use the syslog and stdout logging facilities it
reads from the config file instead of having them hard-coded to zero.
- Fix cert-checker to handle a nil response from mariadb if no records
are found.
- Fix comment in log.go to correctly describe when the initialize function
and therefore default values would be used.
Fixes#6067
Although each goroutine gets its own `mocks.mockMailerConn`, each one
of those is racing with the others whenever they try to update the list
of sent messages in their parent `mocks.Mailer`. This leads to data races
in the unit tests (but, thankfully, not in the production code).
Introduce a mutex around the slice of sent messages to prevent the race.
Fixes#6070
Redis recently released version 7.0.0, which has several breaking
changes. The go-redis library that we rely on does not yet support
communicating with a Redis 7.0.0 cluster.
Pin ourselves to the latest non-7.0.0 version, 6.2.7, until such time
as go-redis releases a version with support for 7.0.0.
Fixes#6071
ServiceConfig is only needed for components that act as gRPC services.
For rocsp-tool, which is a gRPC client but is long-running enough to
merit a debug port, we should provide DebugAddr in the config.
By default, Boulder's feature flag code verifies that the list of flags
being set (from a JSON file) maps to actually-existing flags.
However, this gets in the way of a deployment strategy where feature
flags are added to config templates during a staging deploy with "true"
or "false" filled in depending on production or staging status - for
instance, when rolling out a deprecation to staging ahead of production.
If those configs get rolled to prod before the corresponding Boulder
deploy, Boulder will refuse to start up, even though it would be fine to
start up with the unrecognized flag ignored.
The envisioned deployment behavior here is that prod will have
AllowUnrecognizedFeatures: true while staging will have it set to false,
to ensure that misspellings of feature flag names are caught during
staging deploy. As a correlary, this assumes that the list of flags in
configs will be the same between staging and prod, with only their
values changing.
Previously, each accounts email would be sent in serial,
along with several reads from the database (to check for
certificate renewal) and several writes to the database (to update
`certificateStatus.lastExpirationNagSent`). This adds a config field
for the expiration mailer that sets the parallelism it will use.
That means making and using multiple SMTP connections as well. Previously,
`bmail.Mailer` was not safe for concurrent use. It also had a piece of
API awkwardness: after you created a Mailer, you had to call Connect on
it to change its state.
Instead of treating that as a state change on Mailer, I split out a
separate component: `bmail.Conn`. Now, when you call `Mailer.Connect()`,
you get a Conn. You can send mail on that Conn and Close it when you're
done. A single Mailer instance can produce multiple Conns, so Mailer is
now concurrency-safe (while Conn is not).
This involved a moderate amount of renaming and code movement, and
GitHub's move detector is not keeping up 100%, so an eye towards "is
this moved code?" may help. Also adding `?w=1` to the diff URL to ignore
whitespace diffs.
This allows us to narrow permissions by only granting write privileges
to the upload portion of the job (which doesn't run any code from our
repo). It also allows us to verify that the release build works on every
commit, while only generating releases on actual release tags.