Add `identifier` fields, which will soon replace the `dnsName` fields,
to:
- `corepb.Authorization`
- `corepb.Order`
- `rapb.NewOrderRequest`
- `sapb.CountFQDNSetsRequest`
- `sapb.CountInvalidAuthorizationsRequest`
- `sapb.FQDNSetExistsRequest`
- `sapb.GetAuthorizationsRequest`
- `sapb.GetOrderForNamesRequest`
- `sapb.GetValidAuthorizationsRequest`
- `sapb.NewOrderRequest`
Populate these `identifier` fields in every function that creates
instances of these structs.
Use these `identifier` fields instead of `dnsName` fields (at least
preferentially) in every function that uses these structs. When crossing
component boundaries, don't assume they'll be present, for
deployability's sake.
Deployability note: Mismatched `cert-checker` and `sa` versions will be
incompatible because of a type change in the arguments to
`sa.SelectAuthzsMatchingIssuance`.
Part of #7311
Use policy.ValidEmail to vet email addresses before sending expiration
notifications to them. This same check is performed by notify-mailer,
and it helps reduce the number of invalid addresses we attempt to send
to and the number of email bounces we generate.
Additionally, mark certificates as having had a nag email sent if there
are no valid addresses for us to send to, so that we don't constantly
retry them.
Fixes https://github.com/letsencrypt/boulder/issues/5372
This design seeks to reduce read-pressure on our DB by moving rate limit
tabulation to a key-value datastore. This PR provides the following:
- (README.md) a short guide to the schemas, formats, and concepts
introduced in this PR
- (source.go) an interface for storing, retrieving, and resetting a
subscriber bucket
- (name.go) an enumeration of all defined rate limits
- (limit.go) a schema for defining default limits and per-subscriber
overrides
- (limiter.go) a high-level API for interacting with key-value rate
limits
- (gcra.go) an implementation of the Generic Cell Rate Algorithm, a
leaky bucket-style scheduling algorithm, used to calculate the present
or future capacity of a subscriber bucket using spend and refund
operations
Note: the included source implementation is test-only and currently
accomplished using a simple in-memory map protected by a mutex,
implementations using Redis and potentially other data stores will
follow.
Part of #5545
This change replaces [gorp] with [borp].
The changes consist of a mass renaming of the import and comments / doc
fixups, plus modifications of many call sites to provide a
context.Context everywhere, since gorp newly requires this (this was one
of the motivating factors for the borp fork).
This also refactors `github.com/letsencrypt/boulder/db.WrappedMap` and
`github.com/letsencrypt/boulder/db.Transaction` to not embed their
underlying gorp/borp objects, but to have them as plain fields. This
ensures that we can only call methods on them that are specifically
implemented in `github.com/letsencrypt/boulder/db`, so we don't miss
wrapping any. This required introducing a `NewWrappedMap` method along
with accessors `SQLDb()` and `BorpDB()` to get at the internal fields
during metrics and logging setup.
Fixes#6944
Previously, we had three chained calls initializing a database:
- InitWrappedDb calls NewDbMap
- NewDbMap calls NewDbMapFromConfig
Since all three are exporetd, this left me wondering when to call one vs
the others.
It turns out that NewDbMap is only called from tests, so I renamed it to
DBMapForTest to make that clear.
NewDbMapFromConfig is only called internally to the SA, so I made it
unexported it as newDbMapFromMysqlConfig.
Also, I copied the ParseDSN call into InitWrappedDb, so it doesn't need
to call DBMapForTest. Now InitWrappedDb and DBMapForTest both
independently call newDbMapFromMysqlConfig.
I also noticed that InitDBMetrics was only called internally so I
unexported it.
Enable the errcheck linter. Update the way we express exclusions to use
the new, non-deprecated, non-regex-based format. Fix all places where we
began accidentally violating errcheck while it was disabled.
Use constants from the go stdlib time package, such as time.DateTime and
time.RFC3339, when parsing and formatting timestamps. Additionally,
simplify or remove some of our uses of parsing timestamps, such as to
set fake clocks in tests.
Add a new time.Duration field, LagFactor, to both the SA's config struct
and the read-only SA's implementation struct. In the GetRegistration,
GetOrder, and GetAuthorization2 methods, if the database select returned
a NoRows error and a lagFactor duration is configured, then sleep for
lagFactor seconds and retry the select.
This allows us to compensate for the replication lag between our primary
write database and our read-only replica databases. Sometimes clients
will fire requests in rapid succession (such as creating a new order,
then immediately querying the authorizations associated with that
order), and the subsequent requests will fail because they are directed
to read replicas which are lagging behind the primary. Adding this
simple sleep-and-retry will let us mitigate many of these failures,
without adding too much complexity.
Fixes#6593
Right now the expiration mailer does one big SELECT on
`certificateStatus` to find certificates to work on, then several
thousand SELECTs of individual serial numbers in `certificates`.
Since it's more efficient to get that data as a stream from a single
query, rather than thousands of separate queries, turn that into a JOIN.
NOTE: We used to use a JOIN, and switched to the current approach in
#2440 for performance reasons. I _believe_ part of the issue was that at
the time we were not using READ UNCOMMITTED, so we may have been slowing
down the database by requiring it to keep copies of a lot of rows during
the query. Still, it's possible that I've misunderstood the performance
characteristics here and it will still be a regression to use JOIN. So
I've gated the new behavior behind a feature flag.
The feature flag required extracting a new function, `getCerts`. That in
turn required changing some return types so we are not as closely tied
to `core.Certificate`. Instead we use a new local type named
`certDERWithRegId`, which can be provided either by the new code path or
the old code path.
Create a new gRPC service named StorageAuthorityReadOnly which only
exposes a read-only subset of the existing StorageAuthority service's
methods.
Implement this by splitting the existing SA in half, and having the
read-write half embed and wrap an instance of the read-only half.
Unfortunately, many of our tests use exported read-write methods as part
of their test setup, so the tests are all being performed against the
read-write struct, but they are exercising the same code as the
read-only implementation exposes.
Expose this new service at the SA on the same port as the existing
service, but with (in config-next) different sets of allowed clients. In
the future, read-only clients will be removed from the read-write
service's set of allowed clients.
Part of #6454
- Move incidents tables from `boulder_sa` to `incidents_sa` (added in #6344)
- Grant read perms for all tables in `incidents_sa`
- Modify unit tests to account for new schema and grants
- Add database cleaning func for `boulder_sa`
- Adjust cleanup funcs to omit `sql-migrate` tables instead of `goose`
Resolves#6328
This reverts commit 7ef6913e71.
We turned on the `ExpirationMailerDontLookTwice` feature flag in prod, and it's
working fine but not clearing the backlog. Since
https://github.com/letsencrypt/boulder/pull/6100 fixed the issue that caused us
to (nearly) stop sending mail when we deployed #6057, this should be safe to
roll forward.
The revert of the revert applied cleanly, except for expiration-mailer/main.go
and `main_test.go`, particularly around the contents `processCerts` (where
`sendToOneRegID` was extracted from) and `sendToOneRegID` itself. So those areas
are good targets for extra attention.
We recently landed a fix so the expiration-mailer won't look twice at
the same certificate. This will cause an immediate behavior change when
it is deployed, and that might have surprising effects. Put the fix
behind a feature flag so we can control when it rolls out more
carefully.
findExpiringCertificates had a previously unstated invariant: It is
supposed to only examine each certificate once per nag window. Otherwise,
the set of certificates it has to examine may get clogged up with
certificates it has previously looked at, and it will fall behind. It
does this by updating lastExpirationNagSent after examining each
certificate. For certificates that have already been renewed, we update
this field even though we didn't actually send any mail.
We accidentally broke this invariant in #6057. When that change rolled
out to staging, suddenly our rate of sent mail plummeted to near-zero,
and our nags_at_capacity metric went to 1 for all expiration windows,
and stayed there until we deployed a revert.
The problem was here:
https://github.com/letsencrypt/boulder/pull/6057/files#diff-ae49e23ff8be05aae6145106f04c76fc1f0b7b336c0cdcbe8183f572dedf47c5R261-R263.
We check if `reg.Contacts` is empty, and bail. Prior to #6057,
that check happened _after_ checking for renewal and updating
lastExpirationNagSent. In the code shuffle of #6057, that check got
moved to _before_ checking for renewal. So expiration-mailer's input
became clogged with certificates that had already been renewed, but were
issued by accounts with no contact.
There's a second problem that has existed practically forever: we hit the
`reg.Contacts` check for no-contact-info certificates that have _not_
been renewed, and those also clog up expiration mailer. This problem is
probably causing some amount of slowness today in both prod and staging.
I wrote unittests to check the first and second problem, and verified
that they fail as appropriate. The second one fails with current `main`;
the first one fails if I simulate #6057 by moving the `reg.Contacts`
check higher in the file.
The fix is simple: delete the `reg.Contacts` check entirely. It's valid
to call `sendNags` with an empty contacts list, and will return a nil
error. That allows findExpiringCertificates to proceed to updating
lastExpirationNagSent for those certificates, so they won't be examined
again on the next iteration. I also removed another spurious length check
in `sendNags`, since there's an equivalent length check a few lines lower.
Related to #5682
This was introduced when expiration-mailer was run by cron, and was a
way for expiration-mailer to know something about its expected run
interval so it could send notifications "on time" rather than "just
after" the configured email time.
Now that expiration-mailer runs as a daemon we can simply pull this
value from `Frequency`, which is set to the same value in prod.
There was a lot of copy-paste code, and in particular one of
the most important pieces of setup information (the value of the
`lastExpirationNagSent` field) we often hidden off to the right of the
screen. This extracts out common logic into helper functions and replaces
manual INSERTs with gorp inserts.
When deployed, the newly-parallel expiration-mailer encountered
unexpected difficulties and dropped to apparently sending nearly zero
emails despite not throwing any real errors. Reverting the parallelism
change until we understand and can fix the root cause.
This reverts two commits:
- Allow expiration mailer to work in parallel (#6057)
- Fix data race in expiration-mailer test mocks (#6072)
It also modifies the revert to leave the new `ParallelSends` config key
in place (albeit completely ignored), so that the binary containing this
revert can be safely deployed regardless of config status.
Part of #5682
Previously, each accounts email would be sent in serial,
along with several reads from the database (to check for
certificate renewal) and several writes to the database (to update
`certificateStatus.lastExpirationNagSent`). This adds a config field
for the expiration mailer that sets the parallelism it will use.
That means making and using multiple SMTP connections as well. Previously,
`bmail.Mailer` was not safe for concurrent use. It also had a piece of
API awkwardness: after you created a Mailer, you had to call Connect on
it to change its state.
Instead of treating that as a state change on Mailer, I split out a
separate component: `bmail.Conn`. Now, when you call `Mailer.Connect()`,
you get a Conn. You can send mail on that Conn and Close it when you're
done. A single Mailer instance can produce multiple Conns, so Mailer is
now concurrency-safe (while Conn is not).
This involved a moderate amount of renaming and code movement, and
GitHub's move detector is not keeping up 100%, so an eye towards "is
this moved code?" may help. Also adding `?w=1` to the diff URL to ignore
whitespace diffs.
Adds a rocsp redis client to the sa if cluster information is provided in the
sa config. If a redis cluster is configured, all new certificate OCSP
responses added with sa.AddPrecertificate will attempt to be written to
the redis cluster, but will not block or fail on errors.
Fixes: #5871
This reverts commit e3ce816425,
which was reviewed in https://github.com/letsencrypt/boulder/pull/5607.
This change caused database queries to exceed the maximum packet size
and fail. Because this was an opportunistic optimization, reverting it
is the safest course moving forward.
Use `sa.SelectCertificates` instead of `sa.SelectCertificate` to
fetch the entire batch of certificates all at once, instead of doing
up to 10k individual certificate selections in serial.
The resulting `boulder` binary can be invoked by different names to
trigger the behavior of the relevant subcommand. For instance, symlinking
and invoking as `boulder-ca` acts as the CA. Symlinking and invoking as
`boulder-va` acts as the VA.
This reduces the .deb file size from about 200MB to about 20MB.
This works by creating a registry that maps subcommand names to `main`
functions. Each subcommand registers itself in an `init()` function. The
monolithic `boulder` binary then checks what name it was invoked with
(`os.Args[0]`), looks it up in the registry, and invokes the appropriate
`main`. To avoid conflicts, all of the old `package main` are replaced
with `package notmain`.
To get the list of registered subcommands, run `boulder --list`. This
is used when symlinking all the variants into place, to ensure the set
of symlinked names matches the entries in the registry.
Fixes#5692
Remove the last of the gRPC wrapper files. In order to do so:
- Remove the `core.StorageGetter` interface. Replace it with a new
interface (whose methods include the `...grpc.CallOption` arg)
inside the `sa/proto/` package.
- Remove the `core.StorageAdder` interface. There's no real use-case
for having a write-only interface.
- Remove the `core.StorageAuthority` interface, as it is now redundant
with the autogenerated `sapb.StorageAuthorityClient` interface.
- Replace the `certificateStorage` interface (which appears in two
different places) with a single unified interface also in `sa/proto/`.
- Update all test mocks to include the `_ ...grpc.CallOption` arg in
their method signatures so they match the gRPC client interface.
- Delete many methods from mocks which are no longer necessary (mostly
because they're mocking old authz1 methods that no longer exist).
- Move the two `test/inmem/` wrappers into their own sub-packages to
avoid an import cycle.
- Simplify the `satest` package to satisfy one of its TODOs and to
avoid an import cycle.
- Add many methods to the `test/inmem/sa/` wrapper, to accommodate all
of the methods which are called in unittests.
Fixes#5600
Overhaul how expiration-mailer checks to see if a `certIsRenewed`.
First, change the helper function to take the list of names (which
can be hashed into an fqdnSet) and the issuance date. This allows the
search for renewals to be a much simpler linear scan rather than an
ugly outer left join. Second, update the query to examine both the
`fqdnSets` and `fqdnSets_old` tables, to account for the fact that
this code cares about more time (~90d) than the `fqdnSets` table
currently holds.
Also export the SA's `HashNames` method so it can be used by the mailer,
and update the mailer's tests to use correct name hashes instead of
fake human-readable hashes.
Fixes#5672
Remove all error checking and type transformation from the gRPC wrappers
for the following methods on the SA:
- GetRegistration
- GetRegistrationByKey
- NewRegistration
- UpdateRegistration
- DeactivateRegistration
Update callers of these methods to construct the appropriate protobuf
request messages directly, and to consume the protobuf response messages
directly. In many cases, this requires changing the way that clients
handle the `Jwk` field (from expecting a `JSONWebKey` to expecting a
slice of bytes) and the `Contacts` field (from expecting a possibly-nil
pointer to relying on the value of the `ContactsPresent` boolean field).
Implement two new methods in `sa/model.go` to convert directly between
database models and protobuf messages, rather than round-tripping
through `core` objects in between. Delete the older methods that
converted between database models and `core` objects, as they are no
longer necessary.
Update test mocks to have the correct signatures, and update tests to
not rely on `JSONWebKey` and instead use byte slices.
Fixes#5531
This changeset adds a second DB connect string for the SA for use in
read-only queries that are not themselves dependencies for read-write
queries. In other words, this is attempting to only catch things like
rate-limit `SELECT`s and other coarse-counting, so we can potentially
move those read queries off the read-write primary database.
It also adds a second DB connect string to the OCSP Updater. This is a
little trickier, as the subsequent `UPDATE`s _are_ dependent on the
output of the `SELECT`, but in this case it's operating on data batches,
and a few seconds' replication latency are several orders of magnitude
below the threshold for update frequency, so any certificates that
aren't caught on run `n` can be caught on run `n+1`.
Since we export DB metrics to Prometheus, this also refactors
`InitDBMetrics` to take a DB Address (host:port tuple) and User out of
the DB connection DSN and include those as labels in the metrics.
Fixes#5550Fixes#4985
Historically the only database/sql driver setting exposed via JSON
config was maxDBConns. This change adds support for maxIdleConns,
connMaxLifetime, connMaxIdleTime, and renames maxDBConns to
maxOpenConns. The addition of these settings will give our SRE team a
convenient method for tuning the reuse/closure of database connections.
A new struct, DBSettings, has been added to SA. The struct, and each of
it's fields has been commented.
All new fields have been plumbed through to the relevant Boulder
components and exported as Prometheus metrics. Tests have been
added/modified to ensure that the fields are being set. There should be
no loss in coverage
Deployability concerns for the migration from maxDBConns to maxOpenConns
have been addressed with the temporary addition of the helper method
cmd.DBConfig.GetMaxOpenConns(). This method can be removed once
test/config is defaulted to using maxOpenConns. Relevant sections of the
code have TODOs added that link back to an newly opened issue.
Fixes#5199
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.
There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.
Fixes#4591.
New types and related infrastructure are added to the `db` package to allow
wrapping gorp DbMaps and Transactions.
The wrapped versions return a special `db.ErrDatabaseOp` error type when errors
occur. The new error type includes additional information such as the operation
that failed and the related table.
Where possible we determine the table based on the types of the gorp function
arguments. Where that isn't possible (e.g. with raw SQL queries) we try to use
a simple regexp approach to find the table name. This isn't great for general
SQL but works well enough for Boulder's existing SQL queries.
To get additional confidence my regexps work for all of Boulder's queries
I temporarily changed the `db` package's `tableFromQuery` function to panic if
the table couldn't be determined. I re-ran the full unit and integration test
suites with this configuration and saw no panics.
Resolves https://github.com/letsencrypt/boulder/issues/4559
When a nag group hits capacity we set the nagsAtCapacity gauge to 1.
This gauge also needs to be reset to 0 when the nag group is no longer
at capacity.
* in boulder-ra we connected to the publisher and created a publisher gRPC client twice for no apparent reason
* in the SA we ignored errors from `getChallenges` in `GetAuthorizations` which could result in a nil challenge being returned in an authorization
Since we can make up to 100 SQL queries from this method (based on the 100-SAN
limit), sometimes it is too slow and we get a timeout for large certificates. By
running some of those queries in parallel, we can speed things up and stop
getting timeouts.
This commit replaces the Boulder dependency on
gopkg.in/square/go-jose.v1 with gopkg.in/square/go-jose.v2. This is
necessary both to stay in front of bitrot and because the ACME v2 work
will require a feature from go-jose.v2 for JWS validation.
The largest part of this diff is cosmetic changes:
Changing import paths
jose.JsonWebKey -> jose.JSONWebKey
jose.JsonWebSignature -> jose.JSONWebSignature
jose.JoseHeader -> jose.Header
Some more significant changes were caused by updates in the API for
for creating new jose.Signer instances. Previously we constructed
these with jose.NewSigner(algorithm, key). Now these are created with
jose.NewSigner(jose.SigningKey{},jose.SignerOptions{}). At present all
signers specify EmbedJWK: true but this will likely change with
follow-up ACME V2 work.
Another change was the removal of the jose.LoadPrivateKey function
that the wfe tests relied on. The jose v2 API removed these functions,
moving them to a cmd's main package where we can't easily import them.
This function was reimplemented in the WFE's test code & updated to fail
fast rather than return errors.
Per CONTRIBUTING.md I have verified the go-jose.v2 tests at the imported
commit pass:
ok gopkg.in/square/go-jose.v2 14.771s
ok gopkg.in/square/go-jose.v2/cipher 0.025s
? gopkg.in/square/go-jose.v2/jose-util [no test files]
ok gopkg.in/square/go-jose.v2/json 1.230s
ok gopkg.in/square/go-jose.v2/jwt 0.073s
Resolves#2880