When deployed, the newly-parallel expiration-mailer encountered
unexpected difficulties and dropped to apparently sending nearly zero
emails despite not throwing any real errors. Reverting the parallelism
change until we understand and can fix the root cause.
This reverts two commits:
- Allow expiration mailer to work in parallel (#6057)
- Fix data race in expiration-mailer test mocks (#6072)
It also modifies the revert to leave the new `ParallelSends` config key
in place (albeit completely ignored), so that the binary containing this
revert can be safely deployed regardless of config status.
Part of #5682
Previously, each accounts email would be sent in serial,
along with several reads from the database (to check for
certificate renewal) and several writes to the database (to update
`certificateStatus.lastExpirationNagSent`). This adds a config field
for the expiration mailer that sets the parallelism it will use.
That means making and using multiple SMTP connections as well. Previously,
`bmail.Mailer` was not safe for concurrent use. It also had a piece of
API awkwardness: after you created a Mailer, you had to call Connect on
it to change its state.
Instead of treating that as a state change on Mailer, I split out a
separate component: `bmail.Conn`. Now, when you call `Mailer.Connect()`,
you get a Conn. You can send mail on that Conn and Close it when you're
done. A single Mailer instance can produce multiple Conns, so Mailer is
now concurrency-safe (while Conn is not).
This involved a moderate amount of renaming and code movement, and
GitHub's move detector is not keeping up 100%, so an eye towards "is
this moved code?" may help. Also adding `?w=1` to the diff URL to ignore
whitespace diffs.
Adds a rocsp redis client to the sa if cluster information is provided in the
sa config. If a redis cluster is configured, all new certificate OCSP
responses added with sa.AddPrecertificate will attempt to be written to
the redis cluster, but will not block or fail on errors.
Fixes: #5871
We have decided that we don't like the if err := call(); err != nil
syntax, because it creates confusing scopes, but we have not cleaned up
all existing instances of that syntax. However, we have now found a
case where that syntax enables a bug: It caused readers to believe that
a later err = call() statement was assigning to an already-declared err
in the local scope, when in fact it was assigning to an
already-declared err in the parent scope of a closure. This caused our
ineffassign and staticcheck linters to be unable to analyze the
lifetime of the err variable, and so they did not complain when we
never checked the actual value of that error.
This change standardizes on the two-line error checking syntax
everywhere, so that we can more easily ensure that our linters are
correctly analyzing all error assignments.
Boulder components initialize their gorp and gorp-less (non-wrapped) database
clients via two new SA helpers. These helpers handle client construction,
database metric initialization, and (for gorp only) debug logging setup.
Removes transaction isolation parameter `'READ-UNCOMMITTED'` from all database
connections.
Fixes#5715Fixes#5889
This reverts commit e3ce816425,
which was reviewed in https://github.com/letsencrypt/boulder/pull/5607.
This change caused database queries to exceed the maximum packet size
and fail. Because this was an opportunistic optimization, reverting it
is the safest course moving forward.
Use `sa.SelectCertificates` instead of `sa.SelectCertificate` to
fetch the entire batch of certificates all at once, instead of doing
up to 10k individual certificate selections in serial.
The resulting `boulder` binary can be invoked by different names to
trigger the behavior of the relevant subcommand. For instance, symlinking
and invoking as `boulder-ca` acts as the CA. Symlinking and invoking as
`boulder-va` acts as the VA.
This reduces the .deb file size from about 200MB to about 20MB.
This works by creating a registry that maps subcommand names to `main`
functions. Each subcommand registers itself in an `init()` function. The
monolithic `boulder` binary then checks what name it was invoked with
(`os.Args[0]`), looks it up in the registry, and invokes the appropriate
`main`. To avoid conflicts, all of the old `package main` are replaced
with `package notmain`.
To get the list of registered subcommands, run `boulder --list`. This
is used when symlinking all the variants into place, to ensure the set
of symlinked names matches the entries in the registry.
Fixes#5692
- Replace `gorp.DbMap` with calls that use `sql.DB` directly
- Use `rows.Scan()` and `rows.Next()` to get query results (which opens the door to streaming the results)
- Export function `CertStatusMetadataFields` from `SA`
- Add new function `ScanCertStatusRow` to `SA`
- Add new function `NewDbSettingsFromDBConfig` to `SA`
Fixes#5642
Part Of #5715
Remove the last of the gRPC wrapper files. In order to do so:
- Remove the `core.StorageGetter` interface. Replace it with a new
interface (whose methods include the `...grpc.CallOption` arg)
inside the `sa/proto/` package.
- Remove the `core.StorageAdder` interface. There's no real use-case
for having a write-only interface.
- Remove the `core.StorageAuthority` interface, as it is now redundant
with the autogenerated `sapb.StorageAuthorityClient` interface.
- Replace the `certificateStorage` interface (which appears in two
different places) with a single unified interface also in `sa/proto/`.
- Update all test mocks to include the `_ ...grpc.CallOption` arg in
their method signatures so they match the gRPC client interface.
- Delete many methods from mocks which are no longer necessary (mostly
because they're mocking old authz1 methods that no longer exist).
- Move the two `test/inmem/` wrappers into their own sub-packages to
avoid an import cycle.
- Simplify the `satest` package to satisfy one of its TODOs and to
avoid an import cycle.
- Add many methods to the `test/inmem/sa/` wrapper, to accommodate all
of the methods which are called in unittests.
Fixes#5600
Overhaul how expiration-mailer checks to see if a `certIsRenewed`.
First, change the helper function to take the list of names (which
can be hashed into an fqdnSet) and the issuance date. This allows the
search for renewals to be a much simpler linear scan rather than an
ugly outer left join. Second, update the query to examine both the
`fqdnSets` and `fqdnSets_old` tables, to account for the fact that
this code cares about more time (~90d) than the `fqdnSets` table
currently holds.
Also export the SA's `HashNames` method so it can be used by the mailer,
and update the mailer's tests to use correct name hashes instead of
fake human-readable hashes.
Fixes#5672
If expiry mailer's SELECT statement is regularly returning exactly
as many rows as it is configured to limit at, that means that we are
consistently missing some certs that we should probably be sending
emails about. Change the way this metric is computed to be based
on the number of rows returned directly by the SELECT, rather than
based on the filtered set of rows that ended up corresponding to
final certificates.
This bug was almost certainly causing us to not detect nag groups
being at capacity, given that the expiry-mailer logs consistently say
that nag groups have nearly-but-not-quite as many certs in them as
the configured maximum. This bug was introduced in #4420, when
expiration-mailer gained the ability to silently continue past certain
classes of errors retrieving certificate details.
Remove all error checking and type transformation from the gRPC wrappers
for the following methods on the SA:
- GetRegistration
- GetRegistrationByKey
- NewRegistration
- UpdateRegistration
- DeactivateRegistration
Update callers of these methods to construct the appropriate protobuf
request messages directly, and to consume the protobuf response messages
directly. In many cases, this requires changing the way that clients
handle the `Jwk` field (from expecting a `JSONWebKey` to expecting a
slice of bytes) and the `Contacts` field (from expecting a possibly-nil
pointer to relying on the value of the `ContactsPresent` boolean field).
Implement two new methods in `sa/model.go` to convert directly between
database models and protobuf messages, rather than round-tripping
through `core` objects in between. Delete the older methods that
converted between database models and `core` objects, as they are no
longer necessary.
Update test mocks to have the correct signatures, and update tests to
not rely on `JSONWebKey` and instead use byte slices.
Fixes#5531
This changeset adds a second DB connect string for the SA for use in
read-only queries that are not themselves dependencies for read-write
queries. In other words, this is attempting to only catch things like
rate-limit `SELECT`s and other coarse-counting, so we can potentially
move those read queries off the read-write primary database.
It also adds a second DB connect string to the OCSP Updater. This is a
little trickier, as the subsequent `UPDATE`s _are_ dependent on the
output of the `SELECT`, but in this case it's operating on data batches,
and a few seconds' replication latency are several orders of magnitude
below the threshold for update frequency, so any certificates that
aren't caught on run `n` can be caught on run `n+1`.
Since we export DB metrics to Prometheus, this also refactors
`InitDBMetrics` to take a DB Address (host:port tuple) and User out of
the DB connection DSN and include those as labels in the metrics.
Fixes#5550Fixes#4985
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
Named field `DB`, in a each component configuration struct, acts as the
receiver for the value of `db` when component JSON files are
unmarshalled.
When `cmd.DBConfig` fields are received at the root of component
configuration struct instead of `DB` copy them to the `DB` field of the
component configuration struct.
Move existing `cmd.DBConfig` values from the root of each component's
JSON configuration in `test/config-next` to `db`
Part of #5275
In #5235 we replaced MaxDBConns in favor of MaxOpenConns.
One week ago MaxDBConns was removed from all dev, staging, and
production configurations. This change completes the removal of
MaxDBConns from all components and test/config.
Fixes#5249
Historically the only database/sql driver setting exposed via JSON
config was maxDBConns. This change adds support for maxIdleConns,
connMaxLifetime, connMaxIdleTime, and renames maxDBConns to
maxOpenConns. The addition of these settings will give our SRE team a
convenient method for tuning the reuse/closure of database connections.
A new struct, DBSettings, has been added to SA. The struct, and each of
it's fields has been commented.
All new fields have been plumbed through to the relevant Boulder
components and exported as Prometheus metrics. Tests have been
added/modified to ensure that the fields are being set. There should be
no loss in coverage
Deployability concerns for the migration from maxDBConns to maxOpenConns
have been addressed with the temporary addition of the helper method
cmd.DBConfig.GetMaxOpenConns(). This method can be removed once
test/config is defaulted to using maxOpenConns. Relevant sections of the
code have TODOs added that link back to an newly opened issue.
Fixes#5199
Simplify database interactions
This change is a result of an audit of all places where
Go code directly constructs SQL queries and executes them
against a dbMap, with the goal of eliminating all instances
of constructing a well-known object type (such as a
core.CertificateStatus) from explicitly-listed database columns.
Instead, we should be relying on helper functions defined in the
sa itself to determine which columns are relevant for the
construction of any given object.
This audit did not find many places where this was occurring. It
did reveal a few simplifications, which are contained in this
change:
1) Greater use of existing SelectFoo methods provided by models.go
2) Streamlining of various SelectSingularFoo methods to always
select by serial string, rather than user-provided WHERE clause
3) One spot (in ocsp-responder) where using a well-known type seemed
better than using a more minimal custom type
Addresses #4899
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.
There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.
Fixes#4591.
New types and related infrastructure are added to the `db` package to allow
wrapping gorp DbMaps and Transactions.
The wrapped versions return a special `db.ErrDatabaseOp` error type when errors
occur. The new error type includes additional information such as the operation
that failed and the related table.
Where possible we determine the table based on the types of the gorp function
arguments. Where that isn't possible (e.g. with raw SQL queries) we try to use
a simple regexp approach to find the table name. This isn't great for general
SQL but works well enough for Boulder's existing SQL queries.
To get additional confidence my regexps work for all of Boulder's queries
I temporarily changed the `db` package's `tableFromQuery` function to panic if
the table couldn't be determined. I re-ran the full unit and integration test
suites with this configuration and saw no panics.
Resolves https://github.com/letsencrypt/boulder/issues/4559
When a nag group hits capacity we set the nagsAtCapacity gauge to 1.
This gauge also needs to be reset to 0 when the nag group is no longer
at capacity.
This change adds two tables and two methods in the SA, to store precertificates
and serial numbers.
In the CA, when the feature flag is turned on, we generate a serial number, store it,
sign a precertificate and OCSP, store them, and then return the precertificate. Storing
the serial as an additional step before signing the certificate adds an extra layer of
insurance against duplicate serials, and also serves as a check on database availability.
Since an error storing the serial prevents going on to sign the precertificate, this decreases
the chance of signing something while the database is down.
Right now, neither table has read operations available in the SA.
To make this work, I needed to remove the check for duplicate certificateStatus entry
when inserting a final certificate and its OCSP response. I also needed to remove
an error that can occur when expiration-mailer processes a precertificate that lacks
a final certificate. That error would otherwise have prevented further processing of
expiration warnings.
Fixes#4412
This change builds on #4417, please review that first for ease of review.
Go 1.11+ updated the `sql.DBStats` struct with new fields that are of
interest to us. This PR routes these stats to Prometheus by replacing
the existing autoprom stats code with new first-class Prometheus
metrics. Resolves https://github.com/letsencrypt/boulder/issues/4095
The `max_db_connections` stat from the SA is removed because the Go 1.11+
`sql.DBStats.MaxOpenConnections` field will give us a better view of
the same information.
The autoprom "reused_authz" stat that was being incremented in
`SA.GetPendingAuthorization` was also removed. It wasn't doing what it
says it was (counting reused authorizations) and was instead counting
the number of times `GetPendingAuthorization` returned an authz.
* in boulder-ra we connected to the publisher and created a publisher gRPC client twice for no apparent reason
* in the SA we ignored errors from `getChallenges` in `GetAuthorizations` which could result in a nil challenge being returned in an authorization
A very large number of the logger calls are of the form log.Function(fmt.Sprintf(...)).
Rather than sprinkling fmt.Sprintf at every logger call site, provide formatting versions
of the logger functions and call these directly with the format and arguments.
While here remove some unnecessary trailing newlines and calls to String/Error.
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.
This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.
Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.
A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.
Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
Also instead of repeating the same bucket definitions everywhere just use a single top level var in the metrics package in order to discourage copy/pasting.
Fixes#3607.
Prior to this commit, if the expiration-mailer database configuration is
invalid, or the database is unreachable,
`cmd/expiration-mailer/main.go`'s `main()` function tries to call
`sa.SetSQLDebug(dbMap, logger)` before erroring from the DB
initialization failure. This causes a nil panic.
This commit changes the order of two lines such that `sa.SetSQLDebug` is
only called when there was no db setup error.
Our various main.go functions gated some key code on whether the TLS
and/or GRPC config fields were present. Now that those fields are fully
deployed in production, we can simplify the code and require them.
Also, rename tls to tlsConfig everywhere to avoid confusion with the tls
package.
Avoid assigning to the same err from two different goroutines in
boulder-ca (fix a race).
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.
Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.
I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.
Also, update go-grpc-prometheus to get the necessary methods.
```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
Updates the buckets for histograms in the publisher, va, and expiration-mailer which are used to measure the latency of operations that go over the internet and therefore are liable to take a lot longer than the default buckets can measure. Uses a standard set of buckets for all three instead of attempting to tune for each one.
Fixes#3217.
* Remove non-TLS support from mailer entirely
* Add a config option for trusted roots in expiration-mailer. If unset, it defaults to the system roots, so this does not need to be set in production.
* Use TLS in mail-test-srv, along with an internal root and localhost certificates signed by that root.
Previously, we used prometheus.DefaultRegisterer to register our stats, which uses global state to export its HTTP stats. We also used net/http/pprof's behavior of registering to the default global HTTP ServeMux, via DebugServer, which starts an HTTP server that uses that global ServeMux.
In this change, I merge DebugServer's functions into StatsAndLogging. StatsAndLogging now takes an address parameter and fires off an HTTP server in a goroutine. That HTTP server is newly defined, and doesn't use DefaultServeMux. On it is registered the Prometheus stats handler, and handlers for the various pprof traces. In the process I split StatsAndLogging internally into two functions: makeStats and MakeLogger. I didn't port across the expvar variable exporting, which serves a similar function to Prometheus stats but which we never use.
One nice immediate effect of this change: Since StatsAndLogging now requires and address, I noticed a bunch of commands that called StatsAndLogging, and passed around the resulting Scope, but never made use of it because they didn't run a DebugServer. Under the old StatsD world, these command still could have exported their stats by pushing, but since we moved to Prometheus their stats stopped being collected. We haven't used any of these stats, so instead of adding debug ports to all short-lived commands, or setting up a push gateway, I simply removed them and switched those commands to initialize only a Logger, no stats.
Since we can make up to 100 SQL queries from this method (based on the 100-SAN
limit), sometimes it is too slow and we get a timeout for large certificates. By
running some of those queries in parallel, we can speed things up and stop
getting timeouts.
This commit replaces the Boulder dependency on
gopkg.in/square/go-jose.v1 with gopkg.in/square/go-jose.v2. This is
necessary both to stay in front of bitrot and because the ACME v2 work
will require a feature from go-jose.v2 for JWS validation.
The largest part of this diff is cosmetic changes:
Changing import paths
jose.JsonWebKey -> jose.JSONWebKey
jose.JsonWebSignature -> jose.JSONWebSignature
jose.JoseHeader -> jose.Header
Some more significant changes were caused by updates in the API for
for creating new jose.Signer instances. Previously we constructed
these with jose.NewSigner(algorithm, key). Now these are created with
jose.NewSigner(jose.SigningKey{},jose.SignerOptions{}). At present all
signers specify EmbedJWK: true but this will likely change with
follow-up ACME V2 work.
Another change was the removal of the jose.LoadPrivateKey function
that the wfe tests relied on. The jose v2 API removed these functions,
moving them to a cmd's main package where we can't easily import them.
This function was reimplemented in the WFE's test code & updated to fail
fast rather than return errors.
Per CONTRIBUTING.md I have verified the go-jose.v2 tests at the imported
commit pass:
ok gopkg.in/square/go-jose.v2 14.771s
ok gopkg.in/square/go-jose.v2/cipher 0.025s
? gopkg.in/square/go-jose.v2/jose-util [no test files]
ok gopkg.in/square/go-jose.v2/json 1.230s
ok gopkg.in/square/go-jose.v2/jwt 0.073s
Resolves#2880
This used to be used for AMQP queue names. Now that AMQP is gone, these consts
were only used when printing a version string at startup. This changes
VersionString to just use the name of the current program, and removes
`const clientName = ` from many of our main.go's.
If the feature flag "ReusePendingAuthz" is enabled, a request to create a new authorization object from an account that already has a pending authorization object for the same identifier will return the already-existing authorization object. This should make it less common for people to get stuck in the "too many pending authorizations" state, and reduce DB storage growth.
Fixes#2768
This removes the config and code to output to statsd.
- Change `cmd.StatsAndLogging` to output a `Scope`, not a `Statter`.
- Remove the prefixing of component name (e.g. "VA") in front of stats; this was stripped by `autoProm` but now no longer needs to be.
- Delete vendored statsd client.
- Delete `MockStatter` (generated by gomock) and `mocks.Statter` (hand generated) in favor of mocking `metrics.Scope`, which is the interface we now use everywhere.
- Remove a few unused methods on `metrics.Scope`, and update its generated mock.
- Refactor `autoProm` and add `autoRegisterer`, which can be included in a `metrics.Scope`, avoiding global state. `autoProm` now registers everything with the `prometheus.Registerer` it is given.
- Change va_test.go's `setup()` to not return a stats object; instead the individual tests that care about stats override `va.stats` directly.
Fixes#2639, #2733.
Adds a daemon mode to `expiration-mailer` that is triggered by using the flag `--daemon` in order to follow deployability guidelines. If the `--daemon` flag is used the `mailer.runPeriod` config field is checked for a tick duration, if the value is `0` it exits.
Super lightweight implementation, OCSP-Updater has some custom ticker code which we use to do fancy things when the method being invoked in the loop takes longer expected, but that isn't necessary here.
Fixes#2617.
This patch removes all usages of the `core.XXXError` and almost all usages of `probs` outside of the WFE and VA and replaces them with a unified internal error type. Since the VA uses `probs.ProblemDetails` quite extensively in challenges, and currently stores them in the DB I've saved this change for another change (it'll also require a migration). Since `ProblemDetails` should only ever be exposed to end-users all of its related logic should be moved into the `WFE` but since it still needs to be exposed to the VA and SA I've left it in place for now.
The new internal `errors` package offers the same convenience functions as `probs` does as well as a new simpler type testing method. A few small changes have also been made to error messages, mainly adding the library and function name to internal server errors for easier debugging (i.e. where a number of functions return the exact same errors and there is no other way to distinguish which method threw the error).
Also adds proper encoding of internal errors transferred over gRPC (the current encoding scheme is kept for `core` and `probs` errors since it'll be ideally be removed after we deploy this and follow-up changes) using `grpc/metadata` instead of the gRPC status codes.
Fixes#2507. Updates #2254 and #2505.
This commit resolves#2599 by adding support to the expiration-mailer to
treat the subject for email messages as a template. This allows for the
dynamic subject lines from #2435 to be used with a prefix for staging
emails.
Switch from `gorp.v1` to `gorp.v2`. Removes `vendor/gopkg.in/gorp.v1` and vendors `vendor/gopkg/go-gorp/gorp.v2`, all tests pass.
Changes between `v1.7.1` and `v2.0.0`: c87af80f3c...4deece6103Fixes#2490.
The NotAfter and IsExpired fields on the certificateStatus table
have been migrated in staging & production. Similarly the
CertStatusOptimizationsMigrated feature flag has been turned on after
a successful backfill operation. We have confirmed the optimization is
working as expected and can now clean out the duplicated v1 and v2
models, and the feature flag branching. The notafter-backfill command
is no longer useful and so this commit also cleans it out of the repo.
Note: Some unit tests were sidestepping the SA and inserting
certificateStatus rows explicitly. These tests had to be updated to
set the NotAfter field in order for the queries used by the
ocsp-updater and the expiration-mailer to perform the way the tests
originally expected.
Resolves#2530
If you are the first person to add a feature to a Boulder command its very
easy to forget to update the command's config structure to accommodate a
`map[string]bool` entry and to pass it to `features.Set` in `main()`. See
https://github.com/letsencrypt/boulder/issues/2533 for one example. I've
fallen into this trap myself a few times so I'm going to try and save myself
some future grief by fixing it across the board once and for all!
This PR adds a `Features` config entry and a corresponding `features.Set` to:
* ocsp-updater (resolves#2533)
* admin-revoker
* boulder-publisher
* contact-exporter
* expiration-mailer
* expired-authz-purger
* notify-mailer
* ocsp-responder
* orphan-finder
These components were skipped because they already had features supported:
* boulder-ca
* boulder-ra
* boulder-sa
* boulder-va
* boulder-wfe
* cert-checker
I deliberately skipped adding Feature support to:
* single-ocsp (Its only configuration comes from the pkcs11key library and
doesn't support features)
* rabbitmq-setup (No configuration/features and we'll likely soon be rming this
since the gRPC migration)
* notafter-backfill (This is a one-off that will be deleted soon)
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.
This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.
This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
This PR modifies the `expiration-mailer` utility to change the subject used in the reminder emails to include a domain name from the expiring certificate.
Previously unless otherwise specified using the `Mailer.Subject` configuration parameter all reminder emails were sent with the subject `Certificate expiration notice`. Both the `test/config/` and `test/config-next` expiration mailer configurations do not override the subject and were using the default.
With this PR, if no `Mailer.Subject` configuration parameter is provided then reminder emails are sent with the subject `Certificate expiration notice for domain "foo.bar.com"` in the case of only one domain in the expiring certificate, and `Certificate expiration for domain "foo.bar.com" (and $(n-1) more)` for the case where there are n > 1 domains (e.g. "(and 1 more)", "(and 2 more)" ...). I explicitly left support for the `Mailer.Subject` override to allow legacy configurations to function.
I didn't explicitly add a new unit test for this behaviour because the existing unit tests were exercising both the "configuration override" portion of the subject behaviour, and matching the new expected subject. It would be entirely duplicated code to write a separate test for the subject template.
Resolves#2411
This PR splits up the expiration-mailer's `findExpiringCertificates` query into two parts:
1. One query to find `certificateStatus` serial numbers that match the search criteria
2. Sequential queries to find each `certificate` row for the results from 1.
This removes the `JOIN` on two large tables from the original `findExpiringCertificates` query and lets us shift load away from the database. https://github.com/letsencrypt/boulder/issues/2432 wasn't sufficient to reduce the load of this query.
Resolves https://github.com/letsencrypt/boulder/issues/2425
Similar to #2431 the expiration-mailer's `findExpiringCertificates()` query can be optimized slightly by using `certificateStatus.NotAfter` in place of `certificate.expires` in the `WHERE` clause of its query when the `CertStatusOptimizationsMigrated` feature is enabled.
Resolves https://github.com/letsencrypt/boulder/issues/2425
Previously we had custom code in each gRPC wrapper to implement timeouts. Moving
the timeout code into the client interceptor allows us to simplify things and
reduce code duplication.
Adds a gRPC server to the SA and SA gRPC Clients to the WFE, RA, CA, Publisher, OCSP updater, orphan finder, admin revoker, and expiration mailer.
Also adds a CA gRPC client to the OCSP Updater which was missed in #2193.
Fixes#2347.
This commit updates the `go-jose` dependency to [v1.1.0](https://github.com/square/go-jose/releases/tag/v1.1.0) (Commit: aa2e30fdd1fe9dd3394119af66451ae790d50e0d). Since the import path changed from `github.com/square/...` to `gopkg.in/square/go-jose.v1/` this means removing the old dep and adding the new one.
The upstream go-jose library added a `[]*x509.Certificate` member to the `JsonWebKey` struct that prevents us from using a direct equality test against two `JsonWebKey` instances. Instead we now must compare the inner `Key` members.
The `TestRegistrationContactUpdate` function from `ra_test.go` was updated to populate the `Key` members used in testing instead of only using KeyID's to allow the updated comparisons to work as intended.
The `Key` field of the `Registration` object was switched from `jose.JsonWebKey` to `*jose.JsonWebKey ` to make it easier to represent a registration w/o a Key versus using a value with a nil `JsonWebKey.Key`.
I verified the upstream unit tests pass per contributing.md:
```
daniel@XXXXX:~/go/src/gopkg.in/square/go-jose.v1$ git show
commit aa2e30fdd1fe9dd3394119af66451ae790d50e0d
Merge: 139276c e18a743
Author: Cedric Staub <cs@squareup.com>
Date: Thu Sep 22 17:08:11 2016 -0700
Merge branch 'master' into v1
* master:
Better docs explaining embedded JWKs
Reject invalid embedded public keys
Improve multi-recipient/multi-sig handling
daniel@XXXXX:~/go/src/gopkg.in/square/go-jose.v1$ go test ./...
ok gopkg.in/square/go-jose.v1 17.599s
ok gopkg.in/square/go-jose.v1/cipher 0.007s
? gopkg.in/square/go-jose.v1/jose-util [no test files]
ok gopkg.in/square/go-jose.v1/json 1.238s
```
### Connect before sending mail, not at startup
Per #2250 when we connect to the remote SMTP server at start-up time by calling `mailer.Connect()` but do not actually call `mailer.SendMail()` until after we have done some potentially expensive/time-consuming work we are liable to have our connection closed due to timeout.
This PR moves the `Connect()` call in `expiration-mailer` and `notify-mailer` to be closer to where the actual messages are sent via `SendMail()` and resolves#2250
### Handle SMTP 421 errors gracefully
Issue #2249 describes a case where we see this SMTP error code from the remote server when our connection has been idle for too long. This would manifest when connecting to the remote server at startup, running a very long database query, and then sending mail. This commit allows the mailer to treat SMTP 421 errors as an event that should produce a reconnect attempt and resolves#2249.
A unit test is added to the mailer tests to test that reconnection works when the server sends a SMTP 421 error. Prior to b64e51f and support for SMTP 421 reconnection this test failed in a manner matching issue #2249:
```
go test -p 1 -race --test.run TestReconnectSMTP421
github.com/letsencrypt/boulder/mail
Wrote goodbye msg: 421 1.2.3 green.eggs.and.spam Error: timeout exceeded
Cutting off client early
--- FAIL: TestReconnectSMTP421 (0.00s)
mailer_test.go:257: Expected SendMail() to not fail. Got err: 421
1.2.3 green.eggs.and.spam Error: timeout exceeded
FAIL
FAIL github.com/letsencrypt/boulder/mail 0.023s
```
With b64e51f the test passes and the client gracefully reconnects.
The existing reconnect testing logic in the `mail-test-srv` integration tests is changed such that half of the forced disconnects are a normal clean connection close and half are a SMTP 421. This allows the existing integration test for server disconnects to be reused to test the 421 reconnect logic.
Updates #1699.
Adds a new package, `features`, which exposes methods to set and check if various internal features are enabled. The implementation uses global state to store the features so that services embedded in another service do not each require their own features map in order to check if something is enabled.
Requires a `boulder-tools` image update to include `golang.org/x/tools/cmd/stringer`.
The `MailerImpl` gains a few new fields (`retryBase`, & `retryMax`). These are used with `core.RetryBackoff` in `reconnect()` to implement exponential backoff in a reconnect attempt loop. Both `expiration-mailer` and `notify-mailer` are modified to add CLI args for these 2 flags and to wire them into the `MailerImpl` via its `New()` constructor.
In `MailerImpl`'s `SendMail()` function it now detects when `sendOne` returns an `io.EOF` error indicating that the server closed the connection unexpectedly. When this case occurs `reconnect()` is invoked. If the reconnect succeeds then we invoke `sendOne` again to try and complete the message sending operation that was interrupted by the disconnect.
For integration testing purposes I modified the `mail-test-srv` to support a `-closeChance` parameter between 0 and 100. This controls what % of `MAIL` commands will result in the server immediately closing the client connection before further processing. This allows us to simulate a flaky mailserver. `test/startservers.py` is modified to start the `mail-test-srv` with a 35% close chance to thoroughly test the reconnection logic during the existing `expiration-mailer` integration tests. I took this as a chance to do some slight clean-up of the `mail-test-srv` code (mostly removing global state).
For unit testing purposes I modified the mailer `TestConnect` test to abstract out a server that can operate similar to `mail-test-serv` (e.g. can close connections artificially).
This is testing a server that **closes** a connection, and not a server that **goes away/goes down**. E.g. the `core.RetryBackoff` sleeps themselves are not being tested. The client is disconnected and attempts a reconnection which always succeeds on the first try. To test a "gone away" server would require a more substantial rewrite of the unit tests and the `mail-test-srv`/integration tests. I think this matches the experience we have with MailChimp/Mandril closing long lived connections.
Formerly in expiration-mailer, when we wanted to set `lastNagSent` to `Now()`, we started a transaction, read the object, updated one field, wrote it back, and closed the transaction.
This commit replaces the transaction and instead does a much simpler and more efficient `UPDATE certificateStatus SET lastNagSent = ? where serial = ?;`.
This PR adds a `printStatus` function that is called every iteration of the mailer's `run()` loop. The status output is logged at the `info` level and includes the destination email, the current message being sent, the total number of messages to send, and the elapsed time since `run()` started.
The status output can be disabled by lowering the default syslog level in the `notify-mailer` config.
Additionally, this PR adds stats support for the mailer package. Three new stats are
published during the `MailerImpl`'s `SendMail` function (called in a loop by the mailer utilities):
`Mailer.SendMail.Attempts`
`Mailer.SendMail.Successes`
`Mailer.SendMail.Errors`
This PR removes two stats from the `expiration-mailer` that are redundant copies of the above:
`Mailer.Expiration.Errors.SendingNag.SendFailure`
`Mailer.Expiration.Sent`
This resolves#2026.
Previously, if a certificate was skipped by the expiration mailer due to being
renewed already, we wouldn't update its lastnag time. However, this meant that
already-renewed certificates would clog up the results of the query
expiration-mailer does to find expired certs. Since this query has a limit (1000
in practice), we might find only renewed certificates on each query, even when
there are non-renewed certificates available to alert about. Then we'd never
make forward progress.
This change updates the stored lastExpirationNagSent field when a certificate is
skipped over due to renewal, so that it isn't included in the first-step query.
Fixes#2054
This PR adds a stat that is emitted when any of the nag groups are operating at capacity. The mailer is considered at capacity when the number of certs returned by the query in findExpiringCertificates is equal to the configured -cert_limit.
The at capacity stat names take the form: "Mailer.Expiration.Errors.Nag-XXXXX.AtCapacity" where XXXXX is the String() representation of the nagCheck offset nag time. Allowing the capacity alert to be specified per-nag group. As an example, a nag time of 48hrs with a nag check of 24hrs would produce a stat: "Mailer.Expiration.Errors.Nag-72h0m0s.AtCapacity" when it reached a capacity state.
This will allow creation of an alert for the conditions that caused issue #2002 to manifest.
In order to unit test with a mock statter it was also required to swap out the time.Since calls to equivalent dateB.sub(dateA) calls using the fake clock.
In #1923 we changed reg.Contact to a pointer, which can be nil if the corresponding data from the DB is the literal string "null". This causes panics in expiration-mailer, which we need to fix.
This change fixes modelToRegistration to always return a pointer to a non-nil slice. It also adds an extra sanity check in expiration-mailer itself.
Fixes#1993https://github.com/letsencrypt/boulder/pull/1997
This commit adds a new notify-mailer command. Outside of the new command, this PR also:
Adds a new SMTPConfig to cmd/config.go that is shared between the expiration mailer and the notify mailer.
Modifies mail/mailer.go to add an smtpClient interface.
Adds a dryRunClient to mail/mailer.go that implements the smtpClient interface.
Modifies the mail/mailer.go MailerImpl and constructor to use the SMTPConfig and a dialer. The missing functions from the smtpClient interface are added.
The notify-mailer command supports checkpointing through --start and --end parameters. It supports dry runs by using the new dryRunClient from the mail package when given the --dryRun flag. The speed at which emails are sent can be tweaked using the --sleep flag.
Unit tests for notify-mailer's checkpointing behaviour, the checkpoint interval/sleep parameter sanity, the sleep behaviour, and the message content construction are included in main_test.go.
Future work:
A separate command to generate the list of destination emails provided to notify-mailer
Support for using registration IDs as input and resolving the email address at runtime.
Resolves#1928. Credit to @jsha for the initial work - I'm just completing the branch he started.
* Adds `notify-mailer` command.
* Adds a new SMTPConfig to `cmd/config.go` that is shared between the
expiration mailer and the notify mailer.
* Modifies `mail/mailer.go` to add an `smtpClient` interface.
* Adds a `dryRunClient` to `mail/mailer.go` that implements the
`smtpClient` interface.
* Modifies the `mail/mailer.go` `MailerImpl` and constructor to use the
SMTPConfig and a dialer. The missing functions from the `smtpClient`
interface are added.
* Fix errcheck warnings
* Review feedback
* Review feedback pt2
* Fixes#1446 - invalid message-id generation.
* Change -configFile to -config
* Test message ID with friendly email
https://github.com/letsencrypt/boulder/pull/1936
The RA UpdateRegistration function merges a base registration object with an update by calling Registration.MergeUpdate. Prior to this commit MergeUpdate only allowed the updated registration object to overwrite the Contact field of the existing registration if the updated reg. defined at least one AcmeURL. This prevented clients from being able to outright remove the contact associated with an existing registration.
This commit removes the len() check on the input.Contact in MergeUpdate to allow the r.Contact field to be overwritten by a []*core.AcmeURL(nil) Contact field. Subsequently clients can now send an empty contacts list in the update registration POST in order to remove their reg contact.
Fixes#1846
* Allow removing registration contact.
* Adds a test for `MergeUpdate` contact removal.
* Change `Registration.Contact` type to `*[]*core.AcmeURL`.
* End validateContacts early for empty contacts
* Test removing reg. contact more thoroughly.
* rename, change params, restructure
* I'm wondering how I managed that one
* use a metrics.Scope
* move method to SA, update callers
* rerun goimports
* fix compile error
* revert cmd/shell.go
https://github.com/letsencrypt/boulder/pull/1805
* Fix all errcheck errors
* Add errcheck to test.sh
* Add a new sa.Rollback method to make handling errors in rollbacks easier.
This also causes a behavior change in the VA. If a HTTP connection is
abruptly closed after serving the headers for a non-200 response, the
reported error will be the read failure instead of the non-200.
- Remove error signatures from log methods. This means fewer places where errcheck will show ignored errors.
- Pull in latest cfssl to be compatible with errorless log messages.
- Reduce the number of message priorities we support to just those we actually use.
- AuditNotice -> AuditInfo
- Remove InfoObject (only one use, switched to Info)
- Remove EmergencyExit and related functions in favor of panic
- Remove SyslogWriter / AuditLogger separate types in favor of a single interface, Logger, that has all the logging methods on it.
- Merge mock log into logger. This allows us to unexport the internals but still override them in the mock.
- Shorten names to be compatible with Go style: New, Set, Get, Logger, NewMock, etc.
- Use a shorter log format for stdout logs.
- Remove "... Starting" log messages. We have better information in the "Versions" message logged at startup.
Motivation: The AuditLogger / SyslogWriter distinction was confusing and exposed internals only necessary for tests. Some components accepted one type and some accepted the other. This made it hard to consistently use mock loggers in tests. Also, the unnecessarily fat interface for AuditLogger made it hard to meaningfully mock out.
Pass log as an argument to SA. This allows us to mock it out.
Use a mockSA in CA test.
Use mockSA in orphan-finder test.
Improve logging from assert functions: Use our own printing style plus FailNow() so that each failure message isn't prefixed by "test-tools.go:60"
Remove duplicate TraceOn.
Part of #1642.
https://github.com/letsencrypt/boulder/pull/1683
This creates a new server, 'mail-test-srv', which is a simplistic SMTP
server that accepts mail and can report the received mail over HTTP.
An integration test is added that uses the new server to test the expiry
mailer.
The FAKECLOCK environment variable is used to force the expiry mailer to
think that the just-issued certificate is about to expire.
Additionally, the expiry mailer is modified to cleanly shut down its
SMTP connections.
If a certificate has already been issued with the same set of FQDNs, it
is considered to be renewed and no expiration mail is sent.
Also, use the connection string in the test/vars package instead of
copying it all around.