Run staticcheck as a standalone binary rather than as a library via
golangci-lint. From the golangci-lint help out,
> staticcheck (megacheck): It's a set of rules from staticcheck. It's
not the same thing as the staticcheck binary. The author of staticcheck
doesn't support or approve the use of staticcheck as a library inside
golangci-lint.
We decided to disable ST1000 which warns about incorrect or missing
package comments.
For SA4011, I chose to change the semantics[1] of the for loop rather
than ignoring the SA4011 lint for that line.
Fixes https://github.com/letsencrypt/boulder/issues/6988
1. https://go.dev/ref/spec#Continue_statements
When expiration-mailer attempts to send nag emails, and
the result is a "Bad Address" error, mark the certificates in
question as having had their last expiration nag sent, so we
don't keep retrying them every time expiration-mailer runs.
To facilitate this, factor out more of the code which performs
the database updates into a more robust helper function, and
optimize it to perform all of the updates at once.
Fixes#6185
This reverts commit 7ef6913e71.
We turned on the `ExpirationMailerDontLookTwice` feature flag in prod, and it's
working fine but not clearing the backlog. Since
https://github.com/letsencrypt/boulder/pull/6100 fixed the issue that caused us
to (nearly) stop sending mail when we deployed #6057, this should be safe to
roll forward.
The revert of the revert applied cleanly, except for expiration-mailer/main.go
and `main_test.go`, particularly around the contents `processCerts` (where
`sendToOneRegID` was extracted from) and `sendToOneRegID` itself. So those areas
are good targets for extra attention.
When deployed, the newly-parallel expiration-mailer encountered
unexpected difficulties and dropped to apparently sending nearly zero
emails despite not throwing any real errors. Reverting the parallelism
change until we understand and can fix the root cause.
This reverts two commits:
- Allow expiration mailer to work in parallel (#6057)
- Fix data race in expiration-mailer test mocks (#6072)
It also modifies the revert to leave the new `ParallelSends` config key
in place (albeit completely ignored), so that the binary containing this
revert can be safely deployed regardless of config status.
Part of #5682
Previously, each accounts email would be sent in serial,
along with several reads from the database (to check for
certificate renewal) and several writes to the database (to update
`certificateStatus.lastExpirationNagSent`). This adds a config field
for the expiration mailer that sets the parallelism it will use.
That means making and using multiple SMTP connections as well. Previously,
`bmail.Mailer` was not safe for concurrent use. It also had a piece of
API awkwardness: after you created a Mailer, you had to call Connect on
it to change its state.
Instead of treating that as a state change on Mailer, I split out a
separate component: `bmail.Conn`. Now, when you call `Mailer.Connect()`,
you get a Conn. You can send mail on that Conn and Close it when you're
done. A single Mailer instance can produce multiple Conns, so Mailer is
now concurrency-safe (while Conn is not).
This involved a moderate amount of renaming and code movement, and
GitHub's move detector is not keeping up 100%, so an eye towards "is
this moved code?" may help. Also adding `?w=1` to the diff URL to ignore
whitespace diffs.
errors.As checks for a specific error in a wrapped error chain
(see https://golang.org/pkg/errors/#As) as opposed to asserting
that an error is of a specific type.
Part of #5010
One of the log lines describes the most frequent address corresponding
to a number of accounts, but it actually corresponds to a number of
lines in the input CSV.
Also, now that we escape newlines in log output, the dryRunMailer's
output looks messed up. Split the message body into lines and emit one
log message per line.
If a dryRunClient is used it will cause panics when SendMail is called because the
constructor doesn't populate the prometheus counter. This fix populates the counter.
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.
There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.
Fixes#4591.
This fixes two bugs:
1. `resetAndError` would be called on every error, including `io.EOF`, which is returned when the connection is terminated. Calling `m.client.Reset()` after a `io.EOF` will result in another error, causing us to wrap the `io.EOF` with a `errors.errorString`. This broke a check in `sendMail` that was used to cause a reconnect.
2. There was a error type cast that assumed the type without checking it, which could result in a panic when an error of the unexpected type was returned.
Instead of just on 401. Pulled the various error codes from a handful of SMTP docs I
could find, they could probably use a second once over by others though.
Resolves https://github.com/letsencrypt/boulder/issues/4019
I can't find RFC verse and chapter for "401 4.1.3" errors, but [IANA's registry of SMTP enhanced status codes](https://www.iana.org/assignments/smtp-enhanced-status-codes/smtp-enhanced-status-codes.xhtml) does show an entry matching `x.1.3`:
```
X.1.3 | Bad destination mailbox address syntax | 501 | The destination address was syntactically invalid. This can apply to any field in the address. This code is only useful for permanent failures. | [RFC3463] (Standards Track) | G. Vaudreuil | IESG
```
However that entry from IANA says the "associated basic code" is 501, not 401.
Since we wrote this tool to talk to exactly one SMTP server in the world and it definitely is returning "401 4.1.3" in some cases I think its reasonable to handle as I've done in this PR. Alternative suggestions welcome.
A very large number of the logger calls are of the form log.Function(fmt.Sprintf(...)).
Rather than sprinkling fmt.Sprintf at every logger call site, provide formatting versions
of the logger functions and call these directly with the format and arguments.
While here remove some unnecessary trailing newlines and calls to String/Error.
* Remove non-TLS support from mailer entirely
* Add a config option for trusted roots in expiration-mailer. If unset, it defaults to the system roots, so this does not need to be set in production.
* Use TLS in mail-test-srv, along with an internal root and localhost certificates signed by that root.
### Connect before sending mail, not at startup
Per #2250 when we connect to the remote SMTP server at start-up time by calling `mailer.Connect()` but do not actually call `mailer.SendMail()` until after we have done some potentially expensive/time-consuming work we are liable to have our connection closed due to timeout.
This PR moves the `Connect()` call in `expiration-mailer` and `notify-mailer` to be closer to where the actual messages are sent via `SendMail()` and resolves#2250
### Handle SMTP 421 errors gracefully
Issue #2249 describes a case where we see this SMTP error code from the remote server when our connection has been idle for too long. This would manifest when connecting to the remote server at startup, running a very long database query, and then sending mail. This commit allows the mailer to treat SMTP 421 errors as an event that should produce a reconnect attempt and resolves#2249.
A unit test is added to the mailer tests to test that reconnection works when the server sends a SMTP 421 error. Prior to b64e51f and support for SMTP 421 reconnection this test failed in a manner matching issue #2249:
```
go test -p 1 -race --test.run TestReconnectSMTP421
github.com/letsencrypt/boulder/mail
Wrote goodbye msg: 421 1.2.3 green.eggs.and.spam Error: timeout exceeded
Cutting off client early
--- FAIL: TestReconnectSMTP421 (0.00s)
mailer_test.go:257: Expected SendMail() to not fail. Got err: 421
1.2.3 green.eggs.and.spam Error: timeout exceeded
FAIL
FAIL github.com/letsencrypt/boulder/mail 0.023s
```
With b64e51f the test passes and the client gracefully reconnects.
The existing reconnect testing logic in the `mail-test-srv` integration tests is changed such that half of the forced disconnects are a normal clean connection close and half are a SMTP 421. This allows the existing integration test for server disconnects to be reused to test the 421 reconnect logic.
The `MailerImpl` gains a few new fields (`retryBase`, & `retryMax`). These are used with `core.RetryBackoff` in `reconnect()` to implement exponential backoff in a reconnect attempt loop. Both `expiration-mailer` and `notify-mailer` are modified to add CLI args for these 2 flags and to wire them into the `MailerImpl` via its `New()` constructor.
In `MailerImpl`'s `SendMail()` function it now detects when `sendOne` returns an `io.EOF` error indicating that the server closed the connection unexpectedly. When this case occurs `reconnect()` is invoked. If the reconnect succeeds then we invoke `sendOne` again to try and complete the message sending operation that was interrupted by the disconnect.
For integration testing purposes I modified the `mail-test-srv` to support a `-closeChance` parameter between 0 and 100. This controls what % of `MAIL` commands will result in the server immediately closing the client connection before further processing. This allows us to simulate a flaky mailserver. `test/startservers.py` is modified to start the `mail-test-srv` with a 35% close chance to thoroughly test the reconnection logic during the existing `expiration-mailer` integration tests. I took this as a chance to do some slight clean-up of the `mail-test-srv` code (mostly removing global state).
For unit testing purposes I modified the mailer `TestConnect` test to abstract out a server that can operate similar to `mail-test-serv` (e.g. can close connections artificially).
This is testing a server that **closes** a connection, and not a server that **goes away/goes down**. E.g. the `core.RetryBackoff` sleeps themselves are not being tested. The client is disconnected and attempts a reconnection which always succeeds on the first try. To test a "gone away" server would require a more substantial rewrite of the unit tests and the `mail-test-srv`/integration tests. I think this matches the experience we have with MailChimp/Mandril closing long lived connections.
While finalizing the testing for #2101 I noticed that the `notify-mailer` would panic when `-dryRun=true` (e.g. the default value):
```
E150646 notify-mailer [AUDIT] Panic caused by err: runtime error: invalid memory address or nil pointer dereference
E150646 notify-mailer [AUDIT] Stack Trace (Current frame) goroutine 1 [running]:
github.com/letsencrypt/boulder/log.(*impl).AuditPanic(0xc820167610)
/home/daniel/go/src/github.com/letsencrypt/boulder/log/log.go:190 +0x190
panic(0x9804a0, 0xc82000e120)
/usr/local/go/src/runtime/panic.go:443 +0x4e9
github.com/letsencrypt/boulder/metrics.(*StatsdScope).Inc(0x0, 0xa9ad30, 0x11, 0x1, 0x0, 0x0)
/home/daniel/go/src/github.com/letsencrypt/boulder/metrics/scope.go:68 +0xd5
github.com/letsencrypt/boulder/mail.(*MailerImpl).SendMail(0xc8201dc360, 0xc8203d5150, 0x1, 0x1, 0x7ffeda89737c, 0x1a, 0xc8201f0000, 0x517, 0x0, 0x0)
/home/daniel/go/src/github.com/letsencrypt/boulder/mail/mailer.go:290 +0x108
main.(*mailer).run(0xc820175ea8, 0x0, 0x0)
/home/daniel/go/src/github.com/letsencrypt/boulder/cmd/notify-mailer/main.go:108 +0x362
main.main()
/home/daniel/go/src/github.com/letsencrypt/boulder/cmd/notify-mailer/main.go:365 +0x142e
```
This was caused by the `NewDryRun` constructor not initializing the `stats` member of the `MailerImpl` and is fixed in this commit.
This PR adds a `printStatus` function that is called every iteration of the mailer's `run()` loop. The status output is logged at the `info` level and includes the destination email, the current message being sent, the total number of messages to send, and the elapsed time since `run()` started.
The status output can be disabled by lowering the default syslog level in the `notify-mailer` config.
Additionally, this PR adds stats support for the mailer package. Three new stats are
published during the `MailerImpl`'s `SendMail` function (called in a loop by the mailer utilities):
`Mailer.SendMail.Attempts`
`Mailer.SendMail.Successes`
`Mailer.SendMail.Errors`
This PR removes two stats from the `expiration-mailer` that are redundant copies of the above:
`Mailer.Expiration.Errors.SendingNag.SendFailure`
`Mailer.Expiration.Sent`
This resolves#2026.
This PR, adds a check in registration authority for non-ASCII encoded characters in an email address. This is due to a 'funky email implementation'.
Fixes#1350
This commit adds a new notify-mailer command. Outside of the new command, this PR also:
Adds a new SMTPConfig to cmd/config.go that is shared between the expiration mailer and the notify mailer.
Modifies mail/mailer.go to add an smtpClient interface.
Adds a dryRunClient to mail/mailer.go that implements the smtpClient interface.
Modifies the mail/mailer.go MailerImpl and constructor to use the SMTPConfig and a dialer. The missing functions from the smtpClient interface are added.
The notify-mailer command supports checkpointing through --start and --end parameters. It supports dry runs by using the new dryRunClient from the mail package when given the --dryRun flag. The speed at which emails are sent can be tweaked using the --sleep flag.
Unit tests for notify-mailer's checkpointing behaviour, the checkpoint interval/sleep parameter sanity, the sleep behaviour, and the message content construction are included in main_test.go.
Future work:
A separate command to generate the list of destination emails provided to notify-mailer
Support for using registration IDs as input and resolving the email address at runtime.
Resolves#1928. Credit to @jsha for the initial work - I'm just completing the branch he started.
* Adds `notify-mailer` command.
* Adds a new SMTPConfig to `cmd/config.go` that is shared between the
expiration mailer and the notify mailer.
* Modifies `mail/mailer.go` to add an `smtpClient` interface.
* Adds a `dryRunClient` to `mail/mailer.go` that implements the
`smtpClient` interface.
* Modifies the `mail/mailer.go` `MailerImpl` and constructor to use the
SMTPConfig and a dialer. The missing functions from the `smtpClient`
interface are added.
* Fix errcheck warnings
* Review feedback
* Review feedback pt2
* Fixes#1446 - invalid message-id generation.
* Change -configFile to -config
* Test message ID with friendly email
https://github.com/letsencrypt/boulder/pull/1936
* Fix all errcheck errors
* Add errcheck to test.sh
* Add a new sa.Rollback method to make handling errors in rollbacks easier.
This also causes a behavior change in the VA. If a HTTP connection is
abruptly closed after serving the headers for a non-200 response, the
reported error will be the read failure instead of the non-200.
This creates a new server, 'mail-test-srv', which is a simplistic SMTP
server that accepts mail and can report the received mail over HTTP.
An integration test is added that uses the new server to test the expiry
mailer.
The FAKECLOCK environment variable is used to force the expiry mailer to
think that the just-issued certificate is about to expire.
Additionally, the expiry mailer is modified to cleanly shut down its
SMTP connections.