Remove `debugAddr` from the `admin` tool, which doesn't use it - or need
it, now that `newStatsRegistry` via `StatsAndLogging` doesn't require
it.
Remove `debugAddr` from `config-next/sfe.json`, as we usually set it on
the CLI instead.
Fixes#7838
OpenTelemetry has "semantic conventions" which are versioned
independently of the software package, as it describes the semantics of
the resources being produced. Previously, we'd combined
`resource.Default()` using the `Merge` function with our own resources.
Merge, however, doesn't handle merging resources with different semantic
conventions. This means that every dependabot PR that bumps otel will
break when the `resources.Default` has a new version.
That doesn't seem worth it for the default resources, so just provide
our own resources which have everything we care about. I've added the
PID which we didn't have before but will be interesting. We will lose
the SDK's version, but I don't think that matters.
For more discussion on this topic, see
https://github.com/open-telemetry/opentelemetry-go/issues/3769
Directly update:
- go.opentelemetry.io/otel/* from v1.26.0 to v1.27.0
- go.opentelemetry.io/contrib/* from v0.51.0 to v0.52.0
Indirectly update:
- google.golang.org/protobuf from v1.33.0 to v1.34.0
This update breaks some of our existing otel grpc interceptors, but in
return allows us to use the newer grpc StatsHandler mechanism, while
still filtering out health-check requests.
Fixes https://github.com/letsencrypt/boulder/issues/7235
Many services already have --addr and/or --debug-addr flags.
However, it wasn't universal, so this PR adds flags to commands where
they're not currently present.
This makes it easier to use a shared config file but listen on different
ports, for running multiple instances on a single host.
The config options are made optional as well, and removed from
config-next/.
From the Go specification:
> "3. If the map is nil, the number of iterations is 0."
https://go.dev/ref/spec#For_range
Therefore, an additional nil check for before the loop is unnecessary.
For "ordinary" errors like "file not found" for some part of the config,
we would prefer to log an error and exit without logging about a panic
and printing a stack trace.
To achieve that, we want to call `defer AuditPanic()` once, at the top
of `cmd/boulder`'s main. That's so early that we haven't yet parsed the
config, which means we haven't yet initialized a logger. We compromise:
`AuditPanic` now calls `log.Get()`, which will retrieve the configured
logger if one has been set up, or will create a default one (which logs
to stderr/stdout).
AuditPanic and Fail/FailOnError now cooperate: Fail/FailOnError panic
with a special type, and AuditPanic checks for that type and prints a
simple message before exiting when it's present.
This PR also coincidentally fixes a bug: panicking didn't previously
cause the program to exit with nonzero status, because it recovered the
panic but then did not explicitly exit nonzero.
Fixes#6933
This PR adds a new configuration block specifically for the otelhttp
instrumentation. This block is separate from the existing
"opentelemetry" configuration, and is only relevant when using otelhttp
instrumentation. It does not share any codepath with the existing
configuration, so it is at the top level to indicate which services it
applies to.
There's a bit of plumbing new configuration through. I've adopted the
measured_http package to also set up opentelemetry instead of just
metrics, which should hopefully allow any future changes to be smaller
(just config & there) and more consistent between the wfe2 and ocsp
responder.
There's one option here now, which disables setting
[otelhttp.WithPublicEndpoint](https://pkg.go.dev/go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp#WithPublicEndpoint).
This option is designed to do exactly what we want: Don't accept
incoming spans as parents of the new span created in the server.
Previously we had a setting to disable parent-based sampling to help
with this problem, which doesn't really make sense anymore, so let's
just remove it and simplify that setup path. The default of "false" is
designed to be the safe option. It's set to True in the test/ configs
for integration tests that use traces, and I expect we'll likely set it
true in production eventually once the LBs are configured to handle
tracing themselves.
Fixes#6851
This adds Jaeger's all-in-one dev container (with no persistent storage)
to boulder's dev docker-compose. It configures config-next/ to send all
traces there.
A new integration test creates an account and issues a cert, then
verifies the trace contains some set of expected spans.
This test found that async finalize broke spans, so I fixed that and a
few related spots where we make a new context.
This moves command() from cmd/ into core.Command() and uses it from the
log and main package, ensuring we have a single implementation of
path.Base(os.Args[0]) instead of scattering that method around. I put it
in core/util.go as similar enough functions like the BuildID and
revision info already live there, though I am not entirely sure it's the
right place.
This came up in @aarongable's review of #6750, where he pointed out I
was manually hardcoding commands instead of using command() or similar.
Add a new shared config stanza which all boulder components can use to
configure their Open Telemetry tracing. This allows components to
specify where their traces should be sent, what their sampling ratio
should be, and whether or not they should respect their parent's
sampling decisions (so that web front-ends can ignore sampling info
coming from outside our infrastructure). It's likely we'll need to
evolve this configuration over time, but this is a good starting point.
Add basic Open Telemetry setup to our existing cmd.StatsAndLogging
helper, so that it gets initialized at the same time as our other
observability helpers. This sets certain default fields on all
traces/spans generated by the service. Currently these include the
service name, the service version, and information about the telemetry
SDK itself. In the future we'll likely augment this with information
about the host and process.
Finally, add instrumentation for the HTTP servers and grpc
clients/servers. This gives us a starting point of being able to monitor
Boulder, but is fairly minimal as this PR is already somewhat unwieldy:
It's really only enough to understand that everything is wired up
properly in the configuration. In subsequent work we'll enhance those
spans with more data, and add more spans for things not automatically
traced here.
Fixes https://github.com/letsencrypt/boulder/issues/6361
---------
Co-authored-by: Aaron Gable <aaron@aarongable.com>
Although #6771 significantly cleaned up how gRPC services stop and clean
up, it didn't make any changes to our HTTP servers or our non-server
(e.g. crl-updater, log-validator) processes. This change finishes the
work.
Add a new helper method cmd.WaitForSignal, which simply blocks until one
of the three signals we care about is received. This easily replaces all
calls to cmd.CatchSignals which passed `nil` as the callback argument,
with the added advantage that it doesn't call os.Exit() and therefore
allows deferred cleanup functions to execute. This new function is
intended to be the last line of main(), allowing the whole process to
exit once it returns.
Reimplement cmd.CatchSignals as a thin wrapper around cmd.WaitForSignal,
but with the added callback functionality. Also remove the os.Exit()
call from CatchSignals, so that the main goroutine is allowed to finish
whatever it's doing, call deferred functions, and exit naturally.
Update all of our non-gRPC binaries to use one of these two functions.
The vast majority use WaitForSignal, as they run their main processing
loop in a background goroutine. A few (particularly those that can run
either in run-once or in daemonized mode) still use CatchSignals, since
their primary processing happens directly on the main goroutine.
The changes to //test/load-generator are the most invasive, simply
because that binary needed to have a context plumbed into it for proper
cancellation, but it already had a custom struct type named "context"
which needed to be renamed to avoid shadowing.
Fixes https://github.com/letsencrypt/boulder/issues/6794
Enable the errcheck linter. Update the way we express exclusions to use
the new, non-deprecated, non-regex-based format. Fix all places where we
began accidentally violating errcheck while it was disabled.
The CA, RA, and VA have multiple goroutines running alongside primary
gRPC handling goroutine. These ancillary goroutines should be gracefully
shut down when the process is about to exit. Historically, we have
handled this by putting a call to each of these goroutine's shutdown
function inside cmd.CatchSignals, so that when a SIGINT is received, all
of the various cleanup routines happen in sequence.
But there's a cleaner way to do it: just use defer! All of these
cleanups need to happen after the primary gRPC server has fully shut
down, so that we know they stick around at least as long as the service
is handling gRPC requests. And when the service receives a SIGINT,
cmd.CatchSignals will call the gRPC server's GracefulStop, which will
cause the server's .Serve() to finally exit, which will cause start() to
exit, which will cause main() to exit, which will cause all deferred
functions to be run.
In addition, remove filterShutdownErrors as the bug which made it
necessary (.Serve() returning an error even when GracefulShutdown() is
called) was fixed back in 2017. This allows us to call the start()
function in a much more natural way, simply logging any error it returns
instead of calling os.Exit(1) if it returns an error.
This allows us to simplify the exit-handling code in these three
services' main() functions, and lets us be a bit more idiomatic with our
deferred cleanup functions.
Part of #6794
- Require `letsencrypt/validator` package.
- Add a framework for registering configuration structs and any custom
validators for each Boulder component at `init()` time.
- Add a `validate` subcommand which allows you to pass a `-component`
name and `-config` file path.
- Expose validation via exported utility functions
`cmd.LookupConfigValidator()`, `cmd.ValidateJSONConfig()` and
`cmd.ValidateYAMLConfig()`.
- Add unit test which validates all registered component configuration
structs against test configuration files.
Part of #6052
Add go1.20 as a new version to run tests on, and to build release
artifacts from. Fix one test which was failing because it was
accidentally relying on consistent (i.e. unseeded) non-cryptographic
random number generation, which go1.20 now automatically seeds at import
time.
Update the version of golangci-lint used in our docker containers to the
new version that has go1.20 support. Remove a number of nolint comments
that were required due to an old version of the gosec linter.
Turn bgrpc.NewServer into a builder-pattern, with a config-based
initialization, multiple calls to Add to add new gRPC services, and a
final call to Build to produce the start() and stop() functions which
control server behavior. All calls are chainable to produce compact code
in each component's main() function.
This improves the process of creating a new gRPC server in three ways:
1) It avoids the need for generics/templating, which was slightly
verbose.
2) It allows the set of services to be registered on this server to be
known ahead of time.
3) It greatly streamlines adding multiple services to the same server,
which we use today in the VA and will be using soon in the SA and CA.
While we're here, add a new per-service config stanza to the
GRPCServerConfig, so that individual services on the same server can
have their own configuration. For now, only provide a "ClientNames" key,
which will be used in a follow-up PR.
Part of #6454
Emit a metric (`version`) with a constant value of '1' labeled by the short
commit-id (`buildId`), build timestamp in RFC3339 format (`buildTime`), and Go
release tag like 'go1.3' (`goVersion`) from which the Boulder binary was built.
Resolves#6405
Right now, Boulder expects to be able to connect to syslog, and panics
if it's not available. We'd like to be able to log to stdout/stderr as a
replacement for syslog.
- Add a detailed timestamp (down to microseconds, same as we collect in
prod via syslog).
- Remove the escape codes for colorizing output.
- Report the severity level numerically rather than with a letter prefix.
Add locking for stdout/stderr and syslog logs. Neither the [syslog] package
nor the [os] package document concurrency-safety, and the Go rule is: if
it's not documented to be concurrent-safe, it's not. Notably the [log.Logger]
package is documented to be concurrent-safe, and a look at its implementation
shows it uses a Mutex internally.
Remove places that use the singleton `blog.Get()`, and instead pass through
a logger from main in all the places that need it.
[syslog]: https://pkg.go.dev/log/syslog
[os]: https://pkg.go.dev/os
[log.Logger]: https://pkg.go.dev/log#Logger
Run the Boulder unit and integration tests with go1.19.
In addition, make a few small changes to allow both sets of
tests to run side-by-side. Mark a few tests, including our lints
and generate checks, as go1.18-only. Reformat a few doc
comments, particularly lists, to abide by go1.19's stricter gofmt.
Causes #6275
As of the update from 1.7.1 to 1.12.1 (#5974), the
`prometheus.NewFooCollector` methods have been deprecated
and replaced by equivalent `collectors.NewFooCollector` methods.
Make the `grpcLogger`'s `Warning` methods call the underlying
`Warning` methods on our own logger, rather than upgrading the
severity to `Error`. Move our two filters from the error methods to
the warning methods, since they are warning-level messages in
gRPC. Improve code organization and documentation to make
this struct easier to read next time.
Running an older version (v0.0.1-2020.1.4) of `staticcheck` in
whole-program mode (`staticcheck --unused.whole-program=true -- ./...`)
finds various instances of unused code which don't normally show up
as CI issues. I've used this to find and remove a large chunk of the
unused code, to pave the way for additional large deletions accompanying
the WFE1 removal.
Part of #5681
Currently, `cmd.FailOnError` both audit-logs the message and error, and
prints it directly to stderr. It does both because originally it only
printed to stderr, and the audit logging capability was added later.
Since audit logs are printed to stderr anyway, and since printing to
stderr without going through the logger produces lines of output that
violate the log-validator's expected checksums, remove the now-redundant
print.
Fixes#5790
We had in place some filtering for grpc errors that we consider
spurious, but that filtering was broken. This change ensures the
filtering gets called regardless of which of the various error/warning
methods grpc calls. This removes a lot of unnecessary red from our
integration test output.
Instead of using the default `json.Unmarshal`, explicitly
construct and use a `json.Decoder` so that we can set the
`DisallowUnknownFields` flag on the decoder. This causes
any unrecognized config keys to result in errors at boulder
startup time.
Fixes#5643
Make the `NonCFSSLSigner` code path the only code path through the CA.
Remove all code related to the old, CFSSL-based code path. Update tests
to supply (or mock) issuers of the new kind. Remove or simplify a few
tests that were testing for behavior only exhibited by the old code
path, such as incrementing certain metrics. Remove code from `//cmd/`
for initializing the CFSSL library. Finally, mark the `NonCFSSLSigner`
feature flag itself as deprecated.
Delete the portions of the vendored CFSSL code which were only used
by these deleted code paths. This does not remove the CFSSL library
entirely, the rest of the cleanup will follow shortly.
Part of #5115
Having both of these very similar methods sitting around
only serves to increase confusion. This removes the last
few places which use `cmd.LoadCert` and replaces them
with `core.LoadCert`, and deletes the method itself.
Fixes#5163
This builds on #4665 and #4781. The problem we had previously was that
we were relying on a goroutine to consume bytes from a pipe in a
non-blocking manner, which meant that log.Fatal would cause us to exit
before writing out the data.
This version implements an io.Writer so we can make sure the log line
gets written in a blocking manner.
The problem with this approach is that there is no way to guarantee the output
is copied to syslog / stdout before shutdown. This is particularly evident when
`log.Fatal` is used, because that calls `os.Exit` immediately after `l.Output`,
creating a race condition where the log line might or might not get printed
before the program exits.
Reverting this change means that in case some component does call `log.Fatal`
we'll still get the output from stdout.
This also changes one instance in cmd/shell.go where we call `log.Fatal` to use
`logger.Errf`.
Since Boulder's log system adds checksums to lines, but log-validator
processes entries on a per-line basis, including newlines in log
messages can cause a validation failure.
Some components, particularly net/http, occasionally output log lines
via log.Print. We'd like to capture these and send them to rsyslog so
all our log data goes to the same place, and so that we can attach log
line checksums to them.
This uses log.SetOutput to change the log output to an io.Pipe,
then consumes that buffer line-by-line in a goroutine and sends it to
our rsyslog logger.
This seems to tickle an unrelated race condition in test/ocsp/helper.go,
so I fixed that too.
Also filters out a noisy and unimportant error from the grpcLog handler.
Fixes#4664Fixes#4628
In a handful of places I've nuked old stats which are not used in any alerts or dashboards as they either duplicate other stats or don't provide much insight/have never actually been used. If we feel like we need them again in the future it's trivial to add them back.
There aren't many dashboards that rely on old statsd style metrics, but a few will need to be updated when this change is deployed. There are also a few cases where prometheus labels have been changed from camel to snake case, dashboards that use these will also need to be updated. As far as I can tell no alerts are impacted by this change.
Fixes#4591.
* cmd: update prometheus.NewProcessCollector args.
There's a new struct `prometheus.ProcessCollectorOpts` that is expected
to be used as the sole argument to `prometheus.NewProcessCollector`. We
don't need to specify `os.Getpid` as the `PidFn` of the struct because
the default is to assume `os.Getpid`. Similarly we don't need to set the
namespace to `""` explicitly, it is the default.
* SA: reimplement db metrics as custom collector.
The modern Prometheus golang API supports translating between legacy
metric sources on the fly with a custom collector. We can use this
approach to collect the metrics from `gorp.DbMap`'s via the `sql.DB`
type's `Stats` function and the returned `sql.DbStats` struct.
This is a cleaner solution overall (we can lose the DB metrics updating
go routine) and it avoids the need to use the now-removed `Set` method
of the `prometheus.Counter` type.
* test: Update CountHistogramSamples.
The `With` function of `prometheus.HistogramVec` types we tend to use as
the argument to `test.CountHistogramSamples` changed to return
a `prometheus.Observer`. Since we only use this function in test
contexts, and only with things that cast back to
a `prometheus.Histogram` we take that approach to fix the problem
without updating call-sites.
The idea expressed in this comment isn't representative of the
Boulder cmds. E.g. There's no top level "App Shell" in use and the
`NewAppShell`, `Action` and `Run` functions ref'd do not exist.
These errors show up in the Publisher at shutdown during integration
test runs, because the Publisher is trying to write responses from RPCs
that were slow due to the ct-test-srv's LatencySchedule. This
specifically happens only for the optional submission of "final"
certificates.
The gRPC INFO log lines clutter up integration test output, and we've never
had a use for them in production (they are mostly about details of
connection status).
A very large number of the logger calls are of the form log.Function(fmt.Sprintf(...)).
Rather than sprinkling fmt.Sprintf at every logger call site, provide formatting versions
of the logger functions and call these directly with the format and arguments.
While here remove some unnecessary trailing newlines and calls to String/Error.