Neither our testing, staging, nor production configs use the
DBConfig.DBConnect config value. Remove it.
To connect to a database, you have to provide a connection URL. These
URLs often contain sensitive information such as DB usernames and
passwords, so we don't store them directly in our configs -- instead, we
store paths to files which contain these strings, and provision those
files via a separate mechanism. We maintained the ability to provide a
URL directly in the config for the sake of easy testing, but have not
used it for that purpose for some time now.
Turn bgrpc.NewServer into a builder-pattern, with a config-based
initialization, multiple calls to Add to add new gRPC services, and a
final call to Build to produce the start() and stop() functions which
control server behavior. All calls are chainable to produce compact code
in each component's main() function.
This improves the process of creating a new gRPC server in three ways:
1) It avoids the need for generics/templating, which was slightly
verbose.
2) It allows the set of services to be registered on this server to be
known ahead of time.
3) It greatly streamlines adding multiple services to the same server,
which we use today in the VA and will be using soon in the SA and CA.
While we're here, add a new per-service config stanza to the
GRPCServerConfig, so that individual services on the same server can
have their own configuration. For now, only provide a "ClientNames" key,
which will be used in a follow-up PR.
Part of #6454
- Add a new gRPC client config field which overrides the dNSName checked in the
certificate presented by the gRPC server.
- Revert all test gRPC credentials to `<service>.boulder`
- Revert all ClientNames in gRPC server configs to `<service>.boulder`
- Set all gRPC clients in `test/config` to use `serverAddress` + `hostOverride`
- Set all gRPC clients in `test/config-next` to use `srvLookup` + `hostOverride`
- Rename incorrect SRV record for `ca` with port `9096` to `ca-ocsp`
- Rename incorrect SRV record for `ca` with port `9106` to `ca-crl`
Resolves#6424
- Fork the default `dns` resolver from `go-grpc` to add backend discovery via
DNS SRV resource records.
- Add new fields for SRV based discovery to `cmd.GRPCClientConfig`
- Add new (optional) field `DNSAuthority` for specifying custom DNS server to
`cmd.GRPCClientConfig`
- Add a utility method to `cmd.GRPCClientConfig` to simplify target URI and host
construction. With three schemes and `DNSAuthority` it makes more sense to
handle all of this parsing and construction outside of the RPC client
constructor.
Resolves#6111
Honeycomb was emitting logs directly to stderr like this:
```
WARN: Missing API Key.
WARN: Dataset is ignored in favor of service name. Data will be sent to service name: boulder
```
Fix this by providing a fake API key and replacing "dataset" with "serviceName" in configs. Also add missing Honeycomb configs for crl-updater.
For stdout-only logger, include checksums and escape newlines.
Debug and Info messages still go to stdout.
Fix the CAA integration test, which asserted that stderr should be empty
when caa-log-checker finds a problem. That used to be the case because
we never logged to stderr, but now it is the case.
Update the logging docs.
Fixes#6324
The iotuil package has been deprecated since go1.16; the various
functions it provided now exist in the os and io packages. Replace all
instances of ioutil with either io or os, as appropriate.
- Implement a static resolver for the gPRC dialer under the scheme `static:///`
which allows the dialer to resolve a backend from a static list of IPv4/IPv6
addresses passed via the existing JSON config.
- Add config key `serverAddresses` to the `GRPCClientConfig` which, when
populated, enables static IP resolution of gRPC server backends.
- Set `config-next` to use static gRPC backend resolution for all SA clients.
- Generate a new SA certificate which adds `10.77.77.77` and `10.88.88.88` to
the SANs.
Resolves#6255
We have decided that we don't like the if err := call(); err != nil
syntax, because it creates confusing scopes, but we have not cleaned up
all existing instances of that syntax. However, we have now found a
case where that syntax enables a bug: It caused readers to believe that
a later err = call() statement was assigning to an already-declared err
in the local scope, when in fact it was assigning to an
already-declared err in the parent scope of a closure. This caused our
ineffassign and staticcheck linters to be unable to analyze the
lifetime of the err variable, and so they did not complain when we
never checked the actual value of that error.
This change standardizes on the two-line error checking syntax
everywhere, so that we can more easily ensure that our linters are
correctly analyzing all error assignments.
Boulder components initialize their gorp and gorp-less (non-wrapped) database
clients via two new SA helpers. These helpers handle client construction,
database metric initialization, and (for gorp only) debug logging setup.
Removes transaction isolation parameter `'READ-UNCOMMITTED'` from all database
connections.
Fixes#5715Fixes#5889
Running an older version (v0.0.1-2020.1.4) of `staticcheck` in
whole-program mode (`staticcheck --unused.whole-program=true -- ./...`)
finds various instances of unused code which don't normally show up
as CI issues. I've used this to find and remove a large chunk of the
unused code, to pave the way for additional large deletions accompanying
the WFE1 removal.
Part of #5681
We'd prefer not to send these, as they don't contain a significant
amount of relevant information, and they nearly double the total
number of spans sent. However, not sending them is worse: the resulting
server spans created on the other side of the connection all have
parent span IDs that "don't exist" (are never sent to the tracing
service), breaking all ability of the tracing system to infer causality.
The ideal solution here would be to customize the client wrapper code
to not generate these client spans at all, but that requires upstream
changes that may take some time. In the mean time, let's at least do
this.
This changeset adds a second DB connect string for the SA for use in
read-only queries that are not themselves dependencies for read-write
queries. In other words, this is attempting to only catch things like
rate-limit `SELECT`s and other coarse-counting, so we can potentially
move those read queries off the read-write primary database.
It also adds a second DB connect string to the OCSP Updater. This is a
little trickier, as the subsequent `UPDATE`s _are_ dependent on the
output of the `SELECT`, but in this case it's operating on data batches,
and a few seconds' replication latency are several orders of magnitude
below the threshold for update frequency, so any certificates that
aren't caught on run `n` can be caught on run `n+1`.
Since we export DB metrics to Prometheus, this also refactors
`InitDBMetrics` to take a DB Address (host:port tuple) and User out of
the DB connection DSN and include those as labels in the metrics.
Fixes#5550Fixes#4985
The sampling rate integer is used by the span collector to estimate
"how many spans does this span actually represent". This allows accurate
volume comparisons: for example, if you sample successful requests at
a rate of 1/100 and error requests at a rate of 1/10, the trace query
interface will know to scale its query results by those respective
values in order to arrive at accurate error rate estimates.
Previously, this code was returning a sample rate integer of 0 to
indicate that the span was selected for sampling due to an extraordinary
circumstance. This was wrong. This change updates the sample rate int
to be 1, indicating that every such span which exhibited this feature
was sampled, and represents only itself.
Add a test to the Honeycomb SamplerHook which never sends
spans which have a "meta.type" of "grpc_client".
This field and value are set automatically by the Honeycomb gRPC
client interceptor, and can't be set by application code (any fields
set by application code have "app." prepended to their name).
Never sending these spans reduces our visibility into in-datacenter
network latency, but also reduces the number of spans sent by
roughly 50%.
Switch from using the honeycomb beeline's built-in sampling
to a sampler hook which bases its sampling decisions on a
hash of the trace ID. This allows us to do "deterministic"
sampling, where every span in a given trace will either be
sent or not (since the trace ID is the same across all spans
in a trace), giving us more complete traces.
This preserves the same simple (single integer) configuration
of the sample rate. The sample rate can be set differently for
different boulder components (e.g. 1 at the WFE, 100 at the
RA, and 1000 at the nonce-service), but the sampling rate
denominator should only increase towards the leaves of a
gRPC request path.
Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
This allows servers to tell clients to go away after some period of time, which triggers the clients to re-resolve DNS.
Per grpc/grpc#12295, this is the preferred way to do this.
Related: #5307.
Named field `DB`, in a each component configuration struct, acts as the
receiver for the value of `db` when component JSON files are
unmarshalled.
When `cmd.DBConfig` fields are received at the root of component
configuration struct instead of `DB` copy them to the `DB` field of the
component configuration struct.
Move existing `cmd.DBConfig` values from the root of each component's
JSON configuration in `test/config-next` to `db`
Part of #5275
In #5235 we replaced MaxDBConns in favor of MaxOpenConns.
One week ago MaxDBConns was removed from all dev, staging, and
production configurations. This change completes the removal of
MaxDBConns from all components and test/config.
Fixes#5249
Historically the only database/sql driver setting exposed via JSON
config was maxDBConns. This change adds support for maxIdleConns,
connMaxLifetime, connMaxIdleTime, and renames maxDBConns to
maxOpenConns. The addition of these settings will give our SRE team a
convenient method for tuning the reuse/closure of database connections.
A new struct, DBSettings, has been added to SA. The struct, and each of
it's fields has been commented.
All new fields have been plumbed through to the relevant Boulder
components and exported as Prometheus metrics. Tests have been
added/modified to ensure that the fields are being set. There should be
no loss in coverage
Deployability concerns for the migration from maxDBConns to maxOpenConns
have been addressed with the temporary addition of the helper method
cmd.DBConfig.GetMaxOpenConns(). This method can be removed once
test/config is defaulted to using maxOpenConns. Relevant sections of the
code have TODOs added that link back to an newly opened issue.
Fixes#5199
In all boulder services, we construct a single tls.Config object
and then pass it into multiple gRPC setup methods. In all boulder
services but one, we pass the object into multiple clients, and
just one server. In general, this is safe, because all of the client
setup happens on the main thread, and the server setup similarly
happens on the main thread before spinning off the gRPC server
goroutine.
In the CA, we do the above and pass the tlsConfig object into two
gRPC server setup functions. Thus the first server goroutine races
with the setup of the second server.
This change removes the post-hoc assignment of MinVersion,
MaxVersion, and CipherSuites of the tls.Config object passed
to grpc.ClientSetup and grpc.NewServer. And adds those same
values to the cmd.TLSConfig.Load, the method responsible for
constructing the tls.Config object before it's passed to
grpc.ClientSetup and grpc.NewServer.
Part of #5159
errors.As checks for a specific error in a wrapped error chain
(see https://golang.org/pkg/errors/#As) as opposed to asserting
that an error is of a specific type.
Part of #5010
ACME Challenges are well-known strings ("http-01", "dns-01", and
"tlsalpn-01") identifying which kind of challenge should be used
to verify control of a domain. Because they are well-known and
only certain values are valid, it is better to represent them as
something more akin to an enum than as bare strings. This also
improves our ability to ensure that an AcmeChallenge is not
accidentally used as some other kind of string in a different
context. This change also brings them closer in line with the
existing core.AcmeResource and core.OCSPStatus string enums.
Fixes#5009
This follows up on some refactoring we had done previously but not
completed. This removes various binary-specific config structs from the
common cmd package, and moves them into their appropriate packages. In
the case of CT configs, they had to be moved into their own package to
avoid a dependency loop between RA and ctpolicy.
- docker-rebuild isn't needed now that boulder and bhsm containers run directly off
the boulder-tools image.
- Remove DNS options from RA config.
- Remove GSB options from VA config.
* Remove the challenge whitelist
* Reduce the signature for ChallengesFor and ChallengeTypeEnabled
* Some unit tests in the VA were changed from testing TLS-SNI to testing the same behavior
in TLS-ALPN, when that behavior wasn't already tested. For instance timeouts during connect
are now tested.
Fixes#4109
The plural `serverAddresses` field in gRPC config has been deprecated for a bit now. We've removed the last usages of it in our staging/prod environments and can clear out the related code. Moving forward we only support a singular `serverAddress` and rely on DNS to direct to multiple instances of a given server.
Since #2633 we generate OCSP at first issuance, so we no longer need
this loop to check for new certificates that need OCSP status generated.
Since the associate SQL query is slow, we should just turn it off.
Also remove the configuration fields for the MissingSCTTick. The code
for that was already deleted.
Required a little bit of rework of the RA issuance flow (to add parsing of the precert to determine the expiration date, and moving final cert parsing before final cert submission) and RA tests, but I think it shouldn't create any issues...
Fixes#3197.
Submits final certificates to any configured CT logs. This doesn't introduce a feature flag as it is config gated, any log we want to submit final certificates to needs to have it's log description updated to include the `"submitFinalCerts": true` field.
Fixes#3605.
During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available.
During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that.
Fixes#3641.
Our various main.go functions gated some key code on whether the TLS
and/or GRPC config fields were present. Now that those fields are fully
deployed in production, we can simplify the code and require them.
Also, rename tls to tlsConfig everywhere to avoid confusion with the tls
package.
Avoid assigning to the same err from two different goroutines in
boulder-ca (fix a race).
Add a set of logs which will be submitted to but not relied on for their SCTs,
this allows us to test submissions to a particular log or submit to a log which
is not yet approved by a browser/root program.
Also add a feature which stops cancellations of remaining submissions when racing
to get a SCT from a group of logs.
Additionally add an informational log that always times out in config-next.
Fixes#3464 and fixes#3465.
This updates the PA component to allow authorization challenge types that are globally disabled if the account ID owning the authorization is on a configured whitelist for that challenge type.
Go's default is 2: https://golang.org/src/database/sql/sql.go#L686.
Graphs show we are opening 100-200 fresh connections per second on the SA.
Changing this default should reduce that a lot, which should reduce load on both
the SA and MariaDB. This should also improve latency, since every new TCP
connection adds a little bit of latency.
We removed most of the component-specific configs out of cmd a long time ago. We
left CAConfig in, because it is used directly in the NewCertificateAuthority
constructor. That means that moving CAConfig into cmd/boulder-ca would have
resulted in a circular dependency.
Eventually we probably want to decompose CAConfig so it's a set of arguments to
NewCertificateAuthority, but as a short term improvement, move the config into
its own package to break the dependency. This has the advantage of removing a
couple of big dependencies from cmd.