Commit Graph

880 Commits

Author SHA1 Message Date
Daniel McCarney 299e53b237 RA,CA: Refuse to start with MaxNames == 0. (#3634)
This commit updates the `boulder-ra` and `boulder-ca` commands to refuse
to start if their configured `MaxNames` is 0 (the default value). This
should always be set to a positive number.

This commit also updates `csr/csr.go` to always apply the max names
check since it will never be 0 after the change above.

Also refactor `FailOnError` to pull out a separate `Fail` function.

Related to https://github.com/letsencrypt/boulder/issues/3632
2018-04-10 10:53:23 -07:00
Roland Bracewell Shoemaker cc5ec34539 Allow configuration of multiple DNS resolvers (#3612)
* Allow configuration of multiple DNS resolvers
* Use multiple DNS resolvers in integration tests

Fixes #3611.
2018-04-05 11:51:22 -04:00
Daniel McCarney 590dca0fe1
Cert-checker: Update certlint, add CN/SAN==PSL err ignore. (#3600)
* Update `globalsign/certlint` to d4a45be.

This commit updates the `github.com/globalsign/certlint` dependency to
the latest tip of master (d4a45be06892f3e664f69892aca79a48df510be0).

Unit tests are confirmed to pass:
```
$ go test ./...
ok    github.com/globalsign/certlint  3.816s
ok    github.com/globalsign/certlint/asn1 (cached)
?     github.com/globalsign/certlint/certdata [no test files]
?     github.com/globalsign/certlint/checks [no test files]
?     github.com/globalsign/certlint/checks/certificate/aiaissuers  [no
test files]
?     github.com/globalsign/certlint/checks/certificate/all [no test
files]
?     github.com/globalsign/certlint/checks/certificate/basicconstraints
[no test files]
?     github.com/globalsign/certlint/checks/certificate/extensions  [no
test files]
?     github.com/globalsign/certlint/checks/certificate/extkeyusage [no
test files]
ok    github.com/globalsign/certlint/checks/certificate/internal
(cached)
?     github.com/globalsign/certlint/checks/certificate/issuerdn  [no
test files]
?     github.com/globalsign/certlint/checks/certificate/keyusage  [no
test files]
?     github.com/globalsign/certlint/checks/certificate/publickey [no
test files]
?     github.com/globalsign/certlint/checks/certificate/publickey/goodkey
[no test files]
ok    github.com/globalsign/certlint/checks/certificate/publicsuffix
(cached)
?     github.com/globalsign/certlint/checks/certificate/revocation  [no
test files]
?     github.com/globalsign/certlint/checks/certificate/serialnumber
[no test files]
?     github.com/globalsign/certlint/checks/certificate/signaturealgorithm
[no test files]
ok    github.com/globalsign/certlint/checks/certificate/subject (cached)
ok    github.com/globalsign/certlint/checks/certificate/subjectaltname
(cached)
?     github.com/globalsign/certlint/checks/certificate/validity  [no
test files]
?     github.com/globalsign/certlint/checks/certificate/version [no test
files]
?     github.com/globalsign/certlint/checks/certificate/wildcard  [no
test files]
?     github.com/globalsign/certlint/checks/extensions/adobetimestamp
[no test files]
?     github.com/globalsign/certlint/checks/extensions/all  [no test
files]
?     github.com/globalsign/certlint/checks/extensions/authorityinfoaccess
[no test files]
?     github.com/globalsign/certlint/checks/extensions/authoritykeyid
[no test files]
?     github.com/globalsign/certlint/checks/extensions/basicconstraints
[no test files]
?     github.com/globalsign/certlint/checks/extensions/crldistributionpoints
[no test files]
?     github.com/globalsign/certlint/checks/extensions/ct [no test
files]
?     github.com/globalsign/certlint/checks/extensions/extkeyusage  [no
test files]
?     github.com/globalsign/certlint/checks/extensions/keyusage [no test
files]
?     github.com/globalsign/certlint/checks/extensions/nameconstraints
[no test files]
ok    github.com/globalsign/certlint/checks/extensions/ocspmuststaple
(cached)
?     github.com/globalsign/certlint/checks/extensions/ocspnocheck  [no
test files]
?     github.com/globalsign/certlint/checks/extensions/pdfrevocation
[no test files]
?     github.com/globalsign/certlint/checks/extensions/policyidentifiers
[no test files]
?     github.com/globalsign/certlint/checks/extensions/smimecapabilities
[no test files]
?     github.com/globalsign/certlint/checks/extensions/subjectaltname
[no test files]
?     github.com/globalsign/certlint/checks/extensions/subjectkeyid [no
test files]
ok    github.com/globalsign/certlint/errors (cached)
?     github.com/globalsign/certlint/examples/ct  [no test files]
?     github.com/globalsign/certlint/examples/specificchecks  [no test
files]
```

* Certchecker: Remove OCSP Must Staple err ignore, fix typos.

This commit removes the explicit ignore for OCSP Must Staple errors that
was added when the upstream `certlint` package didn't understand that
PKIX extension. That problem was resolved and so we can remove the
ignore from `cert-checker`.

This commit also fixes two typos that were fixed upstream and needed to
be reflected in expected error messages in the `certlint` unit test.

* Certchecker: Ignore Certlint CN/SAN == PSL errors.

`globalsign/certlint`, used by `cmd/cert-checker` to vet certs,
improperly flags certificates that have subj CN/SANs equal to a private
entry in the public suffix list as faulty.

This commit adds a regex that will skip errors that match the certlint
PSL error string. Prior to this workaround the addition of a private PSL
entry as a SAN in the `TestCheckCert` test cert fails the test:

```
--- FAIL: TestCheckCert (1.72s)
  main_test.go:221: Found unexpected problem 'Certificate subjectAltName
  "dev-myqnapcloud.com" equals "dev-myqnapcloud.com" from the public
  suffix list'.
```

With the workaround in place, the test passes again.
2018-04-04 12:20:43 -04:00
Roland Bracewell Shoemaker 8167abd5e3 Use internet facing appropriate histogram buckets for DNS latencies (#3616)
Also instead of repeating the same bucket definitions everywhere just use a single top level var in the metrics package in order to discourage copy/pasting.

Fixes #3607.
2018-04-04 08:01:54 -04:00
Daniel McCarney 703b134e93 WFE2: Wire missed config elements to WFE object. (#3604)
This commit addresses two config elements that were defined but not
wired through to the WFE implementation object. Prior to this commit the
`c.WFE.DirectoryCAAIdentity` and `c.WFE.DirectoryWebsite` configuration
values were read and unmarshaled from config but not passed to the WFE.
After this commit these two config options will be picked up by the WFE
impl.
2018-03-29 11:01:26 -07:00
Daniel McCarney 57d0141519 cert-checker: Ignore OCSP Must Staple certlint errs. (#3598)
The upstream `certlint` package doesn't understand the RFC 7633 OCSP
Must Staple PKIX Extension and flags its presence as an error. Until
this is resolved upstream this commit updates `cmd/cert-checker` to
ignore the error.

The `TestCheckCert` unit test is updated to add an unsupported extension
and the OCSP must staple extension to its test cert. Only the
unsupported extension should be flagged as a problem.
2018-03-26 10:30:57 -07:00
Daniel McCarney 17922a6d2d
Add CAAIdentities and Website to /directory "meta". (#3588)
This commit updates the WFE and WFE2 to have configuration support for
setting a value for the `/directory` object's "meta" field's
optional "caaIdentities" and "website" fields. The config-next wfe/wfe2
configuration are updated with values for these fields. Unit tests are
updated to check that they are sent when expected and not otherwise.

Bonus content: The `test.AssertUnmarshaledEquals` function had a bug
where it would consider two inputs equal when the # of keys differed.
This commit also fixes that bug.
2018-03-22 16:12:43 -04:00
Daniel McCarney f3a2fd85bc Remove deprecated SubscriberAgreementURL config field. (#3587)
The outer `config.SubscriberAgreementURL` field has been deprecated for
a while in favour of `config.wfe.SubscriberAgreementURL`. After
verifying the prod/staging configurations do not use the legacy field
this commit removes it.
2018-03-22 12:43:53 -07:00
Daniel McCarney 0c4e1daa46 WFE2 Chain File Loading Improvements (#3580)
* Reject WFE2 certificate chain PEM files with CRLF endings.

This commit updates the `boulder-wfe2` command's processing of
certificate chains such that it will reject chain files that contain PEM
encoding with Windows CRLF line endings. Boulder is a UNIX service and
throughout we assume UNIX newlines. CRLF endings in a certificate chain
input file is an error that should be resolved by the operator prior to
startup.

* Add trailing newline to PEM chainfiles automatically.

If a PEM encoded chain file doesn't end with a trailing `\n` the WFE2
should add it. This commit updates the chain file loading to handle this
and adds a corresponding unit test.
2018-03-20 14:54:20 -07:00
Jacob Hoffman-Andrews 65b88a8dbc Run certlint in cert-checker (#3550)
This pulls in the certlint dependency, which in turn pulls in publicsuffix as a dependency.

Fixes #3549
2018-03-15 17:42:58 +00:00
Daniel McCarney e00ed50cc3 Fix nil panic in mailer logger setup when DB conn fails. (#3545)
Prior to this commit, if the expiration-mailer database configuration is
invalid, or the database is unreachable,
`cmd/expiration-mailer/main.go`'s `main()` function tries to call
`sa.SetSQLDebug(dbMap, logger)` before erroring from the DB
initialization failure. This causes a nil panic.

This commit changes the order of two lines such that `sa.SetSQLDebug` is
only called when there was no db setup error.
2018-03-12 13:46:13 -07:00
Roland Bracewell Shoemaker 9c9e944759 Add SCT embedding (#3521)
Adds SCT embedding to the certificate issuance flow. When a issuance is requested a precertificate (the requested certificate but poisoned with the critical CT extension) is issued and submitted to the required CT logs. Once the SCTs for the precertificate have been collected a new certificate is issued with the poison extension replace with a SCT list extension containing the retrieved SCTs.

Fixes #2244, fixes #3492 and fixes #3429.
2018-03-12 11:58:30 -07:00
Jacob Hoffman-Andrews d654675223 Remove BaseURL from WFE config. (#3540)
For a long time now the WFE has generated URLs based on the incoming
request rather than a hardcoded BaseURL. BaseURL is no longer set in the
prod configs.

This also allows factoring out relativeEndpoint into the web package.
2018-03-09 11:04:02 +00:00
Jacob Hoffman-Andrews 5c4f5e346a Fix pprof handlers. (#3533)
Some of the pprof handlers have to be accessed through
pprof.Handler("string"), while some have to be accessed through an
exported var in pprof. We weren't doing the latter before, which meant
some key handlers like Profile weren't available.
2018-03-08 18:18:13 +00:00
Jacob Hoffman-Andrews 6b8b6a37c0 Update chisel2 and boulder-tools (#3495)
This change updates boulder-tools to use Go 1.10, and references a
newly-pushed image built using that new config.

Since boulder-tools pulls in the latest Certbot master at the time of
build, this also pulls in the latest changes to Certbot's acme module,
which now supports ACME v2. This means we no longer have to check out
the special acme-v2-integration branch in our integration tests.

This also updates chisel2.py to reflect some of the API changes that
landed in the acme module as it was merged to master.

Since we don't need additional checkouts to get the ACMEv2-compatible
version of the acme module, we can include it in the default RUN set for
local tests.
2018-02-28 15:21:40 -08:00
Jacob Hoffman-Andrews 9003dd4522 Improve test-tools. (#3481)
Use `t.Helper` and `t.Fatalf` instead of our own versions.

Remove some unused or single-user helpers.

Make the output of `AssetUnmarshaledEquals` clearer by showing one line per field.
2018-02-27 10:19:24 -05:00
Jacob Hoffman-Andrews 9da5a7e1fc Cleanup: TLS and GRPC configs are mandatory. (#3476)
Our various main.go functions gated some key code on whether the TLS
and/or GRPC config fields were present. Now that those fields are fully
deployed in production, we can simplify the code and require them.
    
Also, rename tls to tlsConfig everywhere to avoid confusion with the tls
package.
    
Avoid assigning to the same err from two different goroutines in
boulder-ca (fix a race).
2018-02-26 10:16:50 -05:00
Roland Bracewell Shoemaker 0b53063a72 ctpolicy: Add informational logs and don't cancel remaining submissions (#3472)
Add a set of logs which will be submitted to but not relied on for their SCTs,
this allows us to test submissions to a particular log or submit to a log which
is not yet approved by a browser/root program.

Also add a feature which stops cancellations of remaining submissions when racing
to get a SCT from a group of logs.

Additionally add an informational log that always times out in config-next.

Fixes #3464 and fixes #3465.
2018-02-23 21:51:50 -05:00
Daniel af97ccae18
Fix OCSPUpdater config parameter 2018-02-21 11:52:13 -05:00
Daniel 36de3bf000
Support Akamai CCU v3 API in cache-client.
This commit adds support for the Akamai CCU v3 API. See
https://developer.akamai.com/api/purge/ccu/resources.html for more information.

The V2 and V3 API are close enough to one another that we can support
both with minimal changes. A new OCSP updated configuration parameter
"AkamaiV3Network" is used to determine if the cache client should use
the V2 API or the V3 API. When empty, V2 is used. When set to either
"production" or "staging", the V3 API is used.
2018-02-21 11:41:32 -05:00
Jacob Hoffman-Andrews c556a1a20d
Reduce spurious errors in integration test (#3436)
Boulder is fairly noisy about gRPC connection errors. This is a mixed
blessing: Our gRPC configuration will try to reconnect until it hits
an RPC deadline, and most likely eventually succeed. In that case,
we don't consider those to really be errors. However, in cases where
a connection is repeatedly failing, we'd like to see errors in the
logs about connection failure, rather than "deadline exceeded." So
we want to keep logging of gRPC errors.

However, right now we get a lot of these errors logged during
integration tests. They make the output hard to read, and may disguise
more serious errors. So we'd like to avoid causing such errors in
normal integration test operation.

This change reorders the startup of Boulder components by their gRPC
dependencies, so everything's backend is likely to be up and running
before it starts. It also reverses that order for clean shutdowns,
and waits for each process to exit before signalling the next one.

With these changes, I still got connection errors. Taking listenbuddy
out of the gRPC path fixed them. I believe the issue is that
listenbuddy is not a truly transparent proxy. In particular, it
accepts an inbound TCP connection before opening an outbound TCP
connection. If opening that outbound connection results in "connection
refused," it closes the inbound connection. That means gRPC sees a
"connection closed" (or "connection reset"?) rather than "connection
refused". I'm guessing it handles those cases differently, explaining
the different error results.

We've been using listenbuddy to trigger disconnects while Boulder is
running, to ensure that gRPC's reconnect code works. I think we can
probably rely on gRPC's reconnect to work. The initial problem that
led us to start testing this was a configuration problem; now that
we have the configuration we want, we should be fine and don't need
to keep testing reconnects on every integration test run.
2018-02-12 18:17:50 -08:00
Roland Bracewell Shoemaker 9e23edf850 Use ctpolicy package in RA (#3422)
And collect the metrics on success/failure rates. Built on top of #3414.

Fixes #3413.
2018-02-08 13:33:42 -08:00
Jacob Hoffman-Andrews 6584d2067b
Return 500s from ocsp-responder. (#3423)
Previously, all errors were treated as Not Found, but we actually want
to treat database errors differently; for instance, by not caching them,
and by setting tighter alerting thresholds for them.

Fixes #3419.
2018-02-06 11:37:44 -08:00
Daniel McCarney dae0e4e41d Remove `.mil` check from cert-checker. (#3426)
We're no longer forbidden from issuing `.mil` certificates and shouldn't
flag certs with `.mil` subjects when running `cert-checker`.
2018-02-06 11:02:45 -08:00
Roland Bracewell Shoemaker 62f3978f3b
Add inital CTPolicy impl (#3414)
Adds a package which implements group based SCT retrieval.

Fixes #3412.
2018-02-06 10:52:20 -08:00
Roland Bracewell Shoemaker cdab3a2ef8 Improve wildcard error (#3398) 2018-01-29 10:49:31 -08:00
Roland Bracewell Shoemaker 2a04a85c49 Export max DB connections in boulder-sa and ocsp-responder (#3388)
Fixes #3387.
2018-01-24 09:11:01 -05:00
Daniel McCarney d6a33d1108 Return full cert chain for V2 cert GET. (#3366)
This commit implements a mapping from certificate AIA Issuer URL to PEM
encoded certificate chain. GET's to the V2 Certificate endpoint will
return a full PEM encoded certificate chain in addition to the leaf cert
using the AIA issuer URL of the leaf cert and the configured mapping.

The boulder-wfe2 command builds the chain mapping by reading the
"wfe" config section's 'certificateChains" field, specifying a list
of file paths to PEM certificates for each AIA issuer URL. At startup
the PEM file contents are ready, verified and separated by a newline.
The resulting populated AIA issuer URL -> PEM cert chain mapping is
given to the WFE for use with the Certificate endpoint.

Resolves #3291
2018-01-19 11:23:44 -08:00
Jacob Hoffman-Andrews 54ca6fe939 Use WillingToIssueWildcard in cert-checker. (#3372)
Fixes #3348 and #3369
2018-01-18 08:36:58 -05:00
Daniel McCarney f969847070 Delete unused WFE/WFE2 cache configuration params. (#3360)
This commit removes `CertCacheDuration`, `CertNoCacheExpirationWindow`,
`IndexCacheDuration` and `IssuerCacheDuration`. These were read from
config values that weren't set in config/config-next into WFE struct
fields that were never referenced in any code.
2018-01-12 15:54:02 -08:00
Maciej Dębski 44984cd84a Implement regID whitelist for allowed challenge types. (#3352)
This updates the PA component to allow authorization challenge types that are globally disabled if the account ID owning the authorization is on a configured whitelist for that challenge type.
2018-01-10 13:44:53 -05:00
Jacob Hoffman-Andrews 827f7859f2 Fix issuerCert in test configs. (#3310)
Previously, there was a disagreement between WFE and CA as to what the correct
issuer certificate was. Consolidate on test-ca2.pem (h2ppy h2cker fake CA).
    
Also, the CA configs contained an outdated entry for "IssuerCert", which was not
being used: The CA configs now use an "Issuers" array to allow signing by
multiple issuer certificates at once (for instance when rolling intermediates).
Removed this outdated entry, and the config code for CA to load it. I've
confirmed these changes match what is currently in production.

Added an integration test to check for this problem in the future.

Fixes #3309, thanks to @icing for bringing the issue to our attention!

This also includes changes from #3321 to clarify certificates for WFE.
2018-01-09 07:56:39 -05:00
Jacob Hoffman-Andrews a98a206dd2 Remove references to test-ca.pem. (#3322)
shell_test.go and publisher_test.go had unnecessary references to
../test/test-ca.pem. This change makes them a little more self-contained.

Note: ca/ca_test.go still depends on test-ca.pem, but removing the dependency
turns out to be a little more complicated due to hardcoded expectations in some
of the test cases.
2018-01-05 12:07:12 -08:00
Jacob Hoffman-Andrews 90f7998b15 Speed up expired authz purger (#3267)
Now, rather than LIMIT / OFFSET, this uses the highest id from the last batch in each new batch's query. This makes efficient use of the index, and means the database does not have to scan over a large number of non-expired rows before starting to find any expired rows.

This also changes the structure of the purge function to continually push ids for deletion onto a channel, to be processed by goroutines consuming that channel.

Also, remove the --yes flag and prompting.
2017-12-11 12:05:43 -05:00
Jacob Hoffman-Andrews 68d5cc3331
Restore gRPC metrics (#3265)
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.

Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.

I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.

Also, update go-grpc-prometheus to get the necessary methods.

```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok      github.com/grpc-ecosystem/go-grpc-prometheus    0.069s
?       github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
2017-12-07 15:44:55 -08:00
Roland Bracewell Shoemaker bdea281ae0 Remove CAA SERVFAIL exceptions code (#3262)
Fixes #3080.
2017-12-05 14:39:37 -08:00
Roland Bracewell Shoemaker 9da1bea433 Update histogram buckets for latencies that measure things over the internet (#3254)
Updates the buckets for histograms in the publisher, va, and expiration-mailer which are used to measure the latency of operations that go over the internet and therefore are liable to take a lot longer than the default buckets can measure. Uses a standard set of buckets for all three instead of attempting to tune for each one.

Fixes #3217.
2017-11-29 15:13:14 -08:00
Roland Bracewell Shoemaker d5db80ab12 Various publisher CT fixes (#3219)
Makes a couple of changes:
* Change `SubmitToCT` to make submissions to each log in parallel instead of in serial, this prevents a single slow log from eating up the majority of the deadline and causing submissions to other logs to fail
* Remove the 'submissionTimeout' field on the publisher since it is actually bounded by the gRPC timeout as is misleading
* Add a timeout to the CT clients internal HTTP client so that when log servers hang indefinitely we actually do retries instead of just using the entire submission deadline. Currently set at 2.5 minutes

Fixes #3218.
2017-11-09 10:05:26 -05:00
Jacob Hoffman-Andrews 975456bb08 Switch nagsAtCapacity to Gauge. (#3224)
Fixes #3186
2017-11-08 15:35:25 -08:00
Jacob Hoffman-Andrews 4296dd985a Use TLS in mailer integration tests (#3213)
* Remove non-TLS support from mailer entirely
* Add a config option for trusted roots in expiration-mailer. If unset, it defaults to the system roots, so this does not need to be set in production.
* Use TLS in mail-test-srv, along with an internal root and localhost certificates signed by that root.
2017-11-06 14:57:14 -08:00
Jacob Hoffman-Andrews 3d9b3d4d20 Restore expvar handler. (#3209)
In #3167 I removed expvar, thinking it was unused, but it turns out the RA
exports the last issuance time, and core/util.go has a function to export
BuildID, both of which are used in monitoring. This wasn't caught at compile
time because the global expvar package was happy to register the exports even
though there was no handler to serve them.
2017-11-02 07:05:54 -07:00
Roland Bracewell Shoemaker 29c95f0aed Add a PKCS#11 key generation tool (#3163)
Tested against master SoftHSMv2 and relevant hardware.

Fixes #3125.
2017-10-30 16:09:28 -07:00
Lucas Amorim 7daecf7b23
fix metric name 2017-10-30 11:26:03 -07:00
Lucas Amorim a7a2eaf035
Add metrics to sendNags errors in expiration-mailer 2017-10-29 21:34:41 -07:00
Jacob Hoffman-Andrews bf9ce64aca Update GSB library (#3192)
This pulls in google/safebrowsing#74, which introduces a new LookupURLsContext that allows us to pass through timeout information nicely.

Also, update calling code to use LookupURLsContext instead of LookupURLs.
2017-10-24 08:33:03 -04:00
Jacob Hoffman-Andrews c06dcfaf02 Limit number of authzs purged at once. (#3177)
Previously the expired-authz-purger would try to load the ids for all relevant
authzs into memory before doing any work. On a very large table, this would mean
running out of memory. This setting allows limiting how much work will be done
in one chunk.

Also add periodic logging of deletion count.

Fixes #3147.
2017-10-23 11:20:07 -07:00
Jacob Hoffman-Andrews 6cd777bd8d Fix up stats after #3167 (#3185)
There were two bugs in #3167:

All process-level stats got prefixed with "boulder", which broke dashboards.
All request_time stats got dropped, because measured_http was using the prometheus DefaultRegisterer.
To fix, this PR plumbs through a scope object to measured_http, and uses an empty prefix when calling NewProcessCollector().
2017-10-18 11:14:59 -07:00
Jacob Hoffman-Andrews 071fc0120f Remove facebookgo/httpdown. (#3168)
Its purpose is now served by net/http's Shutdown().
2017-10-17 08:55:43 -04:00
Jacob Hoffman-Andrews 600640294d Increase default MaxIdleConns. (#3164)
Go's default is 2: https://golang.org/src/database/sql/sql.go#L686.
Graphs show we are opening 100-200 fresh connections per second on the SA.
Changing this default should reduce that a lot, which should reduce load on both
the SA and MariaDB. This should also improve latency, since every new TCP
connection adds a little bit of latency.
2017-10-16 15:48:17 -07:00
Jacob Hoffman-Andrews f366e45756 Remove global state from metrics gathering (#3167)
Previously, we used prometheus.DefaultRegisterer to register our stats, which uses global state to export its HTTP stats. We also used net/http/pprof's behavior of registering to the default global HTTP ServeMux, via DebugServer, which starts an HTTP server that uses that global ServeMux.

In this change, I merge DebugServer's functions into StatsAndLogging. StatsAndLogging now takes an address parameter and fires off an HTTP server in a goroutine. That HTTP server is newly defined, and doesn't use DefaultServeMux. On it is registered the Prometheus stats handler, and handlers for the various pprof traces. In the process I split StatsAndLogging internally into two functions: makeStats and MakeLogger. I didn't port across the expvar variable exporting, which serves a similar function to Prometheus stats but which we never use.

One nice immediate effect of this change: Since StatsAndLogging now requires and address, I noticed a bunch of commands that called StatsAndLogging, and passed around the resulting Scope, but never made use of it because they didn't run a DebugServer. Under the old StatsD world, these command still could have exported their stats by pushing, but since we moved to Prometheus their stats stopped being collected. We haven't used any of these stats, so instead of adding debug ports to all short-lived commands, or setting up a push gateway, I simply removed them and switched those commands to initialize only a Logger, no stats.
2017-10-13 11:58:01 -07:00