Commit Graph

4092 Commits

Author SHA1 Message Date
Daniel McCarney de5fbbdb67
Implement CAA issueWild enforcement for wildcard names (#3266)
This commit implements RFC 6844's description of the "CAA issuewild
property" for CAA records.

We check CAA in two places: at the time of validation, and at the time
of issuance when an authorization is more than 8hours old. Both
locations have been updated to properly enforce issuewild when checking
CAA for a domain corresponding to a wildcard name in a certificate
order.

Resolves https://github.com/letsencrypt/boulder/issues/3211
2017-12-13 12:09:33 -05:00
Daniel McCarney 09628bcfa2 WFE2 'new-nonce' endpoint (#3270)
This commit adds the "new-nonce" endpoint to the WFE2. A small unit test
is included. The existing /directory unit tests are updated for the new
endpoint.
2017-12-13 08:29:34 -08:00
Daniel McCarney a099e40b9c Add 'new-order' endpoint to WFE2 directory. (#3269)
When we implemented the new-order issuance flow for the WFE2 we forgot to include the endpoint in the /directory object. This commit adds it and updates associated tests.
2017-12-12 13:43:25 -08:00
Jacob Hoffman-Andrews 90f7998b15 Speed up expired authz purger (#3267)
Now, rather than LIMIT / OFFSET, this uses the highest id from the last batch in each new batch's query. This makes efficient use of the index, and means the database does not have to scan over a large number of non-expired rows before starting to find any expired rows.

This also changes the structure of the purge function to continually push ids for deletion onto a channel, to be processed by goroutines consuming that channel.

Also, remove the --yes flag and prompting.
2017-12-11 12:05:43 -05:00
Jacob Hoffman-Andrews 68d5cc3331
Restore gRPC metrics (#3265)
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.

Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.

I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.

Also, update go-grpc-prometheus to get the necessary methods.

```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok      github.com/grpc-ecosystem/go-grpc-prometheus    0.069s
?       github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
2017-12-07 15:44:55 -08:00
Daniel McCarney cda7b25c23 Do not `Update` pendingAuthz unnecessarily for chal update. (#3263)
If two requests simultaneously update a challenge for the same
authorization there is a chance that `UpdatePendingAuthorization` will
encounter a Gorp optimistic lock error having read one LockCol value
from a `Select` on the `pendingAuthorizations` table only for it to have
changed by the time an `Update` on the same row is performed.

After closer examination this `Update` is unnecessary! Only
`RA.UpdateAuthorization` calls `SA.UpdatePendingAuthorization` and it
does so only to record updated challenge information by way of
`UpdatePendingAuthorization`'s call to `updateChallenges`. Since no data
in the `pendingAuthorizations` row is being changed we don't need to do
this `Update` at all, saving both a potential race condition & some
database load.

This commit removes the `Update` entirely. Several SA unit tests had to
be updated because they were (ab)using `UpdatePendingAuthorization` to
mutate pendingAuthz rows.
2017-12-06 12:20:37 -08:00
Roland Bracewell Shoemaker bdea281ae0 Remove CAA SERVFAIL exceptions code (#3262)
Fixes #3080.
2017-12-05 14:39:37 -08:00
Daniel McCarney 0684d5fc73
Add pending orders rate limit to new-order. (#3257)
This commit adds a new rate limit to restrict the number of outstanding
pending orders per account. If the threshold for this rate limit is
crossed subsequent new-order requests will return a 429 response.

Note: Since this the rate limit object itself defines an `Enabled()`
test based on whether or not it has been configured there is **not**
a feature flag for this change.

Resolves https://github.com/letsencrypt/boulder/issues/3246
2017-12-04 16:36:48 -05:00
Daniel McCarney 1c99f91733 Policy based issuance for wildcard identifiers (Round two) (#3252)
This PR implements issuance for wildcard names in the V2 order flow. By policy, pending authorizations for wildcard names only receive a DNS-01 challenge for the base domain. We do not re-use authorizations for the base domain that do not come from a previous wildcard issuance (e.g. a normal authorization for example.com turned valid by way of a DNS-01 challenge will not be reused for a *.example.com order).

The wildcard prefix is stripped off of the authorization identifier value in two places:

When presenting the authorization to the user - ACME forbids having a wildcard character in an authorization identifier.
When performing validation - We validate the base domain name without the *. prefix.
This PR is largely a rewrite/extension of #3231. Instead of using a pseudo-challenge-type (DNS-01-Wildcard) to indicate an authorization & identifier correspond to the base name of a wildcard order name we instead allow the identifier to take the wildcard order name with the *. prefix.
2017-12-04 12:18:10 -08:00
Daniel McCarney 55dd1020c0 Increase VA SingleDialTimeout to 10s. (#3260)
This PR changes the VA's singleDialTimeout value from 5 * time.Second to 10 * time.Second. This will give slower servers a better chance to respond, especially for the multi-VA case where n requests arrive ~simultaneously.

This PR also bumps the RA->VA timeout by 5s and the WFE->RA timeout by 5s to accommodate the increased dial timeout. I put this in a separate commit in case we'd rather deal with this separately.
2017-12-04 09:53:26 -08:00
Roland Bracewell Shoemaker 9da1bea433 Update histogram buckets for latencies that measure things over the internet (#3254)
Updates the buckets for histograms in the publisher, va, and expiration-mailer which are used to measure the latency of operations that go over the internet and therefore are liable to take a lot longer than the default buckets can measure. Uses a standard set of buckets for all three instead of attempting to tune for each one.

Fixes #3217.
2017-11-29 15:13:14 -08:00
Daniel McCarney 171da33513
Don't 500 on missing pendingAuthz (#3248)
As described in #3201  a concurrent challenge POST would result in 500 errors if the pending authz row was deleted by a promotion to the authz table underneath another request. 

This PR adjusts the SA & RA so that if a pending authz is promoted to a final authz between updates a not found error will be returned instead of a server internal error.

Resolves https://github.com/letsencrypt/boulder/issues/3201
2017-11-22 15:48:42 -05:00
Jacob Hoffman-Andrews 2fd2f9e230 Remove LegacyCAA implementation. (#3240)
Fixes #3236
2017-11-20 16:09:00 -05:00
Andriy 72330bbedd Fix `TestNormalizeCSR` test condition (#3245)
Previous to this commit `TestNormalizeCSR` was comparing the expected DNSNames in the CSR to themselves, **not** the names in the CSR.
2017-11-17 08:30:09 -05:00
Jacob Hoffman-Andrews 25e5c3ec3c Add test for SA's parallelismPerRPC. (#3241)
Fixes #3138.
2017-11-16 09:15:15 -05:00
Jacob Hoffman-Andrews 0d8190f799 Update Boulderdash (#3232)
Move Boulder CPU panel over
Add basic expiry mailer stats
Add HSM signatures panel
Add page splits panel
Add CT submissions panel, with old and new metrics for now
2017-11-13 13:22:45 -05:00
Roland Bracewell Shoemaker d5db80ab12 Various publisher CT fixes (#3219)
Makes a couple of changes:
* Change `SubmitToCT` to make submissions to each log in parallel instead of in serial, this prevents a single slow log from eating up the majority of the deadline and causing submissions to other logs to fail
* Remove the 'submissionTimeout' field on the publisher since it is actually bounded by the gRPC timeout as is misleading
* Add a timeout to the CT clients internal HTTP client so that when log servers hang indefinitely we actually do retries instead of just using the entire submission deadline. Currently set at 2.5 minutes

Fixes #3218.
2017-11-09 10:05:26 -05:00
Jacob Hoffman-Andrews ef0bf7e9d0 Remove TooManyCertificatesError. (#3228)
When counting certificates for rate limiting, we attempted to impose a limit on
the query results to avoid we did not receive so many results that they caused
slowness on the database or SA side. However, that check has never actually been
executed correctly. The check was fixed in #3126, but rolling out that fix
broke issuance for subscribers with rate limit overrides that have allowed them
to exceed the limit.

Because this limit has not been needed in practice over the years, remove it
rather than refining it. The size of the results are loosely governed by our
rate limits (and overrides), and if result sizes from this query become a
performance issue in the future, we can address it then. For now, opt for
simplification.

Fixes #3214.
2017-11-09 09:11:45 -05:00
Robert Kästel 60ca8febb3 Pin version of Prometheus to 1.8.2 (#3230)
Pin version of Prometheus to `1.8.2` because Prometheus folks has just released v2 and switched the `:latest` tag to that.

Using Prometheus v2 produces an error:

     prometheus: error: unknown short flag '-c'
2017-11-09 09:03:12 -05:00
Jacob Hoffman-Andrews 975456bb08 Switch nagsAtCapacity to Gauge. (#3224)
Fixes #3186
2017-11-08 15:35:25 -08:00
Jacob Hoffman-Andrews 9dc32b010f Add indexes on certificateStatus. (#3225)
In #1864 we discussed possible
optimizations to how expiration-mailer and ocsp-updater query the
certificateStatus table. In #2177 we
added the notAfter and isExpired fields for more efficient querying.
However, we forgot to add indexes on these fields. This change adds new indexes
and drops the old indexes, and should result in much more efficient querying in
those two components.

Also, remove a comment that goose couldn't understand.

Running EXPLAINs to show the difference:

For expiration-mailer, before:

MariaDB [boulder_sa_integration]> EXPLAIN SELECT  cs.serial  FROM certificateStatus AS cs  WHERE cs.notAfter > DATE_ADD(NOW(), INTERVAL 21 DAY)  AND cs.notAfter < DATE_ADD(NOW(), INTERVAL 10 DAY)  AND cs.status != "revoked"  AND COALESCE(TIMESTAMPDIFF(SECOND, cs.lastExpirationNagSent, cs.notAfter) > 10 * 86400, 1)  ORDER BY cs.notAfter ASC  LIMIT 100000;
+------+-------------+-------+------+------------------------------+------+---------+------+------+-----------------------------+
| id   | select_type | table | type | possible_keys                | key  | key_len | ref  | rows | Extra                       |
+------+-------------+-------+------+------------------------------+------+---------+------+------+-----------------------------+
|    1 | SIMPLE      | cs    | ALL  | status_certificateStatus_idx | NULL | NULL    | NULL |  486 | Using where; Using filesort |
+------+-------------+-------+------+------------------------------+------+---------+------+------+-----------------------------+
1 row in set (0.00 sec)

For expiration-mailer, after:

MariaDB [boulder_sa_integration]> EXPLAIN SELECT  cs.serial  FROM certificateStatus AS cs  WHERE cs.notAfter < DATE_ADD(NOW(), INTERVAL 21 DAY)  AND cs.notAfter < DATE_ADD(NOW(), INTERVAL 10 DAY)  AND cs.status != "revoked"  AND COALESCE(TIMESTAMPDIFF(SECOND, cs.lastExpirationNagSent, cs.notAfter) > 10 * 86400, 1)  ORDER BY cs.notAfter ASC  LIMIT 100000;
+------+-------------+-------+-------+---------------+--------------+---------+------+------+------------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref  | rows | Extra                              |
+------+-------------+-------+-------+---------------+--------------+---------+------+------+------------------------------------+
|    1 | SIMPLE      | cs    | range | notAfter_idx  | notAfter_idx | 6       | NULL |    1 | Using index condition; Using where |
+------+-------------+-------+-------+---------------+--------------+---------+------+------+------------------------------------+

For ocsp-updater, before:

MariaDB [boulder_sa_integration]> EXPLAIN SELECT    cs.serial,    cs.status,    cs.revokedDate,    cs.notAfter    FROM certificateStatus AS cs    WHERE cs.ocspLastUpdated > DATE_SUB(NOW(), INTERVAL 10 DAY)    AND cs.ocspLastUpdated < DATE_SUB(NOW(), INTERVAL 3 DAY)    AND NOT cs.isExpired    ORDER BY cs.ocspLastUpdated ASC    LIMIT 100000;
+------+-------------+-------+-------+---------------------------------------+---------------------------------------+---------+------+------+------------------------------------+
| id   | select_type | table | type  | possible_keys                         | key                                   | key_len | ref  | rows | Extra                              |
+------+-------------+-------+-------+---------------------------------------+---------------------------------------+---------+------+------+------------------------------------+
|    1 | SIMPLE      | cs    | range | ocspLastUpdated_certificateStatus_idx | ocspLastUpdated_certificateStatus_idx | 5       | NULL |    1 | Using index condition; Using where |
+------+-------------+-------+-------+---------------------------------------+---------------------------------------+---------+------+------+------------------------------------+
1 row in set (0.00 sec)

For ocsp-updater, after:

MariaDB [boulder_sa_integration]> EXPLAIN SELECT    cs.serial,    cs.status,    cs.revokedDate,    cs.notAfter    FROM certificateStatus AS cs    WHERE cs.ocspLastUpdated > DATE_SUB(NOW(), INTERVAL 10 DAY)    AND cs.ocspLastUpdated < DATE_SUB(NOW(), INTERVAL 3 DAY)    AND NOT cs.isExpired    ORDER BY cs.ocspLastUpdated ASC    LIMIT 100000;
+------+-------------+-------+-------+-------------------------------+-------------------------------+---------+------+------+-----------------------+
| id   | select_type | table | type  | possible_keys                 | key                           | key_len | ref  | rows | Extra                 |
+------+-------------+-------+-------+-------------------------------+-------------------------------+---------+------+------+-----------------------+
|    1 | SIMPLE      | cs    | range | isExpired_ocspLastUpdated_idx | isExpired_ocspLastUpdated_idx | 7       | NULL |    1 | Using index condition |
+------+-------------+-------+-------+-------------------------------+-------------------------------+---------+------+------+-----------------------+
1 row in set (0.00 sec)
2017-11-08 13:25:30 -08:00
Jacob Hoffman-Andrews 6178688231 Remove background subprocess for DB migrations. (#3226)
We started running our DB migrations in the background to speed up CI. However,
the semantics of subprocesses and `wait` mean that if a migration fails, the
overall `create_db.sh` doesn't fail. That means, for instance, tests continue to
run, and it's hard to find the resulting error.

This change runs the migrations in serial again so that we can catch such errors
more easily.
2017-11-08 09:25:33 -05:00
Jacob Hoffman-Andrews 5928a06d4d Add a missing "2" to commit id. (#3223) 2017-11-07 17:00:05 -05:00
Jacob Hoffman-Andrews 6af3f4e315 Update to latest certificate-transparency-go. (#3207)
This pulls in multilog support (logs sharded by date). As a result,
it also pulls in new dependencies gogo/protobuf (for UnmarshalText) and
golang/protobuf/ptypes (for Timestamp).

Replaces #3202, adding a smaller set of dependencies. See also #3205.

Tests run:

```
$ go test github.com/gogo/protobuf/proto github.com/golang/protobuf/ptypes/... github.com/google/certificate-transparency-go/... 
ok      github.com/gogo/protobuf/proto  0.063s
ok      github.com/golang/protobuf/ptypes       0.009s
?       github.com/golang/protobuf/ptypes/any   [no test files]
?       github.com/golang/protobuf/ptypes/duration      [no test files]
?       github.com/golang/protobuf/ptypes/empty [no test files]
?       github.com/golang/protobuf/ptypes/struct        [no test files]
?       github.com/golang/protobuf/ptypes/timestamp     [no test files]
?       github.com/golang/protobuf/ptypes/wrappers      [no test files]
ok      github.com/google/certificate-transparency-go   1.005s
ok      github.com/google/certificate-transparency-go/asn1      0.021s
ok      github.com/google/certificate-transparency-go/client    22.034s
?       github.com/google/certificate-transparency-go/client/ctclient   [no test files]
ok      github.com/google/certificate-transparency-go/fixchain  0.145s
?       github.com/google/certificate-transparency-go/fixchain/main     [no test files]
ok      github.com/google/certificate-transparency-go/fixchain/ratelimiter      27.745s
ok      github.com/google/certificate-transparency-go/gossip    0.772s
?       github.com/google/certificate-transparency-go/gossip/main       [no test files]
ok      github.com/google/certificate-transparency-go/jsonclient        25.523s
ok      github.com/google/certificate-transparency-go/merkletree        0.004s
?       github.com/google/certificate-transparency-go/preload   [no test files]
?       github.com/google/certificate-transparency-go/preload/dumpscts/main     [no test files]
?       github.com/google/certificate-transparency-go/preload/main      [no test files]
ok      github.com/google/certificate-transparency-go/scanner   0.010s
?       github.com/google/certificate-transparency-go/scanner/main      [no test files]
ok      github.com/google/certificate-transparency-go/tls       0.026s
ok      github.com/google/certificate-transparency-go/x509      0.417s
?       github.com/google/certificate-transparency-go/x509/pkix [no test files]
?       github.com/google/certificate-transparency-go/x509util  [no test files]
```
2017-11-07 07:59:46 -05:00
Jacob Hoffman-Andrews 4296dd985a Use TLS in mailer integration tests (#3213)
* Remove non-TLS support from mailer entirely
* Add a config option for trusted roots in expiration-mailer. If unset, it defaults to the system roots, so this does not need to be set in production.
* Use TLS in mail-test-srv, along with an internal root and localhost certificates signed by that root.
2017-11-06 14:57:14 -08:00
Jacob Hoffman-Andrews 8ed063a901
Revert "Logic error. Always-zero-value-variable used. (#3126)" (#3215)
This reverts commit 887d75f1e0.
2017-11-06 09:36:24 -08:00
Jacob Hoffman-Andrews 5f0cbddd9d Check for unnecessary godeps (#3206)
Fixes https://github.com/letsencrypt/boulder/issues/3205.

Previously, we would only move aside Godeps.json before running `godep save ./...`. However, in order to get a true picture of what is needed, we must also remove the existing `vendor/` directory.

This change also removes some unnecessary dependencies that have piled up over the years, generally test dependencies. Godep used to vendor such dependencies but no longer does.
2017-11-03 14:30:07 -04:00
Roland Bracewell Shoemaker f31d2867b2 Switch publisher to prom stats (#3212)
Magical StatsD style->prom style stats are hard to actually use.

Fixes #2906.
2017-11-03 08:48:18 -04:00
Jacob Hoffman-Andrews d882a7a2d1 Remove export of feature flags. (#3210)
In #3167 I removed the code that would use this, but forgot to remove the
exporting code. This follows up on that. We don't currently use this for
monitoring, and it's easier to get the current flags from a config file.
2017-11-02 07:07:02 -07:00
Jacob Hoffman-Andrews 3d9b3d4d20 Restore expvar handler. (#3209)
In #3167 I removed expvar, thinking it was unused, but it turns out the RA
exports the last issuance time, and core/util.go has a function to export
BuildID, both of which are used in monitoring. This wasn't caught at compile
time because the global expvar package was happy to register the exports even
though there was no handler to serve them.
2017-11-02 07:05:54 -07:00
Jacob Hoffman-Andrews 8103ee0b27 Update godep instructions. (#3208)
These are a little simpler and should be more reliable.
2017-11-02 09:24:11 -04:00
Jacob Hoffman-Andrews 5df083a57e Add ROCA weak key checking (#3189)
Thanks to @titanous for the library!
2017-11-02 08:42:59 -04:00
Daniel McCarney 2f263f8ed5 ACME v2 Finalize order support (#3169)
This PR implements order finalization for the ACME v2 API.

In broad strokes this means:

* Removing the CSR from order objects & the new-order flow
* Adding identifiers to the order object & new-order
* Providing a finalization URL as part of orders returned by new-order
* Adding support to the WFE's Order endpoint to receive finalization POST requests with a CSR
* Updating the RA to accept finalization requests and to ensure orders are fully validated before issuance can proceed
* Updating the SA to allow finding order authorizations & updating orders.
* Updating the CA to accept an Order ID to log when issuing a certificate corresponding to an order object

Resolves #3123
2017-11-01 12:39:44 -07:00
Ben Zarzycki 887d75f1e0 Logic error. Always-zero-value-variable used. (#3126)
The intent here was pretty clear, but an oversight prevented the error condition from being checked.
2017-10-30 16:10:41 -07:00
Roland Bracewell Shoemaker 29c95f0aed Add a PKCS#11 key generation tool (#3163)
Tested against master SoftHSMv2 and relevant hardware.

Fixes #3125.
2017-10-30 16:09:28 -07:00
Jacob Hoffman-Andrews 0882b86e6c
Add metrics to sendNags errors in expiration-mailer (#3198)
Fixes #3176
2017-10-30 12:38:44 -07:00
Lucas Amorim 7daecf7b23
fix metric name 2017-10-30 11:26:03 -07:00
Lucas Amorim a7a2eaf035
Add metrics to sendNags errors in expiration-mailer 2017-10-29 21:34:41 -07:00
Roland Bracewell Shoemaker e2de327f4d Remove unused old script (#3196)
This appears to be from the RabbitMQ era and isn't referenced from anywhere else in the codebase.
2017-10-27 15:15:40 -04:00
Jacob Hoffman-Andrews bf9ce64aca Update GSB library (#3192)
This pulls in google/safebrowsing#74, which introduces a new LookupURLsContext that allows us to pass through timeout information nicely.

Also, update calling code to use LookupURLsContext instead of LookupURLs.
2017-10-24 08:33:03 -04:00
Jacob Hoffman-Andrews c06dcfaf02 Limit number of authzs purged at once. (#3177)
Previously the expired-authz-purger would try to load the ids for all relevant
authzs into memory before doing any work. On a very large table, this would mean
running out of memory. This setting allows limiting how much work will be done
in one chunk.

Also add periodic logging of deletion count.

Fixes #3147.
2017-10-23 11:20:07 -07:00
Jacob Hoffman-Andrews 90278c80fe Revert "Reject CAA responses containing DNAMEs (#3082)" (#3188)
This reverts commit 08d2018c10.

Feedback from root programs:

https://cabforum.org/pipermail/public/2017-October/012293.html
https://cabforum.org/pipermail/public/2017-October/012297.html
https://cabforum.org/pipermail/public/2017-October/012358.html
https://cabforum.org/pipermail/public/2017-October/012320.html

Resolves #3130.
2017-10-23 11:14:56 -07:00
Roland Bracewell Shoemaker e2cc6fbe68 Add test/chisel2.py for ACME v2 testing (#3179)
Pulled out of https://github.com/certbot/certbot/compare/acme-v2 by @jsha, Boulder is the correct place for it to live.
2017-10-19 10:45:51 -07:00
Jacob Hoffman-Andrews da31fc8b70 Add a renewal bit to issuedNames. (#3178)
This is only the migration, so far. Rather than doing the feature-switch dance,
we can wait for this migration to be applied, and then commit the code to start
setting it, with a feature switch to start checking it, which can be turned on
once we've been setting the bit in production for a week.

Having this as an indexed bit on issuedNames allows us to cheaply exclude
renewals from our rate limit queries, so we can avoid the ordering dependency
for renewals vs new issuances on the same domain.

Fixes #3161
2017-10-19 09:29:43 -04:00
Jacob Hoffman-Andrews 6cd777bd8d Fix up stats after #3167 (#3185)
There were two bugs in #3167:

All process-level stats got prefixed with "boulder", which broke dashboards.
All request_time stats got dropped, because measured_http was using the prometheus DefaultRegisterer.
To fix, this PR plumbs through a scope object to measured_http, and uses an empty prefix when calling NewProcessCollector().
2017-10-18 11:14:59 -07:00
Roland Bracewell Shoemaker 06d348cab8 Remove references to RabbitMQ (#3184) 2017-10-17 21:42:50 -04:00
Jacob Hoffman-Andrews 071fc0120f Remove facebookgo/httpdown. (#3168)
Its purpose is now served by net/http's Shutdown().
2017-10-17 08:55:43 -04:00
Jacob Hoffman-Andrews 600640294d Increase default MaxIdleConns. (#3164)
Go's default is 2: https://golang.org/src/database/sql/sql.go#L686.
Graphs show we are opening 100-200 fresh connections per second on the SA.
Changing this default should reduce that a lot, which should reduce load on both
the SA and MariaDB. This should also improve latency, since every new TCP
connection adds a little bit of latency.
2017-10-16 15:48:17 -07:00
Jacob Hoffman-Andrews 613ce0620f Update minimum required Go version in README. (#3174) 2017-10-14 14:16:48 -04:00
Jacob Hoffman-Andrews f366e45756 Remove global state from metrics gathering (#3167)
Previously, we used prometheus.DefaultRegisterer to register our stats, which uses global state to export its HTTP stats. We also used net/http/pprof's behavior of registering to the default global HTTP ServeMux, via DebugServer, which starts an HTTP server that uses that global ServeMux.

In this change, I merge DebugServer's functions into StatsAndLogging. StatsAndLogging now takes an address parameter and fires off an HTTP server in a goroutine. That HTTP server is newly defined, and doesn't use DefaultServeMux. On it is registered the Prometheus stats handler, and handlers for the various pprof traces. In the process I split StatsAndLogging internally into two functions: makeStats and MakeLogger. I didn't port across the expvar variable exporting, which serves a similar function to Prometheus stats but which we never use.

One nice immediate effect of this change: Since StatsAndLogging now requires and address, I noticed a bunch of commands that called StatsAndLogging, and passed around the resulting Scope, but never made use of it because they didn't run a DebugServer. Under the old StatsD world, these command still could have exported their stats by pushing, but since we moved to Prometheus their stats stopped being collected. We haven't used any of these stats, so instead of adding debug ports to all short-lived commands, or setting up a push gateway, I simply removed them and switched those commands to initialize only a Logger, no stats.
2017-10-13 11:58:01 -07:00