Does a simpler probe than compared to using a `blackbox_exporter`, but directly collects the info we think will aid debugging publisher outages.
Updates #3821.
Things removed:
* features.EmbedSCTs (and all the associated RA/CA/ocsp-updater code etc)
* ca.enablePrecertificateFlow (and all the associated RA/CA code)
* sa.AddSCTReceipt and sa.GetSCTReceipt RPCs
* publisher.SubmitToCT and publisher.SubmitToSingleCT RPCs
Fixes#3755.
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.
This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.
Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.
A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.
Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
Our various main.go functions gated some key code on whether the TLS
and/or GRPC config fields were present. Now that those fields are fully
deployed in production, we can simplify the code and require them.
Also, rename tls to tlsConfig everywhere to avoid confusion with the tls
package.
Avoid assigning to the same err from two different goroutines in
boulder-ca (fix a race).
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.
Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.
I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.
Also, update go-grpc-prometheus to get the necessary methods.
```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
Makes a couple of changes:
* Change `SubmitToCT` to make submissions to each log in parallel instead of in serial, this prevents a single slow log from eating up the majority of the deadline and causing submissions to other logs to fail
* Remove the 'submissionTimeout' field on the publisher since it is actually bounded by the gRPC timeout as is misleading
* Add a timeout to the CT clients internal HTTP client so that when log servers hang indefinitely we actually do retries instead of just using the entire submission deadline. Currently set at 2.5 minutes
Fixes#3218.
Previously, we used prometheus.DefaultRegisterer to register our stats, which uses global state to export its HTTP stats. We also used net/http/pprof's behavior of registering to the default global HTTP ServeMux, via DebugServer, which starts an HTTP server that uses that global ServeMux.
In this change, I merge DebugServer's functions into StatsAndLogging. StatsAndLogging now takes an address parameter and fires off an HTTP server in a goroutine. That HTTP server is newly defined, and doesn't use DefaultServeMux. On it is registered the Prometheus stats handler, and handlers for the various pprof traces. In the process I split StatsAndLogging internally into two functions: makeStats and MakeLogger. I didn't port across the expvar variable exporting, which serves a similar function to Prometheus stats but which we never use.
One nice immediate effect of this change: Since StatsAndLogging now requires and address, I noticed a bunch of commands that called StatsAndLogging, and passed around the resulting Scope, but never made use of it because they didn't run a DebugServer. Under the old StatsD world, these command still could have exported their stats by pushing, but since we moved to Prometheus their stats stopped being collected. We haven't used any of these stats, so instead of adding debug ports to all short-lived commands, or setting up a push gateway, I simply removed them and switched those commands to initialize only a Logger, no stats.
Previously, we would produce an error an a nonzero status code on shutdown,
because gRPC's GracefulStop would cause s.Serve() to return an error. Now we
filter that specific error and treat it as success. This also allows us to kill
process with SIGTERM instead of SIGKILL in integration tests.
Fixes#2410.
This used to be used for AMQP queue names. Now that AMQP is gone, these consts
were only used when printing a version string at startup. This changes
VersionString to just use the name of the current program, and removes
`const clientName = ` from many of our main.go's.
Switches imports from `github.com/google/certificate-transparency` to `github.com/google/certificate-transparency-go` and vendors the new code. Also fixes a number of small breakages caused by API changes since the last time we vendored the code. Also updates `github.com/cloudflare/cfssl` since you can't vendor both `github.com/google/certificate-transparency` and `github.com/google/certificate-transparency-go`.
Side note: while doing this `godep` tried to pull in a number of imports under the `golang.org/x/text` repo that I couldn't find actually being used anywhere so I just dropped the changes to `Godeps/Godeps.json` and didn't add the vendored dir to the tree, let's see if this breaks any tests...
All tests pass
```
$ go test ./...
ok github.com/google/certificate-transparency-go 0.640s
ok github.com/google/certificate-transparency-go/asn1 0.005s
ok github.com/google/certificate-transparency-go/client 22.054s
? github.com/google/certificate-transparency-go/client/ctclient [no test files]
ok github.com/google/certificate-transparency-go/fixchain 0.133s
? github.com/google/certificate-transparency-go/fixchain/main [no test files]
ok github.com/google/certificate-transparency-go/fixchain/ratelimiter 27.752s
ok github.com/google/certificate-transparency-go/gossip 0.322s
? github.com/google/certificate-transparency-go/gossip/main [no test files]
ok github.com/google/certificate-transparency-go/jsonclient 25.701s
ok github.com/google/certificate-transparency-go/merkletree 0.006s
? github.com/google/certificate-transparency-go/preload [no test files]
? github.com/google/certificate-transparency-go/preload/dumpscts/main [no test files]
? github.com/google/certificate-transparency-go/preload/main [no test files]
ok github.com/google/certificate-transparency-go/scanner 0.013s
? github.com/google/certificate-transparency-go/scanner/main [no test files]
ok github.com/google/certificate-transparency-go/tls 0.033s
ok github.com/google/certificate-transparency-go/x509 1.071s
? github.com/google/certificate-transparency-go/x509/pkix [no test files]
? github.com/google/certificate-transparency-go/x509util [no test files]
```
```
$ ./test.sh
...
ok github.com/cloudflare/cfssl/api 1.089s coverage: 81.1% of statements
ok github.com/cloudflare/cfssl/api/bundle 1.548s coverage: 87.2% of statements
ok github.com/cloudflare/cfssl/api/certadd 13.681s coverage: 86.8% of statements
ok github.com/cloudflare/cfssl/api/client 1.314s coverage: 55.2% of statements
ok github.com/cloudflare/cfssl/api/crl 1.124s coverage: 75.0% of statements
ok github.com/cloudflare/cfssl/api/gencrl 1.067s coverage: 72.5% of statements
ok github.com/cloudflare/cfssl/api/generator 2.809s coverage: 33.3% of statements
ok github.com/cloudflare/cfssl/api/info 1.112s coverage: 84.1% of statements
ok github.com/cloudflare/cfssl/api/initca 1.059s coverage: 90.5% of statements
ok github.com/cloudflare/cfssl/api/ocsp 1.178s coverage: 93.8% of statements
ok github.com/cloudflare/cfssl/api/revoke 2.282s coverage: 75.0% of statements
ok github.com/cloudflare/cfssl/api/scan 2.729s coverage: 62.1% of statements
ok github.com/cloudflare/cfssl/api/sign 2.483s coverage: 83.3% of statements
ok github.com/cloudflare/cfssl/api/signhandler 1.137s coverage: 26.3% of statements
ok github.com/cloudflare/cfssl/auth 1.030s coverage: 68.2% of statements
ok github.com/cloudflare/cfssl/bundler 15.014s coverage: 85.1% of statements
ok github.com/cloudflare/cfssl/certdb/dbconf 1.042s coverage: 78.9% of statements
ok github.com/cloudflare/cfssl/certdb/ocspstapling 1.919s coverage: 69.2% of statements
ok github.com/cloudflare/cfssl/certdb/sql 1.265s coverage: 65.7% of statements
ok github.com/cloudflare/cfssl/cli 1.050s coverage: 61.9% of statements
ok github.com/cloudflare/cfssl/cli/bundle 1.023s coverage: 0.0% of statements
ok github.com/cloudflare/cfssl/cli/crl 1.669s coverage: 57.8% of statements
ok github.com/cloudflare/cfssl/cli/gencert 9.278s coverage: 83.6% of statements
ok github.com/cloudflare/cfssl/cli/gencrl 1.310s coverage: 73.3% of statements
ok github.com/cloudflare/cfssl/cli/genkey 3.028s coverage: 70.0% of statements
ok github.com/cloudflare/cfssl/cli/ocsprefresh 1.106s coverage: 64.3% of statements
ok github.com/cloudflare/cfssl/cli/revoke 1.081s coverage: 88.2% of statements
ok github.com/cloudflare/cfssl/cli/scan 1.217s coverage: 36.0% of statements
ok github.com/cloudflare/cfssl/cli/selfsign 2.201s coverage: 73.2% of statements
ok github.com/cloudflare/cfssl/cli/serve 1.133s coverage: 39.0% of statements
ok github.com/cloudflare/cfssl/cli/sign 1.210s coverage: 54.8% of statements
ok github.com/cloudflare/cfssl/cli/version 2.475s coverage: 100.0% of statements
ok github.com/cloudflare/cfssl/cmd/cfssl 1.082s coverage: 0.0% of statements
ok github.com/cloudflare/cfssl/cmd/cfssljson 1.016s coverage: 4.0% of statements
ok github.com/cloudflare/cfssl/cmd/mkbundle 1.024s coverage: 0.0% of statements
ok github.com/cloudflare/cfssl/config 2.754s coverage: 67.7% of statements
ok github.com/cloudflare/cfssl/crl 1.063s coverage: 68.3% of statements
ok github.com/cloudflare/cfssl/csr 27.016s coverage: 89.6% of statements
ok github.com/cloudflare/cfssl/errors 1.081s coverage: 81.2% of statements
ok github.com/cloudflare/cfssl/helpers 1.217s coverage: 80.4% of statements
ok github.com/cloudflare/cfssl/helpers/testsuite 7.658s coverage: 65.8% of statements
ok github.com/cloudflare/cfssl/initca 205.809s coverage: 74.2% of statements
ok github.com/cloudflare/cfssl/log 1.016s coverage: 59.3% of statements
ok github.com/cloudflare/cfssl/multiroot/config 1.107s coverage: 77.4% of statements
ok github.com/cloudflare/cfssl/ocsp 1.524s coverage: 77.7% of statements
ok github.com/cloudflare/cfssl/revoke 1.775s coverage: 79.6% of statements
ok github.com/cloudflare/cfssl/scan 1.022s coverage: 1.1% of statements
ok github.com/cloudflare/cfssl/selfsign 1.119s coverage: 70.0% of statements
ok github.com/cloudflare/cfssl/signer 1.019s coverage: 20.0% of statements
ok github.com/cloudflare/cfssl/signer/local 3.146s coverage: 81.2% of statements
ok github.com/cloudflare/cfssl/signer/remote 2.328s coverage: 71.8% of statements
ok github.com/cloudflare/cfssl/signer/universal 2.280s coverage: 67.7% of statements
ok github.com/cloudflare/cfssl/transport 1.028s
ok github.com/cloudflare/cfssl/transport/ca/localca 1.056s coverage: 94.9% of statements
ok github.com/cloudflare/cfssl/transport/core 1.538s coverage: 90.9% of statements
ok github.com/cloudflare/cfssl/transport/kp 1.054s coverage: 37.1% of statements
ok github.com/cloudflare/cfssl/ubiquity 1.042s coverage: 88.3% of statements
ok github.com/cloudflare/cfssl/whitelist 2.304s coverage: 100.0% of statements
```
Fixes#2746.
This removes the config and code to output to statsd.
- Change `cmd.StatsAndLogging` to output a `Scope`, not a `Statter`.
- Remove the prefixing of component name (e.g. "VA") in front of stats; this was stripped by `autoProm` but now no longer needs to be.
- Delete vendored statsd client.
- Delete `MockStatter` (generated by gomock) and `mocks.Statter` (hand generated) in favor of mocking `metrics.Scope`, which is the interface we now use everywhere.
- Remove a few unused methods on `metrics.Scope`, and update its generated mock.
- Refactor `autoProm` and add `autoRegisterer`, which can be included in a `metrics.Scope`, avoiding global state. `autoProm` now registers everything with the `prometheus.Registerer` it is given.
- Change va_test.go's `setup()` to not return a stats object; instead the individual tests that care about stats override `va.stats` directly.
Fixes#2639, #2733.
If you are the first person to add a feature to a Boulder command its very
easy to forget to update the command's config structure to accommodate a
`map[string]bool` entry and to pass it to `features.Set` in `main()`. See
https://github.com/letsencrypt/boulder/issues/2533 for one example. I've
fallen into this trap myself a few times so I'm going to try and save myself
some future grief by fixing it across the board once and for all!
This PR adds a `Features` config entry and a corresponding `features.Set` to:
* ocsp-updater (resolves#2533)
* admin-revoker
* boulder-publisher
* contact-exporter
* expiration-mailer
* expired-authz-purger
* notify-mailer
* ocsp-responder
* orphan-finder
These components were skipped because they already had features supported:
* boulder-ca
* boulder-ra
* boulder-sa
* boulder-va
* boulder-wfe
* cert-checker
I deliberately skipped adding Feature support to:
* single-ocsp (Its only configuration comes from the pkcs11key library and
doesn't support features)
* rabbitmq-setup (No configuration/features and we'll likely soon be rming this
since the gRPC migration)
* notafter-backfill (This is a one-off that will be deleted soon)
Pulls in logging improvements in OCSP Responder and the CT client, plus a handful of API changes. Also, the CT client verifies responses by default now.
This change includes some Boulder diffs to accommodate the API changes.
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.
This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.
This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
Previously we had custom code in each gRPC wrapper to implement timeouts. Moving
the timeout code into the client interceptor allows us to simplify things and
reduce code duplication.
Adds a gRPC server to the SA and SA gRPC Clients to the WFE, RA, CA, Publisher, OCSP updater, orphan finder, admin revoker, and expiration mailer.
Also adds a CA gRPC client to the OCSP Updater which was missed in #2193.
Fixes#2347.
When the CA and the VA encounter an error from grpc.ServerSetup() they
print a message of the form:
> "Unable to setup XXX gRPC server", where XXX is "CA", or "VA".
Prior to this commit when the publisher encounters the same error it
prints:
> "Failed to setup gRPC server".
This commit updates the Publisher cmd to use the same gRPC error message
format as the CA and the VA.
Implements a less RPC focused signal catch/shutdown method. Certain things that probably could also use this (i.e. `ocsp-updater`) haven't been given it as they would require rather substantial changes to allow for a graceful shutdown approach.
Fixes#2298.
Updates #1699.
Adds a new package, `features`, which exposes methods to set and check if various internal features are enabled. The implementation uses global state to store the features so that services embedded in another service do not each require their own features map in order to check if something is enabled.
Requires a `boulder-tools` image update to include `golang.org/x/tools/cmd/stringer`.
Fixes#1576.
Adds a new package mock_metrics, with code generated by gomock, in order to test the change.
Modifies publisher.New to take a metrics.Scope and an SA, and unexport SA.
Moves core of submission loop into a separate function, singleLogSubmit, which can return an error rather than using the continue keyword. This reduces repetition of AuditErr lines, and makes it easier to put error statting in one place.
This PR removes the use of all anonymous struct fields that were introduced by myself as per my work on splitting up boulder-config (#1962).
The root of the bug was related to the loading of the json configuration file into the config struct. The config structs contained several embedded (anonymous) fields. An embedded (anonymous) field in a struct actually results in the flattening of the json structure. This caused json.Unmarshal to look not at the nested level, but at the root level of the json object and hence not find the nested field (i.e. AllowedSigningAlgos).
See https://play.golang.org/p/6uVCsEu3Df for a working example.
This fixes the reported bug: #2018
Adds a server side unary RPC interceptor which includes basic stats. We could also use this to add a server request ID to the context.Context to identify the call through the system, but really I'd rather do that on the client side before the RPC is sent which requires the client interceptor implementation upstream. Also updates google.golang.org/grpc.
Updates #1880.
Fixes#799, ensuring that these files at least count towards our coverage numbers and show up on coveralls' "least covered" list. Improving their coverage is part of the overall long-term project of coverage improvement.
In this PR, logger is passed to the following callers:
NewWebFrontEndImpl
NewCertificateAuthorityImpl
NewValidationAuthorityImpl
NewAmqpRPCServer
newUpdater
NewRegistrationAuthorityServer
This reduces the usage of a global singleton logger and allows tests to consistently use a mock logger.
Fixes#1642
* remove blog.Get() in wfe
* remove blog.Get() from va
* remove Blog.Get() from ca
* remove blog.Get() from oscp updater, ampq rpc server, registration authority server
* removed some pointless logging code
* remove one added newline
* fix format issue
* fix setup function to return *blog.Mock instead of being passed in
* remove useless blog.NewMock() call
* Fix all errcheck errors
* Add errcheck to test.sh
* Add a new sa.Rollback method to make handling errors in rollbacks easier.
This also causes a behavior change in the VA. If a HTTP connection is
abruptly closed after serving the headers for a non-200 response, the
reported error will be the read failure instead of the non-200.
- Remove error signatures from log methods. This means fewer places where errcheck will show ignored errors.
- Pull in latest cfssl to be compatible with errorless log messages.
- Reduce the number of message priorities we support to just those we actually use.
- AuditNotice -> AuditInfo
- Remove InfoObject (only one use, switched to Info)
- Remove EmergencyExit and related functions in favor of panic
- Remove SyslogWriter / AuditLogger separate types in favor of a single interface, Logger, that has all the logging methods on it.
- Merge mock log into logger. This allows us to unexport the internals but still override them in the mock.
- Shorten names to be compatible with Go style: New, Set, Get, Logger, NewMock, etc.
- Use a shorter log format for stdout logs.
- Remove "... Starting" log messages. We have better information in the "Versions" message logged at startup.
Motivation: The AuditLogger / SyslogWriter distinction was confusing and exposed internals only necessary for tests. Some components accepted one type and some accepted the other. This made it hard to consistently use mock loggers in tests. Also, the unnecessarily fat interface for AuditLogger made it hard to meaningfully mock out.
google/certificate-transparency provides a new method, AddChainWithContext,
that allwos us to cancel a submission attempt if it takes longer than a
provided timeout using context.WithTimeout. Also refactor the initialization
method and fix a previously broken test (related to Retry-After headers).
In the process, break out AMQP config into its own struct, one per service.
The AMQPConfig struct is included by composition in the config structs that need
it. If any given service lacks an AMQP config of its own, it gets a default
value from the top-level AMQP config struct, for deployability reasons.
Tightens the RPC code to take a specific AMQP config, not an over-broad
cmd.Config.
Shortens construction of specific RPC clients so they instatiate the generic
client connection themselves, simplifying per-service startup code.
Remove unused SetTimeout method on RPC clients.
Consolidate initialization of stats and logging from each main.go into cmd
package.
Define a new config parameter, `StdoutLevel`, that determines the maximum log
level that will be printed to stdout. It can be set to 6 to inhibit debug
messages, or 0 to print only emergency messages, or -1 to print no messages at
all.
Remove the existing config parameter `Tag`. Instead, choose the tag from the
basename of the currently running process. Previously all Boulder log messages
had the tag "boulder", but now they will be differentiated by process, like
"boulder-wfe".
Shorten the date format used in stdout logging, and add the current binary's
basename.
Consolidate setup function in audit-logger_test.go.
Note: Most CLI binaries now get their stats and logging from the parameters of
Action. However, a few of our binaries don't use our custom AppShell, and
instead use codegangsta/cli directly. For those binaries, we export the new
StatsAndLogging method from cmd.
Fixes https://github.com/letsencrypt/boulder/issues/852