Commit Graph

33 Commits

Author SHA1 Message Date
Aaron Gable d16d3fd067
Remove OCSPStaleMaxAge config value and handling (#4911)
The OCSPStaleMaxAge config value was added in #2419 as part of an
effort to ensure that ocsp-updater's queries of the certificateStatus
table were efficient. It was never intended as a long-term fix:
in #2431 and #2432 the query was updated to index on the much more
efficient isExpired and notAfter columns if a feature flag was set,
and in #2561 that code path was made the default and the flag removed.

However, the `WHERE ocspLastUpdate > ocspStaleMaxAge` clause has
remained in the query. This is redundant, as the ocspStaleMaxAge has
always been set to 5040 hours, or 210 days, significantly longer than
the 90-day expiration of Let's Encrypt certs.

This change removes that clause from the query, and removes the config
scaffolding around it. In addition, it updates the tests to remove
workarounds necessitated by this column, and simplifies and documents
them for future readers.

Fixes #4884
2020-06-29 12:42:51 -07:00
Roland Bracewell Shoemaker 7673f02803
Use cmd/ceremony in integration tests (#4832)
This ended up taking a lot more work than I expected. In order to make the implementation more robust a bunch of stuff we previously relied on has been ripped out in order to reduce unnecessary complexity (I think I insisted on a bunch of this in the first place, so glad I can kill it now).

In particular this change:

* Removes bhsm and pkcs11-proxy: softhsm and pkcs11-proxy don't play well together, and any softhsm manipulation would need to happen on bhsm, then require a restart of pkcs11-proxy to pull in the on-disk changes. This makes manipulating softhsm from the boulder container extremely difficult, and because of the need to initialize new on each run (described below) we need direct access to the softhsm2 tools since pkcs11-tool cannot do slot initialization operations over the wire. I originally argued for bhsm as a way to mimic a network attached HSM, mainly so that we could do network level fault testing. In reality we've never actually done this, and the extra complexity is not really realistic for a handful of reasons. It seems better to just rip it out and operate directly on a local softhsm instance (the other option would be to use pkcs11-proxy locally, but this still would require manually restarting the proxy whenever softhsm2-util was used, and wouldn't really offer any realistic benefit).
* Initializes the softhsm slots on each integration test run, rather than when creating the docker image (this is necessary to prevent churn in test/cert-ceremonies/generate.go, which would need to be updated to reflect the new slot IDs each time a new boulder-tools image was created since slot IDs are randomly generated)
* Installs softhsm from source so that we can use a more up to date version (2.5.0 vs. 2.2.0 which is in the debian repo)
* Generates the root and intermediate private keys in softhsm and writes out the root and intermediate public keys to /tmp for use in integration tests (the existing test-{ca,root} certs are kept in test/ because they are used in a whole bunch of unit tests. At some point these should probably be renamed/moved to be more representative of what they are used for, but that is left for a follow-up in order to keep the churn in this PR as related to the ceremony work as possible)
Another follow-up item here is that we should really be zeroing out the database at the start of each integration test run, since certain things like certificates and ocsp responses will be signed by a key/issuer that is no longer is use/doesn't match the current key/issuer.

Fixes #4832.
2020-06-03 15:20:23 -07:00
Jacob Hoffman-Andrews 87fb6028c1
Add log validator to integration tests (#4782)
For now this mainly provides an example config and confirms that
log-validator can start up and shut down cleanly, as well as provide a
stat indicating how many log lines it has handled.

This introduces a syslog config to the boulder-tools image that will write
logs to /var/log/program.log. It also tweaks the various .json config
files so they have non-default syslogLevel, to ensure they actually
write something for log-validator to verify.
2020-04-20 13:33:42 -07:00
Roland Bracewell Shoemaker db01830508
Return OCSP unauthorized status if the certificate is expired (#4380)
The ocsp-updater ocspStaleMaxAge config var has to be bumped up to ~7 months so that when it is run after the six-months-ago run it will actually update the ocsp responses generated during that period and mark the certificate status row as expired.

Fixes #4338.
2019-08-01 14:13:27 -07:00
Roland Bracewell Shoemaker acc44498d1 RA: Make RevokeAtRA feature standard behavior (#4268)
Now that it is live in production and is working as intended we can remove
the old ocsp-updater functionality entirely.

Fixes #4048.
2019-06-20 14:32:53 -04:00
Roland Bracewell Shoemaker 098a761c02 ocsp-updater: Remove integrated akamai purger (#4258)
This is now an external service.

Also bumps up the deadline in the integration test helper which checks for
purging because using the remote service from the ocsp-updater takes a little
longer. Once we remove ocsp-updater revocation support that can probably be
cranked back down to a more reasonable timeframe.
2019-06-12 09:36:53 -04:00
Daniel McCarney d1daeee831 Config: serverAddresses -> serverAddress. (#4035)
The plural `serverAddresses` field in gRPC config has been deprecated for a bit now. We've removed the last usages of it in our staging/prod environments and can clear out the related code. Moving forward we only support a singular `serverAddress` and rely on DNS to direct to multiple instances of a given server.
2019-01-25 10:50:53 -08:00
Jacob Hoffman-Andrews bb15685a0f
No new certificates tick (#4012)
Since #2633 we generate OCSP at first issuance, so we no longer need 
this loop to check for new certificates that need OCSP status generated.
Since the associate SQL query is slow, we should just turn it off.

Also remove the configuration fields for the MissingSCTTick. The code
for that was already deleted.
2019-01-17 14:43:06 -08:00
Roland Bracewell Shoemaker 842739bccd Remove deprecated features that have been purged from prod and staging configs (#4001) 2019-01-15 16:16:35 -08:00
Roland Bracewell Shoemaker cdef80ce67
Remove Akamai CCU v2 support (#3994)
Fixes #3991.
2019-01-08 12:28:11 -08:00
Roland Bracewell Shoemaker ba7a8e8e5d Add fake Akamai purge server for integration testing (#3946)
Fixes #3916.
2018-11-27 09:49:05 -05:00
Daniel McCarney 3e61513364
config, config-next: remove deprecated ocsp-updater fields (#3884) 2018-10-12 13:52:15 -04:00
Roland Bracewell Shoemaker e27f370fd3 Excise code relating to pre-SCT embedding issuance flow (#3769)
Things removed:

* features.EmbedSCTs (and all the associated RA/CA/ocsp-updater code etc)
* ca.enablePrecertificateFlow (and all the associated RA/CA code)
* sa.AddSCTReceipt and sa.GetSCTReceipt RPCs
* publisher.SubmitToCT and publisher.SubmitToSingleCT RPCs

Fixes #3755.
2018-06-28 08:33:05 -04:00
Jacob Hoffman-Andrews b2f5cf39b9
Bring test/config up to date with test/config-next (#3743)
Notably, enable the precertificate flow, RPCHeadroom, and multi-IP hostnames.
Lots of other changes and feature flags too.
2018-06-01 12:00:52 -07:00
Jacob Hoffman-Andrews a4421ae75b Run gRPC backends on multiple IPs instead of multiple ports (#3679)
We're currently stuck on gRPC v1.1 because of a breaking change to certificate validation in gRPC 1.8. Our gRPC balancer uses a static list of multiple hostnames, and expects to validate against those hostnames. However gRPC expects that a service is one hostname, with multiple IP addresses, and validates all those IP addresses against the same hostname. See grpc/grpc-go#2012.

If we follow gRPC's assumptions, we can rip out our custom Balancer and custom TransportCredentials, and will probably have a lower-friction time in general.

This PR is the first step in doing so. In order to satisfy the "multiple IPs, one port" property of gRPC backends in our Docker container infrastructure, we switch to Docker's user-defined networking. This allows us to give the Boulder container multiple IP addresses on different local networks, and gives it different DNS aliases in each network.

In startservers.py, each shard of a service listens on a different DNS alias for that service, and therefore a different IP address. The listening port for each shard of a service is now identical.

This change also updates the gRPC service certificates. Now, each certificate that is used in a gRPC service (as opposed to something that is "only" a client) has three names. For instance, sa1.boulder, sa2.boulder, and sa.boulder (the generic service name). For now, we are validating against the specific hostnames. When we update our gRPC dependency, we will begin validating against the generic service name.

Incidentally, the DNS aliases feature of Docker allows us to get rid of some hackery in entrypoint.sh that inserted entries into /etc/hosts.

Note: Boulder now has a dependency on the DNS aliases feature in Docker. By default, docker-compose run creates a temporary container and doesn't assign any aliases to it. We now need to specify docker-compose run --use-aliases to get the correct behavior. Without --use-aliases, Boulder won't be able to resolve the hostnames it wants to bind to.
2018-05-07 10:38:31 -07:00
Roland Bracewell Shoemaker 0a86573a73 Update integration tests 2018-04-20 13:18:40 -07:00
Jacob Hoffman-Andrews c556a1a20d
Reduce spurious errors in integration test (#3436)
Boulder is fairly noisy about gRPC connection errors. This is a mixed
blessing: Our gRPC configuration will try to reconnect until it hits
an RPC deadline, and most likely eventually succeed. In that case,
we don't consider those to really be errors. However, in cases where
a connection is repeatedly failing, we'd like to see errors in the
logs about connection failure, rather than "deadline exceeded." So
we want to keep logging of gRPC errors.

However, right now we get a lot of these errors logged during
integration tests. They make the output hard to read, and may disguise
more serious errors. So we'd like to avoid causing such errors in
normal integration test operation.

This change reorders the startup of Boulder components by their gRPC
dependencies, so everything's backend is likely to be up and running
before it starts. It also reverses that order for clean shutdowns,
and waits for each process to exit before signalling the next one.

With these changes, I still got connection errors. Taking listenbuddy
out of the gRPC path fixed them. I believe the issue is that
listenbuddy is not a truly transparent proxy. In particular, it
accepts an inbound TCP connection before opening an outbound TCP
connection. If opening that outbound connection results in "connection
refused," it closes the inbound connection. That means gRPC sees a
"connection closed" (or "connection reset"?) rather than "connection
refused". I'm guessing it handles those cases differently, explaining
the different error results.

We've been using listenbuddy to trigger disconnects while Boulder is
running, to ensure that gRPC's reconnect code works. I think we can
probably rely on gRPC's reconnect to work. The initial problem that
led us to start testing this was a configuration problem; now that
we have the configuration we want, we should be fine and don't need
to keep testing reconnects on every integration test run.
2018-02-12 18:17:50 -08:00
Jacob Hoffman-Andrews 827f7859f2 Fix issuerCert in test configs. (#3310)
Previously, there was a disagreement between WFE and CA as to what the correct
issuer certificate was. Consolidate on test-ca2.pem (h2ppy h2cker fake CA).
    
Also, the CA configs contained an outdated entry for "IssuerCert", which was not
being used: The CA configs now use an "Issuers" array to allow signing by
multiple issuer certificates at once (for instance when rolling intermediates).
Removed this outdated entry, and the config code for CA to load it. I've
confirmed these changes match what is currently in production.

Added an integration test to check for this problem in the future.

Fixes #3309, thanks to @icing for bringing the issue to our attention!

This also includes changes from #3321 to clarify certificates for WFE.
2018-01-09 07:56:39 -05:00
Jacob Hoffman-Andrews 0a64fd4066 Bring test/config up-to-date. (#3056)
Methodology: Copy test/config-next/* into test/config/, then manually review
the diffs, removing any diffs that are not yet in production.
2017-09-11 16:55:58 -04:00
Roland Bracewell Shoemaker a46d30945c Purge remaining AMQP code (#2648)
Deletes github.com/streadway/amqp and the various RabbitMQ setup tools etc. Changes how listenbuddy is used to proxy all of the gRPC client -> server connections so we test reconnection logic.

+49 -8,221 😁

Fixes #2640 and #2562.
2017-04-04 15:02:22 -07:00
Jacob Hoffman-Andrews 6719dc17a6 Remove AMQP config and code (#2634)
We now use gRPC everywhere.
2017-04-03 10:39:39 -04:00
Roland Bracewell Shoemaker 18de73f0d8 Pass nil errors through boulder/grpc wrapError/unwrapError (#2544)
Instead of trying to wrap or unwrap them which causes panics.

Also, expand the test_ct_submission integration test to include resubmissions.
2017-02-06 18:19:39 -08:00
Daniel e88db3cd5e
Revert "Revert "Copy all statsd stats to Prometheus. (#2474)" (#2541)"
This reverts commit 9d9e4941a5 and
restores the statsd prometheus code.
2017-02-01 15:48:18 -05:00
Daniel McCarney 9d9e4941a5 Revert "Copy all statsd stats to Prometheus. (#2474)" (#2541)
This reverts commit 58ccd7a71a.

We are seeing multiple boulder components restart when they encounter the stat registration race condition described in https://github.com/letsencrypt/boulder/issues/2540
2017-02-01 12:50:27 -05:00
Jacob Hoffman-Andrews cbde78d58f Harmonize and tweak configs (#2479)
Set authorizationLifetimeDays to 60 across both config and config-next.

Set NumSessions to 2 in both config and config-next. A decrease from 10 because pkcs11-proxy (or pkcs11-daemon?) seems to error out under load if you have more sessions than CPUs.

Reorder parallelGenerateOCSPRequests to match config-next.

Remove extra tags for parsing yaml in config objects.
2017-01-10 13:46:38 -08:00
Jacob Hoffman-Andrews 58ccd7a71a Copy all statsd stats to Prometheus. (#2474)
We have a number of stats already expressed using the statsd interface. During
the switchover period to direct Prometheus collection, we'd like to make those
stats available both ways. This change automatically exports any stats exported
using the statsd interface via Prometheus as well.

This is a little tricky because Prometheus expects all stats to by registered
exactly once. Prometheus does offer a mechanism to gracefully recover from
registering a stat more than once by handling a certain error, but it is not
safe for concurrent access. So I added a concurrency-safe wrapper that creates
Prometheus stats on demand and memoizes them.

In the process, made a few small required side changes:
 - Clean "/" from method names in the gRPC interceptors. They are allowed in
   statsd but not in Prometheus.
 - Replace "127.0.0.1" with "boulder" as the name of our testing CT log.
   Prometheus stats can't start with a number.
 - Remove ":" from the CT-log stat names emitted by Publisher. Prometheus stats
   can't include it.
 - Remove a stray "RA" in front of some rate limit stats, since it was
   duplicative (we were emitting "RA.RA..." before).

Note that this means two stat groups in particular are duplicated:
 - Gostats* is duplicated with the default process-level stats exported by the
   Prometheus library.
 - gRPCClient* are duplicated by the stats generated by the go-grpc-prometheus
   package.

When writing dashboards and alerts in the Prometheus world, we should be careful
to avoid these two categories, as they will disappear eventually. As a general
rule, if a stat is available with an all-lowercase name, choose that one, as it
is probably the Prometheus-native version.

In the long run we will want to create most stats using the native Prometheus
stat interface, since it allows us to use add labels to metrics, which is very
useful. For instance, currently our DNS stats distinguish types of queries by
appending the type to the stat name. This would be more natural as a label in
Prometheus.
2017-01-10 10:30:15 -05:00
Jacob Hoffman-Andrews 0c665b2053 Split up gRPC certificates by service. (#2453)
Previously, all gRPC services used the same client and server certificates. Now,
each service has its own certificate, which it uses for both client and server
authentication, more closely simulating production.

This also adds aliases for each of the relevant hostnames in /etc/hosts. There
may be some issues if Docker decides to rewrite /etc/hosts while Boulder is
running, but this seems to work for now.
2016-12-29 14:53:59 -08:00
Jacob Hoffman-Andrews 26cf552ff9 Sign OCSP in parallel for better performance. (#2422)
Previously all OCSP signing and storage would be serial, which meant it was hard
to exercise the full capacity of our HSM. In this change, we run a limited
number of update and store requests in parallel.

This change also changes stats generation in generateOCSPResponses so we can
tell the difference between stats produced by new OCSP requests vs existing ones,
and adds a new stat that records how long the SQL query in findStaleOCSPResponses
takes.
2016-12-12 17:22:44 -08:00
Jacob Hoffman-Andrews 1c1449b284 Improvements to tests and test configs. (#2396)
- Remove spinner from test.js. It made Travis logs hard to read.
- Listen on all interfaces for debugAddr. This makes it possible to check
  Prometheus metrics for instances running in a Docker container.
- Standardize DNS timeouts on 1s and 3 retries across all configs. This ensures
  DNS completes within the relevant RPC timeouts.
- Remove RA service queue from VA, since VA no longer uses the callback to RA on
  completing a challenge.
2016-12-05 14:35:27 -08:00
Daniel McCarney be63da0639 Removes AQMP publishing support from ocsp-updater. (#2341)
The ocsp-updater has been switched over to the `config-next/` usage of
gRPC for submitting to the publisher service. This commit removes the
legacy AQMP support for such.

This does not remove the `rpc/rpc-wrappers.go` implementation of
`NewPublisherClient` at this point because it appears `boulder-ca` may
still be using it.
2016-11-22 13:33:53 -08:00
Jacob Hoffman-Andrews e1bc1e5b29 Update config from config-next. (#2175)
Set feature flags:

"reuseValidAuthz": true,
"authorizationLifetimeDays": 90,
"pendingAuthorizationLifetimeDays": 7,
"CAASERVFAILExceptions": "test/caa-servfail-exceptions.txt",
"lookupIPV6": true,
"allowAuthzDeactivation": true,

Remove BaseURL.
Remove trailing slash on CT log URL.
All files now have trailing newlines.
2016-09-19 14:08:36 -07:00
Jacob Hoffman-Andrews d75a44baa0 Remove "network" and "server" from syslog configs. (#2159)
We removed these from the config object because we never use anything other than
the default empty string, which means "local socket."
2016-09-08 10:08:18 -04:00
Ben Irving 653cc004d0 Split Boulder Config (OCSP Updater) (#2013) 2016-07-06 10:00:52 -04:00