Add Honeycomb tracing to all Boulder components which act as
HTTP servers, gRPC servers, or gRPC clients. Add many values
which we currently emit to logs to the trace spans. Add a way to
configure the Honeycomb integration to our config files, and by
default configure all of our tests to "mute" (send nothing).
Followup changes will refine the configuration, attempt to reduce
the new dependency load, and introduce better sampling.
Part of https://github.com/letsencrypt/dev-misc-tickets/issues/218
This allows servers to tell clients to go away after some period of time, which triggers the clients to re-resolve DNS.
Per grpc/grpc#12295, this is the preferred way to do this.
Related: #5307.
Named field `DB`, in a each component configuration struct, acts as the
receiver for the value of `db` when component JSON files are
unmarshalled.
When `cmd.DBConfig` fields are received at the root of component
configuration struct instead of `DB` copy them to the `DB` field of the
component configuration struct.
Move existing `cmd.DBConfig` values from the root of each component's
JSON configuration in `test/config-next` to `db`
Part of #5275
In #5235 we replaced MaxDBConns in favor of MaxOpenConns.
One week ago MaxDBConns was removed from all dev, staging, and
production configurations. This change completes the removal of
MaxDBConns from all components and test/config.
Fixes#5249
Historically the only database/sql driver setting exposed via JSON
config was maxDBConns. This change adds support for maxIdleConns,
connMaxLifetime, connMaxIdleTime, and renames maxDBConns to
maxOpenConns. The addition of these settings will give our SRE team a
convenient method for tuning the reuse/closure of database connections.
A new struct, DBSettings, has been added to SA. The struct, and each of
it's fields has been commented.
All new fields have been plumbed through to the relevant Boulder
components and exported as Prometheus metrics. Tests have been
added/modified to ensure that the fields are being set. There should be
no loss in coverage
Deployability concerns for the migration from maxDBConns to maxOpenConns
have been addressed with the temporary addition of the helper method
cmd.DBConfig.GetMaxOpenConns(). This method can be removed once
test/config is defaulted to using maxOpenConns. Relevant sections of the
code have TODOs added that link back to an newly opened issue.
Fixes#5199
In all boulder services, we construct a single tls.Config object
and then pass it into multiple gRPC setup methods. In all boulder
services but one, we pass the object into multiple clients, and
just one server. In general, this is safe, because all of the client
setup happens on the main thread, and the server setup similarly
happens on the main thread before spinning off the gRPC server
goroutine.
In the CA, we do the above and pass the tlsConfig object into two
gRPC server setup functions. Thus the first server goroutine races
with the setup of the second server.
This change removes the post-hoc assignment of MinVersion,
MaxVersion, and CipherSuites of the tls.Config object passed
to grpc.ClientSetup and grpc.NewServer. And adds those same
values to the cmd.TLSConfig.Load, the method responsible for
constructing the tls.Config object before it's passed to
grpc.ClientSetup and grpc.NewServer.
Part of #5159
errors.As checks for a specific error in a wrapped error chain
(see https://golang.org/pkg/errors/#As) as opposed to asserting
that an error is of a specific type.
Part of #5010
ACME Challenges are well-known strings ("http-01", "dns-01", and
"tlsalpn-01") identifying which kind of challenge should be used
to verify control of a domain. Because they are well-known and
only certain values are valid, it is better to represent them as
something more akin to an enum than as bare strings. This also
improves our ability to ensure that an AcmeChallenge is not
accidentally used as some other kind of string in a different
context. This change also brings them closer in line with the
existing core.AcmeResource and core.OCSPStatus string enums.
Fixes#5009
This follows up on some refactoring we had done previously but not
completed. This removes various binary-specific config structs from the
common cmd package, and moves them into their appropriate packages. In
the case of CT configs, they had to be moved into their own package to
avoid a dependency loop between RA and ctpolicy.
- docker-rebuild isn't needed now that boulder and bhsm containers run directly off
the boulder-tools image.
- Remove DNS options from RA config.
- Remove GSB options from VA config.
* Remove the challenge whitelist
* Reduce the signature for ChallengesFor and ChallengeTypeEnabled
* Some unit tests in the VA were changed from testing TLS-SNI to testing the same behavior
in TLS-ALPN, when that behavior wasn't already tested. For instance timeouts during connect
are now tested.
Fixes#4109
The plural `serverAddresses` field in gRPC config has been deprecated for a bit now. We've removed the last usages of it in our staging/prod environments and can clear out the related code. Moving forward we only support a singular `serverAddress` and rely on DNS to direct to multiple instances of a given server.
Since #2633 we generate OCSP at first issuance, so we no longer need
this loop to check for new certificates that need OCSP status generated.
Since the associate SQL query is slow, we should just turn it off.
Also remove the configuration fields for the MissingSCTTick. The code
for that was already deleted.
Required a little bit of rework of the RA issuance flow (to add parsing of the precert to determine the expiration date, and moving final cert parsing before final cert submission) and RA tests, but I think it shouldn't create any issues...
Fixes#3197.
Submits final certificates to any configured CT logs. This doesn't introduce a feature flag as it is config gated, any log we want to submit final certificates to needs to have it's log description updated to include the `"submitFinalCerts": true` field.
Fixes#3605.
During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available.
During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that.
Fixes#3641.
Our various main.go functions gated some key code on whether the TLS
and/or GRPC config fields were present. Now that those fields are fully
deployed in production, we can simplify the code and require them.
Also, rename tls to tlsConfig everywhere to avoid confusion with the tls
package.
Avoid assigning to the same err from two different goroutines in
boulder-ca (fix a race).
Add a set of logs which will be submitted to but not relied on for their SCTs,
this allows us to test submissions to a particular log or submit to a log which
is not yet approved by a browser/root program.
Also add a feature which stops cancellations of remaining submissions when racing
to get a SCT from a group of logs.
Additionally add an informational log that always times out in config-next.
Fixes#3464 and fixes#3465.
This updates the PA component to allow authorization challenge types that are globally disabled if the account ID owning the authorization is on a configured whitelist for that challenge type.
Go's default is 2: https://golang.org/src/database/sql/sql.go#L686.
Graphs show we are opening 100-200 fresh connections per second on the SA.
Changing this default should reduce that a lot, which should reduce load on both
the SA and MariaDB. This should also improve latency, since every new TCP
connection adds a little bit of latency.
We removed most of the component-specific configs out of cmd a long time ago. We
left CAConfig in, because it is used directly in the NewCertificateAuthority
constructor. That means that moving CAConfig into cmd/boulder-ca would have
resulted in a circular dependency.
Eventually we probably want to decompose CAConfig so it's a set of arguments to
NewCertificateAuthority, but as a short term improvement, move the config into
its own package to break the dependency. This has the advantage of removing a
couple of big dependencies from cmd.
* CA: Stub IssuePrecertificate gPRC method.
* CA: Implement IssuePrecertificate.
* CA: Test Precertificate flow in TestIssueCertificate().
move verification of certificate storage
IssuePrecertificate tests
Add CT precertificate poison extension to CFSSL whitelist.
CFSSL won't allow us to add an extension to a certificate unless that
certificate is in the whitelist.
According to its documentation, "Extensions requested in the CSR are
ignored, except for those processed by ParseCertificateRequest (mainly
subjectAltName)." Still, at least we need to add tests to make sure a
poison extension in a CSR isn't copied into the final certificate.
This allows us to avoid making invasive changes to CFSSL.
* CA: Test precertificate issuance in TestInvalidCSRs().
* CA: Only support IssuePrecertificate() if it is explicitly enabled.
* CA: Test that we produce CT poison extensions in the valid form.
The poison extension must be critical in order to work correctly. It probably wouldn't
matter as much what the value is, but the spec requires the value to be ASN.1 NULL, so
verify that it is.
Adds a basic truncated modulus hash check for RSA keys that can be used to check keys against the Debian `{openssl,openssh,openvpn}-blacklist` lists of weak keys generated during the [Debian weak key incident](https://wiki.debian.org/SSLkeys).
Testing is gated on adding a new configuration key to the WFE, RA, and CA configs which contains the path to a directory which should contain the weak key lists.
Fixes#157.
If you are the first person to add a feature to a Boulder command its very
easy to forget to update the command's config structure to accommodate a
`map[string]bool` entry and to pass it to `features.Set` in `main()`. See
https://github.com/letsencrypt/boulder/issues/2533 for one example. I've
fallen into this trap myself a few times so I'm going to try and save myself
some future grief by fixing it across the board once and for all!
This PR adds a `Features` config entry and a corresponding `features.Set` to:
* ocsp-updater (resolves#2533)
* admin-revoker
* boulder-publisher
* contact-exporter
* expiration-mailer
* expired-authz-purger
* notify-mailer
* ocsp-responder
* orphan-finder
These components were skipped because they already had features supported:
* boulder-ca
* boulder-ra
* boulder-sa
* boulder-va
* boulder-wfe
* cert-checker
I deliberately skipped adding Feature support to:
* single-ocsp (Its only configuration comes from the pkcs11key library and
doesn't support features)
* rabbitmq-setup (No configuration/features and we'll likely soon be rming this
since the gRPC migration)
* notafter-backfill (This is a one-off that will be deleted soon)
This PR has three primary contributions:
1. The existing code for using the V4 safe browsing API introduced in #2446 had some bugs that are fixed in this PR.
2. A gsb-test-srv is added to provide a mock Google Safebrowsing V4 server for integration testing purposes.
3. A short integration test is added to test end-to-end GSB lookup for an "unsafe" domain.
For 1) most notably Boulder was assuming the new V4 library accepted a directory for its database persistence when it instead expects an existing file to be provided. Additionally the VA wasn't properly instantiating feature flags preventing the V4 api from being used by the VA.
For 2) the test server is designed to have a fixed set of "bad" domains (Currently just honest.achmeds.discount.hosting.com). When asked for a database update by a client it will package the list of bad domains up & send them to the client. When the client is asked to do a URL lookup it will check the local database for a matching prefix, and if found, perform a lookup against the test server. The test server will process the lookup and increment a count for how many times the bad domain was asked about.
For 3) the Boulder startservers.py was updated to start the gsb-test-srv and the VA is configured to talk to it using the V4 API. The integration test consists of attempting issuance for a domain pre-configured in the gsb-test-srv as a bad domain. If the issuance succeeds we know the GSB lookup code is faulty. If the issuance fails, we check that the gsb-test-srv received the correct number of lookups for the "bad" domain and fail if the expected isn't reality.
Notes for reviewers:
* The gsb-test-srv has to be started before anything will use it. Right now the v4 library handles database update request failures poorly and will not retry for 30min. See google/safebrowsing#44 for more information.
* There's not an easy way to test for "good" domain lookups, only hits against the list. The design of the V4 API is such that a list of prefixes is delivered to the client in the db update phase and if the domain in question matches no prefixes then the lookup is deemed unneccesary and not performed. I experimented with sending 256 1 byte prefixes to try and trick the client to always do a lookup, but the min prefix size is 4 bytes and enumerating all possible prefixes seemed gross.
* The test server has a /add endpoint that could be used by integration tests to add new domains to the block list, but it isn't being used presently. The trouble is that the client only updates its database every 30 minutes at present, and so adding a new domain will only take affect after the client updates the database.
Resolves#2448
Set authorizationLifetimeDays to 60 across both config and config-next.
Set NumSessions to 2 in both config and config-next. A decrease from 10 because pkcs11-proxy (or pkcs11-daemon?) seems to error out under load if you have more sessions than CPUs.
Reorder parallelGenerateOCSPRequests to match config-next.
Remove extra tags for parsing yaml in config objects.
Previously, a given binary would have three TLS config fields (CA cert, cert,
key) for its gRPC server, plus each of its configured gRPC clients. In typical
use, we expect all three of those to be the same across both servers and clients
within a given binary.
This change reuses the TLSConfig type already defined for use with AMQP, adds a
Load() convenience function that turns it into a *tls.Config, and configures it
for use with all of the binaries. This should make configuration easier and more
robust, since it more closely matches usage.
This change preserves temporary backwards-compatibility for the
ocsp-updater->publisher RPCs, since those are the only instances of gRPC
currently enabled in production.
This allows finer-grained control of which components can request issuance. The OCSP Updater should not be able to request issuance.
Also, update test/grpc-creds/generate.sh to reissue the certs properly.
Resolves#2417
Previously all OCSP signing and storage would be serial, which meant it was hard
to exercise the full capacity of our HSM. In this change, we run a limited
number of update and store requests in parallel.
This change also changes stats generation in generateOCSPResponses so we can
tell the difference between stats produced by new OCSP requests vs existing ones,
and adds a new stat that records how long the SQL query in findStaleOCSPResponses
takes.
This PR adds a new `OCSPStaleMaxAge` configuration parameter to the `ocsp-updater`. The default value when not provided is 30 days, and this is explicitly added to both `config/ocsp-updater.json` and `config-next/ocsp-updater.json`.
The OCSP updater uses this new parameter in `findStaleOCSPResponses` as a lower bound on the `ocspLastUpdated` field of the certificateStatus table. This is intended to speed up the processing of this query until we can land the proper fixes that require more intensive migrations & backfilling.
The `TestGenerateOCSPResponses` and `TestFindStaleOCSPResponses` unit tests had to be updated to explicitly set the `ocspLastUpdated` field of the certificate status rows that the tests add, because otherwise they are left at a default value of `0` and are excluded by the new `OCSPStaleMaxAge` functionality.
Adds a gRPC server to the SA and SA gRPC Clients to the WFE, RA, CA, Publisher, OCSP updater, orphan finder, admin revoker, and expiration mailer.
Also adds a CA gRPC client to the OCSP Updater which was missed in #2193.
Fixes#2347.
With the current gRPC design the CA talks directly to the Publisher when calling SubmitToCT which crosses security bounadries (secure internal segment -> internet facing segment) which is dangerous if (however unlikely) the Publisher is compromised and there is a gRPC exploit that allows memory corruption on the caller end of a RPC which could expose sensitive information or cause arbitrary issuance.
Instead we move the RPC call to the RA which is in a less sensitive network segment. Switching the call site from the CA -> RA is gated on adding the gRPC PublisherService object to the RA config.
Fixes#2202.
As described in #2282, our gRPC code uses mutual TLS to authenticate both clients and servers. However, currently our gRPC servers will accept any client certificate signed by the internal CA we use to authenticate connections. Instead, we would like each server to have a list of which clients it will accept. This will improve security by preventing the compromise of one client private key being used to access endpoints unrelated to its intended scope/purpose.
This PR implements support for gRPC servers to specify a list of accepted client names. A `serverTransportCredentials` implementing `ServerHandshake` uses a `verifyClient` function to enforce that the connecting peer presents a client certificate with a SAN entry that matches an entry on the list of accepted client names
The `NewServer` function from `grpc/server.go` is updated to instantiate the `serverTransportCredentials` used by `grpc.NewServer`, specifying an accepted names list populated from the `cmd.GRPCServerConfig.ClientNames` config field.
The pre-existing client and server certificates in `test/grpc-creds/` are replaced by versions that contain SAN entries as well as subject common names. A DNS and an IP SAN entry are added to allow testing both methods of specifying allowed SANs. The `generate.sh` script is converted to use @jsha's `minica` tool (OpenSSL CLI is blech!).
An example client whitelist is added to each of the existing gRPC endpoints in config-next/ to allow the SAN of the test RPC client certificate.
Resolves#2282
Right now, we only get single-threaded performance from our HSM, even though it
has multiple cores. We can use the pkcs11key's NewPool function to create a pool
of PKCS#11 sessions, allowing us to take advantage of the HSM's full
performance.
Add feature flagged support for issuing for IDNs, fixes#597.
This patch expects that clients have performed valid IDN2008 encoding on any label that includes unicode characters. Invalid encodings (including non-compatible IDN2003 encoding) will be rejected. No script-mixing or script exclusion checks are performed as we assume that if a name is resolvable that it conforms to the registrar's policies on these matters and if it uses non-standard scripts in sub-domains etc that browsers should be the ones choosing how to display those names.
Required a full update of the golang.org/x/net tree to pull in golang.org/x/net/idna, all test suites pass.
This PR removes the use of the global configuration variable BOULDER_CONFIG. It also removes the global configuration struct cmd.Config. Furthermore, it removes the dependency codegangsta/cli and the last bit of code that was using it cmd/single-ocsp/main.go.
This is the final (hopefully) pull request in the work to remove the reliance on a global configuration structure. Included below is a history of all other pull requests relevant in accomplishing this:
WFE (#1973)
RA (#1974)
SA (#1975)
CA (#1978)
VA (#1979)
Publisher (#2008)
OCSP Updater (#2013)
OCSP Responder (#2017)
Admin Revoker (#2053)
Expiration Mailer (#2036)
Cert Checker (#2058)
Orphan Finder (#2059)
Single OCSP (this PR)
Closes#1962
Another step in completing #1962, which will remove the global configuration file and codegangsta/cli from boulder. 3 more to go!
This PR, is a little bit different than others in that there was a lot more reliance on codegangsta/cli especially in the implementation of subcommands. I put some thought into creating our own SubCommand struct, but given the lack of complexity it seemed unnecessary as the same could be accomplished with slightly more advanced usage of os and flag.
As described in issue #2025 if the contents of a DBConnectFile argument ends in a trailing newline an error is produced when later using the contents as a DB configuration URL.
This PR changes the URL function of the DBConfig type to strip leading and trailing whitespace (including newlines) when reading the DBConnectFile contents. This resolves#2025.
Moves the wfe to it's own config file.
Each config will now belong in `test/config` and `test/config-next` analogous to `boulder-config` and `boulder-config-next`.
Moves the `subscriberAgreementURL` from a top-level config variable to the WFE section. The top level configuration is still respected until we can change the prod configs.
Under the new defaults, if the syslog section is missing, we'll use the default
config that we use in prod: no logs to stdout, INFO and below to syslog.
This allows us to remove the syslog section from prod configs, and potentially
move it to individual service configs in the future.
* Improve syslog defaults.
* Add stdout logging for purger test.
* Use plain int for sysloglevel.
* Fix JSON syntax
* Fix syslog config for expired-authz-purger.
https://github.com/letsencrypt/boulder/pull/1932
This commit adds a new notify-mailer command. Outside of the new command, this PR also:
Adds a new SMTPConfig to cmd/config.go that is shared between the expiration mailer and the notify mailer.
Modifies mail/mailer.go to add an smtpClient interface.
Adds a dryRunClient to mail/mailer.go that implements the smtpClient interface.
Modifies the mail/mailer.go MailerImpl and constructor to use the SMTPConfig and a dialer. The missing functions from the smtpClient interface are added.
The notify-mailer command supports checkpointing through --start and --end parameters. It supports dry runs by using the new dryRunClient from the mail package when given the --dryRun flag. The speed at which emails are sent can be tweaked using the --sleep flag.
Unit tests for notify-mailer's checkpointing behaviour, the checkpoint interval/sleep parameter sanity, the sleep behaviour, and the message content construction are included in main_test.go.
Future work:
A separate command to generate the list of destination emails provided to notify-mailer
Support for using registration IDs as input and resolving the email address at runtime.
Resolves#1928. Credit to @jsha for the initial work - I'm just completing the branch he started.
* Adds `notify-mailer` command.
* Adds a new SMTPConfig to `cmd/config.go` that is shared between the
expiration mailer and the notify mailer.
* Modifies `mail/mailer.go` to add an `smtpClient` interface.
* Adds a `dryRunClient` to `mail/mailer.go` that implements the
`smtpClient` interface.
* Modifies the `mail/mailer.go` `MailerImpl` and constructor to use the
SMTPConfig and a dialer. The missing functions from the `smtpClient`
interface are added.
* Fix errcheck warnings
* Review feedback
* Review feedback pt2
* Fixes#1446 - invalid message-id generation.
* Change -configFile to -config
* Test message ID with friendly email
https://github.com/letsencrypt/boulder/pull/1936
Presently clients may request a new AuthZ be created for a domain that they have already proved authorization over. This results in unnecessary bloat in the authorizations table and duplicated effort.
This commit alters the `NewAuthorization` function of the RA such that before going through the work of creating a new AuthZ it checks whether there already exists a valid AuthZ for the domain/regID that expires in more than 24 hours from the current date. If there is, then we short circuit creation and return the existing AuthZ. When this case occurs the `RA.ReusedValidAuthz` counter is incremented to provide visibility.
Since clients requesting a new AuthZ and getting an AuthZ back expect to turn around and post updates to the corresponding challenges we also return early in `UpdateAuthorization` when asked to update an AuthZ that is already valid. When this case occurs the `RA.ReusedValidAuthzChallenge` counter is incremented.
All of the above behaviour is gated by a new RA config flag `reuseValidAuthz`. In the default case (false) the RA does **not** reuse any AuthZ's and instead maintains the historic behaviour; always creating a new AuthZ when requested, irregardless of whether there are already valid AuthZ's that could be reused. In the true case (enabled only in `boulder-config-next.json`) the AuthZ reuse described above is enabled.
Resolves#1854
* Split CSR testing and name hoisting into own functions, verify CSR in RA & CA
* Move tests around and various other fixes
* 1.5.3 doesn't have the needed stringer
* Move functions to their own lib
* Remove unused imports
* Move MaxCNLength and BadSignatureAlgorithms to csr package
* Always normalizeCSR in VerifyCSR and de-export it
* Update comments
When a CAA request to Unbound times out, fall back to checking CAA via Google Public DNS' HTTPS API, through multiple proxies so as to hit geographically distributed paths. All successful multipath responses must be identical in order to succeed, and at most one can fail.
Fixes#1618
* Delete Policy DB.
This is no longer needed now that we have a JSON policy file.
* Fix tests.
* Revert Dockerfile.
* Fix create_db
* Simplify user addition.
* Fix tests.
* Fix tests
* Review fixes.
https://github.com/letsencrypt/boulder/pull/1773
* Split out CAA checking service (minus logging etc)
* Add example.yml config + follow general Boulder style
* Update protobuf package to correct version
* Add grpc client to va
* Add TLS authentication in both directions for CAA client/server
* Remove go lint check
* Add bcodes package listing custom codes for Boulder
* Add very basic (pull-only) gRPC metrics to VA + caa-service
It's now the default in all cases that it was configurable. When we want to
suppress SQL debug messages, we can simply adjust the logging level to suppress
debug messages in general.
Also, pass a logger to SetSQLDebug rather than calling GetAuditLogger.
This eliminates the need the a database to store the hostname policy,
simplifying deployment. We keep the database for now, as part of our
deployability guidelines: we'll deploy, then switch config to the new style.
This also disables the obsolete whitelist checking code, but doesn't yet change
the function signature for policy.New(), to avoid bloating the pull request.
I'll fully remove the whitelist checking code in a future change when I also
remove the policy database code.
google/certificate-transparency provides a new method, AddChainWithContext,
that allwos us to cancel a submission attempt if it takes longer than a
provided timeout using context.WithTimeout. Also refactor the initialization
method and fix a previously broken test (related to Retry-After headers).
It's behind a new temporary config flag.
Also, check if the CN is over 64 bytes.
This also makes sure the certificate's Subject is not empty if the CN is
empty by always setting the SerialNumber in Subject.
While I was here, I also corrected the logged hex encoding of
SerialNumber so that its prefixed by zeroes correctly. See the use of
core.SerialToString in IssueCertificate.
I also added a test for the no CommonName and no SANs case.
Fixes#40