Things removed:
* features.EmbedSCTs (and all the associated RA/CA/ocsp-updater code etc)
* ca.enablePrecertificateFlow (and all the associated RA/CA code)
* sa.AddSCTReceipt and sa.GetSCTReceipt RPCs
* publisher.SubmitToCT and publisher.SubmitToSingleCT RPCs
Fixes#3755.
We'd like to start using the DNS load balancer in the latest version of gRPC. That means putting all IPs for a service under a single hostname (or using a SRV record, but we're not taking that path). This change adds an sd-test-srv to act as our service discovery DNS service. It returns both Boulder IP addresses for any A lookup ending in ".boulder". This change also sets up the Docker DNS for our boulder container to defer to sd-test-srv when it doesn't know an answer.
sd-test-srv doesn't know how to resolve public Internet names like `github.com`. Resolving public names is required for the `godep-restore` test phase, so this change breaks out a copy of the boulder container that is used only for `godep-restore`.
This change implements a shim of a DNS resolver for gRPC, so that we can switch to DNS-based load balancing with the currently vendored gRPC, then when we upgrade to the latest gRPC we won't need a simultaneous config update.
Also, this change introduces a check at the end of the integration test that each backend received at least one RPC, ensuring that we are not sending all load to a single backend.
Remove various unnecessary uses of fmt.Sprintf - in particular:
- Avoid calls like t.Error(fmt.Sprintf(...)), where t.Errorf can be used directly.
- Use strconv when converting an integer to a string, rather than using
fmt.Sprintf("%d", ...). This is simpler and can also detect type errors at
compile time.
- Instead of using x.Write([]byte(fmt.Sprintf(...))), use fmt.Fprintf(x, ...).
In #3651 we introduced a new parameter to sa.AddCertificate to allow specifying the Issued date. If nil, we defaulted to the current time to maintain deployability guidelines.
Now that this has been deployed everywhere this PR updates SA.AddCertificate and the gRPC wrappers such that a nil issuer argument is rejected with an error.
Unit tests that were previously using nil for the issued time are updated to explicitly set the issued time to the fake clock's now().
Resolves#3657
Now that #3638 has been deployed to all of the RA instances there are no
more RPC clients using the SA's `CountCertificatesRange` RPC.
This commit deletes the implementation, the RPC definition & wrappers,
and all the test code/mocks.
This PR updates the Boulder gRPC clientInterceptor to update a Prometheus gauge stat for each in-flight RPC it dispatches, sliced by service and method.
A unit test is included that uses a custom ChillerServer that lets the test block up a bunch of RPCs, check the in-flight gauge value is increased, unblock the RPCs, and recheck that the in-flight gauge is reduced. To check the gauge value for a specific set of labels a new test-tools.go function GaugeValueWithLabels is added.
Updates #3635
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.
This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.
Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.
A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.
Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
The Boulder orphan-finder command uses the SA's AddCertificate RPC to add orphaned certificates it finds back to the DB. Prior to this commit this RPC always set the core.Certificate.Issued field to the
current time. For the orphan-finder case this meant that the Issued date would incorrectly be set to when the certificate was found, not when it was actually issued. This could cause cert-checker to alarm based on the unusual delta between the cert NotBefore and the core.Certificate.Issued value.
This PR updates the AddCertificate RPC to accept an optional issued timestamp in the request arguments. In the SA layer we address deployability concerns by setting a default value of the current time when none is explicitly provided. This matches the classic behaviour and will let an old RA communicate with a new SA.
This PR updates the orphan-finder to provide an explicit issued time to sa.AddCertificate. The explicit issued time is calculated using the found certificate's NotBefore and the configured backdate.
This lets the orphan-finder set the true issued time in the core.Certificate object, avoiding any cert-checker alarms.
Resolves#3624
Submits final certificates to any configured CT logs. This doesn't introduce a feature flag as it is config gated, any log we want to submit final certificates to needs to have it's log description updated to include the `"submitFinalCerts": true` field.
Fixes#3605.
During periods of peak load, some RPCs are significantly delayed (on the order of seconds) by client-side blocking. HTTP/2 clients have to obey a "max concurrent streams" setting sent by the server. In Go's HTTP/2 implementation, this value [defaults to 250](https://github.com/golang/net/blob/master/http2/server.go#L56), so the gRPC default is also 250. So whenever there are more than 250 requests in progress at a time, additional requests will be delayed until there is a slot available.
During this peak load, we aren't hitting limits on CPU or memory, so we should increase the max concurrent streams limit to take better advantage of our available resources. This PR adds a config field to do that.
Fixes#3641.
gRPC passes deadline information through the RPC boundary, but client and server have the same deadline. Ideally we'd like the server to have a slightly tighter deadline than the client, so if one of the server's onward RPCs or other network calls times out, the server can pass back more detailed information to the client, rather than the client timing out the server and losing the opportunity to log more detailed information about which component caused the timeout.
In this change, I subtract 100ms from the deadline on the server side of our interceptors, using our existing serverInterceptor. I also check that there is at least 100ms remaining in which to do useful work, so the server doesn't begin a potentially expensive task only to abort it.
Fixes#3608.
Adds SCT embedding to the certificate issuance flow. When a issuance is requested a precertificate (the requested certificate but poisoned with the critical CT extension) is issued and submitted to the required CT logs. Once the SCTs for the precertificate have been collected a new certificate is issued with the poison extension replace with a SCT list extension containing the retrieved SCTs.
Fixes#2244, fixes#3492 and fixes#3429.
Previously we introduced the concept of a "pending orders per account
ID" rate limit. After struggling with making an implementation of this
rate limit perform well we reevaluated the problem and decided a "new
orders per account per time window" rate limit would be a better fit for
ACMEv2 overall.
This commit introduces the new newOrdersPerAccount rate limit. The RA
now checks this before creating new pending orders in ra.NewOrder. It
does so after order reuse takes place ensuring the rate limit is only
applied in cases when a distinct new pending order row would be created.
To accomplish this a migration for a new orders field (created) and an
index over created and registrationID is added. It would be possible to
use the existing expires field for this like we've done in the past, but that
was primarily to avoid running a migration on a large table in prod. Since
we don't have that problem yet for V2 tables we can Do The Right Thing
and add a column.
For deployability the deprecated pendingOrdersPerAccount code & SA
gRPC bits are left around. A follow-up PR will be needed to remove
those (#3502).
Resolves#3410
Prior to this commit when building up the authorizations for a new-order
request we looked for any unexpired pending/valid authorizations owned
by the account and used them for the order. This allows a client to use
the V1 new-authz endpoint in combination with the V2 new-order endpoint
and we do not want to support this behaviour. All V2 authorizations
should be sourced from other V2 orders. This commit implements a new
parameter for the SA's getAuthorizations function that allows filtering
out legacy V1 authorizations by doing a JOIN on the order to
authorizations join table.
Resolves#3328
This commit resolves the case where an error during finalization occurs.
Prior to this commit if an error (expected or otherwise) occurred after
setting an order to status processing at the start of order
finalization the order would be stuck processing forever.
The SA now has a `SetOrderError` RPC that can be used by the RA to
persist an error onto an order. The order status calculation can use
this error to decide if the order is invalid. The WFE is updated to
write the error to the order JSON when displaying the order information.
Prior to this commit the order protobuf had the error field as
a `[]byte`. It doesn't seem like this is the right decision, we have
a specific protobuf type for ProblemDetails and so this commit switches
the error field to use it. The conversion to/from `[]byte` is done with
the model by the SA.
An integration test is included that prior to this commit left an order
in a stuck processing state. With this commit the integration test
passes as expected.
Resolves https://github.com/letsencrypt/boulder/issues/3403
The SA RPC previously called `GetOrderAuthorizations` only returns
**valid, unexpired** authorizations. This commit updates the name to
emphasize that it only returns valid order authzs.
This PR is a rework of what was originally https://github.com/letsencrypt/boulder/pull/3382, integrating the design feedback proposed by @jsha: https://github.com/letsencrypt/boulder/pull/3382#issuecomment-359912549
This PR removes the stored Order status field and replaces it with a value that is calculated on-the-fly by the SA when fetching an order, based on the order's associated authorizations.
In summary (and order of precedence):
* If any of the order's authorizations are invalid, the order is invalid.
* If any of the order's authorizations are deactivated, the order is deactivated.
* If any of the order's authorizations are pending, the order is pending.
* If all of the order's authorizations are valid, and there is a certificate serial, the order is valid.
* If all of the order's authorizations are valid, and we have began processing, but there is no certificate serial, the order is processing.
* If all of the order's authorizations are valid, and we haven't processing, then the order is pending waiting a finalization request.
This avoids having to explicitly update the order status when an associated authorization changes status.
The RA's implementation of new-order is updated to only reuse an existing order if the calculated status is pending. This avoids giving back invalid or deactivated orders to clients.
Resolves#3333
This change adds a feature flag, TLSSNIRevalidation. When it is enabled, Boulder
will create new authorization objects with TLS-SNI challenges if the requesting
account has issued a certificate with the relevant domain name, and was the most
recent account to do so*. This setting overrides the configured list of
challenges in the PolicyAuthority, so even if TLS-SNI is disabled in general, it
will be enabled for revalidation.
Note that this interacts with EnforceChallengeDisable. Because
EnforceChallengeDisable causes additional checked at validation time and at
issuance time, we need to update those two places as well. We'll send a
follow-up PR with that.
*We chose to make this work only for the most recent account to issue, even if
there were overlapping certificates, because it significantly simplifies the
database access patterns and should work for 95+% of cases.
Note that this change will let an account revalidate and reissue for a domain
even if the previous issuance on that account used http-01 or dns-01. This also
simplifies implementation, and fits within the intent of the mitigation plan: If
someone previously issued for a domain using http-01, we have high confidence
that they are actually the owner, and they are not going to "steal" the domain
from themselves using tls-sni-01.
Also note: This change also doesn't work properly with ReusePendingAuthz: true.
Specifically, if you attempted issuance in the last couple days and failed
because there was no tls-sni challenge, you'll still have an http-01 challenge
lying around, and we'll reuse that; then your client will fail due to lack of
tls-sni challenge again.
This change was joint work between @rolandshoemaker and @jsha.
This commit adds pending order reuse. Subsequent to this commit multiple
add-order requests from the same account ID for the same set of order
names will result in only one order being created. Orders are only
reused while they are not expired. Finalized orders will not be reused
for subsequent new-order requests allowing for duplicate order issuance.
Note that this is a second level of reuse, building on the pending
authorization reuse that's done between separate orders already.
To efficiently find an appropriate order ID given a set of names,
a registration ID, and the current time a new orderFqdnSets table is
added with appropriate indexes and foreign keys.
Resolves#3258
* Allow nil `Authz` slice in `GetAuthorizations` response.
The `StorageAuthorityClientWrapper` was enforcing that the response to
a `GetAuthorizations` request did not have `resp.Authz == nil`. This
meant that the RA's `NewOrder` function failed when creating an order
for names that had no existing authorizations to reuse.
This commit updates the wrapper to allow `resp.Authz` to be nil - this
is a valid case when there are no authorizations found.
* Fix SA server wrapper `AddPendingAuthorizations` logic.
Prior to this commit the `StorageAuthorityServerWrapper`'s
`AddPendingAuthorizations` function had an error in the boolean logic
for determining if a request was incomplete. It was rejecting any
requests that had a non-nil `Authz`. This commit fixes the logic so that
it rejects requests that have a **nil** `Authz`.
* Add `newOrderValid` for new-order rpc wrappers.
This commit updates the `StorageAuthorityServerWrapper`'s `NewOrder`
function to use a new pb-marshalling utility function `newOrderValid` to
determine if the provided order is valid or not. Previous to this commit
the `NewOrder` server wrapper used `orderValid` which rejected orders
that had a nil `Id`. This is incorrect because **all** orders provided
to `NewOrder` have a nil id! They haven't been added yet :-)
* Fix SA server wrapper `GetOrder` incomplete response check.
Prior to this commit the `StorageAuthorityClientWrapper`'s `GetOrder`
function was validating that the returned order had a non-nil
`CertificateSerial`. This isn't correct - you can GET an order that
hasn't been finalized with a certificate and it should work. This commit
updates the `GetOrder` function to use the utility `orderValid` function
that allows for a nil `CertificateSerial` but enforces all other fields
are populated as expected.
* Allow nil Authz in `GetOrderAuthorizations` response.
This commit fixes the `StorageAuthorityClientWrapper`'s
`GetOrderAuthorizations` function to not consider a response with a nil
`Authz` array incomplete. This condition happens under normal
circumstances when an attempt to finalize an order is made for an order
that has completed no authorizations.
The go-grpc-prometheus package by default registers its metrics with Prometheus' global registry. In #3167, when we stopped using the global registry, we accidentally lost our gRPC metrics. This change adds them back.
Specifically, it adds two convenience functions, one for clients and one for servers, that makes the necessary metrics object and registers it. We run these in the main function of each server.
I considered adding these as part of StatsAndLogging, but the corresponding ClientMetrics and ServerMetrics objects (defined by go-grpc-prometheus) need to be subsequently made available during construction of the gRPC clients and servers. We could add them as fields on Scope, but this seemed like a little too much tight coupling.
Also, update go-grpc-prometheus to get the necessary methods.
```
$ go test github.com/grpc-ecosystem/go-grpc-prometheus/...
ok github.com/grpc-ecosystem/go-grpc-prometheus 0.069s
? github.com/grpc-ecosystem/go-grpc-prometheus/examples/testproto [no test files]
```
This commit adds a new rate limit to restrict the number of outstanding
pending orders per account. If the threshold for this rate limit is
crossed subsequent new-order requests will return a 429 response.
Note: Since this the rate limit object itself defines an `Enabled()`
test based on whether or not it has been configured there is **not**
a feature flag for this change.
Resolves https://github.com/letsencrypt/boulder/issues/3246
This PR implements order finalization for the ACME v2 API.
In broad strokes this means:
* Removing the CSR from order objects & the new-order flow
* Adding identifiers to the order object & new-order
* Providing a finalization URL as part of orders returned by new-order
* Adding support to the WFE's Order endpoint to receive finalization POST requests with a CSR
* Updating the RA to accept finalization requests and to ensure orders are fully validated before issuance can proceed
* Updating the SA to allow finding order authorizations & updating orders.
* Updating the CA to accept an Order ID to log when issuing a certificate corresponding to an order object
Resolves#3123
For the new-order endpoint only. This does some refactoring of the order of operations in `ra.NewAuthorization` as well in order to reduce the duplication of code relating to creating pending authorizations, existing tests still seem to work as intended... A close eye should be given to this since we don't have integration tests yet that test it end to end. This also changes the inner type of `grpc.StorageAuthorityServerWrapper` to `core.StorageAuthority` so that we can avoid a circular import that is created by needing to import `grpc.AuthzToPB` and `grpc.PBToAuthz` in `sa/sa.go`.
This is a big change but should considerably improve the performance of the new-order flow.
Fixes#2955.
* Remove all of the errors under core. Their purpose is now served by errors, and they were almost entirely unused. The remaining uses were switched to errors.
* Remove errors.NotSupportedError. It was used in only one place (ca.go), and that usage is more appropriately a ServerInternal error.
Stub out IssueCertificateForPrecertificate() enough so that we can continue with the PRs that implement & test it in parallel with PRs that implement and test the calling side (via mock implementations of the CA side).
* CA: Stub IssuePrecertificate gPRC method.
* CA: Implement IssuePrecertificate.
* CA: Test Precertificate flow in TestIssueCertificate().
move verification of certificate storage
IssuePrecertificate tests
Add CT precertificate poison extension to CFSSL whitelist.
CFSSL won't allow us to add an extension to a certificate unless that
certificate is in the whitelist.
According to its documentation, "Extensions requested in the CSR are
ignored, except for those processed by ParseCertificateRequest (mainly
subjectAltName)." Still, at least we need to add tests to make sure a
poison extension in a CSR isn't copied into the final certificate.
This allows us to avoid making invasive changes to CFSSL.
* CA: Test precertificate issuance in TestInvalidCSRs().
* CA: Only support IssuePrecertificate() if it is explicitly enabled.
* CA: Test that we produce CT poison extensions in the valid form.
The poison extension must be critical in order to work correctly. It probably wouldn't
matter as much what the value is, but the spec requires the value to be ASN.1 NULL, so
verify that it is.
Switch certificates and certificateStatus to use autoincrement primary keys to avoid performance problems with clustered indexes (fixes#2754).
Remove empty externalCerts and identifierData tables (fixes#2881).
Make progress towards deleting unnecessary LockCol and subscriberApproved fields (#856, #873) by making them NULLable and not including them in INSERTs and UPDATEs.
The existing ReusePendingAuthz implementation had some bugs:
It would recycle deactivated authorizations, which then couldn't be fulfilled. (#2840)
Since it was implemented in the SA, it wouldn't get called until after the RA checks the Pending Authorizations rate limit. Which means it wouldn't fulfill its intended purpose of making accounts less likely to get stuck in a Pending Authorizations limited state. (#2831)
This factors out the reuse functionality, which used to be inside an "if" statement in the SA. Now the SA has an explicit GetPendingAuthorization RPC, which gets called from the RA before calling NewPendingAuthorization. This happens to obsolete #2807, by putting the recycling logic for both valid and pending authorizations in the RA.
This commit replaces the Boulder dependency on
gopkg.in/square/go-jose.v1 with gopkg.in/square/go-jose.v2. This is
necessary both to stay in front of bitrot and because the ACME v2 work
will require a feature from go-jose.v2 for JWS validation.
The largest part of this diff is cosmetic changes:
Changing import paths
jose.JsonWebKey -> jose.JSONWebKey
jose.JsonWebSignature -> jose.JSONWebSignature
jose.JoseHeader -> jose.Header
Some more significant changes were caused by updates in the API for
for creating new jose.Signer instances. Previously we constructed
these with jose.NewSigner(algorithm, key). Now these are created with
jose.NewSigner(jose.SigningKey{},jose.SignerOptions{}). At present all
signers specify EmbedJWK: true but this will likely change with
follow-up ACME V2 work.
Another change was the removal of the jose.LoadPrivateKey function
that the wfe tests relied on. The jose v2 API removed these functions,
moving them to a cmd's main package where we can't easily import them.
This function was reimplemented in the WFE's test code & updated to fail
fast rather than return errors.
Per CONTRIBUTING.md I have verified the go-jose.v2 tests at the imported
commit pass:
ok gopkg.in/square/go-jose.v2 14.771s
ok gopkg.in/square/go-jose.v2/cipher 0.025s
? gopkg.in/square/go-jose.v2/jose-util [no test files]
ok gopkg.in/square/go-jose.v2/json 1.230s
ok gopkg.in/square/go-jose.v2/jwt 0.073s
Resolves#2880
This is a step towards the long-term goal of eliminating wrappers and a step
towards the short-term goal of making it easier to refactor ca/ca_test.go to
add testing of precertificate-based issuance.
Prior to this PR the SA's `CountRegistrationsByIP` treated IPv6
differently than IPv4 by counting registrations within a /48 for IPv6 as
opposed to exact matches for IPv4. This PR updates
`CountRegistrationsByIP` to treat IPv4 and IPv6 the
same, always matching exactly. The existing RegistrationsPerIP rate
limit policy will be applied against this exact matching count.
A new `CountRegistrationsByIPRange` function is added to the SA that
performs the historic matching process, e.g. for IPv4 it counts exactly
the same as `CountRegistrationsByIP`, but for IPv6 it counts within
a /48.
A new `RegistrationsPerIPRange` rate limit policy is added to allow
configuring the threshold/window for the fuzzy /48 matching registration
limit. Stats for the "Exceeded" and "Pass" events for this rate limit are
separated into a separate `RegistrationsByIPRange` stats scope under
the `RateLimit` scope to allow us to track it separate from the exact
registrations per IP rate limit.
Resolves https://github.com/letsencrypt/boulder/issues/2738
This PR introduces a new feature flag "IPv6First".
When the "IPv6First" feature is enabled the VA's HTTP dialer and TLS SNI
(01 and 02) certificate fetch requests will attempt to automatically
retry when the initial connection was to IPv6 and there is an IPv4
address available to retry with.
This resolves https://github.com/letsencrypt/boulder/issues/2623
This PR removes two berrors that aren't used anywhere in the codebase:
TooManyRequests , a holdover from AMQP, and is no longer used.
UnsupportedIdentifier, used just for rejecting IDNs, which we no longer do.
In addition, the SignatureValidation error was only used by the WFE so it is moved there and unexported.
Note for reviewers: To remove berrors.UnsupportedIdentifierError I replaced the errIDNNotSupported error in policy/pa.go with a berrors.MalformedError with the same name. This allows removing UnsupportedIdentifierError ahead of #2712 which removes the IDNASupport feature flag. This seemed OK to me, but I can restore UnsupportedIdentifierError and clean it up after 2712 if that's preferred.
Resolves#2709
Prior to this PR if a domain was an exact match to a public suffix
list entry the certificates per name rate limit was applied based on the
count of certificates issued for that exact name and all of its
subdomains.
This PR introduces an exception such that exact public suffix
matches correctly have the certificate per name rate limit applied based
on only exact name matches.
In order to accomplish this a new RPC is added to the SA
`CountCertificatesByExactNames`. This operates similar to the existing
`CountCertificatesByNames` but does *not* include subdomains in the
count, only exact matches to the names provided. The usage of this new
RPC is feature flag gated behind the "CountCertificatesExact" feature flag.
The RA unit tests are updated to test the new code paths both with and
without the feature flag enabled.
Resolves#2681
Generate first OCSP response in ca.IssueCertificate instead of ocsp-updater.newCertificateTick
if features.GenerateOCSPEarly is enabled. Adds a new field to the sa.AddCertiifcate RPC for
the OCSP response and only adds it to the certificate status + sets ocspLastUpdated if it is a
non-empty slice. ocsp-updater.newCertificateTick stays the same so we can catch certificates
that were successfully signed + stored but a OCSP response couldn't be generated (for whatever
reason).
Fixes#2477.
Updates the various gRPC/protobuf libs (google.golang.org/grpc/... and github.com/golang/protobuf/proto) and the boulder-tools image so that we can update to the newest github.com/grpc-ecosystem/go-grpc-prometheus. Also regenerates all of the protobuf definition files.
Tests run on updated packages all pass.
Unblocks #2633fixes#2636.
This patch removes all usages of the `core.XXXError` and almost all usages of `probs` outside of the WFE and VA and replaces them with a unified internal error type. Since the VA uses `probs.ProblemDetails` quite extensively in challenges, and currently stores them in the DB I've saved this change for another change (it'll also require a migration). Since `ProblemDetails` should only ever be exposed to end-users all of its related logic should be moved into the `WFE` but since it still needs to be exposed to the VA and SA I've left it in place for now.
The new internal `errors` package offers the same convenience functions as `probs` does as well as a new simpler type testing method. A few small changes have also been made to error messages, mainly adding the library and function name to internal server errors for easier debugging (i.e. where a number of functions return the exact same errors and there is no other way to distinguish which method threw the error).
Also adds proper encoding of internal errors transferred over gRPC (the current encoding scheme is kept for `core` and `probs` errors since it'll be ideally be removed after we deploy this and follow-up changes) using `grpc/metadata` instead of the gRPC status codes.
Fixes#2507. Updates #2254 and #2505.
The gRPC client reconnect code needs to be able to check if a error is temporary so that it can decide if it should attempt to reconnect or just fail and kill the client[1]. By wrapping the error we were receiving in our TLS handshake code we were removing the existing `Temporary` interface on the error. This meant that if a client attempted to reconnect to a server that was in the process of being shutdown, the client would consider that server permanently dead and never retry.
Fix is simple: don't wrap errors that we pass back into the gRPC internals so that they can be properly inspected.
[1]: aefc96d792/clientconn.go (L783)