We'd like to start using the DNS load balancer in the latest version of gRPC. That means putting all IPs for a service under a single hostname (or using a SRV record, but we're not taking that path). This change adds an sd-test-srv to act as our service discovery DNS service. It returns both Boulder IP addresses for any A lookup ending in ".boulder". This change also sets up the Docker DNS for our boulder container to defer to sd-test-srv when it doesn't know an answer.
sd-test-srv doesn't know how to resolve public Internet names like `github.com`. Resolving public names is required for the `godep-restore` test phase, so this change breaks out a copy of the boulder container that is used only for `godep-restore`.
This change implements a shim of a DNS resolver for gRPC, so that we can switch to DNS-based load balancing with the currently vendored gRPC, then when we upgrade to the latest gRPC we won't need a simultaneous config update.
Also, this change introduces a check at the end of the integration test that each backend received at least one RPC, ensuring that we are not sending all load to a single backend.
This is a quick first pass at audit logging the *dns.CAA records in JSON format from within the VA's IsCAAValid function. This will provide more information for post-hoc analysis of CAA decisions.
Per RFC 6844 Section 5.1 the matching of CAA tag identifiers (e.g.
"Issue") is case insensitive. This commit updates the CAA tag processing
to be case insensitive as required by the RFC.
To exercise the fix this commit adds a test case to the `caaMockDNS`
`LookupCAA` implementation for a hostname (`mixedcase.com`) that has
a CAA record with a mixed case `Issue` tag. Prior to the fix from this
branch being included in `va/caa.go` the test fails:
```
--- FAIL: TestCAAChecking/Bad_(Reserved,_Mixed_case_Issue) (0.00s)
caa_test.go:292: checkCAARecords validity mismatch for
mixedcase.com: got true expected false
```
With the fix applied, the test passes.
This commit updates the `github.com/weppos/publicsuffix-go` dependency to
67ec7c1, the tip of master at the time of writing.
Unit tests are verified to pass:
```
$ go test ./...
? github.com/weppos/publicsuffix-go/cmd/load [no test files]
ok github.com/weppos/publicsuffix-go/net/publicsuffix (cached)
ok github.com/weppos/publicsuffix-go/publicsuffix (cached)
```
Prior to this PR the builds of master are failing in Travis with an error during the `godep-restore` phase of our CI:
```
[ godep-restore] Starting
godep restore
rm -rf Godeps/ vendor/
godep save ./...
diff /dev/fd/63 /dev/fd/62
254c254
< "Comment": "v1.3-28-g3955978",
---
> "Comment": "v1.3.0-28-g3955978",
> git diff --exit-code -- ./vendor/
>
```
The root cause here is the upstream `github.com/go-sql-driver/mysql` project adding a
new tag for the 3955978caca48c1658a4bb7a9c6a0f084e326af3 commit. The new tag
is named to match semvar requirements and Godeps in CI is preferring that tag.
Updating the file to point to this new tag is a sensible fix from our side.
This commit updates the "Comment" filed to match what CI expects.
Prior to this commit the builds of master are failing in Travis with an
error during the `godep-restore` phase of our CI:
```
[ godep-restore] Starting
godep restore
rm -rf Godeps/ vendor/
godep save ./...
diff /dev/fd/63 /dev/fd/62
254c254
< "Comment": "v1.3-28-g3955978",
---
> "Comment": "v1.3.0-28-g3955978",
> git diff --exit-code -- ./vendor/
>
```
This seems to be a mysterious difference in the "Comment" field of the
`github.com/go-sql-driver/mysql` dependency. This dep hasn't changed
versions so given the general level of frustration involved with
debugging Godep it seems like the easiest path forward is to mimick the
diff.
This commit updates the "Comment" filed to match what CI expects.
Two fixes that I found while doing work on the gen-cert tool and setting up the HSM again
* Accept an empty PIN argument, this allows purely using the PED for login if not using challenge mode
* Generate 4 byte key ID for public/private key pairs during key gen, the HSM doesn't generate this field itself and `letsencrypt/pkcs11key` relies on this attribute to function
When rechecking CAA, the existing code maps all failures to a CAAError.
This means that any other non-CAA failure (for example, an internal server
error) gets hidden.
Avoid this by reworking recheckCAA to return errors and if we find a
non-CAAError, we return that directly. Revise tests to cover both
situations.
Updates issue #3143.
The existing CAA tests only test the CAA checks on the validation path and
not the CAA rechecking in the case where an existing authorization is present
(but older than the 8 hour window).
This extends the CAA integration tests to also cover the CAA rechecking
code path, by reusing older authorizations and rejecting issuance via CAA.
Production/staging have been updated to a release built with Go 1.10.2.
This allows us to remove the Go 1.10.1 builds from the travis matrix and
default to Go 1.10.2.
While we intended to allow legacy ACME v1 accounts created through the WFE to work with the ACME v2 implementation and the WFE2 we neglected to consider that a legacy account would have a Key ID URL that doesn't match the expected for a V2 account. This caused `wfe2/verify.go`'s `lookupJWK` to reject all POST requests authenticated by a legacy account unless the ACME client took the extra manual step of "fixing" the URL.
This PR adds a configuration parameter to the WFE2 for an allowed legacy key ID prefix. The WFE2 verification logic is updated to allow both the expected key ID prefix and the configured legacy key ID prefix. This will allow us to specify the correct legacy URL in configuration for both staging/prod to allow unmodified V1 ACME accounts to be used with ACME v2.
Resolves https://github.com/letsencrypt/boulder/issues/3674
So I took a quick stab at one possible solution for the authz expiration variance discussed over at community.letsencrypt.org/t/inconsistent-datetime-format-in-some-responses/61452
Golang's nanosecond precision results in newly created pending authz having one expires datetime, but subsequent requests have a different expires datetime due to database storage throwing away fractional second information.
"expires": "2018-04-14T04:43:00.105818284Z",
...
"expires": "2018-04-14T04:43:00Z",
I am not a Go expert so there might be some more widely accepted approach to accomplishing the same thing, please let me know if you would prefer a different solution.
Using both a sync.WaitGroup and channel is somewhat unnecessary - instead
synchronize directly on the channel. Additionally, strings in Go are
immutable - as such using string concatentation in a loop results in
reallocation. Use a string slice and strings.Join, which not only avoids
this, but also cuts down on the lines of code needed.
Remove various unnecessary uses of fmt.Sprintf - in particular:
- Avoid calls like t.Error(fmt.Sprintf(...)), where t.Errorf can be used directly.
- Use strconv when converting an integer to a string, rather than using
fmt.Sprintf("%d", ...). This is simpler and can also detect type errors at
compile time.
- Instead of using x.Write([]byte(fmt.Sprintf(...))), use fmt.Fprintf(x, ...).
Many of the probs.XYZ calls are of the form probs.XYZ(fmt.Sprintf(...)).
Convert these functions to take a format string and optional arguments,
following the same pattern used in the errors package. Convert the
various call sites to remove the now redundant fmt.Sprintf calls.
The test_expired_authz_404() test is currently broken in two ways - firstly,
there is no way for it to distinguish between a 404 from an expired authz
and a 404 from a non-existant authz. Secondly, the test_expired_authz_purger()
test runs and wipes out all of the existing authorizations, including the one
that was set up from setup_seventy_days_ago(), before the expired test runs.
Avoid this by running the expired authorization purger test from later in main().
Also, add a minimal canary that will detect if all authorizations have been purged
(although this still does not guarantee that we got a 404 due to expiration).
A very large number of the logger calls are of the form log.Function(fmt.Sprintf(...)).
Rather than sprinkling fmt.Sprintf at every logger call site, provide formatting versions
of the logger functions and call these directly with the format and arguments.
While here remove some unnecessary trailing newlines and calls to String/Error.
In staging/prod we use `doNotForceCN: false` for both the RA & CA
config. Switching this to `true` is blocked on CABF work that will
likely take considerable time.
In the short-term we should use `doNotForceCN: false` in `test/config`
and only use `doNotForceCN: true` in `test/config-next`.
Prior to this commit we had two implementations of ACME challenge
servers for use in tests:
1) test/dns-test-srv - a small fake DNS server used for adding/removing
DNS-01 TXT records and returning fake A/AAAA data.
2) test/load-generator/challenge-servers.go - a small library for
providing an HTTP-01 challenge server.
This commit consolidates both into a dedicated `test/challsrv` package.
The `load-generator` code is updated to use this library package to
implement its HTTP-01 challenge server. This leaves the `load-generator`
as a nice stand alone tool that doesn't need coordination between itself
and a separate `challsrv` binary.
To keep the `dns-test-srv` use-case of a nice standalone binary that can
be run from `test/startservers.py` the `test/challsrv` package has
a `test/challsrv/cmd/challsrv` package that provides the `challsrv`
command. This is a stand-alone binary that can offer both an HTTP-01 and
a DNS-01 challenge server along with a management HTTP interface that
can be used by external programs to add/remove HTTP-01 and DNS-01
challenges.
The Boulder integration tests are updated to use `challsrv` instead of
`dns-test-srv`. Presently only the DNS-01 challenge server of `challsrv`
is used by the integration tests.
TODO: The DNS-01 challenge server is doing a fair number of non-DNS-01
challenge things (Fake host data, etc). This should be cleaned up and
made configurable.
Updates #3652
Merging #3693 broke Certbot's integration tests. I'm working with them on a PR that will make the tests work against both a published and unpublished Boulder container. But the experience made me realize a lot of clients are running Boulder for integration tests, and we shouldn't break them without warning. This republishes the necessary ports, while leaving the unnecessary (debugging) ports unpublished.
This change cleans up how `va.http01Dialer` works with regards to `core.ValidationRecord`s. Instead of using the record as both an input and a output it now uses a set of inputs and outputs information about addresses via a channel. The validation record is then constructed in the parent scope or in the redirect function instead of the dialer itself.
Fixes#2730, fixes#3109, and fixes#3663.
Updates `golang.org/x/net` to master (d11bb6cd).
```
$ go test ./...
ok golang.org/x/net/bpf (cached)
ok golang.org/x/net/context (cached)
ok golang.org/x/net/context/ctxhttp (cached)
? golang.org/x/net/dict [no test files]
ok golang.org/x/net/dns/dnsmessage (cached)
ok golang.org/x/net/html (cached)
ok golang.org/x/net/html/atom (cached)
ok golang.org/x/net/html/charset (cached)
ok golang.org/x/net/http/httpguts (cached)
ok golang.org/x/net/http/httpproxy (cached)
ok golang.org/x/net/http2 (cached)
? golang.org/x/net/http2/h2i [no test files]
ok golang.org/x/net/http2/hpack (cached)
ok golang.org/x/net/icmp 0.199s
ok golang.org/x/net/idna (cached)
? golang.org/x/net/internal/iana [no test files]
? golang.org/x/net/internal/nettest [no test files]
ok golang.org/x/net/internal/socket (cached)
ok golang.org/x/net/internal/socks (cached)
ok golang.org/x/net/internal/sockstest (cached)
ok golang.org/x/net/internal/timeseries (cached)
ok golang.org/x/net/ipv4 (cached)
ok golang.org/x/net/ipv6 (cached)
ok golang.org/x/net/nettest (cached)
ok golang.org/x/net/netutil (cached)
ok golang.org/x/net/proxy (cached)
ok golang.org/x/net/publicsuffix (cached)
ok golang.org/x/net/trace (cached)
ok golang.org/x/net/webdav (cached)
ok golang.org/x/net/webdav/internal/xml (cached)
ok golang.org/x/net/websocket (cached)
ok golang.org/x/net/xsrftoken (cached)
```
Fixes#3692.
Originally we published these ports mainly to make it easier to access Boulder
from the host when running things in a container locally. However, with the
recent changes to the Docker network, Boulder is now at a fixed, predictable IP
address on the Docker network that can be reached from the host, so we no longer
need these published ports.
This is a nice change because people running a Boulder test environment may not
want to open up ports on their public interfaces.
We're currently stuck on gRPC v1.1 because of a breaking change to certificate validation in gRPC 1.8. Our gRPC balancer uses a static list of multiple hostnames, and expects to validate against those hostnames. However gRPC expects that a service is one hostname, with multiple IP addresses, and validates all those IP addresses against the same hostname. See grpc/grpc-go#2012.
If we follow gRPC's assumptions, we can rip out our custom Balancer and custom TransportCredentials, and will probably have a lower-friction time in general.
This PR is the first step in doing so. In order to satisfy the "multiple IPs, one port" property of gRPC backends in our Docker container infrastructure, we switch to Docker's user-defined networking. This allows us to give the Boulder container multiple IP addresses on different local networks, and gives it different DNS aliases in each network.
In startservers.py, each shard of a service listens on a different DNS alias for that service, and therefore a different IP address. The listening port for each shard of a service is now identical.
This change also updates the gRPC service certificates. Now, each certificate that is used in a gRPC service (as opposed to something that is "only" a client) has three names. For instance, sa1.boulder, sa2.boulder, and sa.boulder (the generic service name). For now, we are validating against the specific hostnames. When we update our gRPC dependency, we will begin validating against the generic service name.
Incidentally, the DNS aliases feature of Docker allows us to get rid of some hackery in entrypoint.sh that inserted entries into /etc/hosts.
Note: Boulder now has a dependency on the DNS aliases feature in Docker. By default, docker-compose run creates a temporary container and doesn't assign any aliases to it. We now need to specify docker-compose run --use-aliases to get the correct behavior. Without --use-aliases, Boulder won't be able to resolve the hostnames it wants to bind to.
This PR adds Go 1.10.2 to the build matrix along with Go 1.10.1.
After staging/prod have been updated to Go 1.10.2 we can remove Go 1.10.1.
Resolves#3680
Splits out the old `Errors` slice into a public `Error` string and a `InternalErrors` slice. Also removes a number of occurrences of calling `logEvent.AddError` then immediately calling `wfe.sendError` with either the same internal error which caused the same error to be logged twice or no error which is slightly redundant as `wfe.sendError` calls `logEvent.AddError` internally.
Fixes#3664.
In #3651 we introduced a new parameter to sa.AddCertificate to allow specifying the Issued date. If nil, we defaulted to the current time to maintain deployability guidelines.
Now that this has been deployed everywhere this PR updates SA.AddCertificate and the gRPC wrappers such that a nil issuer argument is rejected with an error.
Unit tests that were previously using nil for the issued time are updated to explicitly set the issued time to the fake clock's now().
Resolves#3657
Now that #3638 has been deployed to all of the RA instances there are no
more RPC clients using the SA's `CountCertificatesRange` RPC.
This commit deletes the implementation, the RPC definition & wrappers,
and all the test code/mocks.
load generator: send correct ACMEv2 Content-Type on POST.
This PR updates the Boulder load-generator to send the correct ACMEv2 Content-Type header when POSTing the ACME server. This is required for ACMEv2 and without it all POST requests to the WFE2 running a test/config-next configuration result in malformed 400 errors. While only required by ACMEv2 this commit sends it for ACMEv1 requests as well. No harm no foul.
integration tests: allow running just the load generator.
Prior to this PR an omission in an if statement in integration-test.py meant that you couldn't invoke test/integration-test.py with just the --load argument to only run the load generator. This commit updates the if to allow this use case.
This PR updates the Boulder gRPC clientInterceptor to update a Prometheus gauge stat for each in-flight RPC it dispatches, sliced by service and method.
A unit test is included that uses a custom ChillerServer that lets the test block up a bunch of RPCs, check the in-flight gauge value is increased, unblock the RPCs, and recheck that the in-flight gauge is reduced. To check the gauge value for a specific set of labels a new test-tools.go function GaugeValueWithLabels is added.
Updates #3635
Prior to Go 1.9 (https://golang.org/doc/go1.9),
various go commands would expand "./..." to include vendor directories.
We worked around this by listing "./..." then grepping out vendor. Now
that we are on Go 1.10 this is no longer necessary. Remove the TESTPATHS
hack.
We still need to exclude certain test directories when running errcheck,
so some of the "go list" logic gets moved into the errcheck stanza.
Also, as of Go 1.10, running coverage on multiple packages in one run is
supported, so replace the "for" loop in the coverage stanza with a
single command.
Also, remove GITHUB_SECRET_FILE and "die," both of which were unused.
We may see RPCs that are dispatched by a client but do not arrive at the server for some time afterwards. To have insight into potential request latency at this layer we want to publish the time delta between when a client sent an RPC and when the server received it.
This PR updates the gRPC client interceptor to add the current time to the gRPC request metadata context when it dispatches an RPC. The server side interceptor is updated to pull the client request time out of the gRPC request metadata. Using this timestamp it can calculate the latency and publish it as an observation on a Prometheus histogram.
Accomplishing the above required wiring a clock through to each of the client interceptors. This caused a small diff across each of the gRPC aware boulder commands.
A small unit test is included in this PR that checks that a latency stat is published to the histogram after an RPC to a test ChillerServer is made. It's difficult to do more in-depth testing because using fake clocks makes the latency 0 and using real clocks requires finding a way to queue/delay requests inside of the gRPC mechanisms not exposed to Boulder.
Updates https://github.com/letsencrypt/boulder/issues/3635 - Still TODO: Explicitly logging latency in the VA, tracking outstanding RPCs as a gauge.
This field was set to singleDialTimeout, but the net/http library treats
it as covering all of dial, write headers, and read headers and body.
Since http01Dialer also uses singleDialTimeout, there's a race between
http01Dialer and net/http to see who will time out first. The result is
that sometimes we give "Timeout after connect" when the error really
should be "Timeout during connect." This issue also inhibits IPv6 to
IPv4 fallback, and tickles a data race that was causing a rare panic in
VA: https://github.com/letsencrypt/boulder/issues/3109.
After this change, the overall HTTP request will get the full deadline
allowed by the RPC context. The dialer will continue to use
singleDialTimeout for each of its two possible dial attempts.