We intentionally use a SLEEP in a SQL query to trigger timeout behavior.
This caused integration tests failures locally (where unittests are run
in the same session as integration tests).
Prev. we inserted data for tracking issued names into the `issuedNames` table
during `sa.AddCertificate`. A more robust solution is to do this during
`sa.AddPrecertificate` since this is when we've truly committed to having
issued for the names.
The new SA `WriteIssuedNamesPrecert` feature flag enables writing this table
during `AddPrecertificate`. The legacy behaviour continues with the flag
enabled or disabled but is updated to tolerate duplicate INSERT errors so that
it is possible to deploy this change across multiple SA instances safely.
Along the way I also updated `SA.AddPrecertificate` to perform its two
`INSERT`s in a transaction using the `db.WithTransaction` wrapper.
Resolves https://github.com/letsencrypt/boulder/issues/4565
In the process I tweaked a few variable names in GetAuthorizations2 to
refer to just "authz" instead of "authz2" because it made things
clearer, particularly in the case of authz2IDMap, which is a map of
whether a given ID exists, not a map from authz's to IDs.
Fixes#4564
In the deep dark history of Boulder we ended up jamming contacts into
a VARCHAR db field. We need to make sure that when contacts are
marshaled the resulting bytes will fit into the column or a 500 will
be returned to the user when the SA RPC fails.
One day we should fix this properly and not return a hacky error message
that's hard for users to understand. Unfortunately that will likely
require a migration or a new DB table. In the shorter term this hack
will prevent 500s which is a clear improvement.
Prev. we weren't checking the domain portion of an email contact address
very strictly in the RA. This updates the PA to export a function that
can be used to validate the domain the same way we validate domain
portions of DNS type identifiers for issuance.
This also changes the RA to use the `invalidEmail` error type in more
places.
A new Go integration test is added that checks these errors end-to-end
for both account creation and account update.
We need the RA's `NewOrder` RPC to return a `berrors.Malformed` instance
when there are too many identifiers. A bare error will be turned into
a server internal problem by the WFE2's `web.ProblemDetailsForError`
call while a `berrors.Malformed` will produce the expected malformed
problem.
This commit fixes the err, updates the unit test, and adds an end-to-end
integration test so we don't mess this up again.
This updates the `github.com/eggsampler/acme` dependency used in our Go-based
integration tests to v3. Notably this fixes a data race we encountered in CI.
With the data race fixed this branch can also revert
54a798b7f6 and resolve
https://github.com/letsencrypt/boulder/issues/4542
I ran a `go mod tidy` to cleanup the old `v2` copy of the dep and it also
removed a few stale cfssl/mysql items from the `go.mod`.
Upstream library's tests are confirmed to pass:
```
~/go/src/github.com/eggsampler/acme$ git log --pretty=format:'%h' -n 1
b581dc6
~/go/src/github.com/eggsampler/acme$ make pebble
mkdir -p /home/daniel/go/src/github.com/letsencrypt/pebble
git clone --depth 1 https://github.com/letsencrypt/pebble.git /home/daniel/go/src/github.com/letsencrypt/pebble \
|| (cd /home/daniel/go/src/github.com/letsencrypt/pebble; git checkout -f master && git reset --hard HEAD && git pull -q)
fatal: destination path '/home/daniel/go/src/github.com/letsencrypt/pebble' already exists and is not an empty directory.
Already on 'master'
Your branch is up-to-date with 'le/master'.
HEAD is now at 6c2d514 wfe: compare Identifier.Type with acme.IndentifierIP (#287)
docker-compose -f /home/daniel/go/src/github.com/letsencrypt/pebble/docker-compose.yml up -d
Creating network "pebble_acmenet" with driver "bridge"
Creating pebble_challtestsrv_1 ... done
Creating pebble_pebble_1 ... done
while ! wget --delete-after -q --no-check-certificate "https://localhost:14000/dir" ; do sleep 1 ; done
go clean -testcache
go test -race -coverprofile=coverage_18.txt -covermode=atomic github.com/eggsampler/acme/v3
ok github.com/eggsampler/acme/v3 24.292s coverage: 83.0% of statements
docker-compose -f /home/daniel/go/src/github.com/letsencrypt/pebble/docker-compose.yml down
Stopping pebble_pebble_1 ... done
Stopping pebble_challtestsrv_1 ... done
Removing pebble_pebble_1 ... done
Removing pebble_challtestsrv_1 ... done
Removing network pebble_acmenet
```
When a nag group hits capacity we set the nagsAtCapacity gauge to 1.
This gauge also needs to be reset to 0 when the nag group is no longer
at capacity.
Updates `github.com/weppos/publicsuffix-go` to 3dd5f42, and
`github.com/zmap/zlint` to eea5fe8. Both hashes are the tip of master at
the time of writing.
Unit tests are confirmed to pass:
```
~/go/src/github.com/weppos/publicsuffix-go$ git log --pretty=format:'%h' -n 1
3dd5f42
~/go/src/github.com/weppos/publicsuffix-go$ go test ./...
? github.com/weppos/publicsuffix-go/cmd/load [no test files]
ok github.com/weppos/publicsuffix-go/net/publicsuffix 0.008s
ok github.com/weppos/publicsuffix-go/publicsuffix 0.005s
? github.com/weppos/publicsuffix-go/publicsuffix/generator [no test files]
~/go/src/github.com/zmap/zlint$ git log --pretty=format:'%h' -n 1
eea5fe8
~/go/src/github.com/zmap/zlint$ go test ./...
ok github.com/zmap/zlint 0.240s
? github.com/zmap/zlint/cmd/zlint [no test files]
? github.com/zmap/zlint/cmd/zlint-gtld-update [no test files]
ok github.com/zmap/zlint/lints 0.156s
ok github.com/zmap/zlint/util 0.020s
```
I couldn't get this to work cleanly with
`--log-queries-not-using-indexes` because a couple of queries show up
during integration test runs, seemingly because the tables involved are
small enough that the optimizer finds it faster to skip the index.
Some possible followups:
- Allow list those queries, or
- Preload the DB with a certain number of certificates before the start
of testing.
This is a small clean-up I spotted while migrating the `WithTransaction` wrapper
out of the `sa` package into `db` during #4544.
The `admin-revoker` util. was using bare transactions with the `db.Rollback`
(prev `sa.Rollback`) helper function instead of the newly exported
`db.WithTransaction` wrapper. The latter is safer so we should use it here too.
After this change all of the external consumers of the `Rollback` function have
been switched to using `WithTransaction` so we can unexport `Rollback`.
The `orphans` Prometheus `CounterVec` is used to count orphans that
couldn't be confirmed saved by the SA and were queued by the CA.
The `adopted_orphans` `CounterVec` is used to count orphans pulled from
the queue by the CA and successfully integrated through to the SA.
Both counter stats are labelled by "type", e.g. "precert" or "cert".
Based on the volume of data Boulder supports we use `BIGINT(20)` for
database ID fields throughout all of our tables except for two that were
missed: `fqdnSets` and `issuedNames`. Prior to this migration both were
using `INT(11)`, allowing only values up to 2,147,483,647. After the
migration is applied the `BIGINT(20)` type allows values up to 2^63-1.
This avoids needing to send the entire certificate in OCSP generation
RPCs.
Ended up including a few cleanups that made the implementation easier.
Initially I was struggling with how to derive the issuer identification info.
We could just stick the full SPKI hash in certificateStatus, but that takes a
significant amount of space, we could configure unique issuer IDs in the CA
config, but that would require being very careful about keeping the IDs
constant, and never reusing an ID, or we could store issuers in a table in the
database and use that as a lookup table, but that requires figuring out how to
get that info into the table etc. Instead I've just gone with what I found to
be the easiest solution, deriving a stable ID from the cert hash. This means we
don't need to remember to configure anything special and the CA config stays
the same as it is now.
Fixes#4469.
We've found we need the context offered from logging the error closer to when it
happens in the `bdns` package rather than in the `va`. Adopting the function
requires adapting it slightly. Specifically in the new location we know it won't
be called with any timeout results, with a non-dns error, or with a nil
underlying error.
Having the logging done in `bdns` (and specifically from `exchangeOne`) also
lets us log the wire format of the query and response when we get a `dns.ErrId`
error indicating a query/response ID mismatch. A small unit test is included
that ensures the logging happens as expected.
In case it proves useful for matching against other metrics the DNS ID mismatch
error case also now increments a dedicated prometheus counter vector stat,
`dns_id_mismatch`. The stat is labelled by resolver and query type.
Resolves https://github.com/letsencrypt/boulder/issues/4532
The most recent tagged release of mysql is v1.4.1, from a year ago. It
also happens to pull in an unwanted dependency (appengine) that the
latest commit does not.
Tests pass:
$ go test -count=1 github.com/go-sql-driver/mysql
ok github.com/go-sql-driver/mysql 0.068s
Fixes#4530
The `boulder-janitor` is extended to cleanup rows from the `orders` table that
have expired beyond the configured grace period, and the associated referencing
rows in `requestedNames`, `orderFqdnSets`, and `orderToAuthz2`.
To make implementing the transaction work for the deletions easier/consistent
I lifted the SA's `WithTransaction` code and assoc. functions to a new shared
`db` package. This also let me drop the one-off `janitorDb` interface from the
existing code.
There is an associated change to the `GRANT` statements for the `janitor` DB
user to allow it to find/delete the rows related to orders.
Resolves https://github.com/letsencrypt/boulder/issues/4527
This comment was there mainly to indicate that you should get protocol
changes made in ACME before implementing them in Boulder. Since the
protocol is done, this is no longer an issue. In practice we don't often
see people proposing Boulder changes that are incompatible with the
spec, so I don't think we need this line anymore.
Fixes#4541
This brings OCSP signing into alignment with the other components of the
CA in that they use ca.clk, which can be mocked out in unittests.
This tweaks test_ocsp_exp_unauth to be compatible with the change.
Fixes#4441.
Since 6f71c0c switched the Go integration tests to run in parallel the
`TestPrecertificateOCSP` test has been flaky. To fix the flake the test
needs to be changed to be resilient to precertificates other than the
one it is expecting being returned by the ct-test-srv since other tests
are also concurrently using it.
Newer Go versions seem to give a different psuedoversion for this
dependency at the same commit than when we initially switched to Go
modules for Boulder. Fixing the psuedoversion now so it won't trip up
future updates unexpectedly.
Previously, we referred to "DER encoded PKIX public keys", but PKIX (RFC 5280)
doesn't define a standalone "public key" type. Instead, it defines
SubjectPublicKeyInfo, containing an algorithm and a BIT STRING. As a
result, SPKI and SPKI hash are more commonly used terms, and we're more
likely to get reports based on those. We should mirror that terminology
in our documentation.
When we get a DNS error that has an internal cause (like connection
refused), we return a generic message like "networking error" to the
user to avoid revealing details that would be confusing. However, when
debugging problems with our own services, it's useful to have the
underlying errors.
This adds a helper method in the VA and calls it from each place we use
DNS errors.
LE is popular and aims to popularise certificate issuance. End users who see
error messages cannot be assumed to be as DNS-experienced as previously. The
user-facing error messages in the policy authority file are terse and unobvious
to the point that they are often unlikely to be well understood by those they
are intended to inform, who may be "just trying to get a LE cert for their
domain".
Previously we used a JOIN on the orderToAuthz2 table in order to make sure
we only returned authorizations created using the ACME v2 API. Each time an
order is created a pivot row (order ID + authz ID) is added to the
orderToAuthz2 table. If a large number of orders are created that all contain
the same authorization, due to reuse, then the JOINd query would return a full
authorization row for each entry in the orderToAuthz2 table with the authorization
ID.
Instead we now filter out these authorizations by doing a second query against
the orderToAuthz2 table. Using this query still requires examining a large number
of rows, but because we don't need to construct a temporary table for the JOIN
and fill it with all the full authorization rows we should save resources.
Fixes#4500.
Errors that were being returned in the checkAlgorithm methods of both wfe and
wfe2 didn't really match up to what was actually being checked. This change
attempts to bring the errors in line with what is actually being tested.
Fixes#4452.
This branch also updates the WFE2 parseJWS function to match the error string fixed in the upstream project for the case where a JWS EC public key fails to unmarshal due to an incorrect length.
Resolves#4300
We now expect that the config dir is always set, so we make that
explicit in the integration test and error if that's not true.
This change also renames the variable to just "config_dir", and removes
the parameter to startservers.start, which is currently never set to
anything other than its default value.
This also explicitly sets the environment variable in .travis.yml.
This creates the correct type of backend service for the OCSP generator.
It also adds an invocation of orphan-finder during the integration
tests.
This also adds a minor safety check to SA that I hit while writing the
test. Without this safety check, passing a certificate with no DNSNames
to AddCertificate would result in an obscure MariaDB syntax error
without enough context to track it down. In normal circumstances this
shouldn't be hit, but it will be good to have a solid error message if
we hit it in tests sometime.
Also, this tweaks the .travis.yml so it explicitly sets BOULDER_CONFIG_DIR
to test/config in the default case. Because the docker-compose run
command uses -e BOULDER_CONFIG_DIR="${BOULDER_CONFIG_DIR}",
we were setting a blank BOULDER_CONFIG_DIR in default case.
Since the Python startservers script sets a default if BOULDER_CONFIG_DIR
is not set, we haven't noticed this before. But since this test case relies
on the actual environment variable, it became an issue.
Fixes#4499
Addresses two issues introduced in #4476:
* Keep setting the V2 field in modelToAuthzPB so RPCs returned from new components to old don't cause panics
* Don't return expired orders from the SA, so that users requesting old orders that contain old style authorizations don't cause breakage in the RA
Only removing :443 when the http.Request.TLS is not nil breaks when
Boulder's WFE/WFE2 are running HTTP behind a separate ingress proxy that
terminates HTTPS on its behalf.
We need to apply some fixes for bugs introduced in #4476 before it can be deployed, as such we need to revert #4495 as there needs to be a full deploy cycle between these two changes.
This reverts commit 3ae1ae1.
😭
OCSPGeneratorService matches the semantics better, and is what
ocsp-updater uses. It also matches what's in the config-next.
This wasn't caught by integration tests because we don't currently
run orphan-finder in the integration tests. We don't have a good way
to induce failures in the SA on demand.
This change set makes the authz2 storage format the default format. It removes
most of the functionality related to the previous storage format, except for
the SA fallbacks and old gRPC methods which have been left for a follow-up
change in order to make these changes deployable without introducing
incompatibilities.
Fixes#4454.