#8109 updated CI to use 24.04 runners, now update the Docker image to
build 24.04 and CI to use it.
Build fixes:
- Unpin mariadb-client-core, 10.3 is no longer provided in 24.04 apt
repositories
- Use new pip flag --break-system-packages to comply with PEP 668, which
is now enforced in Python 3.12+
Runtime fixes:
- Start rsyslogd directly due to missing symlink (see:
https://github.com/rsyslog/rsyslog/issues/5611)
- Fix SyntaxWarning: invalid escape sequence '\w' error.
- Replace OpenSSL.crypto.load_certificate with
x509.load_pem_x509_certificate due to
d73d0ed417
Replace the bpkilint container with a new bpkimetal container. Update
our custom lint which calls out to that API to speak PKIMetal's (very
similar) protocol instead. Update our zlint custom configuration to
configure this updated lint.
Fixes https://github.com/letsencrypt/boulder/issues/8009
I occasionally receive timeouts due to pkilint being unresponsive during
local integration tests. Typically this happens after rebooting my
machine, with no containers previously running due to the reboot, and no
container data in disk/memory cache.
Example timeout
```
16:14:40.485848 3 boulder-ca _PeZ5w0 [AUDIT] Preparing precert failed: issuer=[int rsa b] serial=[7f2ba75acba0b729fc4e1ba5e2f6aacd5921] regID=[1] names=[rand.3ce2c964.xyz] certProfileName=[defaultBoulderCertificateProfile] certProfileHash=[de4c8c8866ed46b1d4af0d79e6b7ecf2d1ea625e26adcbbd3979ececd8fbd05a] err=[tbsCertificate linting failed: failed lint(s): e_pkilint_lint_cabf_serverauth_cert (making POST request to pkilint API: Post "http://10.77.77.9/certificate/cabf-serverauth": context deadline exceeded)]
```
Add the necessary scaffolding for deep health checking of our various
gRPC components. Each component implementation that also implements the
grpc.checker interface will be checked periodically, and the health
status of the component will be updated accordingly.
Add the necessary methods to SA to implement the grpc.checker interface
and register these new health checks with Consul.
Additionally:
- Update entry point script to check for ProxySQL readiness.
- Increase the poll rate for gRPC Consul checks from 5s to 2s to help
with DNS failures, due to check failures, on startup.
- Change log level for Consul from INFO to ERROR to deal with noisy logs
full of transport failures due to Consul gRPC checks firing before the
SAs are up.
Fixes#6878
Part of #6795
Expanding `$@` means that if a positional parameter has an internal
space, e.g. "foo bar", it will be split into two positional parameters
in the resulting command, e.g. "foo" "bar". Expanding `"$@"` ensures
that such parameters are quoted during expansion, so we still get
"foo bar" in the exec command, which is always what we wanted.
When wait-for-it is trying to connect and failing, bash emits errors on
stderr. This captures those errors and sends them to /dev/null.
This also replaces an internal wait_tcp_port function inside
entrypoint.sh with a call to wait-for-it.sh.
This is a sort of proof of concept of the Redis interaction, which will
evolve into a tool for inspection and manual repair of missing entries,
if we find ourselves needing to do that.
The important bits here are rocsp/rocsp.go and
cmd/rocsp-tool/main.go. Also, the newly-vendored Redis client.
This ended up taking a lot more work than I expected. In order to make the implementation more robust a bunch of stuff we previously relied on has been ripped out in order to reduce unnecessary complexity (I think I insisted on a bunch of this in the first place, so glad I can kill it now).
In particular this change:
* Removes bhsm and pkcs11-proxy: softhsm and pkcs11-proxy don't play well together, and any softhsm manipulation would need to happen on bhsm, then require a restart of pkcs11-proxy to pull in the on-disk changes. This makes manipulating softhsm from the boulder container extremely difficult, and because of the need to initialize new on each run (described below) we need direct access to the softhsm2 tools since pkcs11-tool cannot do slot initialization operations over the wire. I originally argued for bhsm as a way to mimic a network attached HSM, mainly so that we could do network level fault testing. In reality we've never actually done this, and the extra complexity is not really realistic for a handful of reasons. It seems better to just rip it out and operate directly on a local softhsm instance (the other option would be to use pkcs11-proxy locally, but this still would require manually restarting the proxy whenever softhsm2-util was used, and wouldn't really offer any realistic benefit).
* Initializes the softhsm slots on each integration test run, rather than when creating the docker image (this is necessary to prevent churn in test/cert-ceremonies/generate.go, which would need to be updated to reflect the new slot IDs each time a new boulder-tools image was created since slot IDs are randomly generated)
* Installs softhsm from source so that we can use a more up to date version (2.5.0 vs. 2.2.0 which is in the debian repo)
* Generates the root and intermediate private keys in softhsm and writes out the root and intermediate public keys to /tmp for use in integration tests (the existing test-{ca,root} certs are kept in test/ because they are used in a whole bunch of unit tests. At some point these should probably be renamed/moved to be more representative of what they are used for, but that is left for a follow-up in order to keep the churn in this PR as related to the ceremony work as possible)
Another follow-up item here is that we should really be zeroing out the database at the start of each integration test run, since certain things like certificates and ocsp responses will be signed by a key/issuer that is no longer is use/doesn't match the current key/issuer.
Fixes#4832.
This is a breaking API change: pkcs11key now takes as input a public key rather than
a private key label. In order to find the private key, it first finds the public key's CKA_ID
in the token, then looks for a private key with the same CKA_ID. From ftp://ftp.rsasecurity.com/pub/pkcs/pkcs-11/v2-30/pkcs-11v2-30b-d6.pdf:
> The CKA_ID field is intended to distinguish among multiple keys. In
the case of public and private keys, this field assists in handling
multiple keys held by the same subject; the key identifier for a
public key and its corresponding private key should be the same.
This does require that both the public key and private key are present and have
appropriate CKA_IDs set. I've verified this is the case in prod. In our integration
testing environment it was not the case, so I've tweaked entrypoint.sh to load
public keys into SoftHSM and set their CKA_ID.
The initial part of this change was written by @cpu. I've reviewed and approved
those commits.
Python 2 is over in 1 month 4 days: https://pythonclock.org/
This rolls forward most of the changes in #4313.
The original change was rolled back in #4323 because it
broke `docker-compose up`. This change fixes those original issues by
(a) making sure `requests` is installed and (b) sourcing a virtualenv
containing the `requests` module before running start.py.
Other notable changes in this:
- Certbot has changed the developer instructions to install specific packages
rather than rely on `letsencrypt-auto --os-packages-only`, so we follow suit.
- Python3 now has a `bytes` type that is used in some places that used to
provide `str`, and all `str` are now Unicode. That means going from `bytes` to
`str` and back requires explicit `.decode()` and `.encode()`.
- Moved from urllib2 to requests in many places.
We'd like to start using the DNS load balancer in the latest version of gRPC. That means putting all IPs for a service under a single hostname (or using a SRV record, but we're not taking that path). This change adds an sd-test-srv to act as our service discovery DNS service. It returns both Boulder IP addresses for any A lookup ending in ".boulder". This change also sets up the Docker DNS for our boulder container to defer to sd-test-srv when it doesn't know an answer.
sd-test-srv doesn't know how to resolve public Internet names like `github.com`. Resolving public names is required for the `godep-restore` test phase, so this change breaks out a copy of the boulder container that is used only for `godep-restore`.
This change implements a shim of a DNS resolver for gRPC, so that we can switch to DNS-based load balancing with the currently vendored gRPC, then when we upgrade to the latest gRPC we won't need a simultaneous config update.
Also, this change introduces a check at the end of the integration test that each backend received at least one RPC, ensuring that we are not sending all load to a single backend.
We're currently stuck on gRPC v1.1 because of a breaking change to certificate validation in gRPC 1.8. Our gRPC balancer uses a static list of multiple hostnames, and expects to validate against those hostnames. However gRPC expects that a service is one hostname, with multiple IP addresses, and validates all those IP addresses against the same hostname. See grpc/grpc-go#2012.
If we follow gRPC's assumptions, we can rip out our custom Balancer and custom TransportCredentials, and will probably have a lower-friction time in general.
This PR is the first step in doing so. In order to satisfy the "multiple IPs, one port" property of gRPC backends in our Docker container infrastructure, we switch to Docker's user-defined networking. This allows us to give the Boulder container multiple IP addresses on different local networks, and gives it different DNS aliases in each network.
In startservers.py, each shard of a service listens on a different DNS alias for that service, and therefore a different IP address. The listening port for each shard of a service is now identical.
This change also updates the gRPC service certificates. Now, each certificate that is used in a gRPC service (as opposed to something that is "only" a client) has three names. For instance, sa1.boulder, sa2.boulder, and sa.boulder (the generic service name). For now, we are validating against the specific hostnames. When we update our gRPC dependency, we will begin validating against the generic service name.
Incidentally, the DNS aliases feature of Docker allows us to get rid of some hackery in entrypoint.sh that inserted entries into /etc/hosts.
Note: Boulder now has a dependency on the DNS aliases feature in Docker. By default, docker-compose run creates a temporary container and doesn't assign any aliases to it. We now need to specify docker-compose run --use-aliases to get the correct behavior. Without --use-aliases, Boulder won't be able to resolve the hostnames it wants to bind to.
Deletes github.com/streadway/amqp and the various RabbitMQ setup tools etc. Changes how listenbuddy is used to proxy all of the gRPC client -> server connections so we test reconnection logic.
+49 -8,221 😁Fixes#2640 and #2562.
* Make restarting boulder in docker nicer.
Handle SIGTERM in startservers.py.
Forcibly remove rsyslog pid to avoid error.
* Add explanatory comment.
* Send SIGTERM instead of kill.
* Further improvements.
- Handle SIGINT too.
- Use unbuffered mode for Python so the print statements (like "all servers
running") get printed right away rather than at shutdown
- Squelch an unnecessary OSError about interrupting the wait() call.
Previously, all gRPC services used the same client and server certificates. Now,
each service has its own certificate, which it uses for both client and server
authentication, more closely simulating production.
This also adds aliases for each of the relevant hostnames in /etc/hosts. There
may be some issues if Docker decides to rewrite /etc/hosts while Boulder is
running, but this seems to work for now.
Output base64-encoded DER, as expected by ocsp-responder.
Use flags instead of template for Status, ThisUpdate, NextUpdate.
Provide better help.
Remove old test (wasn't run automatically).
Add it to integration test, and use its output for integration test of issuer ocsp-responder.
Add another slot to boulder-tools HSM image, to store root key.
The PKCS11 proxy requires `test/test-ca.key.pem` in DER form. Rather
than generating it when it doesn't exist in `test/entrypoint.sh` and
adding it to the gitignore we've opted to check it in directly.
Instead of reading the CA key from a file on disk into memory and using that for signing in `boulder-ca` this patch adds a new Docker container that runs SoftHSM and pkcs11-proxy in order to hold the key and perform signing operations. The pkcs11-proxy module is used by `boulder-ca` to talk to the SoftHSM container.
This exercises (almost) the full pkcs11 path through boulder and will allow testing various HSM related failures in the future as well as simplifying tuning signing performance for benchmarking.
Fixes#703.
That change broke the certbot tests because it switched to a MariaDB
10.1-specific syntax. certbot/certbot#3058 changes the certbot tests to use
Boulder's docker-compose.yml, so they will get MariaDB 10.1 automatically.
* MariaDB 10.1
* MariaDB 10.1 in Docker
* Run docker stuff.
* Improve test.js error.
* Lower log level
* Revert dockerfile to master
* Export debug ports, set FAKE_DNS, and remove container_name.
* Remove typo.
* Make integration-test.py wait for debug ports.
* Use 10.1 and export more Boulder ports.
* Test updates for Docker
Listen on 0.0.0.0 for utility servers.
Make integration-test.py just wait for ports rather than calling startservers.
Run docker-compose in test.sh.
Remove bypass when database exists.
Separate mailer test into its own function in integration test.
Print better errors in test.js.
* Always bring up mysql container.
* Wait for MySQL to come up.
* Put it in travis-before-install.
* Use 127
* Remove manual docker-up.
* Add ifconfig
* Switch to docker-compose run
* It works!
* Remove some spurious env vars.
* Add bash
* try running it
* Add all deps.
* Pass through env.
* Install everything in the Dockerfile.
* Fix install of ruby
* More improvements
* Revert integration test to run directly
Also remove .git from dockerignore and add some packages.
* Revert integration-test.py to master.
* Stop ignoring test/js
* Start from boulder-tools.
* Add boulder-tools.
* Tweak travis.yml
* Separate out docker-compose pull as install.
* Build in install phase; don't bother with go install in Dockerfile
* Add virtualenv
* Actually build rabbitmq-setup
* Remove FAKE_DNS
* Trivial change
* Pull boulder-tools as a separate step so it gets its own timing info.
* Install certbot and protobuf from repos.
* Use cerbot from debian backports.
* Fix clone
* Remove CERTBOT_PATH
* Updates
* Go back to letsencrypt for build.sh
* Remove certbot volume.
* go back to preinstalled letsencrypt
* Restore ENV
* Remove BASH_ENV
* Adapt reloader test so it psses when run as root.
* Fixups for review.
* Revert test.js
* Revert startservers.py
* Revert Makefile.
Make COPY and compilation the last commands in the Dockerfile so in the common case Docker will cache results of EXPOSE, WORKDIR and ENV commands. The CMD is eliminated as entrypoint.sh now defaults to start.py if no arguments are given. The patch eliminates setting MYSQL_CONTAINER in run-docker.sh and docker-compose.yaml as entrypoint.sh sets the variable on its own when calling create_db.sh.
In addition the patch passes arguments passed to run-docker.sh as arguments to the entryscript.sh in the container. This way running `./run-docker.sh ./test.sh ...` allows to execute tests locally.
This followup for #1639 replaces localhost with boulder-rabbitmq when tests run rabbitmq-setup.
Also fixed log message to point to the server name, not 0.0.0.0, when notifying about trying to connect.
Use bridged networking.
Add some files to .dockerignore to shrink the build state sent to Docker
daemon.
Use specific hostnames to contact services, rather than localhost.
Add instructions for adding those hostnames to /etc/hosts in non-Docker config.
Use DSN-style connect strings for DBs.
Remove localhost / 127.0.0.1 rewrite hack from create_db.sh.
Add hosts section with new hostnames.
Remove bin from .dockerignore.
SQL grants go to %
Short-circuit DB creation if already existing.
Make `go install` a part of Docker image build so that Docker run is much
faster.
Bind to 0.0.0.0 for OCSP responders so they can be reached from host, and
publish / expose their ports.
Remove ToSServerThread and test.js' fetch of ToS.
Increase the registrationsPerIP rate limit threshold. When issuing from a Docker
host, the 127.0.0.1 override doesn't apply, so the limit is quickly hit.
Update docker-compose for bridged networking. Note: docker-compose doesn't currently work, but should be close.
https://github.com/letsencrypt/boulder/pull/1639
- Separated RabbitMq into it's own container
- some various Dockerfile-isms cleanup
- updated routes to linked containers
- removed nodejs, I have not been able to figure out why it was being installed
(so this could be something that is actually needed)
To setup a dev environment:
You now need `docker-compose`, but running the setup with all the
configurations is as simple as:
```
$ docker-compose build
$ docker-compose up
```
Then you can even run the `test.sh` in the container with:
```
$ docker exec -it boulder_boulder_1 bash
root@container $ ./test.sh
```
This is just an _initial_ first pass at refactoring a bunch of this. There is
a bunch more I want to change and make better.
Also with regard to database migration taking awhile I want to try and move
the goose stuff over to the mariadb container, there is just some less savory
things I don't like about starting the db in the background then running the
migration script :/, I like to attach to the process on container start. I do
have some thoughts on a `docker exec` command in the mariadb container which
migrates the db... but trying to think of something better.
Signed-off-by: Jessica Frazelle <acidburn@docker.com>