Commit Graph

27 Commits

Author SHA1 Message Date
Tianon Gravi e48b7063a1 Move `oci-import` (and SBOM) shell code out of `meta.jq` into an explicit shell script
Also, add a *lot* more validation.

This new script/code assumes we're setting `BASHBREW_META_SCRIPTS` to an appropriate value (pointing to the directory of this repository on-disk).
2025-04-25 16:27:26 -07:00
yosifkit da908ae960
Merge pull request #110 from infosiftr/parallelism-redux
(re-)Implement parallelism in deploy
2025-02-03 16:33:19 -08:00
Tianon Gravi 5adcfa53f4 Add `User-Agent` to all our outgoing requests
This is something we should've been doing all along because we've been contributing to the thundering herd of useless `Go-http-client/1.1` agent values hitting Docker Hub.
2025-01-31 12:12:23 -08:00
Tianon Gravi 8b4e0f376a (re-)Implement parallelism in deploy
This requires the input to be in the correct order (children first), but that was already an implied constraint.

In my testing, this takes our `.test/test.sh --deploy` block testing deploy from ~5s down to ~2.5s.

When we originally introduced this, it collided with some other aging infrastructure in a bad way, kicking over everything, and had to be reverted.  We've since changed that and this should be safe to re-introduce.
2025-01-31 10:22:42 -08:00
yosifkit dfd62f2762
Revert "Implement parallelism in deploy" 2025-01-27 12:59:41 -08:00
Tianon Gravi 75c28536e9 Implement parallelism in deploy
This requires the input to be in the correct order (children first), but that was already an implied constraint.

In my testing, this takes our `.test/test.sh --deploy` block testing deploy from ~5s down to ~2.5s.

I used "show timestamps" in our GHA CI jobs to compare them also, and they're currently at ~7s, and after this change they drop to ~1s.

I've tested this with my personal deploy job, which was taking ~2m40s before, and is ~1m18s after.
2025-01-23 14:12:33 -08:00
Tianon Gravi 0e4a5d2465 Slight rate limit bump 2025-01-22 13:43:23 -08:00
Tianon Gravi 0560d0fc1b Add 504 to the list of responses we retry on
This is "gateway timeout" which we do see from Docker Hub from time to time (probably due to some HTTP middleware, redeploys, etc).
2025-01-15 10:00:42 -08:00
Tianon Gravi 016bee1b7e Double our Docker Hub rate limit
This increases our rate limit from 100/min up to 200/min (with an immediate burst of up to 200 and at most 200 at once).  This should be generally safe since the code will already very aggressively drain the pool of available "tokens" the minute it sees a 429 (thus only doing a trickle of requests anyhow), and we auto-retry on 429, but only after a delay of a full second per request.

If all goes well, this should help mitigate our amd64 deploy jobs that have crept back up to a full hour runtime (without adding further parallelization complexity yet).
2024-10-14 16:38:12 -07:00
Tianon Gravi 08727fedfd Treat "500 Internal Server Error" and "502 Bad Gateway" the same as a 503
We still get spurious errors from our deploy jobs, and almost every time I dig into one, it ends with something like `failed HEAD: 500 Internal Server Error`.

This updates our auto-retry logic to treat 500 (and 502) the same as 503 (ie, up to three retries, subject to our per-request+overall rate limits).
2024-09-03 16:14:17 -07:00
Tianon Gravi 39c067d57a Revert "Revert "Attempt to speed up deploy""
This reverts commit 4473698dae.
2024-06-26 16:12:51 -07:00
Tianon Gravi 4473698dae Revert "Attempt to speed up deploy"
This reverts commit c70a53466d.

(After fixing the implementation bug, this made things worse, not better 😭)
2024-06-25 17:31:16 -07:00
Tianon Gravi e0a32c05fe Ensure our pre-flight HEAD request is always by-tag if we're pushing a tag
This is a latent bug the system has had for large manifests that we now have for all manifests instead! 😄

(This also would artificially inflate our speedups since all the pre-flight HEAD requests would be by-digest which are infinitely cacheable and thus very, very fast.)
2024-06-25 17:11:56 -07:00
Joseph Ferguson c70a53466d Attempt to speed up deploy
Always try HEAD request before the PUT request
2024-06-24 16:52:04 -07:00
Tianon Gravi bb72bc31df Implement retries on 503s
This caps total attempts of any given request to at most 3x 503 responses (in other words, the third 503 will "bubble up" as-is).  The intent there is that if there *is* a serious issue with Hub, we don't want to continue retrying aggressively as that would likely contribute to the outage persisting longer.
2024-05-22 11:28:27 -07:00
Tianon Gravi b0dcc102bc Implement more fine-grained mutexes in registryCache
In short, this keeps the "data mutex" as-is, but only acquires it when we have some data to shove in (and only for the limited instructions necessary to update the data), and adds a new mutex per-tag or per-digest to prevent concurrent requests for the exact same upstream content (which is the main reason for this cache in the first place, so feels appropriate).

Without this, our current mutex means that any other efforts to parallelize will ultimately bottleneck on our registry cache.

In order to test this effectively, I've added a `--parallel` flag to `lookup` which just hyper-aggressively runs every single lookup in parallel inside a goroutine.

My first test of this was doing a lookup of the first tag of every DOI repository (so ~147 concurrent lookups in total), and I found it was consistently ~1m57s both with and without this change.  Our Hub rate limiter is pegged at ~100/min, which seems consistent with that result.  If I increase that limit to ~200/min, I was able to achieve a small speedup with this change (~43s down to ~30s).

In other words, for this to actually be effective as a speedup against Docker Hub if we implement parallel deploy, for example, we'll *also* have to increase our rate limit (which I think is fairly safe now that we handle 429 by explicitly emptying the limiter).

In order to *really* test this effectively, I took a different approach.  I spun up a local registry (`docker run -dit -p 127.0.0.1:5000:5000 -p [::1]:5000:5000 --name registry registry`), and copied `hello-world:linux` into it 10,000 times (something like `crane cp hello-world:linux localhost:5000/hello && echo localhost:5000/hello:{1..10000} | xargs -rtn1 -P$(nproc) crane cp localhost:5000/hello` -- could also be done with `jq` and `deploy` if you are ambitious 👀).

Then I used `time ./bin/lookup --parallel $(crane ls --full-ref localhost:5000/hello) > /dev/null` to establish some benchmarks.

Without this change, it's pretty consistently in the 15-20s range on my local system.  With this change, it drops down an order of magnitude to be in the 3-6s range.
2024-05-01 16:38:29 -07:00
Tianon Gravi ab38f954b3
Rewrite deploy to minimize API requests (#38)
* WIP: first pass at deploy

* WIP: second pass, now with more cowbell

* WIP: refactor coverage handling (cleaner, more consistent, no longer has to clobber top-level bin/)

* WIP: the start of more "jq" tests

* WIP: add a few TODOs

* Add a benchmark for `om.OrderedMap.Set`

I was testing a minor memory-usage improvement to `Set`, but it turns out it doesn't actually matter (and this helped me determine that, so I might as well keep it).

* Add explicit `Reference.StringWithKnownDigest` unit test

* WIP: refactor EnsureManifest loop with more correct handling of child manifests vs blobs

* Update to use the new `ociregistry.HTTPError` for more consistent/correct HTTP error handling

* WIP: remove TODO that was implemented elsewhere (and fix error message text / comment text)

* WIP: also normalize descriptor field ordering

* WIP: assume pre-normalized platform (no reason to normalize more than once)

* WIP: initial "deploy" data munging helpers plus tests

* WIP: update Jenkinsfile.deploy to use new deploy code

* WIP: remove example-commands symlink so Git detects rename better

* WIP: add delay for racy registry startup

* WIP: remove trap once it's no longer necessary

* WIP: typo

* WIP: remove unnecessary TODOs
2024-04-18 11:53:23 -07:00
yosifkit 530a1082db
Merge pull request #34 from infosiftr/cache-data
Update registry cache to (ab)use `Data` field of `Descriptor` objects
2024-03-26 17:04:30 -07:00
Tianon Gravi a6e64448b9 Fix `EntryForRegistry` code coverage randomization 2024-03-25 14:46:42 -07:00
Tianon Gravi c2771e3eed Add new `registry.Lookup` wrapper to handle generic `Reference` lookups
This allows us to update `cmd/lookup` to use this new wrapper and thus let us easily/correctly test more edge cases / lookups. 🚀
2024-03-25 14:01:49 -07:00
Tianon Gravi 073cf83236 Update `ociregistry` and deal with the minor breaking changes
This also updates our `replace` to my new upstream fix that deals with `HEAD` on a Docker Hub digest failing when it shouldn't.
2024-03-25 10:14:32 -07:00
Tianon Gravi 05d9b2ebf1 Implement `Resolve{Manifest,Tag,Blob}` in our registry cache
Now that we have `Descriptor` objects (and a way to cache them), this implementation is trivial. 🎉
2024-03-25 10:14:32 -07:00
Tianon Gravi 7cf25c6af9 Update registry cache to (ab)use `Data` field of `Descriptor` objects
This allows us to store full descriptors (necessary to implement `Resolve*`), but with a net decrease in the number of fields we have to juggle / keep in sync.

This does mean consumers need to be careful about how they use the `Descriptor` objects we return (esp. WRT `Data`), but it makes it easier for them to then have `Data` available if they want it (which is something I'd like to use in the future).  This is a net win anyhow because the upstream objects might've contained `Data` fields so this forces us to deal with them in a sane way we're comfortable with instead of potentially just including them verbatim unintentionally. 🚀
2024-03-25 10:13:37 -07:00
yosifkit 6e0d1210aa
Merge pull request #32 from infosiftr/ref-wrapper
Add a `Reference` wrapper type to `ociref.Reference` to handle our "Docker references" logic more cleanly
2024-03-22 17:28:15 -07:00
Tianon Gravi 1c19f8f8c8 Also treat 401 as if it were 404
We recently ran into this again with `arm32v6/hello-world` where 401 from the proxy actually means 404 because of mixing + matching public/private repositories in the same registry/namespace and thus returning a true 404 would be an information leak.  In our case, if we get a 401 at this point, we know for sure it might as well be a 404 because `ociauth` would already have handled any authentication we could possibly do.
2024-03-21 16:10:08 -07:00
Tianon Gravi bdbeb7673d Add a `Reference` wrapper type to `ociref.Reference` to handle our "Docker references" logic more cleanly
This also allows us to add explicit JSON round-tripping which can handle normalizing/denormalizing for us, so our output strings no longer contain `docker.io/[library/]`. 🎉
2024-03-21 15:58:58 -07:00
Tianon Gravi 64b5472227 Refactor data model to support more atomic index generation
This adjusts our data model to store/track data from the registry as an image index (with more data fidelity) instead of as a custom data structure.  One of the most notable benefits is that we can now track the annotation objects for each architecture, but even more importantly, this allows us to generate the indexes we need to push in deploy directly from data that's on-disk (where the old structure would require querying the registry).

Notably, however, this does *not* change deploy (yet) -- those changes are still in progress, but this was a large enough refactor/rewrite that I figured I should start here.

This also switches us from using containerd's registry client library to using [`cuelabs.dev/go/oci/ociregistry`](https://pkg.go.dev/cuelabs.dev/go/oci/ociregistry), which is much closer to the distribution-spec APIs (modeled/designed after them, in fact), where containerd's (and frankly most others) are a higher-level abstraction.  This is important because we're running into raw number of request rate limits, and almost every library for this always starts with a `HEAD` before pushing content, and this will allow us to go directly for a `PUT` (and then only do the "copy child objects" dance if the `PUT` fails).

This also drops our `allTags` data entry (which we only *ever* used the first value of, and then only to identify a `sourceId`/`buildId` in a human-meaningful way, because it's not useful/meaningful for any other use case), and moves `tags` from being arch-specific up to the per-source-object level because it's always identical across all arches for a given `sourceId` so it's silly to copy it N times for every architecture object.
2024-02-28 16:07:13 -08:00