meta-scripts

Commit Graph

Author	SHA1	Message	Date
Tianon Gravi	e48b7063a1	Move `oci-import` (and SBOM) shell code out of `meta.jq` into an explicit shell script Also, add a lot more validation. This new script/code assumes we're setting `BASHBREW_META_SCRIPTS` to an appropriate value (pointing to the directory of this repository on-disk).	2025-04-25 16:27:26 -07:00
yosifkit	da908ae960	Merge pull request #110 from infosiftr/parallelism-redux (re-)Implement parallelism in deploy	2025-02-03 16:33:19 -08:00
Tianon Gravi	5adcfa53f4	Add `User-Agent` to all our outgoing requests This is something we should've been doing all along because we've been contributing to the thundering herd of useless `Go-http-client/1.1` agent values hitting Docker Hub.	2025-01-31 12:12:23 -08:00
Tianon Gravi	8b4e0f376a	(re-)Implement parallelism in deploy This requires the input to be in the correct order (children first), but that was already an implied constraint. In my testing, this takes our `.test/test.sh --deploy` block testing deploy from ~5s down to ~2.5s. When we originally introduced this, it collided with some other aging infrastructure in a bad way, kicking over everything, and had to be reverted. We've since changed that and this should be safe to re-introduce.	2025-01-31 10:22:42 -08:00
yosifkit	dfd62f2762	Revert "Implement parallelism in deploy"	2025-01-27 12:59:41 -08:00
Tianon Gravi	75c28536e9	Implement parallelism in deploy This requires the input to be in the correct order (children first), but that was already an implied constraint. In my testing, this takes our `.test/test.sh --deploy` block testing deploy from ~5s down to ~2.5s. I used "show timestamps" in our GHA CI jobs to compare them also, and they're currently at ~7s, and after this change they drop to ~1s. I've tested this with my personal deploy job, which was taking ~2m40s before, and is ~1m18s after.	2025-01-23 14:12:33 -08:00
Tianon Gravi	0e4a5d2465	Slight rate limit bump	2025-01-22 13:43:23 -08:00
Tianon Gravi	0560d0fc1b	Add 504 to the list of responses we retry on This is "gateway timeout" which we do see from Docker Hub from time to time (probably due to some HTTP middleware, redeploys, etc).	2025-01-15 10:00:42 -08:00
Tianon Gravi	016bee1b7e	Double our Docker Hub rate limit This increases our rate limit from 100/min up to 200/min (with an immediate burst of up to 200 and at most 200 at once). This should be generally safe since the code will already very aggressively drain the pool of available "tokens" the minute it sees a 429 (thus only doing a trickle of requests anyhow), and we auto-retry on 429, but only after a delay of a full second per request. If all goes well, this should help mitigate our amd64 deploy jobs that have crept back up to a full hour runtime (without adding further parallelization complexity yet).	2024-10-14 16:38:12 -07:00
Tianon Gravi	08727fedfd	Treat "500 Internal Server Error" and "502 Bad Gateway" the same as a 503 We still get spurious errors from our deploy jobs, and almost every time I dig into one, it ends with something like `failed HEAD: 500 Internal Server Error`. This updates our auto-retry logic to treat 500 (and 502) the same as 503 (ie, up to three retries, subject to our per-request+overall rate limits).	2024-09-03 16:14:17 -07:00
Tianon Gravi	39c067d57a	Revert "Revert "Attempt to speed up deploy"" This reverts commit `4473698dae`.	2024-06-26 16:12:51 -07:00
Tianon Gravi	4473698dae	Revert "Attempt to speed up deploy" This reverts commit `c70a53466d`. (After fixing the implementation bug, this made things worse, not better 😭)	2024-06-25 17:31:16 -07:00
Tianon Gravi	e0a32c05fe	Ensure our pre-flight HEAD request is always by-tag if we're pushing a tag This is a latent bug the system has had for large manifests that we now have for all manifests instead! 😄 (This also would artificially inflate our speedups since all the pre-flight HEAD requests would be by-digest which are infinitely cacheable and thus very, very fast.)	2024-06-25 17:11:56 -07:00
Joseph Ferguson	c70a53466d	Attempt to speed up deploy Always try HEAD request before the PUT request	2024-06-24 16:52:04 -07:00
Tianon Gravi	bb72bc31df	Implement retries on 503s This caps total attempts of any given request to at most 3x 503 responses (in other words, the third 503 will "bubble up" as-is). The intent there is that if there is a serious issue with Hub, we don't want to continue retrying aggressively as that would likely contribute to the outage persisting longer.	2024-05-22 11:28:27 -07:00
Tianon Gravi	b0dcc102bc	Implement more fine-grained mutexes in registryCache In short, this keeps the "data mutex" as-is, but only acquires it when we have some data to shove in (and only for the limited instructions necessary to update the data), and adds a new mutex per-tag or per-digest to prevent concurrent requests for the exact same upstream content (which is the main reason for this cache in the first place, so feels appropriate). Without this, our current mutex means that any other efforts to parallelize will ultimately bottleneck on our registry cache. In order to test this effectively, I've added a `--parallel` flag to `lookup` which just hyper-aggressively runs every single lookup in parallel inside a goroutine. My first test of this was doing a lookup of the first tag of every DOI repository (so ~147 concurrent lookups in total), and I found it was consistently ~1m57s both with and without this change. Our Hub rate limiter is pegged at ~100/min, which seems consistent with that result. If I increase that limit to ~200/min, I was able to achieve a small speedup with this change (~43s down to ~30s). In other words, for this to actually be effective as a speedup against Docker Hub if we implement parallel deploy, for example, we'll also have to increase our rate limit (which I think is fairly safe now that we handle 429 by explicitly emptying the limiter). In order to really test this effectively, I took a different approach. I spun up a local registry (`docker run -dit -p 127.0.0.1:5000:5000 -p [::1]:5000:5000 --name registry registry`), and copied `hello-world:linux` into it 10,000 times (something like `crane cp hello-world:linux localhost:5000/hello && echo localhost:5000/hello:{1..10000} \| xargs -rtn1 -P$(nproc) crane cp localhost:5000/hello` -- could also be done with `jq` and `deploy` if you are ambitious 👀). Then I used `time ./bin/lookup --parallel $(crane ls --full-ref localhost:5000/hello) > /dev/null` to establish some benchmarks. Without this change, it's pretty consistently in the 15-20s range on my local system. With this change, it drops down an order of magnitude to be in the 3-6s range.	2024-05-01 16:38:29 -07:00
Tianon Gravi	ab38f954b3	Rewrite deploy to minimize API requests (#38 ) * WIP: first pass at deploy * WIP: second pass, now with more cowbell * WIP: refactor coverage handling (cleaner, more consistent, no longer has to clobber top-level bin/) * WIP: the start of more "jq" tests * WIP: add a few TODOs * Add a benchmark for `om.OrderedMap.Set` I was testing a minor memory-usage improvement to `Set`, but it turns out it doesn't actually matter (and this helped me determine that, so I might as well keep it). * Add explicit `Reference.StringWithKnownDigest` unit test * WIP: refactor EnsureManifest loop with more correct handling of child manifests vs blobs * Update to use the new `ociregistry.HTTPError` for more consistent/correct HTTP error handling * WIP: remove TODO that was implemented elsewhere (and fix error message text / comment text) * WIP: also normalize descriptor field ordering * WIP: assume pre-normalized platform (no reason to normalize more than once) * WIP: initial "deploy" data munging helpers plus tests * WIP: update Jenkinsfile.deploy to use new deploy code * WIP: remove example-commands symlink so Git detects rename better * WIP: add delay for racy registry startup * WIP: remove trap once it's no longer necessary * WIP: typo * WIP: remove unnecessary TODOs	2024-04-18 11:53:23 -07:00
yosifkit	530a1082db	Merge pull request #34 from infosiftr/cache-data Update registry cache to (ab)use `Data` field of `Descriptor` objects	2024-03-26 17:04:30 -07:00
Tianon Gravi	a6e64448b9	Fix `EntryForRegistry` code coverage randomization	2024-03-25 14:46:42 -07:00
Tianon Gravi	c2771e3eed	Add new `registry.Lookup` wrapper to handle generic `Reference` lookups This allows us to update `cmd/lookup` to use this new wrapper and thus let us easily/correctly test more edge cases / lookups. 🚀	2024-03-25 14:01:49 -07:00
Tianon Gravi	073cf83236	Update `ociregistry` and deal with the minor breaking changes This also updates our `replace` to my new upstream fix that deals with `HEAD` on a Docker Hub digest failing when it shouldn't.	2024-03-25 10:14:32 -07:00
Tianon Gravi	05d9b2ebf1	Implement `Resolve{Manifest,Tag,Blob}` in our registry cache Now that we have `Descriptor` objects (and a way to cache them), this implementation is trivial. 🎉	2024-03-25 10:14:32 -07:00
Tianon Gravi	7cf25c6af9	Update registry cache to (ab)use `Data` field of `Descriptor` objects This allows us to store full descriptors (necessary to implement `Resolve*`), but with a net decrease in the number of fields we have to juggle / keep in sync. This does mean consumers need to be careful about how they use the `Descriptor` objects we return (esp. WRT `Data`), but it makes it easier for them to then have `Data` available if they want it (which is something I'd like to use in the future). This is a net win anyhow because the upstream objects might've contained `Data` fields so this forces us to deal with them in a sane way we're comfortable with instead of potentially just including them verbatim unintentionally. 🚀	2024-03-25 10:13:37 -07:00
yosifkit	6e0d1210aa	Merge pull request #32 from infosiftr/ref-wrapper Add a `Reference` wrapper type to `ociref.Reference` to handle our "Docker references" logic more cleanly	2024-03-22 17:28:15 -07:00
Tianon Gravi	1c19f8f8c8	Also treat 401 as if it were 404 We recently ran into this again with `arm32v6/hello-world` where 401 from the proxy actually means 404 because of mixing + matching public/private repositories in the same registry/namespace and thus returning a true 404 would be an information leak. In our case, if we get a 401 at this point, we know for sure it might as well be a 404 because `ociauth` would already have handled any authentication we could possibly do.	2024-03-21 16:10:08 -07:00
Tianon Gravi	bdbeb7673d	Add a `Reference` wrapper type to `ociref.Reference` to handle our "Docker references" logic more cleanly This also allows us to add explicit JSON round-tripping which can handle normalizing/denormalizing for us, so our output strings no longer contain `docker.io/[library/]`. 🎉	2024-03-21 15:58:58 -07:00
Tianon Gravi	64b5472227	Refactor data model to support more atomic index generation This adjusts our data model to store/track data from the registry as an image index (with more data fidelity) instead of as a custom data structure. One of the most notable benefits is that we can now track the annotation objects for each architecture, but even more importantly, this allows us to generate the indexes we need to push in deploy directly from data that's on-disk (where the old structure would require querying the registry). Notably, however, this does not change deploy (yet) -- those changes are still in progress, but this was a large enough refactor/rewrite that I figured I should start here. This also switches us from using containerd's registry client library to using [`cuelabs.dev/go/oci/ociregistry`](https://pkg.go.dev/cuelabs.dev/go/oci/ociregistry), which is much closer to the distribution-spec APIs (modeled/designed after them, in fact), where containerd's (and frankly most others) are a higher-level abstraction. This is important because we're running into raw number of request rate limits, and almost every library for this always starts with a `HEAD` before pushing content, and this will allow us to go directly for a `PUT` (and then only do the "copy child objects" dance if the `PUT` fails). This also drops our `allTags` data entry (which we only ever used the first value of, and then only to identify a `sourceId`/`buildId` in a human-meaningful way, because it's not useful/meaningful for any other use case), and moves `tags` from being arch-specific up to the per-source-object level because it's always identical across all arches for a given `sourceId` so it's silly to copy it N times for every architecture object.	2024-02-28 16:07:13 -08:00

27 Commits