14 KiB

Raw Permalink Blame History

zstd:chunked file format

This is a rough documentation of the zstd:chunked file format. It's a mix of several formats:

tar
zstd
tar-split
CRFS Table of Contents JSON
a "footer"
a set of annotations on OCI Descriptors

The main concept is of a zstd compressed .tar.zstd file which, when normally decompressed, produces the desired layer .tar, with the correct checksum. This mode of operations is only intended to be used by clients which are unaware of zstd:chunked, however.

For clients that are aware of the format, zstd:chunked takes advantage of the fact that (like tar and gzip) a zstd file can consist of multiple independent concatenated frames. It also takes advantage of zstd Skippable Frames to store some metadata at the end of the file which can be used for finding the frames that we need to download. Finally, HTTP Range Requests are used to only download the frames that we actually require.

This document is written from the standpoint of someone trying to consume the format.

Overall layout

The .tar.zstd file will be compressed as a large number of separate frames, such that each individual non-empty regular file has its content contained within a single frame. This typically implies that the tar headers between regular files will also be separated into their own frames.

At the end of the file are three skippable frames:

the compressed "manifest" JSON
the compressed "tarsplit" JSON
the uncompressed "footer"

In order to do anything useful with the file you first need to locate the manifest and tarsplit JSON files in the compressed stream.

Finding the skippable frames

Having the skippable frames at the end of file makes things a bit more difficult: it's generally not possible to find frames by scanning backwards. There are two ways to solve this problem:

the fixed-sized uncompressed footer
OCI descriptor annotations

The footer

The footer is a fixed-sized uncompressed skip-frame, 64 bytes in content length (72 bytes total length). A skippable frame header starts with a 32bit little endian magic number (as per the zstd spec, "any value from 0x184D2A50 to 0x184D2A5F") but we always use 0x184d2a50. It's followed by a 32bit little endian size (which is always 64 in our case).

The content of the footer content consists of the following 64bit little endian integers, in order:

manifest start offset, compressed length, uncompressed length
manifest type (always 1)
tarsplit start offset, compressed length, uncompressed length
magic number: 0x78556e496c554e47 (ie: the ascii string "GNUlInUx")

That means that any zstd:chunked file will always end with 72 bytes which look something like:

00000000  50 2a 4d 18 40 00 00 00  6c 91 62 06 00 00 00 00  |P*M.@...l.b.....|
00000010  9e 68 0f 00 00 00 00 00  e7 4e 54 00 00 00 00 00  |.h.......NT.....|
00000020  01 00 00 00 00 00 00 00  12 fa 71 06 00 00 00 00  |..........q.....|
00000030  e2 57 09 00 00 00 00 00  76 07 eb 00 00 00 00 00  |.W......v.......|
00000040  47 4e 55 6c 49 6e 55 78                           |GNUlInUx|

50 2a 4d 18 40 00 00 00: skippable frame, size 0x40 (64)
6c 91 62 06 00 00 00 00: start of the manifest in the compressed stream
9e 68 0f 00 00 00 00 00: length of the manifest in the compressed stream
e7 4e 54 00 00 00 00 00: uncompressed size of the manifest
01 00 00 00 00 00 00 00: manifest type (1)
12 fa 71 06 00 00 00 00: start of the tarsplit json in the compressed stream
e2 57 09 00 00 00 00 00: length of the tarsplit json in the compressed stream
76 07 eb 00 00 00 00 00: uncompressed size of the tarsplit json
47 4e 55 6c 49 6e 55 78: magic number (GNUlInUx)

The OCI descriptor annotations

Of course, you're probably interested in downloading the layer because it's part of an OCI image. In that case, the same information that's contained in the header will have also been encoded as a set of annotations on the descriptor for the layer:

    {
      "mediaType": "application/vnd.oci.image.layer.v1.tar+zstd",
      "digest": "sha256:20574ef181bd366aa8a344c3e869c95a22429feb00f1c4feeb7fb2fd0e8de71c",
      "size": 108745276,
      "annotations": {
        "io.github.containers.zstd-chunked.manifest-checksum": "sha256:44b5219a19eea4bd8414c2938d0561eebdd053fad8110df7957bee86370ba02b",
        "io.github.containers.zstd-chunked.manifest-position": "107123052:1009822:5525223:1",
        "io.github.containers.zstd-chunked.tarsplit-checksum": "sha256:4041c7b1197a991b90eb5e933c3e96e5f60abc56ed4c0bc926a0d5a2e136ebdc",
        "io.github.containers.zstd-chunked.tarsplit-position": "108132882:612322:15402870"
      }
    }

The annotations are:

io.github.containers.zstd-chunked.manifest-checksum: a digest of the compressed "manifest" JSON
io.github.containers.zstd-chunked.manifest-position: a :-separated 4-tuple containing the manifest location information in the same order as it appears in the footer: offset, compressed size, uncompressed size, manifest type
io.github.containers.zstd-chunked.tarsplit-checksum: a digest of the compressed "tarsplit" JSON
io.github.containers.zstd-chunked.tarsplit-position: a :-separated 3-tuple containing the tarsplit location information in the same order as it appears in the footer: offset, compressed size, uncompressed size

The "manifest" file format

The manifest is obtained by slicing (or fetching) the manifest start offset up to the specified compressed length from the original file. That range contains a single normal compressed zstd frame which you decompress to get the manifest. As mentioned above, the manifest is contained inside of a skippable frame, but the offsets for finding the manifest do not include the skippable frame header, so you don't need to do anything about it.

This file format was originally designed as part of the Container Registry Filesystem project and contains far more information than is required for incremental downloading. It more or less duplicates all of the tar header information in a different format. It was designed to allow a lazy-downloading filesystem implementation that could do metadata lookups without having to fetch tar headers that were scattered around the rest of the file. We can safely ignore most of it.

At the top-level, it's a JSON dictionary containing two items, and looking something like:

{
  "version": 1,
  "entries": [
    {
      "type": "dir",
      "name": "etc/",
      "mode": 493,
      "modtime": "2025-05-28T02:24:19Z",
      "accesstime": "0001-01-01T00:00:00Z",
      "changetime": "0001-01-01T00:00:00Z",
      "xattrs": {
        "user.overlay.impure": "eQ=="
      }
    },
    {
      "type": "reg",
      "name": "etc/asound.conf",
      "mode": 420,
      "size": 55,
      "modtime": "2025-04-14T00:00:00Z",
      "accesstime": "0001-01-01T00:00:00Z",
      "changetime": "0001-01-01T00:00:00Z",
      "digest": "sha256:3b7fe1f8fd9bb7e1c80fddd31c592bc5b7b7d39e322ded8303c742b2bc1bec31",
      "offset": 210,
      "endOffset": 278
    }
  ]
}

The version is 1.

entries is an array. Each entry is either a complete description of an entry in the .tar file (in which case the type will be "reg", "dir", "hardlink", "symlink", etc.) or an additional "chunk" (type of "chunk"). Chunk entries follow the file which is chunked, allowing it to be split into smaller chunks (which may improve incremental downloads). It's possible to ignore chunks when implementing the file format: even in the presence of chunks, the information required to download the complete data of the file is available on the main entry: you'll simply get it as multiple concatenated zstd frames (which, as mentioned above, is still a valid compressed file).

The important fields for knowing what needs to be downloaded are "digest", "offset", and "endOffset". The digest provides the main mechanism for knowing if we already have a particular file's contents downloaded and the offsets provide us with the information we need to fetch it if we don't have it. The "digest" field is a digest of the uncompressed file data. The "offset" is the offset of the start of the compressed frame and the "endOffset" is the offset past the end of the frame (ie: you need to subtract 1 from this when turning it into a Range request). "size" is the size of the uncompressed data, and it's useful for verification.

After decompressing, the data from the given range does not contain any extra padding which might be implied by the tar format (ie: rounding up to the next 512-byte block). The uncompressed data should match the "size" and "digest".

For purposes of incremental downloads, we really only need the entries of "type": "reg" with a non-zero "size" and "digest" plus "offset" and "endOffset". Those are the entries that will let us find our file content.

Chunks

The "type": "chunk" entries contain information about individual file chunks. It's not specified which algorithm is used to determine the chunking, but containers/storage uses a rolling checksum approach. In an addition to the original format, zstd:chunked can also contain "chunkType": "zeros" chunks which are effectively sparse "holes".

The chunk format is not described here because it's not implemented yet: the most obvious approach to doing so would require duplicating file content on disk (chunked and merged) and it's not clear if it's worth it to save a bit of extra downloading.

The "tarsplit" file format

This is the JSON-lines format produced by the tar-split utility used in the implementation of podman, explaining its presence in this file format.

The purpose of this format is to store enough metadata to allow reconstructing the original content of a .tar file which was previously unpacked, assuming we still have the unpacked contents around (referred to by their filenames).

The inclusion of this file essentially reproduces all of the data in the tar headers for a third time.

It looks something like this:

{"type":2,"payload":"ZXRjL1BheEhlYWRlcnMuMAAA...","position":0}
{"type":1,"name":"etc/","payload":null,"position":1}
{"type":2,"payload":"ZXRjL2Fzb3VuZC5jb25mAAA...=","position":2}
{"type":1,"name":"etc/asound.conf","size":55,"payload":"tAt3+IpQDrE=","position":3}
{"type":2,"payload":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...==","position":4}

where each line contains one dictionary which is either "type": 1 or "type": 2.

The "type 1" items are references to file content. In the originally-conceived use of the tool, these would be files that you could read from the filesystem to get the content to include back in the reconstructed tar stream. In our case, we can find those files in the manifest (along with the information about where we can fetch them and if we need to). The important key here is "name" which is the filename: it's exactly the same name as in the manifest. These entries are really only useful for regular files. You can identify that case by the presence of a "size" field: the uncompressed file size (which should match the same key in the manifest). The "payload" field on "type 1" entries contains a base64-encoded crc64 of the file contents.

The "type 2" items are inline content used to encode non-file data. The tar headers end up reproduced in these. It's worth noting that padding is included in these items. That means that the payload of a "type 2" entry following a "type 1" entry that doesn't have a multiple-of-512 file size will start with padding (seen as "AAAA..." in the above example).

Putting it all together

The first step is to get the compressed form of the manifest and the tarsplit JSON files. You can use the OCI annotations if you have them, or do a "suffix" range request for the footer.

Once you have the two JSON files you need to decompress them.

Rebuilding the .tar file contents then essentially works by iterating over the lines in the tarsplit file. For each "type 2" entry, you simply output the inline data. For each "type 1" entry, you look up the equivalent entry in the manifest, check if you already have an object with that checksum, use the range information to download it if not, and then output its content.

In all cases I've seen, the entries in the tarstream and the manifest are in the same order, so it's probably possible to create an efficient implementation that decompresses them in parallel, keeping them in sync with each other, avoiding ever having to fully extract either file at the same time. In practice, the total amount of data in the manifest is relatively small, and extracting it to some form of a lookup table is probably a more reliable (and easier) approach.

Making individual HTTP requests for each range as we require it is probably not a reasonable idea: there's too much overhead and latency. HTTP/2 and HTTP/3 improve things by allowing massive levels of concurrency over a single connection (saving handshaking overhead) but the overhead of the headers is still significant. A better approach is to pre-filter the list of digests, removing the ones we already have, and fetching many ranges per request.

14 KiB Raw Permalink Blame History