14 KiB
zstd:chunked file format
This is a rough documentation of the zstd:chunked file format. It's a mix of several formats:
- tar
- zstd
- tar-split
- CRFS Table of Contents JSON
- a "footer"
- a set of annotations on OCI Descriptors
The main concept is of a zstd compressed .tar.zstd
file which, when normally decompressed, produces the desired layer
.tar
, with the correct checksum. This mode of operations is only intended to be used by clients which are unaware of
zstd:chunked
, however.
For clients that are aware of the format, zstd:chunked
takes advantage of the fact that (like tar and gzip) a zstd
file can consist of multiple independent concatenated frames. It also takes advantage of zstd Skippable
Frames to store some metadata at the end of the
file which can be used for finding the frames that we need to download. Finally, HTTP Range
Requests are used to only download the frames that we actually require.
This document is written from the standpoint of someone trying to consume the format.
Overall layout
The .tar.zstd
file will be compressed as a large number of separate frames, such that each individual non-empty
regular file has its content contained within a single frame. This typically implies that the tar headers between
regular files will also be separated into their own frames.
At the end of the file are three skippable frames:
- the compressed "manifest" JSON
- the compressed "tarsplit" JSON
- the uncompressed "footer"
In order to do anything useful with the file you first need to locate the manifest and tarsplit JSON files in the compressed stream.
Finding the skippable frames
Having the skippable frames at the end of file makes things a bit more difficult: it's generally not possible to find frames by scanning backwards. There are two ways to solve this problem:
- the fixed-sized uncompressed footer
- OCI descriptor annotations
The footer
The footer is a fixed-sized uncompressed skip-frame, 64 bytes in content length (72 bytes total length). A skippable
frame header starts with a 32bit little endian magic number (as per the zstd spec, "any value from 0x184D2A50 to
0x184D2A5F") but we always use 0x184d2a50
. It's followed by a 32bit little endian size (which is always 64 in our
case).
The content of the footer content consists of the following 64bit little endian integers, in order:
- manifest start offset, compressed length, uncompressed length
- manifest type (always 1)
- tarsplit start offset, compressed length, uncompressed length
- magic number: 0x78556e496c554e47 (ie: the ascii string "GNUlInUx")
That means that any zstd:chunked file will always end with 72 bytes which look something like:
00000000 50 2a 4d 18 40 00 00 00 6c 91 62 06 00 00 00 00 |P*M.@...l.b.....|
00000010 9e 68 0f 00 00 00 00 00 e7 4e 54 00 00 00 00 00 |.h.......NT.....|
00000020 01 00 00 00 00 00 00 00 12 fa 71 06 00 00 00 00 |..........q.....|
00000030 e2 57 09 00 00 00 00 00 76 07 eb 00 00 00 00 00 |.W......v.......|
00000040 47 4e 55 6c 49 6e 55 78 |GNUlInUx|
50 2a 4d 18 40 00 00 00
: skippable frame, size 0x40 (64)6c 91 62 06 00 00 00 00
: start of the manifest in the compressed stream9e 68 0f 00 00 00 00 00
: length of the manifest in the compressed streame7 4e 54 00 00 00 00 00
: uncompressed size of the manifest01 00 00 00 00 00 00 00
: manifest type (1)12 fa 71 06 00 00 00 00
: start of the tarsplit json in the compressed streame2 57 09 00 00 00 00 00
: length of the tarsplit json in the compressed stream76 07 eb 00 00 00 00 00
: uncompressed size of the tarsplit json47 4e 55 6c 49 6e 55 78
: magic number (GNUlInUx
)
The OCI descriptor annotations
Of course, you're probably interested in downloading the layer because it's part of an OCI image. In that case, the same information that's contained in the header will have also been encoded as a set of annotations on the descriptor for the layer:
{
"mediaType": "application/vnd.oci.image.layer.v1.tar+zstd",
"digest": "sha256:20574ef181bd366aa8a344c3e869c95a22429feb00f1c4feeb7fb2fd0e8de71c",
"size": 108745276,
"annotations": {
"io.github.containers.zstd-chunked.manifest-checksum": "sha256:44b5219a19eea4bd8414c2938d0561eebdd053fad8110df7957bee86370ba02b",
"io.github.containers.zstd-chunked.manifest-position": "107123052:1009822:5525223:1",
"io.github.containers.zstd-chunked.tarsplit-checksum": "sha256:4041c7b1197a991b90eb5e933c3e96e5f60abc56ed4c0bc926a0d5a2e136ebdc",
"io.github.containers.zstd-chunked.tarsplit-position": "108132882:612322:15402870"
}
}
The annotations are:
io.github.containers.zstd-chunked.manifest-checksum
: a digest of the compressed "manifest" JSONio.github.containers.zstd-chunked.manifest-position
: a:
-separated 4-tuple containing the manifest location information in the same order as it appears in the footer: offset, compressed size, uncompressed size, manifest typeio.github.containers.zstd-chunked.tarsplit-checksum
: a digest of the compressed "tarsplit" JSONio.github.containers.zstd-chunked.tarsplit-position
: a:
-separated 3-tuple containing the tarsplit location information in the same order as it appears in the footer: offset, compressed size, uncompressed size
The "manifest" file format
The manifest is obtained by slicing (or fetching) the manifest start offset up to the specified compressed length from the original file. That range contains a single normal compressed zstd frame which you decompress to get the manifest. As mentioned above, the manifest is contained inside of a skippable frame, but the offsets for finding the manifest do not include the skippable frame header, so you don't need to do anything about it.
This file format was originally designed as part of the Container Registry Filesystem project and contains far more information than is required for incremental downloading. It more or less duplicates all of the tar header information in a different format. It was designed to allow a lazy-downloading filesystem implementation that could do metadata lookups without having to fetch tar headers that were scattered around the rest of the file. We can safely ignore most of it.
At the top-level, it's a JSON dictionary containing two items, and looking something like:
{
"version": 1,
"entries": [
{
"type": "dir",
"name": "etc/",
"mode": 493,
"modtime": "2025-05-28T02:24:19Z",
"accesstime": "0001-01-01T00:00:00Z",
"changetime": "0001-01-01T00:00:00Z",
"xattrs": {
"user.overlay.impure": "eQ=="
}
},
{
"type": "reg",
"name": "etc/asound.conf",
"mode": 420,
"size": 55,
"modtime": "2025-04-14T00:00:00Z",
"accesstime": "0001-01-01T00:00:00Z",
"changetime": "0001-01-01T00:00:00Z",
"digest": "sha256:3b7fe1f8fd9bb7e1c80fddd31c592bc5b7b7d39e322ded8303c742b2bc1bec31",
"offset": 210,
"endOffset": 278
}
]
}
The version
is 1.
entries
is an array. Each entry is either a complete description of an entry in the .tar
file (in which case the
type will be "reg"
, "dir"
, "hardlink"
, "symlink"
, etc.) or an additional "chunk" (type of "chunk"
). Chunk
entries follow the file which is chunked, allowing it to be split into smaller chunks (which may improve incremental
downloads). It's possible to ignore chunks when implementing the file format: even in the presence of chunks, the
information required to download the complete data of the file is available on the main entry: you'll simply get it as
multiple concatenated zstd frames (which, as mentioned above, is still a valid compressed file).
The important fields for knowing what needs to be downloaded are "digest"
, "offset"
, and "endOffset"
. The digest
provides the main mechanism for knowing if we already have a particular file's contents downloaded and the offsets
provide us with the information we need to fetch it if we don't have it. The "digest"
field is a digest of the
uncompressed file data. The "offset"
is the offset of the start of the compressed frame and the "endOffset"
is the
offset past the end of the frame (ie: you need to subtract 1 from this when turning it into a Range request). "size"
is the size of the uncompressed data, and it's useful for verification.
After decompressing, the data from the given range does not contain any extra padding which might be implied by the tar
format (ie: rounding up to the next 512-byte block). The uncompressed data should match the "size"
and "digest"
.
For purposes of incremental downloads, we really only need the entries of "type": "reg"
with a non-zero "size"
and
"digest"
plus "offset
" and "endOffset"
. Those are the entries that will let us find our file content.
Chunks
The "type": "chunk"
entries contain information about individual file chunks. It's not specified which algorithm is
used to determine the chunking, but containers/storage uses a rolling checksum
approach. In an addition to the original format, zstd:chunked can also contain "chunkType": "zeros"
chunks which are
effectively sparse "holes".
The chunk format is not described here because it's not implemented yet: the most obvious approach to doing so would require duplicating file content on disk (chunked and merged) and it's not clear if it's worth it to save a bit of extra downloading.
The "tarsplit" file format
This is the JSON-lines format produced by the tar-split utility used in the implementation of podman, explaining its presence in this file format.
The purpose of this format is to store enough metadata to allow reconstructing the original content of a .tar
file
which was previously unpacked, assuming we still have the unpacked contents around (referred to by their filenames).
The inclusion of this file essentially reproduces all of the data in the tar headers for a third time.
It looks something like this:
{"type":2,"payload":"ZXRjL1BheEhlYWRlcnMuMAAA...","position":0}
{"type":1,"name":"etc/","payload":null,"position":1}
{"type":2,"payload":"ZXRjL2Fzb3VuZC5jb25mAAA...=","position":2}
{"type":1,"name":"etc/asound.conf","size":55,"payload":"tAt3+IpQDrE=","position":3}
{"type":2,"payload":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA...==","position":4}
where each line contains one dictionary which is either "type": 1
or "type": 2
.
The "type 1" items are references to file content. In the originally-conceived use of the tool, these would be files
that you could read from the filesystem to get the content to include back in the reconstructed tar stream. In our
case, we can find those files in the manifest (along with the information about where we can fetch them and if we need
to). The important key here is "name"
which is the filename: it's exactly the same name as in the manifest. These
entries are really only useful for regular files. You can identify that case by the presence of a "size"
field: the
uncompressed file size (which should match the same key in the manifest). The "payload"
field on "type 1" entries
contains a base64-encoded crc64 of the file contents.
The "type 2" items are inline content used to encode non-file data. The tar headers end up reproduced in these. It's worth noting that padding is included in these items. That means that the payload of a "type 2" entry following a "type 1" entry that doesn't have a multiple-of-512 file size will start with padding (seen as "AAAA..." in the above example).
Putting it all together
The first step is to get the compressed form of the manifest and the tarsplit JSON files. You can use the OCI annotations if you have them, or do a "suffix" range request for the footer.
Once you have the two JSON files you need to decompress them.
Rebuilding the .tar
file contents then essentially works by iterating over the lines in the tarsplit file. For each
"type 2" entry, you simply output the inline data. For each "type 1" entry, you look up the equivalent entry in the
manifest, check if you already have an object with that checksum, use the range information to download it if not, and
then output its content.
In all cases I've seen, the entries in the tarstream and the manifest are in the same order, so it's probably possible to create an efficient implementation that decompresses them in parallel, keeping them in sync with each other, avoiding ever having to fully extract either file at the same time. In practice, the total amount of data in the manifest is relatively small, and extracting it to some form of a lookup table is probably a more reliable (and easier) approach.
Making individual HTTP requests for each range as we require it is probably not a reasonable idea: there's too much overhead and latency. HTTP/2 and HTTP/3 improve things by allowing massive levels of concurrency over a single connection (saving handshaking overhead) but the overhead of the headers is still significant. A better approach is to pre-filter the list of digests, removing the ones we already have, and fetching many ranges per request.