From Zero to Zarr¶
From a NumPy array to stored bytes, chunk by chunk.
This page is for people who are new to Zarr. You don't need to know NumPy, HDF5, or anything about file formats. We begin with why Zarr exists, then build up the how one idea at a time, until you understand how Zarr stores an array, why that layout is defined by a written specification, and how a library turns those stored bytes back into an array you can use.
The page comes in three parts:
- Part I: The core idea. The happy path, with pictures and no code.
- Part II: Under the hood. A few deeper sections that go off the happy path. Each one is signposted, so you can read on or skip ahead.
- Part III: Seeing it for real. A short hands-on section with runnable code that ties everything together.
But before the how, a word on the why.
Why we need Zarr¶
In short
Modern instruments and simulations produce arrays of numbers far too big to fit in memory, and that data needs to be stored durably and shared widely. Zarr stores these giant arrays so a reader can cheaply fetch just the piece they want, and so the work of reading them can run in parallel across many CPU cores.
Across science and industry, our instruments and simulations have become extraordinary firehoses of numbers. A satellite streams images of the Earth. A microscope captures gigapixel scans. A gene sequencer reads thousands of genomes. A climate model writes out temperature and wind for every point on the globe, hour after hour. In each case the result has the same shape: a vast grid of numbers, far more than fits in any single computer's memory, and often arriving as a continuous stream.
That data is worth little sitting on one machine. It has to be stored somewhere durable and shareable, so that many people (often scattered across the world) can read and analyze it. Increasingly that somewhere is cloud object storage (such as Amazon S3, Google Cloud Storage, or Azure Blob Storage): cheap, effectively unlimited, and reachable from anywhere. But sheer size makes this hard. Nobody wants to download terabytes just to inspect one corner. What's needed is a way to store these giant grids so a reader can efficiently and cheaply fetch just the piece they want.
Zarr was built to solve exactly this, though it didn't begin with the cloud. It grew out of genomics. Around 2015, Alistair Miles needed to analyze arrays of genetic variation across thousands of malaria-carrying mosquitoes (the Anopheles gambiae 1000 Genomes Project), arrays far too big to fit in memory. His real frustration was speed, and to see why, it helps to understand two things the array formats of the day were already doing: chunking and compression.
First, chunking. To store an array bigger than memory, formats like HDF5 and netCDF already split it into blocks (called chunks) and compress each one. That's what lets you read part of an array without loading all of it: you only fetch and decompress the chunks that cover the part you want. None of this is Zarr's invention. Chunking and compression were well-established ideas, and Zarr deliberately reuses them. The catch with the existing tools was speed: decompression takes CPU work, and for a big analysis that scans millions of values, that work adds up fast.
Here's where speed came in. Reading a chunk means decompressing it, so reading many chunks is a pile of independent decompression jobs, exactly the kind of work you'd want to spread across all your CPU cores at once. But the tools of the day wouldn't let him. In Python, the global interpreter lock (GIL) limits how much work threads can do at the same time, so reading through HDF5 couldn't keep all the cores busy. And the other chunked format he tried could split an array along only its first dimension, while scientific arrays usually have several dimensions. His analyses kept needing pieces that cut across those dimensions, and chunking along just one of them made that painfully slow. One core did all the work while the rest sat idle.
So he built Zarr. It didn't introduce new storage concepts so much as recombine familiar ones (chunks, compression, metadata) in a way that frees the CPU cores to work in parallel: cut an array into chunks across all its dimensions at once, not just one, and decompress them concurrently. Now a read becomes many chunk-decompressions running at the same time, across every core on the machine (and, with tools like Dask, across many machines), so an analysis that crunches the whole array finishes in a fraction of the time. (He tells the story in his early Zarr blog posts.)
Storing data in the cloud came later, and turned out to be a superpower. Because each chunk is simply one key/value entry (as we'll see), Zarr maps naturally onto object storage like S3, which made it a backbone of cloud-native science. Today Zarr is used far beyond genomics: in Earth and climate science (satellite imagery and weather and climate model output, via the Pangeo community), bio-imaging (huge microscopy volumes, via OME-Zarr), astronomy, and machine learning, anywhere people wrestle with large, multi-dimensional grids of numbers.
Strip away the domain (mosquitoes, galaxies, hurricanes) and the object at the center is always the same: an array, a big grid of numbers. So that's where we'll begin. In Part I we'll look at what an array is, then at what happens when one grows too big to fit in memory, and build up from there to how Zarr stores it.
Part I: The core idea¶
An overview of arrays¶
The thing Zarr stores is an array: a grid of values that all share a single data type (almost always shortened to dtype, the term we'll use from here on), arranged by a shape.
Here is a small array with 4 rows and 6 columns of 32-bit integers (we use a non-square shape on purpose, so "rows" and "columns" are never ambiguous):
| 0 | 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 |
shape = (4, 6) (4 rows, 6 columns); dtype = int32 (every value is a 32-bit integer).A few terms we'll use throughout:
- dtype: every element has the same type. Here it's a 32-bit integer, so each value takes exactly 4 bytes. A uniform type means the computer knows precisely how many bytes each value occupies, and where each one lives.
- shape: the size along each dimension. Ours is
(4, 6): the first number is rows, the second is columns. - ndim: the number of dimensions (axes). Ours is 2. Arrays can be 0-D, 1-D, 2-D, 3-D, or more.
Why "contiguous memory" matters¶
That grid is a convenient picture. Underneath, the array is one contiguous block of memory: a single run of bytes, with the values laid out row by row (row 0, then row 1, and so on). This is called row-major, or C order ("C" because the C programming language lays out arrays this way, and NumPy follows the same convention by default). The alternative is column-major, or F order ("F" for Fortran, which stores arrays column by column); a few tools, such as MATLAB and R, use it. The two simply disagree on which direction to walk the grid when flattening it into memory. Zarr's default is C order.
| 0 | 1 | 2 | 3 | 4 | 5 | … | 18 | 19 | 20 | 21 | 22 | 23 |
This layout is not just trivia; it has real consequences:
- Reading a whole row is fast: the values are already next to each other, so it's one smooth sequential scan of memory.
- Reading a whole column is slower: the values are far apart (column 0 is at positions 0, 6, 12, 18), so the computer has to hop around in a strided access that is much less friendly to memory and caches.
Keep this in mind. The fact that an array is a contiguous, row-major block is exactly what Zarr has to wrestle with once we start chopping arrays into pieces.
Slicing¶
To read part of an array, you slice it, selecting a rectangular region.
Asking for rows 1–2 and columns 2–4, which Python and NumPy write as a[1:3, 2:5],
picks out the shaded block:
| 0 | 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 |
a[1:3, 2:5] selects the shaded region (rows 1–2, columns 2–4).That start:stop bracket notation is Python and NumPy's; other languages and Zarr
implementations express the same idea with their own syntax (Rust's s![1..3, 2..5],
for instance). The notation isn't part of Zarr; what matters here is the universal
concept: picking out a sub-region of the array.
When an array outgrows memory¶
The catch with an in-memory array is right there in the name: it lives in memory, and memory runs out. As we saw, real datasets are routinely too big to fit in RAM, need to outlive the program that created them, and must be shared so others can read even a single corner without copying the whole thing.
An array that large is never held in memory all at once; it's written out a piece at a time (more on that in Part II). To make that possible, Zarr starts with one simple idea: don't store the array as a single blob. Split it up.
Chunking: splitting the grid into blocks¶
To store an array that may be enormous, Zarr first cuts the grid into a regular grid of equal-sized blocks called chunks. You choose the chunk shape: the shape of one block.
Let's chunk our 4×6 array with a chunk shape of (2, 3): 2 rows by 3 columns.
That divides it evenly into four chunks. Crucially, the chunks aren't a flat list:
they tile the array, so they form a grid of their own, the chunk grid. Here
the chunk grid has shape (2, 2): two rows and two columns of chunks. Notice the
position labels match the original array: chunk (0, 1) sits top-right, holding
the array's top-right block:
| 0 | 1 | 2 |
| 6 | 7 | 8 |
| 3 | 4 | 5 |
| 9 | 10 | 11 |
| 12 | 13 | 14 |
| 18 | 19 | 20 |
| 15 | 16 | 17 |
| 21 | 22 | 23 |
(4, 6) array with chunk shape (2, 3) forms a chunk grid of shape (2, 2). Don't confuse the two: the chunk shape (2, 3) is the size of each block; the chunk grid shape (2, 2) is how many blocks there are along each axis.Chunking is the key move. Each chunk can be stored, loaded, and compressed on its own, so a program can read just the chunks it needs (that one corner your colleague wanted) without touching the rest.
Note
We deliberately chose a chunk shape that divides the array evenly. But if every chunk has a fixed shape, how can chunks represent an array whose size isn't evenly divisible by the chunk shape? We answer that in when chunks don't divide evenly.
A store is just keys and bytes¶
Where do the chunks go? Into a store. According to the Zarr specification, a store is simply a mapping from keys to values, where a key is a text string and a value is a sequence of bytes. In other words, a store is basically a dictionary: hand it a key, get back some bytes.
That abstraction is deliberately humble, because lots of things can play the role of a store: a directory on your disk (keys are file paths), an object-storage bucket like Amazon S3 (keys are object names), a ZIP file, or even plain memory. Zarr treats them all the same way.
Each cell of the chunk grid becomes one value in the store, under a key built from the cell's position. So the four chunk-grid cells become four key→bytes entries, one per cell:
%%{init: {'flowchart': {'nodeSpacing': 14, 'rankSpacing': 55}}}%%
flowchart LR
G00["chunk (0, 0)"] -->|c/0/0| K00["bytes"]
G01["chunk (0, 1)"] -->|c/0/1| K01["bytes"]
G10["chunk (1, 0)"] -->|c/1/0| K10["bytes"]
G11["chunk (1, 1)"] -->|c/1/1| K11["bytes"]
c/0/1.Where does a key like c/0/1 come from? Each array's metadata picks a chunk key
encoding: a rule for turning a chunk's grid position into a key. The default
(chunk key encoding)
works like this: start with the literal prefix c (short for "chunk"), then
append the chunk's grid indices, one per dimension, separated by /. So the chunk
in grid row 1, column 0 becomes c/1/0; for a 3-D array, a chunk at grid
position (2, 0, 1) becomes c/2/0/1. The separator is configurable (. is the
other common choice), and (as we'll see next) the array records which scheme it
uses, so any reader reconstructs exactly the same keys.
That's the whole trick: a big array becomes a handful of key/value entries that any storage system capable of "save these bytes under this name" can hold.
Metadata: making the bytes meaningful¶
A pile of chunk blobs is meaningless on its own. If all you have is the bytes
under c/0/1, how would you know they're 32-bit integers, how big the array is,
or how the chunks tile together?
Zarr answers this with metadata: a small JSON document, stored in the same
store under the key zarr.json, that describes the array. Among its
fields:
shape: the array's overall shape, e.g.[4, 6].data_type: the dtype, e.g.int32.chunk_grid: how the array is divided into a regular grid of chunks. Nested inside it is thechunk_shape, the shape of a single chunk, e.g.[2, 3]. (Note the number of chunks along each axis, the chunk grid's own shape (2 × 2here), isn't stored; it's computed from the array shape and the chunk shape.)chunk_key_encoding: the rule that turns chunk positions into keys (thec/0/1scheme just described, including the separator).fill_value: the value for parts of the array that were never written (the spec calls these "uninitialised portions"). More on this in Part II.codecs: the codecs (a codec is a coder/decoder: it encodes a chunk's values into stored bytes, and decodes them back) used to turn each chunk's values into the bytes saved in the store. More on this in Part II.
The metadata is the legend that turns anonymous bytes back into your array. We'll
look at a real zarr.json in Part III.
The role of the specification¶
Here's the part that surprises newcomers: none of that layout (the zarr.json
fields, the c/0/1 key names, the way chunks are encoded) was invented by any
particular library. It is defined by the
Zarr specification,
a written, public standard.
Because the format is specified independently of any one library, any implementation can read and write it. An array written from Python can be read by an implementation in Rust, JavaScript, or C++, because they all agree on the same spec. This page's examples use zarr-python, but the same data is understood by zarrs (Rust), zarrita.js (JavaScript), TensorStore (C++), and more.
You can read the standard at zarr-specs.readthedocs.io. These examples use Zarr format 3, the current default. An older format, version 2, uses a slightly different on-disk layout; see the v3 migration guide if you meet it.
Part II: Under the hood¶
You now have the core mental model: arrays become chunks, chunks become key/value entries, and metadata explains it all, exactly as the spec prescribes. The next few sections go a little deeper, off the happy path. None of it changes the big picture; it just explains the machinery. Read on if you're curious; skip to Part III if you'd rather see it in action.
Going deeper: how chunks meet memory¶
Remember that an array lives in memory as one contiguous, row-major block. Here's the catch: the values belonging to a single chunk are not next to each other in that block.
Look again at chunk (0, 0): values 0, 1, 2, 6, 7, 8. In the array's flat
memory, those sit in two separate runs, because columns 3, 4, 5 of each row fall
in between:
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | … |
So writing a chunk isn't a straight copy. Zarr must gather the chunk's scattered values into the chunk's own small contiguous block, then encode and store it. Reading does the reverse: decode the chunk into its compact block, then scatter the values back into the right positions of your result array.
| 0 | 1 | 2 | 3 4 5 | 6 | 7 | 8 |
| 0 | 1 | 2 | 6 | 7 | 8 |
This gather/scatter isn't stated in the spec; it's a direct consequence of two spec rules working together: chunks form a regular grid, and a chunk's values are serialized in row-major order. Two practical effects follow, and they're worth remembering when you choose a chunk shape:
- Reading amplifies. To return a slice, Zarr reads every chunk the slice touches, decodes each one completely, and then extracts the part you asked for. Ask for a single value in a million-element chunk, and the whole chunk is still read and decoded.
- Unaligned writing is expensive. If you write a region that doesn't line up with chunk boundaries, Zarr must first read the affected edge chunks, modify the overlapping part, and write them back (a read-modify-write). Writing whole, chunk-aligned regions avoids that round trip.
Going deeper: when chunks don't divide evenly¶
Our 4×6 array split cleanly into 2×3 chunks. Real arrays rarely cooperate. What if the array has 5 rows and we keep a chunk height of 2?
The chunk grid simply rounds up: 5 rows with a chunk height of 2 gives row-chunks covering rows 0–1, 2–3, and 4. That last row-chunk only has one real row, but, per the spec, border chunks are always stored at full size. The cells beyond the array's edge are unused; the spec recommends writing the fill value into them.
| 24 | 25 | 26 |
| fill | fill | fill |
| 27 | 28 | 29 |
| fill | fill | fill |
So a 5×6 array chunked at (2, 3) quietly stores a row of "phantom" cells holding
the fill value. It's harmless, but it's a small waste, and a good reason to pick a
chunk shape that fits your array's real shape reasonably well. When a shape really
can't fit a regular grid, the
rectilinear chunk grid extension
allows chunks of differing sizes. (For practical guidance on choosing chunk shapes,
see Performance.)
Going deeper: codecs (how values become bytes)¶
One more thing happens inside each chunk. The bytes stored under c/0/1 aren't
necessarily the raw values; they're produced by a codec pipeline, a small
ordered assembly line recorded in the metadata. The
specification
defines three kinds of codec, applied in this order:
- array → array codecs (optional, any number): rearrange or transform the values; e.g. a transpose codec changes their order.
- array → bytes codec (exactly one, always required): turns the array of
values into a flat sequence of bytes. By default
(the
bytescodec) it writes them in lexicographical order, which the spec notes is C / row-major order. - bytes → bytes codecs (optional, any number): transform the bytes; e.g. compression to shrink them, or a checksum to detect corruption.
Because the metadata records the exact pipeline, any spec-compliant reader knows precisely how to run it in reverse and decode a chunk back into values. Per-chunk compression is a big part of why Zarr can store enormous arrays efficiently while staying readable everywhere.
Going deeper: sharding (when there are too many chunks)¶
So far, each chunk has been its own store object: one key, one value. That's simple, but it has a limit: small chunks in a very large array produce a huge number of chunks, and therefore a huge number of files or objects. The spec notes this is exactly where file systems (block sizes, inode limits) and object stores start to struggle. On object storage the limit is often about cost as much as performance: many providers bill per request, so millions of tiny objects mean millions of billable operations.
Sharding is the fix, and it adds one layer to the picture. Instead of writing every chunk as a separate object, Zarr can pack a block of neighboring chunks into a single store object called a shard. Inside a shard, the chunks are written one after another, followed by an index recording each chunk's byte offset and length. That index is the clever part: because the store knows exactly where each chunk sits, a reader can still pull out a single chunk without decoding the whole shard. The chunk shape must divide the shard shape evenly, so a shard always holds a whole number of chunks. The layering becomes array → shards (one object per store key) → chunks:
flowchart LR
subgraph shard ["one shard = one store key (e.g. c/0/0)"]
direction TB
subgraph inner ["chunks packed inside"]
direction LR
IC0["chunk"]
IC1["chunk"]
IC2["chunk"]
IC3["chunk"]
end
IDX["index: offset + length of each chunk"]
end
inner --> IDX
The shard truth
Sharding gives you the best of both worlds: far fewer objects in the
store, but still fine-grained, single-chunk reads within them. The one
thing to keep straight is what now occupies a single store object: without
sharding, one chunk is one stored object; with sharding, the stored
object is the shard, and chunks become pieces inside it. In zarr-python
you set both shapes explicitly: chunks= for the small pieces and shards=
for the bundle. (The formal
specification
calls those small pieces inner chunks; this page just calls them chunks, but
they're the same thing.) You'll also hear "shards are the unit of writing,
chunks are the unit of reading". That's handy guidance, though the spec only
defines the on-disk layout that makes partial reads possible, not a hard rule
about write granularity.
Where does all this get recorded? Reassuringly, sharding adds no new metadata
files: the array still has its single zarr.json. Sharding is simply one of the
codecs from the previous section: a
sharding_indexed
codec in the array's codecs list. The array's chunk shape in chunk_grid becomes the shard shape
(the unit that maps to one store key), while the inner chunk shape sits inside
that codec's own configuration. The shard's index (the offsets and lengths
that locate each inner chunk) isn't in zarr.json at all; it's written inside
each shard object itself, as a small footer (by default at the end). So zarr.json
describes how shards are built, and every shard then carries its own little map to
the chunks within it.
So in zarr.json there's nothing new to learn: sharding is just one more entry in
the array's codecs list, a sharding_indexed codec that looks roughly like this:
{
"name": "sharding_indexed",
"configuration": {
"chunk_shape": [2, 3],
"codecs": [ ... ],
"index_codecs": [ ... ],
"index_location": "end"
}
}
Two parts are worth recognising. The inner chunk_shape is the size of the chunks
packed inside each shard. And index_location tells a reader where in the shard
to find the index: "end" means the footer described above (it can also be
"start"). The elided codecs and index_codecs lists simply record how the
chunks and the index itself are encoded. Because the sharding_indexed codec is
part of the Zarr specification, any implementation that understands it can open
the shard.
And the index inside a shard is, logically, just a small table: one row per inner chunk, giving where that chunk starts and how long it is:
| inner chunk | byte offset | byte length |
|---|---|---|
| (0, 0) | 0 | 33 |
| (0, 1) | 33 | 33 |
| (1, 0) | 66 | 33 |
| (1, 1) | 99 | 33 |
For the hands-on side of sharding, see Sharding in the array guide.
Going deeper: groups (organizing many arrays)¶
Real datasets usually hold more than one array. Because store keys are just
strings, they can contain /, which lets Zarr nest things into a hierarchy,
much like folders and files. A group is a node that can contain arrays and
other groups.
flowchart TD
root["/ (root group)"]
root --> temp["temperature (array)"]
root --> grp["measurements (group)"]
grp --> hum["humidity (array)"]
grp --> pressure["pressure (array)"]
Here's the key insight: there are no real folders. The hierarchy is an
illusion created entirely by the key names. Every node (each group and each
array) has its own zarr.json under a key prefixed by its path, and an array's
chunk keys carry the same prefix. The tree above is just these flat keys in the
store:
zarr.json ← root group metadata
temperature/zarr.json ← array metadata
temperature/c/0/0 ← a chunk of "temperature"
temperature/c/0/1
measurements/zarr.json ← subgroup metadata
measurements/humidity/zarr.json ← array metadata
measurements/humidity/c/0/0 ← a chunk of "humidity"
measurements/pressure/zarr.json
measurements/pressure/c/0/0
Reading measurements/humidity just means looking up the keys that start with that
prefix. So nothing new is needed to support hierarchies: the same simple rules
(keys, bytes, and metadata) scale from a single array up to a richly structured
dataset, and they map just as naturally onto a flat object store (where keys never
were folders) as onto a directory on disk. See Groups to work with
them.
Going deeper: more than two dimensions¶
We've stayed in 2-D to keep the pictures simple, but nothing about Zarr is limited
to two dimensions: the whole model scales to any number of axes, with one more
index at each step. An N-dimensional array's shape and chunk shape each gain an
axis, its chunk grid becomes N-dimensional, and each chunk key carries N
indices. The store, the metadata, and the codecs are all unchanged.
This matters because real data is usually more than 2-D. A short video clip is 3-D (frames × height × width); an RGB image is 3-D (height × width × colour channel); a microscope scan or a CT volume is a stack of 2-D slices; and a climate field recorded over time is (time × latitude × longitude). Many datasets have four or more axes.
To see the generalisation concretely, picture a 3-D array as a stack of 2-D
arrays. Here are two 4×6 layers stacked into a (2, 4, 6) array (think of them
as two time steps):
| 0 | 1 | 2 | 3 | 4 | 5 |
| 6 | 7 | 8 | 9 | 10 | 11 |
| 12 | 13 | 14 | 15 | 16 | 17 |
| 18 | 19 | 20 | 21 | 22 | 23 |
| 24 | 25 | 26 | 27 | 28 | 29 |
| 30 | 31 | 32 | 33 | 34 | 35 |
| 36 | 37 | 38 | 39 | 40 | 41 |
| 42 | 43 | 44 | 45 | 46 | 47 |
Everything you've already learned carries straight over, just with that extra index:
- Chunk shape gains an axis. A chunk shape of
(1, 2, 3)keeps each layer separate (depth 1) and splits each layer into the same 2×3 blocks as before. - The chunk grid is now 3-D: shape
(2, 2, 2), two layers × two chunk-rows × two chunk-columns = eight chunks. - Keys gain an index. The chunk at grid position
(layer, row, col) = (1, 0, 1)is stored under the keyc/1/0/1. - Slicing gains an index too:
a[0]selects the whole first layer, anda[0, 1:3, 2:5]selects a region within it.
Adding dimensions changes the numbers (more entries in the shape, more indices in each key) but not the model. And these higher-dimensional arrays are often exactly the ones too big for memory, which brings us to the last piece.
Going deeper: working with data bigger than memory¶
Back to the question from Part I: how do you handle an array that's too big for RAM? The answer falls out of everything above. Creating a Zarr array doesn't allocate the whole thing; it just writes the metadata and prepares an (empty) store. You then fill the array a region at a time, and each write only needs that region in memory:
- read or generate one block of data (say, a few chunks' worth),
- write it to the corresponding slice of the array,
- discard it, and move on to the next block.
Because you only need to hold one block in memory at a time, the array on disk can be far larger than your RAM. Writing chunk-aligned blocks keeps each write cheap (no read-modify-write, as we saw earlier). This is also how data streaming in from instruments or simulations gets persisted: block by block, as it arrives. Tools like Dask automate this, computing and writing many chunks in parallel. For the practical recipes, see Optimizing performance and Working with arrays.
Part III: Seeing it for real¶
Enough concepts: let's watch the machinery run. We'll create the exact (4, 6)
array chunked at (2, 3) from Part I, then inspect what Zarr actually wrote.
First, create the array:
from pathlib import Path
import numpy as np
import zarr
z = zarr.create_array(
store="data/understanding-zarr.zarr",
shape=(4, 6),
chunks=(2, 3),
dtype="int32",
)
print(type(z))
Notice what z is: a Zarr array, not a NumPy array. It's a lightweight handle
onto the store; there aren't 24 integers sitting in memory. And create_array has
so far written only metadata: no chunk data at all. We can prove that by listing
everything in the store:
def show_store():
root = Path("data/understanding-zarr.zarr")
for path in sorted(root.rglob("*")):
if path.is_file():
print(path.relative_to(root))
show_store()
Just zarr.json: the metadata, and not a single chunk. Here is what it holds:
import json
metadata = Path("data/understanding-zarr.zarr/zarr.json").read_text()
print(json.dumps(json.loads(metadata), indent=2))
{
"shape": [
4,
6
],
"data_type": "int32",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [
2,
3
]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": 0,
"codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
}
Every concept from Part I is right there: the shape, the chunk_grid (with
its nested chunk_shape), the data_type, the chunk_key_encoding that produces
c/0/1-style keys, the fill_value, and the codecs pipeline.
The dump also carries a few fields we haven't dwelt on.
zarr_format
and
node_type
are housekeeping: the Zarr format version (3) and whether this node is an array or
a group. attributes is a slot for your own custom metadata, such as names, units,
or descriptions (see Attributes). And
storage_transformers
is an optional, advanced extension point: where a codec transforms an individual
chunk, a storage transformer sits between the whole array and the store, able to
transform how data is read from and written to it. No storage transformers are
standardised yet, so zarr-python writes an empty list ([]); you can safely ignore
it for now.
Now let's actually store some data. zarr-python deliberately mirrors NumPy's indexing and slicing syntax: you write into the array by assigning to a slice, just as you would with NumPy. That assignment is what triggers Zarr to encode chunks and write their bytes. Creation set up the metadata, but only writing puts chunk data in the store:
# the same 4x6 grid of integers from Part I
source = np.arange(24).reshape(4, 6)
z[:] = source # assigning to a slice writes chunk bytes to the store
show_store()
Now the four chunk values (c/0/0 … c/1/1) have appeared alongside zarr.json,
one object per cell of our 2×2 chunk grid, exactly as the diagrams promised. The
metadata was written at creation; the chunk bytes were written only just now, by the
assignment.
Finally, reading: again using ordinary Python indexing. When you ask for a slice, zarr-python does the reverse of everything in Part II:
flowchart LR
s["slice request<br/>z[1:3, 2:5]"] --> w["work out which<br/>chunks it touches"]
w --> f["fetch those keys<br/>from the store"]
f --> d["decode each<br/>chunk's bytes"]
d --> r["scatter values<br/>into the result"]
It works out which chunks the slice overlaps, fetches only those keys, decodes their bytes, and scatters the values into the result, which comes back as an ordinary NumPy array. The round-trip matches the original data:
opened = zarr.open_array("data/understanding-zarr.zarr", mode="r")
corner = opened[1:3, 2:5] # ordinary slicing, just like NumPy
print(corner)
print("matches NumPy:", bool((corner == source[1:3, 2:5]).all()))
And here's the real payoff of everything this page has argued. We wrote that store
with zarr-python, but nothing about it is Python-specific: it's just the keys, bytes,
and zarr.json that the specification prescribes. So the exact same directory
can be opened, unchanged, by a Zarr implementation in another language: by
zarrs from Rust, by
zarrita.js from JavaScript in a browser, or by
TensorStore from C++, each reading the same
chunks and reconstructing the same array. That portability isn't a feature
zarr-python adds; it's what being a specification means.
Recap, and where to go next¶
You now have the whole mental model:
- An array is a grid of equally-typed values with a shape, stored in memory as one contiguous, row-major block.
- Zarr splits it into equal-shaped chunks.
- Chunks live in a store as keys mapped to bytes (a folder, a bucket, a zip, memory).
- A metadata document (
zarr.json) describes the array so the bytes mean something. - The layout is fixed by the Zarr specification, so any implementation can read it; zarr-python is just one of them.
- Under the hood: chunks are gathered/scattered to and from memory; uneven chunks get fill-padded edges; codecs compress and transform chunk bytes; sharding bundles inner chunks into shards to avoid too many objects; groups organize many arrays into a hierarchy; and the whole model scales to any number of dimensions.
Ready to use it? Continue with:
- Working with arrays: create, read, and write arrays in zarr-python.
- Groups: build and navigate hierarchies.
- Storage: the stores you can put your data in.
- Optimizing performance: choosing chunk and shard shapes.
- Glossary: quick definitions of chunk, codec, store, and more.
- Zarr specifications: the standard itself.