R, GDAL, Zarr, and the Missing Recipe Format for Raster Datasets

code
news
r-packages
python
zarr
gdal
netcdf
Author

Michael Sumner

Published

May 27, 2026

The virtual-Zarr format that lets xarray describe 16,000 files in a tiny store doesn’t exist in R yet - but the pieces are in place, this post tries to show where they fit together .

A common scenario in geospatial data workflows is multiple raster files. Let’s consider 23 snapshots of OISST sea surface temperature spread across 2025. One NetCDF file per date, each raster is the same grid, four variables each. Let’s see what R does with that. Here we assume files is a vector of paths to the NetCDF files.

(Fully reproducible setup, downloading directly from NCEI is in the appendix.)

str(basename(files))
# chr [1:23] "oisst-avhrr-v02r01.20250101.nc" "oisst-avhrr-v02r01.20250117.nc" "oisst-avhrr-v02r01.20250202.nc" ...
library(terra)
rast(files)
# class       : SpatRaster
# size        : 720, 1440, 92  (nrow, ncol, nlyr)
# resolution  : 0.25, 0.25  (x, y)
# extent      : 0, 360, -90, 90
# coord. ref. : lon/lat WGS 84 (CRS84)
# sources     : oisst-avhrr-v02r01.20250101.nc (4 layers)
#               oisst-avhrr-v02r01.20250117.nc (4 layers)
#               ... and 21 more sources
# varnames    : anom, err, ice, sst, anom, err, ...
# names       : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0, ...
# time (days) : 2025-01-01 to 2025-12-19 (23 steps)

The result is pretty good. terra::rast() is no slouch: it reads all subdatasets across all files, concatenates band-wise, preserves the time axis, carries units, and records depth metadata. For a lot of workflows this is exactly what is needed, and it’s lazy — no pixel values are read until they are needed.

But look at the names: anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0 — repeated 23 times, 92 layers total. Four variables and 23 timesteps are there, flattened into band order. The n-D structure — this is a (variable × time × lat × lon) array — is implicit in the interleaving, not encoded anywhere. There’s an implicit contract about how this was unrolled, there are conventions for band ordering, workarounds for the _zlev=0 naming. Nothing is wrong, but the band-stack model forces all of that n-D structure into a single axis, and recovering it requires a lot of information that is not usual in the multi-band model.

We can sidestep the variable interleaving with the vrt:// URI syntax:

vrt(
  sprintf("vrt://%s?sd_name=sst", files),
  "/tmp/sst_stack.vrt",
  options = "-separate",
  overwrite = TRUE
)
# class       : SpatRaster
# size        : 720, 1440, 23  (nrow, ncol, nlyr)
# source      : sst_stack.vrt
# names       : r_1, r_2, r_3, ..., r_23

One variable, 23 layers. The vrt:// GDAL protocol is worth knowing: it lets us name a subdataset in a URI query string, vectorizing cleanly with sprintf and using the underlying raster-stack capability of GDAL VRT (-separate). But there’s a trade-off: no time axis, layer names are r_1 through r_23 rather than dates. We fixed the variable interleaving problem and lost the coordinate metadata.

rast(files, "sst") also selects a single subdataset without the vrt:// syntax, and produces the same 23-layer result — but it goes through the ncdf4 backend rather than GDAL, so it doesn’t produce a VRT file and doesn’t carry the same interoperability guarantees. The two approaches look similar but are doing different things underneath, and that distinction matters for what comes next.

stars approaches this differently. In default proxy (lazy) mode it uses GDAL classic raster and keeps variables as separate attributes, file-per-timestep structure explicit, doing this by harvesting metadata from the underlying files as carried by GDAL. stars::read_mdim() goes further — it uses the GDAL multidimensional API and gives named attributes, real dimension coordinates with udunits, proper n-D structure. The representation is right, but it too has limits; details are in the appendix.

None of these approaches is the best — each embeds real expertise and genuine design choices. But with excellent tools like terra and stars, you still can’t hand someone a file that says “this is a year of global SST.” You can hand them code, or you can hand them gigabytes. The code only works if they have the same files in the same place. The gigabytes require a format decision and a copy operation.

There’s no serializable recipe that is light, language-agnostic, and handles a broad variety of cases.

Fifteen minutes with xarray

A key moment for me was while learning Python and xarray, I was impressed that xarray could open these files — a huge list of them (16,337 days of OISST at time of writing) as local files or remote URLs. xarray checks them all for consistency, detects that what is really there is four variables in longitude, latitude, and time, and that time is an implicit concatenation from a single time step in each of those thousands of files. It’s remarkable: one object in memory in Python, ready to lazy-read any part of the four 16337×720×1440 arrays. But that took about 15 minutes to run, let alone the software and computing infrastructure setup I had been through. And then: how do you share that progress with colleagues?

Everybody knows you aren’t supposed to serialize in-memory objects to binary, we all have scary experiences with .Rdata and .rds, and Python has the same with its pickle. (I thought: “hmmm, what’s the xarray equivalent of a VRT?” Spoiler: there isn’t one — and although VirtualiZarr does serve that role, it’s missing an intermediate state where we only record what files were opened and what structure was learned.)


What the VRT already does

Back to our 23 files — a sample of the 16,337+ in the full OISST set as of writing.

Open /tmp/sst_stack.vrt in a text editor (browseURL("/tmp/sst_stack.vrt") works too). It’s a few kilobytes of XML that fully describes a virtual 23-layer SST dataset — source files, spatial extent, band mapping. It’s a tiny file: you could email it, put it in a repository, or patch the source paths to /vsicurl/https://....nc URLs so it works without any local files at all. Anyone with GDAL can open it with rast("sst_stack.vrt") and get exactly what you built, without copying a byte of the underlying data.

That’s a recipe: we’ve described our dataset as a small chunk of text that GDAL can open very quickly.

VRT (Virtual Format) is a plain XML file that describes a virtual raster dataset. It can mosaic, warp, reproject, reband, and combine source files without touching the underlying data. For the 2D raster world it is exactly the recipe format that’s missing from the terra/stars in-memory story. The 2D VRT is one of GDAL’s most underappreciated features — already powering a lot of geospatial infrastructure, especially where band-stacking or tile-mosaicking is sufficient as the solution.

So what about the full n-D case: multiple variables, time coordinates, arbitrary dimensions?


What Zarr figured out

The Python/xarray community has been working on exactly this problem, approaching it from the chunk level rather than the file level.

Zarr is a chunked array format. An array is split into regular chunks; each chunk is compressed independently and stored as a separate object — a file, an S3 object, a range of bytes in a larger store. The chunks themselves are opaque: raw numeric arrays in compressed bytes, not self-describing. What makes Zarr work is the metadata sidecar: a small JSON file that records the shape, dtype, chunk layout, compression codec, and dimension coordinates — everything xarray needs to provide a rich, self-describing interface over the data. This is really good for cloud access: fetch only the chunks you need, in parallel, without a format-specific library. It’s a bit like taking a pile of NetCDF files and turning them inside out — all the metadata (tiny in comparison to the data) is placed upfront in JSON, and all the compressed opaque chunks are stored as individually addressable objects.

The deeper insight comes when you realize that HDF5, netCDF4, GeoTIFF, GRIB, and others already store their data in chunks. The chunks are buried inside files that require format-specific libraries to navigate, but the byte ranges are all there. We just need an index.

kerchunk and its successor VirtualiZarr do exactly that: open your existing files, find where each chunk lives (file path, byte offset, byte length), and write that index as a JSON or Parquet reference store. The bulk data never moves. What you get is a Zarr-shaped recipe that points back into your original files. It’s worth noting that this idea is not new — NASA and Unidata’s DMR++ format did the same thing for OPeNDAP, indexing chunk byte-ranges in HDF5/netCDF files as a sidecar that an OPeNDAP server can serve directly. Cloud-Optimized GeoTIFF is the same idea applied to TIFF: the IFD tile index is moved to the front of the file so a client can read the chunk map in one range request and fetch only what it needs thereafter. The difference is that Zarr’s client-side framing made the pattern composable and ecosystem-wide rather than tied to a specific server or format.

NOTE: this is very much pseudo code showing the overall style - see the Usage VirtualiZarr pages for more examples, and we have a real example in the appendix.

from virtualizarr import open_virtual_dataset
import xarray as xr

# index the chunks — no data copied
combined = xr.concat(
    [open_virtual_dataset(f) for f in files],
    dim="time"
)
combined.virtualize.to_kerchunk("sst_virtual.zarr", format="zarr")

# open it — chunk-aware, lazy, S3-ready, fully structured
ds = xr.open_dataset("sst_virtual.zarr/", engine="kerchunk")
# <xarray.Dataset>
# Dimensions:  (time: 23, zlev: 1, lat: 720, lon: 1440)
# Coordinates:
#   * time     (time) datetime64[ns] 2025-01-01 ... 2025-12-19
#   * lat      (lat) float32 -90.0 ... 89.75
#   * lon      (lon) float32 0.0 ... 359.75
# Data variables: anom, err, ice, sst

The store is kilobytes of metadata, the actual data stays where it is, and the n-D structure is fully encoded. Compare this to the terra output above:

  • sst, anom, err, ice as named variables, not interleaved bands
  • time as a real coordinate with date values, not layer indices r_1..r_23
  • lat, lon, zlev as labelled dimensions with coordinate values
  • the whole thing serialized as a directory anyone with Zarr support can open

This is what stars::read_mdim() is reaching toward in R — the representation is structurally similar. The difference is that the xarray object came from a small Zarr store, not from reconstructing structure at runtime over 23 original netCDF files. That difference is the recipe.


GDAL is the bridge

GDAL already spans both worlds.

GDAL has a multidimensional API (the “mdim” layer) that represents data as named arrays with labelled dimensions — the same model as xarray, not the band-stacked model of classic GDAL rasters. GDAL’s Zarr driver can open VirtualiZarr/kerchunk reference stores through this API. A store created in Python is readable via GDAL — which means it’s reachable from R today.

More importantly, GDAL has gdal mdim mosaic — a tool that takes a collection of raster files and produces a multidim VRT describing them as a unified n-D array dataset. This VRT encodes the array structure, dimension names, coordinate values, and source file mapping. It is the GDAL-native recipe for an n-D dataset — and as shown in the appendix, stars::read_mdim() opens it faithfully, giving you the full multi-file dataset with the right structure, right coordinates, and a serializable description that travels independently of the original files.

Converting a GDAL multidim VRT to a Zarr reference store is a small transformation on kilobytes of XML, not a data operation. That’s the near-term path to R-native VirtualiZarr creation.


The gap: creation, not reading

R can read virtual stores — Zarr and multidim VRT and VirtualiZarr Parquet reference via stars::read_mdim(), gdalraster::mdim_translate or zaro or real Zarr with pizzarr or zarr. What R can’t yet do is create the chunk-level byte-reference stores that VirtualiZarr generates for HDF5 and netCDF.

That’s a frontier, and I’ve been exloring where the pieces could converge:

gdal mdim mosaic creates multidim VRTs from file collections today. The XML fully describes what a VirtualiZarr store needs — shape, chunks, coordinate values, source file paths and byte offsets. This is the hard structural work; emitting the Zarr reference store format from it is a straightforward conversion.

rhdf5 can now open remote HDF5 files at URLs or on S3 (dev version on GitHub). The next step is extracting chunk byte-reference indices directly - a core operation that VirtualiZarr performs in Python (with HDF5 library, GRIB libraries, TIFF libraries some in C++ and some in Rust). Step 7 in the appendix shows remote access, and a pending PR in my rhdf5 fork has already use that to harvest remote byte references from HDF5 and NetCDF.

zaro (pure-R Zarr reader via Arrow) already reads VirtualiZarr Parquet stores. The reading side is there, and needs comparison with pizzarr, and zarr and other zarr-enabled packages.

vrtstack generates multidim VRTs from 2D raster collections in R - the gdal mdim mosaic equivalent for simple cases of multiple 2D raster - without leaving R.

The path: vrtstack or gdal mdim mosaic for file collections → emit Zarr reference store → readable by zaro, xarray, or any Zarr-aware tool. In this way R can become a first-class creator of portable dataset descriptions, not just a consumer of what Python produces.


Why this matters beyond R

A Zarr reference store created from OISST netCDF files is immediately openable in Python, R, GDAL and in theory any tool that speaks Zarr.

STAC is complementary but different: STAC describes where files are and their spatiotemporal footprint. A virtual store describes what the dataset structurally is — how chunks map to array coordinates, what the dimensions mean, how to reassemble a slice. A STAC item can point to a virtual store the way it points to a COG.

The R community has genuine leverage here. GDAL is the common substrate between R and the xarray/Zarr ecosystem. R is not just catching up, there are things we can do that aren’t on any other roadmap.


What would help

If you work with time series of raster files and you’ve felt this friction:

  • Talk about your use case — what files, what structure, what would a portable dataset description let you do that you can’t do today?
  • Try terra::vrt() and try the vrt:// URI syntax for netCDF subdataset selection.
  • Try stars::read_mdim() on a gdal mdim mosaic VRT — the full tour in the appendix shows how these fit together
  • Try remotes::install_github("hypertidy/tidync@multi-input") if you have remote netCDF files and want to skip the download entirely — Step 8 in the appendix shows what this already looks like.
  • Look at vrtstack and zaro on the hypertidy GitHub, vrtstack builds a multidim VRT from 2D rasters (multidim GDAL supports COG, but doesn’t support builiding these), zaro is a read-implementation for Zarr that is primarily based around Arrow.
  • File issues against GDAL’s multidim API when you find gaps — R users exercising it in real workflows makes it better for everyone.
  • Check out gdalraster (and my multidim branch) there’s a long way to go for multidim but gdalraster is a rich package has a super strong basis for faithfully carrying the core facilities of the multidim api exposed to R.

Appendix: a progressive tour on 23 OISST files

Setup: download 23 OISST files from NCEI

All code below uses only this setup, and we use a multidim VRT representation later to act as an intermediate form.

library(glue)

date  <- seq(as.Date("2025-01-01"), by = "16 days", length.out = 23)
base0 <- "https://www.ncei.noaa.gov/data/sea-surface-temperature-optimum-interpolation/v2.1/access/avhrr"
ym    <- format(date, "%Y%m")
ymd   <- format(date, "%Y%m%d")
urls  <- glue("{base0}/{ym}/oisst-avhrr-v02r01.{ymd}.nc")

dir   <- tempdir()
files <- file.path(dir, basename(urls))
download.file(urls, files, method = "libcurl", mode = "wb")

Setup part 2: create a GDAL multidimensional VRT

writeLines(glue::glue("/vsicurl/{urls}"), tf <- tempfile())
system(glue::glue("gdal mdim mosaic @{tf} oisst_2025_sample.vrt"))
## piggyback::pb_upload("oisst_2025_sample.vrt")

The URL structure itself is the catalogue: a simple path template over date components, no database required. 23 files, ~3.5MB each, ~80MB total.

Step 1 — terra: rast(), band-stacked, all variables, time preserved

rast(files)
# class       : SpatRaster
# size        : 720, 1440, 92  (nrow, ncol, nlyr)
# sources     : oisst-avhrr-v02r01.20250101.nc (4 layers)
#               ... and 22 more sources
# varnames    : anom, err, ice, sst, anom, err, ice, sst, ...
# names       : anom_zlev=0, err_zlev=0, ice_zlev=0, sst_zlev=0, ...
# unit        :      Celsius,    Celsius,          %,    Celsius, ...
# time (days) : 2025-01-01 to 2025-12-19 (23 steps)

terra uses the ncdf4 R package for netCDF reads — a design inherited from the raster package - rather than routing through GDAL. This is why its behaviour differs from stars proxy mode. The time axis survives via ncdf4’s own metadata handling. The n-D structure is flattened into 92 bands; recovering it means knowing the 4-variable interleaving convention.

Bundling all bands from all variables that share the same dimensions is a choice made by terra, GDAL has a similar capacity but it is not used here (and is in fact unusable by these files for other reasons).

Step 2 — terra: vrt() with vrt://, single variable, time lost

vrt(
  sprintf("vrt://%s?sd_name=sst", files),
  "/tmp/sst_stack.vrt",
  options = "-separate",
  overwrite = TRUE
)
# class       : SpatRaster
# size        : 720, 1440, 23  (nrow, ncol, nlyr)
# source      : sst_stack.vrt
# names       : r_1, r_2, r_3, ..., r_23

vrt:// URI syntax selects a subdataset inline; ?sd_name=sst is the query param.. The result is a real file — a 2D recipe for this one variable. But trade-offs: time coordinate is gone, layer names are positional. The 2D VRT has no slot for dimension coordinates, it can only carry raw metadata on a dataset or on a band.

Step 3 — stars: read_stars() proxy, GDAL classic raster, variables separated

read_stars(files)
# stars_proxy object with 4 attributes in 92 file(s):
# $sst
#  [1] "[...]/oisst-avhrr-v02r01.20250101.nc:sst"
#  [2] "[...]/oisst-avhrr-v02r01.20250117.nc:sst"
#  ...
# $anom, $err, $ice  (similarly, 23 files each)
#
# dimension(s):
#      from   to offset delta  refsys x/y
# x       1 1440      0  0.25      NA [x]
# y       1  720     90 -0.25      NA [y]
# zlev    1    1  0 [m]    NA udunits
# time    1   23     NA    NA      NA

stars proxy mode uses GDAL classic raster and separates variables as attributes rather than interleaving as bands — a meaningful structural difference from terra. Time is present as a dimension but without coordinate values, it’s fully lazy at this point as made clear by the lists of files in subdataset syntax.

Step 4 — stars: read_mdim() on a single file, right model, wrong scope

stars::read_mdim(files[1L])
# stars object with 4 dimensions and 4 attributes
# attribute(s):
#            Min.  1st Qu.  Median      Mean  3rd Qu.   Max.     NAs
# sst [°C]  -1.80    -1.80   -1.78  -1.70779    -1.64  -1.03   89635
# anom [°C] -0.72    -0.08   -0.02  -0.03531     0.00   0.49   89635
# err [°C]   0.30     0.30    0.30   0.30446     0.30   0.35   89635
# ice [%]    0.19     0.81    0.95   0.87747     0.98   1.00   91877
# dimension(s):
#      from   to  offset  delta          refsys x/y
# lon     1 1440       0   0.25  WGS 84 (CRS84) [x]
# lat     1  720     -90   0.25  WGS 84 (CRS84) [y]
# zlev    1    1  0 [m]      NA         udunits
# time    1    1  17532 [days since 1978-01-01 12:00:00]  NA  udunits

read_mdim() goes through GDAL’s multidimensional API: named attributes, labelled dimensions, udunits-encoded time coordinate, proper spatial reference. This is a rich representation of the dataset.

But pass stars all 23 files and it silently opens only the first, one timestep, and no error. stars has other ways of concatenating files along dimensions but it’s only applied to GDAL classic raster context.

This is a pivot point.

Step 5 — gdal mdim mosaic + read_mdim(), the full tour

gdal mdim mosaic is the missing piece. It takes a collection of files and produces a multidim VRT that describes them as a unified n-D array — dimension coordinates and all:

gdal mdim mosaic \
  $(printf 'NETCDF:%s:sst ' $(cat filelist.txt)) \
  -o /tmp/sst_mdim.vrt

Or from R via vrtstack or a direct system() call. Now hand that VRT to read_mdim():

stars::read_mdim("/tmp/sst_mdim.vrt")
# stars object with 3 dimensions and 1 attribute
# attribute(s):
#           Min.  1st Qu. Median    Mean 3rd Qu.  Max.    NAs
# sst [°C] -1.80    -1.80  -1.55  -1.198   -1.03  29.0 357256
# dimension(s):
#      from   to                        offset  delta          refsys x/y
# lon     1 1440                             0   0.25  WGS 84 (CRS84) [x]
# lat     1  720                           -90   0.25  WGS 84 (CRS84) [y]
# time    1   23  2025-01-01T00:00:00+00:00  16 days         POSIXct

23 timesteps. Named attribute. Real time coordinates as POSIXct with a 16-day delta. Proper spatial reference. And /tmp/sst_mdim.vrt is a file — a few kilobytes of XML — that fully describes this dataset spanning the year. Hand it to a colleague and they get the same thing. Put it in a repository. Point tidync or vapour at it. Open it in Python via GDAL. It’s the recipe.

This is what gdal mdim mosaic unlocks: read_mdim() was always the right model. It just needed the VRT to make the multi-file case work.

Step 6 — skip the download: URL access, who can and can’t

The files we downloaded are publicly accessible over plain HTTPS. Why download at all? Let’s see who can open them directly.

GDAL — just works. GDAL’s /vsicurl/ virtual filesystem handles HTTP range requests transparently, and both terra::rast() (via GDAL for the spatial read) and stars::read_mdim() can open the URLs directly once you prefix them appropriately. No download needed.

# GDAL-backed read direct from URL
stars::read_mdim(paste0("/vsicurl/", urls[1L]))
# stars object with 4 dimensions and 4 attributes
# ... same output as Step 4, no local file required

ncdf4 — the picture is more complicated. ncdf4 routes through the netCDF-C library, which expects either a local file path or an OPeNDAP/Thredds dodsC URL. A plain HTTPS URL fails, and the error is instructive:

ncdf4::nc_open(urls[1])
# syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or
# SCAN_DATASET or SCAN_ERROR
# context: <!DOCTYPE^ HTML PUBLIC ...><h1>Not Found</h1>...
# Error in ncdf4::nc_open: NetCDF: file not found

ncdf4 tried to parse the 404 HTML response as netCDF syntax. But there’s a workaround — the #mode=bytes suffix instructs the HDF5 layer to use HTTP byte-range requests rather than treating the URL as a file path:

ncdf4::nc_open(sprintf("%s#mode=bytes", urls[1]))
# File https://...oisst-avhrr-v02r01.20250101.nc#mode=bytes
# (NC_FORMAT_NETCDF4):
#   4 variables (excluding dimension variables):
#   sst, anom, err, ice ...

It works — but you have to know the magic suffix. And because terra’s netCDF reading is backed by ncdf4, it inherits this same limitation: plain HTTPS URLs don’t work without #mode=bytes, and OPeNDAP dodsC URLs are the usual workaround for terra users working with remote data.

This is the library-identity problem: the same URL provides three different outcomes depending on which library is doing the opening. GDAL’s cloud-native HTTP access is a first-class citizen but the ’/vsi*’ prefix is clearly not popular in some communities; NetCDF’s is an afterthought requiring a suffix that hardly anyone seems to know about. The recipe format inherits these differences — a VRT that references HTTPS URLs is GDAL-native and just works; a workflow built on NetCDF needs extra plumbing.

Step 7 — rhdf5-dev: opening the door to chunk indexing (dev, not yet on CRAN)

The development version of rhdf5 on GitHub can open remote HDF5 files over HTTPS and S3 using the ROS3 virtual file driver — no download, no #mode=bytes workaround, just direct anonymous access:

# remotes::install_github("grimbough/rhdf5")
library(rhdf5)

fapl <- H5Pcreate("H5P_FILE_ACCESS")
H5Pset_fapl_ros3(fapl)   # no credentials → anonymous mode

fid <- H5Fopen(urls[1], flags = "H5F_ACC_RDONLY", fapl = fapl)
fid
# HDF5 FILE
#         name /
#     filename
#
#   name       otype       dclass              dim
# 0 anom  H5I_DATASET    INTEGER  1440 x 720 x 1 x 1
# 1 err   H5I_DATASET    INTEGER  1440 x 720 x 1 x 1
# 2 ice   H5I_DATASET    INTEGER  1440 x 720 x 1 x 1
# 3 lat   H5I_DATASET      FLOAT  720
# 4 lon   H5I_DATASET      FLOAT  1440
# 5 sst   H5I_DATASET    INTEGER  1440 x 720 x 1 x 1
# 6 time  H5I_DATASET      FLOAT  1
# 7 zlev  H5I_DATASET      FLOAT  1

What you’re seeing here is the raw HDF5 structure of the file — variables as datasets, dimensions as separate float arrays, all visible without reading a single data value. This is exactly the level at which VirtualiZarr operates: it walks this structure, reads the chunk B-tree for each dataset, and records the byte offset and length of every chunk.

Once rhdf5 can expose those chunk byte-references — which is the next step in its development — R will have a native path to VirtualiZarr store creation without Python, without shelling out, and working directly on files at HTTPS or S3 URLs. The H5Pset_fapl_ros3() call with no credentials is the anonymous S3/HTTPS access pattern: it works for any publicly accessible endpoint that supports range requests, including NCEI, NASA Earthdata public buckets, and most open data archives.

The reading side (zaro) is already there. The indexing side (rhdf5-dev) is almost there. The assembly side (gdal mdim mosaic → Zarr reference store) is a small conversion on top of infrastructure that already exists. The pieces are in place.

Step 8 — tidync dev: multi-file remote concat in one call (unmerged dev branch)

tidync recently got an update in a dev branch to accept multiple files — long on the wishlist but out of reach in the early years of the package. I had also forgotten that remote access to NetCDF is possible via its own library by appending #mode=bytes to the URL — this means we don’t always need an OPeNDAP-enabled server for remote read. (GDAL uses its own techniques for remote access via /vsicurl/.)

# remotes::install_github("hypertidy/tidync@multi-input")
tidync::tidync(
  sprintf("%s#mode=bytes", urls),
  fast = TRUE,
  concat_dim = list(name = "time", values = date)
)
# not a file:
# ' https://.../oisst-avhrr-v02r01.20250101.nc#mode=bytes '
#
# ... attempting remote connection
#
# Connection succeeded.
# Data Source (23): oisst-avhrr-v02r01.20250101.nc#mode=bytes,
#                   oisst-avhrr-v02r01.20250117.nc#mode=bytes ...
# Concatenated along 'time' (23/23 steps, 23 sources, fast mode)
#
# Grids (5) <dimension family> : <associated variables>
#
# [1]   D3,D2,D1,D0 : sst, anom, err, ice    **ACTIVE GRID**
# [2]   D0          : time
# [3]   D1          : zlev
# [4]   D2          : lat
# [5]   D3          : lon
#
# Dimensions 4 (all active):
#   dim  name  length     min       max  unlim coord_dim
#   D0   time      23   20089     20441   TRUE  TRUE
#   D1   zlev       1       0         0  FALSE  TRUE
#   D2   lat      720  -89.875   89.875  FALSE  TRUE
#   D3   lon     1440    0.125   359.88  FALSE  TRUE

Twenty-three remote netCDF files concatenated into one lazy n-D dataset in a single call — #mode=bytes vectorized over the URL vector, a synthetic time axis injected via concat_dim, all four variables on the active grid, unlimited time dimension, proper coordinates throughout. We have tested this on the full set of 16,000+ URLs and it opens in less than a minute using the fast and concat_dim mechanism to declare the values on the concatenating dimension.

The “not a file / attempting remote connection / Connection succeeded” in the output is tidync doing its own detective work: it knows this isn’t a local path, tries the remote connection, and succeeds. The #mode=bytes suffix — the ncdf4 workaround from Step 6 — unlocks the whole thing.

This is an R equivalent of the xarray open: the same structure, the same laziness, the same multi-file concatenation. The recipe here is the code rather than a store on disk — and the tidync object it produces is fully navigable with hyper_filter() and hyper_array() without touching the data. The serializable recipe (the Zarr store, the multidim VRT) is still the goal for sharing and interoperability, but as a working representation for analysis, this is already there.

In the last section of the appendix we virtualize to the “normal” xarray/Zarr model, but then open that again with different tools. The same store, opened by four different tools across two languages - there’s a recipe, but it’s not a nailed down ecosystem yet.

The xarray counterpart — for comparison

First create the VirtualiZarr version. We use the GDAL mdim VRT hosted above because it’s convenient — and that’s where this effort will start: GDAL already has all the metadata and array concatenating logic, and we can pair that with byte reference extraction improvements to GDAL’s GetRawBlockInfo().

from lxml import etree
import requests
from urllib.parse import urlparse
import xarray as xr
from obstore.store import HTTPStore
from obspec_utils.registry import ObjectStoreRegistry
from virtualizarr import open_virtual_mfdataset
from virtualizarr.parsers import HDFParser

# 1. parse the VRT to get source URLs
r = requests.get(
    "https://github.com/mdsumner/quarto-blog/releases/download/latest/oisst_2025_sample.vrt"
)
tree = etree.fromstring(r.content)
urls = sorted(set(             # sorted: set() loses order, filenames are date-stamped
    el.text.replace("/vsicurl/", "")
    for el in tree.findall(".//SourceFilename")
))

# 2. set up the object store (anonymous HTTP)
parsed = urlparse(urls[0])
bucket = f"{parsed.scheme}://{parsed.netloc}"
registry = ObjectStoreRegistry({bucket: HTTPStore.from_url(bucket)})
parser = HDFParser()

# 3. open as a single virtual dataset — no data read, no files downloaded
combined = open_virtual_mfdataset(
    urls,
    concat_dim="time",
    parser=parser,
    registry=registry,
    combine="nested",
)

# 4. serialize — a directory store describing all 23 files
combined.virtualize.to_kerchunk("sst_virtual.zarr", format="zarr")

Now open with xarray:

# note the trailing slash — this is a Zarr directory store, not a single file
# fsspec needs it to resolve the .zmetadata index correctly
ds = xr.open_dataset("sst_virtual.zarr/", engine="kerchunk")
# <xarray.Dataset> Size: 763MB
# Dimensions:  (time: 23, zlev: 1, lat: 720, lon: 1440)
# Coordinates:
#   * time  (time) datetime64[ns] 2025-01-01T12:00:00 ... 2025-12-19T12:00:00
#   * zlev  (zlev) float32 0.0
#   * lat   (lat) float32 -89.88 -89.62 ... 89.62 89.88
#   * lon   (lon) float32 0.125 0.375 ... 359.6 359.9
# Data variables:
#     anom  (time, zlev, lat, lon) float64 191MB ...
#     err   (time, zlev, lat, lon) float64 191MB ...
#     ice   (time, zlev, lat, lon) float64 191MB ...
#     sst   (time, zlev, lat, lon) float64 191MB ...
# Attributes: (12/37)
#     Conventions:   CF-1.6, ACDD-1.3
#     title:         NOAA/NCEI 1/4 Degree Daily Optimum Interpolation ...
#     metadata_link: https://doi.org/10.25921/RE9P-PT57
#     sensor:        Thermometer, AVHRR

That virtual store is available here:

https://raw.githubusercontent.com/mdsumner/quarto-blog/refs/heads/main/data/sst_virtual.zarr

Open with xarray directly from the URL (note the trailing slash):

xr.open_dataset(
    "https://raw.githubusercontent.com/mdsumner/quarto-blog/refs/heads/main/data/sst_virtual.zarr/",
    engine="kerchunk"
)

Open with GDAL:

from osgeo import gdal
ds = gdal.OpenEx(
    "ZARR:\"/vsicurl/https://raw.githubusercontent.com/mdsumner/quarto-blog/refs/heads/main/data/sst_virtual.zarr\"",
    gdal.OF_MULTIDIM_RASTER
)
gdal mdim info "ZARR:\"/vsicurl/https://raw.githubusercontent.com/mdsumner/quarto-blog/refs/heads/main/data/sst_virtual.zarr\""

Open with zaro:

z <- zaro("virtualizarr://https://raw.githubusercontent.com/mdsumner/quarto-blog/refs/heads/main/data/sst_virtual.zarr")
zaro_meta(z)
# [zaro] found .zmetadata (Zarr V2 consolidated)
# [zaro]   8 arrays: anom, err, ice, lat, lon, sst, time, zlev
# <zaro::ZaroMeta>
#  @ zarr_format    : int 2
#  @ node_type      : chr "group"
#  @ attributes     :List of 37
#  .. $ Conventions: chr "CF-1.6, ACDD-1.3"
#  .. $ title      : chr "NOAA/NCEI 1/4 Degree Daily Optimum Interpolation ..."
#  .. $ metadata_link: chr "https://doi.org/10.25921/RE9P-PT57"
#  ...

Open with stars:

stars::read_mdim(
  "ZARR:\"/vsicurl/https://raw.githubusercontent.com/mdsumner/quarto-blog/refs/heads/main/data/sst_virtual.zarr\""
)
# stars object with 4 dimensions and 4 attributes
# attribute(s), summary of first 1e+05 cells:
#            Min. 1st Qu. Median       Mean 3rd Qu. Max.   NAs
# anom [°C] -1.58   -0.61  -0.43 -0.3861891   -0.20 1.45 89635
# err [°C]   0.30    0.30   0.30  0.3042393    0.30 0.35 89635
# ice [%]    0.16    0.62   0.87  0.7743456    0.96 1.00 93223
# sst [°C]  -1.80   -1.80  -1.66 -1.4256025   -1.29 1.19 89635
# dimension(s):
#      from   to                                 offset
# lon     1 1440                                      0
# lat     1  720                                    -90
# zlev    1    1                                  0 [m]
# time    1   23  17167 [days since 1978-01-01T12:00:00]
#                                    delta         refsys x/y
# lon                                 0.25 WGS 84 (CRS84) [x]
# lat                                 0.25 WGS 84 (CRS84) [y]
# zlev                                  NA        udunits
# time  16 [days since 1978-01-01T12:00:00]        udunits

Oh, and turns out terra::rast() does support multidim:

r <- rast(  "ZARR:\"/vsicurl/https://raw.githubusercontent.com/mdsumner/quarto-blog/refs/heads/main/data/sst_virtual.zarr\"", md  = TRUE)
class       : SpatRaster
size        : 720, 1440, 92  (nrow, ncol, nlyr)
resolution  : 0.25, 0.25  (x, y)
extent      : 0, 360, -90, 90  (xmin, xmax, ymin, ymax)
coord. ref. : lon/lat WGS 84 (CRS84) (OGC:CRS84)
sources     : sst_virtual.zarr
varnames    : anom (Daily sea surface temperature anomalies)
              err (Estimated error standard deviation of analysed_sst)
              ice (Sea ice concentration)
              sst (Daily sea surface temperature)
names       : anom_~v=0_1, anom_~v=0_2, anom_~v=0_3, anom_~v=0_4, anom_~v=0_5, anom_~v=0_6, ...
unit        :     Celsius,     Celsius,     Celsius,     Celsius,     Celsius,     Celsius, ...
depth       : 0
time (days) : 2025-01-01 to 2025-12-19 (23 steps)

There’s a lot here.


Michael Sumner is a research software engineer at the Australian Antarctic Division. Packages mentioned — tidync (rOpenSci reviewed), vapour, vrtstack, zaro — are part of the hypertidy ecosystem.