Icechunk is a transactional storage engine for Zarr and provides git-style versioning for large scientific datasets. This post is about Icechunk from a GDAL perspective and some impact that will have on the R community and its interactions with Zarr.
What is Icechunk?
Zarr is a chunked array format designed for cloud object storage. Icechunk is a Rust library that sits on top of Zarr and adds git-like snapshots combined with ACID transactions - (which means reliable and guaranteed read and write).
(From their docs) this provides:
- Safety: Recover from corrupted or accidentally deleted data using version history
- Consistency: Ensure readers always see complete, valid snapshots - never partial writes
- Reproducibility: Reference any version of your data permanently via commits or tags
Typical usage involves the Python packages xarray and icechunk, but Rust Icechunk can be used directly, in combination with zarrs_icechunk, and now in an upcoming GDAL release 3.14 via the read-only Icechunk driver.
Please be aware that Icechunk in GDAL is very new, and not yet in an officially supported release but if you are interested to get involved it can be used today. (GDAL 3.14 is milestoned for sometime in November 2026).
What works in GDAL and R today
GDAL 3.14 ships a /vsiicechunk/ virtual filesystem and Icechunk driver, and will be available in gdalcubes, gdalraster, sf, terra, and vapour in various forms once that support lands in R packages.
The following examples can be run in this Linux docker container, for example:
docker pull ghcr.io/hypertidy/gdal-r-full:dev
docker run --rm -ti ghcr.io/hypertidy/gdal-r-full:dev(If that doesn’t work, or you want help with docker/GDAL/R just contact me directly)
Here is a single cli call that gives a GDAL multidimensional summary of an Icechunk store.
gdal mdim info \
/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk \
--config AWS_NO_SIGN_REQUEST YES --config AWS_REGION us-west-2Click here for the full mdim info output
Driver: Icechunk
Dimensions:
Name (path) Size Type Direction
----------- ---- ------------ ---------
/init_time 3264
/latitude 721 HORIZONTAL_Y
/lead_time 61
/longitude 1440 HORIZONTAL_X
Coordinates (indexing variables):
Name (path) Dimension Type Unit
----------- ----------- ------- ------------------------
/init_time (init_time) Int64 seconds since 1970-01-01
/latitude (latitude) Float64 degree_north
/lead_time (lead_time) Float64 seconds
/longitude (longitude) Float64 degree_east
Data variables:
Name (path) Type Unit Shape Chunk size
------------------------------------------- ------- ------------------------ --------------------- -----------------
(/init_time):
/expected_forecast_length Float64 seconds [3264] [23360]
/ingested_forecast_length Float64 seconds [3264] [23360]
(/init_time, /lead_time):
/valid_time Int64 seconds since 1970-01-01 [3264, 61] [23360, 61]
(/init_time, /lead_time, /latitude, /longitude):
/dew_point_temperature_2m Float32 degree_Celsius [3264, 61, 721, 1440] [1, 61, 241, 240]
/downward_long_wave_radiation_flux_surface Float32 W m-2 [3264, 61, 721, 1440] [1, 61, 241, 240]
/downward_short_wave_radiation_flux_surface Float32 W m-2 [3264, 61, 721, 1440] [1, 61, 241, 240]
/geopotential_height_500hpa Float32 m [3264, 61, 721, 1440] [1, 61, 241, 240]
/geopotential_height_850hpa Float32 m [3264, 61, 721, 1440] [1, 61, 241, 240]
/geopotential_height_925hpa Float32 m [3264, 61, 721, 1440] [1, 61, 241, 240]
/precipitation_surface Float32 kg m-2 s-1 [3264, 61, 721, 1440] [1, 61, 241, 240]
/pressure_reduced_to_mean_sea_level Float32 Pa [3264, 61, 721, 1440] [1, 61, 241, 240]
/pressure_surface Float32 Pa [3264, 61, 721, 1440] [1, 61, 241, 240]
/temperature_2m Float32 degree_Celsius [3264, 61, 721, 1440] [1, 61, 241, 240]
/temperature_850hpa Float32 degree_Celsius [3264, 61, 721, 1440] [1, 61, 241, 240]
/temperature_925hpa Float32 degree_Celsius [3264, 61, 721, 1440] [1, 61, 241, 240]
/total_cloud_cover_atmosphere Float32 percent [3264, 61, 721, 1440] [1, 61, 241, 240]
/wind_u_100m Float32 m s-1 [3264, 61, 721, 1440] [1, 61, 241, 240]
/wind_u_10m Float32 m s-1 [3264, 61, 721, 1440] [1, 61, 241, 240]
/wind_v_100m Float32 m s-1 [3264, 61, 721, 1440] [1, 61, 241, 240]
/wind_v_10m Float32 m s-1 [3264, 61, 721, 1440] [1, 61, 241, 240]
Scalar arrays:
Name (path) Type Unit
------------ ----- ----
/spatial_ref Int64
Attributes:
Name Type Value
------------------- ------ ----------------------------------------------------------------------------
attribution String "ECMWF AIFS Single forecast data processed by dynamical.org from ECMWF Open
Data."
dataset_id String "ecmwf-aifs-single-forecast"
dataset_version String "0.1.0"
description String "Weather forecasts from the ECMWF Artificial Intelligence Forecasting System
(AIFS) Single model."
forecast_domain String "Forecast lead time 0-360 hours (0-15 days) ahead"
forecast_resolution String "6 hourly"
license String "CC-BY-4.0"
name String "ECMWF AIFS Single forecast"
spatial_domain String "Global"
spatial_resolution String "0.25 degrees (~20km)"
time_domain String "Forecasts initialized 2024-04-01 00:00:00 UTC to Present"
time_resolution String "Forecasts initialized every 6 hours"
Arrays:
- /init_time:
Dimensions: (/init_time)
Shape: [3264]
Chunk size: [23360]
Type: Int64
Unit: seconds since 1970-01-01
Nodata value: 0
Attributes:
Name Type Value
---------------------- ------ ---------------------------------------------
calendar String "proleptic_gregorian"
long_name String "Forecast initialization time"
standard_name String "forecast_reference_time"
statistics_approximate String {"min":"2024-04-01T00:00:00","max":"Present"}
Structural metadata:
COMPRESSOR { "name": "blosc", "configuration": { "typesize": 8, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } }
- /latitude:
Dimensions: (/latitude)
Shape: [721]
Chunk size: [721]
Type: Float64
Unit: degree_north
Nodata value: "NaN"
Attributes:
Name Type Value
---------------------- ------ --------------------
_FillValue String "AAAAAAAA+H8="
axis String "Y"
long_name String "Latitude"
statistics_approximate String {"min":-90,"max":90}
Structural metadata:
COMPRESSOR { "name": "blosc", "configuration": { "typesize": 8, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } }
- /lead_time:
Dimensions: (/lead_time)
Shape: [61]
Chunk size: [61]
Type: Float64
Unit: seconds
Nodata value: "NaN"
Attributes:
Name Type Value
---------------------- ------ --------------------------------------------------
_FillValue String "AAAAAAAA+H8="
dtype String "timedelta64[us]"
long_name String "Forecast lead time"
standard_name String "forecast_period"
statistics_approximate String {"min":"0 days 00:00:00","max":"15 days 00:00:00"}
Structural metadata:
COMPRESSOR { "name": "blosc", "configuration": { "typesize": 8, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } }
- /longitude:
Dimensions: (/longitude)
Shape: [1440]
Chunk size: [1440]
Type: Float64
Unit: degree_east
Nodata value: "NaN"
Attributes:
Name Type Value
---------------------- ------ -------------------------
_FillValue String "AAAAAAAA+H8="
axis String "X"
long_name String "Longitude"
statistics_approximate String {"min":-180,"max":179.75}
Structural metadata:
COMPRESSOR { "name": "blosc", "configuration": { "typesize": 8, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } }
- /dew_point_temperature_2m:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: degree_Celsius
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "2 metre dewpoint temperature"
short_name String "2d"
standard_name String "dew_point_temperature"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /downward_long_wave_radiation_flux_surface:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: W m-2
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Surface downward long-wave radiation flux"
short_name String "sdlwrf"
standard_name String "surface_downwelling_longwave_flux_in_air"
step_type String "avg"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /downward_short_wave_radiation_flux_surface:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: W m-2
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Surface downward short-wave radiation flux"
short_name String "sdswrf"
standard_name String "surface_downwelling_shortwave_flux_in_air"
step_type String "avg"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /expected_forecast_length:
Dimensions: (/init_time)
Shape: [3264]
Chunk size: [23360]
Type: Float64
Unit: seconds
Nodata value: "NaN"
Attributes:
Name Type Value
---------------------- ------ --------------------------------------------------
_FillValue String "AAAAAAAA+H8="
dtype String "timedelta64[us]"
long_name String "Expected forecast length"
statistics_approximate String {"min":"0 days 00:00:00","max":"15 days 00:00:00"}
Structural metadata:
COMPRESSOR { "name": "blosc", "configuration": { "typesize": 8, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } }
- /geopotential_height_500hpa:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: m
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Geopotential height"
short_name String "gh"
standard_name String "geopotential_height"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /geopotential_height_850hpa:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: m
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Geopotential height"
short_name String "gh"
standard_name String "geopotential_height"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /geopotential_height_925hpa:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: m
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Geopotential height"
short_name String "gh"
standard_name String "geopotential_height"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /ingested_forecast_length:
Dimensions: (/init_time)
Shape: [3264]
Chunk size: [23360]
Type: Float64
Unit: seconds
Nodata value: "NaN"
Attributes:
Name Type Value
---------------------- ------ --------------------------------------------------
_FillValue String "AAAAAAAA+H8="
dtype String "timedelta64[us]"
long_name String "Ingested forecast length"
statistics_approximate String {"min":"0 days 00:00:00","max":"15 days 00:00:00"}
Structural metadata:
COMPRESSOR { "name": "blosc", "configuration": { "typesize": 8, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } }
- /precipitation_surface:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: kg m-2 s-1
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ ------------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
comment String "Average precipitation rate since the previous forecast step. Units equivalent
to mm/s."
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Precipitation rate"
short_name String "prate"
standard_name String "precipitation_flux"
step_type String "avg"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /pressure_reduced_to_mean_sea_level:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: Pa
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Pressure reduced to MSL"
short_name String "prmsl"
standard_name String "air_pressure_at_mean_sea_level"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /pressure_surface:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: Pa
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Surface pressure"
short_name String "sp"
standard_name String "surface_air_pressure"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /spatial_ref:
Dimensions: ()
Shape: []
Type: Int64
Nodata value: 0
Attributes:
Name Type Value
--------------------------- ------- -----------------------------------------------------------------------------
comment String "This coordinate reference system matches the source data which follows WMO
conventions of assuming the earth is a perfect sphere with a radius of 6,371,
229m. It is similar to EPSG:4326, but EPSG:4326 uses a more accurate
representation of the earth's shape."
crs_wkt String "GEOGCS[\"unknown\",DATUM[\"unknown\",SPHEROID[\"unknown\",6371229,0]],
PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",
0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],
AXIS[\"Latitude\",NORTH]]"
geographic_crs_name String "unknown"
grid_mapping_name String "latitude_longitude"
horizontal_datum_name String "unknown"
inverse_flattening Float64 0
longitude_of_prime_meridian Float64 0
prime_meridian_name String "Greenwich"
reference_ellipsoid_name String "unknown"
semi_major_axis Float64 6371229
semi_minor_axis Float64 6371229
spatial_ref String "GEOGCS[\"unknown\",DATUM[\"unknown\",SPHEROID[\"unknown\",6371229,0]],
PRIMEM[\"Greenwich\",0,AUTHORITY[\"EPSG\",\"8901\"]],UNIT[\"degree\",
0.0174532925199433,AUTHORITY[\"EPSG\",\"9122\"]],AXIS[\"Longitude\",EAST],
AXIS[\"Latitude\",NORTH]]"
Structural metadata:
COMPRESSOR { "name": "zstd", "configuration": { "level": 0, "checksum": false } }
- /temperature_2m:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: degree_Celsius
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "2 metre temperature"
short_name String "2t"
standard_name String "air_temperature"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /temperature_850hpa:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: degree_Celsius
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Temperature"
short_name String "t"
standard_name String "air_temperature"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /temperature_925hpa:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: degree_Celsius
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Temperature"
short_name String "t"
standard_name String "air_temperature"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /total_cloud_cover_atmosphere:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: percent
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "Total cloud cover"
short_name String "tcc"
standard_name String "cloud_area_fraction"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /valid_time:
Dimensions: (/init_time, /lead_time)
Shape: [3264, 61]
Chunk size: [23360, 61]
Type: Int64
Unit: seconds since 1970-01-01
Nodata value: 0
Attributes:
Name Type Value
---------------------- ------ ---------------------------------------------
calendar String "proleptic_gregorian"
long_name String "Valid time"
standard_name String "time"
statistics_approximate String {"min":"2024-04-01T00:00:00","max":"Present"}
Structural metadata:
COMPRESSOR { "name": "blosc", "configuration": { "typesize": 8, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } }
- /wind_u_100m:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: m s-1
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "100 metre U wind component"
short_name String "100u"
standard_name String "eastward_wind"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /wind_u_10m:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: m s-1
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "10 metre U wind component"
short_name String "10u"
standard_name String "eastward_wind"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /wind_v_100m:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: m s-1
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "100 metre V wind component"
short_name String "100v"
standard_name String "northward_wind"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
- /wind_v_10m:
Dimensions: (/init_time, /lead_time, /latitude, /longitude)
Shape: [3264, 61, 721, 1440]
Chunk size: [1, 61, 241, 240]
Type: Float32
Unit: m s-1
Nodata value: "NaN"
Attributes:
Name Type Value
------------- ------ --------------------------------------------------------------------------
_FillValue String "AAAAAAAA+H8="
coordinates String "expected_forecast_length ingested_forecast_length spatial_ref valid_time"
long_name String "10 metre V wind component"
short_name String "10v"
standard_name String "northward_wind"
step_type String "instant"
Structural metadata:
COMPRESSOR { "name": "sharding_indexed", "configuration": { "chunk_shape": [ 1, 61, 241, 240 ], "codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "blosc", "configuration": { "typesize": 4, "cname": "zstd", "clevel": 3, "shuffle": "shuffle", "blocksize": 0 } } ], "index_codecs": [ { "name": "bytes", "configuration": { "endian": "little" } }, { "name": "crc32c" } ], "index_location": "end" } }
It’s a lot of output! For comparison let’s see what xarray will show as equivalent summary.
<xarray.Dataset> Size: 14TB
Dimensions: (init_time: 3264,
latitude: 721, lead_time: 61,
longitude: 1440)
Coordinates:
* init_time (init_time) datetime64[ns] 26kB ...
* latitude (latitude) float64 6kB 90.0 ....
* lead_time (lead_time) float64 488B 0.0 ...
* longitude (longitude) float64 12kB -180...
Data variables: (12/21)
dew_point_temperature_2m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
downward_long_wave_radiation_flux_surface (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
downward_short_wave_radiation_flux_surface (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
expected_forecast_length (init_time) float64 26kB dask.array<chunksize=(3264,), meta=np.ndarray>
geopotential_height_500hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
geopotential_height_850hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
... ...
total_cloud_cover_atmosphere (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
valid_time (init_time, lead_time) float32 796kB dask.array<chunksize=(3264, 61), meta=np.ndarray>
wind_u_100m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
wind_u_10m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
wind_v_100m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
wind_v_10m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3264, 61, 721, 1440), meta=np.ndarray>
Attributes:
attribution: ECMWF AIFS Single forecast data processed by dynami...
dataset_id: ecmwf-aifs-single-forecast
dataset_version: 0.1.0
description: Weather forecasts from the ECMWF Artificial Intelli...
forecast_domain: Forecast lead time 0-360 hours (0-15 days) ahead
forecast_resolution: 6 hourly
license: CC-BY-4.0
name: ECMWF AIFS Single forecast
spatial_domain: Global
spatial_resolution: 0.25 degrees (~20km)
time_domain: Forecasts initialized 2024-04-01 00:00:00 UTC to Pr...
time_resolution: Forecasts initialized every 6 hoursThat’s a lot denser, but the important thing is that GDAL has access to the same rich underlying Zarr data and note that xarray has clipped the full list of variables in the summary.
There are a few things to unpack in this cli call.
GDAL multidimensional
gdal mdim provides a family of apps for the GDAL multidimensional raster model - this means that array data is presented in its native form with potentially multiple dimensions, and multiple variables like the NetCDF4 or HDF5 file type, whereas the classic 2D raster model (a 2D grid with one or more bands) can only present one variable and must unroll higher dimension out as bands.
The xarray output and the gdal mdim info output above is human readable, for a programmatic summary we can call
gdal mdim info --format json \
/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk \
--config AWS_NO_SIGN_REQUEST YES --config AWS_REGION us-west-2and in R we can do
gdalraster::set_config_option("AWS_NO_SIGN_REQUEST", "YES")
gdalraster::set_config_option("AWS_REGION", "us-west-2")
json <- gdalraster::mdim_info("/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk", cout = FALSE)
cat(substr(json, 1, 1000))
# {
# "type": "group",
# "driver": "Icechunk",
# "name": "/",
# "attributes": {
# "attribution": "ECMWF AIFS Single forecast data processed by dynamical.org from ECMWF Open Data.",
# "dataset_id": "ecmwf-aifs-single-forecast",
# "dataset_version": "0.1.0",
# "description": "Weather forecasts from the ECMWF Artificial Intelligence Forecasting System (AIFS) Single model.",
# "forecast_domain": "Forecast lead time 0-360 hours (0-15 days) ahead",
# "forecast_resolution": "6 hourly",
# "license": "CC-BY-4.0",
# "name": "ECMWF AIFS Single forecast",
# "spatial_domain": "Global",
# "spatial_resolution": "0.25 degrees (~20km)",
# "time_domain": "Forecasts initialized 2024-04-01 00:00:00 UTC to Present",
# "time_resolution": "Forecasts initialized every 6 hours"
# },
# "dimensions": [
# {
# "name": "init_time",
# "full_name": "/init_time",
# "size": 3264,
# "indexing_variable": "/init_time"
# },
# {
# "name": "latitude",
# "full_name": "/latit... or with terra (in-dev, only on github atm)
terra::setGDALconfig("AWS_NO_SIGN_REQUEST", "YES")
terra::setGDALconfig("AWS_REGION", "us-west-2")
terra::rast("/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk")
# class : SpatRaster
# size : 721, 1440, 3384768 (nrow, ncol, nlyr)
# resolution : 0.25, 0.25 (x, y)
# extent : -180.125, 179.875, -90.125, 90.125 (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 (CRS84) (OGC:CRS84)
# sources : v0.1.0.icechunk
# varnames : dew_point_temperature_2m (2 metre dewpoint temperature)
# downward_long_wave_radiation_flux_surface (Surface downward long-wave radiation flux)
# downward_short_wave_radiation_flux_surface (Surface downward short-wave radiation flux)
# geopotential_height_500hpa (Geopotential height)
# geopotential_height_850hpa (Geopotential height)
# ...
# names : dew_p~600_1, dew_p~600_2, dew_p~600_3, dew_p~600_4, dew_p~600_5, dew_p~600_6, ...
# unit : degree_Celsius, degree_Celsius, degree_Celsius, degree_Celsius, degree_Celsius, degree_Celsius, ...
# depth : 1.71193e+09 to 1.78241e+09 (: 3264 steps)
# time : 0
# Warning message:
# [rast] skipped array (different geometry): /valid_timeor sf/stars
Sys.setenv("AWS_NO_SIGN_REQUEST" = "YES")
Sys.setenv("AWS_REGION" = "us-west-2")
s <- sf::gdal_read_mdim("/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk", proxy = TRUE)
str(s)
List of 4
$ array_list:List of 17
..$ dew_point_temperature_2m : NULL
..$ downward_long_wave_radiation_flux_surface : NULL
..$ downward_short_wave_radiation_flux_surface: NULL
...Click here for the full sf structure info output
List of 4
$ array_list:List of 17
..$ dew_point_temperature_2m : NULL
..$ downward_long_wave_radiation_flux_surface : NULL
..$ downward_short_wave_radiation_flux_surface: NULL
..$ geopotential_height_500hpa : NULL
..$ geopotential_height_850hpa : NULL
..$ geopotential_height_925hpa : NULL
..$ precipitation_surface : NULL
..$ pressure_reduced_to_mean_sea_level : NULL
..$ pressure_surface : NULL
..$ temperature_2m : NULL
..$ temperature_850hpa : NULL
..$ temperature_925hpa : NULL
..$ total_cloud_cover_atmosphere : NULL
..$ wind_u_100m : NULL
..$ wind_u_10m : NULL
..$ wind_v_100m : NULL
..$ wind_v_10m : NULL
$ dimensions:List of 4
..$ init_time:List of 5
.. ..$ from : int 1
.. ..$ to : int 3264
.. ..$ values :List of 1
.. .. ..$ : num [1:3264(1d)] 1.71e+09 1.71e+09 1.71e+09 1.71e+09 1.71e+09 ...
.. .. .. ..- attr(*, "units")= chr "seconds since 1970-01-01"
.. .. .. ..- attr(*, "d_names")= chr "init_time"
.. .. .. ..- attr(*, "attributes")= Named chr [1:4] "proleptic_gregorian" "Forecast initialization time" "forecast_reference_time" "{ \"min\": \"2024-04-01T00:00:00\", \"max\": \"Present\" }"
.. .. .. .. ..- attr(*, "names")= chr [1:4] "calendar" "long_name" "standard_name" "statistics_approximate"
.. ..$ type : chr ""
.. ..$ direction: chr ""
..$ lead_time:List of 5
.. ..$ from : int 1
.. ..$ to : int 61
.. ..$ values :List of 1
.. .. ..$ : num [1:61(1d)] 0 21600 43200 64800 86400 ...
.. .. .. ..- attr(*, "units")= chr "seconds"
.. .. .. ..- attr(*, "d_names")= chr "lead_time"
.. .. .. ..- attr(*, "attributes")= Named chr [1:5] "AAAAAAAA+H8=" "timedelta64[us]" "Forecast lead time" "forecast_period" ...
.. .. .. .. ..- attr(*, "names")= chr [1:5] "_FillValue" "dtype" "long_name" "standard_name" ...
.. ..$ type : chr ""
.. ..$ direction: chr ""
..$ latitude :List of 5
.. ..$ from : int 1
.. ..$ to : int 721
.. ..$ values :List of 1
.. .. ..$ : num [1:721(1d)] 90 89.8 89.5 89.2 89 ...
.. .. .. ..- attr(*, "units")= chr "degree_north"
.. .. .. ..- attr(*, "d_names")= chr "latitude"
.. .. .. ..- attr(*, "attributes")= Named chr [1:4] "AAAAAAAA+H8=" "Y" "Latitude" "{ \"min\": -90.0, \"max\": 90.0 }"
.. .. .. .. ..- attr(*, "names")= chr [1:4] "_FillValue" "axis" "long_name" "statistics_approximate"
.. ..$ type : chr "HORIZONTAL_Y"
.. ..$ direction: chr ""
..$ longitude:List of 5
.. ..$ from : int 1
.. ..$ to : int 1440
.. ..$ values :List of 1
.. .. ..$ : num [1:1440(1d)] -180 -180 -180 -179 -179 ...
.. .. .. ..- attr(*, "units")= chr "degree_east"
.. .. .. ..- attr(*, "d_names")= chr "longitude"
.. .. .. ..- attr(*, "attributes")= Named chr [1:4] "AAAAAAAA+H8=" "X" "Longitude" "{ \"min\": -180.0, \"max\": 179.75 }"
.. .. .. .. ..- attr(*, "names")= chr [1:4] "_FillValue" "axis" "long_name" "statistics_approximate"
.. ..$ type : chr "HORIZONTAL_X"
.. ..$ direction: chr ""
$ srs : chr NA
$ geometry : list()
That is leveraged by stars to give
Sys.setenv("AWS_NO_SIGN_REQUEST" = "YES")
Sys.setenv("AWS_REGION" = "us-west-2")
stars::read_mdim("/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk", proxy = TRUE)
# stars_proxy object with 17 attributes in 17 file(s):
# $dew_point_temperature_2m
# [1] "[...]/v0.1.0.icechunk"
#
# $downward_long_wave_radiation_flux_surface
# [1] "[...]/v0.1.0.icechunk"
#
# ...
#
# $temperature_925hpa
# [1] "[...]/v0.1.0.icechunk"
#
# $total_cloud_cover_atmosphere
# [1] "[...]/v0.1.0.icechunk"
#
# $wind_u_100m
# [1] "[...]/v0.1.0.icechunk"
#
# $wind_u_10m
# [1] "[...]/v0.1.0.icechunk"
#
# $wind_v_100m
# [1] "[...]/v0.1.0.icechunk"
#
# $wind_v_10m
# [1] "[...]/v0.1.0.icechunk"
#
# dimension(s):
# from to offset delta refsys x/y
# longitude 1 1440 -180.1 0.25 NA [x]
# latitude 1 721 90.12 -0.25 NA [y]
# lead_time 1 61 0 [s] 21600 [s] udunits
# init_time 1 3264 2024-04-01 UTC 6 hours POSIXct Please note that this store contains pretty simple regular array data, and so the relationship of data arrays to their coordinates is simple, the map in longitude/latitude is a regular grid and so slices of that can correspond to the normal raster model where a bounding box (or corner coordinates + resolution) is sufficient to georeference the data. Multidimensional data can be more general and more complex than that.
This post won’t explore any more detail of the various ways R packages support mdim.
Multidimensional?
Please hang in here for a few sections, we are diverting down a GDAL and object storage rabbit hole, but it’s not long and then we’ll get back to Icechunk in the naming a thing section.
We can treat this store as raster data in the normal 2D way, but note that there are caveats and real performance implications depending on the way data is laid out in a given store.
gdalinfo /vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk --config AWS_NO_SIGN_REQUEST YES --config AWS_REGION us-west-2
Driver: Icechunk/Icechunk
Files: /vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk
Size is 512, 512
Subdatasets:
SUBDATASET_1_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/dew_point_temperature_2m
SUBDATASET_1_DESC=[3264x61x721x1440] /dew_point_temperature_2m (Float32)
SUBDATASET_2_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/downward_long_wave_radiation_flux_surface
SUBDATASET_2_DESC=[3264x61x721x1440] /downward_long_wave_radiation_flux_surface (Float32)
SUBDATASET_3_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/downward_short_wave_radiation_flux_surface
SUBDATASET_3_DESC=[3264x61x721x1440] /downward_short_wave_radiation_flux_surface (Float32)
SUBDATASET_4_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/geopotential_height_500hpa
SUBDATASET_4_DESC=[3264x61x721x1440] /geopotential_height_500hpa (Float32)
SUBDATASET_5_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/geopotential_height_850hpa
SUBDATASET_5_DESC=[3264x61x721x1440] /geopotential_height_850hpa (Float32)
SUBDATASET_6_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/geopotential_height_925hpa
SUBDATASET_6_DESC=[3264x61x721x1440] /geopotential_height_925hpa (Float32)
SUBDATASET_7_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/precipitation_surface
SUBDATASET_7_DESC=[3264x61x721x1440] /precipitation_surface (Float32)
SUBDATASET_8_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/pressure_reduced_to_mean_sea_level
SUBDATASET_8_DESC=[3264x61x721x1440] /pressure_reduced_to_mean_sea_level (Float32)
SUBDATASET_9_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/pressure_surface
SUBDATASET_9_DESC=[3264x61x721x1440] /pressure_surface (Float32)
SUBDATASET_10_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/temperature_2m
SUBDATASET_10_DESC=[3264x61x721x1440] /temperature_2m (Float32)
SUBDATASET_11_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/temperature_850hpa
SUBDATASET_11_DESC=[3264x61x721x1440] /temperature_850hpa (Float32)
SUBDATASET_12_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/temperature_925hpa
SUBDATASET_12_DESC=[3264x61x721x1440] /temperature_925hpa (Float32)
SUBDATASET_13_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/total_cloud_cover_atmosphere
SUBDATASET_13_DESC=[3264x61x721x1440] /total_cloud_cover_atmosphere (Float32)
SUBDATASET_14_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/valid_time
SUBDATASET_14_DESC=[3264x61] /valid_time (Int64)
SUBDATASET_15_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/wind_u_100m
SUBDATASET_15_DESC=[3264x61x721x1440] /wind_u_100m (Float32)
SUBDATASET_16_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/wind_u_10m
SUBDATASET_16_DESC=[3264x61x721x1440] /wind_u_10m (Float32)
SUBDATASET_17_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/wind_v_100m
SUBDATASET_17_DESC=[3264x61x721x1440] /wind_v_100m (Float32)
SUBDATASET_18_NAME=ZARR:"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}":/wind_v_10m
SUBDATASET_18_DESC=[3264x61x721x1440] /wind_v_10m (Float32)
Corner Coordinates:
Upper Left ( 0.0, 0.0)
Lower Left ( 0.0, 512.0)
Upper Right ( 512.0, 0.0)
Lower Right ( 512.0, 512.0)
Center ( 256.0, 256.0)Those strings like "ZARR:\"/vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk}\":/dew_point_temperature_2m" (note the embedded escaped quotes) can be used by the more tradition GDAL classic raster model.
We’ll come back to the /vsiicechunk and {/vsis3} parts of this string a bit later.
Setting config options and environment variables for GDAL
GDAL config options can be set through a package wrapper - gdalraster::set_config_option("AWS_NO_SIGN_REQUEST", "YES") or terra::setGDALconfig("AWS_NO_SIGN_REQUEST", "YES") - or as process environment variables (Sys.setenv(AWS_NO_SIGN_REQUEST = "YES"), or KEY=value in the shell). Both routes are global and sticky: once set they can affect every other step, spooky-action-at-a-distance that we want to avoid.
That’s why setting config inline - --config KEY=VALUE on the gdal mdim info call is a clean option: it lives and dies in one line, a bit like with in Python or with() in R. The R equivalent for env vars and options is withr::with_envvar() / withr::with_options(), true scoped context managers.
With online storage it can be bewildering what exact configuration you need and what you have, but it does come down to a fairly comprehensible set of ideas. This post is not a guide to online storage access configuration, or a guide to GDAL configuration options but there’s a finite set of concepts in this jungle so don’t be put off by the complexity.
Auth - keyed vs anonymous. If a dataset is not public, you will need something like AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. If it is public, you will need something like AWS_NO_SIGN_REQUEST=YES (or --no-sign-request).
Addressing. Everything is a key inside a bucket. You reference an object as s3://bucket/key (Amazon), gs://bucket/key (Google Storage), abfss://… (Azure storage) - and that’s the whole model. (Azure bends it slightly: a storage account holds containers holding blobs, and the scheme carries the account, abfss://container@account.dfs.core.windows.net/key.)
Endpoint, region, hosting-style. That object is hosted on an endpoint - s3:// is shorthand for s3.<region>.amazonaws.com, gs:// for storage.googleapis.com. A custom endpoint means pointing at your own host - AWS_S3_ENDPOINT=your.endpoint.net - so that s3 in s3://bucket/key resolves to your host and not Amazon’s. Virtual hosting just means the bucket goes before the endpoint in the final url https://bucket.s3.<region>.amazonaws.com/key vs https://my.custom.endpoint.net/bucket/key. It’s confusing, until you know what it does and when you might need it. For non-AWS S3 implementations (MinIO, Ceph) this usually goes together with path-style addressing, AWS_VIRTUAL_HOSTING=NO, since most of them don’t do virtual-host style. Again on a custom S3 AWS_REGION is sometimes needed but probably shouldn’t matter - set it to ““.
GDAL and the virtual filesystem
Everything above is generic - true of any R tool talking to object storage - and GDAL addresses remote objects through its own virtual filesystem (VSI) rather than the s3://-style schemes. Again, bucket, key but with a different prefix:
| cloud-native | GDAL VSI |
|---|---|
s3://bucket/key |
/vsis3/bucket/key |
gs://bucket/key |
/vsigs/bucket/key |
abfss://container/key |
/vsiaz/container/key |
This is the most common first stumble: s3://… handed to terra::rast() or vapour::gdal_raster_data() won’t open; the /vsis3/… form will. Be aware that some packages in R and other langs will do their own heuristics to wrap different protocols: ‘https://’, ‘s3://’, ‘virtualizarr://’ etc - this can be helpful but also very confusing, and this post cannot summarize that landscape. When GDAL is in use my preference is to use GDAL protocols without wrapping. The config options from above (AWS_NO_SIGN_REQUEST, AWS_S3_ENDPOINT, and friends) all still apply unchanged - VSI only changes how you name the object, it doesn’t affect how or if it’s accessible.
The same /vsi* mechanism abstracts over a whole zoo of access methods, all of which GDAL then reads as if they were ordinary local files:
| prefix | what it opens |
|---|---|
/vsicurl/ |
any file over plain HTTP/HTTPS (range-requested, not downloaded whole) |
/vsizip/ |
a file inside a ZIP archive |
/vsitar/, /vsigzip/ |
likewise for TAR / gzip |
/vsimem/ |
an in-memory buffer |
/vsiswift/, /vsioss/ |
OpenStack Swift, Alibaba OSS, … |
/vsistdout/ |
stream output straight to standard out |
The real payoff is that these compose - you chain prefixes right-to-left to stack the abstractions. To read a single GeoTIFF sitting inside a ZIP that lives in an S3 bucket, none of which you download in full:
/vsizip//vsis3/bucket/archive.zip/inside.tifGDAL signs the S3 request, range-reads the ZIP’s central directory, pulls just the bytes for inside.tif, and hands you a raster. That’s the whole world the VSI layer opens up: object storage, archives, compression, and HTTP are all just prefixes over the same “name a thing, read its bytes” idea.
Naming a thing inside a thing
That earlier chain glued its prefixes by plain concatenation - /vsizip/ meets /vsis3/, and since both carry a leading slash you get the doubled //. GDAL finds the archive boundary inside that run by heuristic: it spots the .zip. The brace form, /vsizip/{…}, makes the boundary explicit rather than guessed - and once the inner path carries its own punctuation (a URL with a query string, say) explicit is the only thing that works.
GEBCO 2026 ships its global grid as a single zip behind a download URL ending …gebco_2026_geotiff.zip?download=1. To reach into it without pulling the whole archive, chain /vsizip/ over /vsicurl/, braces around the inner path because of that ?download=1 which masks the fact that it’s a zip file:
/vsizip/{/vsicurl/https://dap.ceda.ac.uk/bodc/gebco/global/gebco_2026/ice_surface_elevation/geotiff/gebco_2026_geotiff.zip?download=1}If we point GDAL at that archive it can’t return a single raster, so it tells you what’s in there instead:
ERROR 4: ... not recognized as being in a supported file format.
The archive contains several files:
…}/gebco_2026_n0.0_s-90.0_w0.0_e90.0_geotiff.tif
…}/gebco_2026_n0.0_s-90.0_w-180.0_e-90.0_geotiff.tif
… (six more files)
…}/GEBCO_Grid_documentation.pdf
…}/GEBCO_Grid_terms_of_use.pdfThat listing is a logical filesystem - real files physically inside the zip in an opaque form, enumerated out of a remote archive. Append one of them after the closing brace and you’ve named a single item; GDAL opens it.
Compare that to the subdataset syntax, multiple subdatasets of classic 2D raster we saw in the Icechunk store above:
ZARR:"/vsiicechunk/{/vsis3/.../v0.1.0.icechunk}":/dew_point_temperature_2mAgain we have bracing of an internal /vsi* item but the thing named after it, /dew_point_temperature_2m, isn’t a file but a variable in a logical hierarchy, a node in the Zarr data model. /vsizip/ descends into the structural contents of an archive, ZARR: over /vsiicechunk/ into the logical contents of a data model - that’s why we see different forms of addressing in stores.
GDAL can list things, here we use gdal vsi list to see the structural contents:
gdal vsi list /vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk \
--config AWS_NO_SIGN_REQUEST YES --config AWS_REGION us-west-2
repo
chunks
manifests
overwritten
snapshots
transactionsok we think, those look like directories so let’s go deeper (eek!):
gdal vsi list /vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk/chunks \
--config AWS_NO_SIGN_REQUEST YES --config AWS_REGION us-west-2
0003DSF0M0N9B5MP3R9G
0010VXY96ABV0RP45T30
0015TP2QVGT990JE9PM0
...yes it’s valid content but not something we can understand - we are seeing the internal structure of a complex sophisticated cloud format. We know it’s Icechunk so lets wield the Icechunk virtualization tool
gdal vsi list /vsiicechunk/{/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk} \
--config AWS_NO_SIGN_REQUEST YES --config AWS_REGION us-west-2
zarr.json
dew_point_temperature_2m
downward_long_wave_radiation_flux_surface
downward_short_wave_radiation_flux_surface
expected_forecast_length
geopotential_height_500hpa
geopotential_height_850hpa
geopotential_height_925hpa
ingested_forecast_length
init_time
latitude
lead_time
longitude
precipitation_surface
pressure_reduced_to_mean_sea_level
pressure_surface
spatial_ref
temperature_2m
temperature_850hpa
temperature_925hpa
total_cloud_cover_atmosphere
valid_time
wind_u_100m
wind_u_10m
wind_v_100m
wind_v_10mand now we are back in logical ground, this is the interpretation that gdal mdim info gave us above, not raw listing of the structural contents.
That shows some of the depth of GDAL abstraction. “Name a thing, read its bytes” it’s not just bytes on a disk. A thing can be an object in a bucket, a member of an archive, a variable in a chunked store, etc - and the path is the recipe for access.
The distinction between material structure (thousands of small manifest files on S3) and logical/virtual structure (the clean array hierarchy you see via /vsiicechunk/) is a key icechunk concept. GDAL’s /vsi* filesystem stack handles this transparently - the same way /vsicurl/, /vsis3/, /vsigs/ abstract over HTTP and object stores.
What about python?
Above we saw how GDAL and some R packages interact with the Icechunk store, how do we do that in Python?
This time we don’t have a single description of the store, we have to use infrastructure to encode the details.
import icechunk
import xarray
storage = icechunk.s3_storage(
bucket="dynamical-ecmwf-aifs-single",
prefix="ecmwf-aifs-single-forecast/v0.1.0.icechunk",
region="us-west-2",
anonymous=True, # the AWS_NO_SIGN_REQUEST=YES equivalent
)
repo = icechunk.Repository.open(storage)
session = repo.readonly_session("main")
ds = xarray.open_zarr(session.store, consolidated=False)
ds
# <xarray.Dataset> Size: 14TB
# Dimensions: (init_time: 3265,
# lead_time: 61, latitude: 721,
# longitude: 1440)
# Coordinates:
# * init_time (init_time) datetime64[ns] 26kB ...
# expected_forecast_length (init_time) timedelta64[us] 26kB dask.array<chunksize=(3265,), meta=np.ndarray>
# ingested_forecast_length (init_time) timedelta64[us] 26kB dask.array<chunksize=(3265,), meta=np.ndarray>
# * lead_time (lead_time) timedelta64[us] 488B ...
# valid_time (init_time, lead_time) datetime64[ns] 2MB dask.array<chunksize=(3265, 61), meta=np.ndarray>
# * latitude (latitude) float64 6kB 90.0 ....
# * longitude (longitude) float64 12kB -180...
# spatial_ref int64 8B ...
# Data variables: (12/17)
# dew_point_temperature_2m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# downward_long_wave_radiation_flux_surface (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# geopotential_height_500hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# geopotential_height_850hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# precipitation_surface (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# downward_short_wave_radiation_flux_surface (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# ... ...
# wind_u_100m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# temperature_925hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# total_cloud_cover_atmosphere (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# wind_u_10m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# wind_v_100m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# wind_v_10m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(1, 61, 241, 240), meta=np.ndarray>
# Attributes:
# dataset_id: ecmwf-aifs-single-forecast
# dataset_version: 0.1.0
# name: ECMWF AIFS Single forecast
# description: Weather forecasts from the ECMWF Artificial Intelli...
# attribution: ECMWF AIFS Single forecast data processed by dynami...
# license: CC-BY-4.0
# spatial_domain: Global
# spatial_resolution: 0.25 degrees (~20km)
# time_domain: Forecasts initialized 2024-04-01 00:00:00 UTC to Pr...
# time_resolution: Forecasts initialized every 6 hours
# forecast_domain: Forecast lead time 0-360 hours (0-15 days) ahead
# forecast_resolution: 6 hourlyWe looked above at this xarray representation but up there we didn’t use the same infrastructure or show the code. The one above was made with
import os
import xarray
os.environ["AWS_NO_SIGN_REQUEST"] = "YES"
os.environ["AWS_REGION"] = "us-west-2"
ds = xarray.open_dataset("/vsis3/dynamical-ecmwf-aifs-single/ecmwf-aifs-single-forecast/v0.1.0.icechunk", engine = "gdalxarray", chunks = {})
# <xarray.Dataset> Size: 14TB
# Dimensions: (init_time: 3265,
# latitude: 721, lead_time: 61,
# longitude: 1440)
# Coordinates:
# * init_time (init_time) datetime64[ns] 26kB ...
# expected_forecast_length (init_time) float64 26kB dask.array<chunksize=(3265,), meta=np.ndarray>
# ingested_forecast_length (init_time) float64 26kB dask.array<chunksize=(3265,), meta=np.ndarray>
# * latitude (latitude) float64 6kB 90.0 ....
# * lead_time (lead_time) float64 488B 0.0 ...
# valid_time (init_time, lead_time) float32 797kB dask.array<chunksize=(3265, 61), meta=np.ndarray>
# * longitude (longitude) float64 12kB -180...
# spatial_ref float32 4B ...
# Data variables: (12/17)
# dew_point_temperature_2m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# downward_long_wave_radiation_flux_surface (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# downward_short_wave_radiation_flux_surface (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# geopotential_height_500hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# geopotential_height_850hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# geopotential_height_925hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# ... ...
# temperature_925hpa (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# total_cloud_cover_atmosphere (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# wind_u_100m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# wind_u_10m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# wind_v_100m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# wind_v_10m (init_time, lead_time, latitude, longitude) float32 827GB dask.array<chunksize=(3265, 61, 721, 1440), meta=np.ndarray>
# Attributes:
# attribution: ECMWF AIFS Single forecast data processed by dynami...
# dataset_id: ecmwf-aifs-single-forecast
# dataset_version: 0.1.0
# description: Weather forecasts from the ECMWF Artificial Intelli...
# forecast_domain: Forecast lead time 0-360 hours (0-15 days) ahead
# forecast_resolution: 6 hourly
# license: CC-BY-4.0
# name: ECMWF AIFS Single forecast
# spatial_domain: Global
# spatial_resolution: 0.25 degrees (~20km)
# time_domain: Forecasts initialized 2024-04-01 00:00:00 UTC to Pr...
# time_resolution: Forecasts initialized every 6 hoursThat seems very similar, and hopefully for the in-dev GDAL xarray backend gdalxarray it’s an equivalent connection (I’m not going to validate any further here in this post).
What this illustrates is while the configuration set up can be confusing, these are the very same generic store-auth settings cast different ways. Is it desirable to consolidate these? Is one necessarily better or more capable? Icechunk-Python auth and addressing through typed constructor arguments, GDAL through the VSI path plus config.
What about not python?
I don’t have a strong opinion on what’s the right moves, I do have a strong opinion though that this not just a Python community. It was hard looking over the fence at xarray and Zarr and going ‘uhoh, this looks like I can’t ignore it’. The landscape and GDAL’s particular take on it is something I had the privilege to take a long deep dive on. I can’t ever remember the typed constructor incantation above - but I also web search EMPTY_DIR to find GDAL_DISABLE_READDIR_ON_OPEN which I just can’t recall though I reach for it often. This for me is about opening up to audiences that are not already in the tent, and maybe some who won’t ever use Python.
Should we have easier wrappers? Write the helper that sniffs an s3:// address and sets the right path-style / anonymous / region knobs? How do we make that cross-language? I love everything out in the open, protocol://path/to/data, a few env vars a few function arguments, but env vars and function calls are hard won expertise.
Is consolidating these worth doing? Is there any scope for that or prior art?
Other R Zarr readers
Beyond GDAL, there is growing native R support for Zarr:
- pizzarr (CRAN) - pure R Zarr V2 reader; r-universe builds include an optional high-performance Rust backend via the zarrs crate
- Rarr (Bioconducator) - A simple native R reader for Zarr Arrays, Rarr is actively developed and has good support for a growing number of codecs
- zarr (CRAN) - a native implementation in R with support for all required features of Zarr version 3.
- CopernicusMarine - Subset and download marine data from EU Copernicus Marine Service Information, has an in-built Zarr engine designed around the service
- zaro (r-universe) - Low-level Zarr interface (V2 and V3) built on Arrow filesystem and codec infrastructure
- RNetCDF, ncdf4 - with the right build set up NetCDF has Zarr support, and this for example has been proved for using with tidync
Work is ongoing on multidim API coverage in the R ecosystem, I particularly want tight api bindings to the GDAL API with gdalraster but that will be a while still in the making.
The GDAL multidim API (used by /vsiicechunk/) is distinct from the classic 2D raster driver - terra now auto-detects which to use for the multidim drivers which are VRT/HDF5/NetCDF/Zarr/GRIB. If terra gives unexpected results, check whether it’s related to recent changes in mdim support.
For some time stars via sf has support for accessing Zarr via the classic or multidimensional raster model of GDAL.
For Icechunk specifically see icechunk.io and see the GDAL icechunk driver docs at gdal.org icechunk and ZARR driver at gdal.org zarr.