Michael D. Sumner



For some time I have used GDAL as a standard tool in my kit, I was introduced to the concept by the rgdal package authors and it slowly dawned on me what it meant to have a geo-spatial data abstraction library. To realize what this meant I had spent a lot of time in R, reading (primarily) MapInfo TAB and MIF format files as well (of course) as shapefiles, and the occasional GeoTIFF.

I already knew how immensely powerful R was, with its epic flexibity and useability and I could just sense there was a brighter way once I understood many more details. As my experience grew I was able to do amazing tasks like, merge a few shapefiles together into one, or plot a window of data from a georeferenced grid. Previously the best I’d done in this space was VBScript in Manifold GIS, which I could use to automate some data tasks - but the prospects of full automation from raw data files to output, end-to-end with a software tool that anyone could use was absolutely mind-blowing. I was super-powered, I remember earning a carton of beer from a colleague of my father, for munging some SHP or TAB files between AGD66 and GDA94 … or something, and I knew I had a bright future ahead.

So what’s the abstraction? GDAL does not care what format the data is in, it could be points, lines, areas, a raster DEM, a time series of remote sensing, or an actual image. It just doesn’t mind, there’s an interpretation in its model for what’s in the file (or data base, or web server) and it will deliver that interpretation to you, very efficiently. If you understand that intepretation you can do a whole lot of amazing stuff. When this works well it’s because the tasks are modular, you have a series of basic tools designed to work together, and it’s up to you as a developer or a user to chain together the pieces to solve your particular problem.

Where does this get difficult?

GDAL is a C++ library, and that’s not accessible to most users or developers. The other key user-interface is the set of command line utilities, these are called gdal_translate, gdalinfo, ogr2ogr, ogrinfo, and many others. The command line is also not that accessible to many users, though it’s more so than C++ - this is why command line is a key topic for Software Carpentry. These interfaces give very high fidelity to the native interpretation provided by the GDAL model.

GDAL is used from many languages, there’s Python, R, Perl, C#, Fortran, and it is bundled into many, many softwares - a very long list. The original author wrote code for some of the most influential geo-spatial software the world has, and some of that is in GDAL, some is locked up forever in propietary forms. He saw this as a problem and very early on engineered the work to be able to be open, in the do anything with me, including privatize me-license called MIT. Have a look in the source code for gdalwarp, you’ll see the company who was the best at raster reprojection in the late 1990s and early 2000s.

Python is surely the closest other language to the native interpretation, but then it’s not that simple, and this is not that story …

R has a very particular interpretation of the GDAL interpretation, it’s called rgdal and if you are familiar with the GDAL model and with R you can see a very clear extra layer there. This extra level is there partly because of when it was done, the goals of the authors, the community response to the amazingly powerful facilities it provided, but also and perhaps mainly because it was very hard. R’s rather peculiar API meant that in the early 2000s the authors had to write in another language, a language between the native GDAL C++ and the R user language - this is the R API, it’s full of SEXP and UNPROTECTs and if you search this issue you’ll see clear signals not to bother - now you should just get on with learning Rcpp.

These extra levels are there in the R API, the hard stuff down in the rgdal/src/ folder but also in the R code itself. There’s a bunch of rigorous rules applied there, things to help protect us from that lower level, and to save from making serious analytical mistakes. All of this was very hard work and very well-intended, but it’s clear that it takes us away from the magic of the GDAL abstraction, we have a contract with rgdal, the R code has a contract with the R API, and the rgdal/src has a contract with GDAL. All of these things divorce R users and developers from the original schemes that GDAL provides, because at the R level rgdal itself has to provide certain guarantees and contracts with both R and with R users. I think that is too much for one package.

Add to this the complex zoo of formats, the other libraries that GDAL requires for full use. The Windows rgdal on CRAN doesn’t include HDF4, or HDF5, or NetCDF, or DODS - there are many missing things in this list, and it’s not clear if it’s because it’s hard, it’s against the license (ECW might be tricky, MrSID most definitely would be), or because no one has asked or maybe no one knows how, or maybe CRAN doesn’t want to. (Would you know how to find out?) All of these things add up to being way too much for one package. It’s kind of impossible, though now there are many more eyes on the problem and progress is being made. Who should decide these things? How would anyone know it’s even an issue?

I wonder if many of us see rgdal as the definition of the GDAL abstraction. I see pretty clearly the difference, and while the package has been extremely useful for me I’ve long wanted lower level control and access to the GDAL core itself. (I had rather influential guidance from extremely expert programmers I’ve worked with, and I’ve discussed GDAL with many others, including employees of various companies, and across various projects and with many users. I assume most R users don’t know much about the details, and why would they want to?).

There is active work to modernize rgdal, and you should be aware of the immensely successful sf and the soon to be stars project. sf is an R interpretation of the Simple Features Standard (GDAL has an interpretation of that standard too, sf starts there when it reads with GDAL). stars will start with GDAL as a model for gridded data, and it’s not yet fleshed out what the details of that will be. None of these interpretations are permanent, though while the simple features standard is unlikely to change, there is no doubt that GDAL will evolve and include more features that don’t involve that standard. These things do change and very few people, relatively, are engaged in the decisions that are made. (GDAL could definitely benefit from more input, also something I’ve long wanted to do more of).

It’s been a long time, but I’ve recently found a way over the key obstacle I had - building a package to compile for use in R with bindings to GDAL itself. I have a lot to thank Roger Bivand and Edzer Pebesma for many years of instruction and guidance in that, in many different ways. I also am extremely grateful to R-Core for the overall environment, and the tireless work done by the CRAN maintainers. I have to mention Jeroen Ooms and Mark Padgham who’ve been extremely helpful very recently. This is something I’ve wanted to be able to do for a really long time, I hope this post helps provide context to why, and I hope it encourages some more interest in the general topic.


My response to the interpretation layers is vapour, this is my version of a very minimal interpretation of the core GDAL facilities. There’s a function to read the attribute data from geometric features, a function to read the geometry data as raw binary, or various text formats, a function to read only the bounding box of each feature, and there’s a function to read the raw pixel values from a local window within a gridded data set.

None of this is new, we have R packages that do these things and the vapour functions will have bugs, and will need maintenance, and maybe no one but me will ever use them. I’ve needed them and I’ve started to learn a whole lot more about what I’m going to do next with them. I recommend that any R user with an interest in geo-spatial or GDAL facilities have a closer look at how they work - and if you know there’s a lower level below the interpretations provide in R you should explore them. Rcpp and the modern tools for R really do make this immensely more easy than in the past (RStudio has intellisense for C++ …).

I also believe strongly that R is well-placed to write the future of multi-dimensional and hierarchical and complex structured and geo-spatial data. Do you know what that future should look like?