Fiona and Rasterio
Data Access for Python Programmers and Future Python Programmers
Sean Gillies @Mapbox
FOSS4G 2014 ∙ Portland ∙ September 10
Hi! Good morning! Thank you very much for coming to this sessiom. My name
is Sean Gillies, I work at Mapbox.
I've been a Python programmer since 2001 and a GIS analyst and programmer
since 1999, with a séjour in the digital classics from 2006 to 2013.
I've been coming to FOSS4G off and on since it was the MapServer User
Meeting.
New Python packages
Fiona and Rasterio are Python interfaces to OGR and GDAL
Each are native GeoJSON speakers
Each embrace the good parts of Python
Each embrace the command line
My talk is about new Python geospatial software and a double challenge in developing it.
Why are we writing new geodata access libraries?
How do you use it yourselves?
A Double Challenge
Helping experienced Python programmers learn GIS concepts
Helping GIS experts learn to be better Python programmers
My talk is about new Python geospatial software and a double challenge in developing it.
Why are we writing new geodata access libraries?
How do you use it yourselves?
Mapbox Cloudless
The Mapbox Satellite Team is making a Cloudless Landsat mosaic of the
world.
This is a base layer for maps created by Mapbox users. Its primary
requirements are to be current and to look good. We'll be working
on products that are better for science in the future.
This is the project that drives a lot of Rasterio development.
It has interesting technical problems.
We have interesting problems
Off-the-shelf solutions are scarce
Trial and error is involved
We need new software
Trial and error isn't our only method, but it's an important one. It
has a big impact on our requirements.
Software requirements
Allow rapid prototyping and iteration
Use dependable algorithms
Scale out to many computers
We need to get new iterations of our algorithms into production
quickly. We need robust numerical libraries. And we need to distribute
the work across many servers. And also know the cost of doing it.
Fulfilling requirements
1 and 3 argue for open source
1 argues for a high-level language with handy multi-dimensional array syntax
2 argues for LAPACK (&c) and GDAL
The fit: Linux, Python, Numpy, Scipy
GDAL Python bindings?
They've served us long and well
But they don't fit well with the good parts of Python
We can do better
We can do better while keeping the power of the GDAL and OGR libraries.
The good parts
What do I mean by the good parts of Python?
I've made a Venn diagram of unique good features of three common languages.
The "good parts" of each. It's not exhaustive, I'm leaving out some good
parts.
We can argue over whether I got these exactly right and I hope we will.
The image is derived from http://commons.wikimedia.org/wiki/File:Venn_diagram_cmyk.svg.
Anybody recognize the yellow language? I'll give you a hint: the other
two are implemented in it. Right: C/C++.
Anybody recognize the magenta language? Right, javascript.
Anybody recognize the blue language? Right, Python.
The good parts of Python are lacking in Javascript and C. Wrapper
generators like SWIG give you the features in the overlap between Python
and C, but have a hard time with the good parts of the dynamic language.
Bottom line is that GDAL's Python bindings aren't a good match for the
good parts of Python.
I'm thinking about the design of D3 and Mike Bostock's explanation of
what its user gain by D3's embrace of the DOM and browser standards.
I'm convinced that spatial programmers can similarly benefit from software that embraces Python.
OGR Example
from osgeo import ogr
# Create ring
ring = ogr.Geometry(ogr.wkbLinearRing)
ring.AddPoint(1179091.1646903288, 712782.8838459781)
ring.AddPoint(1161053.0218226474, 667456.2684348812)
ring.AddPoint(1214704.933941905, 641092.8288590391)
ring.AddPoint(1228580.428455506, 682719.3123998424)
ring.AddPoint(1218405.0658121984, 721108.1805541387)
ring.AddPoint(1179091.1646903288, 712782.8838459781)
# Create polygon
poly = ogr.Geometry(ogr.wkbPolygon)
poly.AddGeometry(ring)
[source: http://pcjericks.github.io/py-gdalogr-cookbook/]
This is the classic way to create an osgeo.ogr geometry object.
It could be tightened up a bit.
Analogy
# C style
nums = []
for i in range(100):
nums.append(i)
vs
# Python style
nums = list(range(100))
I've got an analogy to Python list construction.
First, I'm showing the most imperative, C-like way of making a Python list of the first 100 numbers.
Next, the most idiomaticly Python way of doing it.
Not just less code – faster
$ python -m timeit -s "nums=[]" "for i in xrange(1000):" " nums.append(i)"
10000 loops, best of 3: 76.4 usec per loop
vs
$ python -m timeit -s "nums=list(xrange(1000))"
100000000 loops, best of 3: 0.0108 usec per loop
The idomaticly Python way is faster. ridiculously faster!
Do this in your code.
Provide APIs like this in your Python packages.
Literal syntax: {} for geometries
# Create polygon
poly = {
'type': 'Polygon',
'coordinates': [[
(1179091.1646903288, 712782.8838459781),
(1161053.0218226474, 667456.2684348812),
(1214704.933941905, 641092.8288590391),
(1228580.428455506, 682719.3123998424),
(1218405.0658121984, 721108.1805541387),
(1179091.1646903288, 712782.8838459781) ]]}
30X faster than OGR with AddPoint()
Pay only for what you eat
# This is "dumb" data, no methods.
data = {
'type': 'Polygon',
'coordinates': [[
(1179091.1646903288, 712782.8838459781),
...
(1179091.1646903288, 712782.8838459781) ]]}
# When you need spatial methods, bring in Shapely.
from shapely.geometry import shape
print shape(data).buffer(100.0).area
Winning the Double Challenge
Python programmers get GIS data access via familiar Python idioms (dicts, iterators, &c.)
Future Python programmers learn to do things in the fast and effective Python way
Design of Fiona and Rasterio
A Python package at the top
Extension modules (using Cython) in the middle
Fast loops, typed memoryviews, "nogil" blocks
GDAL shared library on the bottom
Cython takes care of a lot of C extension details for us.
Rasterio uses Cython features like fast loops, typed memoryviews,
and "nogil" blocks.
Reading raster data
import rasterio
with rasterio.open(path) as src:
data = src.read()
open()
gives you a file-like dataset object
read()
gives you a Numpy ndarray
Read windows of data with extended slice-like syntax
Rasterio is geared for band-interleaved data.
New read() method coming in 0.9.
Reading vector data
import fiona
with fiona.open(path) as src:
first = next(src)
open()
gives you a file-like iterator
next()
gives you the next feature record from the iterator
Records are Python dicts
Writing raster data
kwargs = src.meta
with rasterio.open(path, 'w', **kwargs) as dst:
dst.write(arr)
Get keyword args needed to open a dataset for writing from another dataset
write()
takes an ndarray
You can also write to windows of a dataset
Writing vector data
kwargs = src.meta
with fiona.open(path, 'w', **kwargs) as dst:
dst.write(record)
Get keyword args needed to open a dataset for writing from another dataset
write()
takes a feature record dict
Georeferencing
Fiona and Rasterio follow the lead of pyproj
>>> import rasterio
>>> src = rasterio.open('tests/data/RGB.byte.tif')
>>> src.crs
{'units': 'm', 'zone': 18, 'ellps': 'WGS84', 'proj': 'utm', 'no_defs': True}
PROJ.4 items as a dict.
rasterio.features module
rasterio.features.shapes()
yields all the features of an array as GeoJSON-like objects
rasterio.features.rasterize()
"burns" GeoJSON-like objects into an array
Dicts, iterators, tuples, ndarrays
No datasets or layers necessary
{'coordinates': [[(71.0, 6.0), ...]], 'type': 'Polygon'}, ...
rasterio.warp module
rasterio.warp.reproject()
maps elements of one array to another, using cartographic projections
No datasets or layers required
Data created in non-GIS programs can be reprojected for use with GIS programs
In the Beginning...
Was the Command Line
Grep is cool, but
$ gdalinfo tests/data/RGB.byte.tif | grep -c '^Band \d'
3
there's gotta be a better way
Command Line Fun
$ rio --help
Usage: rio [OPTIONS] COMMAND [ARGS]...
Rasterio command line interface.
Options:
-v, --verbose Increase verbosity.
-q, --quiet Decrease verbosity.
--help Show this message and exit.
Commands:
bounds Write bounding boxes to stdout as GeoJSON.
info Print information about a data file.
insp Open a data file and start an interpreter.
merge Merge a stack of raster datasets.
shapes Write the shapes of features.
transform Transform coordinates.
rio info
$ rio info tests/data/RGB.byte.tif --indent 2
{
"count": 3,
"crs": "EPSG:32618",
"dtype": "uint8",
"driver": "GTiff",
"bounds": [
101985.0,
2611485.0,
339315.0,
2826915.0
],
"height": 718,
"width": 791,
"nodata": 0.0
}
Single items
$ rio info tests/data/RGB.byte.tif --count
3
$ rio info tests/data/RGB.byte.tif --crs
EPSG:32618
$ rio info tests/data/RGB.byte.tif --bounds
101985.0 2611485.0 339315.0 2826915.0
rio.shapes | geojsonio
$ rio shapes /tests/data/shade.tif --precision 5 \
> | underscore extract features.112 \
> | geojsonio
Assists from underscore-cli and geojson-cli
New Releases
Fiona 1.2
Rasterio 0.13
Shapely 1.4
Thanks
Mapbox satellite team (Chris, Charlie, Amit, Bruno, Camilla)
Asger Petersen, Mike Toews, Brendan Ward, Kelsey Jordahl, René Buffat, Jacob Wasserman, Oliver Tonnhofer, Joshua Arnott, Phil Elson, Matt Perry
And especially Frank Warmerdam and Even Rouault