Thunderdome for Geospatial Tools in Python. It’s to the Death.
It’s a fight to the death people… that’s why it’s called Thunderdome. This will be no different. Last time we talked about the very basics of the strange world of geo-spatial tools for data engineering. The next most obvious thing do of course is to see what tool is the best. By best I mean what tools can be used to load and do simple manipulation of data in a fast and relatively simple manner.
Luckily the USGS has plenty of free satellite imagery available to the public. You can access EarthExplorer here. I did a quick little search for the county in God’s country where I grew up. It’s the a Landsat satellite NDVI image (normalized vegetation index.) We will use this data for playing around with.
Downloading the .tar file gives me a few tiff files to test tools with.
Let’s start with some simple tests just opening a geotiff and reading a single band as an array. We will do this wall the different Python tools to see what’s up.
import rasterio
from time import time
t1 = time()
dataset = rasterio.open('landsat/LC08_CU_018007_20191019_20191031_C01_V01_SRAEROSOLQA.tif')
band = dataset.read(1)
print(band)
t2 = time()
print(t2-t1)
The first tool we tried is one of the most popular, rasterio. It’s very Pythonic in its design and is used widely because of its easy of access and functionality. So that was pretty quick. Next we should look at the Python api for GDAL. The documentation for GDAL sucks, it’s bare bones and you pretty much have to figure out everything for yourself. It’s notorious for not playing nice with any system and different versions of Python and GDAL in combination are like watching my two kids fight each other.
from osgeo import gdal
from time import time
import numpy as np
t1 = time()
raster = gdal.Open("landsat/LC08_CU_018007_20191019_20191031_C01_V01_SRAEROSOLQA.tif")
band_array = np.array(raster.GetRasterBand(1).ReadAsArray())
t2 = time()
print(band_array)
print(t2-t1)
Interesting. I know a lot of people who use GDAL because they think it’s faster, even though it’s harder to use. It appears that the GDAL Python bindings are a little slower when working in Python. It may not look like much, but if your loading TB’s of data in a image processing library, why waste time you don’t have to?
There is also a old school Python image library called PIL. Might as well try it out.
from PIL import Image
import numpy
from time import time
t1 = time()
im = Image.open("landsat/LC08_CU_018007_20191019_20191031_C01_V01_SRAEROSOLQA.tif")
imarray = numpy.array(im)
t2 = time()
print(imarray)
print(t2-t1)
It appears PIL is a little slower than the other two, no surprise there. Let’s try another simple, lesser known imageio module. It probably win’s the award for easiest to use.
import imageio
from time import time
t1 = time()
image = imageio.imread("landsat/LC08_CU_018007_20191019_20191031_C01_V01_SRAEROSOLQA.tif")
t2 = time()
print(image)
print(t2-t1)
So up to this point we’ve just talked about opening a geotiff file as a numpy array, to do some sort of math or work on the image data itself. But, this is really only a small part of what problem needs to be done with geospatial imagery. What we really need to look at is some geospatial transforms. For example, what if we need to cut out/clip a section from our image. We know which tool to use when doing just doing raster math on arrays in Python, but let’s found out what to use if we need to start transforming our data.
Below is a geojson of a rectangle I drew on this sweet website. We will use this as a “bounding” box to clip this section out of the bigger geotiff image. Below is a view of the section I want to cut out of the image.
{
"type": "FeatureCollection",
"features": [
{
"type": "Feature",
"properties": {},
"geometry": {
"type": "Polygon",
"coordinates": [
[
[
-93.4691047668457,
42.94725053181952
],
[
-93.4164047241211,
42.94725053181952
],
[
-93.4164047241211,
42.96207578988295
],
[
-93.4691047668457,
42.96207578988295
],
[
-93.4691047668457,
42.94725053181952
]
]] }}]}
Here is the code for masking/cutting with rasterio. You can follow the code pretty easily. My only complaint is that it’s not very obvious that the geometries you want to use for cutting in rasterio have to be a list of geometries. It would be nice to just be able to pass the whole geojson file without having to write the following two lines. geometries = box[“features”][0][“geometry”] and geoms = [geometries]
import rasterio
from time import time
import json
from rasterio.mask import mask
t1 = time()
with open('bounding.json', 'r') as file:
box = json.loads(file.read())
geometries = box["features"][0]["geometry"]
geoms = [geometries]
with rasterio.open("landsat/LC08_CU_018007_20191019_20191031_C01_V01_SRAEROSOLQA.tif", "r") as src:
print(src.crs)
out_image, out_transform = mask(src, geoms, crop=True)
out_meta = src.meta.copy()
out_meta.update({"driver": "GTiff",
"height": out_image.shape[1],
"width": out_image.shape[2],
"transform": out_transform})
with rasterio.open('test.tiff', "w", **out_meta) as dest:
dest.write(out_image)
t2 = time()
print(t2-t1)
This process is amazingly fast in rasterio and worked like a charm. Note ( you will have to make sure both your geojson and tiff are in the same CRS.
Let’s try the same thing with the GDAL python bindings. Again, this is a little surprising but it appears to be a little slower than rasterio.
from osgeo import gdal
from time import time
t1 = time()
input_raster = "landsat/LC08_CU_018007_20191019_20191031_C01_V01_SRAEROSOLQA.tif"
output_raster = "test.tif"
input_bound = "bounding.json"
ds = gdal.Warp(destNameOrDestDS=output_raster, srcDSOrSrcDSTab=input_raster, format='GTiff',
cutlineDSName=input_bound, cropToCutline=True)
ds = None
t2 = time()
print(t2-t1)
So it appears if you have your choice when working with geo tools in Python, rasterio is the way to go. It’s faster at opening up files and pulling out data as an array. It appears to be way faster when doing simple geo transforms like cutting out a boundary from a file. This is strange to me as it seems like GDAL is the tool of choice chosen by most professionals working with geo-spatial data. Granted the code for rasterio requires a little more work, but it appears to be worth the effort.
Long live rasterio.