Optimizing Landsat-util
Two weeks ago we launched a new version of Landsat-util, v0.5, our utility for searching and processing Landsat satellite imagery. This version is now faster than it was before. To do this we rewrote most of the internals to use flexible python frameworks.
Below is a deep dive into how we’ve improved Landsat-util to make it faster and easier to use.
Dependency Hell
Landsat-util downloads Landsat files, pulls out the individual bands representing wavelengths of light, corrects contrast, warps them and combines them to make a colored image you can add on a web map.
Initially, landsat-util was written as a command line wrapper to existing pipelines:
- Scale bands with
gdal-translate
- Warp with
gdal-warp
to the correct projection - Combine bands with
ImageMagick
- Contrast correct with
OpenCV
- Pansharpen with
orfeoToolbox
- Gamma correct with
ImageMagick
- Add geographic information back with
gdal_edit
We needed a lot of dependencies to process an image, and they’re not particularly optimized for scripting. ImageMagick, GDAL, orfeoToolbox and openCV are monolithic frameworks that don’t allow for cherry-picking functions. Installing all these frameworks can be a painful experience.
This toolchain combination is also tedious because it creates inherent bottlenecks. To communicate between all tools, we need to write temporary files to disk and read them back in at every step.
Enter rasterio
Rasterio is a great python library written by Sean Gillies at Mapbox to work with raster data. It wraps around gdal and abstracts the band data as Numpy arrays.
By standardizing the input, rasterio allows us to minimize our dependencies and use fast, in-memory, scientific libraries like scikit-image.
Here’s the guts of our new process:
Read in bands with rasterio
{% highlight python %}
with rasterio.drivers():
with rasterio.open(‘LC82040522013123LGN01_B4.TIF’) as band4:
with rasterio.open(‘LC82040522013123LGN01_B3.TIF’) as band3:
with rasterio.open(‘LC82040522013123LGN01_B2.TIF’) as band2:
with rasterio.open(‘LC82040522013123LGN01_B8.TIF’) as band8:
band4_s = band4.read_band(1)
band3_s = band3.read_band(1)
band2_s = band2.read_band(1)
band8_s = band8.read_band(1)
{% endhighlight %}
Scale bands with scikit
{% highlight python %}
from skimage import transform as sktransform
band4_s = sktransform.rescale(band4_s, 2)
band3_s = sktransform.rescale(band3_s, 2)
band2_s = sktransform.rescale(band2_s, 2)
{% endhighlight %}
Warp with rasterio
{% highlight python %}
for color, band in zip([r,g,b,b8], [band4_s, band3_s, band2_s, band8_s]):
reproject(band, color,
src_transform = src.transform,
src_crs = src.crs,
dst_transform = dst_transform,
dst_crs = dst_crs,
resampling = RESAMPLING.nearest)
{% endhighlight %}
Pansharpen using numpy
operations
{% highlight python %}
m = r + b + g
pan = 1/m * b8
r = r * pan
b = b * pan
g = g * pan
{% endhighlight %}
Contrast-correct and gamma correct using scikit-image
{% highlight python %}
# using CLAHE
from skimage.exposure import equalize_adapthist
for band in [r,g,b]:
band = equalize_adapthist(band, clip_limit=0.02)
{% endhighlight %}
Write to disk using rasterio
{% highlight python %}
with rasterio.open(
tiffname,’w’, driver=’GTiff’,
width=dst_shape[1],height=dst_shape[0],
count=3,dtype=numpy.uint8,
nodata=0,
transform=dst_transform,
photometric=’RGB’,
crs=dst_crs) as dst:
for k, arr in [(1, r), (2, g), (3, b)]:
dst.write_band(k, arr)
{% endhighlight %}
The toolchain consists of only python libraries, and no other dependencies. Rasterio inherently supports GeoTiff so we don’t lose geo-information along the way.
Landsat-util is open source, and we encourage developers to improve on our process. Fork our repo!
How fast?
By using rasterio
, numpy
and scikit
, we eliminate the disk bottleneck, and we regain transparent control over the pipeline.
We ran tests on a couple of AWS machines. Each test was conducted 5 times and the resulting times were averaged.
- R3.large: 2-core 2.5 GHz Intel Xeon (Ivy Bridge) with 15GB RAM and 32GB SSD.
- C4.2xlarge: 8-core 2.9GHz Intel Xeon (Haswell) with 15GB RAM and 32GB SSD.
Results
instance type
process type
old-landsat-util
new-landsat-util
speedup
R3.large
non-pansharpened
252.7s
122.05s
2x
R3.large
pansharpened
846.9s
349.17s
2.4x
C4.2xlarge
non-pansharpened
216.6s
106.67s
2x
C4.2xlarge
pansharpened
438s
290s
1.5x
On all but the largest machine, the new landsat-util is at least twice as fast as previously.
And it gets better: a significant amount of time (about a minute on average) is used up to decompress the NASA bundle after downloading it. Landsat-util v0.5 takes advantage of AWS’s new landsat archive of unzipped bands, saving even more time.
We hope you enjoy the new landsat-util! Fork it, modify it, break it or just use it and tell us about all the ways you’re incorporating satellite data in your apps.