How We Publish Satellite Data on the Web
Cloud computing has opened the door for satellite data to make a daily difference in the lives of farmers, emergency responders, policymakers, and many other professionals. Until recently, the size and complexity of satellite-derived data meant that it was only practical for governments and research institutions to invest in. Now, we have the power to process massive amounts of imagery as it is collected, and deliver insights to smartphones in real time.
To accomodate these new users, imagery providers need to adapt their services. We recently worked with Amazon Web Services (AWS) and Astro Digital to design and build MODIS* on AWS, building with web and app users in mind. This article shares what we learned, and provides guidelines for other data providers looking to improve their digital distribution.
A composite MODIS image of the northeastern United States. Cape Code can be seen to the east, New York City and Long Island to the south. It is created from 16 days of imagery centered around July 9th, 2016. This data is available on AWS.
Earth on Amazon Web Services has been tremendously successful in making satellite imagery accessible. Landsat on AWS recorded over 1 billion requests in the first year. The majority of applications and machine learning pipelines that use Landsat now pull their data from AWS rather than USGS. While the data is the same, AWS, along with Frank Warmerdam and the Planet team, created an extremely thoughtful data structure based on feedback from end users. AWS has continued this throughout the Earth on AWS program. These lessons come from the combined experiences of putting dozens of data sets on the cloud.
Build for Applications
AWS development enables access by building regular folder structures that applications can follow, query piecemeal, or crawl at scale. This enables an entire range of use, from real-time apps that directly access the data to indexing scripts that build vast libraries.
Organize in Buckets
“Buckets” are units of storage on S3, a storage service of AWS. When needing standardized cloud storage they are a good place to start because they are extensible and scale well. Other AWS services within the same co-located AWS region (e.g., us-east-1) can access buckets quickly at reduced or no cost.
The concept of buckets exist at other cloud providers and often carry similar benefits, just under different names.
Use Predictable Names
MODIS folders use the path PRODUCT/XX/YY/DATE, where XX and YY are tile designations, and DATE is the four digit year and three digit day of the year. So 2017001 would be the first day of this year.
In theory, this means you can access MODIS on AWS without searching for it, provided you have the spatial location and capture time. Unfortunately, MODIS data files all contain a scene ID, which includes a “processed-at” timestamp. So while the folders were crawl-able, we had to rely on other methods for the data itself.
Include an Index
An index.html page provides an excellent landing page for both humans and machines and should show what files exist for a particular scene.
MODIS filenames are irregular, so including an index.html at the root directory gives applications a way to access full URL’s for each file.
Publish Thumbnails
MODIS scenes are 2400x2400 pixel resolution. They come with a 10% size thumbnail, so you can preview the image before spending time and resources getting the full-res version. On AWS, you’ll find this thumbnail alongside the full version.
When working interactively with large files, as most imagery is, thumbnails save time and cost.
Publish Metadata
MODIS directories on AWS includes full scene metadata in an XML file. Generating an additional JSON metadata file containing this information is relatively straightforward, and makes accessing the data easier. JSON is native to the web, and both faster and safer for applications to parse.
Provide Smaller Files (When Possible)
The larger the file, the more resources you need to work with it, the harder it becomes. MODIS on AWS publishes one GeoTIFF file for each band. These files are split off from the original data file. This lets you download just the ones you need.
MODIS often comes with ancillary data such as quality bands, and these get their own GeoTIFF as well.
For very large scenes, they could be re-tiled to break them into into smaller chunks. However, creating too many files can be counterproductive though, as there is overhead involved with accessing each datafile.
Provide Overviews
Overviews are downsampled versions of the data that can speed up access at lower zoom levels (“zoomed out”). Not everyone needs these, so they’re better kept in separate files.
MODIS on AWS provides compressed overviews for each band (but not ancillary bands) at 1/2, 1/4, 1/16, and 1/32 sizes. Including four overview levels provides lots of flexibility for remote access.
File Format, Compression, and Tiling
GeoTIFF is an accessible format, supports decent compression, and can be internally tiled to improve remote access. While the most common, it’s not the only choice.
Compression greatly reduces file sizes at the cost of additional overhead when reading the file. Depending on the file, this additional overhead can greatly outweigh gains in download time. Tiling can also improve read performance, since you read data in pieces.
MODIS uses deflate compression to compress GeoTIFF files (and overview files), and uses a 512x512 internal tile size. As a general rule if only a piece of a tile is needed and is less than 20% of the image it may be better to read the data remotely, otherwise the file should be downloaded before accessing.
Result of Community Outreach
These practices come from the community. AWS organized calls with user groups to figure out which MODIS products were most useful to store, how much of an archive to build, and the best way to store them. They ended up with these:
- MCD43A4: The highest quality scientific data from MODIS sensors, available daily.
- MOD09GA and MYD09GA: Daily readings from bands 1-7.
- MOD09GQ and MYD09GQ: Bands 1 (red) and 2 (near-infrared).
AWS is still backfilling archives. MCD43A4 will go back 5 years, while the daily products will work on a 30 day rolling archive. It currently takes 4-5 days from capture for NASA to process the data and make it available, at which point AWS mirrors the data on S3.
*Landsat and MODIS are different types of photographic instruments. MODIS flies aboard two spacecraft: Terra and Aqua. Together these cover the earth every 1-2 days.