MongoDB + Uber H3 = Fast global datasets

May 16, 2021

As a researcher grappling with environmental datasets, the slow speed of database queries can often hinder your progress and disrupt the process of problem-solving. This article presents a robust solution to accelerate geospatial database queries, employing an optimized database structure fortified by Uber’s H3 index. This discussion will encompass an explanation of how Uber’s H3 index functions, supplemented by a coding example to facilitate further implementation.

Before delving into the solution, it is imperative to understand the core issue. For instance, consider the ISD (Integrated Surface Database) by NOAA, comprising approximately 2.6 billion single weather measurements spanning from 1901 to the present. Our objective is to insert this data into a database for rapid querying, enabling its utilization in further applications. MongoDB once proposed a technique for this, which required substantial hardware resources. In this article, we combine MongoDB’s approach with Uber’s H3 index to achieve faster query execution.

ISD

The document displayed below is a MongoDB document in JSON format, encompassing a weather station ID (st), a timestamp (st), a geoJSON encoded location object, and the actual data.

{
  "st" : "u725053",
  "ts" : ISODate("2013-06-03T22:51:00Z"),
  "position" : {
    "type" : "Point",
    "coordinates" : [
        -96.4,
        39.117
    ]
  },
  "elevation" : 231,
  "airTemperature" : {
      "value" : 21.1,
      "quality" : "1"
  },
  "sky condition" : {
    "cavok": "N",
    "ceilingHeight": {
        "determination": "9",
        "quality": "1",
        "value": 1433
    }
  }
  "atmosphericPressure" : {
      "value" : 1009.7,
      "quality" : "5"
  }
  [etc]
}

As demonstrated by Avery from MongoDB’s blog, this query takes about 1500ms to execute using a configuration of 32 vCPUs, 251GB of RAM, 8 x 800GB SSD, and approximately 300ms with 100 instances of 16GB RAM machines.

db.data.find({"st" :db.data.find({"st" : "u103840",
              "ts" : {"$gte": ISODate("1989-01-01"),
              "$lt" : ISODate("1990-01-01")}})

mongodb Left: Single machine (32 cores, etc.), Right: 100 smaller machines clustered | Image from https://www.mongodb.com/blog/post/weather-century-part-4

Although throwing more hardware at the problem might seem a feasible solution, the crux of the matter lies in the unoptimized database structure. Since all documents are housed in a single database, every data piece must be verified during query execution. In a production environment, it’s more commonplace to query by location rather than station ID, therefore our optimization will focus on location.

Introducing Uber’s H3 Index

This leads us to Uber’s H3 index. Uber’s open-sourced hexagonal hierarchical geospatial indexing system is proficient in segmenting the Earth into uniformly distributed hexagons across 16 resolution levels, making it a potent tool for a variety of applications.

Bigger hexagon: lower resolution | Smaller hexagon: higher resolution https://eng.uber.com/h3/ Every hexagon is represented by a unique 16-bit hexadecimal index e.g.

8843a13687fffff

Different resolutions

The resolution can be adapted based on the problem at hand. Each resolution level comprises a specific number of hexagons enveloping the entire globe, ranging from 122 hexagons at resolution 0 to over 500 trillion at resolution 15.

H3 resolution	Average hexagon area	Average hexagon edge length	Number of hexagons
0	4.25e+09 km²	1.11e+06 km	122
1	6.07e+08 km²	3.78e+05 km	842
2	8.67e+07 km²	1.89e+05 km	5882
3	1.24e+07 km²	9.44e+04 km	41176
4	1.77e+06 km²	4.72e+04 km	288122
5	2.53e+05 km²	2.36e+04 km	2016842
6	3.61e+04 km²	1.18e+04 km	14117894
7	5.16e+03 km²	5.89e+03 km	98825158
8	7.36e+02 km²	2.95e+03 km	691776106
9	1.05e+02 km²	1.48e+03 km	4842432742
10	1.50e+01 km²	7.42e+02 km	33897029194
11	2.14 km²	3.71e+02 km	237279204358
12	3.06e-01 km²	1.86e+02 km	1660954420506
13	4.37e-02 km²	9.31e+01 km	11626680943542
14	6.24e-03 km²	4.66e+01 km	81386766504794
15	8.91e-04 km²	2.33e+01 km	569707365533558
https://h3geo.org/docs/core-library/restable/

Uber offers a C library for swift and convenient conversions from coordinates to the H3 index and vice versa. Additionally, bindings are available for all major languages.

Optimizing the Database Structure

Having established the foundational components, we can now design our database structure with two significant modifications:

Each year will have a dedicated database
Each year will incorporate 5882 tables of H3 indexes at resolution 2

db structure

This division allows us to distribute the Earth into 5882 uniform hexagons, drastically reducing the number of documents to be searched for each query, usually by a factor of 5882. A hexagon at resolution 2 possesses the following dimensions:

hexagons

After inserting some data for Germany, the h3 index tables looks like:

tables

This methodology is highly scalable, facilitated by a higher H3 resolution and distributed machines. Additionally, smaller areas with denser data (as per our approach at Predly) can be queried efficiently with this method.

Given the following query:

{
 "location" :{
            $near: {
                   $geometry: {
                               type: "Point" ,
                               coordinates: [ 8.25,50 ]
                              },
                               $maxDistance:1000,
                               $minDistance:10
                              }
                     }
 },
 "ts" : ISODate("1989-01-01")
}

The proposed workflow is as follows:

Determine the years to query (in this case, 1989)
Identify the desired H3 index: h3.geo_to_h3(50,8.25,2) = ‘821faffffffffff’
Execute the query for the given database and table

An example would be like

def make_query(query):
    
    # get year
    if isinstance(query["ts"], datetime.date):
        ts = query["ts"]
        year = ts.year
    
    # get h3 index
    lon, lat = query["location"]["coordinates"]
    h3_index = h3.geo_to_h3(lat,lon,2)
    
    return db[year][h3_index].find(query)

In a global best-case scenario, this approach can accelerate query speed nearly 5882 times. It is hoped that this article piques your interest, and you find its insights valuable for your next big geospatial data project.

Introducing Uber’s H3 Index

Optimizing the Database Structure

An example would be like

Subscribe for insights in AI, tech, earth science & space