Geospatial intersects with data.table, geos and sf
Vincent Thorne · Posted 09 Feb 2022
Working with large spatial data sets in R, I was delighted to discover Grant McDermott’s post on how to combine data.table
and geos
for blazing-fast spatial operations compared to the standard sf
(keeping in mind that geos
assumes planar geometries). The benchmark for the spatial operations he presents are very convincing, so decided to give it a go on the data set I’m currently working with, the infamous NYC taxi trip records.
In the example below, I am matching each trip pickup coordinates (points) with one of 263 taxi zones (polygons). In the end, I want to know how many taxi trips started in each zones. To find out to which taxi zone each pair of coordinates belongs, I use st_intersect
from the sf
package, and geos_intersects_matrix
from geos
. Surprisingly, the benchmark reveals that when intersecting points and polygons sf
is about 40% faster compared to the data.table
+geos
team.
I am in no position to say what is happening behind the scenes here, but found it interesting to report and contrast with Grant’s findings: the advantages of using data.table
+geos
might depend on the type of spatial operation. There might also be a better way to implement the data.table
+geos
version in this case that I overlooked — if you have ideas, please let me know!
library(sf)
library(geos)
library(data.table)
library(microbenchmark)
options(timeout=10000)
rm(list = ls())
gc()
url_trips = "https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2015-01.csv"
download.file(url = url_trips, destfile = "trip_records.csv", mode = "wb") # takes some time and is pretty heavy!
trips.dt = fread("trip_records.csv", nrows = 100000) # enough rows to benchmark performance
# Keep only pickup (origin) coordinate columns
cols_drop = grep("pickup_l", names(trips.dt), value = T, invert = T)
trips.dt[, (cols_drop) := NULL]
trips.dt = trips.dt[!(pickup_longitude == 0 | pickup_latitude == 0)] # obviously invalid coordinates
# Generate geom
trips.dt[, `:=`(o_geom = geos_make_point(pickup_longitude,
pickup_latitude,
crs=4326))]
# Generate sf
trips.sf = st_as_sf(trips.dt)
trips.sf = st_transform(trips.sf, 4326)
# Get polygons to intersect
url_zones = "https://s3.amazonaws.com/nyc-tlc/misc/taxi_zones.zip"
download.file(url_zones, destfile = "zones_polygons.zip", mode = "wb")
unzip("zones_polygons.zip", exdir = file.path("zones"))
# sf version
zones.sf = st_read(file.path("zones", "taxi_zones.shp"))
zones.sf = st_transform(zones.sf, 4326)
# geom version
zones.dt = as.data.table(zones.sf)[, geom := as_geos_geometry(geometry, crs = 4326)]
zones.dt[, geometry := NULL]
# Functions to test
geos.dt = function(x) {
geos_intersects_matrix(zones.dt[, geom], trips.dt[, o_geom])
}
sf = function(x) {
st_intersects(zones.sf, trips.sf)
}
# Benchmark
microbenchmark(geos = geos.dt(),
sf = sf(),
times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# geos 3.406296 3.823752 4.388384 4.226851 4.992151 6.192083 10
# sf 2.431363 2.493557 2.894264 2.870200 3.118318 3.941235 10