I am working on a spatial problem using the sf
package in conjunction with dplyr
and purrr
.
I would prefer to perform spatial operations inside a mutate
call, like so:
simple_feature %>%
mutate(geometry_area = map_dbl(geometry, ~ as.double(st_area(.x))))
I like that this approach allows me to run a series of spatial operations using %>%
and mutate
.
I dislike that this approach seems to significantly increase the run-time of the sf
functions (sometimes prohibitively) and I would appreciate hearing suggestions about how to overcome this speed loss.
Here is a reprex that illustrates the speed loss problem in detail.
Please note: this is not a minimal example and requires downloading a few packages and one file from an ESRI REST API. I hope you'll be kind with me ;)
The objective in this example is to add a new column indicating whether each North Carolina county (nc
) intersects with any of the waterbodies polygons (nc_wtr
), as shown in the image below:
I created a function that performs this calculation: st_intersects_any()
Then I benchmark that function on two datasets (nc
and nc_1e4
), first using st_intersects_any()
by itself and then using it inside a mutate
call.
## |TEST | ELAPSED|
## |:------------------|--------:|
## |bm_sf_small | 0.01|
## |bm_sf_dplyr_small | 1.22|
## |bm_sf_large | 0.95|
## |bm_sf_dplyr_large | 122.88|
The benchmarks clearly show that the dplyr
approach is substantially slower, and I'm hoping that someone has a suggestion for reducing or eliminating this speed loss while still using the dplyr
approach.
If there are significantly faster ways to do this using data.table
or some other method that I should check out please let me know about those as well.
Thanks!
# Setup ----
library(lwgeom) # devtools::install_github('r-spatial/lwgeom)
library(tidyverse)
library(sf)
library(esri2sf) # devtools::install_github('yonghah/esri2sf')
library(rbenchmark)
library(knitr)
# Create the new sf function: st_intersects_any ----
st_intersects_any <- function(x, y) {
st_intersects(x, y) %>%
map_lgl(~ length(.x) > 0)
}
# Load data ----
# NC counties
nc <- read_sf(system.file("shape/nc.shp", package = "sf")) %>%
st_transform(32119)
nc_1e4 <- list(nc) %>%
rep(times = 1e2) %>%
reduce(rbind)
# NC watersheds
url <- "https://services.nconemap.gov/secure/rest/services/NC1Map_Watersheds/MapServer/2"
nc_wtr <- esri2sf(url)
## Warning: package 'httr' was built under R version 3.4.2
##
## Attaching package: 'jsonlite'
## The following object is masked from 'package:purrr':
##
## flatten
## [1] "Feature Layer"
## [1] "esriGeometryPolygon"
nc_wtr <- st_transform(nc_wtr, 32119) %>%
st_simplify(dTolerance = 100) # simplify the waterbodies geometries
# plot the data
par(mar = rep(.1, 4))
plot(st_geometry(nc), lwd = 1)
plot(st_geometry(nc_wtr), col = alpha("blue", .3), lwd = 1.5, add = TRUE)
# Benchmark the two approaches
cols <- c("elapsed", "relative")
bm_sf_small <- benchmark({
st_intersects_any(nc, nc_wtr)
}, columns = cols, replications = 1)
bm_sf_dplyr_small <- benchmark({
nc %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr))
}, columns = cols, replications = 1)
## Warning: package 'bindrcpp' was built under R version 3.4.2
bm_sf_large <- benchmark({
st_intersects_any(nc_1e4, nc_wtr)
}, columns = cols, replications = 1)
bm_sf_dplyr_large <- benchmark({
nc_1e4 %>% transmute(INT = map_lgl(geometry, st_intersects_any, y = nc_wtr))
}, columns = cols, replications = 1)
tests <- list(bm_sf_small, bm_sf_dplyr_small, bm_sf_large, bm_sf_dplyr_large)
tbl <- tibble(
TEST = c("bm_sf_small", "bm_sf_dplyr_small", "bm_sf_large", "bm_sf_dplyr_large"),
ELAPSED = map_dbl(tests, "elapsed")
)
kable(tbl,format = "markdown", padding = 2)
## |TEST | ELAPSED|
## |:------------------|--------:|
## |bm_sf_small | 0.01|
## |bm_sf_dplyr_small | 1.22|
## |bm_sf_large | 0.95|
## |bm_sf_dplyr_large | 122.88|
devtools::session_info()
## Session info -------------------------------------------------------------
## setting value
## version R version 3.4.0 (2017-04-21)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## tz America/Los_Angeles
## date 2018-01-31
## Packages -----------------------------------------------------------------
## package * version date source
## assertthat 0.2.0 2017-04-11 CRAN (R 3.4.2)
## backports 1.1.0 2017-05-22 CRAN (R 3.4.0)
## base * 3.4.0 2017-04-21 local
## bindr 0.1 2016-11-13 CRAN (R 3.4.2)
## bindrcpp * 0.2 2017-06-17 CRAN (R 3.4.2)
## broom 0.4.3 2017-11-20 CRAN (R 3.4.3)
## cellranger 1.1.0 2016-07-27 CRAN (R 3.4.2)
## class 7.3-14 2015-08-30 CRAN (R 3.4.0)
## classInt 0.1-24 2017-04-16 CRAN (R 3.4.2)
## cli 1.0.0 2017-11-05 CRAN (R 3.4.2)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.4.2)
## compiler 3.4.0 2017-04-21 local
## crayon 1.3.4 2017-10-30 Github (r-lib/crayon@b5221ab)
## curl 3.0 2017-10-06 CRAN (R 3.4.2)
## datasets * 3.4.0 2017-04-21 local
## DBI 0.7 2017-06-18 CRAN (R 3.4.2)
## devtools 1.13.2 2017-06-02 CRAN (R 3.4.0)
## digest 0.6.13 2017-12-14 CRAN (R 3.4.3)
## dplyr * 0.7.4 2017-09-28 CRAN (R 3.4.2)
## e1071 1.6-8 2017-02-02 CRAN (R 3.4.2)
## esri2sf * 0.1.0 2017-12-12 Github (yonghah/esri2sf@81d211f)
## evaluate 0.10.1 2017-06-24 CRAN (R 3.4.3)
## forcats * 0.2.0 2017-01-23 CRAN (R 3.4.3)
## foreign 0.8-67 2016-09-13 CRAN (R 3.4.0)
## ggplot2 * 2.2.1.9000 2017-12-02 Github (tidyverse/ggplot2@7b5c185)
## glue 1.2.0.9000 2018-01-13 Github (tidyverse/glue@1592ee1)
## graphics * 3.4.0 2017-04-21 local
## grDevices * 3.4.0 2017-04-21 local
## grid 3.4.0 2017-04-21 local
## gtable 0.2.0 2016-02-26 CRAN (R 3.4.2)
## haven 1.1.0 2017-07-09 CRAN (R 3.4.2)
## hms 0.4.0 2017-11-23 CRAN (R 3.4.3)
## htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
## httr * 1.3.1 2017-08-20 CRAN (R 3.4.2)
## jsonlite * 1.5 2017-06-01 CRAN (R 3.4.0)
## knitr 1.18 2017-12-27 CRAN (R 3.4.3)
## lattice 0.20-35 2017-03-25 CRAN (R 3.4.0)
## lazyeval 0.2.1 2017-10-29 CRAN (R 3.4.2)
## lubridate 1.7.1 2017-11-03 CRAN (R 3.4.2)
## lwgeom * 0.1-1 2017-12-16 Github (r-spatial/lwgeom@baf22c6)
## magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
## memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
## methods * 3.4.0 2017-04-21 local
## mnormt 1.5-5 2016-10-15 CRAN (R 3.4.1)
## modelr 0.1.1 2017-07-24 CRAN (R 3.4.2)
## munsell 0.4.3 2016-02-13 CRAN (R 3.4.2)
## nlme 3.1-131 2017-02-06 CRAN (R 3.4.0)
## parallel 3.4.0 2017-04-21 local
## pillar 1.0.99.9001 2018-01-16 Github (r-lib/pillar@9d96835)
## pkgconfig 2.0.1 2017-03-21 CRAN (R 3.4.2)
## plyr 1.8.4 2016-06-08 CRAN (R 3.4.2)
## psych 1.7.8 2017-09-09 CRAN (R 3.4.2)
## purrr * 0.2.4.9000 2017-12-05 Github (tidyverse/purrr@62b135a)
## R6 2.2.2 2017-06-17 CRAN (R 3.4.0)
## rbenchmark * 1.0.0 2012-08-30 CRAN (R 3.4.1)
## Rcpp 0.12.15 2018-01-20 CRAN (R 3.4.3)
## readr * 1.1.1 2017-05-16 CRAN (R 3.4.2)
## readxl 1.0.0 2017-04-18 CRAN (R 3.4.2)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.4.2)
## rlang 0.1.6 2017-12-21 CRAN (R 3.4.3)
## rmarkdown 1.8 2017-11-17 CRAN (R 3.4.2)
## rprojroot 1.3-2 2018-01-03 CRAN (R 3.4.3)
## rvest 0.3.2 2016-06-17 CRAN (R 3.4.2)
## scales 0.5.0.9000 2017-12-02 Github (hadley/scales@d767915)
## sf * 0.6-1 2018-01-24 Github (r-spatial/sf@7ea67a5)
## stats * 3.4.0 2017-04-21 local
## stringi 1.1.6 2017-11-17 CRAN (R 3.4.2)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.4.0)
## tibble * 1.4.1.9000 2018-01-18 Github (tidyverse/tibble@64fedbd)
## tidyr * 0.7.2.9000 2018-01-13 Github (tidyverse/tidyr@74bd48f)
## tidyverse * 1.2.1 2017-11-14 CRAN (R 3.4.3)
## tools 3.4.0 2017-04-21 local
## udunits2 0.13 2016-11-17 CRAN (R 3.4.1)
## units 0.5-1 2018-01-08 CRAN (R 3.4.3)
## utf8 1.1.3 2018-01-03 CRAN (R 3.4.3)
## utils * 3.4.0 2017-04-21 local
## withr 2.1.1.9000 2018-01-13 Github (jimhester/withr@df18523)
## xml2 1.1.1 2017-01-24 CRAN (R 3.4.2)
## yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)
you can considerably speed-up this by simply dropping the unnecessary map_lgl
call in the pipe:
bm_sf_dplyr_large_fast <- benchmark({
int_new <- nc_1e4 %>% mutate(INT = st_intersects_any(., nc_wtr))
}, columns = cols, replications = 1)
bm_sf_dplyr_large_fast
# bm_sf_dplyr_large_fast
# elapsed relative
# 1 0.829 1
The huge slow down depends from the fact that mapping over geometry rows is in this case detrimental, because you then do a looped one-to-multi polygon intersection.
Besides the overhead introduced by subsetting, I believe this is much slower than a straight-on multi-to-multi because you are probably mostly losing the "spatial indexing" capabilities of sf
objects, which considerably speed-up intersect operations (see http://r-spatial.org/r/2017/06/22/spatial-index.html). (Also note that I substituted transmute' with
mutate` - also that was introducing some overhead).
HTH
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With