The Background
I have an issue for which a number of solution pathways are possible, but I am convinced there is an as-yet-undiscovered elegant solution leveraging purrr.
The Example Code
I have a large data frame as follows, for which I have included an example below:
library(tibble)
library(ggmap)
library(purrr)
library(dplyr)
# Define Example Data
df <- frame_data(
~Street, ~City, ~State, ~Zip, ~lon, ~lat,
"226 W 46th St", "New York", "New York", 10036, -73.9867, 40.75902,
"5th Ave", "New York", "New York", 10022, NA, NA,
"75 Broadway", "New York", "New York", 10006, -74.01205, 40.70814,
"350 5th Ave", "New York", "New York", 10118, -73.98566, 40.74871,
"20 Sagamore Hill Rd", "Oyster Bay", "New York", 11771, NA, NA,
"45 Rockefeller Plaza", "New York", "New York", 10111, -73.97771, 40.75915
)
The Challenge
I would like to geotag all locations for which the lon
and lat
columns are currently NA
. There are many ways I could go about this, one of which is shown below:
# Safe Code is Great Code
safe_geocode <- safely(geocode)
# Identify Data to be Geotagged by Absence of lon and lat
data_to_be_geotagged <- df %>% filter(is.na(lon) | is.na(lat))
# GeoTag Addresses of Missing Data Points
fullAddress <- paste(data_to_be_geotagged$Street,
data_to_be_geotagged$City,
data_to_be_geotagged$State,
data_to_be_geotagged$Zip,
sep = ", ")
fullAddress %>%
map(safe_geocode) %>%
map("result") %>%
plyr::ldply()
The Question
While I can get the above to work, and even wrangle the newly identified lon
and lat
coordinates back into the original data frame, the whole scheme feels dirty. I am convinced there is an elegant way to leverage piping and purrr to go through the data-frame and conditionally geotag the locations based on the absence of lon
and lat
.
I have been down a number of rabbit holes including purrr::pmap
in an attempt to walk through multiple columns in parallel when constructing the full address (As well as rowwise()
and by_row()
). Nevertheless, I fall short in constructing anything that would qualify as an elegant solution.
Any insight provided would be most appreciated.
Really, you want to avoid calling geocode
any more than necessary because it's slow and if you're using Google, you only have 2500 queries per day. Thus, it's best to make both columns from the same call, which can be done with a list column, making a new version of the data.frame with do
, or a self-join.
With a list column, you make a new version of lon
and lat
with ifelse
, geocoding if there are NA
s, else just copying the existing values. Afterwards, get rid of the old versions of the columns and unnest the new ones:
library(dplyr)
library(ggmap)
library(tidyr) # For `unnest`
# Evaluate each row separately
df %>% rowwise() %>%
# Add a list column. If lon or lat are NA,
mutate(data = ifelse(any(is.na(c(lon, lat))),
# return a data.frame of the geocoded results,
list(geocode(paste(Street, City, State, Zip))),
# else return a data.frame of existing columns.
list(data_frame(lon = lon, lat = lat)))) %>%
# Remove old columns
select(-lon, -lat) %>%
# Unnest newly created ones from list column
unnest(data)
## # A tibble: 6 × 6
## Street City State Zip lon lat
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 226 W 46th St New York New York 10036 -73.98670 40.75902
## 2 5th Ave New York New York 10022 -73.97491 40.76167
## 3 75 Broadway New York New York 10006 -74.01205 40.70814
## 4 350 5th Ave New York New York 10118 -73.98566 40.74871
## 5 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 6 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
do
do
, on the other hand, creates a wholly new data.frame from pieces of the old one. It requires slightly clunky $
notation, with .
to represent the grouped data.frame piped in. Using if
and else
instead of ifelse
lets you avoid nesting results in lists (which they had to be above, anyway).
# Evaluate each row separately
df %>% rowwise() %>%
# Make a new data.frame from the first four columns and the geocode results or existing lon/lat
do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){
geocode(paste(.[1:4], collapse = ' '))
} else {
.[5:6]
}))
which returns exactly the same thing as the first version.
If the ifelse
is overly confusing, you can just geocode a subset and then recombine by binding the rows to the anti_join
, i.e. all the rows that are in df
but not the subset .
:
df %>% filter(is.na(lon) | is.na(lat)) %>%
select(1:4) %>%
bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>%
bind_rows(anti_join(df, ., by = c('Street', 'Zip')))
which returns the same thing, but with the newly geocoded rows at the top. The same approach works with a list column or do
, but since there's no need to combine two sets of columns, just bind_cols
will do the trick.
mutate_geocode
ggmap
actually includes a mutate_geocode
function that will add lon and lat columns when passed a data.frame and a column of addresses. It has an issue: it can't accept more than a column name for the address, and thus requires a single column with the entire address. Thus, while this version could be quite nice, it requires creating and deleting an extra column with the whole address, making it inconcise:
df %>% filter(is.na(lon) | is.na(lat)) %>%
select(1:4) %>%
mutate(address = paste(Street, City, State, Zip)) %>% # make an address column
mutate_geocode(address) %>%
select(-address) %>% # get rid of address column
bind_rows(anti_join(df, ., by = c('Street', 'Zip')))
## Street City State Zip lon lat
## 1 5th Ave New York New York 10022 -73.97491 40.76167
## 2 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 3 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
## 4 350 5th Ave New York New York 10118 -73.98566 40.74871
## 5 75 Broadway New York New York 10006 -74.01205 40.70814
## 6 226 W 46th St New York New York 10036 -73.98670 40.75902
Base R can assign to a subset directly, which makes the idiom here much simpler, even if it requires a lot of subsetting:
df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)])
Results are the same as the first version.
All versions only call geocode
twice.
Note that while you could use purrr
for the job, it's not particularly better suited than regular dplyr
. purrr
excels at dealing with lists, and while a list column is one option, it doesn't really have to be manipulated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With