Objective: Using R, get latitude and longitude data for a vector of addresses through open.mapquestapi
Point of departure: Since geocode
from the ggmap
package is restricted to 2500 queries a day, I needed to find a different way (My data.frame consists of 9M entries). The data-science toolkit is not an option since most of my addresses are based outside the UK/US. I found this excellent snippet on http://rpubs.com/jvoorheis/Micro_Group_Rpres utilizing open.mapquestapi.
geocode_attempt <- function(address) {
URL2 = paste("http://open.mapquestapi.com/geocoding/v1/address?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz",
"&location=", address, "&outFormat='json'", "boundingBox=24,-85,50,-125",
sep = "")
# print(URL2)
URL2 <- gsub(" ", "+", URL2)
x = getURL(URL2)
x1 <- fromJSON(x)
if (length(x1$results[[1]]$locations) == 0) {
return(NA)
} else {
return(c(x1$results[[1]]$locations[[1]]$displayLatLng$lat, x1$results[[1]]$locations[[1]]$displayLatLng$lng))
}
}
geocode_attempt("1241 Kincaid St, Eugene,OR")
We need these libraries:
library(RCurl)
library(rjson)
library(dplyr)
Let's create a mock-up data.frame with 5 adresses.
id <- c(seq(1:5))
street <- c("Alexanderplatz 10", "Friedrichstr 102", "Hauptstr 42", "Bruesseler Platz 2", "Aachener Str 324")
postcode <- c("10178","10117", "31737", "50672", "50931")
city <- c(rep("Berlin", 2), "Rinteln", rep("Koeln",2))
country <- c(rep("DE", 5))
df <- data.frame(id, street, postcode, city, country
For a adding a latitude lat
and longitude lon
variable to the data.frame we could work with a for
-Loop. I will present the code, just to demonstrate that the function works in principle.
for(i in 1:5){
df$lat[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[1]
df$lon[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[2]
}
From a performance standpoint, this code is pretty bad. Even for this small data.frame, my computer took about 9sec, most probably due to the webservice query, but never mind. So I could run this code on my 9M rows, but the time would be tremendous.
My attempt was to utilize the mutate
function from the dplyr
package.
Here is what I tried:
df %>%
mutate(lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])
system.time
stops in only 2.3 seconds. Not too bad. But here is the problem:
id street postcode city country lat lon
1 1 Alexanderplatz 10 10178 Berlin DE 52.52194 13.41348
2 2 Friedrichstr 102 10117 Berlin DE 52.52194 13.41348
3 3 Hauptstr 42 31737 Rinteln DE 52.52194 13.41348
4 4 Bruesseler Platz 2 50672 Koeln DE 52.52194 13.41348
5 5 Aachener Str 324 50931 Koeln DE 52.52194 13.41348
lat
and lon
are exactly the same for all entries. In my understanding, the mutate
function is working rowwise. But here, lat and lon are the one calculated from the first row. Accordingly, the first row is correct. Does anyone have an idea why? The code I provided is complete. Nothing extra loaded. Any ideas? If you have a performant alternative way instead of an optimization of my code, I would be grateful as well.
Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply. There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day.
The short story is you need to do: Geocoder geocoder = new Geocoder(this, Locale. getDefault()); List<Address> addresses = geocoder. getFromLocation(lat, lng, 1);
Geocoding requests can be sent for every domain in which the main Google Maps application offers geocoding.
You might need to vectorize your geocode_attempt
function to do it columnwise:
vecGeoCode<-Vectorize(geocode_attempt,vectorize.args = c('address'))
And then call:
df %>%
mutate(lat = vecGeoCode(paste(street, postcode, city, country, sep=","))[1,],
lon =vecGeoCode(paste(street, postcode, city, country, sep=","))[2,])
To speed thing up, you might want to look at the batch mode of the API to get up to 100 lats and longs in one go.
To use the API's batch requests you could use this function:
geocodeBatch_attempt <- function(address) {
#URL for batch requests
URL=paste("http://open.mapquestapi.com/geocoding/v1/batch?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz",
"&location=", paste(address,collapse="&location="),sep = "")
URL <- gsub(" ", "+", URL)
data<-getURL(URL)
data <- fromJSON(data)
p<-sapply(data$results,function(x){
if(length(x$locations)==0){
c(NA,NA)
} else{
c(x$locations[[1]]$displayLatLng$lat, x$locations[[1]]$displayLatLng$lng)
}})
return(t(p))
}
To test it:
#make a bigger df from the data (repeat the 5 lines 25 times)
biggerDf<-df[rep(row.names(df), 25), ]
#add a reqId column to split the data in batches of 100 requests
biggerDf$reqId<-seq_along(biggerDf$id)%/%100
#run the function, first grouping by reqId to send batches of 100 requests
biggerDf %>%
group_by(reqId) %>%
mutate(lat = geocodeBatch_attempt(paste(street, postcode, city, country, sep=","))[,1],
lon =geocodeBatch_attempt(paste(street, postcode, city, country, sep=","))[,2])
It's really easy to look at mutate()
and draw the conclusion that what's happening is similar to what you illustrate in your for loop - but what you're actually seeing there is just a vectorized R function which acting on the entire column of the data frame.
I would not be surprised if others had this misconception - the dplyr tutorials don't address the distinction between vectorized/non-vectorized functions, and (even more dangerous) R's recycling rules mean that applying a scalar function won't necessarily raise an error. There's some more discussion of this here.
One option is to rewrite your geocode_attempt
so that it can take a vector of addresses.
If you want to keep your function as is, but want dplyr to behave more like something from the -ply family you have two potential approaches:
The first is to use the grouping variable you have in your data:
df %>%
group_by(id) %>%
mutate(
lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])
The second is to use rowwise()
function described in this answer.
df %>%
rowwise() %>%
mutate(
lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])
The group_by solution is significantly faster on my machine. Not sure why!
Unfortunately the speed savings you are seeing from dplyr above are likely somewhat illusory - most likely the result of the geocoding function getting called only once (vs once per row in the loop). There may well be gains, but you'll need to run the timmings again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With