Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Geocode batch addresses in R with open mapquestapi

Objective: Using R, get latitude and longitude data for a vector of addresses through open.mapquestapi

Point of departure: Since geocode from the ggmap package is restricted to 2500 queries a day, I needed to find a different way (My data.frame consists of 9M entries). The data-science toolkit is not an option since most of my addresses are based outside the UK/US. I found this excellent snippet on http://rpubs.com/jvoorheis/Micro_Group_Rpres utilizing open.mapquestapi.

geocode_attempt <- function(address) {
    URL2 = paste("http://open.mapquestapi.com/geocoding/v1/address?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz", 
        "&location=", address, "&outFormat='json'", "boundingBox=24,-85,50,-125", 
        sep = "")
    # print(URL2)
    URL2 <- gsub(" ", "+", URL2)
    x = getURL(URL2)
    x1 <- fromJSON(x)
    if (length(x1$results[[1]]$locations) == 0) {
        return(NA)
    } else {
        return(c(x1$results[[1]]$locations[[1]]$displayLatLng$lat, x1$results[[1]]$locations[[1]]$displayLatLng$lng))
    }
}
geocode_attempt("1241 Kincaid St, Eugene,OR")

We need these libraries:

library(RCurl)
library(rjson)
library(dplyr)

Let's create a mock-up data.frame with 5 adresses.

id <- c(seq(1:5))
street <- c("Alexanderplatz 10", "Friedrichstr 102", "Hauptstr 42", "Bruesseler Platz 2", "Aachener Str 324")
postcode <- c("10178","10117", "31737", "50672", "50931")
city <- c(rep("Berlin", 2), "Rinteln", rep("Koeln",2))
country <- c(rep("DE", 5))

df <- data.frame(id, street, postcode, city, country

For a adding a latitude lat and longitude lon variable to the data.frame we could work with a for-Loop. I will present the code, just to demonstrate that the function works in principle.

for(i in 1:5){
  df$lat[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[1]
  df$lon[i] <- geocode_attempt(paste(df$street[i], df$postcode[i], df$city[i], df$country[i], sep=","))[2]
}

From a performance standpoint, this code is pretty bad. Even for this small data.frame, my computer took about 9sec, most probably due to the webservice query, but never mind. So I could run this code on my 9M rows, but the time would be tremendous.

My attempt was to utilize the mutate function from the dplyr package. Here is what I tried:

df %>%
  mutate(lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
        lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])

system.time stops in only 2.3 seconds. Not too bad. But here is the problem:

 id             street postcode    city country      lat      lon
1  1  Alexanderplatz 10    10178  Berlin      DE 52.52194 13.41348
2  2   Friedrichstr 102    10117  Berlin      DE 52.52194 13.41348
3  3        Hauptstr 42    31737 Rinteln      DE 52.52194 13.41348
4  4 Bruesseler Platz 2    50672   Koeln      DE 52.52194 13.41348
5  5   Aachener Str 324    50931   Koeln      DE 52.52194 13.41348

lat and lon are exactly the same for all entries. In my understanding, the mutate function is working rowwise. But here, lat and lon are the one calculated from the first row. Accordingly, the first row is correct. Does anyone have an idea why? The code I provided is complete. Nothing extra loaded. Any ideas? If you have a performant alternative way instead of an optimization of my code, I would be grateful as well.

like image 264
Daniel Schultz Avatar asked Feb 23 '15 18:02

Daniel Schultz


People also ask

How do I geocode an address in R?

Geocoding can be simply achieved in R using the geocode() function from the ggmap library. The geocode function uses Googles Geocoding API to turn addresses from text to latitude and longitude pairs very simply. There is a usage limit on the geocoding service for free users of 2,500 addresses per IP address per day.

How do I find the latitude and longitude of a geocoder address?

The short story is you need to do: Geocoder geocoder = new Geocoder(this, Locale. getDefault()); List<Address> addresses = geocoder. getFromLocation(lat, lng, 1);

Can you geocode with Google Maps?

Geocoding requests can be sent for every domain in which the main Google Maps application offers geocoding.


2 Answers

You might need to vectorize your geocode_attempt function to do it columnwise:

vecGeoCode<-Vectorize(geocode_attempt,vectorize.args = c('address'))

And then call:

df %>%
        mutate(lat = vecGeoCode(paste(street, postcode, city, country, sep=","))[1,],
               lon =vecGeoCode(paste(street, postcode, city, country, sep=","))[2,])

To speed thing up, you might want to look at the batch mode of the API to get up to 100 lats and longs in one go.

To use the API's batch requests you could use this function:

geocodeBatch_attempt <- function(address) {
  #URL for batch requests
  URL=paste("http://open.mapquestapi.com/geocoding/v1/batch?key=", "Fmjtd%7Cluub2huanl%2C20%3Do5-9uzwdz", 
             "&location=", paste(address,collapse="&location="),sep = "") 

  URL <- gsub(" ", "+", URL)
  data<-getURL(URL)
  data <- fromJSON(data)

  p<-sapply(data$results,function(x){
    if(length(x$locations)==0){
      c(NA,NA)
    } else{
      c(x$locations[[1]]$displayLatLng$lat, x$locations[[1]]$displayLatLng$lng)   
    }})
  return(t(p))
}

To test it:

#make a bigger df from the data (repeat the 5 lines 25 times)
biggerDf<-df[rep(row.names(df), 25), ]

#add a reqId column to split the data in batches of 100 requests 
biggerDf$reqId<-seq_along(biggerDf$id)%/%100

#run the function, first grouping by reqId to send batches of 100 requests
biggerDf %>%
  group_by(reqId) %>%
  mutate(lat = geocodeBatch_attempt(paste(street, postcode, city, country, sep=","))[,1],
         lon =geocodeBatch_attempt(paste(street, postcode, city, country, sep=","))[,2])
like image 142
NicE Avatar answered Oct 17 '22 19:10

NicE


It's really easy to look at mutate() and draw the conclusion that what's happening is similar to what you illustrate in your for loop - but what you're actually seeing there is just a vectorized R function which acting on the entire column of the data frame.

I would not be surprised if others had this misconception - the dplyr tutorials don't address the distinction between vectorized/non-vectorized functions, and (even more dangerous) R's recycling rules mean that applying a scalar function won't necessarily raise an error. There's some more discussion of this here.

One option is to rewrite your geocode_attempt so that it can take a vector of addresses.

If you want to keep your function as is, but want dplyr to behave more like something from the -ply family you have two potential approaches:

The first is to use the grouping variable you have in your data:

df %>%
  group_by(id) %>%
  mutate(
    lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
    lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])

The second is to use rowwise() function described in this answer.

df %>%
  rowwise() %>%
  mutate(
    lat = geocode_attempt(paste(street, postcode, city, country, sep=","))[1],
    lon = geocode_attempt(paste(street, postcode, city, country, sep=","))[2])

The group_by solution is significantly faster on my machine. Not sure why!

Unfortunately the speed savings you are seeing from dplyr above are likely somewhat illusory - most likely the result of the geocoding function getting called only once (vs once per row in the loop). There may well be gains, but you'll need to run the timmings again.

like image 33
Andrew Avatar answered Oct 17 '22 20:10

Andrew