I am querying Freebase to get the genre information for some 10000 movies.
After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr
might be a better alternative to RCurl
.
My questions are:
Is it possible to speed up the API calls by using
a) a parallel version of the loop below (using a WINDOWS machine)?
b) alternatives to getURL such as GET
in the httr
-package?
library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)
df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)
f_query_freebase <- function(film.title){
request <- paste0("https://www.googleapis.com/freebase/v1/search?",
"filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
"&indent=TRUE",
"&limit=1",
"&output=(/film/film/genre)")
temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
data <- fromJSON(temp, simplifyVector=FALSE)
genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
return(genre)
}
# Non-parallel version
# ----------------------------------
for (i in df$film){
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
# Parallel version - Does not work
# ----------------------------------
# Set up parallel computing
cl<-makeCluster(2)
registerDoSNOW(cl)
foreach(i=df$film) %dopar% {
df$genre[which(df$film==i)] <- f_query_freebase(i)
}
stopCluster(cl)
# --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".
Caching is one of the best ways to improve API performance. If you have requests that frequently produce the same response, a cached version of the response avoids excessive database queries. The easiest way to cache responses is to periodically expire it, or force it to expire when certain data updates happen.
Making API requests in R To work with APIs in R, we need to bring in some libraries. These libraries take all of the complexities of an API request and wrap them up in functions that we can use in single lines of code. The R libraries that we'll be using are httr and jsonlite .
If you want to slow down the api requests for particular routes instead of all routes in your express app, like /create-account/ route or /reset-password/ route, you can do it like this. const express = require("express"); const rateLimit = require("express-rate-limit");const app = express(); app.
This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.
You'll want to break the process into a few parts:
Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.
Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):
e.g. save a plain text file with .ps1
file extension, you can double click on it to run it, or schedule it with task scheduler/cron:
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }
What's it doing? It will:
extract.R
, and provide an argument to the R script (1
, 2
, and 3
).Each R process can look like this
# Get command line argument
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])
api_calls <- read.csv("api_calls.csv")
# work out which API calls each R script should make (e.g.
indicies <- seq(process_number, nrow(api_calls), 3)
api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)
# Now, make API calls as usual using rvest/jsonlite or whatever you use for that
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With