Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up API calls in R

I am querying Freebase to get the genre information for some 10000 movies.

After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr might be a better alternative to RCurl.

My questions are: Is it possible to speed up the API calls by using a) a parallel version of the loop below (using a WINDOWS machine)? b) alternatives to getURL such as GET in the httr-package?

library(RCurl)
library(jsonlite)
library(foreach)
library(doSNOW)

df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA)

f_query_freebase <- function(film.title){

  request <- paste0("https://www.googleapis.com/freebase/v1/search?",
                    "filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"),
                    "&indent=TRUE",
                    "&limit=1",
                    "&output=(/film/film/genre)")

  temp <- getURL(URLencode(request), ssl.verifypeer = FALSE)
  data <- fromJSON(temp, simplifyVector=FALSE)
  genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ")
  return(genre)
}


# Non-parallel version
# ----------------------------------

for (i in df$film){
  df$genre[which(df$film==i)] <- f_query_freebase(i)      
}


# Parallel version - Does not work
# ----------------------------------

# Set up parallel computing
cl<-makeCluster(2) 
registerDoSNOW(cl)

foreach(i=df$film) %dopar% {
  df$genre[which(df$film==i)] <- f_query_freebase(i)     
}

stopCluster(cl)

# --> I get the following error:  "Error in { : task 1 failed", further saying that it cannot find the function "getURL". 
like image 512
majom Avatar asked Apr 10 '14 11:04

majom


People also ask

How can I speed up my API calls?

Caching is one of the best ways to improve API performance. If you have requests that frequently produce the same response, a cached version of the response avoids excessive database queries. The easiest way to cache responses is to periodically expire it, or force it to expire when certain data updates happen.

Can you use R for API?

Making API requests in R To work with APIs in R, we need to bring in some libraries. These libraries take all of the complexities of an API request and wrap them up in functions that we can use in single lines of code. The R libraries that we'll be using are httr and jsonlite .

How do you slow down an API?

If you want to slow down the api requests for particular routes instead of all routes in your express app, like /create-account/ route or /reset-password/ route, you can do it like this. const express = require("express"); const rateLimit = require("express-rate-limit");const app = express(); app.


1 Answers

This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful.

At a high level

You'll want to break the process into a few parts:

  1. Get a list of the URLs/API calls you need to make and store as a csv/text file
  2. Use the code below as a template for starting multiple R processes and dividing the work among them

Note: this happened to run on windows, so I used powershell. On mac this could be written in bash.

Powershell/bash script

Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes):

e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron:

start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 }
start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 }

What's it doing? It will:

  • Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3).

The R processes

Each R process can look like this

# Get command line argument 
arguments <- commandArgs(trailingOnly = TRUE)
process_number <- as.numeric(arguments[1])

api_calls <- read.csv("api_calls.csv")

# work out which API calls each R script should make (e.g. 
indicies <- seq(process_number, nrow(api_calls), 3)

api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls
# (the other two processes will take care of the remaining calls)

# Now, make API calls as usual using rvest/jsonlite or whatever you use for that
like image 175
stevec Avatar answered Sep 26 '22 16:09

stevec