Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Troubleshooting a Loop in R

Tags:

html

loops

list

r

I have this loop in R that is scraping Reddit comments from an API incrementally on an hourly basis (e.g. all comments containing a certain keyword between now and 1 hour ago, 1 hour ago and 2 hours ago, 2 hours ago and 3 hours ago, etc.):

library(jsonlite)

part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="    
part2 = "h&before="
part3 = "h&size=500"

results = list()
for (i in 1:50000)
{tryCatch({
    {
        url_i<-  paste0(part1, i+1,  part2, i,  part3)
        r_i <-  data.frame(fromJSON(url_i))
        results[[i]] <- r_i

myvec_i <- sapply(results, NROW)

print(c(i, sum(myvec_i))) 
       
       
    }
}, error = function(e){})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")

In this loop, I added a line of code that prints which iteration the loop is currently on, and the cumulative number of results that the loop has scraped as of the current iteration. I also added a line of code ("tryCatch") that in the worst case scenario, forces this loop to skip an iteration which produces an error - however, I was not anticipating that to happen very often.

However, I am noticing that this loop is producing errors and often skipping iterations, far more than I had expected. For example (left column is the iteration number, right column is the cumulative results).

My guess is that between certain times, the API might not have recorded any comments that were left between those times thus not adding any new results to the list. E.g.

  iteration_number cumulative_results
1            13432               5673
2            13433               5673
3            13434               5673

But in my case - can someone please help me understand why this loop is producing so many errors that is resulting in so many skipped iterations?

Thank you!

like image 274
stats_noob Avatar asked Oct 12 '25 05:10

stats_noob


1 Answers

Your issue is almost certainly caused by rate limits imposed on your requests by the Pushshift API. When doing scraping tasks like you are here, the server may track how many requests a client has made within a certain time interval (1) and choose to return an error code (HTTP 429) instead of the requested data. This is called rate limiting and is a way for web sites to limit abuse, charge customers for usage, or both.

Per this discussion about Pushshift on Reddit, it does look like Pushshift imposes a rate limit of 120 requests per minute (Also: See their /meta endpoint).

I was able to confirm that a script like yours will run into rate limiting by changing this line and re-running your script:

}, error = function(e){})

to:

}, error = function(e){ message(e) })

After a while, I got output like:

HTTP error 429

The solution is to slow yourself down in order to stay within this limit. A straightforward way to do this is add a call to Sys.sleep(1) into your for loop, where 1 is the number of seconds to pause execution.

I modified your script as follows:

library(jsonlite)

part1 = "https://api.pushshift.io/reddit/search/comment/?q=trump&after="    
part2 = "h&before="
part3 = "h&size=500"

results = list()
for (i in 1:50000)
{tryCatch({
  {
    Sys.sleep(1) # Changed. Change the value as needed.
    url_i<-  paste0(part1, i+1,  part2, i,  part3)
    r_i <-  data.frame(fromJSON(url_i))
    results[[i]] <- r_i
    
    myvec_i <- sapply(results, NROW)
    
    print(c(i, sum(myvec_i))) 
    
    
  }
}, error = function(e){
  message(e) # Changed. Prints to the console on error.
})
}
final = do.call(rbind.data.frame, results)
saveRDS(final, "final.RDS")

Note that you may have to try a number larger than 1 and you'll notice that your script takes longer to run.

like image 191
amoeba Avatar answered Oct 14 '25 19:10

amoeba