Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using R for webscraping: HTTP error 503 despite using long pauses in program

Tags:

r

web-scraping

I'm trying to search the ProQuest Archiver using R. I'm interested in finding the number of articles for a newspaper containing a certain keyword.

It generally works well using the rvest tool. However, the program sometimes breaks down. See this minimal example:

library(xml2)
library(rvest)

# Retrieve the title of the first search hit on the page of search results
for (p in seq(0, 150, 10)) {
  searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", p, sep="")
  htmlWeb <- read_html(searchURL)
  nodeWeb <- html_node(htmlWeb, ".text tr:nth-child(1) .result_title a") 
  textWeb <- html_text(nodeWeb)
  print(textWeb)
  Sys.sleep(0.1)
}

This works for me sometimes. But if I run this or similar scripts a couple of times, it breaks down at the same point and I get an error on the 12th iteration (p=120):

Error in open.connection(x, "rb") : HTTP error 503.

I tried circumventing this by putting in pauses of escalating lengths, but that doesn't help.

I've also considered:

  • saving which result pages cannot be reached and writing separate scripts for those cases
  • changing my IP some time through the program?
  • quit and start R some time through the program?

I thank you for any comments.

like image 346
ulima2_ Avatar asked Mar 05 '26 20:03

ulima2_


2 Answers

Try being a bit more human-like in the delays. This works for me (multiple tries):

library(xml2)
library(httr)
library(rvest)
library(purrr)
library(dplyr)

to_get <- seq(0, 150, 10)
pb <- progress_estimated(length(to_get))

map_chr(to_get, function(i) {
  pb$tick()$print()
  searchURL <- paste("http://pqasb.pqarchiver.com/djreprints/results.html?st=advanced&QryTxt=bankruptcy&sortby=CHRON&datetype=6&frommonth=01&fromday=01&fromyear=1908&tomonth=12&today=31&toyear=1908&By=&Title=&at_hist=article&at_hist=editorial_article&at_hist=front_page&type=historic&start=", i, sep="")
  htmlWeb <- read_html(searchURL)
  nodeWeb <- html_node(htmlWeb, "td > font.result_title > a")
  textWeb <- html_text(nodeWeb)
  Sys.sleep(sample(10, 1) * 0.1)
  textWeb
}) -> titles

print(trimws(titles))

##  [1] "NEWSPAPER SPECIALS."                                      
##  [2] "NEWSPAPER SPECIALS."                                      
##  [3] "New Jersey Ice Co. Insolvent."                            
##  [4] "NEWSPAPER SPECIALS."                                      
##  [5] "NEWSPAPER SPECIALS"                                       
##  [6] "AMERICAN ICE BEGINNING BUSY SEASON IN IMPROVED CONDITION."
##  [7] "NEWSPAPER SPECIALS"                                       
##  [8] "THE GERMAN REICHSBANK."                                   
##  [9] "U.S. Exploration Co. Bankrupt."                           
## [10] "CHICAGO TRACTION."                                        
## [11] "INCREASING FREIGHT RATES."                                
## [12] "A.O. BROWN & CO."                                         
## [13] "BROAD STREET GOSSIP"                                      
## [14] "Meadows, Williams & Co."                                  
## [15] "FAILURES IN OCTOBER."                                     
## [16] "Supplementary Receiver for Heinze & Co." 

I randomized the sleep call value, simplified the CSS target a bit, added a progress bar and automagically made a vector. You prbly ultimately want a data.frame from this data, so ?purrr::map_df for that.

like image 114
hrbrmstr Avatar answered Mar 07 '26 09:03

hrbrmstr


In the end, we use a combination of:

  1. random pauses,
  2. randomly changing the user agent (as suggested by hrmrmstr's comment) and
  3. trying multiple times when a URL access returns an error.

It still happens that we cannot fully access all URLs and in that case we just save the info on where that happened and go on.

Thanks for your comments!

like image 34
ulima2_ Avatar answered Mar 07 '26 10:03

ulima2_