Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

403 Error When Using Rvest to Log Into Website For Scraping

I am trying to scrape a page on a website that requires a login and am consitently getting a 403 Error.

I have modified the code from these 2 posts for my site, Using rvest or httr to log in to non-standard forms on a webpage and how to reuse a session to avoid repeated login when scraping with rvest?

library(rvest)
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1")
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session

When the code is run, I get this message:

Submitting with 'NULL'
Warning message:
In request_POST(session, url = url, body = request$values, encode = request$encode,  :
  Forbidden (HTTP 403).

I have also run the code this way, by updating user_agent as R.S. suggested in the comments, however, I receive the same error as above.

library(rvest)
library(httr)
uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"
pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))
pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(pgform, 'username'='user', 'password'='pass')
s <- submit_form(pgsession, filled_form) # s is your logged in session

If you pull the page up without logging in, it shows you a bit of the data table at the bottom right below the text: "Earnings Events Available: 65"

Once logged in, it will show all 65 events and the table will be filled in which is what I want to download. I have all the code necessary to do that in place but am stuck just on the login part.

Thank you for your help.

like image 923
mks212 Avatar asked Oct 22 '16 23:10

mks212


People also ask

What could be causing a 403 error on a website?

The 403 Forbidden Error happens when the web page (or another resource) that you're trying to open in your web browser is a resource that you're not allowed to access.

Can a firewall cause a 403 error?

The 403 Forbidden error occurs when a request is made the server cannot allow. This is often due to a firewall ruleset that strictly prohibits this specific request, but other settings such as permissions may prevent access based on user rights.

Will Error 403 go away?

You can't always fix a 403 error on your own, but simple tricks like refreshing your page or clearing your cache could help. If visitors to your webpage are getting 403 errors, you may have to reconfigure it.


2 Answers

Using R.S.'s suggestion, I used RSelenium to log in successfully.

A quick note for fellow mac users on using either chrome or phantom. I am running El Capitan so had some issue getting the mac to recognize the paths to both of the bin files. Instead, I moved the bin files to /usr/local/bin and they ran without an issue.

Below is the code to do so:

library(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
appURL <- 'https://www.optionslam.com/accounts/login/'
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)

This can also be done with phantom,

library(RSelenium)

pJS <- phantom() # start phantomjs

appURL <- 'https://www.optionslam.com/accounts/login/'
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "id_username")$sendKeysToElement(list("user"))
remDr$findElement("id", "id_password")$sendKeysToElement(list("password", key='enter'))

appURL <- 'https://www.optionslam.com/earnings/stocks/MSFT?page=-1'
remDr$navigate(appURL)
like image 128
mks212 Avatar answered Oct 07 '22 05:10

mks212


Here's the answer to solve the problem in the original use case with rvest:

   library(rvest)
   library(httr)
   uastring <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36"

   pgsession <- html_session("https://www.optionslam.com/earnings/stocks/MSFT?page=-1", user_agent(uastring))

   pgform <- html_form(pgsession)[[1]]

   filled_form <- set_values(pgform,
                             username = 'un',
                             password = 'ps')

   s <- submit_form(pgsession, filled_form, submit = NULL, config(referer = pgsession$url)) # s is your logged in session

The requested requires knowledge of the page you've come from (the referer(sic)).

config(referer = pgsession$url)
like image 41
Ross Ireland Avatar answered Oct 07 '22 03:10

Ross Ireland