Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape site that asks for cookies consent with rvest

Tags:

r

rvest

I'd like to scrape (using rvest) a website that asks users to consent to set cookies. If I just scrape the page, rvest only downloads the popup. Here is the code:

library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c") 
content %>% html_text()

The result seems to be the content of the popup window asking for consent.

Is there a way to ignore or accept the popup or to set a cookie in advance so I can access the main text of the site?

like image 470
Dominik Vogel Avatar asked Oct 16 '20 15:10

Dominik Vogel


People also ask

How do I refuse cookies on a website?

It is you, the internet user, who should clearly state that you consent to the use of cookies. But, the website’s cookie banner should allow it. Every cookie banner should have an option to click on an ACCEPT button, but also a button to refuse cookies.

Do businesses have to ask for user’s consent for cookies?

Businesses must ask for user’s consent before injecting cookies into users’ devices. The use of cookies without a consent is unlawful and is a violation of the GDPR. Moreover, the consent must be requested and obtained in a lawful way. Not all consents are equal. Businesses, intentionally or not, often make mistakes in consent requests.

How do I grant access to cookies on a website?

On your computer open Chrome. At the top right side click the three dots. Click Settings. Under – Privacy and security – click “Cookies and other site data”. Scroll down to “Customized behaviors”. Enter the website name that you want to grant access to all cookies. Click on “Add” again.

What are the most common issues with rvest?

Another common issue, specifically for modern sites, is having to execute JavaScript. When you send a request using Rvest, it will return the HTML it finds in the target URL.


2 Answers

As suggested, the website is dynamic, which means it is constructed from a javascript. Usually it is very time consuming to reconstruct (or straight impossible) from the .js file how this is done, but in this case, you can actually see in the "network analysis" function of your browser, that there is a non-hidden api that serves the information that you want. This is the request to api.karriere.nrw.

Hence you can use the uuid (identifier in the database) of your url and make a simple GET request to the api and just go straight to the source without rendering through RSelenium, which is extra-time and resources.

Be friendly though, and send them some kind of way to contact you, so they can tell you to stop.

library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "[email protected]")

### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"

### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]

### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)

### get results
response <- httr::GET(api_url,
                    httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()
like image 135
Datapumpernickel Avatar answered Oct 22 '22 06:10

Datapumpernickel


That website isn't static, so I don't think there's a way to scrape it using rvest (I would love to be proved wrong though!); an alternative is to use RSelenium to 'click' the popup then scrape the rendered content, e.g.

library(tidyverse)
library(rvest)
#install.packages("RSelenium")
library(RSelenium)

driver <- rsDriver(browser=c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
webElem <- remote_driver$findElement("id", "popup_close")
webElem$clickElement()
out <- remote_driver$findElement(using = "class", value="css-1nedt8z")
scraped <- out$getElementText()
scraped

Edit: Supporting info concerning the "non-static hypothesis":

If you check how the site is rendered in the browser you will see that loading the "base document" only is not sufficient, but you would require supporting javascript. (Source: Chrome)

enter image description here

like image 40
jared_mamrot Avatar answered Oct 22 '22 05:10

jared_mamrot