Scrape site that asks for cookies consent with rvest

Tags:

rvest

I'd like to scrape (using rvest) a website that asks users to consent to set cookies. If I just scrape the page, rvest only downloads the popup. Here is the code:

library(rvest)
content <- read_html("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c") 
content %>% html_text()

The result seems to be the content of the popup window asking for consent.

Is there a way to ignore or accept the popup or to set a cookie in advance so I can access the main text of the site?

470

asked Oct 16 '20 15:10

2 Answers

As suggested, the website is dynamic, which means it is constructed from a javascript. Usually it is very time consuming to reconstruct (or straight impossible) from the .js file how this is done, but in this case, you can actually see in the "network analysis" function of your browser, that there is a non-hidden api that serves the information that you want. This is the request to api.karriere.nrw.

Hence you can use the uuid (identifier in the database) of your url and make a simple GET request to the api and just go straight to the source without rendering through RSelenium, which is extra-time and resources.

Be friendly though, and send them some kind of way to contact you, so they can tell you to stop.

library(tidyverse)
library(httr)
library(rvest)
library(jsonlite)
headers <- c("Email" = "[email protected]")

### assuming the url is given and always has the same format
url <- "https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c"

### extract identifier of job posting
uuid <- str_split(url,"/")[[1]][5]

### make api call-address
api_url <- str_c("https://api.karriere.nrw/v1.0/stellenausschreibungen/",uuid)

### get results
response <- httr::GET(api_url,
                    httr::add_headers(.headers = headers))
result <- httr::content(response, as = "text") %>% jsonlite::fromJSON()

135

answered Oct 22 '22 06:10

Datapumpernickel

That website isn't static, so I don't think there's a way to scrape it using rvest (I would love to be proved wrong though!); an alternative is to use RSelenium to 'click' the popup then scrape the rendered content, e.g.

library(tidyverse)
library(rvest)
#install.packages("RSelenium")
library(RSelenium)

driver <- rsDriver(browser=c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate("https://karriere.nrw/stellenausschreibung/dba41541-8ed9-4449-8f79-da3cda0cc07c")
webElem <- remote_driver$findElement("id", "popup_close")
webElem$clickElement()
out <- remote_driver$findElement(using = "class", value="css-1nedt8z")
scraped <- out$getElementText()
scraped

Edit: Supporting info concerning the "non-static hypothesis":

If you check how the site is rendered in the browser you will see that loading the "base document" only is not sufficient, but you would require supporting javascript. (Source: Chrome)

enter image description here

answered Oct 22 '22 05:10

jared_mamrot

Related questions
                            
                                header on first page and others
                            
                                `geom_histogram` and `stat_bin()` don't align
                            
                                Checking if there exists a value in vector of dates that lies within a given range
                            
                                Reshape data in R with fixed effect information within column
                            
                                How to clean up the function closure (environment) when returning and saving it?
                            
                                match.call() returns a function or a symbol, but symbols can't be used by do.call()
                            
                                Could not find function "CreateSinglerObject"
                            
                                Compacting Shared Libraries in R package
                            
                                tidymodels: ranger with cross validation
                            
                                Combine rows that have common elements
                            
                                Is there a visual explanation of why data.table operations are faster than tidyverse operations when you need to group by a variable?
                            
                                Use both empty and string filters in dplyr's filter
                            
                                Returning a tibble: how to vectorize with case_when?
                            
                                Why does empty logical vector pass the stopifnot() check?
                            
                                ggplot: some Unicode shapes working while others do not
                            
                                ggplot heatmap gridline formatting geom_tile and geom_rect
                            
                                Reference problem in data.table following a copy
                            
                                Replace df <- df %>% ... with a shortcut
                            
                                How can I add an extra symbol in legend of a ggplot graph?
                            
                                How to use Monte Carlo for ARIMA Simulation Function in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With