Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rvest: how to find required css-selector

I am trying to scrape parts of speeches held in parliament with the rvest package. Using the css selector or chrome's inspector tool provide me with a selector, however I am unable to retrieve the intended (any) data. AFAIK, the site is also not java etc based, i.e. no RSelenium etc should be required.

here is the link:

library(tidyverse)
library(rvest)
library(xml2)

session_1 <- "https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00001/fnameorig_796482.html"

x <- session_1 %>%  
  rvest::read_html() %>% 
  rvest::html_element("wordsection14") %>% 
  rvest::html_text()

Eventually, I would like to be able to get the text contained in all elements with the class 'wordsection*'.

Would be very grateful for any hint. Many thanks.

like image 522
zoowalk Avatar asked Jan 24 '26 16:01

zoowalk


1 Answers

tl;dr The problem is not the css selectors. It's the encoding. Specify encoding = 'latin1'

read_html('https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00001/fnameorig_796482.html', encoding = "latin1") %>% 
  html_nodes('[class^=WordSection]') %>%
  html_text() %>% 
  length()

Curl:

You could also use curl.

library(rvest)
library(curl)

text_info <- curl_fetch_memory("https://www.parlament.gv.at/PAKT/VHG/XXVII/NRSITZ/NRSITZ_00001/fnameorig_796482.html") %>%
  {rawToChar(.$content)} %>%
  .[[1]] %>%
  read_html() %>%
  html_nodes("[class^=WordSection]") %>%
  html_text()

CSS Selectors:

If you use an css attribute = value selector with starts with operator ^ to get all the nodes with class value starting with WordSection.

Given that there is a lot of nesting to avoid repeat material you may decide to use nth-child range selectors or other css selector combinations to restrict the match list.

Write some custom function(s) to manage string cleaning.

You can of course use different css selectors if you so choose.


like image 171
QHarr Avatar answered Jan 27 '26 06:01

QHarr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!