Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

webscraping: replacing tags manually

Tags:

r

rvest

I am dealing and playing around with "rvest". Getting the data with "read_html" is ok.

library(rvest)
# suppressMessages(library(dplyr))
library(stringr)
library(XML)

# get house data
houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/"
house <- read_html(houseurl)
house

I have some problems processing the data. My problems are commented in the source.

## eleminating <br>-tags in address
# using the following commands causes error using "html_nodes"
str_extract_all(house,"<br>") ## show all linebreaks
# replacing <br> in whitespace " ", 
house <- str_replace_all(house,"<br>", " ")

now reading out details but it seems, that doesn't work

houseattribut <- house %>%
html_nodes(css = "div.col-2 li p.data-left")   %>% 
html_text(trim=TRUE) 
# shows "Error in UseMethod("xml_find_all") : ... "
# but all attributes are shown on screen
houseattribut  

Without replacing the "br"-tags manually its working, but the "html_text" tightened the strings together

housedetails <- house %>%
html_nodes(css = "div.col-2 li p.data-right") %>% 
html_text()
housedetails
# the same error shows "Error in UseMethod("xml_find_all") : ... "
# but all details are shown on screen

housedetails[4]
# in the source there is: "Ellwürder Straße 17<br>26954 Nordenham"
# at <br>-tag should be a whitespace 

Any hints what I'm doing wrong?

like image 468
wattnwurm Avatar asked Feb 19 '26 19:02

wattnwurm


1 Answers

The problem is that when you use read_html, house is a xml_document, after you use str_replace_all it became a chr, so, when you try to filter nodes again, its not more a xml_documentand it gives you the error.

You need to convert it again to xml_document or apply the replace node by node.

Something like that:

house <- read_html(str_replace_all(house,"<br>", " "))

Full code:

library(rvest)
#> Loading required package: xml2
library(stringr)

houseurl <- "http://boekhoff.de/immobilien/gepflegtes-zweifamilienhaus-in-ellwuerden/"
house <- read_html(houseurl)

house <- read_html(str_replace_all(house,"<br>", " "))

housedetails <- house %>%
    html_nodes(css = "div.col-2 li p.data-right") %>% 
    html_text()

housedetails[4]
#> [1] "Ellwürder Straße 17 26954 Nordenham"
like image 162
Icaro Bombonato Avatar answered Feb 21 '26 10:02

Icaro Bombonato



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!