I am using rvest
to parse a website. I'm hitting a wall with these little non-breaking spaces. How does one remove the whitespace that is created by the
element in a parsed html document?
library("rvest")
library("stringr")
minimal <- html("<!doctype html><title>blah</title> <p> foo")
bodytext <- minimal %>%
html_node("body") %>%
html_text
Now I have extracted the body text:
bodytext
[1] " foo"
However, I can't remove that pesky bit of whitespace!
str_trim(bodytext)
gsub(pattern = " ", "", bodytext)
I have run into the same problem, and have settled on the simple substitution of
gsub(intToUtf8(160),'',bodytext)
(Edited to correct case.)
jdharrison answered:
gsub("\\W", "", bodytext)
and, that will work but you can use:
gsub("[[:space:]]", "", bodytext)
which will remove all Space characters: tab, newline, vertical tab, form feed, carriage return, space and possibly other locale-dependent characters
. It's a very readable alternative to other, cryptic regex classes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With