I use R
and it's package xml2
to parse an html
document. I extracted a piece of html
file, which looks like this:
text <- ('<div>
<p><span class="number">1</span>First <span class="small-accent">previous</span></p>
<p><span class="number">2</span>Second <span class="accent">current</span></p>
<p><span class="number">3</span>Third </p>
<p><span class="number">4</span>Fourth <span class="small-accent">last</span> A</p>
</div>')
And my goal is to extract information from the text and to convert it into data frame, which looks like this one:
number label text_of_accent type_of_accent
1 1 First previous small-accent
2 2 Second current accent
3 3 Third
4 4 Fourth A last small-accent
I tried the following code:
library(xml2)
library(magrittr)
html_1 <- text %>%
read_html() %>%
xml_find_all( "//span[@class='number']")
number <- html_1 %>% xml_text()
label <- html_1 %>%
xml_parent() %>%
xml_text(trim = TRUE)
text_of_accent <- html_1 %>%
xml_siblings() %>%
xml_text()
type_of_accent <- html_1 %>%
xml_siblings() %>%
xml_attr("class")
Unfortunately, label
, text_of_accent
, type_of_accent
are not extracted as I expect:
label
[1] "1First previous" "2Second current" "3Third" "4Fourth last A"
text_of_accent
[1] "previous" "current" "last"
type_of_accent
[1] "small-accent" "accent" "small-accent"
Is it possible to achieve my goal with just xml2
or I need some additional tools? At least is it possible to extract pieces of text for label
?
It can be done with xml2
, the reason your label
messed up is xml_text()
finds out all the texts including the current node and its children nodes, to avoid this, you can use xpath text()
to locate the text for the current node firstly, then extract it and also you need to check if some nodes exist and handle the missing cases properly:
# read in text as html and extract all p nodes as a list
lst <- read_html(text) %>% xml_find_all("//p")
lapply(lst, function(node) {
# find the first span
first_span_node = xml_find_first(node, "./span[@class='number']")
number = xml_text(first_span_node, trim = TRUE)
# use the text() to find out text nodes from the current position
label = paste0(xml_text(xml_find_all(node, "./text()")), collapse = " ")
# find the second span
accent_node = xml_find_first(first_span_node, "./following-sibling::span")
# check if the second span exists
if(length(accent_node) != 0) {
text_of_accent = xml_text(xml_find_first(accent_node, "./text()"))
type_of_accent = xml_text(xml_find_first(accent_node, "./@class"))
} else {
text_of_accent = ""
type_of_accent = ""
}
c(number = number, label = label,
text_of_accent = text_of_accent,
type_of_accent = type_of_accent)
}) %>%
do.call(rbind, .) %>% as.data.frame()
# number label text_of_accent type_of_accent
#1 1 First previous small-accent
#2 2 Second current accent
#3 3 Third
#4 4 Fourth A last small-accent
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With