Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R and xml2: how to read text that is not in children nodes and read information even if node is missing

Tags:

r

xpath

xml2

I use R and it's package xml2 to parse an html document. I extracted a piece of html file, which looks like this:

text <- ('<div>
<p><span class="number">1</span>First&nbsp;<span class="small-accent">previous</span></p>
<p><span class="number">2</span>Second&nbsp;<span class="accent">current</span></p>
<p><span class="number">3</span>Third&nbsp;</p>
<p><span class="number">4</span>Fourth&nbsp;<span class="small-accent">last</span> A</p>
</div>')

And my goal is to extract information from the text and to convert it into data frame, which looks like this one:

  number      label   text_of_accent   type_of_accent
1      1      First         previous     small-accent
2      2     Second          current           accent
3      3      Third                                  
4      4   Fourth A             last     small-accent

I tried the following code:

library(xml2)
library(magrittr)

html_1 <- text %>% 
    read_html() %>% 
    xml_find_all( "//span[@class='number']")  

number <- html_1 %>% xml_text()

label  <- html_1 %>%
    xml_parent() %>% 
    xml_text(trim = TRUE)

text_of_accent <- html_1 %>%
    xml_siblings() %>% 
    xml_text()

type_of_accent <- html_1 %>% 
    xml_siblings() %>%
    xml_attr("class")

Unfortunately, label, text_of_accent, type_of_accent are not extracted as I expect:

label
[1] "1First previous" "2Second current" "3Third"          "4Fourth last A" 

text_of_accent
[1] "previous" "current"  "last" 

type_of_accent
[1] "small-accent" "accent"       "small-accent"

Is it possible to achieve my goal with just xml2 or I need some additional tools? At least is it possible to extract pieces of text for label?

like image 830
GegznaV Avatar asked Mar 15 '17 01:03

GegznaV


1 Answers

It can be done with xml2, the reason your label messed up is xml_text() finds out all the texts including the current node and its children nodes, to avoid this, you can use xpath text() to locate the text for the current node firstly, then extract it and also you need to check if some nodes exist and handle the missing cases properly:

# read in text as html and extract all p nodes as a list
lst <- read_html(text) %>% xml_find_all("//p")

lapply(lst, function(node) {
    # find the first span
    first_span_node = xml_find_first(node, "./span[@class='number']")

    number = xml_text(first_span_node, trim = TRUE)

    # use the text() to find out text nodes from the current position
    label = paste0(xml_text(xml_find_all(node, "./text()")), collapse = " ")

    # find the second span
    accent_node = xml_find_first(first_span_node, "./following-sibling::span")

    # check if the second span exists
    if(length(accent_node) != 0) {
        text_of_accent = xml_text(xml_find_first(accent_node, "./text()"))
        type_of_accent = xml_text(xml_find_first(accent_node, "./@class"))    
    } else {
        text_of_accent = ""
        type_of_accent = ""
    }

    c(number = number, label = label, 
      text_of_accent = text_of_accent, 
      type_of_accent = type_of_accent)
}) %>% 
do.call(rbind, .) %>% as.data.frame()


#  number     label text_of_accent type_of_accent
#1      1    First        previous   small-accent
#2      2   Second         current         accent
#3      3    Third                               
#4      4 Fourth  A           last   small-accent
like image 180
Psidom Avatar answered Oct 21 '22 11:10

Psidom