Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: How to get parent attributes and node values at the site time?

I have a html and a R code like these and need to relate each node value to its parent id in a data.frame. There are some different information available for each person.

example <- "<div class='person' id='1'>
<div class='phone'>555-5555</div>
<div class='email'>[email protected]</div>
</div>
<div class='person' id='2'>
<div class='phone'>123-4567</div>
<div class='email'>[email protected]</div>
</div>
<div class='person' id='3'>
<div class='phone'>987-6543</div>
<div class='age'>32</div>
<div class='city'>New York</div>
</div>"

doc = htmlTreeParse(example, useInternalNodes = T)

values <- xpathSApply(doc, "//*[@class='person']/div", xmlValue)
variables <- xpathSApply(doc, "//*[@class='person']/div", xmlGetAttr, 'class')
id <- xpathSApply(doc, "//*[@class='person']", xmlGetAttr, 'id')

# The problem: create a data.frame(id,variables,values)

With xpathSApply(), I can get phone, email, and age values as well as person attributes (id) too. However, those information come isolated and I need to reference them to the right data.frame variable and the right person. In my real data there are a lot of different information, so this process of naming each variable has to be automatic.

My goal is to create a data.frame like this relating each id to its proper data.

  id variables          values
1  1     phone        555-5555
2  1     email    [email protected]
3  2     phone        123-4567
4  2     email [email protected]
5  3     phone        987-6543
6  3       age              32
7  3      city        New York

I believe I would have to create a function to use inside xpathSApply which would get at the same time the person phone and the person id, so they would be related, but I haven't had any success with that so far.

Can anyone help me?

like image 240
Erick Damasceno Avatar asked Aug 13 '13 21:08

Erick Damasceno


1 Answers

In general its not going to be easy:

idNodes <- getNodeSet(doc, "//div[@id]")
ids <- lapply(idNodes, function(x) xmlAttrs(x)['id'])
values <- lapply(idNodes, xpathApply, path = './div[@class]', xmlValue)
attributes <- lapply(idNodes, xpathApply, path = './div[@class]', xmlAttrs)
do.call(rbind.data.frame, mapply(cbind, ids, values, attributes))
  V1              V2    V3
1  1        555-5555 phone
2  1    [email protected] email
3  2        123-4567 phone
4  2 [email protected] email
5  3        987-6543 phone
6  3              32   age
7  3        New York  city

The above will give you attribute and value pairs assumming they are nested in a div with an associated id.

UPDATE: if you want to wrap it in an xpathApply type call

utilFun <- function(x){
  id <- xmlGetAttr(x, 'id')
  values <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlValue)
  attributes <- sapply(xmlChildren(x, omitNodeTypes = "XMLInternalTextNode"), xmlAttrs)
  data.frame(id = id, attributes = attributes, values = values, stringsAsFactors = FALSE)
}
res <- xpathApply(doc, '//div[@id]', utilFun)
do.call(rbind, res)
  id attributes          values
1  1      phone        555-5555
2  1      email    [email protected]
3  2      phone        123-4567
4  2      email [email protected]
5  3      phone        987-6543
6  3        age              32
7  3       city        New York
like image 197
jdharrison Avatar answered Nov 15 '22 03:11

jdharrison