Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Editing XML files in R

Tags:

r

xml

I have an xml document with a following element:

<sequence id = "ancestralSequence"> 
    <taxon id="test">
     </taxon>       
    ACAGTTGACACCCTT
</sequence>

and would like to parse a new sequence of characters inside the "taxon" tags. I started looking into XML package documentation, but cannot find a simple solution yet. My code:

# load packages
require("XML")

# create a new sequence
newSeq <- "TGTCAATGGAACCTG"

# read the xml
secondPartXml <- xmlTreeParse("generateSequences_secondPart.xml")
like image 750
fbielejec Avatar asked Feb 21 '23 06:02

fbielejec


2 Answers

I'd read it with xmlParse and then get the bit I want with XPath expressions. For example on your test data, here's how to get the value of the text in the sequence tag:

x=xmlParse("test.xml")
xmlValue(xpathApply(x,"//sequence")[[1]])
## [1] "\n            \n    ACAGTTGACACCCTT\n"

-- two blank lines, some spaces, then the bases.

To get the text in the taxon tag:

xmlValue(xpathApply(x,"//sequence/taxon")[[1]])
## [1] "\n     "

-- empty, just a blank line.

Now, to replace one string with another you just have to find the "text node", which is a bit of XML with invisible magic round it so that it looks just like text but isn't, and set its value to something.

Given some data with a couple of sequences in, and suppose you want to bracket each sequence with CCCCC at the start and GGGGGGG at the end:

<data>
<sequence id = "ancestralSequence"> 
    <taxon id="test">Taxon
     </taxon>       
    ACAGTTGACACCCTT
</sequence>
<sequence id = "someotherSequence"> 
    <taxon id="thing">Taxoff
     </taxon>       
    GGCGGCGCGGGGGGG
</sequence>
</data>

Here comes the code:

# read in to a tree:
x = xmlParse("test.xml")

# this returns a *list* of text nodes under sequence
# and NOT the text nodes under taxon
nodeSet = xpathApply(x,"//sequence/text()")

# now we loop over the list returned, and get and modify the node value:
sapply(nodeSet,function(G){
  text = paste("CCCCC",xmlValue(G),"GGGGGGG",sep="")
  text = gsub("[^A-Z]","",text)
  xmlValue(G) = text
})

Note that this is done by reference which is odd in R. After all that, the object x has changed, although we haven't done anything directly to it. The nodes we play with in the loop are references, pointers, to the data stored in the x object.

Anyway, that should do you. Note that 'parsing' doesn't mean replacing at all, its about how we analyse syntax in an expression, in this case picking out the tags, attributes, and contents of an XML document.

like image 145
Spacedman Avatar answered Feb 27 '23 11:02

Spacedman


You could try to use replaceNodes and either create a new node which may be easier to work with or replace the text.

# new node name
# invisible(replaceNodes(doc[["//sequence/text()"]], newXMLNode("new", newSeq)))

# new text only
invisible(replaceNodes(doc[["//sequence/text()"]], newXMLTextNode( newSeq)))
doc

<?xml version="1.0"?>
<sequence id="ancestralSequence"><taxon id="test">
     </taxon>TGTCAATGGAACCTG</sequence>
like image 31
Chris S. Avatar answered Feb 27 '23 10:02

Chris S.