Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace the text inside an XML element in R?

Tags:

replace

r

xml

I have one input xml file.

cat sample.xml

<Text>
    &lt;p&gt;ABC &lt;/p&gt;
</Text>

R script

library(XML)
doc = xmlTreeParse("sample.xml", useInternal = TRUE)
top<-xmlRoot(doc)

sub("&lt;","<",top[[1]])

How can i fix above pblm?

Error Message: Error in as.vector(x, "character") : cannot coerce type 'externalptr' to vector of type 'character'

Edit: Aim is to use readHTMLTable() function for particular node in xml which has html table but it has xml markup( &gt; and &lt;) for > and < which need to be repalced first as readHTMLTable function cannot handle xml markup.

like image 519
Manish Avatar asked Jan 31 '13 08:01

Manish


2 Answers

And now the answer to your real question:

sample.xml with encoded table:

<Text>
&lt;table&gt;
&lt;tr&gt;&lt;td&gt;1&lt;/td&gt;&lt;td&gt;2&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;2&lt;/td&gt;&lt;td&gt;8&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;4&lt;/td&gt;&lt;td&gt;32&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;
</Text>

Read it in:

> library(XML)
> doc = xmlTreeParse("sample.xml", useInternal = TRUE)
> top<-xmlRoot(doc)

Convert to text:

> table=xmlValue(top)
> table
[1] "\n<table>\n<tr><td>1</td><td>2</td></tr>\n<tr><td>2</td><td>8</td></tr>\n<tr><td>4</td><td>32</td></tr>\n</table>\n"

This is now ready to feed to readHTMLTable. No string conversion needed:

> readHTMLTable(table)
$`NULL`
  V1 V2
1  1  2
2  2  8
3  4 32

Howzat?

like image 182
Spacedman Avatar answered Nov 14 '22 23:11

Spacedman


If your question is to know how to replace a string in the content of an XML node, then you can check the following code, using the sample.xml file you provided :

## Parse the XML file
doc <- xmlTreeParse("sample.xml", useInternal = TRUE)
## Select the nodes we want to update
nodes <- getNodeSet(doc, "//Text")
## For each node, apply gsub on the content of the node
lapply(nodes, function(n) {
  xmlValue(n) <- gsub("ABC","foobar",xmlValue(n))
})

Which will give you :

R> doc
<?xml version="1.0"?>
<Text>
    &lt;p&gt;foobar &lt;/p&gt;
</Text>

Here you can see that "ABC" as been replaced by "foobar".

But, if you try this code with the substitution you want to achieve (replace "&lt;" wit "<"), it apparently won't work :

doc <- xmlTreeParse("sample.xml", useInternal = TRUE)
nodes <- getNodeSet(doc, "//Text")
lapply(nodes, function(n) {
  xmlValue(n) <- gsub("&lt;","<",xmlValue(n))
})

will give you :

R> doc
<?xml version="1.0"?>
<Text>
    &lt;p&gt;ABC &lt;/p&gt;
</Text>

Why ? If you are working with XML files, you should know that some characters, mainly <, >, & and " are reserved as they are part of the base XML syntax. As such, they cannot appear in the content of the nodes, otherwise parsing would fail. So they are replaced by entities, which are a sort of coding of these characters. For example, "<" is coded as "&lt;", "&" is coded as "&amp;", etc.

So here, the content of your node contains a "<" character, which has been automatically converted to his entity "&lt;". What you try to do with your code is to replace "&lt;" back with "<", which R will gladly do for you, but as it is a text content of a node, the XML package will immediatly convert it back to "&lt;".

So, if what you want to achieve is to convert your string "&lt;p&gt;ABC &lt;/p&gt;" to a new XML node "<p>ABC </p>", you can't do it that way. A solution would be to parse your text string, detect the name and of the node (here, "p") from it, create a new node with xmlNode(), give it the text content "ABC" and replace the string with the node you just created.

Another quick and dirty way to do it would be first to replace all the entities in your file without parsing the XML. Something like this :

txt <- readLines(file("sample.xml"))
txt <- gsub("&lt;", "<", txt)
txt <- gsub("&gt;", ">", txt)
writeLines(txt, file("sample2.xml"))
doc2 <- xmlTreeParse("sample2.xml", useInternal = TRUE)

Which gives :

R> doc2
<?xml version="1.0"?>
<Text>
  <p>ABC </p>
</Text>

But this is dangerous, because if there is a "real" "&lt;" entity in you file, parsing will fail.

like image 29
juba Avatar answered Nov 14 '22 23:11

juba