Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xml_find_all function from xml2 package (R) does not find relevant nodes

Tags:

r

xml

xml2

I am using the xml2 package in R to access xml data, and found that it behaves different on different xml_documents.

On this pet example

library(xml2)
doc <- read_xml( "<MEMBERS>
                      <CUSTOMER>
                         <ID>178</ID>
                         <FIRST.NAME>Alvaro</FIRST.NAME>
                         <LAST.NAME>Juarez</LAST.NAME>
                         <ADDRESS>123 Park Ave</ADDRESS>
                         <ZIP>57701</ZIP>
                      </CUSTOMER>
                      <CUSTOMER>
                         <ID>934</ID>
                         <FIRST.NAME>Janette</FIRST.NAME>
                         <LAST.NAME>Johnson</LAST.NAME>
                         <ADDRESS>456 Candy Ln</ADDRESS>
                         <ZIP>57701</ZIP>
                      </CUSTOMER>  
                   </MEMBERS>")
doc
{xml_document}
<MEMBERS>
[1] <CUSTOMER>\n  <ID>178</ID>\n  <FIRST.NAME>Alvaro</FIRST.NAME>\n  <LAST.NAME>Juarez</LAST.NAME>\n  <ADDRESS>12 ...
[2] <CUSTOMER>\n  <ID>934</ID>\n  <FIRST.NAME>Janette</FIRST.NAME>\n  <LAST.NAME>Johnson</LAST.NAME>\n  <ADDRESS> ...

I can run the following code

xml_find_all(doc, "//FIRST.NAME")
{xml_nodeset (2)}
[1] <FIRST.NAME>Alvaro</FIRST.NAME>
[2] <FIRST.NAME>Janette</FIRST.NAME>

giving me the expected output (finding all nodes with 'FIRST.NAME' tags).

However, if I perform the same action on this xml file:

example <- read_xml(file.path("~/Downloads", "uniprot_subset.xml"))
> example
{xml_document}
<uniprot>
 [1] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="35">\n  <accession>Q6GZX4</accession>\n  <name>001R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Putative tr ...
 [2] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="36">\n  <accession>Q6GZX3</accession>\n  <name>002L_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [3] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2018-06-20" version="22">\n  <accession>Q197F8</accession>\n  <name>002R_IIV3</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacteri ...
 [4] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2017-09-27" version="18">\n  <accession>Q197F7</accession>\n  <name>003L_IIV3</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacteri ...
 [5] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="31">\n  <accession>Q6GZX2</accession>\n  <name>003R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [6] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2017-09-27" version="29">\n  <accession>Q6GZX1</accession>\n  <name>004R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [7] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2017-09-27" version="24">\n  <accession>Q197F5</accession>\n  <name>005L_IIV3</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacteri ...
 [8] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="38">\n  <accession>Q6GZX0</accession>\n  <name>005R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...
 [9] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2019-01-16" version="44">\n  <accession>Q91G88</accession>\n  <name>006L_IIV6</name>\n  <protein>\n    <recommendedName>\n      <fullName>Putative Kil ...
[10] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2017-09-27" version="27">\n  <accession>Q6GZW9</accession>\n  <name>006R_FRG3G</name>\n  <protein>\n    <recommendedName>\n      <fullName>Uncharacter ...

it behaves differently

xml_find_all(example, "//accession")
{xml_nodeset (0)}

Basically, it will not find any nodes with the 'accession' tag, even though they exist and can be accessed by different functions, for instance using

xml_children(xml_children(example)[1])[1]
{xml_nodeset (1)}
[1] <accession>Q6GZX4</accession>

Can anyone tell me why the xml_find_all function does not find any nodes in the latter example?

like image 995
dertomtom Avatar asked Apr 17 '19 12:04

dertomtom


1 Answers

This happens because your pet example does not contain namespaces, but the second XML file does.

example %>% xml_ns()

d1  <-> http://uniprot.org/uniprot
d2  <-> http://uniprot.org/uniprot
d3  <-> http://uniprot.org/uniprot
d4  <-> http://uniprot.org/uniprot
d5  <-> http://uniprot.org/uniprot
d6  <-> http://uniprot.org/uniprot
d7  <-> http://uniprot.org/uniprot
d8  <-> http://uniprot.org/uniprot
d9  <-> http://uniprot.org/uniprot
d10 <-> http://uniprot.org/uniprot

Since each entry has the same namespace, in this case the simplest approach is probably to strip (remove) the namespaces:

example %>% xml_ns_strip()

And xml_find_all should now work as expected:

example %>% xml_find_all("//accession")

{xml_nodeset (10)}
 [1] <accession>Q6GZX4</accession>
 [2] <accession>Q6GZX3</accession>
 [3] <accession>Q197F8</accession>
 [4] <accession>Q197F7</accession>
 [5] <accession>Q6GZX2</accession>
 [6] <accession>Q6GZX1</accession>
 [7] <accession>Q197F5</accession>
 [8] <accession>Q6GZX0</accession>
 [9] <accession>Q91G88</accession>
[10] <accession>Q6GZW9</accession>

If you wanted to retain the namespaces, you could access accessions like so:

example %>% xml_find_all("//d1:accession")

which works in this case because the default name d1 given to the namespace for the first entry maps to the same namespace for all entries.

like image 167
neilfws Avatar answered Nov 14 '22 23:11

neilfws