I am using the xml2 package in R to access xml data, and found that it behaves different on different xml_documents.
On this pet example
library(xml2)
doc <- read_xml( "<MEMBERS>
<CUSTOMER>
<ID>178</ID>
<FIRST.NAME>Alvaro</FIRST.NAME>
<LAST.NAME>Juarez</LAST.NAME>
<ADDRESS>123 Park Ave</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
<CUSTOMER>
<ID>934</ID>
<FIRST.NAME>Janette</FIRST.NAME>
<LAST.NAME>Johnson</LAST.NAME>
<ADDRESS>456 Candy Ln</ADDRESS>
<ZIP>57701</ZIP>
</CUSTOMER>
</MEMBERS>")
doc
{xml_document}
<MEMBERS>
[1] <CUSTOMER>\n <ID>178</ID>\n <FIRST.NAME>Alvaro</FIRST.NAME>\n <LAST.NAME>Juarez</LAST.NAME>\n <ADDRESS>12 ...
[2] <CUSTOMER>\n <ID>934</ID>\n <FIRST.NAME>Janette</FIRST.NAME>\n <LAST.NAME>Johnson</LAST.NAME>\n <ADDRESS> ...
I can run the following code
xml_find_all(doc, "//FIRST.NAME")
{xml_nodeset (2)}
[1] <FIRST.NAME>Alvaro</FIRST.NAME>
[2] <FIRST.NAME>Janette</FIRST.NAME>
giving me the expected output (finding all nodes with 'FIRST.NAME' tags).
However, if I perform the same action on this xml file:
example <- read_xml(file.path("~/Downloads", "uniprot_subset.xml"))
> example
{xml_document}
<uniprot>
[1] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="35">\n <accession>Q6GZX4</accession>\n <name>001R_FRG3G</name>\n <protein>\n <recommendedName>\n <fullName>Putative tr ...
[2] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="36">\n <accession>Q6GZX3</accession>\n <name>002L_FRG3G</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacter ...
[3] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2018-06-20" version="22">\n <accession>Q197F8</accession>\n <name>002R_IIV3</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacteri ...
[4] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2017-09-27" version="18">\n <accession>Q197F7</accession>\n <name>003L_IIV3</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacteri ...
[5] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="31">\n <accession>Q6GZX2</accession>\n <name>003R_FRG3G</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacter ...
[6] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2017-09-27" version="29">\n <accession>Q6GZX1</accession>\n <name>004R_FRG3G</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacter ...
[7] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2017-09-27" version="24">\n <accession>Q197F5</accession>\n <name>005L_IIV3</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacteri ...
[8] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2019-01-16" version="38">\n <accession>Q6GZX0</accession>\n <name>005R_FRG3G</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacter ...
[9] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2009-06-16" modified="2019-01-16" version="44">\n <accession>Q91G88</accession>\n <name>006L_IIV6</name>\n <protein>\n <recommendedName>\n <fullName>Putative Kil ...
[10] <entry xmlns="http://uniprot.org/uniprot" dataset="Swiss-Prot" created="2011-06-28" modified="2017-09-27" version="27">\n <accession>Q6GZW9</accession>\n <name>006R_FRG3G</name>\n <protein>\n <recommendedName>\n <fullName>Uncharacter ...
it behaves differently
xml_find_all(example, "//accession")
{xml_nodeset (0)}
Basically, it will not find any nodes with the 'accession' tag, even though they exist and can be accessed by different functions, for instance using
xml_children(xml_children(example)[1])[1]
{xml_nodeset (1)}
[1] <accession>Q6GZX4</accession>
Can anyone tell me why the xml_find_all function does not find any nodes in the latter example?
This happens because your pet example does not contain namespaces, but the second XML file does.
example %>% xml_ns()
d1 <-> http://uniprot.org/uniprot
d2 <-> http://uniprot.org/uniprot
d3 <-> http://uniprot.org/uniprot
d4 <-> http://uniprot.org/uniprot
d5 <-> http://uniprot.org/uniprot
d6 <-> http://uniprot.org/uniprot
d7 <-> http://uniprot.org/uniprot
d8 <-> http://uniprot.org/uniprot
d9 <-> http://uniprot.org/uniprot
d10 <-> http://uniprot.org/uniprot
Since each entry has the same namespace, in this case the simplest approach is probably to strip (remove) the namespaces:
example %>% xml_ns_strip()
And xml_find_all
should now work as expected:
example %>% xml_find_all("//accession")
{xml_nodeset (10)}
[1] <accession>Q6GZX4</accession>
[2] <accession>Q6GZX3</accession>
[3] <accession>Q197F8</accession>
[4] <accession>Q197F7</accession>
[5] <accession>Q6GZX2</accession>
[6] <accession>Q6GZX1</accession>
[7] <accession>Q197F5</accession>
[8] <accession>Q6GZX0</accession>
[9] <accession>Q91G88</accession>
[10] <accession>Q6GZW9</accession>
If you wanted to retain the namespaces, you could access accessions like so:
example %>% xml_find_all("//d1:accession")
which works in this case because the default name d1
given to the namespace for the first entry maps to the same namespace for all entries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With