Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R XML: How to retrieve a node with a given value

Here's a snippet of XML file I am using:

<page>
  <title>AccessibleComputing</title>
  <ns>0</ns>
  <id>10</id>
  <redirect title="Computer accessibility" />
  <revision>
    <id>381202555</id>
    <parentid>381200179</parentid>
    <timestamp>2010-08-26T22:38:36Z</timestamp>
    <contributor>
      <username>OlEnglish</username>
      <id>7181920</id>
    </contributor>
    <minor />
    <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
    <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
    <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1>
    <model>wikitext</model>
    <format>text/x-wiki</format>
  </revision>
</page>
<page>
  <title>AfghanistanGeography</title>
  <ns>0</ns>
  <id>14</id>
  <redirect title="Geography of Afghanistan" />
  <revision>
    <id>407008307</id>
    <parentid>74466619</parentid>
    <timestamp>2011-01-10T03:56:19Z</timestamp>
    <contributor>
      <username>Graham87</username>
      <id>194203</id>
    </contributor>
    <minor />
    <comment>1 revision from [[:nost:AfghanistanGeography]]: import old edit, see [[User:Graham87/Import]]</comment>
    <text xml:space="preserve">#REDIRECT [[Geography of Afghanistan]] {{R from CamelCase}}</text>
    <sha1>0uwuuhiam59ufbu0uzt9lookwtx9f4r</sha1>
    <model>wikitext</model>
    <format>text/x-wiki</format>
  </revision>
</page>
<page>
  <title>AfghanistanPeople</title>
  <ns>0</ns>
  <id>15</id>
  <redirect title="Demography of Afghanistan" />
  <revision>
    <id>135089040</id>
    <parentid>74466558</parentid>
    <timestamp>2007-06-01T13:59:37Z</timestamp>
    <contributor>
      <username>RussBot</username>
      <id>279219</id>
    </contributor>
    <minor />
    <comment>Robot: Fixing [[Special:DoubleRedirects|double-redirect]] -&quot;Demographics of Afghanistan&quot; +&quot;Demography of Afghanistan&quot;</comment>
    <text xml:space="preserve">#REDIRECT [[Demography of Afghanistan]] {{R from CamelCase}}</text>
    <sha1>744dgrl7ef5p53yffn2a989ly1dyr8f</sha1>
    <model>wikitext</model>
    <format>text/x-wiki</format>
  </revision>
</page>

Now, given the value "AccessibleComputing" how do I retrieve the XMLInternalElementNode (which corresponds to 'AccessibleComputing'? I tried using getNodeSet with no success.

Thanks.

Updated question

I should mentioned entire sample.xml file in the first place. Here's it is. The problem I am facing follows:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.8/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd" version="0.8" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.21wmf8</generator>
    <case>first-letter</case>
    <namespaces>
      <namespace key="-2" case="first-letter">Media</namespace>
      <namespace key="-1" case="first-letter">Special</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Talk</namespace>
      <namespace key="2" case="first-letter">User</namespace>
      <namespace key="3" case="first-letter">User talk</namespace>
      <namespace key="4" case="first-letter">Wikipedia</namespace>
      <namespace key="5" case="first-letter">Wikipedia talk</namespace>
      <namespace key="6" case="first-letter">File</namespace>
      <namespace key="7" case="first-letter">File talk</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">MediaWiki talk</namespace>
      <namespace key="10" case="first-letter">Template</namespace>
      <namespace key="11" case="first-letter">Template talk</namespace>
      <namespace key="12" case="first-letter">Help</namespace>
      <namespace key="13" case="first-letter">Help talk</namespace>
      <namespace key="14" case="first-letter">Category</namespace>
      <namespace key="15" case="first-letter">Category talk</namespace>
      <namespace key="100" case="first-letter">Portal</namespace>
      <namespace key="101" case="first-letter">Portal talk</namespace>
      <namespace key="108" case="first-letter">Book</namespace>
      <namespace key="109" case="first-letter">Book talk</namespace>
      <namespace key="446" case="first-letter">Education Program</namespace>
      <namespace key="447" case="first-letter">Education Program talk</namespace>
      <namespace key="710" case="first-letter">TimedText</namespace>
      <namespace key="711" case="first-letter">TimedText talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>AccessibleComputing</title>
    <ns>0</ns>
    <id>10</id>
    <redirect title="Computer accessibility" />
    <revision>
      <id>381202555</id>
      <parentid>381200179</parentid>
      <timestamp>2010-08-26T22:38:36Z</timestamp>
      <contributor>
        <username>OlEnglish</username>
        <id>7181920</id>
      </contributor>
      <minor />
      <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
      <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}</text>
      <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>
  <page>
    <title>History</title>
    <ns>0</ns>
    <id>13</id>
    <redirect title="History of " />
    <revision>
      <id>74466652</id>
      <parentid>15898948</parentid>
      <timestamp>2006-09-08T04:15:52Z</timestamp>
      <contributor>
        <username>Rory096</username>
        <id>750223</id>
      </contributor>
      <comment>cat rd</comment>
      <text xml:space="preserve">#REDIRECT [[History of ]] {{R from CamelCase}}</text>
      <sha1>d4tdz2eojqzamnuockahzcbrgd1t9oi</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>
  <page>
    <title>Geography</title>
    <ns>0</ns>
    <id>14</id>
    <redirect title="Geography of " />
    <revision>
      <id>407008307</id>
      <parentid>74466619</parentid>
      <timestamp>2011-01-10T03:56:19Z</timestamp>
      <contributor>
        <username>Graham87</username>
        <id>194203</id>
      </contributor>
      <minor />
      <comment>1 revision from [[:nost:Geography]]: import old edit, see [[User:Graham87/Import]]</comment>
      <text xml:space="preserve">#REDIRECT [[Geography of ]] {{R from CamelCase}}</text>
      <sha1>0uwuuhiam59ufbu0uzt9lookwtx9f4r</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>
  <page>
    <title>People</title>
    <ns>0</ns>
    <id>15</id>
    <redirect title="Demography of " />
    <revision>
      <id>135089040</id>
      <parentid>74466558</parentid>
      <timestamp>2007-06-01T13:59:37Z</timestamp>
      <contributor>
        <username>RussBot</username>
        <id>279219</id>
      </contributor>
      <minor />
      <comment>Robot: Fixing [[Special:DoubleRedirects|double-redirect]] -&quot;Demographics of &quot; +&quot;Demography of &quot;</comment>
      <text xml:space="preserve">#REDIRECT [[Demography of ]] {{R from CamelCase}}</text>
      <sha1>744dgrl7ef5p53yffn2a989ly1dyr8f</sha1>
      <model>wikitext</model>
      <format>text/x-wiki</format>
    </revision>
  </page>
</mediawiki>

How I get page node which has title element value as "AccessibleComputing". I tried the following:

doc = xmlTreeParse('sample.xml',useInternalNodes=TRUE)
getNodeSet(doc, "//page[title=\"AccessibleComputing\"]")

it returned

list()
attr(,"class")
[1] "XMLNodeSet"

Expected output:

[[1]]
<page>
  <title>AccessibleComputing</title>
  <ns>0</ns>
  <id>10</id>
  <redirect title="Computer accessibility"/>
  <revision>
    <id>381202555</id>
    <parentid>381200179</parentid>
    <timestamp>2010-08-26T22:38:36Z</timestamp>
    <contributor>
      <username>OlEnglish</username>
      <id>7181920</id>
    </contributor>
    <minor/>
    <comment>[[Help:Reverting|Reverted]] edits by [[Special:Contributions/76.28.186.133|76.28.186.133]] ([[User talk:76.28.186.133|talk]]) to last version by Gurch</comment>
    <text xml:space="preserve">#REDIRECT [[Computer accessibility]] {{R from CamelCase}}    </text>
    <sha1>lo15ponaybcg2sf49sstw9gdjmdetnk</sha1>
    <model>wikitext</model>
    <format>text/x-wiki</format>
  </revision>
</page> 

attr(,"class")
[1] "XMLNodeSet"

I guess I have got my XPath query incorrect - the one time appearing 'siteinfo' node breaks what I tried. Any suggestions.

like image 980
arun kejariwal Avatar asked Dec 07 '25 09:12

arun kejariwal


1 Answers

To parse you file I add a new tag

<pages>
....
</pages>

Then using xpathSApply , I can retrieve all , all the title elements:

library(XML)
doc = xmlTreeParse('c:/temp/testxml.xml',useInternalNodes=TRUE)
xpathSApply(doc,'//page/title',xmlValue)
"AccessibleComputing"  "AfghanistanGeography" "AfghanistanPeople" 

you can also getNodeSet :

getNodeSet(doc,'//page/title')
[[1]]
<title>AccessibleComputing</title> 

[[2]]
<title>AfghanistanGeography</title> 

[[3]]
<title>AfghanistanPeople</title> 
like image 57
agstudy Avatar answered Dec 09 '25 23:12

agstudy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!