I need to read an XML file from internet and re-shape it. Here is the XML file and the code I have so far.
library(XML)
url='http://ClinicalTrials.gov/show/NCT00001400?displayxml=true'
doc = xmlParse(url,useInternalNode=TRUE)
I was able to use some functions within the XML package with sucess(e.g., getNodeSet), but I am not an expert and there are some examples on the internet but I was not able to crack this problem myself. I also know some XPath but this was 4 years ago and I am not an expert on sapply and similar functions.
But my goal is this:
I need to remove a whole set of XML children branches about location, for example: <location> ... anything </location>
. There can be multiple nodes with location data. I simply don't need that detail in the output. The XML file above always complies to an XSD schema. The root node is called <clinical_study>
.
The resulted simplified file should be written into a new XML file called "data-changed.xml".
I also need to rename and move one branch from old nested place of
<eligibility>
<criteria>
<textblock>
Inclusion criteria are xyz
</textblock/>...
In new output ("data-changed.xml") the structure should say a different XML node and be directly under root node:
<eligibility_criteria>
Inclusion criteria are xyz
</eligibility_criteria>
So I need to:
Any ideas are greatly appreciated?
Also, if you know about a nice (recent !) tutorial on XML parsing within R (or book chapter which tackles it, please share the reference). (I read the vignettes by Duncan and these are too advanced (too concise)).
Code to remove all location nodes:
r <- xmlRoot(doc)
removeNodes(r[names(r) == "location"])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With