Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read an XML input file, manipulate some nodes (remove and rename some) and write the output to a new XML output file?

Tags:

r

xml

I need to read an XML file from internet and re-shape it. Here is the XML file and the code I have so far.

library(XML)
url='http://ClinicalTrials.gov/show/NCT00001400?displayxml=true'  
doc = xmlParse(url,useInternalNode=TRUE)

I was able to use some functions within the XML package with sucess(e.g., getNodeSet), but I am not an expert and there are some examples on the internet but I was not able to crack this problem myself. I also know some XPath but this was 4 years ago and I am not an expert on sapply and similar functions.

But my goal is this:

  1. I need to remove a whole set of XML children branches about location, for example: <location> ... anything </location>. There can be multiple nodes with location data. I simply don't need that detail in the output. The XML file above always complies to an XSD schema. The root node is called <clinical_study>.

  2. The resulted simplified file should be written into a new XML file called "data-changed.xml".

  3. I also need to rename and move one branch from old nested place of

    <eligibility> <criteria> <textblock> Inclusion criteria are xyz </textblock/>...

  4. In new output ("data-changed.xml") the structure should say a different XML node and be directly under root node:

    <eligibility_criteria> Inclusion criteria are xyz </eligibility_criteria>

So I need to:

  • read the XML into memory
  • manipulate the tree (prune it somewhere)
  • move some XML nodes to a new place and under a new name and
  • write the resulting XML output file.

Any ideas are greatly appreciated?

Also, if you know about a nice (recent !) tutorial on XML parsing within R (or book chapter which tackles it, please share the reference). (I read the vignettes by Duncan and these are too advanced (too concise)).

like image 970
userJT Avatar asked Dec 22 '22 04:12

userJT


1 Answers

Code to remove all location nodes:

r <- xmlRoot(doc)
removeNodes(r[names(r) == "location"])
like image 104
Angie Lambarri Avatar answered Dec 24 '22 01:12

Angie Lambarri