Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read large (~20 GB) xml file in R?

Tags:

r

large-data

I want to read data from large xml file (20 GB) and manipulate them. I tired to use "xmlParse()" but it gave me memory issue before loading. Is there any efficient way to do this?

My data dump looks like this,

<tags>                                                                                                    
    <row Id="106929" TagName="moto-360" Count="1"/>
    <row Id="106930" TagName="n1ql" Count="1"/>
    <row Id="106931" TagName="fable" Count="1" ExcerptPostId="25824355" WikiPostId="25824354"/>
    <row Id="106932" TagName="deeplearning4j" Count="1"/>
    <row Id="106933" TagName="pystache" Count="1"/>
    <row Id="106934" TagName="jitter" Count="1"/>
    <row Id="106935" TagName="klein-mvc" Count="1"/>
</tags>
like image 835
Karthick Avatar asked Feb 23 '15 06:02

Karthick


1 Answers

In XML package the xmlEventParse function implements SAX (reading XML and calling your function handlers). If your XML is simple enough (repeating elements inside one root element), you can use branches parameter to define function(s) for every element.

Example:

MedlineCitation = function(x, ...) {
  #This is a "branch" function
  #x is a XML node - everything inside element <MedlineCitation>
  # find element <ArticleTitle> inside and print it:
  ns <- getNodeSet(x,path = "//ArticleTitle")
  value <- xmlValue(ns[[1]])
  print(value)
}

Call XML parsing:

xmlEventParse(
  file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml", 
  handlers = NULL, 
  branches = list(MedlineCitation = MedlineCitation)
)

Solution with closure:

Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse:

branchFunction <- function() {
  store <- new.env() 
  func <- function(x, ...) {
    ns <- getNodeSet(x, path = "//ArticleTitle")
    value <- xmlValue(ns[[1]])
    print(value)
    # if storing something ... 
    # store[[some_key]] <- some_value
  }
  getStore <- function() { as.list(store) }
  list(MedlineCitation = func, getStore=getStore)
}

myfunctions <- branchFunction()

xmlEventParse(
  file = "medsamp2015.xml", 
  handlers = NULL, 
  branches = myfunctions
)

#to see what is inside
myfunctions$getStore()
like image 180
bergant Avatar answered Sep 29 '22 19:09

bergant