I want to read data from large xml file (20 GB) and manipulate them. I tired to use "xmlParse()" but it gave me memory issue before loading. Is there any efficient way to do this?
My data dump looks like this,
<tags>
<row Id="106929" TagName="moto-360" Count="1"/>
<row Id="106930" TagName="n1ql" Count="1"/>
<row Id="106931" TagName="fable" Count="1" ExcerptPostId="25824355" WikiPostId="25824354"/>
<row Id="106932" TagName="deeplearning4j" Count="1"/>
<row Id="106933" TagName="pystache" Count="1"/>
<row Id="106934" TagName="jitter" Count="1"/>
<row Id="106935" TagName="klein-mvc" Count="1"/>
</tags>
In XML package the xmlEventParse
function implements SAX (reading XML and calling your function handlers). If your XML is simple enough (repeating elements inside one root element), you can use branches
parameter to define function(s) for every element.
Example:
MedlineCitation = function(x, ...) {
#This is a "branch" function
#x is a XML node - everything inside element <MedlineCitation>
# find element <ArticleTitle> inside and print it:
ns <- getNodeSet(x,path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
}
Call XML parsing:
xmlEventParse(
file = "http://www.nlm.nih.gov/databases/dtd/medsamp2015.xml",
handlers = NULL,
branches = list(MedlineCitation = MedlineCitation)
)
Like in Martin Morgan, Storing-specific-xml-node-values-with-rs-xmleventparse:
branchFunction <- function() {
store <- new.env()
func <- function(x, ...) {
ns <- getNodeSet(x, path = "//ArticleTitle")
value <- xmlValue(ns[[1]])
print(value)
# if storing something ...
# store[[some_key]] <- some_value
}
getStore <- function() { as.list(store) }
list(MedlineCitation = func, getStore=getStore)
}
myfunctions <- branchFunction()
xmlEventParse(
file = "medsamp2015.xml",
handlers = NULL,
branches = myfunctions
)
#to see what is inside
myfunctions$getStore()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With