I have a huge xml file (40 gbs). I would like to extract some fields from it without loading the entire file into memory. Any suggestions?
A quick example with XMLEventReader based on a tutorial for SAXParser here (as posted by Rinat Tainov).
I'm sure it can be done better but just to show basic usage:
import scala.io.Source
import scala.xml.pull._
object Main extends App {
val xml = new XMLEventReader(Source.fromFile("test.xml"))
def printText(text: String, currNode: List[String]) {
currNode match {
case List("firstname", "staff", "company") => println("First Name: " + text)
case List("lastname", "staff", "company") => println("Last Name: " + text)
case List("nickname", "staff", "company") => println("Nick Name: " + text)
case List("salary", "staff", "company") => println("Salary: " + text)
case _ => ()
}
}
def parse(xml: XMLEventReader) {
def loop(currNode: List[String]) {
if (xml.hasNext) {
xml.next match {
case EvElemStart(_, label, _, _) =>
println("Start element: " + label)
loop(label :: currNode)
case EvElemEnd(_, label) =>
println("End element: " + label)
loop(currNode.tail)
case EvText(text) =>
printText(text, currNode)
loop(currNode)
case _ => loop(currNode)
}
}
}
loop(List.empty)
}
parse(xml)
}
User SAXParser, it will not load entire xml to memory. Here good java example, easily can be used in scala.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With