Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing very large xml lazily

I have a huge xml file (40 gbs). I would like to extract some fields from it without loading the entire file into memory. Any suggestions?

like image 315
Harshal Pandya Avatar asked Nov 01 '12 19:11

Harshal Pandya


2 Answers

A quick example with XMLEventReader based on a tutorial for SAXParser here (as posted by Rinat Tainov).

I'm sure it can be done better but just to show basic usage:

import scala.io.Source
import scala.xml.pull._

object Main extends App {
  val xml = new XMLEventReader(Source.fromFile("test.xml"))

  def printText(text: String, currNode: List[String]) {
    currNode match {
      case List("firstname", "staff", "company") => println("First Name: " + text)
      case List("lastname", "staff", "company") => println("Last Name: " + text)
      case List("nickname", "staff", "company") => println("Nick Name: " + text)
      case List("salary", "staff", "company") => println("Salary: " + text)
      case _ => ()
    }
  }

  def parse(xml: XMLEventReader) {
    def loop(currNode: List[String]) {
      if (xml.hasNext) {
        xml.next match {
          case EvElemStart(_, label, _, _) =>
            println("Start element: " + label)
            loop(label :: currNode)
          case EvElemEnd(_, label) =>
            println("End element: " + label)
            loop(currNode.tail)
          case EvText(text) =>
            printText(text, currNode)
            loop(currNode)
          case _ => loop(currNode)
        }
      }
    }
    loop(List.empty)
  }

  parse(xml)
}
like image 165
Arjan Avatar answered Sep 29 '22 20:09

Arjan


User SAXParser, it will not load entire xml to memory. Here good java example, easily can be used in scala.

like image 31
Rinat Tainov Avatar answered Sep 29 '22 19:09

Rinat Tainov