Parsing very large xml lazily

I have a huge xml file (40 gbs). I would like to extract some fields from it without loading the entire file into memory. Any suggestions?

Harshal Pandya Avatar asked Nov 01 '12 19:11

2 Answers

A quick example with XMLEventReader based on a tutorial for SAXParser here (as posted by Rinat Tainov).

I'm sure it can be done better but just to show basic usage:

import scala.io.Source
import scala.xml.pull._

object Main extends App {
  val xml = new XMLEventReader(Source.fromFile("test.xml"))

  def printText(text: String, currNode: List[String]) {
    currNode match {
      case List("firstname", "staff", "company") => println("First Name: " + text)
      case List("lastname", "staff", "company") => println("Last Name: " + text)
      case List("nickname", "staff", "company") => println("Nick Name: " + text)
      case List("salary", "staff", "company") => println("Salary: " + text)
      case _ => ()

  def parse(xml: XMLEventReader) {
    def loop(currNode: List[String]) {
      if (xml.hasNext) {
        xml.next match {
          case EvElemStart(_, label, _, _) =>
            println("Start element: " + label)
            loop(label :: currNode)
          case EvElemEnd(_, label) =>
            println("End element: " + label)
          case EvText(text) =>
            printText(text, currNode)
          case _ => loop(currNode)

Arjan Avatar answered Sep 29 '22 20:09


User SAXParser, it will not load entire xml to memory. Here good java example, easily can be used in scala.

Rinat Tainov Avatar answered Sep 29 '22 19:09

Rinat Tainov