Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing large xml files (1G+) in node.js

I'm having a tough time finding a node package that can parse large xml files that are 1G+ in size. Our back-end server is primarily node.js, so I'd hate to have to build another service in another language/platform just to parse the xml and write data to a database. Has anyone had success doing this kind of thing in node? What did you use? I've looked at a bunch of packages like xml-stream, big-xml, etc, and they all have their own problems. Some can't even compile on mac (and seem outdated and no longer supported). I don't really need to convert the parsed results into js objects or anything like that. Just need to make sense of the data and then write to a database.

like image 811
u84six Avatar asked Sep 13 '18 13:09

u84six


People also ask

Is it possible to parse large size XML files in Node JS?

Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using Node.js. Many parser out there do not handle large size xml files and throw this error SAX xml parser handles large xml files but due to it’s complexity in handling events to capture specific xml node and data, we do not recommend this package either.

Why can’t I stream a large file in Node JS?

As it turns out, although Node.js is streaming the file input and output, in between it is still attempting to hold the entire file contents in memory, which it can’t do with a file that size. Node can hold up to 1.5GB in memory at one time, but no more.

What is the use of XML-stream?

xml-stream parse the xml content and output them in array structure. Here see the example. Here comes the interesting part, suppose you have large xml file like i have and you want to extract only those information which are enclosed in specific xml node. xml-stream provides ‘preserve’ and ‘collect’ function to do so. See example.

Does SAX XML parser handle large size XML files?

Many parser out there do not handle large size xml files and throw this error SAX xml parser handles large xml files but due to it’s complexity in handling events to capture specific xml node and data, we do not recommend this package either. This code is tested on Ubuntu 14.04.


1 Answers

The most obvious, but not very helpful answer, is that it depends on the requirements.

In your case however it seems pretty straightforward; you need to load large chunks of data, that may or may not fit into memory, for simple processing before writing it to the database. I think this is good reason alone why you would want to externalise that CPU work as separate processes. So it would probably make more sense to first focus on which XML parser does the job for you rather than which Node wrapper you want to use for it.

Obviously, any parser that requires the entire document to be loaded into memory before processing is not a valid option. You will need to use streams for this and parsers that supports that kind of sequential processing.

This leaves you with a few options:

  • Libxml
  • Expat
  • Saxon

Saxon seems to have the highest level of conformance to the recent W3C specs, so if schema validation and such is important than that might be a good candidate. Otherwise both Libxml and Expat seems to stack up pretty well performance wise and comes already preinstalled on most operating systems.

The are Node wrappers available for all of these:

  • libxmljs – Libxml
  • xml-stream – Expat
  • node-expat – Expat
  • saxon-node – Saxon

My Node implementation would look something like this:

import * as XmlStream from 'xml-stream'
import { request } from 'http'
import { createWriteStream } from 'fs'

const xmlFileReadStream = request('http://external.path/to/xml')
const xmlFileWriteStream = new XmlStream(xmlFileReadStream)
const databaseWriteStream = createWriteStream('/path/to/file.csv')

xmlFileWriteStream.on('endElement: Person', ({ name, phone, age }) =>
  databaseWriteStream.write(`"${name}","${phone}","${age}"\n`))

xmlFileWriteStream.on('end', () => databaseWriteStream.end())

Of course I have no idea what your database write stream would look like, so here I am just writing it to a file.

like image 192
unitario Avatar answered Oct 11 '22 14:10

unitario