I'm having a tough time finding a node package that can parse large xml files that are 1G+ in size. Our back-end server is primarily node.js, so I'd hate to have to build another service in another language/platform just to parse the xml and write data to a database. Has anyone had success doing this kind of thing in node? What did you use? I've looked at a bunch of packages like xml-stream, big-xml, etc, and they all have their own problems. Some can't even compile on mac (and seem outdated and no longer supported). I don't really need to convert the parsed results into js objects or anything like that. Just need to make sense of the data and then write to a database.
Parsing large xml files ( more than 500MB ) seems to be very tedious if you are using Node.js. Many parser out there do not handle large size xml files and throw this error SAX xml parser handles large xml files but due to it’s complexity in handling events to capture specific xml node and data, we do not recommend this package either.
As it turns out, although Node.js is streaming the file input and output, in between it is still attempting to hold the entire file contents in memory, which it can’t do with a file that size. Node can hold up to 1.5GB in memory at one time, but no more.
xml-stream parse the xml content and output them in array structure. Here see the example. Here comes the interesting part, suppose you have large xml file like i have and you want to extract only those information which are enclosed in specific xml node. xml-stream provides ‘preserve’ and ‘collect’ function to do so. See example.
Many parser out there do not handle large size xml files and throw this error SAX xml parser handles large xml files but due to it’s complexity in handling events to capture specific xml node and data, we do not recommend this package either. This code is tested on Ubuntu 14.04.
The most obvious, but not very helpful answer, is that it depends on the requirements.
In your case however it seems pretty straightforward; you need to load large chunks of data, that may or may not fit into memory, for simple processing before writing it to the database. I think this is good reason alone why you would want to externalise that CPU work as separate processes. So it would probably make more sense to first focus on which XML parser does the job for you rather than which Node wrapper you want to use for it.
Obviously, any parser that requires the entire document to be loaded into memory before processing is not a valid option. You will need to use streams for this and parsers that supports that kind of sequential processing.
This leaves you with a few options:
Saxon seems to have the highest level of conformance to the recent W3C specs, so if schema validation and such is important than that might be a good candidate. Otherwise both Libxml and Expat seems to stack up pretty well performance wise and comes already preinstalled on most operating systems.
The are Node wrappers available for all of these:
My Node implementation would look something like this:
import * as XmlStream from 'xml-stream'
import { request } from 'http'
import { createWriteStream } from 'fs'
const xmlFileReadStream = request('http://external.path/to/xml')
const xmlFileWriteStream = new XmlStream(xmlFileReadStream)
const databaseWriteStream = createWriteStream('/path/to/file.csv')
xmlFileWriteStream.on('endElement: Person', ({ name, phone, age }) =>
databaseWriteStream.write(`"${name}","${phone}","${age}"\n`))
xmlFileWriteStream.on('end', () => databaseWriteStream.end())
Of course I have no idea what your database write stream would look like, so here I am just writing it to a file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With