Parsing large xml files (1G+) in node.js

Tags:

I'm having a tough time finding a node package that can parse large xml files that are 1G+ in size. Our back-end server is primarily node.js, so I'd hate to have to build another service in another language/platform just to parse the xml and write data to a database. Has anyone had success doing this kind of thing in node? What did you use? I've looked at a bunch of packages like xml-stream, big-xml, etc, and they all have their own problems. Some can't even compile on mac (and seem outdated and no longer supported). I don't really need to convert the parsed results into js objects or anything like that. Just need to make sense of the data and then write to a database.

811

asked Sep 13 '18 13:09

u84six

1 Answers

The most obvious, but not very helpful answer, is that it depends on the requirements.

In your case however it seems pretty straightforward; you need to load large chunks of data, that may or may not fit into memory, for simple processing before writing it to the database. I think this is good reason alone why you would want to externalise that CPU work as separate processes. So it would probably make more sense to first focus on which XML parser does the job for you rather than which Node wrapper you want to use for it.

Obviously, any parser that requires the entire document to be loaded into memory before processing is not a valid option. You will need to use streams for this and parsers that supports that kind of sequential processing.

This leaves you with a few options:

Libxml
Expat
Saxon

Saxon seems to have the highest level of conformance to the recent W3C specs, so if schema validation and such is important than that might be a good candidate. Otherwise both Libxml and Expat seems to stack up pretty well performance wise and comes already preinstalled on most operating systems.

The are Node wrappers available for all of these:

libxmljs – Libxml
xml-stream – Expat
node-expat – Expat
saxon-node – Saxon

My Node implementation would look something like this:

import * as XmlStream from 'xml-stream'
import { request } from 'http'
import { createWriteStream } from 'fs'

const xmlFileReadStream = request('http://external.path/to/xml')
const xmlFileWriteStream = new XmlStream(xmlFileReadStream)
const databaseWriteStream = createWriteStream('/path/to/file.csv')

xmlFileWriteStream.on('endElement: Person', ({ name, phone, age }) =>
  databaseWriteStream.write(`"${name}","${phone}","${age}"\n`))

xmlFileWriteStream.on('end', () => databaseWriteStream.end())

Of course I have no idea what your database write stream would look like, so here I am just writing it to a file.

192

answered Oct 11 '22 14:10

unitario

Related questions
                            
                                Running promises in small concurrent batches (no more than X at a time)
                            
                                Scaling Socket.IO across multiple servers
                            
                                Application chat character limits?
                            
                                Javascript's performance.now() and Nodejs [duplicate]
                            
                                How to run Nodejs runtime on android or ios mobile devices
                            
                                get JOIN table as array of results with PostgreSQL/NodeJS
                            
                                How can I pass data from express server to react views?
                            
                                Node.js: Send GET request via unix socket
                            
                                express or express-generator: do i need both?
                            
                                Can I link node@6 by force?
                            
                                Configure local typescript compiler inside package.json
                            
                                eslint Parsing error: ecmaVersion must be 3, 5, 6, or 7
                            
                                How to use 'crypto' module in Angular2?
                            
                                Custom validation error using Sequelize.js
                            
                                502 Bad Gateway error for my server running with Node JS on nginx proxy
                            
                                NodeJS MSSQL .query returning double data in both recordsets and recordset
                            
                                Docker compose, heroku, hostname links and production deployment
                            
                                webpack fails "module build failed: unknown word" with webpack.config.js file
                            
                                How to compile typescript in Dockerfile
                            
                                How to upload images to google drive from NodeJS API

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing large xml files (1G+) in node.js

Tags:

node.js

xml

xml-parsing

filestream

u84six

People also ask

1 Answers

unitario

Recent Activity

Donate For Us