I'm relatively new to Node.js. I'm trying to convert 83 XML files that are each around 400MB in size into JSON.
Each file contains data like this (except each element has a large number of additional statements):
<case-file>
<serial-number>75563140</serial-number>
<registration-number>0000000</registration-number>
<transaction-date>20130101</transaction-date>
<case-file-header>
<filing-date>19981002</filing-date>
<status-code>686</status-code>
<status-date>20130101</status-date>
</case-file-header>
<case-file-statements>
<case-file-statement>
<type-code>D10000</type-code>
<text>"MUSIC"</text>
</case-file-statement>
<case-file-statement>
<type-code>GS0351</type-code>
<text>compact discs</text>
</case-file-statement>
</case-file-statements>
<case-file-event-statements>
<case-file-event-statement>
<code>PUBO</code>
<type>A</type>
<description-text>PUBLISHED FOR OPPOSITION</description-text>
<date>20130101</date>
<number>28</number>
</case-file-event-statement>
<case-file-event-statement>
<code>NPUB</code>
<type>O</type>
<description-text>NOTICE OF PUBLICATION</description-text>
<date>20121212</date>
<number>27</number>
</case-file-event-statement>
</case-file-event-statements>
I have tried a lot of different Node modules, including sax, node-xml, node-expat and xml2json. Obviously, I need to stream the data from the file and pipe it through an XML parser and then convert it to JSON.
I have also tried reading a number of blogs, etc. attempting to explain, albeit superficially, how to parse Xml.
In the Node universe, I tried sax first but I can't figure out how to extract the data in a format that I can convert it to JSON. xml2json won't work on streams. node-xml looks encouraging but I can't figure out how it parses chunks in any manner that makes sense. node-expat points to libexpat documentation, which appears to requires a Ph.D. Node elementree does the same, pointing to the Python implementation but doesn't explain what has been implemented or how to use it.
Can someone point me to example that I could use to get started?
If you'd like the JavaScript in string JSON format, you can code: // Assuming xmlDoc is the XML DOM Document var jsonText = JSON. stringify(xmlToJson(xmlDoc)); This function has been extremely useful in allowing me to quickly disregard XML and use JSON instead.
Generally speaking, JSON is much faster and smaller than the equivalent XML.
Although this question is quite old, I am sharing my problem & solution which might be helpful to all who are trying to convert XML
to JSON
.
The actual problem here is not the conversion but processing huge XML files without having to hold them in memory at once.
Working with almost all widely used packages, I came across following problem -
A lot of packages support XML
to JSON
conversion covering all scenarios but they don't work well with large files.
Very few packages (like xml-flow, xml-stream) support large XML file conversion but the conversion process misses out few corner case scenarios where the conversion either fails or gives unpredictable JSON structure (explained in this SO question).
The ideal solution would be to combine the advantages from both the approaches which is exactly what I did and came up with xtreamer node package.
In simple words, xtreamer
accepts repeating node just like xml-flow
/ xml-stream
but emits repeating xml nodes instead of converted JSON. This provides following advantages -
xtreamer
with any readable stream as it extends transform stream
.xtreamer
& it will invoke the JSON parser and emit JSON accordingly.xtreamer
has stream
as its only dependency & being a transform stream extension, it can be piped with other streams flexibly.What if XML structure is not fixed?
I managed to come up with another sax
based node package xtagger which reads the XML file and provides the structure of the file in following format -
structure: { [name: string]: { [hierarchy: number]: number } };
This package allows to figure out the repeating node name which can then be passed to xtreamer
for parsing.
I hope this helps. :)
I doubt this is still relevant after 2-3 years but in case anyone else stumbles on this, I would say xml-stream
on NPM looked rather straightforward to me.
If you're a windows user who wants to avoid GYP
however I tried adding a very simple solution using sax
to extract children form an XML file one by one, it's called no-gyp-xml-stream
and it may not have a lot of features, but it certainly is simple to use: https://www.npmjs.com/package/no-gyp-xml-stream
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With