I'm trying to figure out the best way to accomplish the following: <ol> <li>Download a large XML (1GB) file on daily basis from a third-party website</li> <li>Convert that XML file to relational database on my server</li> <li>Add functionality to search the database</li> </ol> For the first part, is this something that would need to be done manually, or could it be accomplished with a cron? Most of the questions and answers related to XML and relational databases refer to Python or PHP. Could this be done with javascript/nodejs as well? If this question is better suited for a different StackExchange forum, please let me know and I will move it there instead. Below is a sample of the xml code: <pre class="prettyprint"><code><case-file> <serial-number>123456789</serial-number> <transaction-date>20150101</transaction-date> <case-file-header> <filing-date>20140101</filing-date> </case-file-header> <case-file-statements> <case-file-statement> <code>AQ123</code> <text>Case file statement text</text> </case-file-statement> <case-file-statement> <code>BC345</code> <text>Case file statement text</text> </case-file-statement> </case-file-statements> <classifications> <classification> <international-code-total-no>1</international-code-total-no> <primary-code>025</primary-code> </classification> </classifications> </case-file> </code></pre> Here's some more information about how these files will be used: All XML files will be in the same format. There are probably a few dozen elements within each record. The files are updated by a third party on a daily basis (and are available as zipped files on the third-party website). Each day's file represents new case files as well as updated case files. The goal is to allow a user to search for information and organize those search results on the page (or in a generated pdf/excel file). For example, a user might want to see all case files that include a particular word within the <code><text></code> element. Or a user might want to see all case files that include primary code 025 (<code><primary-code></code> element) and that were filed after a particular date (<code><filing-date></code> element). The only data entered into the database will be from the XML files--users won't be adding any of their own information to the database.

All steps could certainly be accomplished using <code>node.js</code>. There are modules available that will help you with each of these tasks: <ol> <li><ul> <li> node-cron: lets you easily set up cron tasks in your node program. Another option would be to set up a cron task on your operating system (lots of resources available for your favourite OS).</li> <li> download: module to easily download files from a URL.</li> </ul></li> <li>xml-stream: allows you to stream a file and register events that fire when the parser encounters certain XML elements. I have successfully used this module to parse KML files (granted they were significantly smaller than your files).</li> <li>node-postgres: node client for PostgreSQL (I am sure there are clients for many other common RDBMS, PG is the only one I have used so far).</li> </ol> Most of these modules have pretty great examples that will get you started. Here's how you would probably set up the XML streaming part: <pre class="prettyprint"><code>var XmlStream = require('xml-stream'); var xml = fs.createReadStream('path/to/file/on/disk'); // or stream directly from your online source var xmlStream = new XmlStream(xml); xmlStream.on('endElement case-file', function(element) { // create and execute SQL query/queries here for this element }); xmlStream.on('end', function() { // done reading elements // do further processing / query database, etc. }); </code></pre>

Converting large XML file to relational database

Tags:

python

javascript

node.js

xml

relational-database

I'm trying to figure out the best way to accomplish the following:

Download a large XML (1GB) file on daily basis from a third-party website
Convert that XML file to relational database on my server
Add functionality to search the database

For the first part, is this something that would need to be done manually, or could it be accomplished with a cron?

Most of the questions and answers related to XML and relational databases refer to Python or PHP. Could this be done with javascript/nodejs as well?

If this question is better suited for a different StackExchange forum, please let me know and I will move it there instead.

Below is a sample of the xml code:

Click to copy

<case-file>
  <serial-number>123456789</serial-number>
    <transaction-date>20150101</transaction-date>
      <case-file-header>
       <filing-date>20140101</filing-date>
      </case-file-header>
      <case-file-statements>
       <case-file-statement>
        <code>AQ123</code>
        <text>Case file statement text</text>
       </case-file-statement>
       <case-file-statement>
        <code>BC345</code>
        <text>Case file statement text</text>
       </case-file-statement>
     </case-file-statements>
   <classifications>
  <classification>
   <international-code-total-no>1</international-code-total-no>
   <primary-code>025</primary-code>
  </classification>
 </classifications>
</case-file>

Here's some more information about how these files will be used:

All XML files will be in the same format. There are probably a few dozen elements within each record. The files are updated by a third party on a daily basis (and are available as zipped files on the third-party website). Each day's file represents new case files as well as updated case files.

The goal is to allow a user to search for information and organize those search results on the page (or in a generated pdf/excel file). For example, a user might want to see all case files that include a particular word within the <text> element. Or a user might want to see all case files that include primary code 025 (<primary-code> element) and that were filed after a particular date (<filing-date> element).

The only data entered into the database will be from the XML files--users won't be adding any of their own information to the database.

989

asked Nov 13 '15 23:11

Ken

1 Answers

All steps could certainly be accomplished using node.js. There are modules available that will help you with each of these tasks:

- node-cron: lets you easily set up cron tasks in your node program. Another option would be to set up a cron task on your operating system (lots of resources available for your favourite OS).
- download: module to easily download files from a URL.
xml-stream: allows you to stream a file and register events that fire when the parser encounters certain XML elements. I have successfully used this module to parse KML files (granted they were significantly smaller than your files).
node-postgres: node client for PostgreSQL (I am sure there are clients for many other common RDBMS, PG is the only one I have used so far).

Most of these modules have pretty great examples that will get you started. Here's how you would probably set up the XML streaming part:

Click to copy

var XmlStream = require('xml-stream');
var xml = fs.createReadStream('path/to/file/on/disk'); // or stream directly from your online source
var xmlStream = new XmlStream(xml);
xmlStream.on('endElement case-file', function(element) {
    // create and execute SQL query/queries here for this element
});
xmlStream.on('end', function() {
    // done reading elements
    // do further processing / query database, etc.
});

155

answered Nov 14 '22 13:11

forrert

Related questions
                            
                                iframe contentWindow is undefined when use window.frames[name] to access
                            
                                JavaScript subarray without copying?
                            
                                D3 How to change dataset based on drop down box selection
                            
                                Dashed-styled list linking divs
                            
                                How to pass a Blob object from javascript to Android?
                            
                                Sending XMLHttpRequest with FormData
                            
                                Get Value of Disabled Option in Select Multiple Jquery
                            
                                Using d3 to shade area between two lines
                            
                                How to update/add element of the array in JavaScript?
                            
                                Calling a method on OnClick using TypeScript
                            
                                React.js - componentWillReceiveProps being hit twice
                            
                                Javascript DataTables - filter() function not working as expected
                            
                                DataTables.net table column sum in footer
                            
                                HTML5 Desktop Notifications (ideally with Angular)
                            
                                How to pre-select rows in Shiny DT datatables
                            
                                How to avoid memory leaks from jQuery?
                            
                                Passing a Javascript callback to a C++ Invoked method in Qml
                            
                                HTML onchange (this.value)
                            
                                Grunt babel multiple files and preserve source mapping
                            
                                Is triggering event listeners using ".click()" asynchronous?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With