Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to load an XML file into MySQL using C#?

Question

What is the fastest way to dump a large (> 1GB) XML file into a MySQL database?

Data

The data in question is the StackOverflow Creative Commons Data Dump.

Purpose

This will be used in an offline StackOverflow viewer I am building, since I am looking to do some studying/coding in places where I will not have access to the internet.

I would like to release this to the rest of the StackOverflow membership for their own use when the project is finished.

Problem

Originally, I was reading from XML/writing to DB one record at a time. This took about 10 hours to run on my machine. The hacktastic code I'm using now throws 500 records into an array, then creates an insertion query to load all 500 at once (eg. "INSERT INTO posts VALUES (...), (...), (...) ... ;"). While this is faster, it still takes hours to run. Clearly this is not the best way to go about it, so I'm hoping the big brains on this site will know of a better way.

Constraints

  • I am building the application using C# as a desktop application (i.e. WinForms).
  • I am using MySQL 5.1 as my database. This means that features such as "LOAD XML INFILE filename.xml" are not usable in this project, as this feature is only available in MySQL 5.4 and above. This constraint is largely due to my hope that the project would be useful to people other than myself, and I'd rather not force people to use Beta versions of MySQL.
  • I'd like the data load to be built into my application (i.e. no instructions to "Load the dump into MySQL using 'foo' before running this application.").
  • I'm using MySQL Connector/Net, so anything in the MySql.Data namespace is acceptable.

Thanks for any pointers you can provide!


Ideas so far

stored procedure that loads an entire XML file into a column, then parses it using XPath

  • This didn't work since the file size is subject to the limitations of the max_allowed_packet variable, which is set to 1 MB by default. This is far below the size of the data dump files.
like image 221
AgentConundrum Avatar asked Sep 21 '09 18:09

AgentConundrum


1 Answers

There are 2 parts to this:

  • reading the xml file
  • writing to the database

For reading the xml file, this link http://csharptutorial.blogspot.com/2006/10/reading-xml-fast.html , shows that 1 MB can be read in 2.4 sec using stream reader, that would be 2400 seconds or 40 mins (if my maths is working this late) for 1 GB file.

From what I have read the fastest way to get data into MySQL is to use LOAD DATA.

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

Therefore, if you can read the xml data, write it to files that can be used by LOAD DATA, then run LOAD DATA. The total time may be less than the hours that you are experiancing.

like image 90
Shiraz Bhaiji Avatar answered Oct 11 '22 15:10

Shiraz Bhaiji