Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting data from Mongo to MySQL (110M docs, 60Gigs) - Tips and Suggestions?

I have been tasked with porting data from a MongoDB database to a MySQL database. (There are strong reasons for porting - so it has to be done).

The MongoDB collection:

  • Has approx 110 Million documents
  • Weighs 60 GB in size
  • Has indexes for important properties
  • Is running of a Windows 2008 standalone separate server which is not serving any production traffic

The Setup that we have tried:

  • An Large Amazon EC2 Win2008 Server instance with 7.5 Gigs of RAM / 8 Gigs of Page File
  • A C# console app which converts the MongoDB data to a local MySQL database

We pick up 1K documents at a time in memory from the MongoDB, do the necessary processing and then save them to the MySQL db doing batch writes of 500 at a time.

The problem that we are facing is that every 2.5 M docs, the server chokes up and Mongo responds very slowly - timing out the app's data fetch operation (Free RAM gets over by the time 1M documents are processed)

We are moving ahead slowly by killing the mongod process and starting it again every 2.5M records when it crashes - but I bet we're doing something wrong.

Question:

Should I move the Mongo Server to a Linux based Large Instance and MySQL to the Amazon RDS for this and rewrite the conversion app in PHP? Will it help?

The reason we decided to keep it all on one box was the latency issue of having different servers on different boxes - but I guess that is moot if the box is choking up.

What other things can I try / tips I can use?

Thanks for reading this far!

-- Update 01 --

Its been approximate 6 hours since I restarted my app and have made the following change:

  1. Increased Mongo Read count from 1,000 to 10,000 records at a time. .skip(10K).limit(10K)
  2. Removed all indexes from the MySQL target database.
  3. Increased the Windows page size from 4 Gigs to 8 Gigs

My memory is at 100% consumption but the app is running still. (Last time it croaked in 52 mins). Mongo eating 6.8 Gigs of RAM, MySQL - 450 Megs and the converter app - 400 Megs (approx values).

Processed 11M records so far - but the speed has gone down to 370 records / sec from approx 500 records / sec.

Next steps are going to be to isolate both the Mongo and MySQL servers to separate boxes and - keeping all of them in the same Amazon availability zone to minimize latency.

-- Update 02 --

We made some changes in code to use the Mongo Cursor and letting it auto increment automatically as against doing a .skip().limt() ourselves. This greatly sped up the process and we were doing 1250 records per second from 300 odd earlier. However, the application started consuming too much memory and would run out of RAM and crash and needed to be restarted after every 2M records.

We used this code snippet:

var docs = db[collectionName].Find(query);
docs.SetBatchSize(numOfResultsToFetchAtATime);
foreach (var d in docs) {
  // do processing
}

So what this does is fetch 'numOfResultsToFetchAtATime' records at a time - but then progresses automatically in the loop and fetches the next set of records. Mongo takes care of this progression using a Cursor and hence it is a lot faster.

However, we have still not been able to successfully port this. Will post my reply with code when that happens properly.

-- Update 03: Success --

We finally used @scarpacci's suggestion of doing a mongoexport. Do remember that it is essential that the mongodb is on a linux box and not a windows box.

We first tried doing a mongoexport from Windows on the local MongoDB and no matter what we tried, it would fail at different places for one large collection (13Gigs+)

Finally, I restored the DB on a Linux box and mongoexport worked like a charm.

There is no Json -> MySQL converter - so that much we had to do. With a little tweaking, we were able to use our previous app and read the files and write to MySQL directly. It was quick and relatively error free.

We had some issues with the large files, but breaking down the 13GB file to 500 Meg long files helped with that and we were able to migrate all data to MySQL successfully.

Many thanks to everyone for spending time helping us out. Hope that this explanation helps someone in the future.

like image 566
saurabhj Avatar asked Jul 21 '12 14:07

saurabhj


1 Answers

I had issues migrating data to SQLServer using .NET once - even though I tried to keep it as lightweight as possible it was still unacceptable slow. In the end I wrote a quick C++ OLEDB app and things went significantly faster. I'm still trying to figure out what I did wrong in my .NET app, but it's possible the problem is in .NET. I wouldn't rewrite the conversion in PHP, but go with the performance option and use C++ (grab a tutorial off the web, its not that hard, not for a throwaway app)

So, that's one thing to look at first - as well as profile your C# app to see if it has a memory leak bug that's slowly bringing the rest of the system to a crawl.

I find it interesting you stop the MongoDB app instead of anything else. What makes you think its MongoDB that is dying, and not the other systems? If its memory usage, then splitting to separate boxes might make a difference, if its slowly growing memory, then read fewer chunks - Mongo should be fine with reading data, so if its not, chances are its something you've done to it to make it hold on to its memory, either in configuration or in your app.

Fire up perfmon and look at the memory, swap, disk IO and CPU usages of Mongo, your app and MySQL instance.

like image 141
gbjbaanb Avatar answered Sep 30 '22 09:09

gbjbaanb