Before I dive into my question, I wanted to point out that I am doing this partially to get familiarized with node and mongo. I realize there are probably better ways to accomplish my final goal, but what I want to get out of this is a general methodology that might apply to other situations.
The goal:
I have a csv file containing 6+ million geo-ip records. Each record contains 4 fields in total and the file is roughly 180mb.
I want to process this file and insert each record into a MongoDB collection called "Blocks". Each "Block" will have the 4 fields from the csv file.
My current approach
I am using mongoose to create a "Block" model and a ReadStream to process the file line by line. The code I'm using to process the file and extract the records works and I can make it print each record to the console if I want to.
For each record in the file, it calls a function that creates a new Blocks object (using mongoose), populates the fields and saves it.
This is the code inside of the function that gets called every time a line is read and parsed. The "rec" variable contains an object representing a single record from the file.
block = new Block();
block.ipFrom = rec.startipnum;
block.ipTo = rec.endipnum;
block.location = rec.locid;
connections++;
block.save(function(err){
if(err) throw err;
//console.log('.');
records_inserted++;
if( --connections == 0 ){
mongoose.disconnect();
console.log( records_inserted + ' records inserted' );
}
});
The problem
Since the file is being read asynchronously, more than one line is processed at the same time and reading the file is much faster than MongoDB can write so the whole process stalls at around 282000 records and gets as high up as 5k+ concurrent Mongo connections. It doesn't crash.. it just sits there doing nothing and doesn't seem to recover, nor does the item count in the mongo collection go up any further.
What I'm after here is a general approach to solving this problem. How would I cap the number of concurrent Mongo connections? I would like to take advantage of being able to insert multiple records at the same time, but I'm missing a way to regulate the flow.
Thank you in advance.
Not a answer to your exact situation of importing from .csv file, but instead, on doing bulk insert(s)
-> First of all there are no special 'bulk' insertions operations, its all a forEach in the end.
-> if you try to read a big file async-ly which would be a lot faster then the write process, then you should consider changing your approach, first of all figure out how much can your setup handle, (or just hit-n-trial).
---> After that, change the way you read from file, you dont need to read every line from file, async-ly, learn to wait, use forEach, forEachSeries from Async.js to bring down your reads near to mongodb write level, and you are good to go.
I would try the commandline CSV import option from Mongodb - it should do what you are after without having to write any code
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With