I have a java application that needs to read a large amount of data from MongoDB 3.2 and transfer it to Hadoop.
This batch application is run every 4 hours 6 times a day.
Data Specifications:
Currently I am using MongoTemplate and Morphia in order to access MongoDB. However I get an OOM exception when processing this data using the following :
List<MYClass> datalist = datasource.getCollection("mycollection").find().asList();
What is the best way to read this data and populate to Hadoop?
MongoTemplate::Stream()
and write to Hadoop one by one?batchSize(someLimit)
and write the entire batch to Hadoop?Cursor.batch()
and write to hdfs one by one?Your problem lies at the asList()
call
This forces the driver to iterate through the entire cursor (80,000 docs few Gigs), keeping all in memory.
batchSize(someLimit)
and Cursor.batch()
won't help here as you traverse the whole cursor, no matter what batch size is.
Instead you can:
1) Iterate the cursor: List<MYClass> datalist = datasource.getCollection("mycollection").find()
2) Read documents one at a time and feed the documents into a buffer (let's say a list)
3) For every 1000 documents (say) call Hadoop API, clear the buffer, then start again.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With