Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read large mongodb data

I have a java application that needs to read a large amount of data from MongoDB 3.2 and transfer it to Hadoop.

This batch application is run every 4 hours 6 times a day.

Data Specifications:

  • Documents: 80000 at a time (every 4 hours)
  • Size : 3gb

Currently I am using MongoTemplate and Morphia in order to access MongoDB. However I get an OOM exception when processing this data using the following :

List<MYClass> datalist = datasource.getCollection("mycollection").find().asList();

What is the best way to read this data and populate to Hadoop?

  • MongoTemplate::Stream() and write to Hadoop one by one?
  • batchSize(someLimit) and write the entire batch to Hadoop?
  • Cursor.batch() and write to hdfs one by one?
like image 436
Sid Avatar asked Sep 28 '17 09:09

Sid


1 Answers

Your problem lies at the asList() call

This forces the driver to iterate through the entire cursor (80,000 docs few Gigs), keeping all in memory.

batchSize(someLimit) and Cursor.batch() won't help here as you traverse the whole cursor, no matter what batch size is.

Instead you can:

1) Iterate the cursor: List<MYClass> datalist = datasource.getCollection("mycollection").find()

2) Read documents one at a time and feed the documents into a buffer (let's say a list)

3) For every 1000 documents (say) call Hadoop API, clear the buffer, then start again.

like image 140
Ori Dar Avatar answered Sep 21 '22 14:09

Ori Dar