How to query when connecting mongodb with apache-spark

Tags:

I am now experimenting the Spark and Mongodb, which uses mongodb-hadoop connector to bridge the spark and mongodb communication. Here is a example of https://github.com/plaa/mongo-spark, this example works well for me.

Then based on this example, I used a bigger dataset from https://github.com/10gen-interns/big-data-exploration, which has a 6 millions records of flight data. What I want to do is to query the mongodb dataset and then do some further processing.

The schema for the flights data is in https://gist.github.com/sweetieSong/6016700

see data example:

{ "_id" : ObjectId( "51bf19c4ca69141e42ddd1f7" ),
  "age" : 27,
  "airTime" : 316,
  "airlineId" : 19805,
  "arrDelay" : -37,
  "arrTime" : Date( 1336304580000 ),
  "carrier" : "AA",
  "carrierId" : "AA",
  "crsArrTime" : Date( 1336306800000 ),
  "crsDepTime" : Date( 1336294800000 ),
  "crsElapsedTime" : 380,
  "date" : Date( 1336262400000 ),
  "dayOfMonth" : 6,
  "dayOfWeek" : 7,
  "depDelay" : -5,
  "depTime" : Date( 1336294500000 ),
  "destAirport" : "LAX",
  "destAirportId" : 12892,
  "destCity" : "Los Angeles, CA",
  "destCityId" : 32575,
  "destState" : "California",
  "destStateId" : "CA",
  "destWAC" : 91,
  "distance" : 2475,
  "diverted" : true,
  "elapsedTime" : 348,
  "flightNum" : 1,
  "month" : 5,
  "numDivAirportLandings" : 0,
  "numFlights" : 1,
  "origAirport" : "JFK",
  "origAirportId" : 12478,
  "origCity" : "New York, NY",
  "origCityId" : 31703,
  "origState" : "New York",
  "origStateId" : "NY",
  "origWAC" : 22,
  "quarter" : 2,
  "tailNum" : "N323AA",
  "taxiIn" : 19,
  "taxiOut" : 13,
  "wheelsOff" : Date( 1336295280000 ),
  "wheelsOn" : Date( 1336303440000 ),
  "year" : 2012 }

My scala code is

val sc = new SparkContext("local", "Scala Word Count")

val config = new Configuration()
config.set("mongo.input.uri", "mongodb://xx.xx.xx.xx:27017/flying.flights")
config.set("mongo.input.query","{destAirport: 'LAX'}");
//config.set("mongo.input.query","{_id.destAirport: 'LAX'}");

val mongoRDD = sc.newAPIHadoopRDD(config, classOf[com.mongodb.hadoop.MongoInputFormat], classOf[Object], classOf[BSONObject])

println("We're running scala..count ", mongoRDD.count())

For the testing purpose, I just want to first get all the records from destAirport 'LAX', I don't know how query is like so I tried two different format of queries, "{destAirport: 'LAX'}" and "{_id.destAirport: 'LAX'}"

When running the application, the console outputs such info

INFO MongoCollectionSplitter: Created split: min={ "_id" : { "$oid" : "51bf29d8ca69141e42097d7f"}}, max= { "_id" : { "$oid" : "51bf29dfca69141e420991ad"}}

14/08/05 10:30:51 INFO Executor: Running task ID 751
14/08/05 10:30:51 INFO TaskSetManager: Finished TID 750 in 109 ms on localhost (progress: 751/1192)
14/08/05 10:30:51 INFO DAGScheduler: Completed ResultTask(0, 750)
14/08/05 10:30:51 INFO BlockManager: Found block broadcast_0 locally
14/08/05 10:30:51 INFO NewHadoopRDD: Input split: MongoInputSplit{URI=mongodb://178.62.35.36:27017/flying.flights, authURI=null, min={ "_id" : { "$oid" : "51bf2f95ca69141e421904e5"}}, max={ "_id" : { "$oid" : "51bf2f9dca69141e42191913"}}, query={ "_id.destAirport" : "LAX "}, sort={ }, fields={ }, notimeout=false}
14/08/05 10:30:51 INFO MongoRecordReader: Read 0.0 documents from:
14/08/05 10:30:51 INFO MongoRecordReader: MongoInputSplit{URI=mongodb://178.62.35.36:27017/flying.flights, authURI=null, min={ "_id" : { "$oid" : "51bf2f95ca69141e421904e5"}}, max={ "_id" : { "$oid" : "51bf2f9dca69141e42191913"}}, query={ "_id.destAirport" : "LAX "}, sort={ }, fields={ }, notimeout=false}
14/08/05 10:30:51 INFO Executor: Serialized size of result for 751 is 597
14/08/05 10:30:51 INFO Executor: Sending result for 751 directly to driver
14/08/05 10:30:51 INFO Executor: Finished task ID 751

No matter what the query is (even don't set the query), the spark always execute 1191 Tasks. Each task will output similar words. and mongoRDD.count() always output 0.

My first question is what is the right query?

Moreover previously I thought what mongodb-hadoop does, is that mongodb firstly query all the collection, and then send the results back to spark for processing. but now it seems to me, mongodb will split the collection into many, and then query that small part of collection, and then send the results of that part to spark. Is it ?

382

asked Aug 08 '14 12:08

user824624

1 Answers

My first question is what is the right query?

I don't think there's a "right" query - you need to query based on the data you would like to process

Moreover previously I thought what mongodb-hadoop does, is that mongodb firstly query all the collection, and then send the results back to spark for processing. but now it seems to me, mongodb will split the collection into many, and then query that small part of collection, and then send the results of that part to spark. Is it ?

I encountered the same issue.

I believe that newAPIHadoopRDD, given the MongoInputSplit.class, does not account for the query when calculating the splits. It is only applied after the splits are calculated. This means that no matter how lean your query may be, the number of splits will remain the same, and will be proportional to the size of the collection.

newAPIHadoopRDD is using the StandaloneMongoSplitter. Note that this class is not using the query to calculate the split boundaries. It is just using mongo's internal "splitVector" command; from the documentation here - http://api.mongodb.org/internal/current/commands.html, it also looks like it does not account for the query.

I don't have a good solution though. A better approach would split the mongo collection only after calculating the query, but this requires another implementation of the splitter. Here's some good read about the issue: http://www.ikanow.com/how-well-does-mongodb-integrate-with-hadoop/

160

answered Oct 11 '22 23:10

tomer

Related questions
                            
                                cannot compile mongo-c-driver example
                            
                                Installing Mongo
                            
                                Ruby: Binary String to IO
                            
                                Is it possible for a MongoDB connection to timeout in Python?
                            
                                Mongo C# Driver and ObjectID JSON String Format
                            
                                Is it possible to search multiple Mongoose models at once?
                            
                                Can I increase the maximum number of indexes per collection in mongodb？
                            
                                How to handle outside insert into Meteor database?
                            
                                What is the hash function used by MongoDB to hash the database user passwords?
                            
                                How to access specific value from mongoose query callback?
                            
                                How to select certain fields with mongodb doctrine in symfony2
                            
                                Concurrent access to a document with mongoose
                            
                                .NET, Layered Architecture & MongoDB - What to use as ID?
                            
                                MongoDB / Meteor / Export MONGO_URL to deployed applications
                            
                                The `.mongorc.js` is not found, but there is one
                            
                                Declaration of Class::save() should be compatible with that of Class::save()
                            
                                Elasticsearch curl: (7) couldn't connect to host
                            
                                How to store dynamic fields in Mongo DB
                            
                                Does spring data mongodb supports manual referencing or it supports only DBrefs?
                            
                                Golang mongodb mgo driver Upsert / UpsertId documentation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to query when connecting mongodb with apache-spark

Tags:

mongodb

apache-spark

hadoop

user824624

People also ask

1 Answers

tomer

Recent Activity

Donate For Us