MongoDB and using DBRef with Spatial Data

Tags:

I have a collection with 100 million documents of geometry.

I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.

Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.

First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.

This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.

Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.

532

asked Jun 12 '15 20:06

ParoX

1 Answers

If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.

A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.

If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.

Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.

196

answered Sep 25 '22 23:09

The Software Barbarian

Related questions
                            
                                Is there a Comparison of NoSQL Solutions (Which is better in certain situations?) [closed]
                            
                                Benefits of V8 JavaScript engine in MongoDB for map reduce
                            
                                Mongos authentication
                            
                                How do I get django celery to write to the test database for my functional tests?
                            
                                MongoDB Aggregate Framework - Group by Year
                            
                                mongodb - many documents in one collection vs many collection
                            
                                Moped::Errors::OperationFailure failed with error "no such cmd
                            
                                Mongodb - sharded and unsharded collections
                            
                                MongoDB aggregate return count of 0 if no results
                            
                                Mongo connections never released - Django and Mongoengine running on gunicorn with gevent
                            
                                Elasticsearch deployment in a 2 server load balanced node js application setting
                            
                                MongoDB Chat Schema
                            
                                MongoDB (3.0.2) NullPointerException with Grails 2.4.3
                            
                                MongoError: attempt to write outside buffer bounds

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

MongoDB and using DBRef with Spatial Data

Tags:

join

mongodb

dbref

ParoX

People also ask

1 Answers

The Software Barbarian

Recent Activity

Donate For Us