Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ridiculously slow mongoDB query on small collection in simple but big database

So I have a super simple database in mongoDB with a few collections:

> show collections
Aggregates <-- count: 92
Users <-- count: 68222
Pages <-- count: 1728288847, about 1.1TB
system.indexes

The Aggregates collection is an aggregate of the Pages collection, and each document looks like this:

> db.Aggregates.findOne()
{
        "_id" : ObjectId("50f237126ba71610eab3aaa5"),
        "daily_total_pages" : 16929799,
        "day" : 21,
        "month" : 9,
        "year" : 2011
}

Very simple. However, let's try and get the total page loads by adding all 92 days daily page loads together:

>>> def get_total():
...     start = datetime.now()
...     print sum([x['daily_total_pages'] for x in c.Aggregates.find()])
...     end = datetime.now()
...     print (end-start).seconds
...
>>> get_total()
1728288847
43

43 seconds?!??!??!?!

Those 92 aggregate results are tiny! I might as well just store them in a text file, this is crazy.

Or are they tiny? How big are they on disk according to mongo?

> db.Aggregates.stats()
{
        "ns" : "c.AggregateResults",
        "count" : 92,
        "size" : 460250104,
        "avgObjSize" : 5002718.521739131,
        "storageSize" : 729464832,
        "numExtents" : 7,
        "nindexes" : 2,
        "lastExtentSize" : 355647488,
        "paddingFactor" : 1.0690000000000066,
        "systemFlags" : 1,
        "userFlags" : 0,
        "totalIndexSize" : 16352,
        "indexSizes" : {
                "_id_" : 8176,
                "date_1" : 8176
        },
        "ok" : 1
}

438 megabytes total for those tiny daily numbers? Each one is approximately 280 bytes, so they should be a maximum of 25~30kb total. So the storage is huge and the query is super slow. Is it possible it could be fragmented on disk? I created the aggregates after inserting the documents into the the full Pages collection.

Anyone have any insights into this madness? :O


Edit: Solved with more specific find() query by Jared. The video below that Sammaye provided also gives some very interesting storage insights.


Edit 2: So I found out that using sys.getsizeof() is a really unreliable method of finding out your document's sizes since it doesn't recurse down any trees. So actually my docs were quite large, and the best thing to do was to use find({}, {'daily_page_loads'}) as a more specific query!

like image 507
LittleBobbyTables Avatar asked Jan 17 '13 18:01

LittleBobbyTables


1 Answers

The avgObjSize is out of line with the 280 byte estimate. It's saying your objects are averaging around 5MB and storageSize is near 1GB. If you're memory-constrained running a query that needs to access all 1GB of file would cause lots of page faults.

Have you tried compacting?

db.runCommand({compact: 'Aggregates'})

or repairing?

db.repairDatabase()

If that doesn't work try pulling back just those fields needed for the sum rather than pulling the whole document. It may be that those documents are actually 5MB and the time is spent pulling data over the wire.

def get_total():
    start = datetime.now()
    print sum([x['daily_total_pages'] for x in c.Aggregates.find({}, {"daily_total_pages": 1})])
    end = datetime.now()
    print (end-start).seconds
like image 127
jared Avatar answered Oct 11 '22 13:10

jared