I've got millions of items ordered by a precomputed score. Each item has many boolean attributes. Let says that there is about ten thousand possible attributes totally, each item having dozen of them.
I'd like to be able to request in realtime (few milliseconds) the top n items given ~any combination of attributes.
What solution would you recommend? I am looking for something extremely scalable.
--
- We are currently looking at mongodb and array index, do you see any limitation ?
- SolR is a possible solution but we do not need text search capabilities.
Because Cloud Datastore API v1 is released, Cloud Datastore API v1beta3 is now deprecated.
An index is defined on a list of properties of a given entity kind, with a corresponding order (ascending or descending) for each property. For use with ancestor queries, the index may also optionally include an entity's ancestors. An index table contains a column for every property named in the index's definition.
Datastore is a highly scalable NoSQL database for your applications. Datastore automatically handles sharding and replication, providing you with a highly available and durable database that scales automatically to handle your applications' load.
Mongodb can handle what you want, if you stored your objects like this
{ score:2131, attributes: ["attr1", "attr2", "attr3"], ... }
Then the following query will match all the items that have att1 and attr2
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr2" ] } })
but this won't match it
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr4" ] } })
the query returns a cursor, if you want this cursor to be sorted, then just add the sort parameters to the query like so
c = db.mycol.find({ attributes: { $all: [ "attr1", "attr2" ] }}).sort({score:1})
Have a look at Advanced Queries to see what's possible.
Appropriate indexes can be setup as follows
db.mycol.ensureIndex({attributes:1, score:1})
And you can get performance information using
db.mycol.find({ attributes: { $all: [ "attr1" ] }}).explain()
Mongo explains how many objects were scanned, how long the operation took and various other statistics.
This is exactly what Mongo can deal with. The fact that your attributes are boolean type helps here. A possible schema is listed below:
[
{
true_tags:[attr1, attr2, attr3, ...],
false_tags: [attr4, attr5, attr6, ...]
},
]
Then we can index on true_tags and false_tags. And it should be efficient to search with $in, $all, ... query operators.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With