Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NoSQL: indexing and keyword-based searching

I have an application that stores items (e.g. web documents). Each item can feature a arbitrary large set of tags. And typical a common query is to retrieve all documents with given set of tags. Well, a pretty common Web application.

Now I'm thinking about a NoSQL database as persistent storage. Various NoSQL systems (e.g. MongoDB) support secondary indexes and with that keyword-based searches. Examples showing how to do it in different systems are easy to find. The problem is, I would like to know what's going on "under the hood", i.e. how/where the secondary indexes stored, and how a query with a list of tags is actually executed. Particularly in systems with many nodes.

I'm aware of solutions based on Map/Reduce or similar. But here I'm interested how the indexing works. Questions I have, for example, are:

  • Does the secondary index only store the item/object id or more?
  • If a query contains k tags, are k subqueries - one for each tag - executed and the k partial results are combined one the initiating node?

Where can I find such information for different NoSQL systems? Thanks a lot for any hints.

Christian

like image 493
Christian Avatar asked Nov 05 '22 08:11

Christian


1 Answers

In MongoDB an index on tags would be done by utilizing the multi-keys feature whereby the database tries to match documents against each element of an array. You would index this tags attribute of a given document which would create a btree that is constructed out of ranges of tags in that array.

You can learn more about multikeys here and can get more information about indexing in MongoDB by watching this presentation: MongoDB Internals

Does the secondary index only store the item/object id or more?

The indexes consist of the indexed field (lets say it's a tags array in your case, then the field would be a single tag) and an offset used to efficiently locate the document in memory. It also has some padding + other overhead as described here

If a query contains k tags, are k subqueries - one for each tag - executed and the k partial results are combined one the initiating node?

It depends, but if, for example, the query were using an $or on the tag field I think the queries are performed in parallel, each in O(log n) time, and the results are combined to form the result set but I'm not sure about this though.

like image 194
Tyler Brock Avatar answered Nov 09 '22 14:11

Tyler Brock