I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.
All other attributes of entities I can store in some fast key-value store.
I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!
Thank you.
An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.
This type of index is called an inverted index, namely because it is an inversion of the forward index.
Inverted index In this method, a vector is formed where each document is given a document ID and the terms act as pointers. Then sorting of the list is done in alphabetical order and pointers are maintained to their corresponding document ID.
I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).
Basically a zset is a sorted set of key-value pairs.
So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid
redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).
Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With