Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast in-memory inverted index

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.

All other attributes of entities I can store in some fast key-value store.

I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!

Thank you.

like image 623
evgenyp Avatar asked Jul 07 '11 02:07

evgenyp


People also ask

What is inverted index in information retrieval?

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.

Why is it called an inverted index?

This type of index is called an inverted index, namely because it is an inversion of the forward index.

What is inverted index in NLP?

Inverted index In this method, a vector is formed where each document is given a document ID and the terms act as pointers. Then sorting of the list is done in alphabetical order and pointers are maintained to their corresponding document ID.


1 Answers

I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).

Basically a zset is a sorted set of key-value pairs.

So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid

redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).

Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).

like image 126
Grynn Avatar answered Oct 04 '22 12:10

Grynn