I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions. All other attributes of entities I can store in some fast key-value store. I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated! Thank you.

I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files). Basically a zset is a sorted set of key-value pairs. So you can have a sorted set per feature where each feature->[ { docid, score }, {docid, score} ..] i.e. zadd feature score docid redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore). Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).

Fast in-memory inverted index

Tags:

indexing

lucene

information-retrieval

lucene.net

I am looking for a fast in-memory implementation of a generic inverted index. All I need is to store features with weights for a couple million entities and use the inverted index to compute similarities between entities using various distance functions.

All other attributes of entities I can store in some fast key-value store.

I hoped I could use Lucene just as an inverted index, but cannot see how I can associate with a document my own custom feature vector with precomputed weights. Any recommendations would be much appreciated!

Thank you.

623

asked Jul 07 '11 02:07

evgenyp

1 Answers

I have been doing some similar work and have discovered that redis' zset is pretty much what I need (though I am not actually using it right now; I have rolled my own solution based on memory mapped files).

Basically a zset is a sorted set of key-value pairs.

So you can have a sorted set per feature where each
feature->[ { docid, score }, {docid, score} ..]
i.e.
zadd feature score docid

redis then has some nice operators to merge, extract ranges etc. See zunionstore, zrange (http://redis.io/commands/zunionstore).

Very fast (supposedly) and all in memory etc ... (though redis is not an embedded db).

126

answered Oct 04 '22 12:10

Grynn

Related questions
                            
                                Is multiple field index in MySQL a good choice?
                            
                                MongoDB - Is searching in arrays as fast as searching in plain keys?
                            
                                Replace subarrays in numpy
                            
                                How to index JSON data in PostgreSQL 9.2?
                            
                                C - Pros/Cons of Enum-Indexed Arrays [closed]
                            
                                Efficient search of sorted numerical values
                            
                                Lucene - Reading all field names that are stored
                            
                                Retrieve all Indexes for a given Table with JDBC
                            
                                How can Datomic users cope without composite indexes?
                            
                                Table Partitioning vs non-Partitioned Table with many indexes
                            
                                Indexing documents using Solr results in Expected mime type application/octet-stream but got text/html
                            
                                Matrix cannot be indexed with
                            
                                How to stop PhpStorm from indexing my `excluded` directory?
                            
                                SQL Server - stored procedure suddenly become slow
                            
                                Index on column with only 2 distinct values
                            
                                Very slow bitmap heap scan in Postgres
                            
                                How do I run a geospatial query in Mongo?
                            
                                MATLAB: index a cell array with cell array of arrays and return a cell array
                            
                                Compound Indexes in Mongo and sorting
                            
                                Highlighting Text in java

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With