Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing an inverted index

I am working on a project on Info Retrieval. I have made a Full Inverted Index using Hadoop/Python. Hadoop outputs the index as (word,documentlist) pairs which are written on the file. For a quick access, I have created a dictionary(hashtable) using the above file. My question is, how do I store such an index on disk that also has quick access time. At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?). Please suggest an efficient way of storing and searching through the index.

My dictionary structure is as follows (using nested dictionaries)

{word : {doc1:[locations], doc2:[locations], ....}}

so that I can get the documents containing a word by dictionary[word].keys() ... and so on.

like image 551
easysid Avatar asked Sep 10 '10 19:09

easysid


People also ask

How is inverted index stored?

The inverted index is typically stored on the disk and is loaded on a dynamic basis depending on the query... e.g. if the query is "stack overflow", you hit on the individual lists corresponding to the terms 'stack' and 'overflow'...

What is the disadvantage using inverted index file?

Inverted Index also has disadvantage:Large storage overhead and high maintenance costs on update, delete and insert.

What can be compressed in an inverted index?

Compression of Inverted Index for Comprehensive Performance Evaluation in Lucene. Abstract: Inverted index is the most popular index structure in search engine. Applying index compression can reduce storage space on inverted index, and improve the search performance.

What is inverted index in database?

In computer science, an inverted index (also referred to as a postings list, postings file, or inverted file) is a database index storing a mapping from content, such as words or numbers, to its locations in a table, or in a document or a set of documents (named in contrast to a forward index, which maps from documents ...


1 Answers

shelve

At present I am storing the dictionary using python pickle module and loading from it but it brings the whole of index into memory at once (or does it?).

Yes it does bring it all in.

Is that a problem? If it's not an actual problem, then stick with it.

If it's a problem, what kind of problem do you have? Too slow? Too fast? Too colorful? Too much memory used? What problem do you have?

like image 149
S.Lott Avatar answered Oct 02 '22 15:10

S.Lott