Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I store the inverted document index on a disk?

I know this question has been asked again and again in stackoverflow and google, but I find that all the answers cannot satisfy me. Most of the solutions assume that the whole index can fit in memory, then we can store it to the disk by Java serialization. When the index is needed, we must load whole index to the memory. Solutions like this: solution 1, solution 2. But as we know, this assumption is not always true, so what should I do to store the inverted document index to the disk when it doesn't fit to the memory?

I will appreciate it if you can give me the solution in Java.

like image 838
jerry_sjtu Avatar asked Mar 15 '12 12:03

jerry_sjtu


People also ask

How do you store an inverted index?

Traditionally, an inverted index is written directly to file and stored on disk somewhere. If you want to do boolean retrieval querying (Either a file contains all the words in the query or not) postings might look like so stored contiguously on file.

What is inverted index in database?

An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.

How is the posting list stored as?

When postings lists are stored on disk, they are stored (perhaps compressed) as a contiguous run of postings without explicit pointers (as in Figure 1.3 ), so as to minimize the size of the postings list and the number of disk seeks to read a postings list into memory.

Does Google use inverted index?

Searching through individual pages for keywords and topics would be a very slow process for search engines to identify relevant information. Instead, search engines (including Google) use an inverted index, also known as a reverse index.


1 Answers

I would try JDBM3 This supports tree and hash collections and the only requirement is that each key or entry fit into memory.

If you have super large entries, I suggest storing each one as files which can be memory mapped to extract portions of the data. In the lookup table you can store keys to file names. (Or make the files names the keys)

like image 185
Peter Lawrey Avatar answered Oct 27 '22 23:10

Peter Lawrey