I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence. Now, I have several questions: <ol> <li>What kind of data structure is often used for building inverted index? Multidimensional list?</li> <li>After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?</li> </ol>

You can see a very simple implementation of inverted index and search in TinySearchEngine. For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this: <pre class="prettyprint lang-scala prettyprint-override"><code>val invertedIndex = new collection.mutable.HashMap[String, List[Posting]] </code></pre> or a Java-esque: <pre class="prettyprint lang-java prettyprint-override"><code>HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>(); </code></pre> The hash maps each term/word/token to a list of Postings. A <code>Posting</code> is just an object that represents an occurrence of a word inside a document: <pre class="prettyprint lang-scala prettyprint-override"><code>case class Posting(docId:Int, var termFrequency:Int) </code></pre> Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert <code>Postings</code> once with the correct <code>termFrequency</code> instead of updating it every time. Regarding your second question, there are normally two cases: (1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example. (2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible. For case (1), you can have at least 2 files: 1 - The Inverted Index file. It lists for each term all <code>Postings</code> (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data. <pre class="prettyprint"><code> Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq> Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq> Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq> Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq> ... TermN<docId5,termFreq><docId7,termFreq> </code></pre> 2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file. <pre class="prettyprint"><code>Term1 -> 0 Term2 -> 126 Term3 -> 222 .... </code></pre> Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm. For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it. The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index. To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user. Lucene calls this "little" indexes <code>segments</code>. The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than <code>N segments</code> you can decide to merge all segments smaller than <code>10 KBs</code> together.

How to build a simple inverted index?

Tags:

indexing

information-retrieval

I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.

Now, I have several questions:

What kind of data structure is often used for building inverted index? Multidimensional list?
After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?

209

asked Sep 20 '12 11:09

Munichong

1 Answers

You can see a very simple implementation of inverted index and search in TinySearchEngine.

For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:

val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]

or a Java-esque:

HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();

The hash maps each term/word/token to a list of Postings. A Posting is just an object that represents an occurrence of a word inside a document:

case class Posting(docId:Int, var termFrequency:Int)

Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings once with the correct termFrequency instead of updating it every time.

Regarding your second question, there are normally two cases:

(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.

(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.

For case (1), you can have at least 2 files:

1 - The Inverted Index file. It lists for each term all Postings (docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.

 Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
 Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
 Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
 Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
 ...
 TermN<docId5,termFreq><docId7,termFreq>

2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.

Term1 -> 0
Term2 -> 126
Term3 -> 222
....

Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.

For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.

The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.

To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.

Lucene calls this "little" indexes segments.

The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments you can decide to merge all segments smaller than 10 KBs together.

125

answered Sep 17 '22 14:09

Felipe Hummel

Related questions
                            
                                performance difference between subscript indexing and linear indexing
                            
                                Can I force Lua's table indexing to start from zero?
                            
                                Order of columns in a multi-column index in MySQL
                            
                                How to index a list with a TensorFlow tensor?
                            
                                SQL Server - index on a computed column?
                            
                                How do indexes work on views?
                            
                                Finding the index of the first element (e.g "True") from a series/column
                            
                                "Specified key was too long; max key length is 1000 bytes"
                            
                                MySQL distinction between e and é (e acute) - UNIQUE index
                            
                                Lucene Index problems with "-" character
                            
                                Elasticsearch can't update non dynamic settings
                            
                                How to get the index of an integer from a list if the list contains a boolean?
                            
                                pandas dataframe: loc vs query performance
                            
                                Index for nullable column
                            
                                Should I add an index for all fields in the WHERE clause? - MySQL
                            
                                cannot subscript a value of type NSDictionary with an index of type string
                            
                                What does this sentence mean: Clustered indexes are stored physically on the table?
                            
                                Moving all non-clustered indexes to another filegroup in SQL Server
                            
                                How to create MongoDB MultiKey index on attribute of items in an array .NET Driver
                            
                                What's the difference between B-Tree and GiST index methods (in PostgreSQL)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With