I wanna build a simple indexing function of search engine without any API, such as Lucene. In the inverted index, I just need to record basic information of each word, e.g. docID, position, and freqence.
Now, I have several questions:
What kind of data structure is often used for building inverted index? Multidimensional list?
After building the index, how to write it into files? What kind of format in the file? Like a table? Like drawing a index table on paper?
An inverted index is an index data structure storing a mapping from content, such as words or numbers, to its locations in a document or a set of documents. In simple words, it is a hashmap like data structure that directs you from a word to a document or a web page.
Inverted index In this method, a vector is formed where each document is given a document ID and the terms act as pointers. Then sorting of the list is done in alphabetical order and pointers are maintained to their corresponding document ID.
Searching through individual pages for keywords and topics would be a very slow process for search engines to identify relevant information. Instead, search engines (including Google) use an inverted index, also known as a reverse index.
The inverted index is typically stored on the disk and is loaded on a dynamic basis depending on the query... e.g. if the query is "stack overflow", you hit on the individual lists corresponding to the terms 'stack' and 'overflow'...
You can see a very simple implementation of inverted index and search in TinySearchEngine.
For your first question, if you want to build a simple (in memory) inverted index the straightforward data structure is a Hash map like this:
val invertedIndex = new collection.mutable.HashMap[String, List[Posting]]
or a Java-esque:
HashMap<String, List<Posting>> invertedIndex = new HashMap<String, List<Postring>>();
The hash maps each term/word/token to a list of Postings. A Posting
is just an object that represents an occurrence of a word inside a document:
case class Posting(docId:Int, var termFrequency:Int)
Indexing a new document is just a matter of tokenizing it (separating in tokens/words) and for each token insert a new Posting in the correct List of the hash map. Of course, if a Posting already exists for that term in that specific docId, you increase the termFrequency. There are other ways of doing this. For in memory inverted indexes this is OK, but for on-disk indexes you'd probably want to insert Postings
once with the correct termFrequency
instead of updating it every time.
Regarding your second question, there are normally two cases:
(1) you have an (almost) immutable index. You index all your data once and if you have new data you can just reindex. There is no need to real-time or indexing many times in an hour, for example.
(2) new documents arrive all the time, and you need to search the newly arrived documents as soon as possible.
For case (1), you can have at least 2 files:
1 - The Inverted Index file. It lists for each term all Postings
(docId/termFrequency pairs). Here represented in plain text, but normally stored as binary data.
Term1<docId1,termFreq><docId2,termFreq><docId3,termFreq><docId4,termFreq><docId5,termFreq><docId6,termFreq><docId7,termFreq>
Term2<docId3,termFreq><docId5,termFreq><docId9,termFreq><docId10,termFreq><docId11,termFreq>
Term3<docId1,termFreq><docId3,termFreq><docId10,termFreq>
Term4<docId5,termFreq><docId7,termFreq><docId10,termFreq><docId12,termFreq>
...
TermN<docId5,termFreq><docId7,termFreq>
2- The offset file. Stores for each term the offset to find its inverted list in the inverted index file. Here I'm representing the offset in characters but you'll normally store binary data, so the offset will be in bytes. This file can be loaded to memory at startup time. When you need to lookup a term inverted list, you lookup its offset and read the inverted list from the file.
Term1 -> 0
Term2 -> 126
Term3 -> 222
....
Along with this 2 files you can (and generally will) have file(s) to store each term's IDF and each document's norm.
For case (2), I'll try to briefly explain how Lucene (and consequently Solr and ElasticSearch) do it.
The file format can be the same as explained above. The main difference is when you index new documents in systems like Lucene instead of rebuilding the index from scratch they just create a new one with only the new documents. So every time you have to index something, you do it in a new separated index.
To perform a query in this "splitted" index you can run the query against each different index (in parallel) and merge the results together before returning to the user.
Lucene calls this "little" indexes segments
.
The obvious concern here is that you'll get a lot of little segments very quick. To avoid this, you'll need a policy for merging segments and creating larger segments. For example, if you have more than N segments
you can decide to merge all segments smaller than 10 KBs
together.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With