I am designing an architecture of full-text search engine. One of the points is processing queries among large datasets with few response time. One thing I could figure out is that to split the inverted index into partitions. There are 2 strategies for this: term-based partition and document-based partition. But I really want to know if there is any other way to make inverted search faster among large datasets?
This video is a speech with Shay Banon, the developer of ElasticSearch a distributed full-text search engine. In the video he discusses the pros and cons of term-based partition and document-based partition.
Basically, term-based partition produces too much network bandwidth between processes/nodes. And it is harder to implement nicely. Document-based is extremely simpler to implement and produce results.
Moreover, in this lecture by Jeffrey Dean he also explains the differences and says that Google uses document-based partition.
This is the two main ways to distribute your search engine. I'm not aware of other ways of doing it. Anyway you may want to search the Information Retrieval literature for novel work on the subject.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With