I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?
Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.
The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).
So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?
The index stores statistics about terms in order to make term-based search more efficient. Lucene's index falls into the family of indexes known as an inverted index. This is because it can list, for a term, the documents that contain it. This is the inverse of the natural relationship, in which documents list terms.
By default, Lucene uses the TF-IDF and BM25 algorithms. Relevance is scored when data is written and searched. Scoring during data writing is called index-time boosting.
Why is Lucene faster? Lucene is very fast at searching for data because of its inverted index technique. Normally, datasources structure the data as an object or record, which in turn have fields and values.
Well, I will speculate here (should probably be a comment - but it's going to be too long).
HashMap
is in general a fast look-up structure that has search time O(1)
- meaning it's constant. But that is the average case; since (at least in Java) a HashMap
uses TreeNodes
- the search is O(logn)
inside that bucket. Even if we treat that their search complexity is O(1)
, it does not mean it's the same time wise. It just means it is constant for each separate data structure.
Memory Indeed - I will give an example here. In short storing 15_000_000
entries would require a little over 1GB
of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.
Putting entries in a HashMap
(usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.
Probably one extra point here - searches in ranges, that would require some TreeMap
probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).
I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.
And the last point: put me into their shoes, I would start with a HashMap
first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With