Why does Lucene use arrays instead of hash tables for its inverted index?

Tags:

I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?

Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.

The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).

So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?

322

asked Jul 21 '17 04:07

CoconutFred

1 Answers

Well, I will speculate here (should probably be a comment - but it's going to be too long).

HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.
Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.
Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.
Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).
I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.
And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.

172

answered Sep 21 '22 17:09

Eugene

Related questions
                            
                                How to quickly convert a returned Python-in-Lua numpy array into a Lua Torch Tensor?
                            
                                Dynamic ListView in Android Studio: Selecting random element and update list view
                            
                                Using drop [R] selectively - only remove specified length-1 dimensions
                            
                                How to polyfill Array.prototype.find using webpack ProvidePlugin?
                            
                                How toString().call() on object prototype is fetching the type of Array
                            
                                More memory efficient way to define many objects of the same type
                            
                                How can I get a set of frequencies for a certain recurring number in an array?
                            
                                Take data from a circle in python
                            
                                How to get the ids of nested array once the condition is met?
                            
                                Do elided array elements ever produce an own property?
                            
                                find minimum-length subarray that has all numbers
                            
                                numpy stride_tricks.as_strided vs list comprehension for rolling window
                            
                                Accessing sql-array item by in jooq
                            
                                GCC `-fsanitize=bounds` strange behaviour with `std::array`
                            
                                Finding optimal path (if exists)
                            
                                Passing three dimensional arrays to a function in C
                            
                                How to sort hash in rails
                            
                                Vectorizing min and max of slices possible?
                            
                                how to combine 2D arrays into a 3D array in python?
                            
                                C# Mongodb cartesian product of multiple object array documents

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why does Lucene use arrays instead of hash tables for its inverted index?

Tags:

arrays

hashmap

hashtable

indexing

lucene

CoconutFred

People also ask

1 Answers

Eugene

Recent Activity

Donate For Us