Searching huge sorted chunks of data

Question

I have a huge set data records in disk that arranged in sorted order based on some key(s). The data is read into memory a block (thousands of records) at a time. I have to search and display all records matching a key. I was thinking of some binary search based algorithm, but I have some restrictions here.

Records can be only sequentially looked up within a block from the start of the block.
Records with the same key can span multiple blocks (as shown in the figure - 8 spans). In binary search, if I am loading the middle block and if the first record matches, then I have to scan the blocks previous to the matched block.

Can someone help me devise an efficient strategy that could work in C++. Will it be efficient to go with the linear search method.

+---+
| 1 | Block1
| 3 |
| 3 |
| 4 |
+---+
| 4 | Block2
| 6 |
| 7 |
| 8 |
+---+
| 8 | Block3
| 8 |
| 8 |
| 8 |
+---+
| 8 | Block4
| 14|
| 15|
| 16|
+---+

Michael · Accepted Answer

You could build a secondary array that consists of the first entry in each block then run binary search on that array. The indices for the array should corresponding directly with the block indices making it an O(1) lookup to get the corresponding block.

It cuts the worst case from O(n) to O(logn) and is still relatively simple.

Bernd Elkemann · Answer

Your idea, using a binary search, is correct. And you can avoid the linear scans alltogether by saving both the minimum and maximum value in each node. In your example the constructed binary-search-tree will look like this:

Block1    <- (1,4)
                    (1,8)
Block2    <- (4,8)
                          (1,16)
Block3    <- (8,8)
                    (8,16)
Block4    <- (8,16)

....

Having both the max and min values makes it efficient to compute the higher nodes. In your example, if you want to search for "8" you will go ...->...->...->(1,16)->(1,8)->(4,8), so you found the correct block without seeking backward and in the most efficent (log(n)) correct way.

Matthieu M. · Answer

It's a well-known problem, and there is a well-known data structure to solve it, employed mostly in databases :)

The idea is to use a B+ Tree.

The idea is to superpose a kind of Binary Search Tree (except that there'll more than 2 children per node) on top of the structure to search in.

Searching huge sorted chunks of data

Tags:

c++

algorithm

search

binary-search

Arpit

3 Answers

Michael

Bernd Elkemann

Matthieu M.

Recent Activity

Donate For Us

Searching huge sorted chunks of data

Tags:

c++

algorithm

search

binary-search

Arpit

3 Answers

Michael

Bernd Elkemann

Matthieu M.

Related questions

Recent Activity

Donate For Us