Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Searching huge sorted chunks of data

I have a huge set data records in disk that arranged in sorted order based on some key(s). The data is read into memory a block (thousands of records) at a time. I have to search and display all records matching a key. I was thinking of some binary search based algorithm, but I have some restrictions here.

  1. Records can be only sequentially looked up within a block from the start of the block.
  2. Records with the same key can span multiple blocks (as shown in the figure - 8 spans). In binary search, if I am loading the middle block and if the first record matches, then I have to scan the blocks previous to the matched block.

Can someone help me devise an efficient strategy that could work in C++. Will it be efficient to go with the linear search method.

+---+
| 1 | Block1
| 3 |
| 3 |
| 4 |
+---+
| 4 | Block2
| 6 |
| 7 |
| 8 |
+---+
| 8 | Block3
| 8 |
| 8 |
| 8 |
+---+
| 8 | Block4
| 14|
| 15|
| 16|
+---+
like image 776
Arpit Avatar asked Mar 07 '11 22:03

Arpit


3 Answers

You could build a secondary array that consists of the first entry in each block then run binary search on that array. The indices for the array should corresponding directly with the block indices making it an O(1) lookup to get the corresponding block.

It cuts the worst case from O(n) to O(logn) and is still relatively simple.

like image 134
Michael Avatar answered Nov 14 '22 01:11

Michael


Your idea, using a binary search, is correct. And you can avoid the linear scans alltogether by saving both the minimum and maximum value in each node. In your example the constructed binary-search-tree will look like this:

Block1    <- (1,4)
                    (1,8)
Block2    <- (4,8)
                          (1,16)
Block3    <- (8,8)
                    (8,16)
Block4    <- (8,16)

....

Having both the max and min values makes it efficient to compute the higher nodes. In your example, if you want to search for "8" you will go ...->...->...->(1,16)->(1,8)->(4,8), so you found the correct block without seeking backward and in the most efficent (log(n)) correct way.

like image 39
Bernd Elkemann Avatar answered Nov 14 '22 01:11

Bernd Elkemann


It's a well-known problem, and there is a well-known data structure to solve it, employed mostly in databases :)

The idea is to use a B+ Tree.

The idea is to superpose a kind of Binary Search Tree (except that there'll more than 2 children per node) on top of the structure to search in.

like image 2
Matthieu M. Avatar answered Nov 14 '22 01:11

Matthieu M.