I have a huge set data records in disk that arranged in sorted order based on some key(s). The data is read into memory a block (thousands of records) at a time. I have to search and display all records matching a key. I was thinking of some binary search based algorithm, but I have some restrictions here.
Can someone help me devise an efficient strategy that could work in C++. Will it be efficient to go with the linear search method.
+---+
| 1 | Block1
| 3 |
| 3 |
| 4 |
+---+
| 4 | Block2
| 6 |
| 7 |
| 8 |
+---+
| 8 | Block3
| 8 |
| 8 |
| 8 |
+---+
| 8 | Block4
| 14|
| 15|
| 16|
+---+
You could build a secondary array that consists of the first entry in each block then run binary search on that array. The indices for the array should corresponding directly with the block indices making it an O(1) lookup to get the corresponding block.
It cuts the worst case from O(n) to O(logn) and is still relatively simple.
Your idea, using a binary search, is correct. And you can avoid the linear scans alltogether by saving both the minimum and maximum value in each node. In your example the constructed binary-search-tree will look like this:
Block1 <- (1,4)
(1,8)
Block2 <- (4,8)
(1,16)
Block3 <- (8,8)
(8,16)
Block4 <- (8,16)
....
Having both the max and min values makes it efficient to compute the higher nodes. In your example, if you want to search for "8" you will go ...->...->...->(1,16)->(1,8)->(4,8), so you found the correct block without seeking backward and in the most efficent (log(n)) correct way.
It's a well-known problem, and there is a well-known data structure to solve it, employed mostly in databases :)
The idea is to use a B+ Tree.
The idea is to superpose a kind of Binary Search Tree (except that there'll more than 2 children per node) on top of the structure to search in.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With