Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get interval of binary search tree as fast as sorted array

Currently I am implementing BST in Java for my university project. As we know, BST is quite good in searching a single unit which is O(log n) in a balanced tree.

But how to perform search between value a and b? (a < b)

Let's say I have this tree

│               ┌── 125
│           ┌── 122
│           │   └── 120
│       ┌── 117
│       │   │   ┌── 113
│       │   └── 112
│       │       └── 108
│   ┌── 86
│   │   │   ┌── 85
│   │   └── 72
└── 59
    │           ┌── 56
    │       ┌── 52
    │   ┌── 47
    │   │   │   ┌── 43
    │   │   └── 39
    │   │       │   ┌── 38
    │   │       └── 36
    └── 28
        │       ┌── 18
        │   ┌── 15
        └── 2
            └── 1

I want to create a method range(a,b) to return value between a and b inclusive. (Note: a and b are not necessary in the tree!)

For example: range(53,112) will return 56,59,72,85,86,108,112

Here is my pseudo code

/* recursive method */
range(a,b)
    range(a,b,root);

/* helper method */
range(a,b,node)
    if (a <= node.value <= b)
        if (node.left != null) and (node.value != a)
            range(a,b,node.left)

        print node.value

        if (node.right != null) and (node.value != b)
            range(a,b,node.right)

    else if node.value < a
        if (node.right != null)
            range(a,b,node.right)

    else // node.value > b
        if (node.left != null)
            range(a,b,node.left)

But I think my method is slower.

For example, in a sorted array, we have to perform binary search on a and b and get their respective index. After that, we iterate from index of a to index of b.

Is it true that BST will perform slower on searching multiple values? Is it possible to improve my algorithm to be as fast as a sorted array?

like image 865
karfai Avatar asked Nov 09 '22 06:11

karfai


1 Answers

Depending on how you can return the result, a sorted array may have the huge advantage of not needing to copy the results anywhere. Just returning a pointer+length view into the array is far faster and more cache-friendly than making another copy of the range into another buffer. A tree always has to copy elements out of the tree. Even if you do need a copy (to modify or whatever), memcpy is much much faster than walking a tree.

This isn't an issue if you can process on the fly while walking the tree (like you're doing with print).

I always seem to write answers before googling. It turns out that trees to answer range queries are a thing. Apparently it's usually done for 2D or 3D ranges (where each point has x and y coordinates, for example), which you can't do with a sorted array. I assume this is because even though it's as efficient as possible, it's not as efficient as returning a pointer+length window into a sorted array!

I'm not going to copy/paste the whole algorithm from wikipedia, just the clever idea:

To report the points that lie in the interval [x1, x2], we start by searching for x1 and x2. At some vertex in the tree, the search paths to x1 and x2 will diverge

This is how you efficiently detect whole subtrees that you know will be in your range, see wikipedia and/or google "tree range query" for lots of details.


My pre-googling observation was that you could avoid comparisons and just walk some subtrees. In your example, the left subtree of 86 is guaranteed to all be in the range, because we know they're all >59 and <86, which is a tighter bound than [a..b]. I hadn't thought of a way to look for this special case that wouldn't maybe cost more overhead than it saved.

like image 66
Peter Cordes Avatar answered Nov 15 '22 05:11

Peter Cordes