I have data consists of keys mapped to values, like this: <pre class="prettyprint"><code>--------------------- Key | Value --------------------- (0, 0, 0, 0) | a (0, 0, 0, 1) | b (0, 1, 0, 1) | c (0, 1, 1, 0) | d .... </code></pre> I am looking for a data structure the can efficiently perform search queries over keys, where the queries could be complete or partial specification the key. For example: <pre class="prettyprint"><code>(0, 0, 0, 1) -> a (0, *, *, *) -> [a, b, c, d] (0, 1, *, *) -> [c, d] </code></pre> The idea that I've right now is to implement this using a regular tree, similar to this: <img src="https://i.imgur.com/vIBK3Qa.png" alt="tree"> Leaves nodes represent the values and non-leaves nodes are parts of the key (i.e. w,x,y and z nodes are first, second, third and forth part of the key, respectively.). A simple BFS algorithm could be used to answer any query. But the problem is that this tree is growing exponentially with each new part of the key. What data structure/algorithm is more appropriate to solve this problem? Note that the key parts can be numbers or strings.

if there exist a maximum(<code>M</code>) value for each part of the keys, you can create a single keyed tree by interpreting the keys as numbers written in base <code>M</code> (or in mixed base) <ul> <li>i assume wildcards only appear at one index and all furter are wildcards, this way <code>(x,*,*,*)</code> will be a query for <code>(x*M^3,(x+1)*M^3-1)</code> </li> </ul> for strings: <ul> <li>you can use a separating character and token paste the keys (using <code>|</code>:</li> </ul> <code>('ax','bc','a','x') -> 'ax|bc|a|x'</code> the separator should not appear in the input strings(it may appear, but in that case it may interfere with accessing the asked results) but...if your situation is <code>difficult</code> you can use objects as <code>keys</code>, in java i would create a <code>class</code> for the key, and define a compare oparator between them for examples i would quote: How to compare objects by multiple fields

Data structure for partial multi-keys mapping?

Tags:

algorithm

data-structures

I have data consists of keys mapped to values, like this:

---------------------
Key          | Value
---------------------
(0, 0, 0, 0) | a
(0, 0, 0, 1) | b
(0, 1, 0, 1) | c
(0, 1, 1, 0) | d
....

I am looking for a data structure the can efficiently perform search queries over keys, where the queries could be complete or partial specification the key. For example:

(0, 0, 0, 1) -> a
(0, *, *, *) -> [a, b, c, d]
(0, 1, *, *) -> [c, d]

The idea that I've right now is to implement this using a regular tree, similar to this: tree Leaves nodes represent the values and non-leaves nodes are parts of the key (i.e. w,x,y and z nodes are first, second, third and forth part of the key, respectively.). A simple BFS algorithm could be used to answer any query. But the problem is that this tree is growing exponentially with each new part of the key.

What data structure/algorithm is more appropriate to solve this problem? Note that the key parts can be numbers or strings.

502

asked Sep 01 '13 07:09

TinyProton

2 Answers

An array. Yes really! You will have no space overhead, no "pointer chasing" overhead, and calculating the indices only takes a little bitmath, and processors are really rather good at that.

Assuming you get a partial key as a mask and bits where the mask has a 0 for the bits which are wildcards and 1 elsewhere, and the bits are 0 for the wildcards and whatever you want for the non-wildcards.

The algorithm to collect all items that have a key that matches that pattern is:

int key = bits;
do {
    yield items[key];
    key = (key | mask) + 1 & ~mask | bits;
} while (key != bits);

That key = (key | mask) + 1 & ~mask | bits part looks funny, here's how it works.

The | (bitwise OR) makes all the non-wildcards 1. That makes sure that the increment keeps carrying through the bits that are not wildcards. After that addition, the bits that were supposed to be "fixed" are destroyed (0 if a carry passed through them, 1 otherwise), so they have to be masked out (the & ~mask) and then set back to the right value (| bits). The precedence of the operators makes it so that it can largely be written without parentheses. You can also write it as

key = (((key | mask) + 1) & (~mask)) | bits;

This works for any sort of pattern. If you only need "last x bits are variable", you can optimize a bit to:

int wildcards = 0;
int invmask = ~mask;
do {
    yield items[wildcards++ | bits];
} while (wildcards & invmask);

That just runs from 0 to 2^{number-of-wildcards} and then puts in the fixed bits in the top.

Non-binary keys

In the simplest non-binary case, the parts of the key are still some integral number of bits, that is, they range from 0 to 2ⁿ-1. You can use exactly the same code in that case, but the interpretation of the mask is different: instead of having a single 0 bit for a wildcard or a single 1 bit for a non-wildcard, it would have some other number of bits (corresponding to the width in bits of a key-part).

For non-powers-of-two, it takes some more trickery. The problem is that a carry has to be generated sooner it normally would in order to satisfy the constrains that a key-part is less than some value.

For example, if all the key-parts can be 0, 1, or 2 (but not 3), you can do (not tested):

int key = bits;
int increment = (0x55555555 & ~mask) + 1;
do {
    yield items[key];
    int temp = (key | mask) + increment & ~mask;
    int fix = (temp | (temp >> 1)) & 0x55555555;
    key = temp - fix | bits;
} while (key != bits);

The extra increment is 1 plus a mask of the "difference of the nearest power of two with the maximum value of a key-part", which is in this case 1 for every key-part, so there's a 1 in every "slot" (the slots are 2 bits wide, which is the narrowest they can be in this case). It only has those "offsets" at positions that are wildcards.

Offsetting the key-parts so that their highest allowable value maps to "all ones" ensures that the carry propagates through them. However, it means that they are usually left in an invalid state (unless it receives a carry and goes to zero). So then comes the annoying part: the offset has to be undone only for key-parts that didn't wrap to zero.

So there fix comes in. It computes a mask of key-parts that aren't zero. That gets more annoying if the key-parts are wider, and it gets downright terrible if they key-parts aren't all the same size.

Then the last part, key = temp - fix | bits, undoes the offsetting and puts the non-wildcards back in. That subtract never destroys anything, because only 1 is subtracted only from groups of 2 bits that are together at least 1, so the carry never leaves a key-part.

This way of indexing does waste some space of course, unlike the power-of-two case, because there are "holes" in the array that you can never index into.

answered Sep 28 '22 18:09

harold

if there exist a maximum(M) value for each part of the keys, you can create a single keyed tree by interpreting the keys as numbers written in base M (or in mixed base)

i assume wildcards only appear at one index and all furter are wildcards, this way (x,*,*,*) will be a query for (x*M^3,(x+1)*M^3-1)

for strings:

you can use a separating character and token paste the keys (using |:

('ax','bc','a','x') -> 'ax|bc|a|x'

the separator should not appear in the input strings(it may appear, but in that case it may interfere with accessing the asked results)

but...if your situation is difficult you can use objects as keys, in java i would create a class for the key, and define a compare oparator between them

for examples i would quote: How to compare objects by multiple fields

answered Sep 28 '22 19:09

Zoltán Haindrich

Related questions
                            
                                RGB to HSB Algorithm [duplicate]
                            
                                Does a combination of K integers exist, so that their sum is equal to a given number?
                            
                                How can we carry out merge sort of 8 elements with only 16 comparisons?
                            
                                Testing for a circuit when implementing Kruskalls algorithm
                            
                                Off By One errors and Mutation Testing
                            
                                Minimum damaging costs in graph
                            
                                Calculate the maximum distance between vectors in an array
                            
                                Efficient way to verify if a string is a rotated palindrome?
                            
                                Filter an array based on density
                            
                                Data structure to hold list of rectangles?
                            
                                Adding a new Point into the correct position in an Array of Points
                            
                                Combining C++ standard algorithms by looping only once
                            
                                How to compute single source shortest paths with OSRM?
                            
                                Locate an ASCII art image inside a body of text with a certain toleration for error
                            
                                8 puzzle: Solvability and shortest solution
                            
                                Is there a way to store gzip's dictionary from a file?
                            
                                "Unzip" IEnumerable dynamically in C# or best alternative
                            
                                Comparing two graphs
                            
                                What algorithm opencv GCGRAPH (max flow) is based on?
                            
                                Creating sets of similar elements in a 2D array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With