I have data consists of keys mapped to values, like this:
---------------------
Key | Value
---------------------
(0, 0, 0, 0) | a
(0, 0, 0, 1) | b
(0, 1, 0, 1) | c
(0, 1, 1, 0) | d
....
I am looking for a data structure the can efficiently perform search queries over keys, where the queries could be complete or partial specification the key. For example:
(0, 0, 0, 1) -> a
(0, *, *, *) -> [a, b, c, d]
(0, 1, *, *) -> [c, d]
The idea that I've right now is to implement this using a regular tree, similar to this: Leaves nodes represent the values and non-leaves nodes are parts of the key (i.e. w,x,y and z nodes are first, second, third and forth part of the key, respectively.). A simple BFS algorithm could be used to answer any query. But the problem is that this tree is growing exponentially with each new part of the key.
What data structure/algorithm is more appropriate to solve this problem? Note that the key parts can be numbers or strings.
HashMap will store separate references per key, but they can all point to the same reference.
The map data type is referred to as an associative array because, like an array, it is a collection of values rather than a single value, as an Int or a String is.
Unfortunately, the Java Map interface doesn't allow for multiple key types, so we need to find another solution.
An array. Yes really! You will have no space overhead, no "pointer chasing" overhead, and calculating the indices only takes a little bitmath, and processors are really rather good at that.
Assuming you get a partial key as a mask
and bits
where the mask
has a 0 for the bits which are wildcards and 1 elsewhere, and the bits
are 0 for the wildcards and whatever you want for the non-wildcards.
The algorithm to collect all items that have a key that matches that pattern is:
int key = bits;
do {
yield items[key];
key = (key | mask) + 1 & ~mask | bits;
} while (key != bits);
That key = (key | mask) + 1 & ~mask | bits
part looks funny, here's how it works.
The |
(bitwise OR) makes all the non-wildcards 1. That makes sure that the increment keeps carrying through the bits that are not wildcards. After that addition, the bits that were supposed to be "fixed" are destroyed (0 if a carry passed through them, 1 otherwise), so they have to be masked out (the & ~mask
) and then set back to the right value (| bits
). The precedence of the operators makes it so that it can largely be written without parentheses. You can also write it as
key = (((key | mask) + 1) & (~mask)) | bits;
This works for any sort of pattern. If you only need "last x bits are variable", you can optimize a bit to:
int wildcards = 0;
int invmask = ~mask;
do {
yield items[wildcards++ | bits];
} while (wildcards & invmask);
That just runs from 0 to 2number-of-wildcards and then puts in the fixed bits in the top.
In the simplest non-binary case, the parts of the key are still some integral number of bits, that is, they range from 0 to 2n-1. You can use exactly the same code in that case, but the interpretation of the mask is different: instead of having a single 0 bit for a wildcard or a single 1 bit for a non-wildcard, it would have some other number of bits (corresponding to the width in bits of a key-part).
For non-powers-of-two, it takes some more trickery. The problem is that a carry has to be generated sooner it normally would in order to satisfy the constrains that a key-part is less than some value.
For example, if all the key-parts can be 0, 1, or 2 (but not 3), you can do (not tested):
int key = bits;
int increment = (0x55555555 & ~mask) + 1;
do {
yield items[key];
int temp = (key | mask) + increment & ~mask;
int fix = (temp | (temp >> 1)) & 0x55555555;
key = temp - fix | bits;
} while (key != bits);
The extra increment
is 1 plus a mask of the "difference of the nearest power of two with the maximum value of a key-part", which is in this case 1 for every key-part, so there's a 1 in every "slot" (the slots are 2 bits wide, which is the narrowest they can be in this case). It only has those "offsets" at positions that are wildcards.
Offsetting the key-parts so that their highest allowable value maps to "all ones" ensures that the carry propagates through them. However, it means that they are usually left in an invalid state (unless it receives a carry and goes to zero). So then comes the annoying part: the offset has to be undone only for key-parts that didn't wrap to zero.
So there fix
comes in. It computes a mask of key-parts that aren't zero. That gets more annoying if the key-parts are wider, and it gets downright terrible if they key-parts aren't all the same size.
Then the last part, key = temp - fix | bits
, undoes the offsetting and puts the non-wildcards back in. That subtract never destroys anything, because only 1 is subtracted only from groups of 2 bits that are together at least 1, so the carry never leaves a key-part.
This way of indexing does waste some space of course, unlike the power-of-two case, because there are "holes" in the array that you can never index into.
if there exist a maximum(M
) value for each part of the keys, you can create a single keyed tree by interpreting the keys as numbers written in base M
(or in mixed base)
(x,*,*,*)
will be a query for (x*M^3,(x+1)*M^3-1)
for strings:
|
:('ax','bc','a','x') -> 'ax|bc|a|x'
the separator should not appear in the input strings(it may appear, but in that case it may interfere with accessing the asked results)
but...if your situation is difficult
you can use objects as keys
, in java i would create a class
for the key, and define a compare oparator between them
for examples i would quote: How to compare objects by multiple fields
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With