Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data structure for partial multi-keys mapping?

I have data consists of keys mapped to values, like this:

---------------------
Key          | Value
---------------------
(0, 0, 0, 0) | a
(0, 0, 0, 1) | b
(0, 1, 0, 1) | c
(0, 1, 1, 0) | d
....

I am looking for a data structure the can efficiently perform search queries over keys, where the queries could be complete or partial specification the key. For example:

(0, 0, 0, 1) -> a
(0, *, *, *) -> [a, b, c, d]
(0, 1, *, *) -> [c, d]

The idea that I've right now is to implement this using a regular tree, similar to this: tree Leaves nodes represent the values and non-leaves nodes are parts of the key (i.e. w,x,y and z nodes are first, second, third and forth part of the key, respectively.). A simple BFS algorithm could be used to answer any query. But the problem is that this tree is growing exponentially with each new part of the key.

What data structure/algorithm is more appropriate to solve this problem? Note that the key parts can be numbers or strings.

like image 502
TinyProton Avatar asked Sep 01 '13 07:09

TinyProton


People also ask

What is the best data structure for mapping multiple keys to the same value?

HashMap will store separate references per key, but they can all point to the same reference.

Which data structure is most appropriate for map data structure?

The map data type is referred to as an associative array because, like an array, it is a collection of values rather than a single value, as an Int or a String is.

Can map have multiple keys?

Unfortunately, the Java Map interface doesn't allow for multiple key types, so we need to find another solution.


2 Answers

An array. Yes really! You will have no space overhead, no "pointer chasing" overhead, and calculating the indices only takes a little bitmath, and processors are really rather good at that.

Assuming you get a partial key as a mask and bits where the mask has a 0 for the bits which are wildcards and 1 elsewhere, and the bits are 0 for the wildcards and whatever you want for the non-wildcards.

The algorithm to collect all items that have a key that matches that pattern is:

int key = bits;
do {
    yield items[key];
    key = (key | mask) + 1 & ~mask | bits;
} while (key != bits);

That key = (key | mask) + 1 & ~mask | bits part looks funny, here's how it works.

The | (bitwise OR) makes all the non-wildcards 1. That makes sure that the increment keeps carrying through the bits that are not wildcards. After that addition, the bits that were supposed to be "fixed" are destroyed (0 if a carry passed through them, 1 otherwise), so they have to be masked out (the & ~mask) and then set back to the right value (| bits). The precedence of the operators makes it so that it can largely be written without parentheses. You can also write it as

key = (((key | mask) + 1) & (~mask)) | bits;

This works for any sort of pattern. If you only need "last x bits are variable", you can optimize a bit to:

int wildcards = 0;
int invmask = ~mask;
do {
    yield items[wildcards++ | bits];
} while (wildcards & invmask);

That just runs from 0 to 2number-of-wildcards and then puts in the fixed bits in the top.

Non-binary keys

In the simplest non-binary case, the parts of the key are still some integral number of bits, that is, they range from 0 to 2n-1. You can use exactly the same code in that case, but the interpretation of the mask is different: instead of having a single 0 bit for a wildcard or a single 1 bit for a non-wildcard, it would have some other number of bits (corresponding to the width in bits of a key-part).

For non-powers-of-two, it takes some more trickery. The problem is that a carry has to be generated sooner it normally would in order to satisfy the constrains that a key-part is less than some value.

For example, if all the key-parts can be 0, 1, or 2 (but not 3), you can do (not tested):

int key = bits;
int increment = (0x55555555 & ~mask) + 1;
do {
    yield items[key];
    int temp = (key | mask) + increment & ~mask;
    int fix = (temp | (temp >> 1)) & 0x55555555;
    key = temp - fix | bits;
} while (key != bits);

The extra increment is 1 plus a mask of the "difference of the nearest power of two with the maximum value of a key-part", which is in this case 1 for every key-part, so there's a 1 in every "slot" (the slots are 2 bits wide, which is the narrowest they can be in this case). It only has those "offsets" at positions that are wildcards.

Offsetting the key-parts so that their highest allowable value maps to "all ones" ensures that the carry propagates through them. However, it means that they are usually left in an invalid state (unless it receives a carry and goes to zero). So then comes the annoying part: the offset has to be undone only for key-parts that didn't wrap to zero.

So there fix comes in. It computes a mask of key-parts that aren't zero. That gets more annoying if the key-parts are wider, and it gets downright terrible if they key-parts aren't all the same size.

Then the last part, key = temp - fix | bits, undoes the offsetting and puts the non-wildcards back in. That subtract never destroys anything, because only 1 is subtracted only from groups of 2 bits that are together at least 1, so the carry never leaves a key-part.

This way of indexing does waste some space of course, unlike the power-of-two case, because there are "holes" in the array that you can never index into.

like image 88
harold Avatar answered Sep 28 '22 18:09

harold


if there exist a maximum(M) value for each part of the keys, you can create a single keyed tree by interpreting the keys as numbers written in base M (or in mixed base)

  • i assume wildcards only appear at one index and all furter are wildcards, this way (x,*,*,*) will be a query for (x*M^3,(x+1)*M^3-1)

for strings:

  • you can use a separating character and token paste the keys (using |:

('ax','bc','a','x') -> 'ax|bc|a|x'

the separator should not appear in the input strings(it may appear, but in that case it may interfere with accessing the asked results)

but...if your situation is difficult you can use objects as keys, in java i would create a class for the key, and define a compare oparator between them

for examples i would quote: How to compare objects by multiple fields

like image 33
Zoltán Haindrich Avatar answered Sep 28 '22 19:09

Zoltán Haindrich