I have been trying to think of a performance efficient way of finding the union of character occurrences in a set of fixed width
strings grouped by index. Something like this;
s1 = "013965"
s2 = "015935"
s3 = "310012"
Resulting in the following where each group of digits exists in all strings at char index n:
out = "[03][1][350][90][631][52]"
I have thought of doing it the very naive way of iterating through every string, at every index while storing the intermediate strings in an array and then iterating through that array to build the output value. However my approach seems to me as a highly inefficient way which is way too far from an asymptotically optimal solution.
If the set of all possible characters is known in advance, let's say their number is n
, with n
being not too high (e.g. 10, if you're doing only digits), you can do this by creating m
boolean arrays of length n
, with m
being the number of positions, or digits in input strings and n
. The n-th position in m-th array will be true
, if the n-th character is present in m-th position in any of input strings. False
will denote, that no such character was in m-th position before.
Then you can iterate over each string, and when you encounter character n
in position m
you mark down true
in n-th position of m-th array. At the end, you will have m
arrays, each describing the content of m-th group
pos[0] = {true, true, false, false, false, true, true, false, true, false}
pos[1] = {true, false, false, false, false, false, false, false, false, false}
pos[2] = {false, false, true, true, false, false, false, false, false, true}
translates to
[0,1,5,6,8] [0] [2,3,9]
As all the structures are direct access arrays, there is no lookup involved, all access is in constant time and you only need to visit every character once, with no comparison involved. Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With