Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find number of times a set is a subset in a list of sets

The problem I'm trying to solve is to find the support of each itemset in transactional data.

For example,

transactions = [
    'b c d',
    'a g' ,
    'a c d e',
    'e f h',
    'a b c g h',
    'd' , 
    'a e g h',
    'b c d',
    'a b f g h',
    'a c d g',
]

will have [2, 5, 1, 1, 1, 5, 1, 2, 1, 1]

So basically for the second transaction a, g, it is a subset of other transactions like 'a g', 'a b c g h', 'a e g h', 'a b f g h', 'a c d g' and hence the count is 5.

Now, initially, I was converting this dataset to a kind of One Hot Encoded transaction using the mlxtend transactional encoder. And used something like

df.progress_apply(lambda x: (df.iloc[:, np.where(x==1)[0]].sum(1)==len(np.where(x==1)[0])).sum(), axis=1)

to get the values.

The idea is like slice the matrix/df with the elements of the present row and then sum across rows. The cases where it is the same as the length of elements of the present row is a subset and hence count it.

However, this worked fine for smaller datasets, and then when I came across the kosarak, I cannot have a dense representation because of OOM error. So, I switched back to countVectorizer and generated a sparse representation and then used a similar logic as the previous one.

Now the issue is, the scipy sparse is 4x slow when doing sum on sparse than dense with a run time of

164 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Even using sets to solve the problem didn't improve things much.

As far, this was my approach and I believe it has O(n2) complexity. Is there any better algorithm/package to speed things up.

Any help is appreciated. Thanks in advance.

like image 241
lu5er Avatar asked Nov 15 '22 02:11

lu5er


1 Answers

Since 2**26 is well below the integer limit on 32-bit integers, you can do this:

digitize = lambda x: np.in1d(list(string.ascii_lowercase), x.split()) @ 2 ** np.arange(26)

digitize converts the strings of letters into a unique bitwise integer for each set of letters. Since the data is bitwise, it can be compared with bit arithmetic.

trans = np.array([digitize(t) for t in transactions])

Out[]: array([ 14,  65,  29, 176, 199,   8, 209,  14, 227,  77], dtype=int32)

(np.bitwise_and.outer(tr, tr) == tr).sum(0)  #bitwise definition of subset, summed over entries

Out[]: array([2, 5, 1, 1, 1, 5, 1, 2, 1, 1])

you could easily make a column of trans and then apply the bitwise function to get your desired output. Should reduce memory usage by not storing those big onehots as well.

like image 66
Daniel F Avatar answered Nov 29 '22 04:11

Daniel F