The problem I'm trying to solve is to find the support of each itemset in transactional data.
For example,
transactions = [
'b c d',
'a g' ,
'a c d e',
'e f h',
'a b c g h',
'd' ,
'a e g h',
'b c d',
'a b f g h',
'a c d g',
]
will have [2, 5, 1, 1, 1, 5, 1, 2, 1, 1]
So basically for the second transaction a, g
, it is a subset of other transactions like 'a g'
, 'a b c g h'
, 'a e g h'
, 'a b f g h'
, 'a c d g'
and hence the count is 5.
Now, initially, I was converting this dataset to a kind of One Hot Encoded transaction using the mlxtend transactional encoder. And used something like
df.progress_apply(lambda x: (df.iloc[:, np.where(x==1)[0]].sum(1)==len(np.where(x==1)[0])).sum(), axis=1)
to get the values.
The idea is like slice the matrix/df with the elements of the present row and then sum across rows. The cases where it is the same as the length of elements of the present row is a subset and hence count it.
However, this worked fine for smaller datasets, and then when I came across the kosarak, I cannot have a dense representation because of OOM error. So, I switched back to countVectorizer and generated a sparse representation and then used a similar logic as the previous one.
Now the issue is, the scipy sparse is 4x slow when doing sum on sparse than dense with a run time of
164 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Even using sets to solve the problem didn't improve things much.
As far, this was my approach and I believe it has O(n2) complexity. Is there any better algorithm/package to speed things up.
Any help is appreciated. Thanks in advance.
Since 2**26 is well below the integer limit on 32-bit integers, you can do this:
digitize = lambda x: np.in1d(list(string.ascii_lowercase), x.split()) @ 2 ** np.arange(26)
digitize
converts the strings of letters into a unique bitwise integer for each set of letters. Since the data is bitwise, it can be compared with bit arithmetic.
trans = np.array([digitize(t) for t in transactions])
Out[]: array([ 14, 65, 29, 176, 199, 8, 209, 14, 227, 77], dtype=int32)
(np.bitwise_and.outer(tr, tr) == tr).sum(0) #bitwise definition of subset, summed over entries
Out[]: array([2, 5, 1, 1, 1, 5, 1, 2, 1, 1])
you could easily make a column of trans
and then apply the bitwise function to get your desired output. Should reduce memory usage by not storing those big onehots as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With