I have a huge list of over 200.000 of lists inside. Like this:
huge_list = [
[23, 18, 19, 36, 42],
[22, 18, 19, 36, 39],
[21, 18, 19, 37, 42]
]
It has the following properties:
I want the result to be how many time each combination can be found along all lists:
18:3(times),
19:3(times),
36:2(times),
(18,42):2(times),
(19,42):2(times),
(18, 36):2(times),
(19, 36):2(times),
(18,19):2(times),
(18,19,36):2(times),
(18, 19, 42):2(times) etc.
The slowest and impossible way is to generate all combinations by 1 taken from 80, then by 2 taken from 80, then by 3 taken from 80 and so on until to the combination by 20 taken by 80 which is almost an infinite number. This is impossible to do but also it is impossible by the number of lists inside the huge_list is over 200.000.
I need something like a Counter but faster. As fast as possible please because it will become a lot slower starting from combos of 12 taken by 80 or even less.
This is what I tried to do until now:
mydict = {}
while len(huge_list) > 1:
to_check = huge_list[0]
del huge_list[0]
for draw in huge_list:
for num in to_check:
# one:
if num in draw:
if num in mydict:
mydict[num] += 1
else:
mydict[num] = 1
if 1 in mydict.values():
for key in mydict.keys():
if mydict[key] == 1:
mydict[key] += 1
print mydict
Result:
{18: 3, 19: 3, 36: 2, 42: 2}
But is almost working for just combinations of 1 taken from 80. How to do it for the other combinations? And how to do it faster than this way?
P.S. I need only combination that they have in common, I am not interested in combinations with 1 or 0 match across all the lists. So, maybe, this could help you in speed it to be even faster.
You could use the powerset algorithm found in more_itertools and put them into a collections.Counter
from more_itertools import powerset
from collections import Counter
from itertools import chain
huge_list = [
[23, 18, 19, 36, 42],
[22, 18, 19, 36, 39],
[21, 18, 19, 37, 42]
]
c = Counter(chain.from_iterable(map(powerset, huge_list)))
print({k if len(k) > 1 else k[0]: v for k, v in c.items() if v > 1 and k})
Results
{18: 3, 19: 3, 36: 2, 42: 2, (18, 19): 3, (18, 36): 2, (18, 42): 2, (19, 36): 2, (19, 42): 2, (18, 19, 36): 2, (18, 19, 42): 2}
This can probably be sped up using pandas although this seems the most efficient way to do this without pandas
P.S: powerset is also a part of the itertools Recipies
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With