I have a big list, which may carry thousands to millions of entries. I set a window of finite size to slide over the list. I need to count the matched elements in the windows and repeat the procedure by sliding the window 1 position forward at a time. Here is a simple example of a list
L = [1 2 1 3 4 5 1 2 1 2 2 2 3 ]
Assuming the window is of the length of 3, it will capture
So total 1 + 1 + 1 + 1 + 3 + 1 = 8 matching pairs. I found the idea to use itertools to find the combination of all elements in a window and develop a code to find all the matching pairs
import itertools
L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3
totalMatch = 0
for n in range(len(L)-winlen+1):
window = [L[n+i] for i in range(winlen)]
A = list(itertools.combinations(window, 2))
match = [a==b for a, b in A]
totalMatch += sum(match)
it works for a short list but for the list and the windows getting large, this code is so slow. I have been working with C++ for years and decide to switch to python, I will appreciate it if there is any hint to improve the efficiency of the code.
Keep track of the data in your window more efficiently? This is O(|L|) instead of your O(|L|*winlen^2). It keeps the window's element counts in ctr
and the window's matches in match
. For example, when a new value enters the window and there are already two instances of that value in the window, you get two new matches. Similarly for a value falling out of the window, it takes as many matches with it as there are other instances of it in the window.
from collections import Counter
L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3
totalMatch = match = 0
ctr = Counter()
for i, x in enumerate(L):
# Remove old element falling out of window
if i >= winlen:
ctr[L[i-winlen]] -= 1
match -= ctr[L[i-winlen]]
# Add new element to window
match += ctr[x]
ctr[x] += 1
# Update the total (for complete windows)
if i >= winlen - 1:
totalMatch += match
print(totalMatch)
Benchmark results with L
and winlen
multiplied by 20:
38.75 ms original
0.18 ms Manuel
38.73 ms original
0.19 ms Manuel
38.87 ms original
0.18 ms Manuel
Benchmark code (also includes testing code for all lists of numbers 1 to 3 of lengths 0 to 9):
from timeit import repeat
import itertools
from itertools import product
from collections import Counter
def original(L, winlen):
totalMatch = 0
for n in range(len(L)-winlen+1):
window = [L[n+i] for i in range(winlen)]
A = list(itertools.combinations(window, 2))
match = [a==b for a, b in A]
totalMatch += sum(match)
return totalMatch
def Manuel(L, winlen):
totalMatch = match = 0
ctr = Counter()
for i, x in enumerate(L):
if i >= winlen:
ctr[L[i-winlen]] -= 1
match -= ctr[L[i-winlen]]
match += ctr[x]
ctr[x] += 1
if i >= winlen - 1:
totalMatch += match
return totalMatch
def test():
for n in range(10):
print('testing', n)
for T in product([1, 2, 3], repeat=n):
L = list(T)
winlen = 3
expect = original(L, winlen)
result = Manuel(L, winlen)
assert result == expect, (L, expect, result)
def bench():
L = [1,2,1,3,4,5,1,2,1,2,2,2,3] * 20
winlen = 3 * 20
for _ in range(3):
for func in original, Manuel:
t = min(repeat(lambda: func(L, winlen), number=1))
print('%6.2f ms ' % (t * 1e3), func.__name__)
print()
test()
bench()
You can use np.bincount
in a for loop, determine the number of combinations for each number/bin, and sum it up with the total.
import numpy as np
L = [1, 2, 1, 3, 4, 5, 1, 2, 1, 2, 2, 2, 3]
winlen = 3
L = np.array(L) # convert to array to speed up indexing
total = 0
for i in range(len(L) - winlen + 1):
bc = np.bincount(L[i:i+winlen]) # bincount on the window
bc = bc[bc>1] # get rid of all single and empty values
bc = bc * (bc-1) // 2 # gauss addition, number of combinations of each number
total += np.sum(bc) # sum up combinations, and add to total
print(total)
# 8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With