Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve the pattern matching on a list in Python

I have a big list, which may carry thousands to millions of entries. I set a window of finite size to slide over the list. I need to count the matched elements in the windows and repeat the procedure by sliding the window 1 position forward at a time. Here is a simple example of a list

L = [1 2 1 3 4 5 1 2 1 2 2 2 3 ]

Assuming the window is of the length of 3, it will capture

  1. [1 2 1] which contains one pair of matching elements (1 & 1)
  2. move the windows forward by 1 position => [2 1 3], no matching elements
  3. move the windows forward by 1 position => [1 3 4], no matching elements
  4. move the windows forward by 1 position => [3 4 5], no matching elements
  5. move the windows forward by 1 position => [4 5 1], no matching elements
  6. move the windows forward by 1 position => [5 1 2], no matching elements
  7. move the windows forward by 1 position => [1 2 1], 1 matching elements (1 & 1)
  8. move the windows forward by 1 position => [2 1 2], 1 matching elements (2 & 2)
  9. move the windows forward by 1 position => [1 2 2], 1 matching elements (2 & 2)
  10. move the windows forward by 1 position => [2 2 2], 3 matching elements ([2 2 -], [2 - 2], [- 2 2])
  11. move the windows forward by 1 position => [2 2 3], 1 matching elements (2 & 2)

So total 1 + 1 + 1 + 1 + 3 + 1 = 8 matching pairs. I found the idea to use itertools to find the combination of all elements in a window and develop a code to find all the matching pairs

import itertools
L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3
totalMatch = 0
for n in range(len(L)-winlen+1):
    window = [L[n+i] for i in range(winlen)]
    A = list(itertools.combinations(window, 2))
    match = [a==b for a, b in A]
    totalMatch += sum(match)

it works for a short list but for the list and the windows getting large, this code is so slow. I have been working with C++ for years and decide to switch to python, I will appreciate it if there is any hint to improve the efficiency of the code.

like image 960
user1285419 Avatar asked Feb 04 '23 14:02

user1285419


2 Answers

Keep track of the data in your window more efficiently? This is O(|L|) instead of your O(|L|*winlen^2). It keeps the window's element counts in ctr and the window's matches in match. For example, when a new value enters the window and there are already two instances of that value in the window, you get two new matches. Similarly for a value falling out of the window, it takes as many matches with it as there are other instances of it in the window.

from collections import Counter

L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3

totalMatch = match = 0
ctr = Counter()
for i, x in enumerate(L):
    
    # Remove old element falling out of window
    if i >= winlen:
        ctr[L[i-winlen]] -= 1
        match -= ctr[L[i-winlen]]

    # Add new element to window
    match += ctr[x]
    ctr[x] += 1

    # Update the total (for complete windows)
    if i >= winlen - 1:
        totalMatch += match

print(totalMatch)

Benchmark results with L and winlen multiplied by 20:

 38.75 ms  original
  0.18 ms  Manuel

 38.73 ms  original
  0.19 ms  Manuel

 38.87 ms  original
  0.18 ms  Manuel

Benchmark code (also includes testing code for all lists of numbers 1 to 3 of lengths 0 to 9):

from timeit import repeat
import itertools
from itertools import product
from collections import Counter

def original(L, winlen):
    totalMatch = 0
    for n in range(len(L)-winlen+1):
        window = [L[n+i] for i in range(winlen)]
        A = list(itertools.combinations(window, 2))
        match = [a==b for a, b in A]
        totalMatch += sum(match)
    return totalMatch

def Manuel(L, winlen):
    totalMatch = match = 0
    ctr = Counter()
    for i, x in enumerate(L):
        if i >= winlen:
            ctr[L[i-winlen]] -= 1
            match -= ctr[L[i-winlen]]
        match += ctr[x]
        ctr[x] += 1
        if i >= winlen - 1:
            totalMatch += match
    return totalMatch

def test():
    for n in range(10):
        print('testing', n)
        for T in product([1, 2, 3], repeat=n):
            L = list(T)
            winlen = 3
            expect = original(L, winlen)
            result = Manuel(L, winlen)
            assert result == expect, (L, expect, result)

def bench():
    L = [1,2,1,3,4,5,1,2,1,2,2,2,3] * 20
    winlen = 3 * 20
    for _ in range(3):
        for func in original, Manuel:
            t = min(repeat(lambda: func(L, winlen), number=1))
            print('%6.2f ms ' % (t * 1e3), func.__name__)
        print()

test()
bench()
like image 157
Manuel Avatar answered Feb 06 '23 04:02

Manuel


You can use np.bincount in a for loop, determine the number of combinations for each number/bin, and sum it up with the total.

import numpy as np

L = [1, 2, 1, 3, 4, 5, 1, 2, 1, 2, 2, 2, 3]
winlen = 3

L = np.array(L) # convert to array to speed up indexing

total = 0
for i in range(len(L) - winlen + 1):
    bc = np.bincount(L[i:i+winlen]) # bincount on the window
    bc = bc[bc>1] # get rid of all single and empty values
    bc = bc * (bc-1) // 2 # gauss addition, number of combinations of each number
    total += np.sum(bc) # sum up combinations, and add to total

print(total)
# 8
like image 20
Jan Christoph Terasa Avatar answered Feb 06 '23 04:02

Jan Christoph Terasa