Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Collapse a list of range tuples into the overlapping ranges

I'm looking for the most memory efficient way to solve this problem.

I have a list of tuples representing partial string matches in a sentence:

[(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]

The first value of each tuple is the start position for the match, the second value is the length.

The idea is to collapse the list so that only the longest continue string match is reported. In this case it would be:

[(0,4), (2,6), (22,6)]

I do not want just the longest range, like in algorithm to find longest non-overlapping sequences, but I want all the ranges collapsed by the longest.

In case your wondering, I am using a pure python implementation of the Aho-Corasick for matching terms in a static dictionary to the given text snippet.

EDIT: Due to the nature of these tuple lists, overlapping but not self-contained ranges should be printed out individually. For example, having the words betaz and zeta in the dictionary, the matches for betazeta are [(0,5),(4,8)]. Since these ranges overlap, but none is contained in the other, the answer should be [(0,5),(4,8)]. I have also modified the input dataset above so that this case is covered.

Thanks!

like image 969
Francisco Roque Avatar asked Oct 07 '22 17:10

Francisco Roque


2 Answers

a = [(0, 2), (1, 2), (0, 4), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
b = [set(xrange(i, i + j)) for i, j in a]
c = b.pop().union(*b)
collapsed = sorted(c)
print collapsed
#Maybe this is useful?:
[0, 1, 2, 3, 22, 23, 24, 25, 26, 27]

#But if you want the requested format, then do this:
d = []
start = collapsed[0]
length = 0
for val in collapsed:
    if start + length < val:
        d.append((start,length))
        start = val
        length = 0
    elif val == collapsed[-1]:
        d.append((start,length + 1))
    length += 1
print d
#Output:
[(0,4), (22,6)]
like image 29
fraxel Avatar answered Oct 10 '22 10:10

fraxel


import operator
lst = [(0, 2), (1, 2), (0, 4), (2,6), (23, 2), (22, 6), (26, 2), (26, 2), (26, 2)]
lst.sort(key=operator.itemgetter(1))
for i in reversed(xrange(len(lst)-1)):
    start, length = lst[i]
    for j in xrange(i+1, len(lst)):
        lstart, llength = lst[j]
        if start >= lstart and start + length <= lstart + llength:
            del lst[i]
            break
print lst
#[(0, 4), (2, 6), (22, 6)]
like image 165
Janne Karila Avatar answered Oct 10 '22 08:10

Janne Karila