Finding common list sequences

Tags:

python

I have a dictionary of lists. Each list is a sequence of numbers. No two lists are equal but two or more lists may start with the same sequence of numbers (see my example input below). What I want to do is find these common sequences and make them a new element in the dictionary.

Sample input:

sequences = {
    18: [1, 3, 5, 6, 8, 12, 15, 17, 18],
    19: [1, 3, 5, 6, 9, 13, 14, 16, 19],
    25: [1, 3, 5, 6, 9, 13, 14, 20, 25],
    11: [0, 2, 4, 7, 11],
    20: [0, 2, 4, 10, 20],
    26: [21, 23, 26],
}

Sample output:

expected_output = {
    6: [1, 3, 5, 6],
    18: [8, 12, 15, 17, 18],
    14: [9, 13, 14],
    19: [16, 19],
    25: [20, 25],
    4: [0, 2, 4],
    11: [7, 11],
    20: [10, 20],
    26: [21, 23, 26],
}

The key of each list is its last element. Order doesn't matter.

I have a working code. However, it's very messy. Could somebody suggest a simpler/cleaner solution?

from collections import Counter

def split_lists(sequences):
    # get first elem from each sequence
    firsts = list(map(lambda s: s[0], sequences))

    # get non-duplicate first elements
    not_duplicates = list(map(lambda c: c[0], filter(lambda c: c[1] == 1, Counter(firsts).items())))

    # start the new_sequences with the non-duplicate lists
    new_sequences = dict(map(lambda s: (s[-1], s), filter(lambda s: s[0] in not_duplicates, sequences)))

    # get duplicate first elements
    duplicates = list(map(lambda c: c[0], filter(lambda c: c[1] > 1, Counter(firsts).items())))
    for duplicate in duplicates:
        # get all lists that start with the duplicate element
        duplicate_lists = list(filter(lambda s: s[0] == duplicate, sequences))

        # get the common elements from the duplicate lists and make it a new
        # list to add to our new_sequences dict
        repeated_sequence = sorted(list(set.intersection(*list(map(set, duplicate_lists)))))
        new_sequences[repeated_sequence[-1]] = repeated_sequence

        # get lists from where I left of
        i = len(repeated_sequence)
        sub_lists = list(filter(lambda s: len(s) > 0, map(lambda s: s[i:], duplicate_lists)))

        # recursively split them and store them in new_sequences
        new_sequences.update(split_lists(sub_lists))

    return new_sequences

Also, could you help me figure out the complexity of my algo? Recursion makes me dizzy. My best guess is O(n*m) where n is the number of lists and m the length of the longest list.

565

asked Feb 20 '18 09:02

damores

2 Answers

seperate this into logical functions:

Find out which sequences start with the same element
Find the common elements

same start:

Can be done easily with a defaultdict

from collections import defaultdict
def same_start(sequences):
    same_start = defaultdict(list)
    for seq in sequences:
        same_start[seq[0]].append(seq)
    return same_start.values()

list(same_start(sequences.values()))

[[[1, 3, 5, 6, 8, 12, 15, 17, 18],
  [1, 3, 5, 6, 9, 13, 14, 16, 19],
  [1, 3, 5, 6, 9, 13, 14, 20, 25]],
 [[0, 2, 4, 7, 11], [0, 2, 4, 10, 20]],
 [[21, 23, 26]]]

Find the common elements:

a simple generator that yields values as long as they are all the same

def get_beginning(sequences):
    for values in zip(*sequences):
        v0 = values[0]
        if not all(i == v0 for i in values):
            return
        yield v0

aggregating

def aggregate(same_start):
    for seq in same_start:
        if len(seq) < 2:
            yield  seq[0]
            continue
        start = list(get_beginning(seq))
        yield start
        yield from (i[len(start):] for i in seq)

list(aggregate(same_start(sequences.values())))

[[1, 3, 5, 6],
 [8, 12, 15, 17, 18],
 [9, 13, 14, 16, 19],
 [9, 13, 14, 20, 25],
 [0, 2, 4],
 [7, 11],
 [10, 20],
 [21, 23, 26]]

Further on

If you want to combine sequences 18 and 25, then you can do something like this

def combine(sequences):
    while True:
        s = same_start(sequences)
        if all(len(i) == 1 for i in s):
            return sequences
        sequences = tuple(aggregate(s))

{i[-1]: i for i in combine(sequences.values())}

{4: [0, 2, 4],
 6: [1, 3, 5, 6],
 11: [7, 11],
 14: [9, 13, 14],
 18: [8, 12, 15, 17, 18],
 19: [16, 19],
 20: [10, 20],
 25: [20, 25],
 26: [21, 23, 26]}

147

answered Oct 12 '22 23:10

Maarten Fabré

Using some functional tools this is what I came up with (assuming sequences are sorted). The gist is in find_longest_prefixes:

#!/usr/bin/env python
from itertools import chain, takewhile
from collections import defaultdict

sequences = {
    18: [1, 3, 5, 6, 8, 12, 15, 17, 18],
    19: [1, 3, 5, 6, 9, 13, 14, 16, 19],
    25: [1, 3, 5, 6, 9, 13, 14, 20, 25],
    11: [0, 2, 4, 7, 11],
    20: [0, 2, 4, 10, 20],
    26: [21, 23, 26],
}

def flatmap(f, it):
    return chain.from_iterable(map(f, it))

def all_items_equal(items):
    return len(set(items)) == 1

def group_by_first_item(ls):
    groups = defaultdict(list)
    for l in ls:
        groups[l[0]].append(l)
    return groups.values()

def find_longest_prefixes(ls):
    # takewhile gives us common prefix easily
    longest_prefix = list(takewhile(all_items_equal, zip(*ls)))
    if longest_prefix:
       yield tuple(vs[0] for vs in longest_prefix)
    # yield suffix per iterable
    leftovers = filter(None, tuple(l[len(longest_prefix):] for l in ls))
    leftover_groups = group_by_first_item(leftovers)
    yield from flatmap(find_longest_prefixes, leftover_groups)

# apply the prefix finder to all groups
all_sequences = find_longest_prefixes(sequences.values())

# create the result structure expected
results = {v[-1]: v for v in all_sequences}

print(results)

The value for result is then:

{4: (0, 2, 4),
 6: (1, 3, 5, 6),
 11: (7, 11),
 18: (8, 12, 15, 17, 18),
 19: (9, 13, 14, 16, 19),
 20: (10, 20),
 25: (9, 13, 14, 20, 25),
 26: (21, 23, 26)}

Note that these are tuples which I prefer for their immutability.

answered Oct 13 '22 00:10

Reut Sharabani

Related questions
                            
                                SQLite: return only top 2 results within each group
                            
                                could not broadcast input array from shape (20,310,310) into shape (20)
                            
                                How do you go from a sip.voidptr (QImage.constBits()) to a ctypes void or char pointer?
                            
                                comparing HOG feature vectors without SVM
                            
                                flask-socket.io: frequent time-outs
                            
                                Opencv Python open dng format
                            
                                Python: performing FFT on music file
                            
                                Time limit for mixed integer programming with Python PuLP
                            
                                Why do I receive "WebSocket connection is closed: code = 1000 (OK), no reason"
                            
                                Fuzzy matching a word inside a pyspark dataframe string
                            
                                Inserting list values from a list to another in a specific order in Python
                            
                                ValueError: operands could not be broadcast together with shapes in Naive bayes classifier
                            
                                why is the Numpy matrix multiplication operation called "dot"?
                            
                                django no module name 'config'
                            
                                Change marker in the legend in matplotlib
                            
                                Iterate through node attributes in networkx
                            
                                python selenium: WebDriverException: Message: chrome not reachable
                            
                                Why is random video seeks with OpenCV slow?
                            
                                Why PerformanceWarning when indexed lookup on sorted index?
                            
                                Passing datetime-like object to seaborn.lmplot

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With