Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I merge two lists of tuples based on a key?

I have two lists of tuples that I need to merge. This would be comparable to a JOIN in database terms. The order of the tuples in each list may change. The order of the items in the tuple will not change. The count of items in A should equal the count in B but there may be a difference.

Here are my two lists of tuples. There will be 10,000+ of these tuples in each list so performance is a concern. The first element in each tuple is the key common to each list.

listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

The desired output is:

listC = [(u'123', u'a1', u'a2', 123, 789, u'f1', u'Y', u'f2', 2), (u'125', u'c1', u'c2', 156, 852, u'd1', u'N', u'd2', 1)]

Here's the code that I threw together for testing the concept. It works but as you can see, performance is an issue. The performance of this code when running with real data (10K items in each list) is unacceptable as it would take ,potentially, hours to complete.

Here's the code:

for row in listA:
    for item in listB:
        if item[0] == row[0]:
            item = list(item)
            del item[0]
            row = list(row)
            merged.append(tuple(row + item))

How can I merge / join the two lists and achieve better performance?

like image 489
DenaliHardtail Avatar asked Aug 07 '15 22:08

DenaliHardtail


People also ask

How do you merge a list of tuples?

The most Pythonic way to merge multiple lists l0, l1, ..., ln into a list of tuples (grouping together the i -th elements) is to use the zip() function zip(l0, l1, ..., ln) .

Can two tuples be merged?

Concatenating and Multiplying TuplesBecause the + operator can concatenate, it can be used to combine tuples to form a new tuple, though it cannot modify an existing tuple.

How do I merge two lists into a list in Python?

Concatenate Two Lists in Python In almost all simple situations, using list1 + list2 is the way you want to concatenate lists. The edge cases below are better in some situations, but + is generally the best choice. All options covered work in Python 2.3, Python 2.7, and all versions of Python 31.


2 Answers

Inner join two lists of tuples on the first (unique in each list) column using itertools.groupby() suggested by @CoryKramer in the comments:

from itertools import groupby
from operator import itemgetter

def inner_join(a, b):
    L = a + b
    L.sort(key=itemgetter(0)) # sort by the first column
    for _, group in groupby(L, itemgetter(0)):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Example:

result = list(inner_join(listA, listB))
assert result == listC

This solution has O(n*log n) time complexity (your solution (in the question) is O(n*n) that is much worse for n ~ 10000).

It doesn't matter for a small n such as 10**4 in the question but in Python 3.5+ you could use heapq.merge() with key parameter to avoid allocating new list i.e., for O(1) constant memory solution:

from heapq import merge # merge has key parameter in Python 3.5

def inner_join(a, b):
    key = itemgetter(0)
    a.sort(key=key) 
    b.sort(key=key)
    for _, group in groupby(merge(a, b, key=key), key):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Here's a dict-based solution. It is O(n) linear in time and space algorithm:

def inner_join(a, b):
    d = {}
    for row in b:
        d[row[0]] = row
    for row_a in a:
        row_b = d.get(row_a[0])
        if row_b is not None: # join
            yield row_a + row_b[1:]

Here's collections.defaultdict-based solution mentioned by @Padraic Cunningham

from collections import defaultdict
from itertools import chain

def inner_join(a, b):
    d = defaultdict(list)
    for row in chain(a, b):
        d[row[0]].append(row[1:])
    for id, rows in d.iteritems():
        if len(rows) > 1:
            assert len(rows) == 2
            yield (id,) + rows[0] + rows[1]
like image 146
jfs Avatar answered Sep 19 '22 14:09

jfs


Have you used pandas before? This seems to give your desired output:

n [41]:
import pandas as pd
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

A = pd.DataFrame(listA)
B = pd.DataFrame(listB)

A.merge(B, on=0)
Out[41]:
    0   1_x     2_x     3_x     4_x     1_y     2_y     3_y     4_y
0   123     a1  a2  123     789     f1  Y   f2  2
1   125     c1  c2  156     852     d1  N   d2  1

`A' and 'B' are pandas dataframes which have some of the SQL like functionality built into them, such as merge. If you haven't used pandas, let me know if you need further explanation.

See Database-style DataFrame joining/merging.

like image 41
JoeCondron Avatar answered Sep 19 '22 14:09

JoeCondron