I have two lists of tuples that I need to merge. This would be comparable to a JOIN in database terms. The order of the tuples in each list may change. The order of the items in the tuple will not change. The count of items in A should equal the count in B but there may be a difference. Here are my two lists of tuples. There will be 10,000+ of these tuples in each list so performance is a concern. The first element in each tuple is the key common to each list. <pre class="prettyprint"><code>listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)] listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)] </code></pre> The desired output is: <pre class="prettyprint"><code>listC = [(u'123', u'a1', u'a2', 123, 789, u'f1', u'Y', u'f2', 2), (u'125', u'c1', u'c2', 156, 852, u'd1', u'N', u'd2', 1)] </code></pre> Here's the code that I threw together for testing the concept. It works but as you can see, performance is an issue. The performance of this code when running with real data (10K items in each list) is unacceptable as it would take ,potentially, hours to complete. Here's the code: <pre class="prettyprint"><code>for row in listA: for item in listB: if item[0] == row[0]: item = list(item) del item[0] row = list(row) merged.append(tuple(row + item)) </code></pre> How can I merge / join the two lists and achieve better performance?

Inner join two lists of tuples on the first (unique in each list) column using <code>itertools.groupby()</code> suggested by @CoryKramer in the comments: <pre class="prettyprint"><code>from itertools import groupby from operator import itemgetter def inner_join(a, b): L = a + b L.sort(key=itemgetter(0)) # sort by the first column for _, group in groupby(L, itemgetter(0)): row_a, row_b = next(group), next(group, None) if row_b is not None: # join yield row_a + row_b[1:] # cut 1st column from 2nd row </code></pre> Example: <pre class="prettyprint"><code>result = list(inner_join(listA, listB)) assert result == listC </code></pre> This solution has <code>O(n*log n)</code> time complexity (your solution (in the question) is <code>O(n*n)</code> that is much worse for <code>n ~ 10000</code>). It doesn't matter for a small <code>n</code> such as <code>10**4</code> in the question but in Python 3.5+ you could use <code>heapq.merge()</code> with <code>key</code> parameter to avoid allocating new list i.e., for <code>O(1)</code> constant memory solution: <pre class="prettyprint"><code>from heapq import merge # merge has key parameter in Python 3.5 def inner_join(a, b): key = itemgetter(0) a.sort(key=key) b.sort(key=key) for _, group in groupby(merge(a, b, key=key), key): row_a, row_b = next(group), next(group, None) if row_b is not None: # join yield row_a + row_b[1:] # cut 1st column from 2nd row </code></pre> Here's a dict-based solution. It is <code>O(n)</code> linear in time and space algorithm: <pre class="prettyprint"><code>def inner_join(a, b): d = {} for row in b: d[row[0]] = row for row_a in a: row_b = d.get(row_a[0]) if row_b is not None: # join yield row_a + row_b[1:] </code></pre> Here's <code>collections.defaultdict</code>-based solution mentioned by @Padraic Cunningham <pre class="prettyprint"><code>from collections import defaultdict from itertools import chain def inner_join(a, b): d = defaultdict(list) for row in chain(a, b): d[row[0]].append(row[1:]) for id, rows in d.iteritems(): if len(rows) > 1: assert len(rows) == 2 yield (id,) + rows[0] + rows[1] </code></pre>

How do I merge two lists of tuples based on a key?

Tags:

python

algorithm

python-2.7

I have two lists of tuples that I need to merge. This would be comparable to a JOIN in database terms. The order of the tuples in each list may change. The order of the items in the tuple will not change. The count of items in A should equal the count in B but there may be a difference.

Here are my two lists of tuples. There will be 10,000+ of these tuples in each list so performance is a concern. The first element in each tuple is the key common to each list.

listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

The desired output is:

listC = [(u'123', u'a1', u'a2', 123, 789, u'f1', u'Y', u'f2', 2), (u'125', u'c1', u'c2', 156, 852, u'd1', u'N', u'd2', 1)]

Here's the code that I threw together for testing the concept. It works but as you can see, performance is an issue. The performance of this code when running with real data (10K items in each list) is unacceptable as it would take ,potentially, hours to complete.

Here's the code:

for row in listA:
    for item in listB:
        if item[0] == row[0]:
            item = list(item)
            del item[0]
            row = list(row)
            merged.append(tuple(row + item))

How can I merge / join the two lists and achieve better performance?

489

asked Aug 07 '15 22:08

DenaliHardtail

2 Answers

Inner join two lists of tuples on the first (unique in each list) column using itertools.groupby() suggested by @CoryKramer in the comments:

from itertools import groupby
from operator import itemgetter

def inner_join(a, b):
    L = a + b
    L.sort(key=itemgetter(0)) # sort by the first column
    for _, group in groupby(L, itemgetter(0)):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Example:

result = list(inner_join(listA, listB))
assert result == listC

This solution has O(n*log n) time complexity (your solution (in the question) is O(n*n) that is much worse for n ~ 10000).

It doesn't matter for a small n such as 10**4 in the question but in Python 3.5+ you could use heapq.merge() with key parameter to avoid allocating new list i.e., for O(1) constant memory solution:

from heapq import merge # merge has key parameter in Python 3.5

def inner_join(a, b):
    key = itemgetter(0)
    a.sort(key=key) 
    b.sort(key=key)
    for _, group in groupby(merge(a, b, key=key), key):
        row_a, row_b = next(group), next(group, None)
        if row_b is not None: # join
            yield row_a + row_b[1:] # cut 1st column from 2nd row

Here's a dict-based solution. It is O(n) linear in time and space algorithm:

def inner_join(a, b):
    d = {}
    for row in b:
        d[row[0]] = row
    for row_a in a:
        row_b = d.get(row_a[0])
        if row_b is not None: # join
            yield row_a + row_b[1:]

Here's collections.defaultdict-based solution mentioned by @Padraic Cunningham

from collections import defaultdict
from itertools import chain

def inner_join(a, b):
    d = defaultdict(list)
    for row in chain(a, b):
        d[row[0]].append(row[1:])
    for id, rows in d.iteritems():
        if len(rows) > 1:
            assert len(rows) == 2
            yield (id,) + rows[0] + rows[1]

146

answered Sep 19 '22 14:09

jfs

Have you used pandas before? This seems to give your desired output:

n [41]:
import pandas as pd
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]

A = pd.DataFrame(listA)
B = pd.DataFrame(listB)

A.merge(B, on=0)
Out[41]:
    0   1_x     2_x     3_x     4_x     1_y     2_y     3_y     4_y
0   123     a1  a2  123     789     f1  Y   f2  2
1   125     c1  c2  156     852     d1  N   d2  1

`A' and 'B' are pandas dataframes which have some of the SQL like functionality built into them, such as merge. If you haven't used pandas, let me know if you need further explanation.

See Database-style DataFrame joining/merging.

answered Sep 19 '22 14:09

JoeCondron

Related questions
                            
                                Iterating the streaming_content attribute on django FileResponse
                            
                                backref lazy='dynamic' - does not support object population - eager loading cannot be applied
                            
                                Tell PyCharm code generated fields of class
                            
                                how to handle long path name for pep8 compliance?
                            
                                name exit is not defined in python
                            
                                In a loop in Python, I assign a new instance of a class to the same variable, but it keeps pointing to the old instance? [duplicate]
                            
                                How to create dummy data with different datatypes in python? [closed]
                            
                                Why is (numpy.nan, 1) == (numpy.nan, 1)?
                            
                                Emitting signals from a Python thread using QObject
                            
                                How to let pyflakes ignore some errors?
                            
                                Difference between functions generating random numbers in numpy
                            
                                Efficient & Pythonic way of finding all possible sublists of a list in given range and the minimum product after multipying all elements in them?
                            
                                Pandas: break categorical column to multiple columns
                            
                                Timezone offset sign reversed by dateutil?
                            
                                sqlAlchemy does not recognise changes to DB made outside of session
                            
                                Sum of float numbers in a list in Python
                            
                                PyCharm: multiple projects in same window, independent version control
                            
                                A list as a key for PySpark's reduceByKey
                            
                                working of \n in python [duplicate]
                            
                                How to stop Python Kafka Consumer in program?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With