I have two lists of tuples that I need to merge. This would be comparable to a JOIN in database terms. The order of the tuples in each list may change. The order of the items in the tuple will not change. The count of items in A should equal the count in B but there may be a difference.
Here are my two lists of tuples. There will be 10,000+ of these tuples in each list so performance is a concern. The first element in each tuple is the key common to each list.
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]
The desired output is:
listC = [(u'123', u'a1', u'a2', 123, 789, u'f1', u'Y', u'f2', 2), (u'125', u'c1', u'c2', 156, 852, u'd1', u'N', u'd2', 1)]
Here's the code that I threw together for testing the concept. It works but as you can see, performance is an issue. The performance of this code when running with real data (10K items in each list) is unacceptable as it would take ,potentially, hours to complete.
Here's the code:
for row in listA:
for item in listB:
if item[0] == row[0]:
item = list(item)
del item[0]
row = list(row)
merged.append(tuple(row + item))
How can I merge / join the two lists and achieve better performance?
The most Pythonic way to merge multiple lists l0, l1, ..., ln into a list of tuples (grouping together the i -th elements) is to use the zip() function zip(l0, l1, ..., ln) .
Concatenating and Multiplying TuplesBecause the + operator can concatenate, it can be used to combine tuples to form a new tuple, though it cannot modify an existing tuple.
Concatenate Two Lists in Python In almost all simple situations, using list1 + list2 is the way you want to concatenate lists. The edge cases below are better in some situations, but + is generally the best choice. All options covered work in Python 2.3, Python 2.7, and all versions of Python 31.
Inner join two lists of tuples on the first (unique in each list) column using itertools.groupby()
suggested by @CoryKramer in the comments:
from itertools import groupby
from operator import itemgetter
def inner_join(a, b):
L = a + b
L.sort(key=itemgetter(0)) # sort by the first column
for _, group in groupby(L, itemgetter(0)):
row_a, row_b = next(group), next(group, None)
if row_b is not None: # join
yield row_a + row_b[1:] # cut 1st column from 2nd row
Example:
result = list(inner_join(listA, listB))
assert result == listC
This solution has O(n*log n)
time complexity (your solution (in the question) is O(n*n)
that is much worse for n ~ 10000
).
It doesn't matter for a small n
such as 10**4
in the question but in Python 3.5+ you could use heapq.merge()
with key
parameter to avoid allocating new list i.e., for O(1)
constant memory solution:
from heapq import merge # merge has key parameter in Python 3.5
def inner_join(a, b):
key = itemgetter(0)
a.sort(key=key)
b.sort(key=key)
for _, group in groupby(merge(a, b, key=key), key):
row_a, row_b = next(group), next(group, None)
if row_b is not None: # join
yield row_a + row_b[1:] # cut 1st column from 2nd row
Here's a dict-based solution. It is O(n)
linear in time and space algorithm:
def inner_join(a, b):
d = {}
for row in b:
d[row[0]] = row
for row_a in a:
row_b = d.get(row_a[0])
if row_b is not None: # join
yield row_a + row_b[1:]
Here's collections.defaultdict
-based solution mentioned by @Padraic Cunningham
from collections import defaultdict
from itertools import chain
def inner_join(a, b):
d = defaultdict(list)
for row in chain(a, b):
d[row[0]].append(row[1:])
for id, rows in d.iteritems():
if len(rows) > 1:
assert len(rows) == 2
yield (id,) + rows[0] + rows[1]
Have you used pandas before? This seems to give your desired output:
n [41]:
import pandas as pd
listA = [(u'123', u'a1', u'a2', 123, 789), (u'124', u'b1', u'b2', 456, 357), (u'125', u'c1', u'c2', 156, 852)]
listB = [(u'125', u'd1', u'N', u'd2', 1), (u'123', u'f1', u'Y', u'f2', 2)]
A = pd.DataFrame(listA)
B = pd.DataFrame(listB)
A.merge(B, on=0)
Out[41]:
0 1_x 2_x 3_x 4_x 1_y 2_y 3_y 4_y
0 123 a1 a2 123 789 f1 Y f2 2
1 125 c1 c2 156 852 d1 N d2 1
`A' and 'B' are pandas dataframes which have some of the SQL like functionality built into them, such as merge. If you haven't used pandas, let me know if you need further explanation.
See Database-style DataFrame joining/merging.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With