Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merge/Join 2 DataFrames by complex criteria

I have 2 large datasets (large in terms of 70K to 110K each). I want to correlate/compare both and find which items from set2 can be found in set1 based on some conditions/criteria.

My current strategy is to sort both lists by common fields and then run nested for loops, perform conditional if tests, aggregate predefined dict with items which were found and those that did not match.

Example:

import pandas as pd

list1 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65},
         {'a': 31, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
         {'a': 70, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
         {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70},]
list2 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65},
         {'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3},
         {'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12},
         {'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148},
         {'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187},
         {'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
         {'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
         {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70},
         {'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25},
         {'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}]

re_dict = dict([('found', []), ('alien', [])])

for L2 in list2:
    for L1 in list1:
        if (L1['a']-5 <= L2['a'] <= L2['a']+10) and L2['c'][-1:] in L1['c'][-1:]:
            if (65 <= L2['e'] <= 75):
                L2.update({'e': 'some value'})
            re_dict['found'].append(L2)
            list1.remove(L1)
            break # break out from the inner loop
    else: # if the inner loop traversed entire list, there were no matches
        re_dict['alien'].append(L2)

Above yields desired results:

re_dict
{'alien': [{'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3},
  {'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12},
  {'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187},
  {'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
  {'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25},
  {'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}],
 'found': [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 'some value'},
  {'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148},
  {'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
  {'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 'some value'}]}

So it does the job, but is obviously not very efficient and seems like ideal job for pandas.

I think it would be ideal if I could merge/join two DataFrames, but I can't figure it out how to merge on the complex criterion. Also my datasets are not equal in size.

Example:

df1 = pd.DataFrame(list1)
df2 = pd.DataFrame(list2)

pd.merge(df1,df2,on='d',how='outer')
   a_x  b_x  c_x    d  e_x  a_y  b_y  c_y  e_y
0   56   38   11   10   65   56   38   11   65
1   31   12   26   99   71   41   12   26   71
2   70   49   40  227    1   80   49   40    1
3    3   85   32   46   70    3   85   32   70
4  NaN  NaN  NaN   84  NaN  145  108  123    3
5  NaN  NaN  NaN    7  NaN  113  144  183   12
6  NaN  NaN  NaN  106  NaN  144   60   46  148
7  NaN  NaN  NaN   95  NaN   57   87   51  187
8  NaN  NaN  NaN   15  NaN  107   95   81   25
9  NaN  NaN  NaN   28  NaN  138   97   38  171

It merges only when say d column is exactly equal in both df1 and df2. What I prefer is to be able to define lets say a range, that is if df2['d']-5 <= df1['d'] <= df2['d']+5 it's still ok and it means, that these lines in both dataframes are candidates to be merged, only if test fails df1 columns are filled with NaN (like in above example).

This way in several steps I could mimic my nested for-for loops, and hopefully that would be quicker?

Any suggestion/hint/example would be greatly appreciated.

Thanks

like image 536
NarūnasK Avatar asked Feb 11 '23 01:02

NarūnasK


1 Answers

pandas currently lacks direct support for "nearby" queries, though I have a pull request up to add some basic functionality (not enough for your use-case).

Fortunately, the scientific Python ecosystem gives you the tools you need to do this yourself.

The efficient way to join on nearby locations is to use a tree data structure, as described nicely in the scikit-learn documentation. Both SciPy and scikit-learn have suitable KDTree implementations.

It's not easy (or efficient) to use entirely ad-hoc rules, but you can do nearest neighbor lookups efficiently as long as you have a well defined distance metric. I believe scikit-learn's KDTree even lets you define your own distance metric, but we'll stick to normal Euclidean distance to continue your example:

from scipy.spatial import cKDTree as KDTree
import pandas as pd

# for each row in df2, we want to join the nearest row in df1
# based on the column "d"
join_cols = ['d']
tree = KDTree(df1[join_cols])
distance, indices = tree.query(df2[join_cols])
df1_near_2 = df1.take(indices).reset_index(drop=True)

left = df1_near_2.rename(columns=lambda l: 'x_' + l)
right = df2.rename(columns=lambda l: 'y_' + l)
merged = pd.concat([left, right], axis=1)

This results in:

   x_a x_b x_c  x_d  x_e  y_a  y_b  y_c  y_d  y_e
0   56  38  11   10   65   56   38   11   10   65
1   31  12  26   99   71  145  108  123   84    3
2   56  38  11   10   65  113  144  183    7   12
3   31  12  26   99   71  144   60   46  106  148
4   31  12  26   99   71   57   87   51   95  187
5   31  12  26   99   71   41   12   26   99   71
6   70  49  40  227    1   80   49   40  227    1
7    3  85  32   46   70    3   85   32   46   70
8   56  38  11   10   65  107   95   81   15   25
9   56  38  11   10   65  138   97   38   28  171

If you want to merge based on nearness for multiple columns, it's as simple as setting join_cols = ['d', 'e', 'f'].

like image 55
shoyer Avatar answered Feb 13 '23 16:02

shoyer