I have 2 large datasets (large in terms of 70K to 110K each). I want to correlate/compare both and find which items from set2 can be found in set1 based on some conditions/criteria.
My current strategy is to sort both lists by common fields and then run nested for
loops, perform conditional if
tests, aggregate predefined dict with items which were found and those that did not match.
Example:
import pandas as pd
list1 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65},
{'a': 31, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
{'a': 70, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
{'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70},]
list2 = [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 65},
{'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3},
{'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12},
{'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148},
{'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187},
{'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
{'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
{'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 70},
{'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25},
{'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}]
re_dict = dict([('found', []), ('alien', [])])
for L2 in list2:
for L1 in list1:
if (L1['a']-5 <= L2['a'] <= L2['a']+10) and L2['c'][-1:] in L1['c'][-1:]:
if (65 <= L2['e'] <= 75):
L2.update({'e': 'some value'})
re_dict['found'].append(L2)
list1.remove(L1)
break # break out from the inner loop
else: # if the inner loop traversed entire list, there were no matches
re_dict['alien'].append(L2)
Above yields desired results:
re_dict
{'alien': [{'a': 145, 'b': '108', 'c': '123', 'd': '84', 'e': 3},
{'a': 113, 'b': '144', 'c': '183', 'd': '7', 'e': 12},
{'a': 57, 'b': '87', 'c': '51', 'd': '95', 'e': 187},
{'a': 41, 'b': '12', 'c': '26', 'd': '99', 'e': 71},
{'a': 107, 'b': '95', 'c': '81', 'd': '15', 'e': 25},
{'a': 138, 'b': '97', 'c': '38', 'd': '28', 'e': 171}],
'found': [{'a': 56, 'b': '38', 'c': '11', 'd': '10', 'e': 'some value'},
{'a': 144, 'b': '60', 'c': '46', 'd': '106', 'e': 148},
{'a': 80, 'b': '49', 'c': '40', 'd': '227', 'e': 1},
{'a': 3, 'b': '85', 'c': '32', 'd': '46', 'e': 'some value'}]}
So it does the job, but is obviously not very efficient and seems like ideal job for pandas
.
I think it would be ideal if I could merge/join two DataFrames
, but I can't figure it out how to merge on the complex criterion. Also my datasets are not equal in size.
Example:
df1 = pd.DataFrame(list1)
df2 = pd.DataFrame(list2)
pd.merge(df1,df2,on='d',how='outer')
a_x b_x c_x d e_x a_y b_y c_y e_y
0 56 38 11 10 65 56 38 11 65
1 31 12 26 99 71 41 12 26 71
2 70 49 40 227 1 80 49 40 1
3 3 85 32 46 70 3 85 32 70
4 NaN NaN NaN 84 NaN 145 108 123 3
5 NaN NaN NaN 7 NaN 113 144 183 12
6 NaN NaN NaN 106 NaN 144 60 46 148
7 NaN NaN NaN 95 NaN 57 87 51 187
8 NaN NaN NaN 15 NaN 107 95 81 25
9 NaN NaN NaN 28 NaN 138 97 38 171
It merges only when say d column is exactly equal in both df1
and df2
.
What I prefer is to be able to define lets say a range, that is if df2['d']-5 <= df1['d'] <= df2['d']+5
it's still ok and it means, that these lines in both dataframes are candidates to be merged, only if test fails df1
columns are filled with NaN (like in above example).
This way in several steps I could mimic my nested for-for loops, and hopefully that would be quicker?
Any suggestion/hint/example would be greatly appreciated.
Thanks
pandas currently lacks direct support for "nearby" queries, though I have a pull request up to add some basic functionality (not enough for your use-case).
Fortunately, the scientific Python ecosystem gives you the tools you need to do this yourself.
The efficient way to join on nearby locations is to use a tree data structure, as described nicely in the scikit-learn documentation. Both SciPy and scikit-learn have suitable KDTree implementations.
It's not easy (or efficient) to use entirely ad-hoc rules, but you can do nearest neighbor lookups efficiently as long as you have a well defined distance metric. I believe scikit-learn's KDTree even lets you define your own distance metric, but we'll stick to normal Euclidean distance to continue your example:
from scipy.spatial import cKDTree as KDTree
import pandas as pd
# for each row in df2, we want to join the nearest row in df1
# based on the column "d"
join_cols = ['d']
tree = KDTree(df1[join_cols])
distance, indices = tree.query(df2[join_cols])
df1_near_2 = df1.take(indices).reset_index(drop=True)
left = df1_near_2.rename(columns=lambda l: 'x_' + l)
right = df2.rename(columns=lambda l: 'y_' + l)
merged = pd.concat([left, right], axis=1)
This results in:
x_a x_b x_c x_d x_e y_a y_b y_c y_d y_e
0 56 38 11 10 65 56 38 11 10 65
1 31 12 26 99 71 145 108 123 84 3
2 56 38 11 10 65 113 144 183 7 12
3 31 12 26 99 71 144 60 46 106 148
4 31 12 26 99 71 57 87 51 95 187
5 31 12 26 99 71 41 12 26 99 71
6 70 49 40 227 1 80 49 40 227 1
7 3 85 32 46 70 3 85 32 46 70
8 56 38 11 10 65 107 95 81 15 25
9 56 38 11 10 65 138 97 38 28 171
If you want to merge based on nearness for multiple columns, it's as simple as setting join_cols = ['d', 'e', 'f']
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With