I have two data frames and I need to compare the full combinations of rows and return those combinations that meet a criteria. This turns out to be too intensive for our small cluster with Spark (using a cross join) so I am experimenting with this approach and would eventually see if Dask can improve on it.
If table A and B are
a=pd.DataFrame(np.array([[1,2,3],[4,5,6]]), columns=['a','b','c'])
b=pd.DataFrame(np.array([[4,7,4],[6,5,1],[8,6,0]]), columns=['d','e','f'])
Then all combinations looks like this, where A-D is calculated. Say I want to keep only the rows where A-D >=-3
A B C D E F A-D
1 2 3 4 7 4 -3
1 2 3 6 5 1 -5
1 2 3 8 6 0 -7
4 5 6 4 7 4 0
4 5 6 6 5 1 -2
4 5 6 8 6 0 -4
I attempted to do this with an apply but it appears I can not return a multi row dataframe out of the function (the function creates all the combinations of the single row of 'A' and the entire table of 'B' and returns the rows that meet the criteria.
Here is the function I was testing:
def return_prox_branches(a, B, cutthresh):
aa=a['a']-B['d']
keep_B = B.copy().loc[(aa.values >= cutthresh),:]
keep_B['A']=a['a']
keep_B['B']=a['b']
keep_B['C']=a['c']
keep_B['A-D']=a['a']-keep_B['d']
print(keep_B)
return(keep_B)
a.apply(return_prox_branches, axis=1, args=(b,-3))
ValueError: cannot copy sequence with size 7 to array axis with dimension 1
In actuality, these two tables are in the millions of rows.
Is there a way to make this work in pandas efficiently?
Unpacking in this way became possible in Python 3.5
https://www.python.org/dev/peps/pep-0448/#rationale
i, j = np.where(np.subtract.outer(a.a, b.d) >= -3)
pd.DataFrame({**a.iloc[i].to_dict('l'), **b.iloc[j].to_dict('l')})
a b c d e f
0 1 2 3 4 7 4
1 4 5 6 4 7 4
2 4 5 6 6 5 1
i, j = np.where(np.subtract.outer(a.a, b.d) >= -3)
a_ = a.values[i]
b_ = b.values[j]
d = pd.DataFrame(
np.column_stack([a_, b_]),
columns=a.columns.append(b.columns)
)
d
a b c d e f
0 1 2 3 4 7 4
1 4 5 6 4 7 4
2 4 5 6 6 5 1
In both cases we depend on an outer subtraction of b.d from a.a. This creates a 2-d array of every possible subtractions of values b.d from values of a.a. np.where finds the coordinates where this difference is >= -3. I can use these results to slice the original data frames and place them together.
I have my doubts you can use this with dask
def gen_pseudo(d_):
def pseudo(d):
cols = d.columns.append(d_.columns)
return d_.assign(**d.squeeze()).query('a - d >= -3')[cols]
return pseudo
a.groupby(level=0).apply(gen_pseudo(b))
a b c d e f
0 0 1 2 3 4 7 4
1 0 4 5 6 4 7 4
1 4 5 6 6 5 1
def pseudo(d, d_):
cols = d.columns.append(d_.columns)
return d_.assign(**d.squeeze()).query('a - d >= -3')[cols]
a.groupby(level=0).apply(pseudo, d_=b)
ja = a.columns.get_loc('a')
jb = b.columns.get_loc('d')
pd.DataFrame([
np.append(ra, rb)
for ra in a.values
for rb in b.values
if ra[ja] - rb[jb] >= -3
], columns=a.columns.append(b.columns))
a b c d e f
0 1 2 3 4 7 4
1 4 5 6 4 7 4
2 4 5 6 6 5 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With