Pandas Apply - Return Multiple Rows

Question

I have two data frames and I need to compare the full combinations of rows and return those combinations that meet a criteria. This turns out to be too intensive for our small cluster with Spark (using a cross join) so I am experimenting with this approach and would eventually see if Dask can improve on it.

If table A and B are

a=pd.DataFrame(np.array([[1,2,3],[4,5,6]]), columns=['a','b','c'])

b=pd.DataFrame(np.array([[4,7,4],[6,5,1],[8,6,0]]), columns=['d','e','f'])

Then all combinations looks like this, where A-D is calculated. Say I want to keep only the rows where A-D >=-3

A   B   C   D   E   F   A-D
1   2   3   4   7   4   -3
1   2   3   6   5   1   -5
1   2   3   8   6   0   -7
4   5   6   4   7   4   0
4   5   6   6   5   1   -2
4   5   6   8   6   0   -4

I attempted to do this with an apply but it appears I can not return a multi row dataframe out of the function (the function creates all the combinations of the single row of 'A' and the entire table of 'B' and returns the rows that meet the criteria.

Here is the function I was testing:

def return_prox_branches(a, B, cutthresh):

    aa=a['a']-B['d']

    keep_B = B.copy().loc[(aa.values >= cutthresh),:]

    keep_B['A']=a['a']

    keep_B['B']=a['b']

    keep_B['C']=a['c']

    keep_B['A-D']=a['a']-keep_B['d']

    print(keep_B)

    return(keep_B)



a.apply(return_prox_branches, axis=1, args=(b,-3))





ValueError: cannot copy sequence with size 7 to array axis with dimension 1

In actuality, these two tables are in the millions of rows.

Is there a way to make this work in pandas efficiently?

piRSquared · Accepted Answer

Fun Way!

Unpacking in this way became possible in Python 3.5
https://www.python.org/dev/peps/pep-0448/#rationale

i, j = np.where(np.subtract.outer(a.a, b.d) >= -3)
pd.DataFrame({**a.iloc[i].to_dict('l'), **b.iloc[j].to_dict('l')})

   a  b  c  d  e  f
0  1  2  3  4  7  4
1  4  5  6  4  7  4
2  4  5  6  6  5  1

Similar but less confusing

i, j = np.where(np.subtract.outer(a.a, b.d) >= -3)
a_ = a.values[i]
b_ = b.values[j]

d = pd.DataFrame(
    np.column_stack([a_, b_]),
    columns=a.columns.append(b.columns)
)

d

   a  b  c  d  e  f
0  1  2  3  4  7  4
1  4  5  6  4  7  4
2  4  5  6  6  5  1

In both cases we depend on an outer subtraction of b.d from a.a. This creates a 2-d array of every possible subtractions of values b.d from values of a.a. np.where finds the coordinates where this difference is >= -3. I can use these results to slice the original data frames and place them together.

Pure(ish) Pandas

I have my doubts you can use this with dask

def gen_pseudo(d_):
    def pseudo(d):
        cols = d.columns.append(d_.columns)
        return d_.assign(**d.squeeze()).query('a - d >= -3')[cols]
    return pseudo

a.groupby(level=0).apply(gen_pseudo(b))

     a  b  c  d  e  f
0 0  1  2  3  4  7  4
1 0  4  5  6  4  7  4
  1  4  5  6  6  5  1

Non-closure alternative

def pseudo(d, d_):
    cols = d.columns.append(d_.columns)
    return d_.assign(**d.squeeze()).query('a - d >= -3')[cols]

a.groupby(level=0).apply(pseudo, d_=b)

Comprehension

ja = a.columns.get_loc('a')
jb = b.columns.get_loc('d')

pd.DataFrame([
    np.append(ra, rb)
    for ra in a.values
    for rb in b.values
    if ra[ja] - rb[jb] >= -3
], columns=a.columns.append(b.columns))

   a  b  c  d  e  f
0  1  2  3  4  7  4
1  4  5  6  4  7  4
2  4  5  6  6  5  1

Pandas Apply - Return Multiple Rows

Tags:

python

pandas

dask

B_Miner

1 Answers

Fun Way!

Similar but less confusing

Pure(ish) Pandas

Non-closure alternative

Comprehension

piRSquared

Recent Activity

Donate For Us

Pandas Apply - Return Multiple Rows

Tags:

python

pandas

dask

B_Miner

1 Answers

Fun Way!

Similar but less confusing

Pure(ish) Pandas

Non-closure alternative

Comprehension

piRSquared

Related questions

Recent Activity

Donate For Us