Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas using apply lambda with two different operators

This question is very similar to one I posted before with just one change. Instead of doing just the absolute difference for all the columns I also want to find the magnitude difference for the 'Z' column, so if the current Z is 1.1x greater than prev than keep it.

(more context to the problem)

Pandas using the previous rank values to filter out current row

df = pd.DataFrame({
    'rank': [1, 1, 2, 2, 3, 3],
    'x': [0, 3, 0, 3, 4, 2],
    'y': [0, 4, 0, 4, 5, 5],
    'z': [1, 3, 1.2, 3.25, 3, 6],
})
print(df)
#    rank  x  y     z
# 0     1  0  0  1.00
# 1     1  3  4  3.00
# 2     2  0  0  1.20
# 3     2  3  4  3.25
# 4     3  4  5  3.00
# 5     3  2  5  6.00

Here's what I want the output to be

output = pd.DataFrame({
    'rank': [1, 1, 2, 3],
    'x': [0, 3, 0, 2],
    'y': [0, 4, 0, 5],
    'z': [1, 3, 1.2, 6],
})
print(output)
#    rank  x  y    z
# 0     1  0  0  1.0
# 1     1  3  4  3.0
# 2     2  0  0  1.2
# 5     3  2  5  6.00

basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (<1.1z) to remove it.

So for the rows rank 1 ANY rows in rank 2 that have any combo of x = (-1-1), y = (-1-1), z= (<1.1) OR x = (2-5), y = (3-5), z= (<3.3) I want it to be removed

like image 848
mike_gundy123 Avatar asked Sep 16 '21 16:09

mike_gundy123


People also ask

How do you use lambda and apply?

Apply Lambda Expression to Single Column You can apply the lambda expression for a single column in the DataFrame. The following example subtracts every cell value by 2 for column A – df["A"]=df["A"]. apply(lambda x:x-2) . Yields below output.

How do I apply a lambda function to a column in pandas?

We can do this with the apply() function in Pandas. We can use the apply() function to apply the lambda function to both rows and columns of a dataframe. If the axis argument in the apply() function is 0, then the lambda function gets applied to each column, and if 1, then the function gets applied to each row.

Can a lambda function takes more than one column?

Using DataFrame. apply() method & lambda functions the resultant DataFrame can be any number of columns whereas with transform() function the resulting DataFrame must have the same length as the input DataFrame.

Can pandas apply return two columns?

Return Multiple Columns from pandas apply() You can return a Series from the apply() function that contains the new data. pass axis=1 to the apply() function which applies the function multiply to each row of the DataFrame, Returns a series of multiple columns from pandas apply() function.


Video Answer


2 Answers

Here's a solution using numpy broadcasting:

# Initially, no row is dropped
df['drop'] = False

for r in range(df['rank'].min(), df['rank'].max()):
    # Find the x_min, x_max, y_min, y_max, z_max of the current rank
    cond = df['rank'] == r
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T
    x_min, x_max = x + [[-1], [1]] # use numpy broadcasting to ±1 in one command
    y_min, y_max = y + [[-1], [1]]
    z_max        = z * 1.1

    # Find the x, y, z of the next rank. Raise them one dimension
    # so that we can make a comparison matrix again x_min, x_max, ...
    cond = df['rank'] == r + 1
    if not cond.any():
        continue
    x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T[:, :, None]

    # Condition to drop a row
    drop = (
        (x_min <= x) & (x <= x_max) &
        (y_min <= y) & (y <= y_max) &
        (z <= z_max)
    ).any(axis=1)
    df.loc[cond, 'drop'] = drop

# Result
df[~df['drop']]

Condensed

An even more condensed version (and likely faster). This is a really good way to puzzle your future teammates when they read the code:

r, x, y, z = df[['rank', 'x', 'y', 'z']].T.to_numpy()
rr, xx, yy, zz = [col[:,None] for col in [r, x, y, z]]

drop = (
    (rr == r + 1) &
    (x-1 <= xx) & (xx <= x+1) &
    (y-1 <= yy) & (yy <= y+1) &
    (zz <= z*1.1)
).any(axis=1)

# Result
df[~drop]

What this does is comparing every row in df against each other (including itself) and return True (i.e. drop) if:

  • The current row's rank == the other row's rank + 1; and
  • The current row's x, y, z fall within the specified range of the other row's x, y, z
like image 75
Code Different Avatar answered Oct 16 '22 03:10

Code Different


You need to slightly modify my previous code:

def check_previous_group(rank, d, groups):
    if not rank-1 in groups.groups:
        # check is a previous group exists, else flag all rows False (i.e. not to be dropped)
        return pd.Series(False, index=d.index)

    else:
        # get previous group (rank-1)
        d_prev = groups.get_group(rank-1)

        # get the absolute difference per row with the whole dataset 
        # of the previous group: abs(d_prev-s)
        # if all differences are within 1/1/0.1*z for x/y/z
        # for at least one rows of the previous group
        # then flag the row to be dropped (True)
        return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*s['z']]).all(1).any(), axis=1)

groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]

output:

   rank  x  y    z
0     1  0  0  1.0
1     1  3  4  3.0
2     2  0  0  1.2
5     3  2  5  6.0
like image 28
mozway Avatar answered Oct 16 '22 02:10

mozway