Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Updating columns based on several conditions - group by method

I am aiming to the below output.

A B C D E F
0000 ZZZ 987 QW1 8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first or no match
1111 AAA 123 AB1 1 first three-four col and offset
1111 AAA 123 CD1 -1 first three-four col and offset
2222 BBB 456 EF1 -4 first three-four col and offset
2222 BBB 456 GH1 -1 first three-four col and offset
2222 BBB 456 IL1 5 first three-four col and offset
3333 CCC 789 MN1 2 first two col and offset
3333 CCC 101 MN1 -2 first two col and offset
4444 DDD 121 UYT 6 first two col and offset
4444 DDD 131 FB1 -5 first two col and offset
4444 DDD 141 UYT -1 first two col and offset
5555 EEE 151 CB1 3 first two col and offset
5555 EEE 161 CR1 -3 first two col and offset
5555 EEE 161 CR1 -5 first or no match
6666 FFF 111 CB1 4 first or no match
7777 GGG 222 ZB1 10.5 first three-four col and small offset
7777 GGG 222 ZB1 -10 first three-four col and small offset

1st rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

2nd rule) the first two columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

3rd rule) no match.

4th rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination can have a difference of 0.5 AT MOST (col E) and NO offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

Please see my code below.

Through the below code, I am able to get a similar result, however, this way does not work with uneven entries, for instance the first three rows would result as following:

A B C D E F
0000 ZZZ 987 QW1 8 first or no match
0000 ZZZ 987 QW1 -8 first or no match
0000 ZZZ 987 QW1 -8 first or no match

Instead of the following:

A B C D E F
0000 ZZZ 987 QW1 8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first three-four col and offset
0000 ZZZ 987 QW1 -8 first or no match

Code so far:

m1 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').eq(0)  # Rule 1
m2 = df.groupby(['A', 'B'])['E'].transform('sum').eq(0)  # Rule 2
m3 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').abs().le(0.5)  # Rule 4

df['new'] = np.select([m1, m2, m3], ['first three-four col and offset',
                      'first two col and offset', 'first three-four col and small offset'], 'first or no match')
like image 685
DaniB Avatar asked May 10 '21 17:05

DaniB


People also ask

How replace column values in pandas based on multiple conditions?

You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.

Can you use Groupby with multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.

How do I use multiple conditions in pandas?

Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.

How to use groupby with multiple columns in pandas?

How to Use GroupBy with Multiple Columns in Pandas Step 1: Create sample DataFrame. You can find the sample data from the repository of the notebook or use the link below... Step 2: Group by multiple columns. The columns should be provided as a list to the groupby method. Step 3: GroupBy ...

Why do we need to filter pandas Dataframe with multiple conditions?

The reason is dataframe may be having multiple columns and multiple rows. Selective display of columns with limited rows is always the expected view of users. To fulfill the user’s expectations and also help in machine deep learning scenarios, filtering of Pandas dataframe with multiple conditions is much necessary.

How do I apply a function to a column in pandas?

Then, we use the apply method using the lambda function which takes as input our function with parameters the pandas columns. Do not forget to set the axis=1, in order to apply the function row-wise. df ['NewColumn_1'] = df.apply(lambda x: myfunc (x ['Age'], x ['Pclass']), axis=1)

How to group and aggregate by multiple columns of a Dataframe?

Often you may want to group and aggregate by multiple columns of a pandas DataFrame. Fortunately this is easy to do using the pandas .groupby () and .agg () functions. This tutorial explains several examples of how to use these functions in practice.


Video Answer


2 Answers

Here is a potential start of a solution... I suspect this logic needs to be a lot more robust to handle your real world dataset.

#Read in your dataframe from this question
df = pd.read_clipboard(dtype={'A':'str'})

def f2(x):
    cum = x.cumsum()
    m = (cum == 0)[::-1].cumsum()[::-1].astype(bool)
    x[m]='first two col and offset'
    x[~m]=np.nan
    return x

def f1(x):
    cum = x.cumsum()
    m = (cum == 0)[::-1].cumsum()[::-1].astype(bool)
    x[m]='first three col and offset'
    cl = ((cum.abs() <= .5) & (cum != 0))[::-1].cumsum()[::-1].astype(bool)
    x[cl] = 'first three col and small offset'
    x[~m & ~cl] = np.nan
    return x
    
df['F2'] = df.groupby(['A','B'])['E'].apply(f2)
df['F1'] = df.groupby(['A', 'B', 'C'])['E'].apply(f1)
df['F'] = df['F1'].fillna(df['F2']).fillna('first or no match')
df = df.drop(['F1', 'F2'], axis=1)

Output:

       A    B    C    D     E                                 F
0   0000  ZZZ  987  QW1   8.0        first three col and offset
1   0000  ZZZ  987  QW1  -8.0        first three col and offset
2   0000  ZZZ  987  QW1  -8.0                 first or no match
3   1111  AAA  123  AB1   1.0        first three col and offset
4   1111  AAA  123  CD1  -1.0        first three col and offset
5   2222  BBB  456  EF1  -4.0        first three col and offset
6   2222  BBB  456  GH1  -1.0        first three col and offset
7   2222  BBB  456  IL1   5.0        first three col and offset
8   3333  CCC  789  MN1   2.0          first two col and offset
9   3333  CCC  101  MN1  -2.0          first two col and offset
10  4444  DDD  121  UYT   6.0          first two col and offset
11  4444  DDD  131  FB1  -5.0          first two col and offset
12  4444  DDD  141  UYT  -1.0          first two col and offset
13  5555  EEE  151  CB1   3.0          first two col and offset
14  5555  EEE  161  CR1  -3.0          first two col and offset
15  5555  EEE  161  CR1  -5.0                 first or no match
16  6666  FFF  111  CB1   4.0                 first or no match
17  7777  GGG  222  ZB1  10.5  first three col and small offset
18  7777  GGG  222  ZB1 -10.0  first three col and small offset

Details:

  • f2, takes a group of 'E' base on the same 'A' and 'B' then computes a cumulative sum.

  • Then we check to where cumsum equals to 0 to create a boolean series.

  • Invert, using [::-1] slicing with a negative one step, that series and use cumsum again to flag all the records before the cumsum == 0.

  • Revert to original order with [::-1] and cast a boolean.

  • Next, use that bolean series' True values to set 'first two col and offset', then use the false records to set np.nan.

f1, does same except with extra logic to capture the offset close records.

like image 150
Scott Boston Avatar answered Oct 16 '22 23:10

Scott Boston


1. Define a function _thresh_sum

We can define a recursive function which takes the input argument as a 1D numpy array and returns the 1D boolean array having the same shape as the input array

from numba import jit

@jit(nopython=True)
def _thresh_sum(arr, indices, flags, offset, thresh):
    if flags[indices].any(): return

    s = np.abs(arr[indices].sum())
    if s <= thresh and len(indices) > 1:
        flags[indices] = True
        return

    for i, _ in enumerate(arr[offset:]):
        new_offset = offset + i + 1
        new_indices = np.array(list(indices) + [offset + i])
        _thresh_sum(arr, new_indices, flags, new_offset, thresh)


def thresh_sum(arr, thresh=0):
    flags = np.full(len(arr), False)
    _thresh_sum(np.array(arr), 
                np.array([], dtype='int'), flags, 0, thresh + 1e-6)
    return flags

Intuition behind _thresh_sum

  • Essentially we are performing a depth-first search on the input array to check for all the possible paths where the vertices in a given path are the elements from the array
  • If for a given path the sum of all the numbers in the path is less than or equal to the given threshold, then we stop further searching on the current path and explore other paths with different starting node
  • If a path found then we flag all the numbers/vertices in a path

We can further improve the performance of code by compiling the function _thresh_sum directly to machine code with the help of technique called just in time compilation using numba

Note: It is NP-Hard type problem. As the number of elements per unique group increases the computational complexity of the problem will increase in the order of approximately O(2^n).

2. Group and transform with thresh_sum

Now we group the dataframe as per the given rules 1, 2, 4 and transform column E using the function thresh_sum corresponding to each rule

m1 = df.groupby(['A', 'B', 'C'])['E'].transform(thresh_sum) # Rule 1
m2 = df[~m1].groupby(['A', 'B'])['E'].transform(thresh_sum)  # Rule 2
m3 = df[~(m1 | m2)].groupby(['A', 'B', 'C'])['E'].transform(thresh_sum, thresh=0.5)  # Rule 4

3. np.select

Based on the calculated boolean masks m1, m2 and m3 corresponding to rules 1, 2 and 4, fill the values in the column F

df['F'] = np.select([m1, m2.reindex(m1.index, fill_value=False), m3.reindex(m1.index, fill_value=False)], 
                    ['first three-four col and offset', 'first two col and offset', 'first three-four col and small offset'], 'first or no match')

Result

       A    B    C    D     E                                      F
0   0000  ZZZ  987  QW1   8.0        first three-four col and offset
1   0000  ZZZ  987  QW1  -8.0        first three-four col and offset
2   0000  ZZZ  987  QW1  -8.0                      first or no match
3   1111  AAA  123  AB1   1.0        first three-four col and offset
4   1111  AAA  123  CD1  -1.0        first three-four col and offset
5   2222  BBB  456  EF1  -4.0        first three-four col and offset
6   2222  BBB  456  GH1  -1.0        first three-four col and offset
7   2222  BBB  456  IL1   5.0        first three-four col and offset
8   3333  CCC  789  MN1   2.0               first two col and offset
9   3333  CCC  101  MN1  -2.0               first two col and offset
10  4444  DDD  121  UYT   6.0               first two col and offset
11  4444  DDD  131  FB1  -5.0               first two col and offset
12  4444  DDD  141  UYT  -1.0               first two col and offset
13  5555  EEE  151  CB1   3.0               first two col and offset
14  5555  EEE  161  CR1  -3.0               first two col and offset
15  5555  EEE  161  CR1  -5.0                      first or no match
16  6666  FFF  111  CB1   4.0                      first or no match
17  7777  GGG  222  ZB1  10.5  first three-four col and small offset
18  7777  GGG  222  ZB1 -10.0  first three-four col and small offset
like image 35
Shubham Sharma Avatar answered Oct 17 '22 00:10

Shubham Sharma