Pandas - Updating columns based on several conditions - group by method

Tags:

I am aiming to the below output.

A	B	C	D	E	F
0000	ZZZ	987	QW1	8	first three-four col and offset
0000	ZZZ	987	QW1	-8	first three-four col and offset
0000	ZZZ	987	QW1	-8	first or no match
1111	AAA	123	AB1	1	first three-four col and offset
1111	AAA	123	CD1	-1	first three-four col and offset
2222	BBB	456	EF1	-4	first three-four col and offset
2222	BBB	456	GH1	-1	first three-four col and offset
2222	BBB	456	IL1	5	first three-four col and offset
3333	CCC	789	MN1	2	first two col and offset
3333	CCC	101	MN1	-2	first two col and offset
4444	DDD	121	UYT	6	first two col and offset
4444	DDD	131	FB1	-5	first two col and offset
4444	DDD	141	UYT	-1	first two col and offset
5555	EEE	151	CB1	3	first two col and offset
5555	EEE	161	CR1	-3	first two col and offset
5555	EEE	161	CR1	-5	first or no match
6666	FFF	111	CB1	4	first or no match
7777	GGG	222	ZB1	10.5	first three-four col and small offset
7777	GGG	222	ZB1	-10	first three-four col and small offset

1st rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

2nd rule) the first two columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

3rd rule) no match.

4th rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination can have a difference of 0.5 AT MOST (col E) and NO offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.

Please see my code below.

Through the below code, I am able to get a similar result, however, this way does not work with uneven entries, for instance the first three rows would result as following:

B	C	D	E	F
ZZZ	987	QW1	8	first or no match
ZZZ	987	QW1	-8	first or no match
ZZZ	987	QW1	-8	first or no match

Instead of the following:

B	C	D	E	F
ZZZ	987	QW1	8	first three-four col and offset
ZZZ	987	QW1	-8	first three-four col and offset
ZZZ	987	QW1	-8	first or no match

Code so far:

m1 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').eq(0)  # Rule 1
m2 = df.groupby(['A', 'B'])['E'].transform('sum').eq(0)  # Rule 2
m3 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').abs().le(0.5)  # Rule 4

df['new'] = np.select([m1, m2, m3], ['first three-four col and offset',
                      'first two col and offset', 'first three-four col and small offset'], 'first or no match')

685

asked May 10 '21 17:05

DaniB

Video Answer

2 Answers

Here is a potential start of a solution... I suspect this logic needs to be a lot more robust to handle your real world dataset.

#Read in your dataframe from this question
df = pd.read_clipboard(dtype={'A':'str'})

def f2(x):
    cum = x.cumsum()
    m = (cum == 0)[::-1].cumsum()[::-1].astype(bool)
    x[m]='first two col and offset'
    x[~m]=np.nan
    return x

def f1(x):
    cum = x.cumsum()
    m = (cum == 0)[::-1].cumsum()[::-1].astype(bool)
    x[m]='first three col and offset'
    cl = ((cum.abs() <= .5) & (cum != 0))[::-1].cumsum()[::-1].astype(bool)
    x[cl] = 'first three col and small offset'
    x[~m & ~cl] = np.nan
    return x
    
df['F2'] = df.groupby(['A','B'])['E'].apply(f2)
df['F1'] = df.groupby(['A', 'B', 'C'])['E'].apply(f1)
df['F'] = df['F1'].fillna(df['F2']).fillna('first or no match')
df = df.drop(['F1', 'F2'], axis=1)

Output:

       A    B    C    D     E                                 F
0   0000  ZZZ  987  QW1   8.0        first three col and offset
1   0000  ZZZ  987  QW1  -8.0        first three col and offset
2   0000  ZZZ  987  QW1  -8.0                 first or no match
3   1111  AAA  123  AB1   1.0        first three col and offset
4   1111  AAA  123  CD1  -1.0        first three col and offset
5   2222  BBB  456  EF1  -4.0        first three col and offset
6   2222  BBB  456  GH1  -1.0        first three col and offset
7   2222  BBB  456  IL1   5.0        first three col and offset
8   3333  CCC  789  MN1   2.0          first two col and offset
9   3333  CCC  101  MN1  -2.0          first two col and offset
10  4444  DDD  121  UYT   6.0          first two col and offset
11  4444  DDD  131  FB1  -5.0          first two col and offset
12  4444  DDD  141  UYT  -1.0          first two col and offset
13  5555  EEE  151  CB1   3.0          first two col and offset
14  5555  EEE  161  CR1  -3.0          first two col and offset
15  5555  EEE  161  CR1  -5.0                 first or no match
16  6666  FFF  111  CB1   4.0                 first or no match
17  7777  GGG  222  ZB1  10.5  first three col and small offset
18  7777  GGG  222  ZB1 -10.0  first three col and small offset

Details:

f2, takes a group of 'E' base on the same 'A' and 'B' then computes a cumulative sum.
Then we check to where cumsum equals to 0 to create a boolean series.
Invert, using [::-1] slicing with a negative one step, that series and use cumsum again to flag all the records before the cumsum == 0.
Revert to original order with [::-1] and cast a boolean.
Next, use that bolean series' True values to set 'first two col and offset', then use the false records to set np.nan.

f1, does same except with extra logic to capture the offset close records.

150

answered Oct 16 '22 23:10

Scott Boston

1. Define a function `_thresh_sum`

We can define a recursive function which takes the input argument as a 1D numpy array and returns the 1D boolean array having the same shape as the input array

from numba import jit

@jit(nopython=True)
def _thresh_sum(arr, indices, flags, offset, thresh):
    if flags[indices].any(): return

    s = np.abs(arr[indices].sum())
    if s <= thresh and len(indices) > 1:
        flags[indices] = True
        return

    for i, _ in enumerate(arr[offset:]):
        new_offset = offset + i + 1
        new_indices = np.array(list(indices) + [offset + i])
        _thresh_sum(arr, new_indices, flags, new_offset, thresh)


def thresh_sum(arr, thresh=0):
    flags = np.full(len(arr), False)
    _thresh_sum(np.array(arr), 
                np.array([], dtype='int'), flags, 0, thresh + 1e-6)
    return flags

Intuition behind _thresh_sum

Essentially we are performing a depth-first search on the input array to check for all the possible paths where the vertices in a given path are the elements from the array
If for a given path the sum of all the numbers in the path is less than or equal to the given threshold, then we stop further searching on the current path and explore other paths with different starting node
If a path found then we flag all the numbers/vertices in a path

We can further improve the performance of code by compiling the function _thresh_sum directly to machine code with the help of technique called just in time compilation using numba

Note: It is NP-Hard type problem. As the number of elements per unique group increases the computational complexity of the problem will increase in the order of approximately O(2^n).

2. `Group` and `transform` with `thresh_sum`

Now we group the dataframe as per the given rules 1, 2, 4 and transform column E using the function thresh_sum corresponding to each rule

m1 = df.groupby(['A', 'B', 'C'])['E'].transform(thresh_sum) # Rule 1
m2 = df[~m1].groupby(['A', 'B'])['E'].transform(thresh_sum)  # Rule 2
m3 = df[~(m1 | m2)].groupby(['A', 'B', 'C'])['E'].transform(thresh_sum, thresh=0.5)  # Rule 4

3. `np.select`

Based on the calculated boolean masks m1, m2 and m3 corresponding to rules 1, 2 and 4, fill the values in the column F

df['F'] = np.select([m1, m2.reindex(m1.index, fill_value=False), m3.reindex(m1.index, fill_value=False)], 
                    ['first three-four col and offset', 'first two col and offset', 'first three-four col and small offset'], 'first or no match')

Result

       A    B    C    D     E                                      F
0   0000  ZZZ  987  QW1   8.0        first three-four col and offset
1   0000  ZZZ  987  QW1  -8.0        first three-four col and offset
2   0000  ZZZ  987  QW1  -8.0                      first or no match
3   1111  AAA  123  AB1   1.0        first three-four col and offset
4   1111  AAA  123  CD1  -1.0        first three-four col and offset
5   2222  BBB  456  EF1  -4.0        first three-four col and offset
6   2222  BBB  456  GH1  -1.0        first three-four col and offset
7   2222  BBB  456  IL1   5.0        first three-four col and offset
8   3333  CCC  789  MN1   2.0               first two col and offset
9   3333  CCC  101  MN1  -2.0               first two col and offset
10  4444  DDD  121  UYT   6.0               first two col and offset
11  4444  DDD  131  FB1  -5.0               first two col and offset
12  4444  DDD  141  UYT  -1.0               first two col and offset
13  5555  EEE  151  CB1   3.0               first two col and offset
14  5555  EEE  161  CR1  -3.0               first two col and offset
15  5555  EEE  161  CR1  -5.0                      first or no match
16  6666  FFF  111  CB1   4.0                      first or no match
17  7777  GGG  222  ZB1  10.5  first three-four col and small offset
18  7777  GGG  222  ZB1 -10.0  first three-four col and small offset

answered Oct 17 '22 00:10

Shubham Sharma

Related questions
                            
                                Python Pytest unpack fixture
                            
                                sequence matching algorithm in python
                            
                                Second argument of the iter() method
                            
                                Unable to pip install in Docker image as agent through Jenkins declarative pipeline
                            
                                Simple Mercurial extension fails to import
                            
                                Scientific Notation formatting in Python
                            
                                Large amount of lists concatenation [duplicate]
                            
                                Is there a way to start a plot already zoomed on a specific area using plotly?
                            
                                How to implement neural network pruning?
                            
                                ChromeDriver ERR_SSL_PROTOCOL_ERROR despite --ignore-certificate-errors
                            
                                How can torch multiply two 10000*10000 matrices in almost zero time? Why does the speed change so much from 349 ms down to 999 µs?
                            
                                How to turn %s into {0}, {1} ... less clunky?
                            
                                Why can't I change the __class__ attribute of an instance of object?
                            
                                ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 1 is different from 3)
                            
                                Memory usage of Python base types (particulary int and float)
                            
                                Importing Pytest fails with "AttributeError: 'str' object has no attribute 'pattern'"
                            
                                Getting module 'google.protobuf.descriptor_pool' has no attribute 'Default' in my python script
                            
                                Cannot install TensorFlow 1.x
                            
                                Python fill missing values according to frequency
                            
                                pytorch DataLoader extremely slow first epoch

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas - Updating columns based on several conditions - group by method

Tags:

python

pandas

group-by

DaniB

People also ask

Video Answer

2 Answers

Scott Boston

1. Define a function `_thresh_sum`

2. `Group` and `transform` with `thresh_sum`

3. `np.select`

Result

Shubham Sharma

Recent Activity

Donate For Us

Pandas - Updating columns based on several conditions - group by method

Tags:

python

pandas

group-by

DaniB

People also ask

Video Answer

2 Answers

Scott Boston

1. Define a function _thresh_sum

2. Group and transform with thresh_sum

3. np.select

Result

Shubham Sharma

Related questions

Recent Activity

Donate For Us

1. Define a function `_thresh_sum`

2. `Group` and `transform` with `thresh_sum`

3. `np.select`