I am aiming to the below output.
A | B | C | D | E | F |
---|---|---|---|---|---|
0000 | ZZZ | 987 | QW1 | 8 | first three-four col and offset |
0000 | ZZZ | 987 | QW1 | -8 | first three-four col and offset |
0000 | ZZZ | 987 | QW1 | -8 | first or no match |
1111 | AAA | 123 | AB1 | 1 | first three-four col and offset |
1111 | AAA | 123 | CD1 | -1 | first three-four col and offset |
2222 | BBB | 456 | EF1 | -4 | first three-four col and offset |
2222 | BBB | 456 | GH1 | -1 | first three-four col and offset |
2222 | BBB | 456 | IL1 | 5 | first three-four col and offset |
3333 | CCC | 789 | MN1 | 2 | first two col and offset |
3333 | CCC | 101 | MN1 | -2 | first two col and offset |
4444 | DDD | 121 | UYT | 6 | first two col and offset |
4444 | DDD | 131 | FB1 | -5 | first two col and offset |
4444 | DDD | 141 | UYT | -1 | first two col and offset |
5555 | EEE | 151 | CB1 | 3 | first two col and offset |
5555 | EEE | 161 | CR1 | -3 | first two col and offset |
5555 | EEE | 161 | CR1 | -5 | first or no match |
6666 | FFF | 111 | CB1 | 4 | first or no match |
7777 | GGG | 222 | ZB1 | 10.5 | first three-four col and small offset |
7777 | GGG | 222 | ZB1 | -10 | first three-four col and small offset |
1st rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.
2nd rule) the first two columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination must have the associated numbers (col E) offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.
3rd rule) no match.
4th rule) the first three columns must equal each other - regardless of the fourth, which can either be equal or not. Each combination can have a difference of 0.5
AT MOST (col E) and NO offset to zero (can be from 2 to X records combined). IT SHOULD WORK EVEN WITH UNEVEN ENTRIES.
Please see my code below.
Through the below code, I am able to get a similar result, however, this way does not work with uneven entries, for instance the first three rows would result as following:
A | B | C | D | E | F |
---|---|---|---|---|---|
0000 | ZZZ | 987 | QW1 | 8 | first or no match |
0000 | ZZZ | 987 | QW1 | -8 | first or no match |
0000 | ZZZ | 987 | QW1 | -8 | first or no match |
Instead of the following:
A | B | C | D | E | F |
---|---|---|---|---|---|
0000 | ZZZ | 987 | QW1 | 8 | first three-four col and offset |
0000 | ZZZ | 987 | QW1 | -8 | first three-four col and offset |
0000 | ZZZ | 987 | QW1 | -8 | first or no match |
Code so far:
m1 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').eq(0) # Rule 1
m2 = df.groupby(['A', 'B'])['E'].transform('sum').eq(0) # Rule 2
m3 = df.groupby(['A', 'B', 'C'])['E'].transform('sum').abs().le(0.5) # Rule 4
df['new'] = np.select([m1, m2, m3], ['first three-four col and offset',
'first two col and offset', 'first three-four col and small offset'], 'first or no match')
You can replace values of all or selected columns based on the condition of pandas DataFrame by using DataFrame. loc[ ] property. The loc[] is used to access a group of rows and columns by label(s) or a boolean array. It can access and can also manipulate the values of pandas DataFrame.
How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.
Using Loc to Filter With Multiple Conditions The loc function in pandas can be used to access groups of rows or columns by label. Add each condition you want to be included in the filtered result and concatenate them with the & operator. You'll see our code sample will return a pd. dataframe of our filtered rows.
How to Use GroupBy with Multiple Columns in Pandas Step 1: Create sample DataFrame. You can find the sample data from the repository of the notebook or use the link below... Step 2: Group by multiple columns. The columns should be provided as a list to the groupby method. Step 3: GroupBy ...
The reason is dataframe may be having multiple columns and multiple rows. Selective display of columns with limited rows is always the expected view of users. To fulfill the user’s expectations and also help in machine deep learning scenarios, filtering of Pandas dataframe with multiple conditions is much necessary.
Then, we use the apply method using the lambda function which takes as input our function with parameters the pandas columns. Do not forget to set the axis=1, in order to apply the function row-wise. df ['NewColumn_1'] = df.apply(lambda x: myfunc (x ['Age'], x ['Pclass']), axis=1)
Often you may want to group and aggregate by multiple columns of a pandas DataFrame. Fortunately this is easy to do using the pandas .groupby () and .agg () functions. This tutorial explains several examples of how to use these functions in practice.
Here is a potential start of a solution... I suspect this logic needs to be a lot more robust to handle your real world dataset.
#Read in your dataframe from this question
df = pd.read_clipboard(dtype={'A':'str'})
def f2(x):
cum = x.cumsum()
m = (cum == 0)[::-1].cumsum()[::-1].astype(bool)
x[m]='first two col and offset'
x[~m]=np.nan
return x
def f1(x):
cum = x.cumsum()
m = (cum == 0)[::-1].cumsum()[::-1].astype(bool)
x[m]='first three col and offset'
cl = ((cum.abs() <= .5) & (cum != 0))[::-1].cumsum()[::-1].astype(bool)
x[cl] = 'first three col and small offset'
x[~m & ~cl] = np.nan
return x
df['F2'] = df.groupby(['A','B'])['E'].apply(f2)
df['F1'] = df.groupby(['A', 'B', 'C'])['E'].apply(f1)
df['F'] = df['F1'].fillna(df['F2']).fillna('first or no match')
df = df.drop(['F1', 'F2'], axis=1)
Output:
A B C D E F
0 0000 ZZZ 987 QW1 8.0 first three col and offset
1 0000 ZZZ 987 QW1 -8.0 first three col and offset
2 0000 ZZZ 987 QW1 -8.0 first or no match
3 1111 AAA 123 AB1 1.0 first three col and offset
4 1111 AAA 123 CD1 -1.0 first three col and offset
5 2222 BBB 456 EF1 -4.0 first three col and offset
6 2222 BBB 456 GH1 -1.0 first three col and offset
7 2222 BBB 456 IL1 5.0 first three col and offset
8 3333 CCC 789 MN1 2.0 first two col and offset
9 3333 CCC 101 MN1 -2.0 first two col and offset
10 4444 DDD 121 UYT 6.0 first two col and offset
11 4444 DDD 131 FB1 -5.0 first two col and offset
12 4444 DDD 141 UYT -1.0 first two col and offset
13 5555 EEE 151 CB1 3.0 first two col and offset
14 5555 EEE 161 CR1 -3.0 first two col and offset
15 5555 EEE 161 CR1 -5.0 first or no match
16 6666 FFF 111 CB1 4.0 first or no match
17 7777 GGG 222 ZB1 10.5 first three col and small offset
18 7777 GGG 222 ZB1 -10.0 first three col and small offset
Details:
f2, takes a group of 'E' base on the same 'A' and 'B' then computes a cumulative sum.
Then we check to where cumsum equals to 0 to create a boolean series.
Invert, using [::-1] slicing with a negative one step, that series and use cumsum again to flag all the records before the cumsum == 0.
Revert to original order with [::-1] and cast a boolean.
Next, use that bolean series' True values to set 'first two col and offset', then use the false records to set np.nan.
f1, does same except with extra logic to capture the offset close records.
_thresh_sum
We can define a recursive function which takes the input argument as a 1D numpy array and returns the 1D boolean array having the same shape as the input array
from numba import jit
@jit(nopython=True)
def _thresh_sum(arr, indices, flags, offset, thresh):
if flags[indices].any(): return
s = np.abs(arr[indices].sum())
if s <= thresh and len(indices) > 1:
flags[indices] = True
return
for i, _ in enumerate(arr[offset:]):
new_offset = offset + i + 1
new_indices = np.array(list(indices) + [offset + i])
_thresh_sum(arr, new_indices, flags, new_offset, thresh)
def thresh_sum(arr, thresh=0):
flags = np.full(len(arr), False)
_thresh_sum(np.array(arr),
np.array([], dtype='int'), flags, 0, thresh + 1e-6)
return flags
Intuition behind _thresh_sum
We can further improve the performance of code by compiling the function _thresh_sum
directly to machine code with the help of technique called just in time compilation using numba
Note: It is NP-Hard
type problem. As the number of elements per unique group increases the computational complexity of the problem will increase in the order of approximately O(2^n)
.
Group
and transform
with thresh_sum
Now we group the dataframe as per the given rules 1
, 2
, 4
and transform
column E
using the function thresh_sum
corresponding to each rule
m1 = df.groupby(['A', 'B', 'C'])['E'].transform(thresh_sum) # Rule 1
m2 = df[~m1].groupby(['A', 'B'])['E'].transform(thresh_sum) # Rule 2
m3 = df[~(m1 | m2)].groupby(['A', 'B', 'C'])['E'].transform(thresh_sum, thresh=0.5) # Rule 4
np.select
Based on the calculated boolean masks m1
, m2
and m3
corresponding to rules 1
, 2
and 4
, fill the values in the column F
df['F'] = np.select([m1, m2.reindex(m1.index, fill_value=False), m3.reindex(m1.index, fill_value=False)],
['first three-four col and offset', 'first two col and offset', 'first three-four col and small offset'], 'first or no match')
A B C D E F
0 0000 ZZZ 987 QW1 8.0 first three-four col and offset
1 0000 ZZZ 987 QW1 -8.0 first three-four col and offset
2 0000 ZZZ 987 QW1 -8.0 first or no match
3 1111 AAA 123 AB1 1.0 first three-four col and offset
4 1111 AAA 123 CD1 -1.0 first three-four col and offset
5 2222 BBB 456 EF1 -4.0 first three-four col and offset
6 2222 BBB 456 GH1 -1.0 first three-four col and offset
7 2222 BBB 456 IL1 5.0 first three-four col and offset
8 3333 CCC 789 MN1 2.0 first two col and offset
9 3333 CCC 101 MN1 -2.0 first two col and offset
10 4444 DDD 121 UYT 6.0 first two col and offset
11 4444 DDD 131 FB1 -5.0 first two col and offset
12 4444 DDD 141 UYT -1.0 first two col and offset
13 5555 EEE 151 CB1 3.0 first two col and offset
14 5555 EEE 161 CR1 -3.0 first two col and offset
15 5555 EEE 161 CR1 -5.0 first or no match
16 6666 FFF 111 CB1 4.0 first or no match
17 7777 GGG 222 ZB1 10.5 first three-four col and small offset
18 7777 GGG 222 ZB1 -10.0 first three-four col and small offset
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With