Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data frame segmentation and dropping

I have the following DataFrame in pandas:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]

I want to create a new column and in that column I want to have values from column A based on the condition on column B. Conditions are if there is no ''txt'' in between two consecutive ''BW'', then I will have those on column C. But if there is ''txt'' between two consecutive ''BW'', I want to drop all those values. So the expected output should look like:

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90], 
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]
C = [1,10,23, BW, 24,24,55, BW, nan, nan, nan, nan, nan, nan, BW, 43,BW]

I have no clue how to do it. Any help is much appreciated.

like image 551
Aimas Avatar asked May 23 '26 05:05

Aimas


2 Answers

EDIT:

Updated answer which was missing the values of BW in the final df.

import pandas as pd
import numpy as np

BW = 999
txt = -999
A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,2,49,BW,479,BW]

df = pd.DataFrame({'A': A, 'B': B})
df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())
df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.A)
df['C'] = np.where(df['B'] == BW, df['B'], df['C'])
df['C'] = df['C'].astype('Int64')
df = df.drop('group', axis=1)
In [435]: df
Out[435]: 
     A    B     C
0    1   24     1
1   10   23    10
2   23   29    23
3   45  999   999 <-- BW
4   24   49    24
5   24   59    24
6   55   72    55
7   67  999   999 <-- BW
8   73    9  <NA>
9   26  183  <NA>
10  13   17  <NA>
11  96 -999  <NA> <-- txt is in the middle of BW
12  53    2  <NA>
13  23   49  <NA>
14  24  999   999 <-- BW
15  43  479    43
16  90  999   999 <-- BW

You can achieve it like so, assuming BW and txt are specific values I just filled them with some random number to differentiate them

In [277]: BW = 999

In [278]: txt = -999

In [293]: A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
     ...: B = [24,23,29, BW,49,59,72, BW,9,183,17, txt,49,BW,479,BW]

In [300]: df = pd.DataFrame({'A': A, 'B': B})

In [301]: df
Out[301]: 
     A    B
0    1   24
1   10   23
2   23   29
3   45  999
4   24   49
5   24   59
6   55   72
7   67  999
8   73    9
9   26  183
10  13   17
11  96 -999
12  53    2
13  23   49
14  24  999
15  43  479
16  90  999

First lets split the different groups of values, here I am splitting them into unique groups where each group contains the values of B that are between the value BW and the next BW.

In [321]: df = df.assign(group = (df[~df['B'].between(BW,BW)].index.to_series().diff() > 1).cumsum())

In [322]: df
Out[322]: 
     A    B      group
0    1   24 0.00000000
1   10   23 0.00000000
2   23   29 0.00000000
3   45  999        NaN
4   24   49 1.00000000
5   24   59 1.00000000
6   55   72 1.00000000
7   67  999        NaN
8   73    9 2.00000000
9   26  183 2.00000000
10  13   17 2.00000000
11  96 -999 2.00000000
12  53    2 2.00000000
13  23   49 2.00000000
14  24  999        NaN
15  43  479 3.00000000
16  90  999        NaN

Next with the use of np.where() we can replace the values depending on the condition that you set.

In [360]: df['C'] = np.where(df.group == df[df.B == txt].group.values[0], np.nan, df.B)

In [432]: df
Out[432]: 
     A    B      group            C
0    1   24 0.00000000  24.00000000
1   10   23 0.00000000  23.00000000
2   23   29 0.00000000  29.00000000
3   45  999        NaN 999.00000000
4   24   49 1.00000000  49.00000000
5   24   59 1.00000000  59.00000000
6   55   72 1.00000000  72.00000000
7   67  999        NaN 999.00000000
8   73    9 2.00000000          NaN
9   26  183 2.00000000          NaN
10  13   17 2.00000000          NaN
11  96 -999 2.00000000          NaN
12  53    2 2.00000000          NaN
13  23   49 2.00000000          NaN
14  24  999        NaN 999.00000000
15  43  479 3.00000000 479.00000000
16  90  999        NaN 999.00000000

Here we need to set the where B is equal to BW for C back to the values of B.

In [488]: df['C'] = np.where(df['B'] == BW, df['B'], df['C'])

In [489]: df
Out[489]: 
     A    B      group            C
0    1   24 0.00000000  24.00000000
1   10   23 0.00000000  23.00000000
2   23   29 0.00000000  29.00000000
3   45  999        NaN 999.00000000
4   24   49 1.00000000  49.00000000
5   24   59 1.00000000  59.00000000
6   55   72 1.00000000  72.00000000
7   67  999        NaN 999.00000000
8   73    9 2.00000000          NaN
9   26  183 2.00000000          NaN
10  13   17 2.00000000          NaN
11  96 -999 2.00000000          NaN
12  53    2 2.00000000          NaN
13  23   49 2.00000000          NaN
14  24  999        NaN 999.00000000
15  43  479 3.00000000 479.00000000
16  90  999        NaN 999.00000000

Lastly just convert the float column to int and drop the group column which we do not need anymore. If you want to maintain that the NaN values are np.nan then ignore the conversion to Int64.

In [396]: df.C = df.C.astype('Int64')

In [397]: df
Out[397]: 
     A    B      group     C
0    1   24 0.00000000    24
1   10   23 0.00000000    23
2   23   29 0.00000000    29
3   45  999        NaN   999
4   24   49 1.00000000    49
5   24   59 1.00000000    59
6   55   72 1.00000000    72
7   67  999        NaN   999
8   73    9 2.00000000  <NA>
9   26  183 2.00000000  <NA>
10  13   17 2.00000000  <NA>
11  96 -999 2.00000000  <NA>
12  53    2 2.00000000  <NA>
13  23   49 2.00000000  <NA>
14  24  999        NaN   999
15  43  479 3.00000000   479
16  90  999        NaN   999

In [398]: df = df.drop('group', axis=1)

In [435]: df
Out[435]: 
     A    B     C
0    1   24    24
1   10   23    23
2   23   29    29
3   45  999   999
4   24   49    49
5   24   59    59
6   55   72    72
7   67  999   999
8   73    9  <NA>
9   26  183  <NA>
10  13   17  <NA>
11  96 -999  <NA>
12  53    2  <NA>
13  23   49  <NA>
14  24  999   999
15  43  479   479
16  90  999   999
like image 99
yudhiesh Avatar answered May 25 '26 18:05

yudhiesh


I don't know if this is the most efficient way to do it, but you can create a new column called mask from mapping the values in column B the following way: 'BW' to True, 'txt' to False and all other values to np.nan.

Then if you forward fill the NaN from mask, and backward fill the NaN from mask and logically combine the results (set equal to True as long as one of the forward or backward filled columns is False), you can create a column called final_mask where all of the values between consecutive BW containing txt are filled in with True.

You can then use .apply to select the value of column A only when the final_mask is False and column B isn't 'BW', select column B if final_mask is False and column B is 'BW', and np.nan otherwise.

import numpy as np
import pandas as pd

A = [1,10,23,45,24,24,55,67,73,26,13,96,53,23,24,43,90]
B = [24,23,29, 'BW',49,59,72, 'BW',9,183,17, 'txt',2,49,'BW',479,'BW']
df = pd.DataFrame({'A':A,'B':B})

df["mask"] = df["B"].apply(lambda x: True if x == 'BW' else False if x == 'txt' else np.nan)
df["ffill"] = df["mask"].fillna(method="ffill")
df["bfill"] = df["mask"].fillna(method="bfill")
df["final_mask"] = (df["ffill"] == False) | (df["bfill"] == False)

df["C"] = df.apply(lambda x: x['A'] if (
    (x['final_mask'] == False) & (x['B'] != 'BW')) 
    else x['B'] if ((x['final_mask'] == False) & (x['B'] == 'BW')) 
    else np.nan, axis=1
)

>>> df
     A    B   mask  ffill  bfill  final_mask    C
0    1   24    NaN    NaN   True       False    1
1   10   23    NaN    NaN   True       False   10
2   23   29    NaN    NaN   True       False   23
3   45   BW   True   True   True       False   BW
4   24   49    NaN   True   True       False   24
5   24   59    NaN   True   True       False   24
6   55   72    NaN   True   True       False   55
7   67   BW   True   True   True       False   BW
8   73    9    NaN   True  False        True  NaN
9   26  183    NaN   True  False        True  NaN
10  13   17    NaN   True  False        True  NaN
11  96  txt  False  False  False        True  NaN
12  53    2    NaN  False   True        True  NaN
13  23   49    NaN  False   True        True  NaN
14  24   BW   True   True   True       False   BW
15  43  479    NaN   True   True       False   43
16  90   BW   True   True   True       False   BW

Dropping the columns we created along the way:

df.drop(columns=['mask','ffill','bfill','final_mask'])

     A    B    C
0    1   24    1
1   10   23   10
2   23   29   23
3   45   BW   BW
4   24   49   24
5   24   59   24
6   55   72   55
7   67   BW   BW
8   73    9  NaN
9   26  183  NaN
10  13   17  NaN
11  96  txt  NaN
12  53    2  NaN
13  23   49  NaN
14  24   BW   BW
15  43  479   43
16  90   BW   BW
like image 21
Derek O Avatar answered May 25 '26 19:05

Derek O