Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find first and last non-zero column in each row of a pandas dataframe

I have DataFrame in view of Name and Date with values of weight in cells :

Name        Jan17  Jun18  Dec18    Apr19  count 
Nick         0      1.7     3.7      0     2
Jack         0       0      2.8     3.5    2       
Fox          0      1.7      0       0     1
Rex          1.0     0      3.0     4.2    3
Snack        0       0      2.8     4.4    2
Yosee        0       0       0      4.3    1 
Petty        0.5    1.3     2.8     3.5    4 

Start and Finish should be added to the dataFrame in reference to the next definition:

  1. Start first non zero value in row started from Jan17 column to Apr19
  2. Finish first non zero value in sequence Apr19 till to Jan17

Also, if row has only one non-zero value in row then Start andFinish are the same.

To find first non zero element in row I tried data[col].keys, np.argmax() and it works as expected.

date_col_list = ['Jan17','Jun18','Dec18', 'Apr19']

data['Start']=data[date_col_list].keys([np.argmax(data[date_col_list].values!=0, axis=1)]

Result is:

Name        Jan17  Jun18  Dec18    Apr19  count   Start 
Nick         0      1.7     3.7      0     2      Jun18        
Jack         0       0      2.8     3.5    2      Dec18           
Fox          0      1.7      0       0     1      Jun18    
Rex          1.0     0      3.0     4.2    3      Jan18    
Snack        0       0      2.8     4.4    2      Dec18    
Yosee        0       0       0      4.3    1      Apr19     
Petty        0.5    1.3     2.8     3.5    4      Jan17

To detect values for Finish column I tried to use: np.apply_along_axis as: def func_X(i): return np.argmax(np.where(i!=0))

np.apply_along_axis(func1d = func_X, axis=1, arr=data[date_col_list].values)

Result is error: 'tuple' object has no attribute 'argmax'

Expected dataframe is:

Name        Jan17  Jun18  Dec18    Apr19  count   Start  Finish
Nick         0      1.7     3.7      0     2      Jun18   Dec18     
Jack         0       0      2.8     3.5    2      Dec18   Apr19        
Fox          0      1.7      0       0     1      Jun18   Jun18 
Rex          1.0     0      3.0     4.2    3      Jan18   Apr19 
Snack        0       0      2.8     4.4    2      Dec18   Apr19 
Yosee        0       0       0      4.3    1      Apr19   Apr19  
Petty        0.5    1.3     2.8     3.5    4      Jan17   Apr19  

How can I find Finish in reference to non-zero value in direction from the last column (Apr19) to the first one (Jan17)?

like image 900
Cindy Avatar asked Jun 25 '19 15:06

Cindy


People also ask

How do you find missing values in rows in pandas?

You can extract rows/columns containing missing values from pandas. DataFrame by using the isnull() or isna() method that checks if an element is a missing value.

How do you find the last and first row of a data frame?

You can get the first row with iloc[0] and the last row with iloc[-1] . If you want to get the value of the element, you can do with iloc[0]['column_name'] , iloc[-1]['column_name'] .

Is at and LOC same in pandas?

at is a single element and using . loc maybe a Series or a DataFrame. Returning single value is not the case always. It returns array of values if the provided index is used multiple times.


4 Answers

first_valid_index and last_valid_index

d = df.mask(df == 0).drop(['Name', 'count'], 1)
df.assign(
    Start=d.apply(pd.Series.first_valid_index, 1),
    Finish=d.apply(pd.Series.last_valid_index, 1)
)

    Name  Jan17  Jun18  Dec18  Apr19  count  Start Finish
0   Nick    0.0    1.7    3.7    0.0      2  Jun18  Dec18
1   Jack    0.0    0.0    2.8    3.5      2  Dec18  Apr19
2    Fox    0.0    1.7    0.0    0.0      1  Jun18  Jun18
3    Rex    1.0    0.0    3.0    4.2      3  Jan17  Apr19
4  Snack    0.0    0.0    2.8    4.4      2  Dec18  Apr19
5  Yosee    0.0    0.0    0.0    4.3      1  Apr19  Apr19
6  Petty    0.5    1.3    2.8    3.5      4  Jan17  Apr19

stack then groupby

d = df.mask(df == 0).drop(['Name', 'count'], 1)
def fl(s): return s.xs(s.name).index[[0, -1]]
s, f = d.stack().groupby(level=0).apply(fl).str
df.assign(Start=s, Finish=f)

    Name  Jan17  Jun18  Dec18  Apr19  count  Start Finish
0   Nick    0.0    1.7    3.7    0.0      2  Jun18  Dec18
1   Jack    0.0    0.0    2.8    3.5      2  Dec18  Apr19
2    Fox    0.0    1.7    0.0    0.0      1  Jun18  Jun18
3    Rex    1.0    0.0    3.0    4.2      3  Jan17  Apr19
4  Snack    0.0    0.0    2.8    4.4      2  Dec18  Apr19
5  Yosee    0.0    0.0    0.0    4.3      1  Apr19  Apr19
6  Petty    0.5    1.3    2.8    3.5      4  Jan17  Apr19
like image 185
piRSquared Avatar answered Oct 19 '22 19:10

piRSquared


In your case try something different with dot

s=df.loc[:,'Jan17':'Apr19'].ne(0)
s=s.dot(s.columns+',').str[:-1].str.split(',')
s.str[0],s.str[-1]
Out[899]: 
(0    Jun18
 1    Dec18
 2    Jun18
 3    Jan17
 4    Dec18
 5    Apr19
 6    Jan17
 dtype: object, 0    Dec18
 1    Apr19
 2    Jun18
 3    Apr19
 4    Apr19
 5    Apr19
 6    Apr19
 dtype: object)
 #df['Start'],df['End']=s.str[0],s.str[-1]
like image 41
BENY Avatar answered Oct 19 '22 19:10

BENY


idxmax

mask = df.drop(['Name', 'count'], axis=1) > 0
df.assign(start=mask.idxmax(axis=1), end=mask.iloc[:,::-1].idxmax(axis=1))

    Name  Jan17  Jun18  Dec18  Apr19  count  start    end
0   Nick    0.0    1.7    3.7    0.0      2  Jun18  Dec18
1   Jack    0.0    0.0    2.8    3.5      2  Dec18  Apr19
2    Fox    0.0    1.7    0.0    0.0      1  Jun18  Jun18
3    Rex    1.0    0.0    3.0    4.2      3  Jan17  Apr19
4  Snack    0.0    0.0    2.8    4.4      2  Dec18  Apr19
5  Yosee    0.0    0.0    0.0    4.3      1  Apr19  Apr19
6  Petty    0.5    1.3    2.8    3.5      4  Jan17  Apr19

Drop irrelevant columns, then use idxmax first on the columns, then on the reversed columns to find the first and last valid indices respectively.

like image 39
cs95 Avatar answered Oct 19 '22 19:10

cs95


Using cumsum on the underlying array

m = df.drop(['Name', 'count'], axis=1)
u = m.to_numpy().cumsum(1)

start = (u!=0).argmax(1)
end = u.argmax(1)

df.assign(start=m.columns[start], end=m.columns[end])

    Name  Jan17  Jun18  Dec18  Apr19  count  start    end
0   Nick    0.0    1.7    3.7    0.0      2  Jun18  Dec18
1   Jack    0.0    0.0    2.8    3.5      2  Dec18  Apr19
2    Fox    0.0    1.7    0.0    0.0      1  Jun18  Jun18
3    Rex    1.0    0.0    3.0    4.2      3  Jan17  Apr19
4  Snack    0.0    0.0    2.8    4.4      2  Dec18  Apr19
5  Yosee    0.0    0.0    0.0    4.3      1  Apr19  Apr19
6  Petty    0.5    1.3    2.8    3.5      4  Jan17  Apr19
like image 6
user3483203 Avatar answered Oct 19 '22 21:10

user3483203