Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Locate first and last non NaN values in a Pandas DataFrame

I have a Pandas DataFrame indexed by date. There a number of columns but many columns are only populated for part of the time series. I'd like to find where the first and last values non-NaN values are located so that I can extracts the dates and see how long the time series is for a particular column.

Could somebody point me in the right direction as to how I could go about doing something like this? Thanks in advance.

like image 443
Jason Avatar asked Mar 14 '14 11:03

Jason


People also ask

How do you filter out NaN values pandas?

You can filter out rows with NAN value from pandas DataFrame column string, float, datetime e.t.c by using DataFrame. dropna() and DataFrame. notnull() methods. Python doesn't support Null hence any missing data is represented as None or NaN.


2 Answers

@behzad.nouri's solution worked perfectly to return the first and last non-NaN values using Series.first_valid_index and Series.last_valid_index, respectively.

like image 174
Jason Avatar answered Sep 17 '22 03:09

Jason


Here's some helpful examples.

Series

s = pd.Series([np.NaN, 1, np.NaN, 3, np.NaN], index=list('abcde')) s  a    NaN b    1.0 c    NaN d    3.0 e    NaN dtype: float64  # first valid index s.first_valid_index() # 'b'  # first valid position s.index.get_loc(s.first_valid_index()) # 1  # last valid index s.last_valid_index() # 'd'  # last valid position s.index.get_loc(s.last_valid_index()) # 3 

Alternative solution using notna and idxmax:

# first valid index s.notna().idxmax() # 'b'  # last valid index s.notna()[::-1].idxmax() # 'd' 

DataFrame

df = pd.DataFrame({     'A': [np.NaN, 1, np.NaN, 3, np.NaN],      'B': [1, np.NaN, np.NaN, np.NaN, np.NaN] }) df       A    B 0  NaN  1.0 1  1.0  NaN 2  NaN  NaN 3  3.0  NaN 4  NaN  NaN 

(first|last)_valid_index isn't defined on DataFrames, but you can apply them on each column using apply.

# first valid index for each column df.apply(pd.Series.first_valid_index)  A    1 B    0 dtype: int64  # last valid index for each column df.apply(pd.Series.last_valid_index)  A    3 B    0 dtype: int64 

As before, you can also use notna and idxmax. This is slightly more natural syntax.

# first valid index df.notna().idxmax()  A    1 B    0 dtype: int64  # last valid index df.notna()[::-1].idxmax()  A    3 B    0 dtype: int64 
like image 23
cs95 Avatar answered Sep 17 '22 03:09

cs95