Get first non-null value per row

Tags:

I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store that value to column result.

ID  c1  c2  c3  c4  result
1   a   b           a
2       cc  dd      cc
3           ee  ff  ee
4               gg  gg

I am using this way for now. but I would like to know if there is a better method.(The column name do not have any pattern, this is just sample)

df["result"] = np.where(df["c1"].notnull(), df["c1"], None)
df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"])
df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)

When there are lots of columns, this method looks not good.

303

asked Apr 24 '18 14:04

qqqwww

2 Answers

Use back filling NaNs first and then select first column by iloc:

df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')

Or:

df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')

print (df)
   ID   c1   c2  c3   c4 result
0   1    a    b   a  NaN      a
1   2  NaN   cc  dd   cc     cc
2   3  NaN   ee  ff   ee     ee
3   4  NaN  NaN  gg   gg     gg

Performance:

df = pd.concat([df] * 1000, ignore_index=True)


In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop

In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop

#jpp solution
In [222]: %%timeit
     ...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
     ...: 
     ...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
     ...: 
1 loop, best of 3: 180 ms per loop

#cᴏʟᴅsᴘᴇᴇᴅ'  s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop

150

answered Sep 29 '22 11:09

jezrael

Setup

df = df.set_index('ID') # if necessary
df
     c1   c2  c3   c4
ID                   
1     a    b   a  NaN
2   NaN   cc  dd   cc
3   NaN   ee  ff   ee
4   NaN  NaN  gg   gg

Solution
stack + groupby + first
stack implicitly drops NaNs, so groupby.first is guarantee to give you the first non-null value if it exists. Assigning the result back will expose any NaNs at missing indices which you can fillna with a subsequent call.

df['result'] = df.stack().groupby(level=0).first()
# df['result'] = df['result'].fillna('unknown') # if necessary 
df
     c1   c2  c3   c4 result
ID                          
1     a    b   a  NaN      a
2   NaN   cc  dd   cc     cc
3   NaN   ee  ff   ee     ee
4   NaN  NaN  gg   gg     gg

(beware, this is slow for larger dataframes, for performance you may use @jezrael's solution)

answered Sep 29 '22 12:09

cs95

Related questions
                            
                                How to get the column name when iterating through dataframe pandas?
                            
                                Merging two dictionaries into one dataframe
                            
                                Pandas: combining header rows of a multiIndex DataFrame
                            
                                How to get all users in a telegram channel using telethon?
                            
                                How to use Dataset API to read TFRecords file of lists of variant length?
                            
                                Why do tuples in a list comprehension need parentheses? [duplicate]
                            
                                If ElseIf Else condition in pandas dataframe list comprehension
                            
                                How to scrape data from a website when linked to event clicks?
                            
                                How to drop duplicates from a subset of rows in a pandas dataframe?
                            
                                Is there a way to cycle through indexes [duplicate]
                            
                                How can I apply a function to itself?
                            
                                How to import python files in google colaboratory?
                            
                                No module named 'beautifulsoup4' in python3
                            
                                Why doesn't Python have a "continue if" statement?
                            
                                Predict probabilities using SVM
                            
                                weird behavior when importing os.path
                            
                                How to pass arguments to Tornado's WebSocketHandler class?
                            
                                pandas DataFrame: normalize one JSON column and merge with other columns
                            
                                Cannot load tensorflow_hub
                            
                                Pandas Query with Variable as Column Name

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get first non-null value per row

Tags:

python

pandas

dataframe

qqqwww

People also ask

2 Answers

jezrael

cs95

Recent Activity

Donate For Us