I have a sample dataframe show as below. For each line, I want to check the c1 first, if it is not null, then check c2. By this way, find the first notnull column and store that value to column result.
ID c1 c2 c3 c4 result
1 a b a
2 cc dd cc
3 ee ff ee
4 gg gg
I am using this way for now. but I would like to know if there is a better method.(The column name do not have any pattern, this is just sample)
df["result"] = np.where(df["c1"].notnull(), df["c1"], None)
df["result"] = np.where(df["result"].notnull(), df["result"], df["c2"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c3"])
df["result"] = np.where(df["result"].notnull(), df["result"], df["c4"])
df["result"] = np.where(df["result"].notnull(), df["result"], "unknown)
When there are lots of columns, this method looks not good.
1. Pandas DataFrame dropna() Function. Pandas DataFrame dropna() function is used to remove rows and columns with Null/NaN values. By default, this function returns a new DataFrame and the source DataFrame remains unchanged. We can create null values using None, pandas.NaT, and numpy.nan variables.
DataFrame. dropna() method is your friend. When you call dropna() over the whole DataFrame without specifying any arguments (i.e. using the default behaviour) then the method will drop all rows with at least one missing value.
pands Drop Infinite Values dropna() method to remove the rows with NaN, Null/None values. This eventually drops infinite values from pandas DataFrame. inplace=True is used to update the existing DataFrame.
Get the First Row of Pandas using iloc[] This method is used to access the row by using row numbers. We can get the first row by using 0 indexes.
Use back filling NaN
s first and then select first column by iloc
:
df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
Or:
df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
print (df)
ID c1 c2 c3 c4 result
0 1 a b a NaN a
1 2 NaN cc dd cc cc
2 3 NaN ee ff ee ee
3 4 NaN NaN gg gg gg
Performance:
df = pd.concat([df] * 1000, ignore_index=True)
In [220]: %timeit df['result'] = df[['c1','c2','c3','c4']].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.78 ms per loop
In [221]: %timeit df['result'] = df.iloc[:, 1:].bfill(axis=1).iloc[:, 0].fillna('unknown')
100 loops, best of 3: 2.7 ms per loop
#jpp solution
In [222]: %%timeit
...: cols = df.iloc[:, 1:].T.apply(pd.Series.first_valid_index)
...:
...: df['result'] = [df.loc[i, cols[i]] for i in range(len(df.index))]
...:
1 loop, best of 3: 180 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ' s solution
In [223]: %timeit df['result'] = df.stack().groupby(level=0).first()
1 loop, best of 3: 606 ms per loop
Setup
df = df.set_index('ID') # if necessary
df
c1 c2 c3 c4
ID
1 a b a NaN
2 NaN cc dd cc
3 NaN ee ff ee
4 NaN NaN gg gg
Solutionstack
+ groupby
+ first
stack
implicitly drops NaNs, so groupby.first
is guarantee to give you the first non-null value if it exists. Assigning the result back will expose any NaNs at missing indices which you can fillna
with a subsequent call.
df['result'] = df.stack().groupby(level=0).first()
# df['result'] = df['result'].fillna('unknown') # if necessary
df
c1 c2 c3 c4 result
ID
1 a b a NaN a
2 NaN cc dd cc cc
3 NaN ee ff ee ee
4 NaN NaN gg gg gg
(beware, this is slow for larger dataframes, for performance you may use @jezrael's solution)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With