I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that?
Case in point. Let's say I have a data frame that looks like this:
Index A B
1 15 NaN
2 NaN 11
3 NaN 99
4 NaN NaN
5 12 14
Let's say my desired output from this is to create a new column C such that my data frame will look like the following:
Index A B C
1 15 NaN 15
2 NaN 11 11
3 NaN 99 99
4 NaN NaN NaN
5 12 14 12 (so giving priority to A over B)
How can I accomplish this?
df[df. columns[~df. isnull(). any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.
isna() in pandas library can be used to check if the value is null/NaN. It will return True if the value is NaN/null.
Pandas DataFrame notnull() Method The notnull() method returns a DataFrame object where all the values are replaced with a Boolean value True for NOT NULL values, and otherwise False.
For a dataframe with an arbitrary number of columns, you can back fill the rows (.bfill(axis=1)
) and take the first column (.iloc[:, 0]
):
df = pd.DataFrame({
'A': [15, None, None, None, 12],
'B': [None, 11, 99, None, 14],
'C': [10, None, 10, 10, 10]})
df['D'] = df.bfill(axis=1).iloc[:, 0]
>>> df
A B C D
0 15 NaN 10 15
1 NaN 11 NaN 11
2 NaN 99 10 99
3 NaN NaN 10 10
4 12 14 10 12
If you just have 2 columns, the cleanest way would be to use where
(the syntax is where([condition], [value if condition is true], [value if condition is false])
(for some reason it took me a while to wrap my head around this).
In [2]: df.A.where(df.A.notnull(),df.B)
Out[2]:
0 15.0
1 11.0
2 99.0
3 NaN
4 12.0
Name: A, dtype: float64
If you have more than two columns, it might be simpler to use max
or min
; this will ignore the null values, however you'll lose the "column prececence" you want:
In [3]: df.max(axis=1)
Out[3]:
0 15.0
1 11.0
2 99.0
3 NaN
4 14.0
dtype: float64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With