Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: take whichever column is not NaN

Tags:

python

pandas

I am working with a fairly messy data set that has been individual csv files with slightly different names. It would be too onerous to rename columns in the csv file, partly because I am still discovering all the variations, so I am looking to determine, for a set of columns, in a given row, which field is not NaN and carrying that forward to a new column. Is there a way to do that?

Case in point. Let's say I have a data frame that looks like this:

Index   A     B
1       15    NaN
2       NaN   11
3       NaN   99
4       NaN   NaN
5       12    14

Let's say my desired output from this is to create a new column C such that my data frame will look like the following:

Index   A     B       C
1       15    NaN     15
2       NaN   11      11
3       NaN   99      99
4       NaN   NaN     NaN
5       12    14      12 (so giving priority to A over B)

How can I accomplish this?

like image 945
helloB Avatar asked Aug 16 '16 02:08

helloB


People also ask

How do I select a column without NaN?

df[df. columns[~df. isnull(). any()]] will give you a DataFrame with only the columns that have no null values, and should be the solution.

Is not NaN in pandas?

isna() in pandas library can be used to check if the value is null/NaN. It will return True if the value is NaN/null.

How do you get NOT null columns in pandas?

Pandas DataFrame notnull() Method The notnull() method returns a DataFrame object where all the values are replaced with a Boolean value True for NOT NULL values, and otherwise False.


2 Answers

For a dataframe with an arbitrary number of columns, you can back fill the rows (.bfill(axis=1)) and take the first column (.iloc[:, 0]):

df = pd.DataFrame({
    'A': [15, None, None, None, 12],
    'B': [None, 11, 99, None, 14],
    'C': [10, None, 10, 10, 10]})

df['D'] = df.bfill(axis=1).iloc[:, 0]

>>> df
    A   B   C   D
0  15 NaN  10  15
1 NaN  11 NaN  11
2 NaN  99  10  99
3 NaN NaN  10  10
4  12  14  10  12
like image 141
Alexander Avatar answered Sep 20 '22 23:09

Alexander


If you just have 2 columns, the cleanest way would be to use where (the syntax is where([condition], [value if condition is true], [value if condition is false]) (for some reason it took me a while to wrap my head around this).

In [2]: df.A.where(df.A.notnull(),df.B)
Out[2]:
0    15.0
1    11.0
2    99.0
3     NaN
4    12.0
Name: A, dtype: float64

If you have more than two columns, it might be simpler to use max or min; this will ignore the null values, however you'll lose the "column prececence" you want:

In [3]: df.max(axis=1)
Out[3]:
0    15.0
1    11.0
2    99.0
3     NaN
4    14.0
dtype: float64
like image 39
maxymoo Avatar answered Sep 20 '22 23:09

maxymoo