Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement sql coalesce in pandas

Tags:

python

pandas

I have a data frame like

df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
     A     B   C
0  1.0   NaN   5
1  2.0  10.0  10
2  NaN   NaN   7 

I want to add a new column 'D'. Expected output is

     A     B   C    D
0  1.0   NaN   5    1.0
1  2.0  10.0  10    2.0
2  NaN   NaN   7    7.0

Thanks in advance!

like image 699
Anoop Avatar asked Apr 03 '17 06:04

Anoop


People also ask

Is there a coalesce function in Python?

Coalesce. This function comes in handy when there are one or more possible values that could be assigned to a variable or used in a given situation and there is a known preference for which value among the options should be selected for use if it's available.

How do you coalesce two columns in SQL?

The coalesce in MySQL can be used to return first not null value. If there are multiple columns, and all columns have NULL value then it returns NULL otherwise it will return first not null value. The syntax is as follows. SELECT COALESCE(yourColumnName1,yourColumnName2,yourColumnName3,.......

How do I check if multiple columns are null in pandas?

By using isnull(). values. any() method you can check if a pandas DataFrame contains NaN / None values in any cell (all rows & columns ). This method returns True if it finds NaN/None on any cell of a DataFrame, returns False when not found.


Video Answer


4 Answers

Another way is to explicitly fill column D with A,B,C in that order.

df['D'] = np.nan
df['D'] = df.D.fillna(df.A).fillna(df.B).fillna(df.C)
like image 102
philshem Avatar answered Oct 20 '22 23:10

philshem


Another approach is to use the combine_first method of a pd.Series. Using your example df,

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({"A":[1,2,np.nan],"B":[np.nan,10,np.nan], "C":[5,10,7]})
>>> df
     A     B   C
0  1.0   NaN   5
1  2.0  10.0  10
2  NaN   NaN   7

we have

>>> df.A.combine_first(df.B).combine_first(df.C)
0    1.0
1    2.0
2    7.0

We can use reduce to abstract this pattern to work with an arbitrary number of columns.

>>> from functools import reduce
>>> cols = [df[c] for c in df.columns]
>>> reduce(lambda acc, col: acc.combine_first(col), cols)
0    1.0
1    2.0
2    7.0
Name: A, dtype: float64

Let's put this all together in a function.

>>> def coalesce(*args):
...     return reduce(lambda acc, col: acc.combine_first(col), args)
...
>>> coalesce(*cols)
0    1.0
1    2.0
2    7.0
Name: A, dtype: float64
like image 13
yardsale8 Avatar answered Oct 20 '22 23:10

yardsale8


I think you need bfill with selecting first column by iloc:

df['D'] = df.bfill(axis=1).iloc[:,0]
print (df)
     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0

same as:

df['D'] = df.fillna(method='bfill',axis=1).iloc[:,0]
print (df)
     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0
like image 10
jezrael Avatar answered Oct 20 '22 23:10

jezrael


option 1
pandas

df.assign(D=df.lookup(df.index, df.isnull().idxmin(1)))

     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0

option 2
numpy

v = df.values
j = np.isnan(v).argmin(1)
df.assign(D=v[np.arange(len(v)), j])

     A     B   C    D
0  1.0   NaN   5  1.0
1  2.0  10.0  10  2.0
2  NaN   NaN   7  7.0

naive time test
over given data

enter image description here

over larger data

enter image description here

like image 7
piRSquared Avatar answered Oct 21 '22 01:10

piRSquared