I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.
def coalesce(values):
not_none = (el for el in values if el is not None)
return next(not_none, None)
df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
{'third':'B','first':None,'second':'C'},
{'third':'B','first':None,'second':None},
{'third':None,'first':None,'second':None},
{'third':'B','first':'A','second':None}])
df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df
Results
first second third combo1 combo2
0 A C B A C
1 None C B C C
2 None None B B B
3 None None None None None
4 A None B A B
this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [['second','third','first']]
Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)
I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1
Can you suggest something nicer/faster... or confirm that this is as good as it gets.
The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.
Use the combine_first() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame. The combine_first() method fills null values in one dataframe with non-null data from the second dataframe to combine two dataframe objects.
Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system's available CPU cores. With that, Modin claims to be able to get nearly linear speedup to the number of CPU cores on your system for Pandas DataFrames of any size.
Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.
The Pandas equivalent to COALESCE
is the method fillna()
:
result = column_a.fillna(column_b)
The result is a column where each value is taken from column_a
if that column provides a non-null value, otherwise the value is taken from column_b
. So your combo1
can be produced with:
df['first'].fillna(df['second']).fillna(df['third'])
giving:
0 A
1 C
2 B
3 None
4 A
And your combo2
can be produced with:
(df['second']).fillna(df['third']).fillna(df['first'])
which returns the new column:
0 C
1 C
2 B
3 None
4 B
If you wanted an efficient operation called coalesce
, it could simply combine columns with fillna()
from left to right and then return the result:
def coalesce(df, column_names):
i = iter(column_names)
column_name = next(i)
answer = df[column_name]
for column_name in i:
answer = answer.fillna(df[column_name])
return answer
print coalesce(df, ['first', 'second', 'third'])
print coalesce(df, ['second', 'third', 'first'])
which gives:
0 A
1 C
2 B
3 None
4 A
0 C
1 C
2 B
3 None
4 B
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With