Is there a better more readable way to coalese columns in pandas

I often need a new column that is the best I can achieve from other columns and I have a specific list of preference priorities. I am willing to take the first non null value.

def coalesce(values):
    not_none = (el for el in values if el is not None)
    return next(not_none, None)

df = pd.DataFrame([{'third':'B','first':'A','second':'C'},
                   {'third':'B','first':None,'second':'C'},
                   {'third':'B','first':None,'second':None},                   
                   {'third':None,'first':None,'second':None},
                   {'third':'B','first':'A','second':None}])

df['combo1'] = df.apply(coalesce, axis=1)
df['combo2'] = df[['second','third','first']].apply(coalesce, axis=1)
print df

Results

  first second third combo1 combo2
0     A      C     B      A      C
1  None      C     B      C      C
2  None   None     B      B      B
3  None   None  None   None   None
4     A   None     B      A      B

this code works (and the result are what I want) but it is not very fast.
I get to pick my priorities if I need to [['second','third','first']]

Coalesce somewhat like the function of the same name from tsql.
I suspect that I may have overlooked an easy way to achieve it with good performance on large DataFrames (+400,000 rows)

I know there are lots of ways to fill in missing data which I often use on axis=0 this is what makes me think I may have missed an easy option for axis=1

Can you suggest something nicer/faster... or confirm that this is as good as it gets.

Is query faster than LOC pandas?

The query function seams more efficient than the loc function. DF2: 2K records x 6 columns. The loc function seams much more efficient than the query function.

How do you coalesce two columns in pandas?

Use the combine_first() Method to Coalesce Values From Multiple Columns Into a Single Column in Pandas DataFrame. The combine_first() method fills null values in one dataframe with non-null data from the second dataframe to combine two dataframe objects.

How can you speed up computations with pandas?

Modin is a new library designed to accelerate Pandas by automatically distributing the computation across all of the system's available CPU cores. With that, Modin claims to be able to get nearly linear speedup to the number of CPU cores on your system for Pandas DataFrames of any size.

What is faster than pandas DataFrame?

Dask runs faster than pandas for this query, even when the most inefficient column type is used, because it parallelizes the computations. pandas only uses 1 CPU core to run the query. My computer has 4 cores and Dask uses all the cores to run the computation.

The Pandas equivalent to COALESCE is the method fillna():

result = column_a.fillna(column_b)

The result is a column where each value is taken from column_a if that column provides a non-null value, otherwise the value is taken from column_b. So your combo1 can be produced with:

df['first'].fillna(df['second']).fillna(df['third'])

giving:

And your combo2 can be produced with:

(df['second']).fillna(df['third']).fillna(df['first'])

which returns the new column:

If you wanted an efficient operation called coalesce, it could simply combine columns with fillna() from left to right and then return the result:

def coalesce(df, column_names):
    i = iter(column_names)
    column_name = next(i)
    answer = df[column_name]
    for column_name in i:
        answer = answer.fillna(df[column_name])
    return answer

print coalesce(df, ['first', 'second', 'third'])
print coalesce(df, ['second', 'third', 'first'])

which gives:

Is there a better more readable way to coalese columns in pandas

Tags:

python

pandas

user2315423

People also ask

1 Answers

Brandon Rhodes

Recent Activity

Donate For Us

Is there a better more readable way to coalese columns in pandas

Tags:

python

pandas

user2315423

People also ask

1 Answers

Brandon Rhodes

Related questions

Recent Activity

Donate For Us